Web Reliability
Chapter 10. What is the Web Reliability Framework?
Mitchell Kimbrough
Founder & CEO
The Web Reliability Framework is the practical expression of the principles we’ve laid out so far, an organized structure that guides the real-world practice of Web Reliability.
The framework is structured as a 3 x 3 matrix of interlacing principles, the result of which is the nine separate aspects of Web Reliability that interact with one another to determine the quality of customer flow through a website.
The first three top-level parts of the system are team, plan, and action. The remaining three interlacing principles are motivation, resistance, and management. As an example, often the flow of a website will degrade. This failure will show first in the action management cell. This is the intersection of the action layer of the framework where servers, routers, switches, and fulfillment software run but are managed and monitored by separate uptime systems. If your web server slows down, someone's pager goes off when the monitoring system detects the problem. The team action cell is activated next since actual humans normally need to get involved to troubleshoot why an issue has surfaced in the action management cell. Other cells of the framework are also involved here, but that comes later.
Flow as a first principle serves us best when we use it as a metaphor and ask that metaphor to teach us about the flow of customer desire through a website. Imagine the metaphor of a city's power grid. Talk to one of the people in your town involved in managing the power grid and you may learn that there are some fundamental principles they adhere to in order to make the flow of power through the grid reliable. This person is always thinking about the three principles of flow in the system they manage: force, resistance, and monitoring. In a power grid, the voltage, or force of energy flowing through the system is of great importance. This part of the flow must be kept steady and consistent. In some places, it must be amplified, in other places dampened. Resistance in the system must also be given great attention. Any encumbered flow - resistance - renders the entire grid inefficient. In a large, complex power grid, monitoring and management are of supreme importance. The flow of electricity or current through the system must be monitored and managed at all times so that emerging issues may be identified and managed immediately. An unsupervised flow is one that will soon fail.
Continuing with the metaphor of the power grid, there are three other critically important factors to consider in maintaining reliable flow: team, plan, and action. A large team of people is needed to maintain the dependable flow of electricity through your city’s power grid. Many of these people are highly trained and experienced specialists, and the system depends on their specific expertise for smooth flow. These individuals must work together as a coordinated unit in service of the common goal of optimal flow. The team that builds and runs the power grid must in turn work in accord with a master plan or a strategy for optimizing flow. There must be an overarching objective along with a method for achieving it that governs their decisions and actions. The team must then execute against that strategy. They must take action. On a daily basis, working together as a single unit, working in accordance with a governing strategy, they do the ongoing work of executing the plan and consistently maintaining the flow of power coursing through the system.
The Web Reliability Framework embodies this metaphor of power grid flow and applies exactly the same principles to the flow of customer desire through a website.
We’ve established that the Web Reliability Framework is a matrix of interlacing principles. The three top-level parts of the system are team, plan, and action. The remaining three interlacing principles are motivation, resistance, and management. These intersecting ideas work together to make up the Web Reliability Framework.
The first top-level component is the team, the human part of the system, and the most important. A good team, over time, can overcome deficiencies in planning and action. And without a good team, even the best strategies and execution tactics will never come to life.
The web team is the group of people who conceive of, conceptualize, plan, design, build, launch, iterate, and maintain a web property. The web team is normally composed of a few key internal players within the organization that owns the web property along with a number of outside agencies and experts who bring their skills and experience to bear. The web team working in a unified way towards a common goal is the unit that serves the customer of the website. The web team's dedication to providing excellent service converts directly into revenue reliability.
The second top-level component is the plan behind the web property - the concept, marketing, process, design, structure, and underlying reliability architecture. The plan is the strategy. It guides customers to the web property. It seeks to serve their needs efficiently and reliably. And it does so with stability and effectiveness over time.
The third top-level component is the execution of the strategy by the team - the actual on-the-ground manifestation of the strategic plan. The action level includes the code that causes the web property to appear and function according to design in the web browser, mobile device or audible representation of the property. Action also includes the servers, cloud architecture, backup and caching systems, as well as monitoring and uptime services. As well, action also includes the working process of iterating on, versioning, and maintaining the web property over time. In an optimally running website, these three top-level system components only function in relationship to each other, and no part is sufficient on its own.
The three second-level components in the Web Reliability Framework are motivation, resistance, and management. For this, we can use a new metaphor. Let’s consider a water pipeline. Flow is the most important thing – the primary purpose of the pipeline. In order to effectively serve the homes that rely on the water pipeline, those who manage it must always be thinking about the pressure of the flow going through the system, the friction and resistance that the flow runs into, and the overall ongoing management of the health of the pipeline.
The metaphors are helpful in picturing what exactly Web Reliability is. Imagine it as a living structure made of coordinated, interdependent parts; the water pipeline or the electrical grid. The customers are analogous to the water molecules or electrons flowing through those utility systems. Customer motivation or purpose corresponds to the pressure or voltage in a system. Customer resistance corresponds to resistance in the pipeline or circuit. Management corresponds to the supervision of the successful flow of the system overall.
These three properties - motivation, resistance, and management - intersect with team, plan, and action. These intersections create a matrix. This matrix as a whole represents the component parts of the flow that make up the first principle of Web Reliability.