Web Reliability

34. Frictionless System Design

Mitchell Kimbrough
Written March 12, 2020 by
Mitchell Kimbrough
Founder & CEO

In this conversation about strategies to reduce friction, we’ve talked a great deal about web applications, so of course, we must talk about servers and infrastructure too.

When you look behind the frontend of a website that functions reliably and well over time, you generally find uncomplicated server and software infrastructure. As with a good foundation for a house, this infrastructure is the dependable footing that will support the day-to-day stability and growth of your systems over time. Complex systems always require ongoing care and feeding, and contain more inherent risk. The simpler system is easier to understand and maintain over time, and contains less risk and therefore is more reliable.

Even the simplest infrastructure system is talked about in the language of layers. You’ll hear developers and administrators say things like, "What are you using in your caching layer?" and "What's your database configuration layer like?" [Caching refers to the activity of setting aside a collection of something in a specific location that’s separate from where one normally would store such things. In the web world it means saving and storing a copy of specific data or a page in a separate location for later use.]

By tracing the path of a web request and thinking about it in terms of the functional layers involved, you get into the proper frame of mind for planning the appropriate components. The term “web request” refers to the call to a specific URL from a given device. The variations between websites and mobile apps are worth noting here, as mobile apps may have additional layers to consider that are device and OS-specific, and present their own unique challenges and opportunities.

There are six different layers to talk about. But let’s start at the beginning. The web request originates when your customer reaches out to you looking for more information. Depending on how you are marketing your services, this might come from a paid search click, an organic search, a direct URL entry, a link from networked content or a partner site, etc. Wherever it comes from, the request is made and sent to your URL.

The first infrastructure layer encountered by the request is the browser or device layer. At this layer, there are a number of different caching opportunities that may be taken advantage of. A sound infrastructure strategy includes planning for these. For example, a web page includes headers that are not immediately visible to the user, which provide the browser or mobile device with useful information about the page. This useful information includes rules about how long a given block of data or page should be cached locally on the user's computer or mobile device. This is an important first infrastructure step. If you can ensure that a user does not need to return to previous pages looking for fresh data before a specific time, you reduce the load on your infrastructure and you reduce system friction.

A few years ago while working on building in comprehensive caching on a client's web property, we inadvertently went overboard with his local cache expiry setting. We were trying to rescue the site from really terrible page load performance by putting a caching service called Cloudflare in front of it. Part of this service includes the ability to manipulate the local cache expiration rules in the page header. As it turned out, we had a typo in our code, which ended up telling users' web browsers to cache pages for one full year. And the nature of the site content was such that it needed to be updated every minute or so. This was an epic mistake. Because once someone visits your site, and you tell their web browser to keep a local cache of a page and how to do it, their browser believes you. They don't come back to check again. We might not have seen those users again for a year. The good news is that we caught the error within a few minutes post-launch, and so this mistake likely only impacted a few users. Our client, of course, was less than thrilled to learn about the error but relieved it had been fixed before doing any real harm.

The second infrastructure layer encountered by the web request is the edge caching layer. [Note that the DNS resolution layer precedes this edge caching layer, but for most sites this is not a critical issue.] This includes CDNs such as Cloudflare, Akamai, Fastly, CloudFront, etc. This is one of the most powerful parts of your infrastructure. When used effectively, the edge caching layer can vastly reduce the pressure and friction on your origin infrastructure, meaning your main web server has to work much less hard to serve content. These days CDNs are very affordable and easy to use, so there’s no good reason not to employ them. At Solspace it is one of the first questions we ask a new client when assessing the health of their current website.

The third infrastructure layer encountered by the web request is not always present, but its absence can suddenly become an issue for growing and evolving sites; it is the load balancing layer. When a web request makes it past the edge caching layer it may be juggled by a load balancer between 2 or more web servers, an additional way of reducing server load and friction. Often load balancing is set up so that two identical web servers will take turns satisfying web requests, sharing the load. This type of system may not be required when extensive and robust use of edge caching is possible, as in cases where most of the site content does not need to be dynamic. However, when a website is expected to function more like a web application, where most of its data is dynamic and specific to user behavior, a load balanced infrastructure makes a lot of sense and is a wise investment.

Looking into this issue more, in the context of cloud infrastructure, this idea of distributing load across machines has been taken to the nth degree. Web servers can be configured to autoscale nearly instantly, through cloud architectures. If you have the financial resources and technical knowledge to support it, your stack's capacity can scale pretty much infinitely. Of course, as with all extreme solutions, there is a price to pay for this type of power.

The fourth infrastructure layer encountered by the web request is the web server itself. This is the physical server box or the cloud instance that houses your website or web application. Configuration at this level is extremely important. Most web properties are served at this layer through the use of open-source software. The bad news is that there are a lot of configuration variations, and as we know, having lots of moving parts creates inherent risk. There are a lot of dumb mistakes a person can make that add friction where there should simply be flow. The good news offsetting this inherent risk is that there is also a lot of free knowledge and experience available on the web to help with this part, so common mistakes can be easily avoided.

The fifth infrastructure layer encountered by the web request is often a CMS (content management system.) The systems used by most websites these days include WordPress, Drupal, Joomla, ExpressionEngine, and Craft CMS. The CMS is the tool that content authors and editors use to manage their site content. The CMS also usually contains rules regarding the permissions required to access various resources. Additionally, the CMS usually includes templating capability, so that the look and feel of the site may be abstracted into the fewest number of rendering files needed for easy, ongoing frictionless maintenance. Within the CMS layer, you’ll find there are numerous options for reducing friction, and in fact, this layer holds the best and biggest opportunities for reliability gains. Conversely of course, it also contains multiple opportunities to increase friction instead.

The sixth layer of infrastructure encountered by the web request is the database layer. It's unusual to encounter a database layer without a CMS, but it's sometimes possible. To optimize the value of this layer, there should be a reasonably user-friendly method of maintaining content within the database. Which is why this layer tends to include a CMS. When we talk about ensuring low friction at this layer, it means carefully reviewing the configuration and speed of the connection between the web servers and the database server. It also means reviewing the configuration settings at the heart of the database itself. But there’s one more thing that’s critical for reducing friction – consistent maintenance over time. Databases never have infinite capacity. They fall over when they get overloaded. So unused data must be regularly weeded out of the database layer. A well-designed pruning strategy will reduce future friction.

Across all of these layers of infrastructure, the rule of friction applies. Low friction produces great rewards, while high friction produces nothing but frustration and failure. Directly correlating with friction within each layer is the level of complexity. More complexity inevitably creates more opportunity for friction and failure. Whereas building simplicity into the system as a whole, as well as into each layer, greatly reduces the risk of friction and failure. Though it may seem as though the realm of web infrastructure is something purely technical, not impacted by human nature, this is not true. Any given infrastructure stack is, in fact, rife with human frailty and error.

Through every layer of infrastructure, at every point in the path of the web request, the goal is for the system and all elements of the site to remain stable and function well. There’s a good argument to be made that stability over time is one of the best measures of web reliability. And reliability in the context of infrastructure is synonymous with simple.