Web Reliability
46. Low Friction Through Speed and Stability
Mitchell Kimbrough
Founder & CEO
Out of all of the elements of the Web Reliability System, I have the most experience with two in particular - the execution layer of content change management, and optimization of website speed and stability. Both areas revolve around the content management system and its ability to reduce friction. The speed and stability of a website, often anchored firmly within the CMS, is a core part of web reliability.
Speed and Reliability
Let’s talk about speed first. A fast car isn't necessarily reliable. In fact, it’s often true that ‘high performance’ equals ‘high maintenance.’ Porsche 911 owners don't expect their car to be a maintenance-free, dependable daily ride for the next 15 years. You might expect that of a Toyota Corolla or a Subaru Outback, however. They may not be racetrack worthy, but their lack of speed is compensated for by their reliability. In fact, there is arguably an inverse relationship between speed and reliability in cars. With web applications, it's a different story.
When we’re talking about reliable websites, “reliable” means customers are having their goals met and their desire reliably fulfilled. A website that fulfills customer desire quickly is by its nature a well-built website. When websites and web applications are fast, they are also reliable.
There is a multitude of data proving that slow-loading pages on a website create friction for a customer seeking to achieve a goal. Slowness becomes a barrier to progress. Pages that load quickly do the important job of promptly responding to customer desire and purpose, keeping them engaged, and moving towards their goal. When pages load quickly the customer doesn’t think about speed because they simply serve their purpose and stay out of the way.
Let’s go back to our Porsche 911. It's extremely fast, but not nearly as reliable as our Toyota Corolla. But both cars have something important in common – when operating at the speeds they were designed for, they are quite stable.
Here's where websites have something in common with our cars. With websites, stability also goes hand-in-hand with speed. A website with pages that load slowly is just as bad as one whose pages don’t load at all. Going forward, we'll combine speed with stability in this conversation, as they are so tightly linked.
As you begin ensuring the reliability of your web app, you will be working to optimize its speed as well as its stability. So we'll call this effort website speed/stability optimization. When doing this work, think about it in the context of the web request. This is the key to coming up with the right solutions. Follow the byte and you won't miss anything.
Let’s look at the web request. The customer comes to your website with a desire, expressed as a request via a specific URL on your site. Perhaps the request originated from a click on another site, or a search result, or from directly entering the URL in a browser. In every case, the request originates in a browser or mobile device. That software sends a call out across the web to request the contents of your specific URL. Now think about this in the context of the web request. Empathize with the customer and their request, and follow the request through the system as it seeks successful resolution. All along the way, if you are paying attention, you will see opportunities to optimize for speed/stability. When you use these opportunities as the basis for your optimization work, the result will be the reliability you were looking for.
Your Connection, or “Is This Thing On?”
The quality of the web request is completely dependent on the user’s connection to the web. As far as I can tell, no website anywhere is entirely reliable. And there is nothing you can do for the customer with a slow internet connection or an unreliable device. You can only pray that it annoys them enough to fix it. But there is a great deal that you do have control over.
DNS Resolution
The initiation of the web request depends on DNS resolution. The DNS (Domain Name System) is basically the address book for the entire internet. This address book will tell you that our web app does not actually live at solspace.com, but rather 216.243.555.5. Computers are built with a language of numbers. However, the users are all human, and humans generally prefer their word-based human language, so the inventors of the interwebs wisely enabled website names to be comprised of human language. But for the computer to function, those names must be translated into numerical IP addresses. This translation takes place at the DNS level. A collection of servers located around the world do the actual work of mapping domain names to IP addresses and support DNS resolution. DNS mappings do not change very often, and so a lot of this data gets cached at various levels. We’re all familiar with having to wait for some period of time for a change in the IP address behind a domain name to take effect across the web. This lag time is due to DNS caching.
At the DNS level there's something called the “Time to Live” or TTL, sometimes called “Time to Lookup.” This is the indicator of how long an item is cached by a DNS server. It tells the various systems that would cache a DNS lookup when they need to check back for fresh information. When we are preparing to launch a new web property we change this TTL a few days in advance of going live so that the caching time will be reduced, and our launch will be picked up more quickly.
Other than manipulating TTL, there is not a lot you can do to influence latency at the DNS lookup level. You're completely constrained by how the web itself was built and how your specific ISP uses it.
Device Request to Server: CDN First
You click on the solspace.com URL, and your device requests a DNS lookup of that URL. This means that the device has asked what the numeric IP address of the domain name part of the URL is. Once it receives the IP address (the translation of “solspace.com” into numbers), it is ready to make a request to your server.
But wait. There’s an opportunity to optimize speed/stability here. The specific IP address of a server is requested along with the rest of the URL that identifies the resource being sought. Let’s look at the flow of the request to find the opportunity.
Normally the server located at a given IP address such as 216.243.555.5 will respond with an answer to the request, but here is the opportunity. What happens if you’ve put a CDN (Content Delivery Network) in front of your server? We tend to use Cloudflare quite a bit so I'll use that as an example.
Cloudflare is a caching service that sits in front of your website in the request flow. It works at the name server and DNS level to intercept traffic headed to your server. Cloudflare, using the caching rules you have provided for it, will attempt to save copies of your web pages and page assets across its global network of servers. These servers, if they have cached copies of one of your web assets, will locate the copy that’s on a server as close to that customer as possible, and then complete the request with this nearby copy. This greatly improves the overall speed of your web app. In fact, within the Web Reliability System, the intelligent use of a CDN is given significant importance. A website running without one is just asking for reliability trouble.
Your origin server is still involved of course. A CDN like Cloudflare still needs to know where to find the content for any given URL request. Your server's ability to respond quickly is still hugely important.
Device Request to Server: Load Balancer Next
So, a device has requested a URL. The domain name part of the URL has been checked against a DNS lookup and an IP address has been found. The device then requests this IP address, asking for a resource. Hopefully, a CDN like Cloudflare or Fastly has been put in place to optimize the process and is waiting to serve a cached resource. But if not, then your actual servers are going to have to manage the request on their own.
A frequently used option for handling traffic is the use of a load balancer. A load balancer is a server that does the job of juggling traffic between multiple web servers. Basically, it is a traffic cop that maintains a consistent flow by utilizing more than one machine to handle the load of traffic from the web, as needed.
If you are using a service like Cloudflare it is handling most of the requests for the web and caching resources effectively. However, if your web app is mostly dynamic, meaning that a fresh new copy of the requested resource needs to be obtained and served in real time, then a load balancer is a good idea. Of course, this is one of those areas in the Web Reliability System that can go to a considerable depth and become a complex component of your system. We recommend working with an expert on the question of CDNs and load balancing if you anticipate significant traffic on your website.
Device Request to Server: Finally the Web Server
We have moved along through the request flow, and the customer's request for a resource at a given URL has made it through DNS lookup, CDN, and load balancer. The request has finally made it down to the actual web server that hosts the application which houses the content or functionality the customer seeks. Now is the time for the server to respond to and hopefully fulfill the request.
Here's a good opportunity to talk about your web host. I doubt very much that you own your own servers and that they are running under your desk next to the recycle bin. If they are, then 'OUCH!' What if you are hit by a beer truck on the way back from lunch, and your server goes down? Who would know where your server is or what to do with it? What happens if you spill your coffee all over the box? There is a reason that almost nobody maintains their own servers anymore, but rather uses a hosting company.
So, let’s assume you’re a legitimate website owner who is using a hosting company and paying them to host your web property. Here is an opportunity for you to really optimize reliability. Because you are a knowledgeable customer, you can carefully select a host and a plan that is itself optimized for speed and stability. In fact, you can evaluate your web host with the Web Reliability criteria more broadly. Ask yourself about the support team. Who are they? How big is that team? Is there a sense of ownership by individuals of support issues? Or does your issue get passed around among an assortment of equally confused IT people? It is well worth spending the time needed to assess the hosting company and their track record of performance because the speed and effectiveness of the hosting service bear directly on the speed and stability of your web property. If you select a reliable and excellent web host, you’ll find they are exceptionally well-positioned to give you guidance on other aspects of your tech stack that could be optimized for better performance.
In the context of this conversation, what we most care about is your web host's IT stack. Most hosts these days will offer you virtualization. This means they will offer you a slice of a machine for a lower price than what a whole machine would cost you. You share a machine with other customers. Because of this, the makeup and configuration of these boxes are extremely important. For example, one of our company’s favorite hosts for many years now, ArcusTech, converted their servers over from machines with physical hard drives to those with SSD storage. SSD or solid-state drives run cooler and faster than their older hard drive cousins. These are next-generation servers that deliver incredible performance for a fraction of the cost of their predecessors. Make sure your web host has modern technology, is forward-looking, and adopts IT advancements steadily. This will result in high speed and low friction.
Your qualified web host also brings a great deal of expertise to the table. A Linux type of server may be set up by almost anyone with a connection to Google. But it takes real expertise to configure web servers optimally for the task to which they have been designated. A number of web hosts go as far as to optimize their machines for the software that will be running on them. For example, you can obtain hosting plans optimized for your specific CMS, such as Craft or ExpressionEngine or WordPress. This can greatly improve site performance.
Your host's connection to the internet is also important. If you host your site with your cousin’s friend’s nephew’s business, ‘Bubba's Web Hosting and Wing Shack’, you may find your server connected to the Internet on an antique 110 baud modem. Opting for a professional hosting provider ensures you have the services of a real data center. In some cases, your hosting provider's data center might be shared by a CDN provider, in which case the connection between the CDN service and your origin web server only needs to span the distance between a few racks in the same room. This can offer great performance boosts.
I want to emphasize context again because it matters that we're discussing an overview system for thinking about web reliability, not an in-depth study of each component. Nevertheless, we must discuss “The Cloud.” The Cloud is a fancy term for distributed, virtualized hosting that includes auto-scaling features. Cloud infrastructure is itself a multi-billion-dollar industry, one that warrants in-depth discussion with regard to speed, stability, and reliability. In our current context, it is useful to point out that simply running a website in the cloud does not guarantee speed. When you choose cloud computing like Amazon Web Services or Google Cloud, you are often still responsible for setting up and maintaining your server instances. And so, the quality of your team comes into the equation once more. Do you have a server team? Is the server team set up to manage a group of server instances with optimal configurations and auto-scaling rules? In a lot of cases yes, but in many cases no. All of this is part of the calculus behind speed, reliability, and stability. The Cloud can be a great ally in the mix of creating a good website, but it too requires care and feeding.
Let’s go back to the request and locate ourselves in the flow once more. A resource has now been requested from the web server that hosts the application which houses the content or functionality the customer seeks. And now the server must respond. The first step in the server response is validating the request. The device has asked the server if a specific resource exists. The server responds with a code. If the requested resource (the URL) does not exist on the server, a 404 code is returned. If the resource has been moved to a different URL, a 301 redirect is issued. If a redirect is issued, the device will then ask for the resource at the new location. Once the URL request lands on a server where it is found to be valid, a 200 code is returned along with the actual content of the requested URL.
You can really tangle up speed and reliability in this initial server response. If a server does not properly issue 404 codes, indexing bots like those used by Google and Bing will get annoyed and punish you by demoting your site ranking. This activity is an indicator that your site is derelict and not being maintained. If you have too many redirects, the transaction takes too long to finalize and resolve a request. Slowness means friction. Friction degrades reliability.
With cloud computing, you have the additional capability of storing data in the cloud, and autoscaling as well. The speed of the connection between the web server and the data source will be important for your success.
CMS or Other Software
We’re deep into the customer flow now. The customer's web request is making its way through your stack. You've optimized as much as possible up to this point, and things are flowing well. You've even managed to get your web server and your database server to play nicely together. But huge opportunities for creating additional speed and stability improvements still remain.
It’s very likely that you’re running a content management system (CMS) or perhaps a web framework like Laravel, Yii, or Symphony. These systems are high value, speeding up development time, and making it easier for your development team to offer new features to your customers. But they can also create a drag on server performance. The good news is that there tend to be many opportunities to optimize at the code level here. Simply keeping a regular schedule of monitoring and maintenance for scripts and server routines can result in incrementally improved page load performance. Additionally, there are great tools like New Relic and Pingdom Server Monitor for application performance monitoring, which can help optimize speed and stability.
Another opportunity may be found in examining the nature of the functions you are offering to the customer. When customers are looking for specific content, they usually need to search for a website or web app. Your search tool can be a high-value part of your web environment. An entire industry worth billions of dollars can vouch for the fact that high-speed search means reliable revenue. But your search tool is also perfectly positioned to slow things down. Perhaps you run your own search tool, which means your team is responsible for indexing, caching, and overall query speed. Or maybe you’re using a 3rd party search solution on your site from a provider such as Algolia, Google, or Bing. Whatever your search solution, it is very much worth your while to look at opportunities for speed enhancement.
Request Lands in the Browser
Whew! This little web request has traveled a very long way by now. It has traversed our entire IT stack and bounced around numerous locations on the internet. And now, at last, our web server has provided the requested data, and the user will receive what they were looking for. Now is the moment when the browser or mobile device will attempt to render the code and provide the customer with an attractive and useful picture.
Even though we’re nearly at the end of the process, there is still an opportunity to optimize further. A browser may be asked to draw simple things like text, or complex things like images and HTML5 graphics. Additional speed may be gained by striking an optimal balance between how fancy you’d like the browser to make your web page and how fancy the customer actually needs the web page to be. The site should be tailored to meet the real-world needs of your customers, and effectively support your brand, but it’s likely there is some editing you can do. The data may have traveled very efficiently to the browser on the customer's computer, and arrived ready to draw, but the final step of rendering may be quite demanding all by itself. Review everything that will need to be rendered, and make sure what you sent is worth the work it will take for the customer to receive it. In this case, less may be more. Remember customer friction.