Web Reliability
Chapter 49. Continuous Infrastructure Monitoring
Mitchell Kimbrough
Founder & CEO
Infrastructure monitoring is a big subject, like many others we’ve touched on during this discussion of the Web Reliability System. Let’s look at the aspects of it that are relevant to our goals of flow and reliability.
Our job is to make sure that the customer flow through a website is smooth and enjoyable. It must genuinely matter to us that the customer's purpose is honored, supported, and unobstructed. And we have to ensure that the process is validated and monitored from start to finish. To ensure the best possible outcomes, we have implemented a complete system of tactical infrastructure to guide the customer's flow. Now that it’s in place and we’ve taken all available opportunities for optimization, we need to monitor it for uptime and effectiveness.
We’ll begin by looking at the infrastructure monitoring problem from the point of view of the web request. Remember that the request begins when your customer's device asks for your specific URL. Which brings us to DNS.
DNS monitoring
Your DNS (Domain Name System) maintains a list of domain names along with associated resources, such as IP addresses. DNS monitoring means having good security protocols over the system that routes your domain name to an IP address and keeping tabs on its activity. Your DNS is your gatekeeper. Our clients occasionally get anxious about security within and without their team, and when this happens, I reassure them that as long as they maintain strict control over their DNS records, they have full ownership of their website. It doesn’t need to be complicated. There are services available that can help to monitor DNS and prevent human error from within as well as malicious attacks from the outside. These are simple services that are well worth the trouble and the investment, especially when you consider the potentially dire consequences of not using them.
Cache monitoring
The importance of CDN caching has been discussed, but let’s look at how best to maintain its value over time. Virtually every site will benefit from the speed boosts to be gained through smart caching, but the caching tactics need to be monitored. Here’s an example of how this type of monitoring can save the day. Our team recently made an easily avoidable mistake on a website after putting considerable work into setting up a CDN caching layer. We inadvertently added an HTTP header back into all site pages which told the CDN and browser not to cache anything. We completely missed it. What a relief that the monitoring for the caching performance we’d set up caught the error! So, monitor your caching layer. This simple execution level tactic will translate directly into website speed, which in turn will translate directly into reliability.
Server monitoring
The problem of server uptime is as old as the web, and there’s a wealth of information that’s been published on the topic. But the monitoring of server uptime will keep you proactive about optimizing your customer's flow. Continuously tracking uptime and speed involves multiple levels of server monitoring. Most likely you’re outsourcing your hosting, so someone else is on call 24/7 to deal with the issue of servers going down. But that doesn’t let you off the hook. Your task as the website owner is to fully understand the hosting company’s policies and procedures and to know what level of monitoring you are paying for. Previously we discussed the importance of high touch management. Your relationship with your web host, with regard to uptime monitoring, greatly benefits from the high touch principle. When your site goes down, and it surely will at some point, do you have an empathetic relationship with an actual human at your hosting company who will help you monitor and deal with a downtime situation? You'll be glad if you do.
Database monitoring
You have ownership of a good-sized website, so in all likelihood, your content has been abstracted out into a database. A crashed server is a major issue, but the truth is that databases tend to die more frequently than web servers. And the usual cause of death is that you have exceeded your database storage quota. Now, a database is really just a series of special files sitting on a drive on a machine somewhere. Your hosting agreement means you have paid for a certain allocation of memory on that drive. But that specific allocation means there is no room to grow beyond the bounds of what you have paid for. So, if you exceed it, your database crashes.
How can you ensure that this doesn’t happen? By staying aware of your database storage consumption. Simple monitoring can go a long way toward preventing premature database death. The types of problems that come up are usually as simple as a database table bloated with the by-products of some routine process that does not also incorporate regular data pruning. In these situations, unnecessary data gets saved to the database every day, and eventually overwhelms it, just as weeds and ivy will take over your yard and eventually begin climbing over the house. If you let the problem go unmonitored for long enough you can't even get in the front door, and given enough time it will eventually pull the house down.
API integration monitoring
You’ve likely seen the light by now, and have already begun to find ways to enhance the customer flow through your system. Perhaps you’ve already gotten to work integrating your site with outside sales and fulfillment systems. Maybe e-commerce orders are being passed from your site into a fulfillment warehouse system. Or maybe sales leads are flowing from your sales pages into your CRM. Reliability at the execution level can make or break your relationship with your customers. So what happens when the API connection between the two systems breaks down? The flow of customers and revenue comes to a halt. And how will you know what’s happened? The conversation with your team will go something like this: "Hey, why were there no sales leads this week? That's really weird. Is our Adwords campaign running? Is our website on? Is our lead gen form working? What about our email, has it broken? Did something happen to the sales team?"
If the failure in the sales pipeline is at the API integration layer it will be close to invisible. And it will take time to track down. Which will, in turn, delay the restoration of the flow. A system for monitoring this API integration can be critical to avoid interruptions. It's especially important when you realize how invisible the underlying problem can be; how difficult it can be to detect, and the high cost of lost time and lost connections.
Your website is built upon ever increasing layers of infrastructure. Each of these layers can be trusted but should also be verified through continuous monitoring.