How To Internalize Site Reliability Engineering's Top 5 Golden Lessons
Most people who have worked as or around Software Engineers can probably easily identify those fishy smells which permeate a company poorly implementing Agile and SCRUM work methodologies. Decomposing piles of bugs, issues, and tasks being carried over from one sprint to another, most without any prospect of being closed or completed in the near future. Surely there would be a connection between these leftovers and the over-arching feature plans for the company product, or at the least those plans and how the sprints are being laid out to the developers. Paying homage to who many consider to be at the forefront of distributed system development, I'm going to tell you about some of the best ways I've found to internalize the most important lessons put forth in the O'Reilly book, Site Reliability Engineering: How Google Runs Production Systems. Hopefully with better versions of these tools under your tool belt, we can all work together to make sure our office spaces don't end up smelling like Fisherman's Warf in San Francisco.
Lesson One: Cooperation & Collaboration
Testing is one of (if not) the most important things a company and its developers can do to ensure the long term viability of the product(s) it makes. Developers and their managers are at their best when working in a spirit of cooperation with the coworkers they will be handing over their labor too. With that being said, I believe a good chunk of us understand from experience that this isn't always the easiest thing to accomplish. Pride in one's work can be a good thing but it shouldn't come at the cost of cooperating with one's colleagues. When collaboration stops, people can start to segregate testing infrastructure and production configurations in the hopes that it serves their own agendas against a perceived internal "competitor" instead of leaving the focus on the external customer. As the quote above goes to show, this segregation can start to damage customers' perception of company and product reliability.
Creating preemptive routines that are consistent and repetitive is one of the best ways I've found to internalize this cooperative lesson I'm spouting. Testing plans should not be started when the developers are "finished" with their product and ready to hand it over, the perfect time to start designing tests is when you're designing the product. A preemptive set of meetings with testers as a new part of the product is growing is an amazing way for developers to bring everyone onto the same page and help data retention. Three one-hour meetings over three months is way better than one three-hour "handover" meeting where developers can get more defensive over any slight issue or misunderstanding.
Lesson Two: Release Engineering
Planning and adhering to the policies and procedures for releasing a project are some of the most important things product owners can do, especially in regards to approving source code changes for production environments. Sadly at smaller companies, some of the worst and more frequent offenders of both tend to be the people who are supposed to enforce the rules themselves. In the section entitled 'Enforcement of Policies and Procedures' on page 89, I think it's best to point out that the code review process is streamlined to the point of allowing Google SREs to know what changes are included in a new release/build of a project and why they are there. This can help to keep everyone honest and save time in several trouble-shooting scenarios. Internalizing this lesson is all about self-control and my preferred way to achieve this is effective discipline, without it there is little in the way of power struggles or authority complexes causing all sorts of trouble.
Lesson Three: Post Mortem Culture
Blameless Post Mortem Culture is my favorite of the three lessons I'm discussing in this post. On page 170, Google is serious when they say that if a culture of finger pointing and shaming individuals or teams for doing the "wrong" thing prevails, people will not bring issues to light for fear of punishment. A company is often only as strong as its product(s) and if issues continue to pile up in the dark than what does that signal about the health of the company? I think as individuals in the industry it can sometimes be tough to remember that there are two sides to this lesson. One side is internalizing the humility it takes to be able to silently accept responsibility for any of your mistakes while trying to diagnose and discuss the root causes of what went wrong. The other side of the lesson is internalizing humility for people when its their mistakes, especially when things get difficult and they aren't showing much, if any, humility themselves. The greatest stepping stone I've found for gaining humility is being able recognize my own faults and accept that I can't realistically be the best at everything I do. It's better to view each moment as a chance to better myself, rather than a competition to be won.
Lesson Four: Monitor Your Distributed Systems
There are several amazing reasons to monitor the important metrics of your distributed systems. These metrics can include hardware performance, error occurrences, and/or caching behaviors. One of the best ways to improve your system and increase its performance is to use your monitoring to influence the prioritization of the work you put into it. Turning these metrics into real-time dashboards can help minimize down-times by increasing the visibility into active issues as soon as they come up. The topics below help to enumerate what engineers should be aware of as they design a new monitoring system for their product:
Choose Your Metrics Wisely
This will be one of the main starting points for the decisions made about the systems you are monitoring, so if good choices aren't made here issues will trickle down. Typically there are two generic types of metrics to monitor, ones you can automate a response to and the rest that you have to create an alerting system for.
Responsive, Real Time Dashboards
A lot of the time there are a plethora of metrics to track per system in a product as a whole. This means being able to see how everything as a whole is performing at a quick glance is extremely important for lowering the mean time to resolution.
Symptoms versus Causes
Your monitoring system should address two questions: what’s broken, and why? The "what’s broken" indicates the symptom; the "why" indicates a (possibly intermediate) cause. Typically these should point to run-books that help layout how similar problems have been resolved in the past.
The Four Golden Signals
The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four and page a human when one signal is problematic. your service will be at least decently covered by monitoring.
Lesson Five: Eliminating Toil
What is toil? Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows. These tasks can be anything from gracefully decommissioning servers of a cluster to adding a templated set of infrastructure for a brand new client. This is why it's important to invest in time spent on engineering project work that will either reduce future toil or add service features. By having a machine automatically handle these tasks response times can go drastically down and cost savings can be realized in AWS bills as well as a larger pipeline to onboard new clients leading to increased revenue. Toil isn’t always and invariably bad, and everyone needs to be absolutely clear that some amount of toil is unavoidable in the SRE role, and indeed in almost any engineering role. It’s fine in small doses but toil becomes toxic when experienced in large quantities. If you’re burdened with too much toil, you should be very concerned and complain loudly.