Site Reliability Engineering (SRE) is a new discipline which is quickly gaining traction within the IT landscape. First conceived by Google, it is now adopted mainly by large enterprises such as Netflix, LinkedIn, Target, Tesla and many more. It is known for a focus on automating manual tasks without enduring value (toil), treating operations as a software problem and its unique approach to systems reliability. You might hear of the advantages of SRE and want to introduce it to your organization. Perhaps you are a software engineer and want to convince your peers to try it out or you are a manager overseeing one or more teams and would like to introduce SRE to them. The question is, are you ready to start?
In this article I would like to give you some tips and tricks in order to judge how ready your organization is for this SRE journey and if it is already adopted, determine the current level of SRE maturity.
As Google describes it, SRE is a superset of DevOps. Many of the SRE principles align with DevOps and build upon them. Having good adoption of DevOps principles is therefore an important factor in the success of implementing SRE. Is the DevOps maturity level relatively low, then you will likely benefit more by focusing on improving that first.
Google describes a number of levels in automation, the higher levels (lower on the list) being more advanced:
So, what makes the higher levels better than the lower levels? Why is it better for a system to be internally monitored than externally? If you use an external system to monitor another system, you are now managing two systems. This usually increases the effort to maintain this process.
A system that is internally automated also has more possibilities to do so as it can directly tap into all of the functionality and therefore has more fine-grained control. If the automation is system specific as opposed to generic, the impact of this particular automated task is not as broad and use of these tasks scales more linear.
If you are low on this scale (less flexible and in lower amounts of automation), there will be more to be gained from automation but some aspects of SRE might not fully come into fruition just yet. SRE works best if there is a foundation of automation upon which to build. Additionally, hiring SRE engineers is easier as the profile of SRE engineers usually includes skills which are a match for more mature forms of automation.
When it comes to SRE as a term, there are usually three definitions used.
In practice, these definitions are often intertwined and mixed depending on the context of discussions.
These definitions are commonly used in a number of forms of adoption. Depending on structure of your organization and how much knowledge about SRE is present and some of its principles are perhaps already adopted.
For all maturity levels there are gains to be made from adopting SRE and the aforementioned forms tailor to these varying needs. The red line in these forms is that as the SRE maturity level progresses, the scope of SRE widens and more of the service delivery flow is imbued with SRE practices and principles. Choose the form that suits the maturity level to make the most out of SRE adoption.
So, know knowing there are different forms to adopt based on the maturity level, how can you judge how high the SRE maturity level is? Assuming SRE is already implemented to some degree, there are some indicators.
The key take-away here is to inspect how well integrated SRE is in the process and how effectively they handle automation, incident response, the degree to which SREs are involved in the design process and whether there is executive buy-in. In other words, is SRE implemented well and is there is support within the organization.
Next to the SRE maturity level as a whole, there is also a way to judge the progression in terms of handling reliability. Effectively handling reliability is a key component of any SRE strategy and how it is dealt with is a good indicator on SRE maturity.
The saying "it is better to prevent than to cure" also applies to reliability. So being at the very least proactive to incidents is a good move to make. However, it is not needed to be very high on the spectrum per-se. It requires major investments in order to become a visionary on reliability and the benefits might not outweigh the costs of doing so.
The key takeaway here is that there are various ways to judge how your organization fares in terms of SRE maturity. Depending on the maturity level, there are different rates of progression and approaches to advance on this level.
A more 'junior' organization might focus more on making automation more generic, hire consultants to help with determining SLOs and make their incident response more proactive.
A 'senior' organization will focus on developing tooling which comes with its own automation, broaden the scope of their existing SRE teams to include more of the service delivery flow and share knowledge about SRE with other people. Regardless of the maturity level, there are gains to be had but the right approach for the right level will mean gains will be maximized.
https://cloud.google.com/blog/products/devops-sre/the-five-phases-of-organizational-reliability