Site Reliability Engineering (SRE) is a new discipline which is quickly gaining traction within the IT landscape. First conceived by Google, it is now adopted mainly by large enterprises such as Netflix, LinkedIn, Target, Tesla and many more. It is known for a focus on automating manual tasks without enduring value (toil), treating operations as a software problem and its unique approach to systems reliability.
There are many ways to implement SRE. SRE is a multi-faceted discipline. You can create Service Level Objectives to define the level of service to deliver to users, focus on reducing toil with targeted measures, define a risk tolerance for a service and do release engineering to make releases streamlined and consistent.
These, amongst some other practices, primarily dictate a certain way of working. However, some practices are more focused on designing systems to be reliable, highly available and robust. In other words, I see two categories:
Not all practices are as easy to apply for a given business. A lot of Googles practices which it has written about in its SRE Handbook has stemmed from dealing with its growing pains. As the organization grew it required cleverer solutions so it could continue to deliver a good service while using resources effectively. Small to medium businesses however do not have some of the challenges and use resources differently. Some examples:
Given the above differences, the way small to medium business can implement SRE differs. For example:
Even though mostly off-the-shelf software is used, you can still apply a lot of the principles in SRE. Instead of using them to design your own software, try to understand the principles and buy software which is built with these principles in mind. For example, you can buy software which has high availability as a feature.
Having the ability to create your own software gives more opportunities to reduce toil as you have more control over the way your infrastructure is managed. However, you can do a lot with scripts, pipelines or small applications. Doing a routine clean-up in an external system could easily be encapsulated in a Python script ran by a cronjob.
Many principles can be implemented with off-the-shelf software and widely available documentation. Service level objectives can be monitored in many monitoring solutions such as DataDog and Splunk. Having engineers be on-call effectively can also be done with tools such as VictorOps.
Your business might not benefit from all the scale advantages of companies like Google, but investments in scalability can still pay off. Using container orchestration for workloads means scaling applications happens in a consistent and controlled manner. Cloud solutions offer comparable solutions for VMs. Having good control over scaling also helps during maintenance as you can easily scale down a node and perform maintenance temporarily.
Google describes a lot of principles in their SRE Handbook. The base principles upon which they are based are universally applicable but the degree to which they can be implemented and how it is done is likely very different when comparing smaller businesses to large enterprises such as Google. Still, by choosing software which adheres to the principles described and provides features for utilizing SRE, using smaller pieces of software for automation and implementing practices to the appropriate degree, your business can also benefit from SRE.