Site Reliability Engineering (SRE) is a rapidly expanding practice in the world of IT management. It offers a means for organizations to optimize the value of their IT operations via increased collaboration, bridging development and operations teams while also prioritizing the needs of end-users.
The concept of SRE was originally conceived at Google. Ben Traynor, the founder of the company’s SRE team, described it as:
“what happens when you ask a software engineer to design an operations function”
The concept sees development and operations staff sharing responsibilities in a way that fosters a holistic understanding of different viewpoints and priorities. At the same time, ‘site reliability engineers’ will constantly look for ways to automate and improve tasks for the sake of optimizing efficiency.
Mr. Traynor went on to say:
“So SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, substitute automation for human labor”
Crucially, SRE also has a focus on product reliability. Practitioners follow a formulaic approach that prioritizes this, balancing the drive to develop new code with the need to keep things stable for customers.
Despite the fact that it is often paired with DevOps, however, SRE is still relatively unknown for many businesses. This is especially true when it comes to stakeholders who lack first-hand knowledge and experience of modern IT practices.
So, what are the most valuable (and tangible) aspects of SRE? Here are the most important points to keep in mind when selling SRE to your business.
The Relationship Between SRE & DevOps
If stakeholders and managers in your business are not aware of SRE, they are likely to at least be aware of DevOps. This widely popular IT methodology has exploded in popularity over the last decade, to the extent that even more well-established frameworks like ITIL and COBIT have factored it into their best practices.
There are a number of similarities between DevOps and SRE. While DevOps focuses on boosting collaboration between development and operations, service reliability engineers work to bridge the gap between them. They will often move between teams, departments, and even companies, taking on additional tasks as necessary and sharing their holistic understanding of pipelines as a whole. Indeed, these engineers will typically have skills relevant to multiple groups and will form part of a shared staffing pool between them.
SREs also have similar points of focus to DevOps engineers, such as continuous delivery and infrastructure automation. As a result, they are able to enhance DevOps pipelines, minimizing friction between teams and departments while also enabling far greater levels of efficiency.
The similarities even go so far that the Site Reliability Engineering (SRE) certification was created by the DevOps Institute (DOI).
A Holistic Perspective
One of the biggest benefits of having site reliability engineers on staff is the fact that, as specialists who can work across multiple departments, they are aware of differing priorities and points of concern. For example, they will usually spend around 50% of their time (at most) working in operations, while also taking on work for development teams.
This can be a major advantage for development staff, as by taking into account the needs of operations, they can create code that will require fewer revisions down the line. The result of this will be faster release dates, as well as higher ROIs. At the same time, the enduring focus of SRE on reliability ensures that, even as developers consider the perspectives of other teams, their work is still monitored to the extent that greater speed does not give way to lax focus elsewhere.
Remember, site reliability engineers will always prioritize customers and end-users over business and engineering concerns; an approach that can be rare in insulated IT environments.
An Increased Focus on Feasibility
Knowing when a project is feasible enough for continued development is an essential element of SRE. While other approaches like PRINCE2 focus on the business case and ensuring that time and resources are not wasted, SRE prioritizes the needs of customers following the point of deployment.
A large part of this is establishing a ‘Service Level Agreement’. This is effectively an agreement between a service provider and their client and specifies what the end-user should expect in terms of quality, reliability, availability, and so on. Service reliability engineers will also create an ‘Error Budget’, which sets a threshold for issues, errors, and outages. If the Error Budget is passed, production effectively ceases until the quality and reliability of code are brought back to an acceptable level.
Setting these factors in stone early on creates a strong incentive to prevent, detect, and quickly repair issues in the development stage. At the same time, if an Error Budget is undercut, the spare time and resources can then be reinvested elsewhere, such as on developing additional features. To put it simply, SREs reduce the amount of time that developers need to spend on fixing problems, creating a more efficient and valuable operation overall.
Clear Performance Tracking
When selling any concept to a business, it is crucial to have a clear idea of the metrics involved. Service reliability engineers have clear KPIs in terms of performance, incident management, and more, across multiple departments.
So, not only do SREs have first-hand experience when it comes to the varying priorities and concerns across IT operations, but also a clear idea of how to judge how well different areas are performing. As a result, they are able to quickly highlight problems and provide the information required to make any necessary changes quickly and efficiently. This can greatly reduce the life of, as well as the damage caused by, downtime and other significant problems.
Crucially, this approach does away with conjecture and finger-pointing between teams. The causes of problems are clearly identified and, when necessary, those responsible can be held firmly accountable.
Proactive Improvements
Site reliability engineers do not simply spend their time running around cleaning up after development and operations teams. In tracking and monitoring the performance of different departments, they are also able to lock onto areas for improvement.
This is often done via increased automation. This is treated as an essential aspect of development (again, in a similar manner to DevOps), with solutions being automated wherever possible for the sake of speed and reliability. It can also take other forms, such as decentralizing service operations like responding to infrastructure alerts.
Highly Attractive for IT Specialists
One of SRE’s claims to fame is that it originated at Google. However, as SRE specialists such as Amy Tobey have pointed out, the resources available at Google enable a unique approach to SRE that cannot always be applied elsewhere. In combining SRE and DevOps, organizations are able to follow similar practices to Google without having access to the same startling budget.
With service reliability engineers becoming more valuable, learning about and gaining experience relevant to the role can be a highly desirable career opportunity, whether for DevOps engineers, operations staff, or anyone willing to develop the necessary skill sets. Remember, enthusiasm is a crucial element of managing a successful corporate training scheme. By communicating the potential benefits for career-minded learners, you will guarantee superior metrics in terms of exam pass rates, course completion rates, and so on.
At the same time, it is worth pointing out that SRE practitioners who are already familiar with your business’s setup will find it easier to start making improvements. This will likely include adapting your structure and practices with less of a need to take on new tools and technology, allowing you to enjoy the benefits of SRE for less.