Site Reliability Engineering (SRE) is a discipline devoted to creating scalable and reliable systems by applying software engineering principles and practices to operations. Site Reliability Engineers (SREs) bridge the gap between development and operations, driving collaboration and integration in pipeline environments. It is quite a complex task, one which calls for versatile individuals who can think outside the box.
In this article, we will go over what it is that Site Reliability Engineers do and then explore the qualities that make a good SRE.
What do SREs do?
Site Reliability Engineers (SREs) are responsible for owning a service or system and keeping it running in the most reliable way. They retain this ownership from beginning to end: from coding to deployment to operation. This means that Site Reliability Engineering is very much a cross-functional role, as well as one of integration.
SREs usually split their time between taking shifts in the on-call rotation and building better systems that will fail less often and require less human intervention. They reduce the silos between operations, development, and other IT teams by performing the most varied tasks, such as managing incident responses and writing, testing, and deploying software.
While on-call, SREs act as incident responders and manually intervene to repair any issues as and when they happen. When off-rotation, they analyze past incidents, write blameless post-mortems to draw lessons from incidents, and take action to prevent them from happening again. They use monitoring tools to observe behaviour in systems and applications and predict any possible issues. SREs also use chaos engineering to purposefully introduce failure in the system, learn from its effects, and make improvements to make the system more robust, effectively dealing with future problems in advance. They also implement automation to improve overall system performance and remove toil.
What makes a good Site Reliability Engineer?
With such a variety of tasks to undertake, SREs must be versatile in technical skills. Most SREs come from a systems or software development background or even from an operations function. However, all SREs have enough knowledge to be able to work within all three areas.
In general, SREs are required to have experience in coding, automation, and systems administration. They will usually have strong knowledge of networking, monitoring, testing, CI/CD pipelines, and virtualization.
Most importantly, good SREs should be familiar with SRE principles and best practices, as well as SRE tools.
Although technical skills have a big role to play, there are many other traits that are essential to performing such a well-rounded function. After all, navigating through different tasks, teams, processes, and ideas requires some serious soft skills.
Let’s look at some of the top qualities displayed by successful Site Reliability Engineers.
1. Big-picture thinking
SREs must be able to look at any issue, idea, system, or resource and think of how it relates to the big picture. This involves considering the use the customers make of the service, as well as how each element affects the production environment as a whole.
SREs take full ownership of systems from development to operations, so being able to visualize things at a higher level is crucial. Understanding how a particular component or aspect of the system fits into the wider environment is essential when it comes to delivering better systems.
Having a good awareness of the big picture also means knowing the business requirements and challenges across all areas, as this drives effective implementation. When SREs have big-picture thinking, they can better predict issues and solutions and act proactively and adaptively towards the organization’s objectives.
Successful SREs are able to use their good understanding of the big picture to solve problems and create solutions that will be the most effective and deliver the greatest value. Dealing with performance and reliability issues is one of the main activities of SREs, which means problem-solving is very much a key trait.
Ideally, the work of SREs will greatly contribute to preventing issues from happening, which requires a proactive approach to problem-solving. Applying SRE monitoring best practices can increase the engineer’s visibility into the system and consequently help them predict and act on problems before they reach customers. A problem-solving mindset will enable the engineer to set up solid and effective monitoring capabilities.
Even though many issues can be prevented, many will still occur. Incident response work tends to be a time-sensitive process, requiring respondents to act quickly to minimize the incident’s impact as well as the organization’s losses. This means SREs must be good at doing detective work and thinking analytically and strategically to solve problems. Many times, they will need to improvise to mitigate the issue more promptly, restore the service, and then keep digging deeper to find the root cause of the problem.
3. Communication and collaboration
In order to understand the big picture and solve problems successfully, SREs need to be good communicators and work well with others. Good communication is what makes collaboration possible, and Site Reliability Engineers are the ones who drive teamwork between development, operations, and other IT teams.
Even though the responsibilities of SREs are varied, they are not expected to hold all answers to all questions or problems. Instead, they should work with others within the organization to get the information they need or to implement a certain solution. Collaborating also entails respecting others’ work, understanding their challenges to better address them, sharing issues and concerns, and working together to define priorities.
It is no surprise that with such a broad spectrum of activities, curious SREs are the ones who thrive the most. SREs constantly come across situations that they have never experienced before and issues that can be quite complex.
Being curious and eager to learn is what will motivate the engineer to investigate and get to the bottom of the problem. It is also what inspires them to question existing processes and tools and try to find new solutions.
5. Ability to embrace risk and failure
Site Reliability Engineering is all about embracing change, as, without change, there can be no improvement. And there is no change without risk, which SREs must be aware of and actively embrace.
Managing risk represents a big part of how SREs manage the reliability of services. That means that being averse to risk-taking can be quite problematic for an SRE. By accepting that some risk is inevitable, it becomes a matter of identifying how much of it is acceptable for the business and then implementing changes accordingly.
More than accepting risk, accepting that failure is unavoidable (at least to some degree) is key. Incidents will happen and denying this only hinders your ability to be ready for them. One of the best practices in SRE is the use of blameless post-mortems, which are an opportunity for everyone involved in an incident to look back at it and understand its root cause, impact, and the actions that were taken to mitigate or resolve it. They will also help the SREs decide what can be done to prevent it from happening again.
Post-mortems in SRE are blameless because failure is normal, and pointing fingers is unproductive. Understanding the issue and how to stop it in the future is a much more fruitful approach.
Studying Site Reliability Engineering
Good e-Learning is an award-winning online training provider with an extensive portfolio of fully accredited courses. Our in-house team of training specialists works with highly experienced practitioners to offer courses with valuable and unique advice to help candidates apply their training in practice.
Visit the Good e-Learning website today to find out more about Site Reliability Engineering training courses and resources, or sign up for a free trial!