SRE or Site Reliability Engineering is an oft-discussed topic in in the software development world.d. To explain how SRE works in practice, one option is to list the tasks and responsibilities, and compare these to responsibilities in DevOps..
What is SRE, and how did it emerge?
In the early 2000s, when the industry began to adopt Agile as the standard methodology, we used to have the development process and operations as two separate teams, each with their own goals. While developers want to push out their application changes as quickly as possible, operations was focused on keeping applicationsstable. This conflict is better known as the Wall of Confusion. When DevOps arrived, it was meant to destroy this Wall of Confusion. However, while DevOps made the release process faster, these new versions were not as stable as ideally wished. Indeed, in DevOps teams, there was no dedicated role or person that focused full time on keeping systems reliable, and that's how the need for SRE Engineer as a separate role emerged. Specifically, SRE was conceptualized at Google by Ben Traynor, a software engineer. He was given the task of running a small team of software engineers to do what used to be operations work. According to his definition, "SRE is what happens when you treat operations as a software problem and stuff it with many software engineers."
What is the system that we want to keep reliable?
The system is the server's infrastructure, the platform, and refers to the entire development environment where the application runs.
What is reliability?
Reliability is the measure of the accessibility of final users to systems. In other words, it is the metric that determines the stability of a system.
What makes a system unreliable?
The leading cause of a system becoming unreliable is when you change your systems, such as changing something in the infrastructure, the platform where the application is running, the application itself, and its services. This may cause disruption and break something in the whole setup. As a solution, we could disallow changes or limit the number of changes to keep systems reliable, but that approach can significantly impact business..
In other words, ensuring a stable software version while scaling with the new development features to create a competitive application is quite complex. And this is the cornerstone of an SRE role. This is what DevOps and SRE try to solve.
What's the specific solution for SRE?
SRE tries to automate analysis and evaluation of the effects the change will have on the system. Reliability automation means no checklists or discussions with the operations team on whether to release t or not, or what threats and risks are involved. Instead, the evaluation is based on an automated process, which makes removing changes fast and safe at the same time.
What are SLAs?
SLA indicates how reliable a system is and how often it will be down and is measured in percent. If a system has 100 % SLA, the service will be up and working 365 days, 24 hours. Then, if there is a 99% SLA, the system can be down a maximum of 3.65 days in a year.
How to calculate the SLA of your System?
Service Level Agreements are based on benchmarks, competition, user feedback, and commerce strategies. So SLA is like a barometer. If systems are more unreliable than the SLA allows, the SRE team works to make the system match SLA goals.. The closer is the SLA target to 100, the more effort you need to put in to guarantee the reliability of your systems.
- SRE or Site Reliability Engineer is the one who creates automated processes to calculate and evaluate whether the service is within the SLA or not.
- Another primary role is to define launching policies that guarantee operations support.
- Abig part of SRE's responsibilities is monitoring and observability by configuring proper monitoring tools and logging system statusIn actual production systems, the SLA is not 100%, so another big part of SRE's responsibilities is monitoring and alerting, giving you visibility to measure your system's performances, and, more importantly, detecting any indications for issues before they happen.
Why is it essential to prevent issues or identify them as early as possible?
We have multiple high availability reliability and self-healing mechanisms when the system is configured correctly. Therefore, various things must go wrong before the system becomes unreliable and before you have an outage.
1. Observability and Monitoring
When an outage does happen, you don't have one thing that you need to fix but rather multiple issues or a chain of problems on various levels that you need to improve. So setting that array of issues is more complicated and takes more time before the system has recovered.These additional protection mechanisms do protect the system and make it more reliable, but they also make it more challenging to detect the issues early. And this is the primary purpose of observability. It's even more important to configure a proper monitoring, logging and alerting configuration to overcome this challenge..
2. Postmortem Culture: Learning from an outage
Every outage is a chance to learn and avoid future similar issues SRE teams use what's called a postmortem document. This document includes a thorough analysis to understand the issues that caused this outage. Nevertheless, during this analysis, it's essential to stay blameless to encourage people to admit and learn from their and other people's mistakes.
SRE vs. DevOps
Finally, what is the difference between SRE and DevOps engineers? or generally between these two concepts. There are two characterizations of DevOps: the original definition, which is more high level and more broad and doesn't specify how exactly DevOps must be implemented, and a more practical one, which evolved with its own DevOps engineer role. So when we compare DevOps with SRE, it's essential to know which definition of DevOps we're using. In this case, we address the first approach because we are talking about general activities.
If we take that approach, we can conclude that SRE is a specific implementation of the DevOps concepts. DevOps targets not only faster releases but also higher quality code. In practice, many DevOps teams optimize more for speed than reliability, so SER is an excellent complement to DevOps.
SRE emerged with the same principles and goals in mind, which is to release quality code fast, but as the name suggests, is more focused on reliability and keeping systems stable while allowing for fast changes.
SRE and DevOps at Encora
Encora’s DevOps and SRE experts bring in world-class best practices to accelerate implementation and time-to-value. In most cases, it is not uncommonfor teams to have both DevOps Engineers and SRE Engineers.
Fast-growing tech companies partner with Encora to outsource product development and drive growth. Contact us to learn more about our software engineering capabilities.