DevOps and SREs: Better together?

Software development is an ever-evolving industry and expectations around it are higher every day, putting pressure on development teams to create high-quality software products at a high delivery pace. In any activity, as velocity increases, it is harder to keep things under control and software development is no exception.

Expectations, high pace, and control are the keywords in this intro as we all expect our Software Development Life Cycle (SDLC) to be efficient, productive and most of all, reliable. These days, this conversation usually bring concepts like DevOps (referring to the mindset, the engineer, or the role) and SRE (Software Reliability Engineering) to the discussion table.

The DevOps and SRE role names have been used indistinctly quite often. However, as organizations go further into this evolutionary path, they realize that even though they might share some common activities and tools, the two roles are different, and the skill set required to execute each one is also different.

The DevOps “side”

DevOps is mainly a mindset, a culture an organization adopts and usually manifests in specific practices along the SDLC. This mindset has a great impact on the way development teams focus their efforts to become more efficient and productive.

While some organizations think of DevOps as a way of working that every engineer on the team must embrace, others see value on having a clear definition of a DevOps Engineer role / position that ensures the development process is constantly evolving to get rid of constraints by introducing processes, tools and best practices that will help “traditional” roles (developers, quality engineers, data engineers, etc.) to work more efficiently with some adjustments to their daily activities.

No matter the vision, most of the visible effects organizations give the DevOps culture credit for are geared towards the “Dev” side of things: Test automation, branching and release strategies, continuous integration / delivery, etc. Many of them happen before or while a system is being deployed to its production environment.

There’s a good set of activities related to DevOps that are not entirely focused on boosting the development process. Infrastructure as code approach or setting up observability mechanisms are very typical requests these days, but the more we get into these areas the more the required profile looks different than development.

Mature organizations with a well-established DevOps culture have very clear metrics to determine whether their development processes are performing. Some of the most mentioned metrics are:

  • Deployment frequency: The ability to deliver new features with efficiency and accuracy
  • Change failure rate: Deploying more often increases the risk of a failure. Knowing how often your deployments fail will help identify areas to improve in the process.
  • Mean time to recover (MTTR): Average recovery time from failure to resolution. Failures will come and you will need to deploy a new working version as soon as possible.
  • Lead time: Average time it will take to turn an idea into a deliverable solution. It is a metric for productivity and efficiency.

DevOps is meant to enable seamless collaboration between development and operations to reach a constant flow of value to the business through continuous delivery. When all that’s been mentioned is implemented, the development cycle will likely be way more efficient, and the team will have what’s needed to improve consistently over time.

So, what happens after the system was successfully deployed into production and the “Dev” team goes back to take some more features and work on them? Who makes sure the deployed system operates as expected from a functional but also a performance and reliability perspective?

Software/Site Reliability Engineering (SRE)

Once a system is up and running in a production environment the real challenge starts. End users will come into action, and they expect not only a functional system but one they can rely on to accomplish their goals.

The SRE concept is not new at all. In the beginning, it was meant as a framework to support large-scale systems, but it has turned into a set of best practices to address any kind of issues any system might face while running in production.

Just like DevOps, SRE focuses on trying to prevent problems as much as possible. However, given the nature of a system running in production, SRE understands that issues will come and there must be mechanisms in place to minimize the possibility of any impact to end-users (i.e., downtime) or a speedy recovery when impact could not be avoided.

Many of the practices and tools introduced by a DevOps culture are also used by SRE, but the focus of any effort is on the operational side, so the approach must be practical more than philosophical. Disaster recovery, high availability, and redundancy are some of the frequently mentioned goals of the SRE team and all of them will likely be accomplished through automation.

Time is literally money for an SRE since every second of downtime has a business impact at some scale. So, they are driven by some key metrics:

  • Service Level Objectives (SLO): The target goals the SRE team aims to accomplish are to maintain a reliable system and meet SLAs
  • Service Level Agreements (SLA): The committed levels of reliability, latency, performance, etc. to meet users’ expectations of the system
  • Service Level Indicators (SLI): Specific indicators to measure compliance of the SRE team towards the defined SLOs

The SRE team will leverage many of the tools and practices introduced by the DevOps culture to improve service delivery through automation: CICD orchestrators, infrastructure as code, configuration management, continuous testing, monitoring, and log management are some of them.

Can DevOps and SRE coexist?

There are many similarities between DevOps and SRE, but it doesn’t mean you have to choose one or the other. In fact, you probably don’t want to as there is a great benefit to having both in your organization:

  • Both DevOps and SRE teams aim to break silos, but they do it in different ways. The former brings together developers, quality engineers, and other roles to build-test-deploy applications in fast-paced but controlled environment, while the latter provides timely feedback to developers to improve the system based on the real operation information
  • While DevOps takes advantage of automation to speed up the development cycle and increase productivity without sacrificing quality, SRE keeps systems operational and available to ensure smooth business operation even when incidents happen
  • A DevOps culture embraces failure as a way to constantly learn and improve, so it fosters continuous experimentation. The SRE understands failure will come and the way it’s handled will make a major difference.
  • From a skillset perspective, a DevOps role requires coding skills to build, test (fix) and deploy the system. SREs require a good deal of analytical thinking to analyze an issue and create a mechanism to prevent that same issue from happening in the future or handle it properly the next time it does.

An organization that successfully combines a DevOps culture with a good SRE approach will be much more likely to not only accomplish its business objective but also be one step ahead on the path of innovation and constant evolution. The road to full adoption of a DevOps mindset and good SRE practices is a long one, but you go through it one step at a time.

Every organization starts from a different place and aims for different goals. At Encora, we are ready to help you take that first step.

Share this post

Table of Contents