Modernização de TI

How to make a business case for site reliability engineering

artigo 23 de mai de 2024 Tempo de leitura: minutos

By: Rod Anami and Greg Pruett

If you could make an investment that’s almost certain to improve the dependability of your software applications, would you do it?

Sure, you would—provided the upfront costs aren’t prohibitive.

The concept of affordable, always-on performance lies at the heart of site reliability engineering, a practice described as “what happens when you ask a software engineer to design an operations team.”¹

For organizations striving to keep their systems and applications humming as efficiently and predictably as possible, site reliability engineering should be a business imperative rather than an operational nice-to-have.

Why businesses should invest in site reliability engineering services

Site reliability engineering uses software tools and engineering principles to automate IT infrastructure tasks and create highly reliable and scalable software systems. Google pioneered the practice in the early 2000s, before DevOps existed, to increase the reliability of its sites and services.

Google later used site reliability engineering methodology to achieve the famous six nines of service availability (99.9999%). This measure of operational performance—which equates to a company’s IT systems being down no more than 31.5 seconds annually—fueled the trustworthiness of cloud platforms.

In the years since site reliability engineering was developed, some organizations have adopted its principles and processes, yet few have fully embraced the practice. In a recent survey from the DevOps Institute, of the more than 62% of companies that say they use site reliability engineering, only 19% practice it throughout the organization.²

This disconnect represents an opportunity for companies to explore adopting site reliability engineering in all business units and locations. When used organization-wide, site reliability engineering principles and processes should enable or enhance the following:

Observability across the IT estate to help eliminate blind spots and detect issues faster
Automation to reduce manual processes and accelerate incident resolution
Analysis that goes beyond the root cause to understand contributing factors and circumstances behind incidents or failures
Predictive maintenance to help avert issues and increase the time between failures
Capacity planning to anticipate needs for service-level objectives (SLOs) and facilitate multiple nines of availability³

Simply put, when IT service providers deploy site reliability engineering at scale, it can drive greater business value by helping to increase reliability and performance while reducing operational costs and complexity.

“

Site reliability engineering deployed at scale can help drive greater business value.

Site reliability engineering in practice

Site reliability engineering has numerous use cases, from monitoring system health to incident management. Engineers typically focus on more significant issues like service disruptions, scalability challenges and slow response times, all of which can cost organizations money.

For example, the site reliability engineering squad for a North American airline company planned the operational needs it would take to provide “reliable services” for a critical application. The organization’s site reliability engineering team worked with the application development and business teams to:

Understand the different journeys an application’s end user may follow
Publish a user journey map with service-level objectives for operational needs
Deploy an open-source monitoring platform for monitoring and observability
Set up actionable alerting to filter out less-critical alerts
Adopt ChatOps to reduce the toil and improve collaboration and automation

Within six weeks, from start to finish, the team could deploy tools to measure, monitor and observe this application. The efforts helped the company avoid losses of roughly US$10 million in revenue.

In another use case, the site reliability engineering team for a large Latin American bank discovered a product bug in its development environment that its technology vendor resolved through a workaround. The IT team tested and applied the workaround to production using a blue-green deployment approach.⁴

After successfully updating the production products and migrating applications, the site reliability engineering and operations teams implemented a process to monitor self-signed certificates to avoid additional incidents. The work prevented disruptions to human resources operations, employee recognition programs, marketing campaigns and the commercial intranet in 65 distinct systems.

Site reliability engineers are technical generalists whose knowledge and skills span the entire technology stack.

Common obstacles to buy-in for site reliability engineering

Many IT leaders and organizations recognize the benefits of site reliability engineering. However, establishing and maintaining a company-wide technology solution presents challenges that can delay or even prevent some organizations from implementing the methodology.

For starters, site reliability engineers are technical generalists whose knowledge and skills span the entire technology stack. The lack of specialization makes it more complicated for companies to develop curricula and train interested employees.

Also, since site reliability engineering is still relatively young, it doesn’t have a proven track record and widespread applications across industries like its more widely known counterpart, DevOps. This lack of familiarity can make it difficult to gain buy-in from software and operations teams or management.

Despite these challenges, applying and following site reliability engineering principles as disciplines is a sound business decision.

Investing in a site reliability engineering program can lead to sustainable improvements in business KPIs like mean time to detect (MTTD), mean time to resolve (MTTR) and mean time between failure (MTBF). These technical performance metrics directly impact business metrics like revenue and cost per user, total addressable market and net promoter scores.

“

Site reliability engineering has numerous use cases, from monitoring system health to incident management.

How to start a site reliability engineering program

Launching and overseeing a site reliability engineering program requires people to change how they work. Here are three tips to get started:

Establish a center of excellence. If you don’t have skilled site reliability engineers on staff, appoint a core group of engineers who can take on the project. There are plenty of educational materials online that the first group of site reliability engineers can use to train themselves, including Google’s definitive SRE books⁵ and materials from SRECon.⁶
Develop an asset repository where automation, runbooks, integrations and procedures can be curated and shared. Internal site reliability engineers can use code and documentation from these assets—which are treated as intellectual property—to enable automation at scale.
Create a certification program to assess an individual’s applied knowledge based on work they’ve completed on the job rather than focusing on product knowledge. You may want to set up mentoring programs and have coaches work on reliability problems with aspiring site reliability engineers until they achieve certification.

A final word about site reliability engineering

Starting a site reliability engineering program is as much a business decision as a technical one. By investing time and resources to build a site reliability engineering culture within your organization, you can benefit from improved reliability, efficiency and innovation for years to come.

Rod Anami is a SRE coach and the SRE profession leader for Kyndryl. Greg Pruett is a distinguished engineer and the Kyndryl SRE profession executive sponsor.

^{1 Site reliability engineering, Google, 2024}

^{2 Global SRE pulse 2022, DevOps Institute, 2022}

^{3 What is SRE (site reliability engineering)? And what do site reliability engineers do?, Dynatrace, February 2024}

^{4 Blue/green deployments, AWS, 2024}

^{5 SRE books, Google, 2024}

^{6 SREcon, USENIX, 2024}

Tópicas