Techniques for high-performing DevOps teams

PresentationDevOpsperformanceautomationSymfonyContechnical debt

14 September 2024

Branislav Bujisic

Senior Director of Engineering

This post is also available in German and in French.

Video transcript

We utilized ChatGPT to enhance the grammar and syntax of the transcript.

Hello, everyone. Thank you for coming. I'm Branislav. I'm an engineering leader at Platform.sh. I'm also an engineer, a father, a husband, and a waste reduction nerd, which will come in handy in this presentation. Today, I’m going to talk about performance optimization and what strategies organizations can apply to optimize their processes.

Before we start, how many of you are managers? How many of you are engineers who tend to become managers? And how many of you are DevOps? Great! You're in the right place.

Now, how many of you know who this is? This is Bruce McLaren, born in 1937 and passed away in 1970. He was a pioneer in racing sports and created some of the most amazing, advanced cars in history. Even after his death, McLaren continued winning, with eight constructors' championships and 12 drivers' championships in Formula 1. They know a thing or two about making highly performant machines.

There’s a problem, though—these machines are absolute snowflakes. What does that mean? Think about starting a Formula 1 car. You don’t just turn the key. Instead, you attach the car to an external machine that heats the oil. Once the oil is at the right temperature, they start pumping it into the car, slowly lubricating and heating the engine. This process is managed by numerous technicians using powerful computers, all supervised by an engine-starting supervisor—a real job title! Only then are they ready to start the engine. As I said, it’s a highly specialized machine, a snowflake.

What’s the best-selling car in the world? It’s not a McLaren; it’s the Toyota Corolla. Since its inception in the 1960s, more than 50 million units have been sold. Toyota as a producer does much better than McLaren. When you think about it, Ford, Volkswagen, and pretty much every car manufacturer perform better than McLaren. Why?

Meet this man, Eiji Toyoda. Born in 1913, he lived for 100 years. He was a Japanese industrialist and the chairman of Toyota. His story is fascinating. After World War II, he was invited by Americans to visit Ford’s production facilities to understand mass production and factory organization. He returned to a struggling Toyota, which was producing sometimes more than 1,000 cars monthly, and he tried to implement better processes. He invented what they call "The Toyota Way."

As part of The Toyota Way, he introduced two techniques. One is Kanban. Who here doesn’t use Kanban? One, two, three people. He co-authored this process. The second is Kaizen, and I’m going to talk about Kaizen today. "Kai" means change, and "Zen" means good. It’s a good change or an improvement. This methodology focuses on continuous improvement in small steps, without huge innovations or revolutions, and without drastic changes.

Why? It’s obvious. A huge company introducing a massive change stops the production line, halts the process, and complicates things for everyone. Then, you must slowly restart everything from scratch. This methodology also emphasizes processes, meaning familiarity. If you have good documentation and a familiar environment, that’s where creativity can thrive. Finally, Kaizen promotes automation, which is key to performance.

Kaizen is built on 10 principles, all about making informed decisions. It’s important to disrupt the status quo in small, incremental steps. Perfectionism has no place here; the focus is on reaching the goal one step at a time. This approach fosters analytical thinking and taps into the collective knowledge of everyone in the organization. It prioritizes the economics of change, meaning you find the smallest, easiest leverage to get the best possible result. And, the tenth principle is "never stop implementing," which means the process is perpetual and cyclic—it never ends. Each cycle should bring a little more improvement to the organization.

In practice, it looks like this: First, you plan—define the problem, suggest solutions, and identify measures to track success. Then, you implement the plan and test small changes to ensure you’re on the right path. Once done, you check and evaluate all the data to ensure deviations between what you expected and what you got are minimal. Finally, you standardize the changes across the organization.

The goal is to increase performance by reducing waste. What is waste? Think about wasted time—meetings where you weren’t needed or had other priorities. Think about work that had to be scrapped because it was duplicate or irrelevant. For those in manufacturing—although it seems no one here is—think about wasted materials. For the rest of us, it’s wasted electricity, CPU, RAM, storage, bandwidth—everything that makes hosting more expensive.

So, we’re trying to increase performance, but how do we know if we’ve actually improved? How can we measure if we’re in a good place? According to the State of DevOps report—how many of you read that, by the way? Hands up. Alright, you know what I’m talking about. According to the State of DevOps report, which is released every year, deployment frequency is one of the key indicators to watch. Every year, for the past seven or eight years, they’ve said the same thing: Deployment frequency and lead time to recovery are the two most critical metrics.

What does this mean? You shouldn’t be deploying every month, or even every week. You should deploy whenever it’s needed—on demand. Avoiding failure isn’t the goal because the best way to avoid failure is to do nothing. Instead, you need to make sure that when failure does happen, you can recover quickly. High-performing teams are those that can recover in less than an hour.

The latest State of DevOps report also introduces the concept of "balanced teams." These are teams with high performance, high organizational effectiveness, high job satisfaction, and low burnout—something new in this year’s report. Balanced teams use technology sustainably and, as a result, they perform better.

Let me repeat this: High performance is associated with deploying more frequently. This is the metric you need to watch. The meantime between failures isn’t as important. What matters is how quickly you recover from a failure.

If you want to categorize all the capabilities needed for high performance, you can divide them into three groups: technical capabilities, process optimization, and a culture of psychological safety. I’m going to talk a bit more about each of these, but let’s start with technical capabilities.

The first point is cutting ties with legacy technology. It seems obvious, but if you’re running on unsupported runtime versions, you’re already in trouble. Who here runs on PHP 8 or less? If you are, you have a problem. Unsupported application versions, dependencies, or redundant technologies can also drag you down. Or imagine you have a data model that no longer fits your needs—how many times have you had to tell your business stakeholders, “I can’t do that because the database doesn’t allow it”? That’s a problem. Tight coupling is another issue, as it limits your flexibility.

All these things—unsupported technology, obsolete systems, tight coupling—they create technical debt. Technical debt usually happens for many reasons, but one key reason is sticking with outdated technology stacks. Instead, we should be thinking about modern architecture.

Think about adopting cloud-friendly, cloud-native designs or patterns. Start with loosely coupled architectures. I’m not necessarily talking about microservices, but microservices could be one solution. The idea is to focus on making small, incremental improvements rather than deploying one massive update that could fail miserably. You want to test smaller changes and minimize risk.

Of course, continuous integration, delivery, and deployment are vital if you can manage them—that should be your goal. I want to emphasize one specific thing: immutable containers. Who here hosts on immutable containers? A handful of you. Do you like it? Some of you probably don’t. But immutable containers are crucial because they guarantee the integrity of your application once it’s live.

Also, think about APIs that connect applications and design them to be anti-fragile. What does that mean? It means the system can handle individual failures and still function. Even when something goes wrong, it allows you to recover quickly.

The next step is reducing waste through standardization. What does that mean? Consider the operating systems your organization uses—think about the licensing and maintenance costs. Also, think about development standards. If they differ from one application to another, you’ll have a hard time moving people between projects, creating cross-functional teams, or onboarding new staff. Longer processes for onboarding, slower code reviews, and adapting to different standards all waste time.

Standardization should have one ultimate goal: automation. Once you’re standardized, you can automate, and that’s vital. Automate everything. In hosting and DevOps, this means having production-like environments for testing purposes.

To achieve this, we introduced "infrastructure as code," which is already a relatively well-known concept. The idea is to have the recipe for provisioning infrastructure in a file, in code. That code can be version-controlled, allowing you to track changes and fine-tune your infrastructure as needed. Think about events like Black Friday, where you may need to adjust your infrastructure quickly.

Now, here’s a crucial point: Your first deployment to production should not be the actual first deployment to production. Who knows what that means? One person? Okay. This means that before deploying to production, you need to test everything in an environment that mirrors production exactly. You need a one-to-one, byte-for-byte copy of the production environment where you can safely test everything—code, infrastructure, databases, caches, etc. Testing in such an environment is essential to avoid surprises when you go live.

It’s all well and good to have such a testing environment, but it can be difficult to maintain. Some of you might already be thinking of Kubernetes or similar tools. The end goal here is commodification. What does that mean? It means being able to get whatever infrastructure you need on-demand, in a self-service manner.

At Platform.sh, the company I work for, we provide that for you. It’s platform-as-a-service, and our newest product, Upson, is designed to meet these needs. This ties into the first pillar—technical capabilities. Now, let’s move on to the second pillar: lean processes.

What are lean processes? You need your rituals and processes, but they should be designed to focus on bringing value to the customer. That’s the main goal. Anything you do that doesn’t directly add value is waste.

So, how do we focus on the user? There are multiple techniques, but what works well for us is cross-functional initiatives. These allow teams to make decisions when needed without having to escalate everything to higher management levels. This approach avoids the “escalation game,” where decisions have to pass through multiple layers of approval before action can be taken.

Let’s look at this in detail. For the first time, the State of DevOps report says that teams with a strong user focus have 40% higher organizational performance. Forty percent is a huge number. Cross-functional initiatives are one way to achieve this. They allow organizations with rigid structures or divisions to bring together people from different teams, give them a task, and provide them with the infrastructure necessary to get the job done.

At Platform.sh, we’ve introduced the concept of a "product trio." This means that a product manager, a product designer, and a lead engineer work together to bring a new feature to life. Who wants to know more about that? Stay after this session, and we’ll go into more detail. The idea is to gather whoever you need to get the job done.

In this product trio, the product manager ensures the team stays focused on the user. The product designer ensures that whatever you’re building is both functional and aesthetically pleasing. The lead engineer ensures that the solution fits into the overall engineering architecture.

I’ve provided a description, but I won’t dwell on these slides too much—you’ll have a link to the slides at the end of this session. The goal here is to make decisions at the right place and avoid the organizational “telephone game,” where messages get passed from one person to another, losing clarity along the way.

How do you do this? You ensure the team has everything they need to deliver. This includes dedicated development environments, the ability to test infrastructure changes, and the ability to ensure everything works before merging to staging and, finally, production.

This gives people time to work on important things, like documentation. Documentation is the foundation of a successful organization. There are several tools you can use—Confluence, GitBook, Read the Docs, Sphinx, and Guru, among others—to manage documentation within your teams. Proper documentation helps make code reviews more efficient and fosters a culture of feedback.

Now, let’s talk about the final pillar: generative culture. Culture is the key driver of employee well-being, and I’ve listed some important aspects you need to focus on when designing or fostering a good culture. You need to care about psychological safety, ensuring that people feel safe to express their ideas and concerns. You need to promote open and quality communication, and you must ensure your colleagues have the opportunity to learn and exchange information.

This all ties into generative culture, a concept that’s very important but could take a whole session to cover on its own. Briefly, Westrum categorizes organizational cultures into three types:

Pathological – driven by fear and threat, where people are more concerned about personal survival than organizational goals.
Bureaucratic – where the focus is on following rules and protecting individual turfs (my team, my department).
Generative – focused on the mission of the organization, where everyone works together to deliver value.

In a generative culture, the lead measurement is the performance of the organization, not individual performance. This is a key difference that promotes collaboration and shared responsibility.

In reality, generative culture reduces burnout, which is one of the main causes of decreased team performance. According to the latest State of DevOps report, burnout is reduced by 61% in organizations with high levels of job security and psychological safety. To continue supporting your colleagues in doing their best work, you need to make sure the mission of the organization remains the focal point.

It’s much easier to keep everyone aligned with the mission when management operates based on trust, not control. This is also done by promoting others within the organization, rather than promoting oneself. Remember this: High-performing teams are not composed of single, amazing stars. They are made of constellations—teams of people that work together seamlessly.

Thank you so much.

Techniques for high-performing DevOps teams

Video transcript

Your greatest work.css-2vew0q{display:inline-block;background:rgb(250, 65, 255);background:linear-gradient(90deg, #806bff 0%, #ed49f0 100%);-webkit-background-clip:text;-webkit-background-clip:text;background-clip:text;-webkit-text-fill-color:transparent;}is just on the horizon

Your greatest work
is just on the horizon