Metrics That Matter
Engineering is stuck in firefighting mode.
We have a never-ending backlog.
We're concerned with the quality of the product.
I've paraphrased a few concerns that I've encountered as an engineering manager at PBS. In April 2018 we changed our approach to how we develop software. We started tracking metrics on throughput and stability. A few members of my team read the excellent book, Accelerate, which presents the metrics that have had a profound impact on product delivery.
The PBS engineering team has had other ideas how to improve throughput and stability for products like pbs.org. Here's a brief history on how we came to change our approach. A problem our software developers face is that there is an endless amount of work to be done. PBS is a non-profit organization; we can't always have more engineers. Budget and headcount are tough to come by. When we couldn't get more engineers to do the work, we considered our options. The first option we chose was to do less work. We ignored some products during sprints to focus on others. Work got done, but we lived in the shadow of a looming backlog. Some weeks we resolved more issues than we created. Some weeks... After months of this approach we changed tack. If we can't add more people, and we can't do less work, what could we change?
We made an effort to use the engineers we have more efficiently. We did this by tracking software delivery performance. My team could make regular decisions at the issue level. They could optimize each issue for stability and throughput. These small decisions led to a noticeable impact. Here's a chart of the issues we've created vs. resolved:
Notice the change in April? That growing green area represents progress! We track four metrics:
- Deployment Frequency
- Change Failure Rate
- Mean Time To Recovery
- Cycle Time
I'd like to explore each metric in depth. Also, I'll comment on how I've noticed it make an impact on our development effort.
How often we deploy a product
A deployment should be a non-event. Deployments shouldn't require careful scrutiny or manual monitoring. They should feel like a byproduct of writing code. Ideally a deployment is a single incremental change. By deploying as frequently as possible we incentivize several productive habits.
First, we make production deployments easy. By measuring the last mile of product development, the time and effort it takes to deploy, we encourage automation. Deployments should be as simple as pushing a button. That button should be accessible and understood by many people. We used to have a "release manager" role. This person knew the proper incantations to transmute a git repository into a production service. Now any engineer can do a deployment, and does several times a day.
Second, we push the deployment button as often as possible. Talk about gamification, you get a point every time you push a button. We're motivated to deploy every pull request that makes it through review. Feature flagging makes this even easier. If a pull request represents a partially completed effort, we put it behind a feature flag. The change remains incremental. The work is available on production and may be useful to some users.
Finally, we keep batch sizes small. Changes don't build up with regular deployments. Encouraging frequent deployments has the beneficial result of keeping your change set small. That means if something does go wrong with a release it should be easy to hone in on the affected code. It can also limit the impact of a change.
Change Failure Rate (CFR)
The percentage of deployments we have to rollback
This metric goes hand-in-hand with "Deployment Frequency". Where that metric encourages throughput, "Change Failure Rate" encourages stability. This metric is the indicator that you're going too fast.
I've used this metric to encourage my team to slow down if we notice an uptick in rollbacks. It's like a tachometer though. I argue that a sustained 0% CFR means you're moving too slowly. I want to red-line deployments a bit. It's worth a rollback to test the boundary of how fast you can go. The boundary is different for different teams and products. For example, I'm more comfortable with >0% CFR in web development. Rollbacks are cheap and quick on the web. I'm risk-averse to rollbacks when it comes to app development. Rollbacks with packaged software are difficult and slow.
Mean Time To Recovery (MTTR)
Hours elapsed from the start of a service failure to system recovery
From my experience MTTR lends a sense of urgency to stability. When our system fails my team knows the clock is ticking. The benefits are twofold:
- We know the top priority is to return to stability.
- We maintain trust with our peers.
Trust is essential to any software development organization. They know that we're taking the issue seriously. They can follow along in a chat channel and pull the latest information.
After watching MTTR we changed the way we handle recovery events. We examined our incident response process and have taken steps to formalize it. We're using a light version of the process laid out by the Site Reliability Engineering book. Our process used to be a general miasma of panic. Developers didn't know if they were supposed to be involved in the incident or not. The root cause analysis and remediation measures weren't publicly available or coordinated. Stakeholders would get infrequent updates on the situation. After adopting our process our recoveries are more orderly. We don't have any MTTR metrics before we adopted an incident response process. Over time I'd like to calculate the trend of our response times.
Another process we adopted after watching MTTR are blameless postmortems. These analyze the cause of the problem and how to avoid it in the future. Postmortems are useful ways to share perspective and knowledge. We generally copy/paste a postmortem template each time that we crafted from a few articles: Hootsuite's 5 Whys and Blameless PortMortems and a Just Culture.
Cycle Time *
Time it takes a task to start in development and end in production
This is my favorite metric. If I had to watch one number to gauge software delivery health it would be this one. Cycle time is the time it takes to get past all the hurdles you encounter in software development. Upstream dependencies, gaps in testing, acceptance review, and deployment pain are all captured in one number. At PBS we've reacted to this metric in a few ways.
First, we scrutinize issues for upstream blockers before we start work. We're trying to minimize the amount of work we have in progress. We're destined for a stalled effort if we begin development and find that we need to something from a supporting service. It's more efficient to identify the blocker upfront before we write any code.
Second, Cycle Time incentivizes us to keep tasks small. A small task is quicker to develop and review. The risk of the change is low so we deploy it more quickly too. Feature flags work well to keep tasks small. We've adopted feature flags at PBS with great success. We recognize that feature flags are tech debt:
- They add some complexity to our code.
- We have to remove the flag eventually.
The benefits outweigh the drawbacks for us though. I love reviewing a pull request that affects three files with a feature flag. I cringe at reviewing a PR that represents the full feature, affects 15 files, and carries significantly more risk.
The engineers who have adopted Accelerate's metrics at PBS have noticed a positive trend. After we measured our software delivery performance, we could improve it. There have been benefits outside of numbers and graphs. Work has been more fun! Some people hear "continuous delivery" and think it sounds exhausting. They think it means a never-ending, scrutinized effort. Engineers can imagine a boss calling them into their office to talk about why their Cycle Time went up 10 minutes yesterday. To me there is nothing further from the truth. Our engineering effort sees the light of day as quickly as possible. We're shipping daily. The backlog is diminishing. We're creating a product we're proud to work on.
* It's worth noting that Accelerate refers to my "Cycle Time" as "Lead Time". From what I've read "Lead Time" is a longer metric. It represents the time from when a requirement is identified, or an issue is put in the backlog, until that issue is completed. There may be some dispute here. A few people were more comfortable with "Cycle Time" at PBS and that's what we stuck with.