Building Software Metrics That Actually Help
We’ve heard this story before: The nail factory was measured by the total weight of nails produced. The factory responded predictably: it made enormous, heavy nails that nobody could use. When the managers switched to measuring the number of nails, the factory produced millions of tiny pins, equally useless. This wasn’t malice or stupidity. It was rational behaviour in a broken measurement system.
Every software team I’ve worked with has its own version of this story. The team measured by lines of code who wrote verbose, repetitive programs. The QA department evaluated on bugs found who rejected features for inconsistent capitalisation. The support team measured on ticket closure rate who marked problems “resolved” whilst customers fumed.
The challenge isn’t that people game metrics. It’s that metrics, by their very nature, create gaming opportunities. The moment you reduce complex work to simple numbers, you create a gap between what you’re measuring and what you actually care about. And in that gap, gaming thrives.
The Manufacturing Inheritance
To understand why software metrics so often fail, we need to understand where they came from. Most of our measurement approaches descended from manufacturing, specifically from Frederick Taylor’s scientific management and later from the Toyota Production System. These systems transformed manufacturing, but software isn’t manufacturing, and pretending otherwise causes profound problems.
In manufacturing, variation is the enemy. When Toyota measures defects per million opportunities, they’re dealing with repeatable processes where consistency equals quality. A car door should fit exactly the same way every time. Variation means something’s wrong.
But software development is knowledge work, not production work. Every feature is different. Every bug is unique. Every line of code solves a new problem (or should; if you’re writing the same code repeatedly, you need better abstractions, not metrics).
When we apply manufacturing metrics to software, we’re measuring the wrong thing.
Consider velocity, perhaps the most common agile metric. Teams measure story points completed per sprint, treating it like a factory’s throughput. But story points aren’t widgets. They’re estimates of complexity and uncertainty. When you start measuring velocity as productivity, teams naturally inflate their estimates. A task that was 3 points becomes 5, then 8. The velocity chart goes up, the actual delivery stays the same, and everyone pretends not to notice.
W. Edwards Deming, whose work inspired the Toyota Production System, understood this problem deeply. He railed against numerical targets, arguing they drove fear and short term thinking into organisations. “Eliminate numerical goals, numerical quotas and management by objectives,” he wrote. “Substitute leadership.”
Yet somehow, when Deming’s ideas crossed into software through lean and agile movements, we kept the metrics whilst losing the wisdom about their limitations.
The Toyota Paradox
Here’s what’s fascinating about the Toyota Production System: whilst it’s often cited as justification for software metrics, Toyota itself is remarkably sceptical about measurement. Yes, they track defects and cycle time and inventory levels. But these aren’t targets. They’re indicators of system health.
The difference is subtle but crucial. A target says “achieve this number.” An indicator says “this number helps us understand what’s happening.” When you treat metrics as indicators, gaming becomes pointless. You’re not trying to make the number look good; you’re trying to understand what the number tells you.
Taiichi Ohno, the father of the Toyota Production System, was explicit about this. The purpose of measurement was to reveal problems, not to evaluate performance. When a metric showed something wrong, the response wasn’t punishment but curiosity. Why is cycle time increasing? What’s causing these defects? How can we improve the system?
This approach, called “going to gemba” (going to where the work happens), means metrics are always grounded in reality. You don’t manage by numbers from a dashboard. You use numbers to guide your investigation of the actual work.
When software teams adopt this approach, metrics transform. Instead of velocity targets, you track cycle time variation to understand predictability. Instead of code coverage goals, you measure the economic impact of escaped defects to understand quality. The metrics serve the work, not the other way around.
Enter the DORA Metrics
In 2014, Nicole Forsgren, Jez Humble, and Gene Kim began studying what actually makes software teams effective. They surveyed thousands of teams, analysed their practices, and looked for statistical correlations with business outcomes. The result was the DORA metrics: Deployment Frequency, Lead Time for Changes, Mean Time to Restore, and Change Failure Rate.
These metrics were revolutionary because they measured outcomes, not activity. They didn’t care how many story points you completed or how many lines of code you wrote. They measured whether you could deliver value quickly and safely.
But here’s what’s often missed about the DORA metrics: they work as a system. You can’t optimise one without affecting the others. If you deploy more frequently without improving your testing, your failure rate increases. If you reduce lead time by skipping reviews, your restoration time grows. The metrics create a natural balance that resists gaming.
This systems thinking reflects a deeper truth about measurement. Individual metrics can always be gamed, but systems of metrics that have natural tensions are much harder to manipulate. It’s like the difference between a unicycle and a bicycle. A unicycle can fall in any direction; a bicycle’s two wheels create stability through balance.
The DORA metrics also demonstrate another crucial principle: measuring closer to customer value reduces gaming opportunities. Deployment frequency matters because customers can’t use features stuck in staging. Lead time matters because faster feedback loops improve quality. These aren’t proxy metrics; they’re direct measures of value delivery.
Yet even DORA metrics have limitations. They assume you’re building the right thing. You can have excellent deployment frequency whilst building features nobody wants. You can have superb lead time whilst solving the wrong problems. Technical excellence without product sense is just very efficient waste.
The Goodhart Trap
In 1975, British economist Charles Goodhart articulated what’s become known as Goodhart’s Law: “When a measure becomes a target, it ceases to be a good measure.” This isn’t just about gaming. It’s about how measurement changes behaviour in subtle, often destructive ways.
Consider code coverage, a metric that seems purely objective. What percentage of your code is tested? Surely higher is better? But watch what happens when coverage becomes a target. Developers write tests that execute code without verifying behaviour. They test getters and setters. They avoid refactoring because it might temporarily reduce coverage. The metric improves, the code quality deteriorates.
Donald Reinertsen, in “The Principles of Product Development Flow,” offers a way out of the Goodhart trap: measure economic impact, not proxy metrics. Don’t measure test coverage; measure the cost of production defects. Don’t measure velocity; measure the cost of delay. When metrics tie directly to economic outcomes, gaming them requires actually improving outcomes.
But economic metrics have their own challenges. They’re often lagging indicators, telling you about problems weeks or months after they occur. They can be influenced by factors outside the team’s control. And they can create their own perverse incentives, like avoiding innovative work because it’s economically risky.
The Human Element
Here’s what every discussion of metrics misses: metrics aren’t neutral. They encode values, assumptions, and power structures. When you measure individual productivity, you’re saying individual work matters more than collaboration. When you measure defects, you’re implying that perfection is achievable. When you measure velocity, you’re suggesting that faster is always better.
John Seddon, the British systems thinker, argues that most measurement in organisations is about control, not improvement. Managers measure to ensure compliance, not to enable learning. This command and control approach doesn’t just fail to improve performance; it actively damages it by destroying intrinsic motivation and creating fear.
The alternative is what Seddon calls “measures for method”: metrics chosen by teams to help them improve their work. These metrics might be temporary, used to test a hypothesis and then discarded. They might be local, relevant only to a specific team or project. They’re tools for learning, not weapons for judgment.
This approach aligns with what we know about human motivation. Daniel Pink’s research shows that people are motivated by autonomy, mastery, and purpose. Imposed metrics undermine all three. They remove autonomy by dictating behaviour. They replace mastery with compliance. They substitute numbers for purpose.
When teams choose their own metrics, something remarkable happens. Gaming disappears, not because it’s prevented, but because it becomes pointless. Why would you game a metric you chose to help yourself improve? It would be like cheating at solitaire.
Building a Measurement Framework
So how do we build metrics that actually help? How do we measure improvement without creating gaming, fear, or dysfunction?
First, separate indicators from targets. Use metrics to understand system behaviour, not to evaluate people. When a metric shows a problem, respond with curiosity, not judgment. Ask “what is this telling us?” not “who’s responsible?”
Second, measure at multiple levels. Team metrics help teams improve. Department metrics reveal coordination problems. Organisation metrics show systemic issues. But don’t cascade targets down through these levels. That path leads to the nail factory in the opening paragraph.
Third, create balanced metric sets. For every efficiency metric, have a quality metric. For every speed metric, have a sustainability metric. The DORA metrics work because they balance deployment frequency with failure rate, lead time with restoration time. Single metrics can always be gamed; balanced sets resist manipulation.
Fourth, measure closer to value. Instead of measuring hours worked, measure features delivered. Instead of measuring features delivered, measure customer problems solved. Instead of measuring problems solved, measure business impact. The closer you get to actual value, the harder gaming becomes.
Fifth, make metrics temporary. A metric that helps you improve this quarter might hinder you next quarter. When you’ve improved cycle time, switch to measuring predictability. When you’ve improved quality, focus on innovation. Permanent metrics become targets; temporary metrics remain tools.
The Continuous Improvement Perspective
Lean management teaches us that improvement is continuous, not episodic. You don’t improve once and stop. You create a culture of constant experimentation and learning. Metrics should support this culture, not replace it.
This means metrics should help you test hypotheses. “We think pair programming will reduce defects.” “We believe smaller batches will improve flow.” “We hypothesise that customer involvement will reduce rework.” Each hypothesis needs appropriate metrics, chosen for their ability to validate or refute the hypothesis.
Once you’ve learned what you need to learn, the metric might become irrelevant. This is fine. Metrics are tools, not monuments. Use them whilst they’re useful, discard them when they’re not.
The Toyota Kata, developed by Mike Rother to codify Toyota’s improvement practices, provides a structure for this approach. It involves four steps: understand the direction, grasp the current condition, establish the next target condition, and iterate toward that condition. Metrics serve each step differently.
When understanding direction, metrics help establish long term aspirations. When grasping current condition, they reveal system behaviour. When establishing targets, they define experiments. When iterating, they provide feedback. The same metric might be useful in one step and harmful in another.
The Psychological Safety Factor
Amy Edmondson’s research on psychological safety reveals another crucial aspect of metrics: they only work in environments where people feel safe to tell the truth. If admitting problems leads to punishment, metrics will always lie.
This creates a paradox. The organisations that most need good metrics (those with problems) are least likely to get them (because people hide problems). Meanwhile, organisations with good metrics probably don’t need them as much, because their culture of openness makes problems visible anyway.
Building psychological safety requires treating metrics as information, not evaluation. When a metric reveals a problem, the response should be “how can we help?” not “who’s to blame?” This isn’t just morally right; it’s practically necessary. Blame drives problems underground, making them harder to solve.
Google’s Project Aristotle found that psychological safety was the strongest predictor of team effectiveness. Teams that felt safe to fail, to admit uncertainty, to ask for help, consistently outperformed teams that didn’t. Metrics in psychologically safe environments become tools for learning. In unsafe environments, they become weapons for survival.
The Remote Work Challenge
The shift to remote work has intensified metric dysfunction. Unable to see people working, managers reach for metrics as proxy for presence. Lines of code, commits per day, hours logged, tickets closed. The digital panopticon emerges, with surveillance masquerading as measurement.
But remote work also offers opportunities for better metrics. Without the theatre of looking busy in an office, we can focus on actual outcomes. Without the pressure of visible presence, we can measure value delivery rather than activity.
The key is trust. If you don’t trust your team to work without surveillance, no metric will help. If you do trust them, you don’t need surveillance metrics. This isn’t about metrics at all; it’s about management philosophy.
Modern Complexities
Modern software development adds layers of complexity that traditional metrics can’t capture. Microservices mean that performance problems might emerge from interaction effects no single team controls. Machine learning models mean that behaviour might change without code changes. Cloud infrastructure means that costs scale non linearly with usage.
These complexities require new approaches to measurement. Instead of static metrics, we need adaptive ones. Instead of single point measurements, we need time series analysis. Instead of absolute values, we need statistical distributions.
Consider reliability. Traditional metrics might track uptime percentage. But modern systems are too complex for simple up or down states. You need error budgets, service level objectives, and service level indicators. You need to measure the user experience, not just system state.
Or consider performance. It’s not enough to measure average response time. You need percentiles to understand the distribution. You need to segment by user type, geographic location, and device type. You need to correlate performance with business outcomes.
These sophisticated metrics require sophisticated interpretation. They can’t be reduced to simple dashboards with red and green lights. They require context, analysis, and judgment. They’re tools for experts, not substitutes for expertise.
The Path Forward
So where does this leave us? How do we build metrics that help rather than harm?
Start with purpose. Why are you measuring? If it’s for control, stop. If it’s for learning, proceed. But be honest. Many organisations claim they measure for learning whilst using metrics for performance reviews.
Involve the people doing the work. Let teams choose metrics that help them improve. Support their choices, even if they’re not what you would choose. Trust that people want to do good work and will choose metrics accordingly.
Create safety. Make it safe to have bad metrics. If a team’s cycle time increases, respond with support, not scrutiny. If quality decreases, offer help, not criticism. Problems revealed are problems that can be solved. Problems hidden fester and grow.
Think in systems. Individual metrics lie. Systems of metrics reveal truth. Look for patterns across multiple metrics. Watch for unintended consequences. Notice what’s not being measured.
Stay curious. Metrics are questions, not answers. When velocity drops, ask why. When quality improves, understand how. When patterns change, investigate causes. Let metrics guide inquiry, not replace it.
Accept ambiguity. Not everything meaningful can be measured. Not everything measurable is meaningful. Some of the most important aspects of software development (creativity, collaboration, innovation) resist quantification. Don’t let the measurable drive out the important.
The Ultimate Metric
If I had to choose one metric for software teams, it wouldn’t be velocity or quality or efficiency. It would be learning rate. How quickly does the team recognise and correct mistakes? How effectively do they incorporate feedback? How readily do they adapt to change?
Learning rate compounds. A team that learns faster pulls away from teams that don’t. They solve problems better, adapt quicker, and innovate more. But learning rate is hard to measure directly. You can only infer it from other changes.
This might be the ultimate truth about metrics: the most important things can’t be measured directly. You can’t measure learning, but you can see its effects. You can’t measure culture, but you can observe its symptoms. You can’t measure potential, but you can create conditions for it to flourish.
Conclusion: Measuring What Matters
The nail factory failed not because the planners were stupid, but because they believed measurement could substitute for understanding. They thought that if they could just find the right metric, they could manage from afar. Software organisations make the same mistake, believing that dashboards can replace judgment, that metrics can replace management, that numbers can replace knowledge.
The path forward isn’t to abandon metrics but to humanise them. To use them as tools for understanding rather than weapons for control. To let teams own their measurements rather than having measurements own them. To measure what matters rather than what’s easy.
This requires courage. It’s scary to give up the illusion of control that metrics provide. It’s uncomfortable to accept ambiguity. It’s hard to trust people to improve without targets driving them.
But the alternative is the path to the nail factory: perfectly optimised metrics producing perfectly useless outcomes. We can do better. We must do better. The future of software development depends not on measuring more, but on measuring more wisely.
The next time someone proposes a new metric, ask not “what will this measure?” but “how will this help us learn?” Ask not “what’s the target?” but “what’s the hypothesis?” Ask not “who’s accountable?” but “how can we improve?”
In the end, metrics are just information. What matters is what we do with that information. Do we use it to punish or improve? To control or enable? To judge or learn? The choice we make determines whether metrics help or harm, whether they reveal or conceal, whether they improve our work or destroy it.
Choose wisely. Your team’s future depends on it.
References
Books
Deming, W. Edwards. Out of the Crisis. MIT Press, 1986.
DeMarco, Tom and Lister, Timothy. Peopleware: Productive Projects and Teams. Addison-Wesley, 1987.
Edmondson, Amy. The Fearless Organization: Creating Psychological Safety in the Workplace for Learning, Innovation, and Growth. Wiley, 2018.
Forsgren, Nicole, Humble, Jez, and Kim, Gene. Accelerate: The Science of Lean Software and DevOps. IT Revolution Press, 2018.
Goldratt, Eliyahu M. The Goal: A Process of Ongoing Improvement. North River Press, 1984.
Humble, Jez and Farley, David. Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation. Addison-Wesley, 2010.
Kim, Gene, Humble, Jez, Debois, Patrick, and Willis, John. The DevOps Handbook. IT Revolution Press, 2016.
Liker, Jeffrey. The Toyota Way: 14 Management Principles from the World’s Greatest Manufacturer. McGraw-Hill, 2004.
Ohno, Taiichi. Toyota Production System: Beyond Large-Scale Production. Productivity Press, 1988.
Pink, Daniel H. Drive: The Surprising Truth About What Motivates Us. Riverhead Books, 2009.
Poppendieck, Mary and Tom. Lean Software Development: An Agile Toolkit. Addison-Wesley, 2003.
Reinertsen, Donald G. The Principles of Product Development Flow: Second Generation Lean Product Development. Celeritas Publishing, 2009.
Ries, Eric. The Lean Startup. Crown Business, 2011.
Rother, Mike. Toyota Kata: Managing People for Improvement, Adaptiveness and Superior Results. McGraw-Hill, 2009.
Seddon, John. Freedom from Command and Control: Rethinking Management for Lean Service. Productivity Press, 2005.
Shingo, Shigeo. A Study of the Toyota Production System. Productivity Press, 1989.
Womack, James P. and Jones, Daniel T. Lean Thinking: Banish Waste and Create Wealth in Your Corporation. Simon & Schuster, 1996.
Papers and Articles
Goodhart, Charles. “Problems of Monetary Management: The UK Experience.” Papers in Monetary Economics. Reserve Bank of Australia, 1975.
Forsgren, Nicole and Kersten, Mik. “DevOps Metrics.” Communications of the ACM, Vol. 61, No. 4, 2018.
Google. “Guide: Understand Team Effectiveness.” re:Work, 2015. (Project Aristotle findings)
Humble, Jez, Molesky, Joanne, and O’Reilly, Barry. “Lean Enterprise: How High Performance Organizations Innovate at Scale.” O’Reilly Media, 2015.
State of DevOps Report. DevOps Research and Assessment (DORA), 2014-2023 annual reports.
Online Resources
DORA Metrics: https://dora.dev/
Lean Enterprise Institute: https://www.lean.org/
The DevOps Institute: https://devopsinstitute.com/
Related Academic Work
Campbell, Donald T. “Assessing the Impact of Planned Social Change.” Evaluation and Program Planning, Vol. 2, No. 1, 1979.
Muller, Jerry Z. The Tyranny of Metrics. Princeton University Press, 2018.
Austin, Robert D. Measuring and Managing Performance in Organizations. Dorset House, 1996.
Bevan, Gwyn and Hood, Christopher. “What’s Measured is What Matters: Targets and Gaming in the English Public Health Care System.” Public Administration, Vol. 84, No. 3, 2006.


Excellent analysis, this really resonnates. It makes me wonder what a truly native software metric sistem would look like if we built it from the ground up, embracing the inherent complexity and variation rather than fighting it.