The Developer Productivity Trap
Why Everything We Think We Know About Developer Productivity Is Wrong — And Why AI Is Making It Worse
Abstract
For decades, the software industry has been trying to measure developer productivity the way factories measure widget output — and failing spectacularly. From lines of code to story points, from velocity charts to McKinsey frameworks, every attempt to reduce the complex, creative act of software development to a number has produced the same result: perverse incentives, gaming, and surrogation — the nonconscious process by which people forget what the metric was supposed to represent and treat the number itself as the goal. Modern frameworks like DORA and SPACE represent genuine progress, yet they too are routinely misapplied. Now, AI coding assistants have entered the picture, promising to make every developer a “10x engineer.” But the research tells a far more nuanced and troubling story: AI amplifies whatever is already there — good practices and bad metrics alike. This article traces the history of productivity measurement in software, dismantles the most persistent myths, examines what the evidence actually says about AI-assisted development, and offers leadership a fundamentally different way to think about the question. The answer, it turns out, is not a better metric. It is a better question.
1. Minus Two Thousand Lines of Code
In 1982, Bill Atkinson was rewriting QuickDraw’s region calculation engine for the Apple Lisa. He replaced a complex algorithm with a simpler, more general one that ran approximately six times faster. The new code was elegant, compact, and superior in every measurable way. It also happened to be 2,000 lines shorter than what it replaced.
That week, Apple’s managers required every engineer to submit a form reporting how many lines of code they had written. Atkinson wrote -2000 on his form.
Management discontinued the reporting shortly after.
Atkinson believed that “lines of code was a silly measure of software productivity” and that such metrics “only encouraged writing sloppy, bloated, broken code.” He was right — and yet, more than four decades later, the software industry is still searching for the number that captures what a developer is worth. We are still, in essence, asking engineers to fill out the same form. We have just made the form more sophisticated.
The pursuit itself is the problem. And now, with AI generating code at unprecedented speed, we are not getting closer to an answer. We are getting further away, faster.
2. The Graveyard of Bad Metrics
Every era of software engineering has produced its own favourite way to measure developer productivity. Every one of them has failed. Understanding why they failed is more instructive than understanding what they measured.
Lines of Code
The most intuitive metric is also the most thoroughly discredited. Bill Gates reportedly said that “measuring software productivity by lines of code is like measuring progress on an airplane by how much it weighs.” The problems are well-documented: a task requiring 100 lines in C++ might take 10 in Python, making cross-project comparisons meaningless. Refactoring, deduplication, and simplification — the hallmarks of good engineering — all reduce line count. Under an LOC regime, the best work registers as negative productivity.
Martin Fowler put it simply: well-designed code is shorter because it eliminates duplication. Copy-paste programming inflates LOC while degrading design. LOC indicates system size, not value created.
Story Points and Velocity
Story points and velocity were designed as planning tools — ways for teams to forecast how much work they could take on in a sprint. They were never meant to measure productivity. But in organization after organization, they have been weaponized for exactly that purpose.
The problems are predictable. When velocity becomes a target, teams inflate their estimates to appear more productive. Individual teams can easily game the system, and when some members start inflating story points, others notice. From there, it is a short path to broken trust, frustration, and culture damage. Story points focus on effort, not value. A team can complete 50 points of features that nobody uses and score higher than a team that delivers 20 points of work that transforms the business.
Hours Worked
In knowledge work, more hours frequently produce worse outcomes. Context switching after interruptions costs approximately 23 minutes of refocus time per interruption. Attention residue from task switching can reduce cognitive capacity for 10 to 30 minutes. Cal Newport’s research on “deep work” demonstrates that sustained focus periods of 90 or more minutes are necessary for complex problem-solving. Measuring hours measures presence, not thought. And in an industry where the most valuable breakthroughs often happen during a walk or in the shower, presence is a particularly poor proxy for contribution.
Three Concepts That Explain Everything
Three concepts explain why every simple metric fails — and will always fail.
The McNamara Fallacy, named for U.S. Secretary of Defense Robert McNamara, describes a four-step descent into delusion: First, measure whatever can be easily measured. Second, disregard what cannot be easily measured. Third, presume that what cannot be measured is not important. Fourth, declare that what cannot be measured does not exist.
During the Vietnam War, McNamara used enemy body counts as the primary measure of success. When a general suggested adding a factor for the sentiments of the Vietnamese people, McNamara erased it from the report — he could not quantify it. The war was lost despite the metrics showing consistent progress. The software industry repeats this pattern with striking fidelity: we measure commits, pull requests, and deployment frequency while ignoring design quality, knowledge sharing, and whether anyone actually uses what was built.
Goodhart’s Law states: “When a measure becomes a target, it ceases to be a good measure.” In software engineering, this manifests everywhere: developers rush deployments to meet frequency targets, producing unstable code; teams close easy tickets to inflate resolution rates; engineers write verbose code to increase line counts; managers ship unnecessary features to hit delivery targets. The metric improves. The system degrades.
Surrogation is the most insidious of the three, because unlike gaming, it is invisible to the person it affects. Formally defined by Choi, Hecht, and Tayler in a 2012 paper in The Accounting Review, surrogation describes the nonconscious process by which people stop treating a metric as a proxy for a goal and start treating the metric as the goal. The metric does not just become a target — it replaces the thing it was supposed to represent in people’s minds.
The mechanism is what psychologists call attribute substitution: when the thing you actually care about (developer productivity, code quality, customer satisfaction) is abstract and hard to observe, and the metric (velocity, deployment frequency, NPS score) is concrete and immediately available, your brain quietly swaps one for the other. You do not notice the substitution. You genuinely believe you are pursuing the original goal when you are, in fact, optimising a number.
A person affected by Goodhart’s Law may know they are gaming. A person affected by surrogation does not know they are substituting. And critically, research by Black, Meservy, Tayler, and Williams (2021) demonstrated that you do not even need to tie compensation to a metric for surrogation to occur. Simply tracking and reporting a number is enough. The mere existence of the metric triggers the substitution.
Harris and Tayler, writing in Harvard Business Review in 2019, identified three conditions that create surrogation risk: the strategic objective is abstract, the metric is concrete and visible, and people accept the measure as representing the goal. Software engineering meets all three conditions for virtually every metric it uses.
These three concepts are not bugs in specific metrics. They are features of the relationship between measurement and human cognition. The McNamara Fallacy explains why we ignore what we cannot count. Goodhart’s Law explains why people game what we do count. Surrogation explains why we forget what we were counting in the first place. Any single metric used for evaluation will eventually be gamed, surrogated, or both. This is not cynicism — it is a predictable consequence of how human minds interact with numbers.
3. The Modern Frameworks: Progress and Pitfalls
The good news is that the industry has produced genuinely better thinking about developer productivity. The bad news is that better thinking is routinely applied in the same old ways.
DORA Metrics
The four DORA metrics — Deployment Frequency, Lead Time for Changes, Mean Time to Recovery, and Change Failure Rate — were introduced by Dr. Nicole Forsgren, Jez Humble, and Gene Kim in Accelerate (2018). They represent a significant leap forward because they measure outcomes (how effectively software reaches users) rather than outputs (how much stuff developers produce).
But DORA’s creators explicitly warn against using these metrics for team-by-team comparison. Yet that is precisely what happens: leadership teams benchmark teams on them, comparing deployment frequency between a mobile app team and a web service team as if the numbers were comparable. DORA metrics do not capture developer satisfaction, cognitive load, code quality, technical debt, or business value. A team can score “elite” on all four metrics while building a product that nobody wants.
The 2025 DORA Report acknowledged these limitations by abandoning the low/medium/high/elite performance tiers entirely, replacing them with seven team archetypes — from “harmonious high-achievers” to “legacy bottleneck” teams — reflecting a more nuanced understanding that performance is contextual and multi-dimensional.
SPACE Framework
SPACE, developed in 2021 by Nicole Forsgren, Margaret-Anne Storey, and colleagues at Microsoft Research, explicitly addresses the multi-dimensional nature of productivity across five dimensions: Satisfaction and well-being, Performance, Activity, Communication and collaboration, and Efficiency and flow. Its key insight is powerful: “Productivity cannot be reduced to a single dimension (or metric!). Only by examining a constellation of metrics in tension can we understand and influence developer productivity.”
Forsgren herself warned against the most common misuse: “One of the most common myths — and potentially most threatening to developer happiness — is the notion that productivity is all about developer activity, things like lines of code or number of commits. More activity can appear for various reasons: working longer hours may signal developers having to ‘brute-force’ work to overcome bad systems or poor planning.”
Developer Experience as a Lens
A more recent strand of thinking shifts the focus away from what developers produce and toward what developers experience. The core insight is that three dimensions shape how effectively developers can work: feedback loops (how quickly they get information about their work), cognitive load (how much mental effort their environment demands), and flow state (how often they can achieve sustained, focused work). When these conditions improve, outcomes follow. When they degrade, no amount of measurement or exhortation will compensate.
The McKinsey Debacle
In August 2023, McKinsey published “Yes, you can measure software developer productivity,” claiming their framework was already in use at nearly 20 companies. The response from the engineering community was swift and devastating — the strongest collective rebuttal in recent memory.
Kent Beck, the creator of Extreme Programming, called the report “so absurd and naive that it makes no sense to critique it in detail.” But then he critiqued it in detail, because “what they published damages people I care about. I’m here to help geeks feel safe in the world. This kind of surveillance makes geeks feel less safe.”
Dave Farley, co-author of Continuous Delivery, was blunt: “Apart from the use of DORA metrics in this model, the rest is pretty much astrology.” He drew a sharp line between DORA’s evidence-based approach and the rest of McKinsey’s framework — “the difference between astronomy and astrology.”
Beck and Gergely Orosz, in a detailed joint response, argued that the McKinsey framework “only measures effort or output, not outcomes and impact, which misses half of the software developer lifecycle.” They warned that “introducing a kind of framework that McKinsey is proposing is wrong-headed and certain to backfire. Such a framework will most likely do far more harm than good to organizations — and to the engineering culture at companies and the damage could take years to undo.”
Beck shared a cautionary tale from Facebook that perfectly illustrates surrogation in action. The company had introduced developer satisfaction surveys — a reasonable approach. But then managers computed an overall score. Then those scores appeared in performance reviews. Directors pressured managers for better scores. Managers began negotiating with engineers: higher survey scores in exchange for better performance ratings. The result was worse organisational outcomes despite improved metrics. First, surrogation: leadership began treating the score as developer satisfaction, forgetting that it was merely a proxy. Then Goodhart’s Law kicked in: once the surrogate became a target, people gamed it. The metric had consumed the thing it was supposed to measure.
The McKinsey episode matters because it revealed a fault line that runs through every discussion of developer productivity: the tension between what management consulting wants (simple numbers that enable comparison and control) and what software engineering actually is (complex, collaborative, creative knowledge work that resists simplification).
4. Why Software Is Fundamentally Different
The reason every productivity metric fails or gets corrupted is not that we haven’t found the right metric yet. It is that software development is a fundamentally different kind of work than what most measurement frameworks were designed for.
Knowledge Work Is Not Manufacturing
Peter Drucker, who coined the term “knowledge work,” identified the crucial distinction: in manufacturing, the task is defined externally, and the worker’s job is to execute efficiently. In knowledge work, “the task of what to do is controlled by knowledge workers, they own the means of production (knowledge not machines), they decide what methods and steps to use, and they focus on what to do and on the right things — effectiveness.” Drucker considered knowledge-work productivity “the greatest management task of this century, just as making manual work productive was the great management task of the last century.”
The manufacturing metaphor fails for software because output is not standardised (every piece of software solves a different problem), efficiency is not the primary constraint (figuring out what to build matters more than how fast you build it), quality is not independently measurable (a feature that works but nobody needs has negative value), and the work itself is fundamentally creative — requiring “anticipatory imagination, problem solving, problem seeking, and generating ideas.”
Output vs. Outcome
Martin Fowler articulates the core principle: “If a team delivers lots of functionality... that functionality doesn’t matter if it doesn’t help the user improve their activity.” Output is what is produced: features shipped, code written, pull requests merged. Outcome is the change that the output creates: increased revenue, reduced support tickets, improved user satisfaction.
The insight is disarmingly simple: “Your job is to minimize output, and maximize outcome and impact.” The best solution is often the smallest change, the feature not built, the code deleted. Under any output-based metric, these acts of productive restraint are invisible or actively penalised.
Fowler dismisses the objection that outcomes are hard to measure: “We are very good at measuring financial outcomes.” The problem is not that outcomes cannot be measured, but that organisations prefer the comfort of output metrics because they are immediate and controllable — even when they measure the wrong thing.
The Invisible Work Problem
Charity Majors, CTO of Honeycomb, offers perhaps the most uncomfortable truth: “Some of the hardest and most impactful engineering work will be all but invisible on any set of individual metrics.”
Consider the work that keeps a team effective: pairing and mobbing that improve quality and spread knowledge across the team — but show up as “two people doing one person’s job” in output metrics. Mentoring that multiplies team capacity — but reduces the mentor’s personal output. Architectural decisions that save months of future work — but take hours that produce no visible deliverable. Cross-team coordination that unblocks others. Documentation that prevents future confusion. Technical debt reduction that improves future velocity but produces zero features.
None of this registers on any output metric. All of it is essential.
Majors pushes further: “Metrics are for easy problems — discrete, self-contained, well-understood problems. The more challenging and novel a problem, the less reliable these metrics will be.” And the most provocative observation: “To the extent you can reduce a job to a set of metrics, that job can be automated away.” If developer productivity could truly be captured in a number, developers would already be obsolete.
The Individual Measurement Trap
Dave Farley puts it plainly: “Measuring software development in terms of individual developer productivity is a terrible idea. Being smart in how we structure and organize our work is much more important than the level of individual genius.”
The harms are well-documented. Individual measurement distorts behaviour: people optimise for their own numbers at the expense of the team. It creates misaligned incentives: short-term outputs over long-term value. It renders collaborative work invisible: the developer who writes little code but dramatically improves team performance through guidance and knowledge sharing is undervalued. And it damages culture: “Monitoring individual performance can cause unnecessary anxiety, even for top-performing contributors. It can also lead to overworking, an overly competitive work culture, and developer burnout.”
5. Enter AI: The Amplifier of Everything
Into this already confused landscape, artificial intelligence has arrived with the promise of making every developer dramatically more productive. The reality is far more complex — and for leaders relying on traditional metrics, far more dangerous.
We Do Not Know What We Think We Know
Studies on AI-assisted coding productivity contradict each other wildly — some report massive speed gains, others find slowdowns and more bugs. But the honest takeaway is not that “it depends.” It is that we simply do not have reliable data yet. Every developer uses AI differently — different tools, different prompting styles, different levels of trust and scrutiny, different types of work. The studies measure wildly different things under wildly different conditions. They are not two sides of a debate. They are measurements of entirely different activities that happen to share the label “coding with AI.”
Until the industry converges on how AI is actually used in practice, the numbers should be treated with deep scepticism. Leaders who cite any of these studies to justify or reject AI investment are building on sand.
What we can observe, however, is a persistent perception gap: developers consistently believe AI is helping them more than it measurably does. This matters for leadership. If developers cannot accurately assess whether AI tools are making them more effective, then the comfortable feedback loop — “we bought the tool, developers say they like it, therefore it’s working” — may be an illusion.
We can also observe quality trends that should concern anyone paying attention. Code duplication is rising. Refactoring is collapsing. Code churn is increasing. Even the CEO of Cursor — a company that sells an AI coding tool — has warned publicly that developers accept AI-generated code “simply because it appears to work, without properly reviewing its structure, logic and long-term impact.” When the people selling the tools are urging caution, it is worth listening.
AI Makes Bad Metrics Worse
Every problem with metrics described in this article — gaming, perverse incentives, the McNamara Fallacy, Goodhart’s Law, surrogation — becomes dramatically worse when AI enters the picture.
A human developer gaming a velocity target has natural friction: there are only so many hours in a day, only so much code a person can write. AI has none of these constraints. If the metric is lines of code, AI will generate mountains of it. If the metric is pull requests merged, AI will create dozens. If the metric is deployment frequency, AI will ship constantly. All while the codebase bloats, technical debt compounds, and the actual product — the thing users need — remains unchanged or gets worse.
AI also deepens surrogation. When a human writes code, reviewers can draw on shared context to assess whether the work actually serves the original goal. AI-generated output lacks that shared context, making it harder to notice the gap between the metric and the thing the metric was supposed to represent. And AI tool adoption metrics are themselves susceptible to surrogation: “AI adoption rate” or “percentage of code generated by AI” can quietly replace the actual goal — effective use of AI to improve engineering outcomes — in leaders’ minds. Adoption becomes the goal, regardless of whether it improves anything.
Goodhart’s Law becomes more dangerous as optimisation power increases. AI is a step-function increase in optimisation power. Applied to bad metrics, it does not merely fail — it fails spectacularly, at speed, and at scale.
The 2025 DORA Report puts it simply: “AI doesn’t fix a team; it amplifies what’s already there.” AI is not a productivity solution. It is a productivity amplifier. And amplifiers do not care what signal they are boosting.
6. What Actually Works
If simple metrics fail, modern frameworks are misapplied, and AI amplifies dysfunction, what should leadership actually do? The evidence points toward a fundamentally different approach — one that requires more organisational maturity but produces genuinely better results.
Measure Experience, Not Output
The most actionable path forward is to stop measuring what developers produce and start measuring the conditions that enable production:
Feedback loops. How quickly do developers get information about their work? How long does a CI pipeline take? How fast do code reviews come back? How soon after deployment do they know if something broke? Every minute of waiting is a minute of lost context, and context — not typing speed — is the primary constraint on developer effectiveness.
Cognitive load. How much mental effort does the development environment demand? How many systems must a developer understand to make a change? How clear is the documentation? How usable are the tools? Teams drowning in complexity do not need productivity metrics. They need simpler systems.
Flow state. How often can developers achieve sustained, uninterrupted focus? Research shows it takes approximately 23 minutes to refocus after each interruption, and minimum effective deep work periods are around 90 minutes. An organisation that interrupts developers every 30 minutes for status meetings and Slack messages is not suffering from a productivity problem. It is suffering from a management problem.
Almost half of tech managers now report that their companies measure developer productivity, developer experience, or both. Google, Microsoft, and Spotify have long relied on developer surveys to understand the conditions their teams work in. The shift is happening — but too slowly, and too often alongside the old metrics rather than replacing them.
Focus on Team Outcomes, Not Individual Metrics
Kent Beck’s recommendation is elegant in its simplicity: focus on “producing at least one customer-facing thing per team, per week.” This is not a metric to be gamed. It is a practice that aligns incentives: the whole team collaborates to deliver something a customer can see and respond to. Value is delivered. Feedback is gathered. The cycle continues.
Gergely Orosz and Abi Noda studied 17 major tech companies and found that “rather than wholesale adoption of frameworks like DORA, leading teams use a mix of org-specific qualitative and quantitative metrics.” The best organisations do not adopt a framework off the shelf. They build their own understanding of what “good” looks like in their specific context.
Use Qualitative Approaches Seriously
Google’s approach to developer productivity measurement reveals a counterintuitive truth: qualitative metrics — “measurements comprised of data provided by humans” — capture what automated systems cannot. Flow state, codebase navigability, technical debt perception, satisfaction — these are real and consequential, and they can only be measured by asking people.
Google’s own analysis of 117 metrics for technical debt found that none were valid indicators. Human judgement about the gap between ideal and actual state proved essential. The numbers, on their own, told the wrong story.
Practical recommendations from the research: start with qualitative baselines to identify opportunities, then deploy targeted quantitative metrics for deeper analysis. Segment by team and persona rather than aggregating company-wide. Prioritise free-text comments — developers suggest improvements and identify gaps that structured questions miss. Use transactional surveys at workflow touchpoints for granular, timely feedback. And above all, act on what you learn. Nothing kills a survey programme faster than asking for feedback and visibly ignoring it.
Think in Systems, Not Individuals
W. Edwards Deming’s insight from 1993 remains foundational: “Left to themselves, system components become selfish, competitive, independent profit centres and thus destroy the system. The secret is cooperation between components toward the aim of the organisation.”
The practical application is straightforward: use Theory of Constraints thinking. Identify the bottleneck in the system and focus improvement efforts there. Individual developer speed is rarely the actual constraint. More often, it is slow code reviews, unclear requirements, flaky CI pipelines, cumbersome deployment processes, or poor cross-team communication. Making developers type faster — whether through training or AI — does nothing to address these systemic bottlenecks.
Use AI Wisely
The evidence does not say “don’t use AI.” It says “use AI with open eyes.” AI coding assistants are genuinely valuable for boilerplate, repetitive tasks, and exploration. They are genuinely dangerous when used uncritically on complex, context-rich work — and when their output is measured by the same broken metrics that have always failed.
The 2025 DORA finding bears repeating: AI amplifies what is already there. Before investing in AI tools, invest in the practices that AI will amplify. Fix the feedback loops, reduce the cognitive load, protect the flow state, clarify the team outcomes. Then introduce AI into a healthy system — where it will make good things better rather than making bad things faster.
7. The Question Behind the Question
When an organisation asks “How do we measure developer productivity?”, it is worth pausing to ask why.
Sometimes the answer is benign: we want to understand where our bottlenecks are so we can remove them. Sometimes it is less benign: we want to identify which engineers to fire. Often, it is anxious: we are spending a lot on engineering and we do not know if we are getting value.
Each of these motivations leads to a different approach. The first calls for systems thinking and understanding what developers actually experience. The second calls for an honest conversation about whether the problem is individual performance or organisational dysfunction. The third calls for outcome measurement — connecting engineering work to business results, accepting that the feedback loop is long, and resisting the temptation to substitute easy output metrics for hard outcome questions.
The real question is not “How productive are our developers?” The real question is: “How do we create the conditions where developers can do their best work?”
This is a harder question. It does not produce a single number. It cannot be answered by a dashboard. It requires leadership to engage with the actual work of software development — its complexity, its creativity, its inherent resistance to simplification.
But it is the right question. And in the age of AI, where the temptation to optimise bad metrics at machine speed has never been greater, asking the right question has never mattered more.
Bill Atkinson understood this in 1982. He made software six times faster by writing 2,000 fewer lines of code. By every metric except the ones that matter, he had a terrible week.
By the ones that matter, it was one of the most productive weeks in the history of software engineering.
References
Folklore.org, “Negative 2000 Lines Of Code” — https://www.folklore.org/Negative_2000_Lines_Of_Code.html
Martin Fowler, “CannotMeasureProductivity” (2003) — https://martinfowler.com/bliki/CannotMeasureProductivity.html
Martin Fowler, “OutcomeOverOutput” — https://martinfowler.com/bliki/OutcomeOverOutput.html
Martin Fowler, “Measuring Developer Productivity via Humans” — https://martinfowler.com/articles/measuring-developer-productivity-humans.html
Nicole Forsgren et al., “The SPACE of Developer Productivity,” ACM Queue (2021) — https://queue.acm.org/detail.cfm?id=3454124
Nicole Forsgren, Jez Humble, Gene Kim, Accelerate (2018)
DORA Metrics Guide — https://dora.dev/guides/dora-metrics/
DORA Report 2025, Google Cloud — https://cloud.google.com/blog/products/ai-machine-learning/announcing-the-2025-dora-report
Kent Beck & Gergely Orosz, “Measuring developer productivity? A response to McKinsey,” The Pragmatic Engineer (2023) —
Gergely Orosz, “Measuring Developer Productivity: Real-World Examples,” The Pragmatic Engineer —
Dave Farley, “What McKinsey got wrong about developer productivity,” LeadDev — https://leaddev.com/process/what-mckinsey-got-wrong-about-developer-productivity
Charity Majors, “Questionable Advice: Can Engineering Productivity Be Measured?” — https://charity.wtf/2020/07/07/questionable-advice-can-engineering-productivity-be-measured/
Peng et al., “The Impact of AI on Developer Productivity: Evidence from GitHub Copilot,” ArXiv (2023) — https://arxiv.org/abs/2302.06590
GitHub Blog, “Research: Quantifying GitHub Copilot’s Impact on Developer Productivity and Happiness” — https://github.blog/news-insights/research/research-quantifying-github-copilots-impact-on-developer-productivity-and-happiness/
METR, “Measuring the Impact of Early 2025 AI on Experienced Open-Source Developer Productivity” (2025) — https://metr.org/blog/2025-07-10-early-2025-ai-experienced-os-dev-study/
Uplevel Data Labs, “A Data-Driven Look at Gen AI for Coding” (2024) — https://resources.uplevelteam.com/gen-ai-for-coding
GitClear, “AI Copilot Code Quality Research 2025” — https://www.gitclear.com/ai_assistant_code_quality_2025_research
Fortune, “Cursor CEO Michael Truell warns vibe coding builds ‘shaky foundations’” (2025) — https://fortune.com/2025/12/25/cursor-ceo-michael-truell-vibe-coding-warning-generative-ai-assistant/
Peter Drucker, “Knowledge-Worker Productivity: The Biggest Challenge” —
Cal Newport, Deep Work: Rules for Focused Success in a Distracted World
W. Edwards Deming, The New Economics for Industry, Government, Education (1993)
Wikipedia, “McNamara Fallacy” — https://en.wikipedia.org/wiki/McNamara_fallacy
Wikipedia, “Goodhart’s Law” — https://en.wikipedia.org/wiki/Goodhart’s_law
Scrum.org, “Velocity, the False Metric of Productivity” — https://www.scrum.org/resources/blog/velocity-false-metric-productivity
LinearB, “Why Agile Velocity is the Most Dangerous Metric” — https://linearb.io/blog/why-agile-velocity-is-the-most-dangerous-metric-for-software-development-teams
Aviator, “Everything Wrong with DORA Metrics” — https://www.aviator.co/blog/everything-wrong-with-dora-metrics/
Matt Hopkins, “Goodhart’s Law for AI Agents” — https://matthopkins.com/business/goodharts-law-ai-agents/
InfoWorld, “Software development meets the McNamara Fallacy” — https://www.infoworld.com/article/4010318/software-development-meets-the-mcnamara-fallacy.html
Microsoft Research, “The SPACE of Developer Productivity” — https://www.microsoft.com/en-us/research/publication/the-space-of-developer-productivity-theres-more-to-it-than-you-think/
ArXiv, “Leveraging Creativity in Software Engineering” — https://arxiv.org/html/2502.03280v1
Choi, J., Hecht, G., and Tayler, W.B. (2012). “Lost in Translation: The Effects of Incentive Compensation on Strategy Surrogation.” The Accounting Review, 87(4), 1135-1164.
Black, P., Meservy, T., Tayler, W.B., and Williams, J.O. (2021). “Surrogation Fundamentals: Measurement and Cognition.” Journal of Management Accounting Research, 34(1), 9-28.
Harris, M. and Tayler, W.B. (2019). “Don’t Let Metrics Undermine Your Business.” Harvard Business Review, September-October 2019.
Kahneman, D. and Frederick, S. (2002). “Representativeness Revisited: Attribute Substitution in Intuitive Judgment.” In Heuristics and Biases: The Psychology of Intuitive Judgment. Cambridge University Press.

