Stop Counting Tickets, Start Watching the Work
A practical guide for managers who want to understand whether their developers are thriving, struggling, or somewhere in between
Picture a mob programming session. A manager joins, not to check on anyone, just genuinely interested in how the team is approaching a tricky migration. But about twenty minutes in, they notice something: one of the developers is letting others drive every decision, never pushing back, never offering an alternative. He is present but passive. In a stand-up or a sprint review, he would be invisible — his tickets are getting done, his pull requests are merging. By every metric the team tracks, he is fine.
He is not fine. He is stuck. And no dashboard would ever show it.
[If you have lived a version of this moment — and if you have managed engineers for any length of time, you almost certainly have — then you already know what this article is about.]
That is the core problem this article is about. Not that measurement is bad — measurement is essential — but that most of what we measure in software engineering tells us almost nothing about the thing we actually need to understand: whether our people are growing, struggling, or coasting. The question is not whether to assess your developers. It is whether your assessment tools are capable of seeing what matters.
The measurement problem
Most managers, when asked how they evaluate their developers, will point to some combination of activity metrics: tickets closed, story points delivered, lines of code written, pull requests merged. These feel scientific. They feel objective. And as tools for evaluating individuals, they are actively counterproductive.
There is a name for this. Goodhart’s Law states that when a measure becomes a target, it ceases to be a good measure. In software engineering, this plays out predictably: if lines of code become a performance indicator, developers produce verbose, unnecessarily complex code. If story points become the yardstick, estimates get inflated. If pull requests merged is the metric, people split work into trivially small changes to pump the numbers. When McKinsey published a paper in 2023 claiming they could measure individual developer productivity through activity metrics, Dan North called the thinking “absurdly naive.” Kent Beck warned that such measurements create perverse incentives — managers pressure for better scores, engineers game the system, and the actual work suffers.
But the answer is not to stop measuring. It is to measure the right things, at the right level, for the right purpose.
Metrics like deployment frequency, lead time for changes, change failure rate, and mean time to recovery — the DORA metrics, validated by Forsgren, Humble, and Kim across 23,000 data points over four years of research — are genuinely useful. They tell you about the health of the system: how quickly value flows from idea to production, how often things break, and how fast the team recovers. What they do not tell you is which individuals are thriving and which are struggling. That requires a different kind of assessment entirely — one that is richer, more contextual, and necessarily more human.
The distinction matters enormously. Metrics for understanding how a system is performing are productive. Metrics for evaluating individual people are destructive. Everything in this article sits on that line.
Watch people work
The richest source of signal about a developer is watching them work. Not in a surveillance sense, but through practices like pair programming and mob programming. When you are pairing with someone, or watching them pair with others, you learn things within a few sessions that no dashboard could ever reveal. You see how they think through problems, how they break down complexity, whether they write tests first or bolt them on as an afterthought, how they respond when their assumptions are wrong, and how they communicate with the person next to them. Academic studies on pair programming confirm it reveals individual skill levels, communication ability, and problem-solving approaches in ways no other assessment method matches.
There is a catch, and it is worth naming upfront. Harvard researcher Ethan Bernstein demonstrated through a rigorous field experiment that being observed can counterintuitively make people worse at their jobs — not because they are lazy and get caught, but because they hide the creative, experimental behaviour that produces the best work. I will return to this tension later, because it is real and it does not have a tidy resolution. But the alternative — assessing people from a distance through thin numerical proxies — is not just less effective. It is actively misleading.
Pay attention to how quickly code gets from someone’s machine to production. If a developer is sitting on branches for days, that tells you something. If their work integrates smoothly and frequently, that tells you something else entirely. The ability to work in small, safe increments is one of the most reliable markers of engineering maturity. The Accelerate research confirms this: elite teams deploy code multiple times per day, while low performers deploy once per month or less. Working in small batches reduces cycle times, accelerates feedback, reduces risk, and improves efficiency. And here is the finding that surprises many managers: speed and stability are not trade-offs. Teams that deploy more frequently also have lower failure rates and faster recovery times.
One caveat matters here. A developer sitting on branches for days might not be signalling personal weakness. The CI/CD pipeline might be broken. Code review might be bottlenecked. The team might lack trunk-based development practices. Before judging the individual, check the system. This is a principle that runs through the entire article: whenever you see individual behaviour that looks like a problem, ask whether the environment is causing it before concluding that the person is.
Watch how people behave when something breaks. Good developers do not panic and do not blame. They have a systematic approach to diagnosis. They narrow things down. They ask good questions. And crucially, after they fix the problem, they ask “how do we make sure this class of problem cannot happen again?” Google’s Project Aristotle — a two-year study of over 180 teams — found that psychological safety was the single strongest predictor of team performance, accounting for 43 percent of the variance. Teams where people blame and panic are teams where psychological safety is low. Teams where people diagnose calmly and think about prevention are teams where it is high.
Then there is the question of learning. Good developers are visibly curious. They read, they experiment, they ask “why” a lot. But here is the nuance: you are not looking for people who chase every shiny new framework. You are looking for people who deepen their understanding of fundamentals. Someone who properly understands testing, design, and feedback loops will pick up any new tool in a week. Someone who only knows tools but not principles will struggle every time the landscape shifts. Daniel Pink’s research on motivation calls this mastery — the desire to continually improve at something that matters. It is one of three core intrinsic motivators for knowledge work, alongside autonomy and purpose. The developers who chase depth are driven by mastery. The ones who chase breadth without depth are often driven by anxiety.
That said, this is not a binary. In a fast-moving field, some exploration of new tools is adaptive and necessary. The signal is not whether someone looks at new things, but whether they can evaluate them critically against fundamentals. Curiosity about new tools combined with deep understanding of principles is the strongest combination.
Practical actions for managers
Let us get concrete about what a manager can actually do day to day to understand their team, not in a way that punishes or controls, but in a way that genuinely helps.
Sit with your teams regularly. Not in a “let me observe you” way, but genuinely participate. Join a mob session. Pair with someone on a real task. Even if you are not coding yourself, being in the room while work happens gives you an enormous amount of information. You will see who drives the design conversation, who asks clarifying questions, who just waits to be told what to do, and who challenges assumptions constructively. Do this often enough and you will build a mental model of every person on the team that no spreadsheet could ever give you.
Pay attention to what happens after someone’s code merges. Does it cause problems downstream? Do other people frequently have to rework it? Or does it land cleanly and become something the team builds on with confidence? You do not need a fancy tool for this. Just ask the team during retrospectives: “What slowed us down this week?” If the same person’s work keeps coming up as a source of friction, that is a signal. If someone’s work is invisible because it just works, that is also a signal, and you should notice it.
Give people a small, ambiguous problem and see what they do with it. Not as a test, but as real work. Something like “we have had three customer complaints about this area of the product, can you investigate and come back with a recommendation?” A strong developer will dig in, ask smart questions, maybe prototype something, and come back with options. Someone less experienced, or someone who has not been given the space to develop this skill, will either freeze, ask you to define every detail upfront, or jump straight to coding without understanding the problem. This tells you far more than any technical interview ever could, because it is real. Will Larson’s research on staff engineering identifies exactly this — the capacity to self-direct through uncertainty rather than waiting for someone to remove it — as the defining characteristic that separates mid-level from senior engineers, and senior from staff.
Look at how people handle knowledge gaps. Give someone a task that is slightly outside their comfort zone and watch what happens. Do they go and learn what they need? Do they ask for help at the right moment, not too early and not after they have wasted three days? Do they share what they learned afterwards? The ability to self-direct learning and then bring it back to the team is one of the strongest indicators of someone who will grow into the role versus someone who has plateaued.
Have honest one-to-ones where you ask open questions and then actually listen. Not “how are your tickets going” but things like “what is the hardest technical decision you have made recently and why?” or “what part of our codebase worries you most?” or “if you could change one thing about how we work, what would it be?” The quality of the answers tells you a lot. Someone who engages thoughtfully with these questions is someone who cares about the craft. Someone who consistently has nothing to say may not be thinking deeply about the work — or they may not yet trust that it is safe to say what they think. Some people are internal processors. Some come from cultures where questioning how the team works is not done casually. The manager’s job is to figure out which it is, not to assume the worst. If someone goes quiet in one-to-ones, the first question should be “have I made this space safe enough?” not “does this person care?”
And get feedback from peers. Not formal 360 reviews with scores and ratings used for appraisal — research shows those tend to produce manipulated data, with gaming documented at companies including GE, IBM, and Amazon. But do not dismiss structured peer feedback entirely. When 360-degree feedback is used for development rather than evaluation, and when responses are anonymous, research shows it genuinely increases communication, productivity, and team effectiveness. The best approach combines informal peer conversations — “What is it like working with Sarah?” — with structured but anonymous developmental feedback that lets people say things they would not say to someone’s face. Some of the most important signals, especially upward feedback about a manager’s own behaviour, only surface when anonymity is guaranteed.
Being honest about scale
Everything I have described so far sounds reasonable for a manager with five or six direct reports. But what if you have twelve? What if you manage two teams across different time zones? The honest answer is that you cannot pair with every engineer every week, attend every mob session, and have deep one-to-ones with everyone on a fortnightly cycle. If this approach only works at small scale, it is not a complete approach.
There are three things that make it practical at scale.
First, you do not need to be the only observer. Tech leads, senior engineers, and other experienced team members are already watching the work. They see how people pair, how they handle ambiguity, how they respond to feedback. Your job is not to personally observe everything but to build a network of people whose judgement you trust and to triangulate their perspectives with your own. Ask your tech lead: “How is Maria doing on the migration? What have you noticed?” This is not delegation of responsibility. It is distributed observation, and it produces a richer, less biased picture than a single manager’s viewpoint ever could.
Second, you can rotate your depth of focus. Rather than shallowly observing everyone all the time, spend a few weeks deeply engaged with one part of the team — joining their sessions, reading their pull requests, having longer conversations — then rotate. Over a quarter, you will have built a detailed picture of everyone. This is not ideal, but it is honest, and it is better than the alternative of seeing no one deeply.
Third, mob programming is more efficient for observation than pairing, precisely because you see the whole team’s dynamics at once. A single mob session reveals who drives, who supports, who withdraws, who challenges, and who defers. You do not need to attend many sessions to learn a lot. If you are time-constrained, mobs give you the highest signal per hour invested.
The honest tension: observation is still observation
There is a tension here that deserves honesty rather than a neat resolution. If you are joining a session partly to learn about the work and partly to assess the people doing it, then dressing it up as pure curiosity is a form of dishonesty. People are not stupid. They can feel when they are being evaluated, regardless of what words you use. And if they later discover that your “just joining in” sessions fed into a performance conversation, the trust damage is significant.
This tension is not just intuition. It has academic backing. Bernstein’s “Transparency Paradox” research, which I mentioned earlier, demonstrated this through a field experiment at a large manufacturing plant. Workers who knew they were being watched concealed innovative approaches, suppressed experimentation, and avoided productive deviation from standard practice. Bernstein found that even a modest increase in group-level privacy sustainably and significantly improved performance. The implication is uncomfortable: the very act of a manager watching can make people worse at their jobs, because they hide the creative, experimental behaviour that produces the best work.
So the better approach is transparency. Not “I am here to check on you,” but something closer to the truth: “Part of my job is understanding how the team works, what is going well, and where people might need support. I cannot do that if I am never near the work. So I am going to join sessions from time to time. Not to catch anyone out, but because I need to see reality if I am going to be any use to you.” That is honest. It acknowledges the evaluative dimension without making it adversarial.
But here is the deeper point, and Bernstein’s research supports it. If the only time a manager is near the work is when they are assessing people, then of course it feels like surveillance, no matter how it is framed. The real fix is not about how you explain your presence. It is about whether your presence is normal or exceptional. If a manager is routinely involved in technical discussions, regularly joins sessions, frequently asks about design decisions, and has ongoing conversations about the codebase, then there is no single moment that feels like “the inspection.” It is just how things work. The observation happens as a side effect of genuine involvement, not as a discrete activity with a hidden agenda.
The managers who struggle most with this are the ones who are distant from the work ninety percent of the time and then suddenly show up. In that context, no amount of friendly framing will stop people from feeling watched. The discomfort is not caused by the words. It is caused by the pattern.
There is also something worth saying about the broader culture. In teams where feedback is frequent and open, where people regularly talk about what is going well and what is not, evaluation loses its menacing quality. People feel micromanaged when assessment is something that happens to them in secret and then gets revealed in a formal review. If instead the manager is consistently saying “I noticed this went well” and “I think this area needs work” as part of normal conversation, then the evaluative aspect of being present becomes something people are used to rather than something they dread.
None of this fully resolves the tension. A manager who is present will, by definition, be forming judgements. People who are being observed will, by definition, behave somewhat differently. You cannot eliminate that dynamic entirely. But you can make it honest, make it routine, and make it part of a relationship where people trust that the judgements being formed are fair and in their interest. That is the best you can do, and it is a lot better than the alternative, which is forming judgements from a distance based on terrible data and then surprising people with them twice a year.
The biases you carry into the room
If observation is your primary assessment tool — and I am arguing it should be — then you need to be honest about the fact that you are not an objective instrument. Decades of research on performance evaluation show that managers are systematically biased in ways they are rarely aware of.
Recency bias means you overweight what happened in the last few weeks and forget the previous months. SHRM research shows that recency and central tendency errors affect nearly 40 percent of annual appraisals, and they do not disappear just because you are observing in person rather than reading a spreadsheet. That developer who had a rough week when you happened to join the mob session? You will remember that disproportionately.
Similarity bias means you unconsciously favour people who remind you of yourself — same background, same communication style, same approach to problem-solving. The developer who thinks the way you do will seem “stronger” than the one who thinks differently but equally well.
The halo and horns effects mean that one strong or weak impression colours everything else. If someone impressed you with an excellent design decision, you will be more generous when evaluating their testing practices, even if those are mediocre. If someone fumbled an incident response, you may underrate their day-to-day coding, even if it is excellent.
And then there is proximity bias, which matters more than ever. Managers form more favourable impressions of people they see frequently. In hybrid or remote teams, this creates systematic unfairness: research shows employees may receive better review outcomes simply because they work in the same office as their manager. The person you pair with every week will feel more “known” — and therefore rated more positively — than the remote team member you interact with only in stand-ups.
One of the largest studies on feedback found that more than half of the variance in performance ratings had more to do with the quirks of the person giving the rating than the person being rated. That means the biggest variable in your evaluation is you, not them.
What do you do about this? You cannot eliminate bias, but you can discipline your observation. Keep running notes. Not a surveillance dossier, but a simple habit: after a pairing session or a notable interaction, write down what you actually saw, not what you felt about it. Over time, patterns emerge from evidence rather than from impressions. Seek disconfirming evidence actively — if you think someone is weak, look specifically for moments where they are strong, and vice versa. Calibrate with peers: ask other managers or tech leads who work with the same people whether they see what you see. And be especially deliberate about equalising your attention across the team. If you are pairing more often with some people than others, you are building a biased dataset whether you mean to or not.
When the team is not in the room
Everything I have described so far assumes you can physically or synchronously join a session with your team. But many teams are distributed. Some are fully remote. Some span time zones. If observation-based assessment only works when you can sit next to someone, then it is not a complete approach.
The principles remain the same, but the channels change. In distributed teams, the work leaves more written traces, and those traces become your primary observation material.
Code review is the most revealing. Not reviewing code yourself as a gatekeeper, but reading how people engage with each other’s code. The difference between a thoughtful reviewer and a careless one is immediately visible. Compare “this is wrong, fix it” with “I think this approach might cause issues under concurrency — have you considered using a lock here? Happy to pair on it if useful.” The first tells you someone is going through the motions. The second tells you someone understands the system, communicates with care, and is willing to invest time in a colleague’s growth. How people respond to critique is equally telling: do they engage with the substance, or do they get defensive? Do they ask follow-up questions, or do they silently apply the change without understanding why? The quality of someone’s code review comments tells you as much about their engineering judgement as watching them code would.
Look at how people communicate in writing more broadly. In asynchronous teams, the ability to write a clear problem statement, a well-structured RFC, or a concise incident summary is itself a form of engineering skill. People who can articulate their thinking in writing are usually people who think clearly, and that matters regardless of whether you are ever in the same room.
Pay attention to how distributed team members handle the particular challenges of remote work. Do they proactively communicate blockers or sit silently for days? Do they make themselves available for synchronous collaboration when it matters, or are they always unavailable? Do they contribute to team discussions, or do they disappear between assigned tasks?
None of this is as rich as pairing with someone in real time. But it is far better than falling back on ticket counts and activity dashboards, which is what most managers of remote teams end up doing by default.
Promotions, recognition, and the evidence problem
This is where it gets uncomfortable, because most organisations pretend they have a system for promotions when really they have a ritual. Someone fills in a form, a manager writes a justification, a calibration meeting happens where people who have never seen the work argue about ratings, and then a decision gets made that is mostly political. Everyone involved knows it is theatre, but nobody says it out loud.
The data you need is not data in the traditional sense. It is evidence. And the best evidence comes from the practices already described, but you need to be deliberate about collecting it. Not in a creepy dossier way, but as a habit of noticing and writing things down. When someone handles an incident well, make a note. When someone’s pairing session lifts the whole team’s understanding, write it down. When someone repeatedly delivers work that needs reworking, note that too. Over time you build a picture that is grounded in real events, not vibes and not metrics.
For promotions specifically, the question should not be “has this person earned a reward?” It should be “is this person already operating at the next level?” That distinction matters enormously. If someone is already doing the work of a senior engineer, meaning they are influencing design decisions, mentoring others, taking ownership of ambiguous problems, thinking about the system rather than just their task, then promoting them is just recognising reality. If you are promoting someone because they have been around long enough or because they will leave otherwise, you are creating problems.
The evidence for this is observable. You can point to specific moments. “In the last six months, you led the redesign of the payment service. You brought three junior developers along with you through pairing. You identified the performance issue before it hit production. You pushed back on the product team when the requirements did not make sense and proposed a better alternative.” That is a promotion case built on things that actually happened, not on a self-assessment form where someone writes “I demonstrated leadership” with no context.
But there are two fairness problems here that deserve honesty.
The first is access. The “already operating at the next level” criterion only works if everyone has equal access to the opportunities that let them demonstrate next-level work. In practice, they often do not. People from underrepresented groups, quieter team members, people in less visible parts of the codebase, and remote workers may all have fewer chances to lead a high-profile redesign or push back on product in a visible way. Research on equitable promotion policies confirms that marginalised groups are promoted at a slower rate, partly because the opportunities to demonstrate next-level capability are unequally distributed.
This means managers have an active responsibility, not just to observe who steps up, but to deliberately distribute stretch opportunities. If you only promote people who naturally volunteer for visible work, you are filtering for confidence and political skill, not engineering ability. Make sure the quiet person on the team gets the chance to lead something. Make sure the remote team member gets the same ambiguous problems as the person sitting next to you.
The second is compensation. “Already operating at the next level” means, in practice, doing a harder job for months while being paid for the easier one. Most companies expect six to twelve months of sustained performance at the next level before promoting. That is six to twelve months of uncompensated labour at a higher level of responsibility. This might sound reasonable in the abstract, but it disproportionately affects people who cannot afford to “invest” unpaid effort — and it is the single most common criticism of this promotion model among the engineers who live inside it.
I do not have a clean solution for this. But I think the manager’s obligation is clear: make the gap between “doing the work” and “getting the title” as short as organisationally possible. If someone has been operating at the next level for two full review cycles and you still have not promoted them, that is a management failure, not a demonstration of rigour. And be transparent about the timeline. If someone is doing senior-level work, tell them: “I see it, I am building the case, and here is when I expect it to happen.” Silence on this topic is how you lose your best people.
Feedback that actually changes behaviour
For feedback conversations, the same principle applies. Be specific and be timely. “Your code in the checkout service last week had no tests and broke the build twice” is useful feedback. “You need to improve your quality” is not. The first gives someone something to act on. The second just makes them feel bad.
Here is something most managers get wrong: feedback should not be saved up for a quarterly review. If you see something that needs addressing, address it within days, not months. If you see something excellent, say so immediately. The idea that feedback is a formal event is one of the most damaging conventions in management. It means people spend months not knowing where they stand, which breeds anxiety and kills trust.
The research on this is overwhelming. Gallup found that employees who receive feedback weekly are 2.7 times more likely to be engaged at work. More strikingly, employees are 3.6 times more likely to be motivated to do outstanding work when their manager provides daily feedback compared to annual feedback. Companies with strong feedback cultures see 14.9 percent lower turnover. When Adobe shifted from annual reviews to continuous check-ins, they saw a 30 percent drop in voluntary turnover within a single year.
If you are giving feedback regularly, the formal review becomes a summary of things both of you already know, which is exactly what it should be. Nothing in a performance review should ever be a surprise. If it is, you have failed as a manager, not because the assessment is wrong, but because you waited too long to share it.
On recognition, be careful about making it purely individual. Software is a team activity. If you only recognise individual heroes, you incentivise hero behaviour, and that is corrosive. Hero culture leads to knowledge silos, bus factor risks, and burnout among the very people being celebrated. Perhaps most damningly, research suggests that hero culture is itself a sign of low psychological safety — when only heroes are recognised, others stop taking initiative because the implicit message is that only extraordinary individual acts matter.
Google’s Project Aristotle found that the number one predictor of team performance was not individual talent but psychological safety. Teams with high psychological safety showed 19 percent higher productivity, 31 percent more innovation, and 27 percent lower turnover. Individual heroics were not on the list.
Recognise teams. Recognise collaborative moments. “The way you and Tom worked through that production issue together was excellent” reinforces the behaviour you actually want. Individual recognition has its place, but it should be tied to team-enabling behaviours — mentoring, unblocking others, sharing knowledge — not solo heroics.
As for what to base the conversation on practically, keep it simple. For each person, maintain a running list of observations under three headings: things they are doing well, things they need to work on, and situations where you were not sure what to make of their contribution. Review your notes before any one-to-one. Share them openly. Ask the person if they see it the same way. The best feedback conversations are ones where the person mostly agrees with your assessment because nothing in it is a surprise.
The environment is the assessment
On the “did I hire the right people” question, it is worth reframing it entirely. The better question is “have I created an environment where good people can do good work and where struggling people can improve?” If you have strong collaborative practices in place, meaning pairing, mobbing, frequent code review as a learning exercise rather than a gate, continuous integration, and short feedback loops, then you will find that most people rise to the level of the environment. The ones who genuinely cannot are not usually a mystery. They become visible quite quickly when the work is transparent.
This is not a new idea. W. Edwards Deming, whose work on quality management became foundational to both Agile and DevOps, argued decades ago that the system, as designed by leaders, is almost always to blame for problems — not the individual working within the system. Deming saw the manager’s role as improving the system in which people work, not judging the people and leaving the system untouched. His insight that “mistakes typically come from bad systems, not bad workers” has been directly supported by the Accelerate research, which identified 24 organisational capabilities — not individual traits — as the drivers of software delivery performance.
Pink’s motivation framework reinforces the point from the individual’s perspective. Metrics-based evaluation systems undermine autonomy by telling people what to optimise for, distort mastery by rewarding the wrong skills, and corrode purpose by making people focus on numbers rather than outcomes. When you build an environment that gives developers genuine autonomy over how they work, supports their drive for mastery through pairing and challenging problems, and connects their work to a purpose they care about, you create the conditions where good people do their best work — and where struggling people either rise to the challenge or become honestly visible.
The real danger is not hiring the wrong person. It is creating an environment where you cannot tell the difference between a good developer in a bad system and a bad developer in a good system. Make the system transparent and the assessment mostly takes care of itself.
That said, environment-first thinking does not mean never addressing individual performance. Some people genuinely underperform even in excellent environments. The point is the order of operations: fix the system first, then assess the individual. If someone is still struggling after you have given them good practices, clear expectations, psychological safety, timely feedback, and opportunities to grow, then you have a genuine individual performance problem — and you have the evidence to address it fairly, because you have been watching the work all along.
Here is what that conversation sounds like when you have done the work: “Over the last three months, I have noticed a pattern. When you paired with Laura on the billing service, she drove all the design decisions and you did not push back on any of them. The migration task I gave you came back without tests, and the team spent two days fixing the issues it caused downstream. In our last three one-to-ones, I asked what you would change about how we work and you said you were not sure. I have given you feedback on each of these as they happened, and I have not seen a change. I want to help you get to where you need to be, but I need to see a shift in the next few weeks, and here is specifically what that looks like.” That is a hard conversation, but it is not an unfair one. Nothing in it is a surprise. Nothing in it is a vibe. Every claim points to something that actually happened, and the person has already heard about each of those moments because you addressed them at the time. That is the payoff of proximity: when the difficult conversation finally has to happen, it is grounded in evidence both of you recognise.
The underlying principle running through all of this is proximity. You cannot evaluate people from a distance. Dashboards, velocity charts, and ticket counts are all ways of trying to understand people without actually being close to the work, and they consistently fail. The managers who truly understand their teams are the ones who stay close to the work itself. Not micromanaging, not controlling, just present and paying attention — while remaining honest about their own biases, deliberate about distributing their attention fairly, and disciplined about grounding their judgements in evidence rather than impressions.
References
Books
Forsgren, N., Humble, J., Kim, G. (2018). Accelerate: The Science of Lean Software and DevOps. IT Revolution Press.
Pink, D. (2009). Drive: The Surprising Truth About What Motivates Us. Riverhead Books.
Larson, W. (2021). Staff Engineer: Leadership Beyond the Management Track.
Deming, W.E. Out of the Crisis. MIT Press.
Academic and peer-reviewed research
Bernstein, E. (2012). “The Transparency Paradox: A Role for Privacy in Organizational Learning and Operational Control.” Administrative Science Quarterly, 57(2), 181–216.
Forsgren, N., Storey, M-A., Maddila, C., Zimmermann, T., Houck, B., Butler, J. (2021). “The SPACE of Developer Productivity: There’s More to It Than You Think.” ACM Queue, 19(1).
Google re:Work. “Guide: Understand Team Effectiveness” (Project Aristotle).
Hawthorne Effect research (1924–1932). Western Electric / Elton Mayo.
Industry sources
Orosz, G. (2023). “Measuring Developer Productivity? A Response to McKinsey.” The Pragmatic Engineer.
Orosz, G. “Common Performance Review Biases.” The Pragmatic Engineer.
Orosz, G. “Software Developer Promotions: Advice to Get to That Next Level.” The Pragmatic Engineer.
Swarmia (2025). “Engineering Metrics Leaders Should Track.”
DORA. “DORA’s Software Delivery Performance Metrics.” dora.dev.
Organisational research
Gallup (2020). Employee engagement and feedback frequency research.
SHRM. Research on recency and central tendency errors in annual appraisals.
ISACA (2024). “Examining the Risks of IT Hero Culture.”
Adobe Systems. Continuous check-in programme results.
Behavioural science
Goodhart, C. “Goodhart’s Law.” Popularised as: “When a measure becomes a target, it ceases to be a good measure.”
Harvard Business Review (2018). “3 Biases That Hijack Performance Reviews, and How to Address Them.”

