Yes, you can measure software developer productivity… but are you sure that’s what you’re measuring or want to measure?
McKinsey wrote an article “Yes, you can measure software developer productivity” (Aug 2023).
Gergely Orosz and Kent Beck responded in 2 parts: Part 1 and Part 2.
Dan North responded.
Here’s my response.
Argument: We should use quantitative measurement and not just rely on expert opinion to assess developer performance.
The long-held belief by many in tech is that it’s not possible to do it correctly — and that, in any case, only trained engineers are knowledgeable enough to assess the performance of their peers.
One of the overall arguments McKinsey makes is that we should use quantitative measurement and not just rely on expert opinion to assess developer performance.
This seems reasonable.
“Facts” over “data”
This does remind me though of something said by Taiichi Ohno:
Data is of course important in manufacturing, but I place the greatest emphasis on facts.
What Ohno is talking about is the importance of direct observation, not just looking at numbers in a spreadsheet, to be able to interpret what is actually happening.
Quantitative data might be useful to assess developer performance, but I’d also want direct observation, including from the developers themselves, to be able to interpret what is actually happening.
Argument: Different productivity metrics are required for different levels (individuals, teams, systems)
To use a sufficiently nuanced system of measuring developer productivity, it’s essential to understand the three types of metrics that need to be tracked: those at the system level, the team level, and the individual level.
McKinsey suggests that productivity should be considered at the individual, team, and system (aka organisation) levels.
I’ve made a similar argument with some posts I wrote:
- 3 Fs for individual productivity: Focus, Feedback, Friction | by Jason Yip | Medium (Dec 2020)
- 2 Fs for team productivity: Flow and Frequent Integration | by Jason Yip | Medium (Dec 2020)
- RAP for organisational productivity: Relearning, Allocation, Parallel bets | by Jason Yip | Medium (Jan 2023)
My point was more to encourage focusing on broader productivity versus what I consider an overemphasis on individual productivity (team productivity is not just the sum of individual productivity; organisational productivity is not just the sum of team productivity) while McKinsey seems to be trying to encourage a focus back on individual productivity.
For instance, while deployment frequency is a perfectly good metric to assess systems or teams, it depends on all team members doing their respective tasks and is, therefore, not a useful way to track individual performance.
Argument: Different productivity metrics are required for different types of focus (outcomes, optimisation, opportunities).
Another critical dimension to recognize is what the various metrics do and do not tell you. For example, measuring deployment frequency or lead time for changes can give you a clear view of certain outcomes, but not of whether an engineering organization is optimized. And while metrics such as story points completed or interruptions can help determine optimization, they require more investigation to identify improvements that might be beneficial.
Beyond levels, McKinsey suggests 3 types of metrics to measure: outcomes, optimisation, and opportunities. The idea is that even if you’re producing outcomes, you might not be doing it efficiently (optimisation). And even if you’re efficiently producing outcomes, you might need to measure other things to more easily identify opportunities for improvement.
The questions they ask are:
- Outcomes: “Are you delivering products satisfactorily?”
- Optimisation: “Are you delivering products in an optimized way?”
- Opportunities: “Are there specific opportunities to improve how you develop products, and what are they worth?”
Given these questions, they categorise the DORA and SPACE metrics and propose 5 new opportunities-focused metrics that “offer clearer paths to improvement”.
“Productivity” — I don’t think that word means what you think it means
Simply put, productivity measures the amount of value created for each hour that is worked in a society.
“What is productivity”, McKinsey & Company
Productivity is the amount of value we get divided by what we put in. Outcomes are what we define as value. Optimization is about being more efficient about producing those outcomes, aka productivity. So, it seems like they’re proposing measuring outcomes, productivity, and opportunities to improve productivity in a roundabout way.
It’s confusing to lump every metric under “measuring productivity”
The Lean community tends to measure a set of goals: Safety Quality Delivery Cost Morale. Productivity maps most closely to “delivery”.
The DORA, SPACE, and the McKinsey opportunity metrics don’t just measure productivity. (Arguably they don’t measure productivity at all)
Customer satisfaction, reliability, change failure rate, time to restore service are quality metrics, not productivity metrics.
Developer satisfaction, retention are morale metrics, not productivity metrics. Granted morale may be a leading indicator of future productivity (and other) problems.
I think it’s both confusing and not useful to lump everything under productivity.
Effort, Output, Outcome, Impact
Kent Beck and Gergely Orosz proposed a model for the software engineering life cycle: effort → output → outcome → impact.
Developers engage in effort to produce tangible outputs (e.g., features) which are intended to lead to changes in customer behaviour (outcomes) which lead to value to flow back to the organization (impact)
Productivity is not the amount of effort you put in. That’s more measuring utilization. Very low utilization might indicate a problem, maybe a bottleneck somewhere else in the system. Very high utilization might indicate a problem, overburden and inability to respond to variation.
Productivity is not the amount of output you produce. That’s more measuring raw throughput. Productive software product development is not just a matter of pumping out features (aka “feature factory”). If the outputs are valuable, low throughput might indicate an opportunity for improvement.
Productivity is not just the amount of value (outcomes and impact) you are able to produce. You can produce that value in an inefficient way.
Productivity is the amount of value you are able to produce given the investment you put in. None of the suggested metrics in the McKinsey matrix measure this.
Argument: Ideally, developers should spend more time on “inner loop” over “outer loop” activities.
McKinsey argues that developers are more productive if they spend more time on “inner loop” activities (e.g., code, build, test) and less time on “outer loop” activities (e.g., meetings, integration, security & compliance, deploying at scale).
This reminds me of the Lean concept of “necessary” or “type one” waste (aka “muda”):
Type one muda creates no value but is unavoidable with current technologies and production assets. An example would be inspecting welds to ensure they are safe.
So, if we’re just saying that developers should spend more time on value-adding activities and adjust technologies, process design, etc. to reduce currently necessary but non-value-adding activities, then I agree.
I disagree though that the only value-adding activities are coding, building, and testing.
Proposed metric: Developer Velocity Index benchmark
The proposed Developer Velocity Index benchmark is described in more detail in the article, “How software developers can drive business growth | McKinsey” and involves 46 different drivers across 13 dimensions.
The most interesting part is what they found correlated most with performance, which is not what I would have expected.
We found the four with the greatest impact on business performance are tools, culture, product management, and talent management.
The Developer Velocity Index reminds me of the 24 key capabilities from Accelerate. It gives you some ideas of where and how to improve.
I generally prefer starting from specific problems, but lists can be helpful to remind you of potential countermeasures.
Proposed metric: Contribution analysis
Assessing contributions by individuals to a team’s backlog (starting with data from backlog management tools such as Jira, and normalizing data using a proprietary algorithm to account for nuances) can help surface trends that inhibit the optimization of that team’s capacity.
Contribution analysis seems like a way to assess the inner loop vs outer loop activity ratio. The issue is with the example.
For example, one company found that its most talented developers were spending excessive time on noncoding activities such as design sessions or managing interdependencies across teams. In response, the company changed its operating model and clarified roles and responsibilities to enable those highest-value developers to do what they do best: code.
The most leverage for the highest-value developer might not be coding. In fact, facilitating design sessions or address dependencies might have more impact on organisational-level capability and delivery than coding.
I don’t necessarily have any problem with contribution analysis per se, but I have a problem with what McKinsey understands as value-adding versus non-value adding activity AND the overemphasis on individual coding utilization over delivering outcomes and impact.
Proposed metric: Talent capability score
Talent capability score looks at distribution of capability for developers. in order to identify opportunities for coaching, upskilling, and recruiting. From It’s Time to Reset the IT Talent Model (mit.edu), the target is a diamond shape with the majority of developers in the middle versus bottom-heavy or top-heavy.
This one is interesting. I agree that it’s useful to understand your competency distribution. The diamond model as a target is not something I’m familiar with but it seems plausible.
Argument: Focusing on a single metric or an overly simplified collection of metrics can incentivise poor practices
Focusing on a single metric or too simple a collection of metrics can also easily incentivize poor practices; in the case of measuring commits, for instance, developers may submit smaller changes more frequently as they seek to game the system.
I agree that focusing a single metric or overly simplified collection of metrics can incentivise poor practices. The example McKinsey uses is not an example of this though. Submitting smaller changes more frequently is explicitly the intention. We want you to game the system that way.
Argument: “Leaders and developers alike need to move past the outdated notion that leaders “cannot” understand the intricacies of software engineering, or that engineering is too complex to measure.”
I agree. See How to Measure Anything.