I’ve been thinking a lot lately about measuring engineering efficiency. Specifically, about how measuring knowledge work output is pretty much impossible.

The work is just too variable. A senior engineer might spend three weeks on an RFC that reshapes your entire platform architecture and produce zero mergeable code in that time. Meanwhile, a junior dev knocks out eight bug fixes in a week. Who was more productive? It depends on what you mean by “productive,” and that’s the whole problem.

Peter Drucker is someone I come back to a lot when I think about this. You’ve probably heard his most famous line, even if you didn’t know it was his: “if you can’t measure it, you can’t improve it.” He wrote extensively about how the methods we used to optimize factory work (break it into motions, eliminate waste, measure output) simply don’t transfer to knowledge work. The task itself is ambiguous, the most important contributions are often invisible, and quality matters more than quantity. It’s a maddeningly hard problem. But as engineering leaders, we don’t get to throw our hands up and say “it’s unmeasurable.” We still need to assess whether our teams are healthy, whether individuals are growing, and whether the organization is actually shipping. We have to measure something.

And at the limit, it’s obvious that measurement works. Consider a simple thought experiment. You have two software engineers on the same team, working in the same codebase, for a full year. At the end of that year, one has had 100 PRs merged and released to production. The other has had one. You don’t need a sophisticated metrics framework to know that something is off.

PR throughput is a simple concept: the count of pull requests merged over a given time period. Per engineer, per team, per org, however you want to slice it. It’s almost stupidly basic on the surface. A PR gets opened, reviewed, and merged into the main branch. That’s one unit of throughput. You track it weekly, monthly, quarterly, and you watch the trend. No weighting for complexity, no adjustment for lines of code, no attempt to capture “impact.” Just: how many completed units of work moved through the system?

What makes it more useful than most alternatives is that a merged PR represents the completion of the entire software development lifecycle. Someone wrote code, got it reviewed by a peer, passed it through CI, and shipped it to a branch that’s heading to production. It’s not a half-finished ticket sitting in “In Progress.” It’s not a pile of work lobbed over to QA that your users will never see. It’s done. Your customers don’t get any value from code sitting on a developer’s laptop, from open PRs waiting on review, or from features stuck in a QA backlog. They get value from code that’s merged and released. That’s what makes the merged PR a meaningful unit: it’s the closest proxy we have to “value actually delivered.”

PR throughput is far from perfect, and I’ll get into the ways it breaks down. But I’ve found a lot of value in it. Not as a leaderboard or a performance score, but as a diagnostic. When I look at throughput data across an organization, patterns emerge. Teams with consistently low output almost always have process problems lurking underneath: slow CI pipelines, unclear ownership, review bottlenecks, too much work-in-progress. At the individual level, I don’t care about someone who’s at the P60 vs the P63. That’s noise. But there’s almost always an interesting story below the P25 and above the P75. The people in the bottom quartile are usually struggling with something: they’re blocked and not speaking up, or stuck on an ill-defined project with no clear path to shipping, or drowning in context-switching. And the people above the P75 are often doing something worth understanding too, whether it’s a workflow trick, a particularly well-scoped problem space, or just a team that’s gotten review culture right. The metric doesn’t tell you why, but it tells you where to look.

Of course, the moment you start paying attention to any metric, people will optimize for it. There’s actually a name for this: Goodhart’s Law, named after a British economist who observed it in monetary policy back in the 1970s. The idea is simple: when a measure becomes a target, it ceases to be a good measure. Once people know what’s being counted, they change their behavior to move the number, often in ways that have nothing to do with the outcome you actually cared about. And PR throughput is especially vulnerable to this. Engineers can split one cohesive change into five tiny PRs to inflate their count. They can avoid complex, risky work in favor of low-stakes fixes that ship fast. They can rubber-stamp reviews to keep the queue moving.

In my observation though, even when people optimize for this metric, the side effects aren’t all bad. If engineers are breaking work into smaller PRs and shipping more frequently, that’s actually a behavior I want. It means value is getting to customers faster, changes are easier to review, and rollbacks are cheaper when something goes wrong. There are worse things to optimize for.