“When a measure becomes a target, it ceases to be a good measure” — Charles Goodhart1
“You show me anything that depicts institutional progress in America, school test scores, crime stats, arrest reports, arrest stats, anything that a politician can run on, anything that somebody can get a promotion on. And as soon as you invent that statistical category, 50 people in that institution will be at work trying to figure out a way to make it look as if progress is actually occurring when actually no progress is…” — David Simon, creator of the Wire.
Merry Christmas dear reader, and thank you very much for being a subscriber of Silicon Continent. Since our Wednesday publication date is 25th of December, I am not going to hit you with some econ–tech thing. Instead, I will write about what we learn about organizations by watching The Wire. I hope the article will be fun, even if you have not seen “The Wire” (but what are you waiting for? It is the best TV series ever, and a great economics of organizations textbook!).
I recently rewatched The Wire–for the third time. In the show, we see the people behind two organizational charts — one for the Baltimore Police Department, another for a street–level drug operation. Both are complex bureaucracies fighting for survival. But only the second gets anything done.
The difference comes down to how they measure success. In the police department, the objective is to reduce crime stats and look good in police commander (ComStat) meetings. The result? Crime doesn't actually decrease — it just disappears from official records. Assault becomes misdemeanor battery. Felony robbery transforms into petty theft. The numbers look great. The streets don't actually get safer.
In one of the most haunting sequences, bodies disappear into the walls of abandoned row houses. When they are discovered — decomposing corpses sealed behind plaster and lath — police bosses prefer not to officially find them — they are optimizing for keeping the crime stats rather than solving the murders.
Drug lord Stringer Bell's operation provides a stark contrast. While city institutions twist themselves into knots, Stringer Bell runs his drug operation like a business. He maximises profits, and the market rewards practices that deliver ‘value’ to customers. As a result, the organization does what it is supposed to do.
Bengt Holmström and Paul Milgrom's work on contract theory contains the first analysis in economics of this problem under the heading of “multitask incentives”. Human nature (yes, the hated but quite resilient homo economicus) is to respond to incentives. Give people a goal with meaningful consequences, and they'll find the most efficient path to achieve it, by whatever means they find suitable. But when several tasks compete for an agent's attention, rewards based on measured performance can be counterproductive. The problem is not the accuracy of the measurement, but alignment between what we can measure and what we actually want to achieve.
Think of these as two vectors. The first vector is what we can measure: test scores, crime statistics, quarterly profits. The second is what we actually want: student learning, public safety, sustainable business growth. Sometimes these vectors are closely aligned – like in Stringer Bell's operation where the profit of selling drugs directly measures success. But often they're orthogonal or even opposed – like in the Baltimore police department where lower crime statistics might indicate worse policing.2
The problem with the public sector, and the reason bureaucracy feels so bureaucratic, is that the outputs of public organisations are necessarily hard to measure. In the drug trade (or the oil trade, for that matter) , success metrics are simple: profits earned. But in public service, the true goals – "effective policing," "public safety" – are inherently complex and multidimensional. (The same challenges affect NGOs). This complexity makes gaming the system not just possible, but inevitable when simplified metrics are imposed.
Reducing crime, improving education, delivering public health — none lend themselves to simple metrics. The police cannot track accurately how safe and confident citizens feel–or it can, but only until, as per Goodhart’s law, the measure is used as a target and linked to incentives, in which case the correlation disappears. Once you choose a metric, employees end up optimizing for that proxy, rather than the real thing, leading to misalignment. The greater the complexity, the more the system turns into a dance of appearances. In The Wire, City Hall tries to show progress by shuffling homeless people around, officers try to look good for CompStat, and teachers teach the test.
Inversely, an honest worker who notices this misalignment and tries to do the ‘right’ thing finds little reward and much frustration. The incentives urge them to follow everyone else and cheat the metrics. They also encourage others to punish defectors who challenge the misalignment. Bunny Colvin, the police commander who understands the war on drugs is a shell game, challenges it, and gets fired for it.Tommy Carcetti, who becomes mayor as an idealist trying to change the system, ends up playing the numbers like the best of the corrupt machine politicians.
In Season 4, the school system provides another good example. Teachers don't actually teach better when test scores become the only metric – they become experts at test preparation. The entire curriculum gets warped. Struggling students get hidden. Advanced material gets abandoned in favor of test preparation rituals that make the school look good. If, like Prez, the wonderful cop-turned-teacher, you are trying to do the right thing for the kids, you are putting your entire school’s existence at risk. Don’t expect anyone to recognize you—you will be lucky if you are not the one fired.
“ROLAND "PREZ" PRYZBYLEWSKI: I don't get it, all this so we score higher on the state tests? If we're teaching the kids the test questions, what is it assessing in them?
TEACHER: Nothing, it assesses us. The test scores go up, they can say the schools are improving. The scores stay down, they can't.
PREZ: Juking the stats.
TEACHER: Excuse me?
PREZ: Making robberies into larcenies, making rapes disappear. You juke the stats, and majors become colonels. I've been here before.”
(Source: Bill Moyers Journal)
The more complex the desired outcome, the more careful we must be about how we measure and incentivize performance. Simple metrics work well for simple goals, but they can catastrophically fail when applied to complex social objectives. The Wire's genius lies in showing us this truth across multiple institutions, each struggling with the gap between what they can measure and what they actually want to achieve.
The Wire is more than just great television. It's a masterclass in organizational economics – a brutal, beautiful demonstration of how incentives shape human behavior, and how the most well–intentioned systems can produce the most perverse results.
These problems– the difference between what we desire and what we can measure, are pervasive. Consider two current, and topical, applications:
1. Greenwashing. The world of carbon credits is flush with companies improving their environmental metrics through creative accounting. Take carbon offsets through forest preservation credits. Companies can claim carbon neutrality by paying to protect forests that were never at risk of being cut down. The more trees you save compared to the reference forest, the more carbon credits you earn.
A recent case in Zimbabwe illustrates the problems with this perfectly: South Pole, the world's largest carbon–offsetting firm, sold credits for protecting forest land near Lake Kariba. When they discovered that both their protected forest and the reference forest were largely intact – good news for the planet but bad for the carbon credit business – they kept selling credits anyway. One executive, when asked if the credits reflected reality, answered: "What is reality?" The measurement becomes the goal, rather than actual carbon reduction.
As Matt Levine from Bloomberg has pointed out, there is a perverse incentive: you could make even more money by secretly arranging to have the reference forest destroyed, making your protected forest look more successful by comparison. Just as Baltimore police commanders preferred not finding bodies to keep murder statistics low, carbon offset traders might prefer seeing reference forests burn to prove their protection efforts "worked."
Similarly, ESG is meant to provide to clean investing, but maximising ESG may mean spending on investing in dirty firms. For instance, an energy company splits its operations between a "green" subsidiary that gets favorable financing through green bonds, and a "brown" entity holding dirtier assets that uses traditional financing. Saudi Aramco creates two separate pipeline subsidiaries and sells 49% of each to investment vehicles in Luxembourg. These vehicles use bank loans to pay Aramco. The Luxembourg vehicles — EIG Pearl and Greensaif Pipelines — aren't flagged by ESG rating systems the same way Aramco is. Though they're essentially funding Aramco's fossil fuel infrastructure, ESG funds can buy their bonds without triggering exclusion criteria. The credit rating agencies see through this - giving these bonds the same rating as Aramco - but ESG frameworks treat them as separate entities. Like The Wire's stat-juking, it's about making the numbers look good rather than achieving real change.
2. AI benchmarks. The news for humanity this pre–Christmas was the astonishing progress of OpenAI’s o3 model. The company announced breakthrough performance on François Chollet's Abstraction and Reasoning Corpus (ARC), a test designed to measure genuine intelligence. Tech media buzzed with excitement. Some even joked about AGI arriving. Then a huge dispute erupted. Some (notably Gary Marcus) argued OpenAI was, to put in language Prez would understand, teaching o3 to the test.
When benchmark performance is the measure of AI progress, the risk is that rather than developing genuine abstract reasoning, AI models engage in an elaborate form of metric optimization. Just as "effective policing" resists simple measurement, "artificial intelligence" defies easy benchmarking. ARC’s creator, François Chollet, is now developing a harder version, acknowledging that the current one has been "saturated" through such techniques.
Epilogue: Career Advice—Choose your organization wisely
Some career advice for our younger readers. Figure out fast if your organization (and your job) values doing the right thing — and if it does not, quit. If you are solving crimes, you may be happy in your job. If your organization is engaged in pervasive cover-your-ass (CYA) behaviour, you probably won’t. Work is almost 1/3 of your lifespan. Whether you are cooking, solving crime, or teaching classes, you can do honest, serious work– what McNulty calls “doing good police”—or you can be in a place that wants you to pretend.
I have worked in academia, in politics and in the private sector, and I have found both types. There are academic organizations which are constantly looking at rankings and emphasizing what “counts” (it could be by some arbitrary metric concocted by e.g. an FT editor) and others where you are encouraged to do your best work regardless of where it ends up. Political parties are notorious, of course, for putting other metrics, notably loyalty, ahead of merit — although a start-up party is more likely to reward merit. Even in business, where profit maximization should provide a clear metric, I have seen lot of gaming — e.g. investing in negative NPV projects that make the short run numbers look good.3
In all of these cases, you must look past the mission statements and corporate values plastered on walls. Enron famously championed "Integrity, Communication, Respect, Excellence" right until its collapse. Instead, study what actually gets rewarded, who gets promoted. Does this place reward good, creative, intense, honest work or do they prefer the stat-manipulating yes-men?
References
Garicano, Luis, and Luis Rayo. "Why organizations fail: Models and cases." Journal of Economic Literature 54, no. 1 (2016): 137-192.
Gibbons, Robert. 1998. “Incentives in Organizations,” Journal of Economic Perspectives 12: 115-32.
Holmstrom, Bengt, and Paul Milgrom. 1991. “Multitask Principal-Agent Analyses: Incentive Contracts, Asset Ownership, and Job Design.” Journal of Law, Economics, and Organization 7:24-52.
Charles Goodhart introduced the so-called Goodhart's Law in his 1975 paper, "Problems of Monetary Management: The U.K. Experience," presented at a conference organized by the Reserve Bank of Australia. He observed that any observed statistical regularity will tend to collapse once pressure is placed upon it for control purposes. The quoted phrase is the pithy summarization by others of this point.
See e.g.Lecture Note 1: Agency Theory1 R. Gibbons MIT.
See the analysis of this variational problem in Section 2 of Garicano and Rayo (2016).
Thanks for this fantastic piece Luis!
I’m a huge fan of The Wire (one of my all-time favorite shows), and I loved how you have captured those perpetual “juking the stats” moments in Baltimore PD (where everybody scrambles to make the numbers look good instead of actually doing good).
Regarding the AI part (which is my field of work), I wanted to add a small note about the recent o3 AI developments you mention, especially around the debate of “teaching to the test” versus real breakthroughs.
Yes, we can always wonder whether an AI has simply been “trained” to ace a particular benchmark—much like students can be “taught to the test” in school. But the truly exciting part about o3 is its inference-time chain-of-thought reasoning.
Being able to dynamically write and refine its own code as part of the reasoning process (which I suppose is what this new model is doing under the hood) suggests we’re inching closer to a world where AI can explore solutions beyond what was explicitly laid out in training. It’s a little like seeing McNulty in the detail room, working through leads step-by-step; only now, the detective has superhuman coding abilities and almost infinite knowledge.
One of the best indicators of o3’s potential is its performance in Frontier math, jumping from 2% with o1 to 25% under o3, is an undeniable zero-to-one leap which makes me think we are in the beginning of AGI (We have to think about AGI not as a singular point in time but as a process) or very, very close to it.
Again, thanks for the great read—I’m excited to see what next year brings us!
Acta non verba.
Excellent post!