When the AI Reflex Gets Expensive: The Real Value of Expertise
Dr. Oliver Gausmann · March 22, 2026 · 12 min read

Key Takeaways
- AI boosts productivity by 40% on standard tasks, but users overestimate their gains by 39 percentage points
- Expert plus AI (Centaur) reduces extreme errors by 90%, while the Cyborg approach without clear role separation lowers quality
- No LLM guarantees reproducible outputs, and the EU AI Act high-risk deadline is August 2, 2026
Executive Summary
Two conversations in the same week prompted this article. A CEO told me his team was using ChatGPT for NIS2 preparation. The results looked convincing, and nobody was really checking them. Two days later, a professor I deeply respect at the University of Zurich described how students solve programming assignments with AI and can no longer explain the code they submit. Both observations describe the same mechanism: people shifting from active AI users to passive recipients. Harvard's research confirms this with numbers. Within AI's capability range, output quality improves by 40% [1]Dell'Acqua et al.: Navigating the Jagged Technological Frontier, Harvard/BCG 2023. Outside that range, it drops by 19 percentage points [1]Dell'Acqua et al.: Navigating the Jagged Technological Frontier, Harvard/BCG 2023. Users don't notice the difference. They overestimate their own productivity gains by 39 percentage points [2]METR: Measuring the Impact of Early-2025 AI on Developer Productivity.
Are We Becoming AI Drones?
The question sounds provocative. The data isn't. According to MIT, 90% of employees use private AI tools, bypassing official company systems [3]MIT NANDA: The GenAI Divide, State of AI in Business 2025. Shadow AI doesn't come from bad intentions. Official systems simply don't map to specific workflows. So people ask ChatGPT. The answer sounds plausible. A quick skim reveals nothing off. And because the output looks convincing, deeper verification often doesn't happen at all.
The pattern is everywhere. CEOs sign off on AI-generated compliance analyses they can't verify. Students submit code they didn't write. In both cases, the role shifts from actively thinking to passively consuming. The Harvard researchers call this the "Cyborg" pattern: AI woven into every step until the capacity for critical evaluation fades [1]Dell'Acqua et al.: Navigating the Jagged Technological Frontier, Harvard/BCG 2023.
Aviation has dealt with this for decades. Modern airliners fly on autopilot 95% of the time. Works beautifully as long as everything goes to plan. But accidents in recent years have shown repeatedly: when technology fails, when sensors deliver wrong data or systems shut down unexpectedly, human intuition is what decides. The ability to read a situation in context, spot contradictions, and act under uncertainty. That can't be automated. Pilots who only monitor the autopilot and have unlearned manual flying become a risk in crisis. CEOs who only consume AI output and have unlearned independent judgment become one too.
47% of enterprise AI users made at least one significant business decision based on hallucinated content in 2024 [8]Enterprise AI Accuracy Data 2024-2025, aggregiert. 95% of enterprise AI pilots deliver no measurable P&L impact according to MIT [3]MIT NANDA: The GenAI Divide, State of AI in Business 2025. These aren't worst-case numbers. This is the current state.
I love technology when it makes life easier. I use it daily. But I don't want to depend on it. I'm glad I can still fly through the storm by hand when it counts.
Where AI Actually Helps and Where It Hurts
Harvard tested this with 758 real BCG consultants [1]Dell'Acqua et al.: Navigating the Jagged Technological Frontier, Harvard/BCG 2023. On standard tasks like writing, data analysis, and research synthesis, results with GPT-4 were 40% better and 25% faster. On a complex task combining quantitative analysis with qualitative judgment, AI users performed 19 percentage points worse than the control group.
The researchers call this boundary the "Jagged Frontier." Jagged and invisible. Massive gains on one side. Measurable quality losses on the other. The insidious part: you only realize you've crossed it after the damage is done. Because AI still delivers professional-looking output beyond the frontier. It's just wrong.
METR's 2025 study shows it even more sharply [2]METR: Measuring the Impact of Early-2025 AI on Developer Productivity. Sixteen experienced open-source developers, people with five years' familiarity with their own codebases, worked with AI coding tools. They became 19% slower. They believed they were 24% faster. Picture that. A 39-percentage-point gap between feeling and reality. When your Head of Engineering reports that the new AI tool saves the team 20% of their time, that could be true. It could also mean the team got slower and didn't notice. Without independent measurement, you don't know.
Stanford adds a crucial dimension: time [4]Stanford HAI AI Index 2025: RE-Bench. On tasks under two hours, AI outperforms humans 4:1. On complex analyses over 32 hours, humans outperform AI 2:1. Makes intuitive sense. If you want to draft an email, ask AI. If you're running a weeks-long NIS2 gap analysis requiring company-specific context, you need a human. Or better: a human with AI.
Centaur or Cyborg: Two Ways to Use AI in Your Company
Harvard identified two usage patterns [1]Dell'Acqua et al.: Navigating the Jagged Technological Frontier, Harvard/BCG 2023. "Cyborgs" interweave AI into every step. Prompt, read, prompt again, accept, prompt again. The boundary between own thinking and AI output eventually blurs. "Centaurs" divide work clearly. AI does the research, the human evaluates. AI writes the draft, the human decides. Centaurs deliver better results. That's not accidental.
In practice, the difference looks like this: A Cyborg approach to NIS2 analysis means ChatGPT writes the analysis, the CEO skims it, signs off. A Centaur approach means ChatGPT delivers a summary of regulatory requirements. The consultant checks it against the company's specific IT infrastructure. Identifies gaps that only someone with knowledge of internal structures can see. Builds a concrete action plan from there. In the first case, the result looks professional. In the second case, it's correct.
LSU quantified the Centaur effect in equity analysis [5]LSU Finance Centaur Analyst Study 2025. AI alone beats human analysts 54.5% of the time. The Centaur, expert plus AI, beats AI alone in 55% of forecasts and reduces extreme forecast errors by 90%. The average improves slightly. The catastrophes nearly disappear. For a CEO making a strategic decision, the average doesn't matter. What matters is how bad it gets in the worst case. That's precisely where the human makes the difference.
Why? Because AI lacks individual context. What ChatGPT knows about NIS2, anyone with internet access knows. What it doesn't know: how your IT architecture connects, which suppliers create critical dependencies, where your organizational structure has gaps that appear in no org chart. 95% of enterprise AI pilots fail because of exactly this problem [3]MIT NANDA: The GenAI Divide, State of AI in Business 2025. Between $30 and $40 billion flowed into enterprise GenAI with minimal returns. Research on tacit knowledge confirms: extracting experience-based expertise from humans is expensive, slow, and yields only a partial picture even under ideal conditions [17]Sanzogni et al.: Tacit Knowledge and AI, MDPI Technologies 2025.
Why AI Poses a Structural Risk in Regulated Environments
LLMs work probabilistically. The same prompt yields different answers on different days. No commercial provider guarantees deterministic outputs. OpenAI calls their API "mostly deterministic." Anthropic states that even at Temperature 0, outputs won't be fully deterministic [9]Thinking Machines Lab: Defeating Nondeterminism in LLM Inference 2025. The root cause is floating-point non-associativity in modern GPU hardware [9]Thinking Machines Lab: Defeating Nondeterminism in LLM Inference 2025. Sounds like a technical detail. For regulated processes, it's fundamental.
The EU AI Act requires consistent accuracy metrics for high-risk AI under Article 15 and automatic logging of all inputs and outputs under Article 12 [10]EU AI Act Verordnung 2024/1689, Art. 12, 15. Compliance deadline: August 2, 2026. Germany's BaFin names stochastic behavior explicitly as an AI-specific risk in its December 2025 guidance [11]BaFin Orientierungshilfe IKT-Risiken bei KI Dezember 2025. Under GDPR Article 22, decisions with legal effect require human decision-makers [12]DSK Orientierungshilfe KI und Datenschutz 2024. Germany's BSI recommends against using AI applications unchecked in critical business processes [13]BSI Management Blitzlicht: Sichere generative KI 2024.
Stanford measured hallucination rates in professional legal AI tools [7]Magesh et al.: Hallucination-Free? Stanford Legal AI Study 2025. Even specialized tools with access to verified databases deliver wrong answers in one-sixth to one-third of cases. In an NIS2 gap analysis with 50 checkpoints and a 17% error rate, roughly eight to nine vulnerabilities go undetected (estimate). Human-in-the-loop raises accuracy from 82% to 98% [8]Enterprise AI Accuracy Data 2024-2025, aggregiert. In a compliance review, that's the difference between nine missed vulnerabilities and one.
What I See at the University
When I first lectured a class of students 22 years ago, I was the source. Students didn't have the knowledge, and I delivered it. In 2026, they have it before I enter the room. 92% of students use AI tools regularly [14]HEPI/Kortext: Student Generative AI Survey 2025. A Harvard experiment showed an AI tutor producing twice the learning gain of active classroom instruction [15]Kestin et al.: AI tutoring outperforms active learning, Scientific Reports 2025.
Today I teach Global Software Management and AI at the University of Zurich. The value of my lectures isn't in the material anymore. That's on YouTube, in Claude, in ChatGPT. The value is in connecting practice with theory, in bringing real implementation experience to an academic setting. What surprised me: this shift gives me more satisfaction in 2026 than it did in 2004. Because students bring the foundational knowledge and we can go straight to depth. Latest knowledge, years of experience, modern tools, combined. That's teaching as it should be.
Yet 95% of faculty fear student over-reliance on AI [16]AAC&U/Elon University Faculty Survey 2026. 48% say research quality has declined [16]AAC&U/Elon University Faculty Survey 2026. I see both. Students who use AI as a shortcut and stop thinking. And students who use AI as a tool and arrive at better results faster. The difference? The second group can explain what they submitted. The first group can't.
The parallel to mid-market AI consulting is clear. AI delivers data collection, synthesis, first drafts. Interpretation, evaluation, strategic application to specific contexts, that requires someone who knows the context. And who's willing to ask uncomfortable questions.
What CEOs Can Do Now
Measure actual AI productivity gains. Employee self-reports are off by 39 percentage points [2]METR: Measuring the Impact of Early-2025 AI on Developer Productivity. Compare AI-assisted and AI-free results on the same task. Measure output quality and turnaround times. If AI output is better, use AI. If not, now you know why.
Separate AI work from human work deliberately. What works for me: research, data preparation, and first drafts run through AI tools. Evaluating what matters for a specific client, setting priorities, planning implementation, that stays human. In an AI strategy workshop, I prepare market analysis with Claude beforehand. The workshop itself centers on questions no tool answers. Which department is ready for change? Where are the informal decision-makers? What legacy architecture blocks the next step?
For regulated processes, verify that AI output meets the traceability requirements the EU AI Act demands from August 2026 [10]EU AI Act Verordnung 2024/1689, Art. 12, 15. Probabilistic AI outputs in regulated processes without documented traceability create personal liability.
External expertise pays off precisely where your organization crosses the Jagged Frontier. Where tasks are complex, context-dependent, and carry regulatory weight. For NIS2 implementation, AI strategy, or software delivery process redesign, a graduate student with a ChatGPT subscription is the cheapest option with the most expensive risk profile. When evaluating external AI consulting, check whether the consultant uses AI as a tool themselves. Anyone consulting without AI in 2026 wastes speed. Anyone consulting only with AI wastes quality.
For a structured approach to building your AI strategy, see our AI Strategy Guide for Mid-Market Companies.
My Take
This article is itself an example of the boundary it describes. I used Claude to run the research. Seven studies reviewed, data points extracted, contradictions between Harvard, MIT, and Stanford flagged. That saved hours. The judgment call, which numbers matter for a CEO with 200 employees and which stay academic, that's not something a prompt delivers. The decision to structure this piece around the Jagged Frontier and the Centaur metaphor came from client experience, not from AI output.
A CEO recently asked whether his team could handle NIS2 requirements using ChatGPT. I asked to see their result. Three of eight critical requirements were missing. The gaps were exactly where things got company-specific: supply chain dependencies, cross-departmental responsibilities, legacy IT systems that appear in no official documentation. ChatGPT knew the regulation. It didn't know the company.
With software delivery, I see the same thing. AI agents generate code, write tests, create documentation. Whether the generated code fits your existing architecture, meets your compliance requirements, or whether your team can maintain it, no agent can judge that. It takes operational experience. And the willingness to give uncomfortable answers.
The data from seven studies shows a clear pattern. AI outperforms humans 4:1 on tasks under two hours [4]Stanford HAI AI Index 2025: RE-Bench. Humans outperform AI 2:1 on analyses over 32 hours [4]Stanford HAI AI Index 2025: RE-Bench. Consulting projects take weeks. They require context that exists in no training dataset. They require judgment that no temperature setting delivers.
Back to aviation: I love technology when it makes life easier. But I don't want to depend on it. I'm glad to still be able to navigate my course by hand when it counts. That goes for aircraft. And for companies.
Sources
- [1]Dell'Acqua et al.: Navigating the Jagged Technological Frontier, Harvard/BCG 2023
- [2]METR: Measuring the Impact of Early-2025 AI on Developer Productivity
- [3]MIT NANDA: The GenAI Divide, State of AI in Business 2025
- [4]Stanford HAI AI Index 2025: RE-Bench
- [5]LSU Finance Centaur Analyst Study 2025
- [6]Noy & Zhang: Experimental evidence on the productivity effects of generative AI, Science 2023
- [7]Magesh et al.: Hallucination-Free? Stanford Legal AI Study 2025
- [8]Enterprise AI Accuracy Data 2024-2025, aggregiert
- [9]Thinking Machines Lab: Defeating Nondeterminism in LLM Inference 2025
- [10]EU AI Act Verordnung 2024/1689, Art. 12, 15
- [11]BaFin Orientierungshilfe IKT-Risiken bei KI Dezember 2025
- [12]DSK Orientierungshilfe KI und Datenschutz 2024
- [13]BSI Management Blitzlicht: Sichere generative KI 2024
- [14]HEPI/Kortext: Student Generative AI Survey 2025
- [15]Kestin et al.: AI tutoring outperforms active learning, Scientific Reports 2025
- [16]AAC&U/Elon University Faculty Survey 2026
- [17]Sanzogni et al.: Tacit Knowledge and AI, MDPI Technologies 2025