AI & TechnologyMarket & Strategy

When the AI Reflex Gets Expensive: The Real Value of Expertise

Dr. Oliver Gausmann · March 15, 2026 · 12 min read

Centaur vs. Cyborg model: humans and AI working as a team

Executive Summary

Two conversations in the same week prompted this article. A CEO told me his team was using ChatGPT for NIS2 preparation. The results looked convincing, and nobody was really checking them. Two days later, a professor I deeply respect at the University of Zurich described how students solve programming assignments with AI and can no longer explain the code they submit. Both observations describe the same mechanism: people shifting from active AI users to passive recipients. Harvard's research confirms this with numbers. Within AI's capability range, output quality improves by 40% ^[1]. Outside that range, it drops by 19 percentage points ^[1]. Users don't notice the difference. They overestimate their own productivity gains by 39 percentage points ^[2].

Are We Becoming AI Drones?

The question sounds provocative. The data isn't. According to MIT, 90% of employees use private AI tools, bypassing official company systems ^[3]. Shadow AI doesn't come from bad intentions. Official systems simply don't map to specific workflows. So people ask ChatGPT. The answer sounds plausible. A quick skim reveals nothing off. And because the output looks convincing, deeper verification often doesn't happen at all.

The pattern is everywhere. CEOs sign off on AI-generated compliance analyses they can't verify. Students submit code they didn't write. In both cases, the role shifts from actively thinking to passively consuming. The Harvard researchers call this the "Cyborg" pattern: AI woven into every step until the capacity for critical evaluation fades ^[1].

Aviation has dealt with this for decades. Modern airliners fly on autopilot 95% of the time. Works beautifully as long as everything goes to plan. But accidents in recent years have shown repeatedly: when technology fails, when sensors deliver wrong data or systems shut down unexpectedly, human intuition is what decides. The ability to read a situation in context, spot contradictions, and act under uncertainty. That can't be automated. Pilots who only monitor the autopilot and have unlearned manual flying become a risk in crisis. CEOs who only consume AI output and have unlearned independent judgment become one too.

47% of enterprise AI users made at least one significant business decision based on hallucinated content in 2024 ^[8]. 95% of enterprise AI pilots deliver no measurable P&L impact according to MIT ^[3]. These aren't worst-case numbers. This is the current state.

Finding	Number	Source
Employees using private AI tools without approval	90%	MIT NANDA 2025 [3]
Business decisions based on hallucinated content	47%	Stanford HAI 2025 [8]
Enterprise AI pilots with no measurable P&L effect	95%	MIT NANDA 2025 [3]
Users overestimate own AI productivity gain by	39 percentage points	METR 2025 [2]

I love technology when it makes life easier. I use it daily. But I don't want to depend on it. I'm glad I can still fly through the storm by hand when it counts.

Where AI Actually Helps and Where It Hurts

Harvard tested this with 758 real BCG consultants ^[1]. On standard tasks like writing, data analysis, and research synthesis, results with GPT-4 were 40% better and 25% faster. On a complex task combining quantitative analysis with qualitative judgment, AI users performed 19 percentage points worse than the control group.

The researchers call this boundary the "Jagged Frontier." Jagged and invisible. Massive gains on one side. Measurable quality losses on the other. The insidious part: you only realize you've crossed it after the damage is done. Because AI still delivers professional-looking output beyond the frontier. It's just wrong.

METR's 2025 study shows it even more sharply ^[2]. Sixteen experienced open-source developers, people with five years' familiarity with their own codebases, worked with AI coding tools. They became 19% slower. They believed they were 24% faster. Picture that. A 39-percentage-point gap between feeling and reality. When your Head of Engineering reports that the new AI tool saves the team 20% of their time, that could be true. It could also mean the team got slower and didn't notice. Without independent measurement, you don't know.

Stanford adds a crucial dimension: time ^[4]. On tasks under two hours, AI outperforms humans 4:1. On complex analyses over 32 hours, humans outperform AI 2:1. Makes intuitive sense. If you want to draft an email, ask AI. If you're running a weeks-long NIS2 gap analysis requiring company-specific context, you need a human. Or better: a human with AI.

Study	AI effect on simple tasks	AI effect on complex tasks
Harvard/BCG, 758 consultants [1]	+40% quality, +25% speed	Minus 19 pp quality
METR, 16 experienced developers [2]	Not tested (complex tasks only)	19% slower, 39pp perception gap
MIT/Noy & Zhang, 453 professionals [6]	40% faster, +18% quality	Not tested
Stanford RE-Bench [4]	AI 4:1 superior (under 2h)	Human 2:1 superior (over 32h)

Centaur or Cyborg: Two Ways to Use AI in Your Company

Harvard identified two usage patterns ^[1]. "Cyborgs" interweave AI into every step. Prompt, read, prompt again, accept, prompt again. The boundary between own thinking and AI output eventually blurs. "Centaurs" divide work clearly. AI does the research, the human evaluates. AI writes the draft, the human decides. Centaurs deliver better results. That's not accidental.

In practice, the difference looks like this: A Cyborg approach to NIS2 analysis means ChatGPT writes the analysis, the CEO skims it, signs off. A Centaur approach means ChatGPT delivers a summary of regulatory requirements. The consultant checks it against the company's specific IT infrastructure. Identifies gaps that only someone with knowledge of internal structures can see. Builds a concrete action plan from there. In the first case, the result looks professional. In the second case, it's correct.

LSU quantified the Centaur effect in equity analysis ^[5]. AI alone beats human analysts 54.5% of the time. The Centaur, expert plus AI, beats AI alone in 55% of forecasts and reduces extreme forecast errors by 90%. The average improves slightly. The catastrophes nearly disappear. For a CEO making a strategic decision, the average doesn't matter. What matters is how bad it gets in the worst case. That's precisely where the human makes the difference.

Why? Because AI lacks individual context. What ChatGPT knows about NIS2, anyone with internet access knows. What it doesn't know: how your IT architecture connects, which suppliers create critical dependencies, where your organizational structure has gaps that appear in no org chart. 95% of enterprise AI pilots fail because of exactly this problem ^[3]. Between $30 and $40 billion flowed into enterprise GenAI with minimal returns. Research on tacit knowledge confirms: extracting experience-based expertise from humans is expensive, slow, and yields only a partial picture even under ideal conditions ^[17].

Approach	What it looks like	Outcome
Cyborg (AI in every step)	ChatGPT writes NIS2 analysis, CEO skims, signs	Looks professional, 17 to 34% error rate [7]
Centaur (clear role separation)	AI delivers raw material, expert checks against company context	90% fewer extreme errors, up to 98% accuracy [5][8]

Why Does AI Pose a Structural Risk in Regulated Environments?

LLMs work probabilistically. The same prompt yields different answers on different days. No commercial provider guarantees deterministic outputs. OpenAI calls their API "mostly deterministic." Anthropic states that even at Temperature 0, outputs won't be fully deterministic ^[9]. The root cause is floating-point non-associativity in modern GPU hardware ^[9]. Sounds like a technical detail. For regulated processes, it's fundamental.

The EU AI Act requires consistent accuracy metrics for high-risk AI under Article 15 and automatic logging of all inputs and outputs under Article 12 ^[10]. Compliance deadline: August 2, 2026. Germany's BaFin names stochastic behavior explicitly as an AI-specific risk in its December 2025 guidance ^[11]. Under GDPR Article 22, decisions with legal effect require human decision-makers ^[12]. Germany's BSI recommends against using AI applications unchecked in critical business processes ^[13].

Stanford measured hallucination rates in professional legal AI tools ^[7]. Even specialized tools with access to verified databases deliver wrong answers in one-sixth to one-third of cases. In an NIS2 gap analysis with 50 checkpoints and a 17% error rate, roughly eight to nine vulnerabilities go undetected (estimate). Human-in-the-loop raises accuracy from 82% to 98% ^[8]. In a compliance review, that's the difference between nine missed vulnerabilities and one.

AI System	Hallucination Rate	Source
GPT-4 without RAG (legal questions)	Over 43%	Stanford 2025 [7]
Westlaw AI (RAG, legal)	Over 34%	Stanford 2025 [7]
Lexis+ AI (RAG, legal)	Over 17%	Stanford 2025 [7]
Human-in-the-loop (expert reviews AI)	Approx. 2%	Stanford HAI 2025 [8]

What I See at the University

When I first lectured a class of students 22 years ago, I was the source. Students didn't have the knowledge, and I delivered it. In 2026, they have it before I enter the room. 92% of students use AI tools regularly ^[14]. A Harvard experiment showed an AI tutor producing twice the learning gain of active classroom instruction ^[15].

Today I teach Global Software Management and AI at the University of Zurich. The value of my lectures isn't in the material anymore. That's on YouTube, in Claude, in ChatGPT. The value is in connecting practice with theory, in bringing real implementation experience to an academic setting. What surprised me: this shift gives me more satisfaction in 2026 than it did in 2004. Because students bring the foundational knowledge and we can go straight to depth. Latest knowledge, years of experience, modern tools, combined. That's teaching as it should be.

Yet 95% of faculty fear student over-reliance on AI ^[16]. 48% say research quality has declined ^[16]. I see both. Students who use AI as a shortcut and stop thinking. And students who use AI as a tool and arrive at better results faster. The difference? The second group can explain what they submitted. The first group can't.

The parallel to mid-market AI consulting is clear. AI delivers data collection, synthesis, first drafts. Interpretation, evaluation, strategic application to specific contexts, that requires someone who knows the context. And who's willing to ask uncomfortable questions.

What Should CEOs Do Now?

Measure actual AI productivity gains. Employee self-reports are off by 39 percentage points ^[2]. Compare AI-assisted and AI-free results on the same task. Measure output quality and turnaround times. If AI output is better, use AI. If not, now you know why.

Separate AI work from human work deliberately. What works for me: research, data preparation, and first drafts run through AI tools. Evaluating what matters for a specific client, setting priorities, planning implementation, that stays human. In an AI strategy workshop, I prepare market analysis with Claude beforehand. The workshop itself centers on questions no tool answers. Which department is ready for change? Where are the informal decision-makers? What legacy architecture blocks the next step?

For regulated processes, verify that AI output meets the traceability requirements the EU AI Act demands from August 2026 ^[10]. Probabilistic AI outputs in regulated processes without documented traceability create personal liability.

External expertise pays off precisely where your organization crosses the Jagged Frontier. Where tasks are complex, context-dependent, and carry regulatory weight. For NIS2 implementation, AI strategy, or software delivery process redesign, a graduate student with a ChatGPT subscription is the cheapest option with the most expensive risk profile. When evaluating external AI consulting, check whether the consultant uses AI as a tool themselves. Anyone consulting without AI in 2026 wastes speed. Anyone consulting only with AI wastes quality.

For a structured approach to building your AI strategy, see our AI Strategy Guide for Mid-Market Companies.

Our Take

This article is itself an example of the boundary it describes. I used Claude to run the research. Seven studies reviewed, data points extracted, contradictions between Harvard, MIT, and Stanford flagged. That saved hours. The judgment call, which numbers matter for a CEO with 200 employees and which stay academic, that's not something a prompt delivers. The decision to structure this piece around the Jagged Frontier and the Centaur metaphor came from client experience, not from AI output.

A CEO recently asked whether his team could handle NIS2 requirements using ChatGPT. I asked to see their result. Three of eight critical requirements were missing. The gaps were exactly where things got company-specific: supply chain dependencies, cross-departmental responsibilities, legacy IT systems that appear in no official documentation. ChatGPT knew the regulation. It didn't know the company.

With software delivery, I see the same thing. AI agents generate code, write tests, create documentation. Whether the generated code fits your existing architecture, meets your compliance requirements, or whether your team can maintain it, no agent can judge that. It takes operational experience. And the willingness to give uncomfortable answers.

The data from seven studies shows a clear pattern. AI outperforms humans 4:1 on tasks under two hours ^[4]. Humans outperform AI 2:1 on analyses over 32 hours ^[4]. Consulting projects take weeks. They require context that exists in no training dataset. They require judgment that no temperature setting delivers.

Back to aviation: I love technology when it makes life easier. But I don't want to depend on it. I'm glad to still be able to navigate my course by hand when it counts. That goes for aircraft. And for companies.