In January, Air Canada's autonomous booking agent rebooked 1,247 passengers onto incorrect flights during a Toronto weather disruption. The agent worked fine in testing. It worked fine on a normal Tuesday. It fell apart the first time it encountered the kind of operational chaos every airline encounters several times a year.

That failure isn't an outlier. It's a deployment pattern. Step Finance watched its AI trading agents autonomously move $27–30 million off compromised executive devices because the agents had blanket SOL transfer authority and no value-threshold checks. Between December 2025 and February 2026, a single attacker used Claude Code and GPT-4.1 to breach nine Mexican government agencies and exfiltrate 195 million taxpayer records. In all three, the agents were doing exactly what their permissions allowed. The mistake was upstream.

Most C-suite leaders are budgeting demo-to-production as the last 20% of the work. It's the next 60%. Cloud Security Alliance research found 65% of organizations have already had an AI agent–related incident. Gartner predicts that by 2027, 40% of enterprises will demote or decommission autonomous agents over governance gaps surfaced only after production incidents.

Here's what the in-between actually contains.

The work you do once

Foundational pieces that have to exist before an agent is allowed in production — hard, real engineering, built once. These map directly to the OWASP Top 10 for Agentic Applications (December 2025) and NIST's AI Agent Standards Initiative (February 2026):

Narrow scoping of permissions. Least-privilege access — what data the agent can read, write, or never touch. Step Finance's agents had blanket transfer authority. That's the agentic equivalent of plaintext root access.
The data and identity boundary. Which systems the agent can authenticate into, with what credentials, and how those credentials are scoped. The Mexican government breach didn't happen because the AI was wrong; it happened because the AI had access to credentials nobody had bounded.
Human-in-the-loop checkpoints. Which actions require human confirmation, which don't, and how that decision was made — documented and reviewable. Especially for actions above a value or impact threshold.
Observability infrastructure. Logging every prompt, decision, tool call, and outcome so failures are diagnosable after the fact. If you can't reconstruct what the agent did and why, you can't fix it.
Kill switches. How fast can you turn this off, and who has authority. If the answer is "we'd have to file a ticket," you're not ready for production.

Table stakes. An agent that lacks any of these shouldn't be in production.

The work that never ends

Pieces most leaders never budget for, because they don't look like projects. They're permanent functions:

Permissions narrowing. As you learn what the agent actually does in production, you should be tightening scope, not loosening it. Most teams loosen because tightening creates friction. That's how Step Finance happens.
Edge case discovery. Production reveals failure modes nobody designed against. Air Canada's weather disruption was one. Each new one has to be cataloged, tested against, and fed back into the agent's behavior. Permanent loop.
Trust-tier graduation. Agents earn additional scope by demonstrating reliability — and lose it when they fail. A ratchet, not a binary switch. Most enterprises don't have a graduation framework at all; they have a launch date.
Drift monitoring. The model changes. The data distribution changes. User behavior changes. Performance silently degrades unless someone is watching — what IEEE Spectrum recently called the "quiet failure" problem. The agent that worked in February isn't the agent you have in August.
Incident retro discipline. Every production incident is data. The question is whether your organization has the muscle to turn it into the next iteration — or whether incidents get filed, blamed, and forgotten.

What this changes for C-suite sponsorship

Three things to do this quarter if you're sponsoring an agentic AI program:

Budget the in-between explicitly. Demo-to-production for a non-trivial agent is six to eighteen months, not weeks. If your roadmap shows the latter, you're funding a future incident.
Measure on trust graduation, not launch dates. Ship dates reward velocity, which is what produces the Step Finance pattern. Trust graduation rewards agents that have earned their scope.
Build the iterative muscle as a permanent function. Incident retros, narrowing reviews, drift monitoring — these need owners with continuity, not a project team that disbands after launch.

The gap between demo and production is where trust in agentic AI is built — or where the next headline gets written. The companies that price the in-between honestly are the ones whose agents are still running in eighteen months.

Sources & further reading:

OWASP GenAI Security Project, "OWASP Top 10 for Agentic Applications 2026," December 9, 2025. https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/

NIST, "AI Risk Management Framework" and "AI Agent Standards Initiative" (CAISI), February 2026. https://www.nist.gov/itl/ai-risk-management-framework

Cloud Security Alliance & Token Security, Autonomous but Not Controlled: AI Agent Incidents Now Common in Enterprises, April 21, 2026.

Cloud Security Alliance Labs, "Agentic NIST AI RMF Profile," April 2026. https://labs.cloudsecurityalliance.org/agentic/agentic-nist-ai-rmf-profile-v1/

Gartner, "Gartner Says Applying Uniform Governance Across AI Agents Will Lead to Enterprise AI Agent Failure," May 26, 2026. https://www.gartner.com/en/newsroom/press-releases/2026-05-26-gartner-says-applying-uniform-governance-across-ai-agents-will-lead-to-enterprise-ai-agent-failure

IEEE Spectrum, "How Quiet Failures Are Redefining AI Reliability," April 22, 2026. https://spectrum.ieee.org/ai-reliability

ISO/IEC TS 22440, "Artificial Intelligence — Functional Safety and AI Systems," 2025.