We Built 78 AI Agents. Here's What Actually Broke.

Feb 16, 2026

Everyone talks about multi-agent systems in theory. We built one with 78 specialized agents processing 660,000 candidates at CNO Financial, a Fortune 500 insurance company.

This is what we learned about agent coordination at enterprise scale.

Most companies building AI for hiring use one model. One prompt. One API call to GPT-4 or Claude. They send a resume, get a score back, move on.

That approach works fine until you hit enterprise volume and enterprise complexity.

At CNO Financial, we process 1.5 million applications annually across 215 locations. Different roles (sales agents, underwriters, claims processors, actuaries). Different regions. Different hiring managers with different preferences. Different performance metrics.

One model cannot handle this complexity well. Not because the model isn't smart enough—because the task is too broad.

So we built something different: 78 specialized agents orchestrated across 25 layers, each handling a specific part of the hiring workflow.

This is the architecture that processes 100% of CNO's candidates and predicts top performers with 80% accuracy, validated against their actual Q1-Q3 2025 performance reviews.

Here's how it works, why we built it this way, and what design decisions matter when you're operating at Fortune 500 scale.

Why 78 Agents Instead of One Model

The first question everyone asks: "Why not just use one really good model?"

Because specialization beats generalization at scale.

Here's the analogy: you wouldn't hire one person to handle recruiting, interviewing, sourcing, compliance, and reporting. You hire specialists. Each person gets really good at their specific function.

Same principle applies to AI agents.

The Three Agent Types

Our architecture uses three categories of specialized agents:

Screening Agents Evaluate every candidate against the Success Profile for that specific role. Not "is this person qualified generally"—but "does this person match the patterns of top performers in this exact role at this specific company?"

At CNO, we have separate screening agents for:

Insurance sales agents (different success patterns than...)
Claims processors (different success patterns than...)
Underwriters (different success patterns than...)
Actuaries (different success patterns than...)
Customer service representatives

Each agent is fine-tuned on the top performers in that specific role. The patterns that predict success for a sales agent don't predict success for an actuary. Different jobs require different agents.

Interview Agents Conduct structured assessments based on what actually predicts success at that company. Not generic interview questions from the internet—questions derived from analyzing actual top performer behavior.

These agents generate full transcripts and audit trails. Every question asked. Every answer given. Every follow-up. All logged for compliance review.

At CNO, this means legal can review any hiring decision and see exactly what was asked, how the candidate responded, and why that response contributed to their final score.

Sourcing Agents Identify external candidates matching Success Profile patterns. These agents don't just keyword-match LinkedIn profiles. They evaluate whether someone's career trajectory, communication patterns, and demonstrated skills align with what makes people successful at CNO specifically.

Traditional sourcing relies on Boolean search: "insurance AND sales AND 5 years." That finds people with credentials.

Our sourcing agents find people with performance patterns, even if they lack traditional credentials. This is how CNO discovered that top performers in insurance sales often came from hospitality, retail, and teaching—not from other insurance companies.

Why Specialization Matters

When you use one model for everything, you're making tradeoffs.

The model has to be good at screening resumes, good at evaluating interview responses, good at sourcing passive candidates, good at understanding industry-specific terminology, good at comparing candidates across different roles.

Being "pretty good" at ten different tasks means you're not excellent at any of them.

When you deploy specialized agents, each one gets really good at its specific function. The screening agent for insurance sales agents becomes expert-level at identifying sales talent patterns. The interview agent for technical roles becomes expert-level at evaluating technical depth.

This is why we achieve 80% accuracy predicting top performers at CNO. Generic models max out around 20-25% on hiring predictions. The difference is specialization.

The Small Model Strategy

Here's the counterintuitive part: each agent uses relatively small models. 7B-20B parameters.

For context, GPT-4 is rumored to be 1.7 trillion parameters. Claude 3.5 Sonnet is likely in the hundreds of billions. Our agents use models that are 10-100× smaller.

This isn't a limitation. It's a deliberate choice.

Why Small Models Win at Enterprise Scale

1. VPC deployment constraints

Enterprise infrastructure typically cannot support GPU clusters needed for 70B+ parameter models. CNO's AWS environment can run 7B-20B parameter models on CPU with acceptable latency.

When you deploy in customer infrastructure (which is why legal approves us), you have to work within their hardware constraints. Small models fit those constraints.

2. CPU inference

Small models can run on CPU with sub-second latency. Large models cannot. This keeps hardware requirements manageable and costs predictable.

At CNO's scale (1.5M applications annually), the difference between "needs GPU clusters" and "runs on standard CPU" is millions in infrastructure cost.

3. Fine-tuning feasibility

Retraining a 7B model on new data every quarter is practical. The compute requirement is measured in hours on standard cloud infrastructure.

Retraining a 70B+ model every quarter would require weeks of compute time on expensive GPU clusters. It's not economically viable for continuous learning.

4. Specialization over scale

A 7B model fine-tuned on 10,000 examples of successful insurance sales agents at CNO outperforms a 70B general-purpose model on that specific task.

The specialization compensates for the smaller parameter count. You're not trying to be good at everything. You're trying to be excellent at one thing.

The Routed Adapter Pattern

Here's how we actually deploy these specialized models:

Instead of 78 completely separate models, we use a base model with routed adapters.

Think of it like this: the base model provides general language understanding. The adapters provide specialized knowledge.

When a request comes in—"evaluate this candidate for an insurance sales role at CNO Financial"—the system routes through three adapters:

Industry adapter: Insurance-specific patterns
Role adapter: Sales-specific patterns
Customer adapter: CNO-specific patterns

The combination produces better accuracy than a general-purpose model for this specific task.

This architecture is why we can deploy 78 "agents" without actually running 78 separate large models. We route requests through specialized adapters based on task type.

According to the technical whitepaper, this routed adapter pattern is what makes the system economically viable at enterprise scale. You get specialization without the compute cost of running 78 independent large models.

How Agents Coordinate: The Orchestration Layer

Having 78 specialized agents is useful. Getting them to work together without chaos is hard.

This is where orchestration matters.

Kubernetes for Agent Management

We use Kubernetes—the industry-standard container orchestration system—to manage agent coordination.

Here's what that means in practice:

Container isolation: Each agent type runs in its own container. If one agent crashes, the others keep running. Failures are isolated.

Service routing: When a request comes in ("screen this candidate"), Kubernetes routes it to the appropriate agent type based on the request. Screening requests go to screening agents. Interview requests go to interview agents.

Health checks: Kubernetes continuously monitors each agent. If an agent stops responding, Kubernetes automatically restarts it. No human intervention required.

Load balancing: When 1,000 candidates need screening simultaneously, Kubernetes distributes the load across multiple screening agent containers. No single agent gets overwhelmed.

This isn't custom orchestration we invented. It's standard Kubernetes used as designed: container orchestration with service routing and self-healing through native health checks and pod restart mechanisms.

Why does this matter? Because reliability at enterprise scale requires boring technology.

CNO processes 1.5 million applications per year. The system cannot go down. Using proven orchestration infrastructure (Kubernetes) instead of building custom coordination logic means we inherit decades of reliability engineering.

The EVP Hierarchy: Learning Without Leaking Data

Here's the hard problem with multi-agent systems at enterprise scale: how do you enable learning across customers without exposing sensitive data?

CNO's candidate data cannot leave their VPC. Neither can data from any other customer. Legal blocks any architecture where Customer A's data could be seen by Customer B.

But ideally, patterns learned at one insurance company should improve predictions at other insurance companies. Not the raw data—the patterns.

We solved this with the EVP (Evals and Patterns) hierarchy.

What EVPs Are

EVP stands for Evals and Patterns. It's the atomic unit of intelligence in our system.

Every hiring decision, every outcome validation, every exception generates EVPs. These are not raw data points. They're abstracted patterns:

"Candidates with trait pattern X tend to succeed in role Y."

Not: "John Smith at CNO Financial scored 87 and became a top performer."

The difference is crucial. Patterns can flow between deployments. Data cannot.

How the Hierarchy Works

EVPs flow through four levels:

Level 1 - Customer EVPs Each deployment generates patterns specific to that company. What predicts success for insurance agents at CNO specifically. These patterns stay in CNO's VPC. They never leave.

Level 2 - Industry EVPs Patterns from multiple insurers aggregate into industry-level intelligence. Not the underlying data—the patterns.

Example: "Candidates in insurance sales with communication pattern X have 73% success rate across 5 insurers."

This aggregated pattern can improve models at a new insurance customer. But CNO's actual candidate data never left CNO's environment.

Level 3 - Sector EVPs Industry patterns combine into sector intelligence. Insurance, banking, and fintech patterns aggregate into financial services patterns.

Level 4 - Master EVP Cross-sector patterns that transfer across all regulated industries.

This hierarchy means each new customer benefits from accumulated intelligence while their data remains completely isolated.

When a new insurance company deploys our system, they get a better Day 1 model because patterns from existing insurance deployments (including CNO) improve baseline accuracy. But CNO's actual candidate data, employee data, performance reviews—none of that ever left CNO's VPC.

The Continuous Learning Loop

Static AI models become outdated fast. Hiring practices evolve. Role requirements change. What predicted success last year might not predict success this year.

Our architecture includes a quarterly learning cycle that keeps models current.

How Learning Works

Quarter 1: Make predictions The system scores candidates and predicts who will become top performers. CNO hires based partly on these predictions.

Quarter 2: Collect outcomes The first cohort of hires completes onboarding. Performance reviews start coming in. HRIS data shows who's actually succeeding.

Quarter 3: Validate predictions We compare predictions against actual outcomes. Which candidates we scored 90+ became top performers? Which ones didn't? What patterns did we miss?

Quarter 4: Retrain models Based on validated outcomes, models retrain on the new data. Only after human review approves the proposed changes.

This is how we achieve model improvement over time. At CNO, the models are 40% more accurate after 6 months of this continuous learning cycle compared to Day 1 deployment.

Why Quarterly Cadence

The quarterly timing isn't arbitrary. It aligns with enterprise performance review cycles.

Most companies conduct performance reviews quarterly or semi-annually. This is when ground truth becomes available. You don't know if a hire was successful until they've been in role long enough to be evaluated.

Trying to retrain monthly would mean training on insufficient data. Waiting annually means missing learning opportunities. Quarterly is the sweet spot.

Targeted Updates, Not Full Retraining

Here's an important architectural detail: we don't retrain the entire system every quarter.

If the "insurance sales" success profile needs updating based on new outcomes, only that adapter gets updated. The "underwriter" profile stays unchanged. The "claims processor" profile stays unchanged.

This targeted update approach prevents catastrophic forgetting—the problem where updating a model on new data makes it forget what it learned previously.

By updating only the relevant adapters, we preserve existing knowledge while incorporating new learnings.

What This Architecture Enables at CNO

After deploying this 78-agent architecture at CNO Financial, here's what changed:

Processing 100% of Candidates

Before: Recruiters manually screened the first 150 applicants per role. That's 1.5% coverage when you're getting 10,000 applications per role.

After: Every candidate gets evaluated by screening agents. 100% coverage. Zero qualified candidates missed due to timing or volume.

CNO had 580,000 unmanaged resumes sitting in their ATS—candidates who applied but were never reviewed. We processed all of them. The system found top performer matches that had been sitting in the database for 6-18 months.

23% of their best potential candidates had applied more than 6 months ago and were never reviewed. Not because they were unqualified. Because they applied after the "first 150" window closed.

80% Prediction Accuracy

Generic AI models (GPT-4, Claude, even Gemini with best prompting) achieve around 20-25% accuracy on hiring predictions according to our whitepaper analysis.

Our fine-tuned, specialized agents achieve 80% accuracy predicting top performers, validated against CNO's Q1-Q3 2025 performance reviews.

The difference is three-fold:

Training data: We train on CNO's actual top performers by integrating with their HRIS. Generic models train on internet text.
Specialization: Each agent is expert-level at one task. Generic models are generalist-level at many tasks.
Continuous learning: Models retrain quarterly on actual outcomes. Generic models are static after initial training.

Decision Traces as Institutional Knowledge

Every hiring decision generates a decision trace: why was this candidate advanced? What evidence supported the decision? How did we resolve disagreements between screening agents and interview agents?

These traces persist in CNO's environment. After 12 months, they become queryable institutional knowledge:

"Show me every candidate we hired without insurance experience and how they performed"
"Which sourcing channels actually produced top performers?"
"When the interview panel was split, which way should we have gone?"
"What patterns predict 90-day attrition?"

This is possible because the 78-agent architecture captures the full context at decision time. Single-model approaches generate a score and move on. Multi-agent systems generate scores plus the reasoning that produced them.

70% Faster Time-to-Hire

CNO's average time-to-hire dropped from 127 days to 38 days after deployment.

This isn't because agents screen faster than humans (though they do). It's because the system eliminates bottlenecks.

Before: Recruiters manually screen 150 resumes → schedule interviews with 10 candidates → hiring managers review 5 finalists → make offer.

After: Screening agents process everyone → recruiters review 15-20 pre-qualified candidates with explainable scores → hiring managers interview top 3-5 → make offer.

The manual screening bottleneck is gone. Recruiters spend time on culture fit and relationship building instead of reading resumes.

Why This Architecture Matters for Other Enterprises

The CNO deployment isn't unique to insurance. The architectural principles transfer to any regulated enterprise dealing with high-volume hiring.

Financial Services: Banks, investment firms, credit card companies processing thousands of applications for analyst, advisor, and operations roles.

FinTech: Stripe, Square, PayPal, Coinbase scaling hiring while maintaining quality bars and regulatory compliance.

Healthcare: Hospital systems, pharmaceutical companies, medical device manufacturers hiring clinical and research talent at scale.

Government: Federal agencies, state governments, municipalities required to document every hiring decision for EEOC compliance.

Any organization where:

Application volume exceeds recruiter capacity
Legal requires explainable, defensible hiring decisions
Multiple role types require different success patterns
Data sovereignty blocks traditional AI tools

This architecture solves all four constraints simultaneously.

What We'd Build Differently

If we were starting from scratch today, here's what we'd change:

More Granular Role Specialization

We currently have separate agents for major role categories (sales, underwriting, claims, etc.). But even within "sales," there are subspecialties.

Inside sales (phone-based) requires different success patterns than outside sales (relationship-driven). Enterprise sales (long cycles, complex deals) requires different patterns than transactional sales (high volume, short cycles).

The next iteration would have more granular specialization at the sub-role level. Not just "sales agent" but "inside sales agent" and "enterprise sales agent" with different success profiles.

Earlier Integration of Communication Data

We currently integrate with ATS (candidate data) and HRIS (performance data). The next iteration would integrate earlier with CRM and communication systems.

Call transcripts, email patterns, customer interaction data—this is where behavioral signals live. Top performers communicate differently. They handle objections differently. They build relationships differently.

We have this integration now, but we wish we'd prioritized it from Day 1. The communication data adds 10-15 percentage points to prediction accuracy according to our internal benchmarks.

More Transparent Agent Reasoning

The current system provides explainable scores: "This candidate scores 87/100. Here's why: communication pattern match (92), resilience indicators (85), customer interaction style (81)."

The next iteration would surface agent-level reasoning: "Screening Agent 23 scored this candidate 90 based on X. Interview Agent 12 scored them 78 based on Y. Here's how we reconciled the difference."

This level of transparency would help hiring managers understand not just the final score, but the reasoning that produced it at each stage of the workflow.

The Infrastructure Advantage

Here's why this architecture creates a compounding moat:

Year 1: Deploy 78 agents. Train on existing top performer data. Achieve 80% baseline accuracy. Process 100% of candidates.

Year 2: Continuous learning improves models to 88% accuracy. Four quarters of outcome data validates which patterns actually predict success. Decision traces become queryable precedent.

Year 3: Models trained on 12 quarters of validated outcomes. Accuracy at 92%. System knows not just who to hire but why. Institutional knowledge captured that doesn't exist anywhere else.

Competitors starting today are 3 years behind. They can replicate the architecture. They cannot replicate the training data because it lives inside customer VPCs and never leaves.

The longer you wait, the wider the gap becomes.

What Changes for Recruiting Teams

Deploying this architecture changes what recruiters actually do day-to-day:

Before:

Read 150 resumes per role (2-3 days)
Manually identify top 10 candidates (gut feel + keyword matching)
Schedule screens with 10 candidates (1 week of coordination)
Present 3-5 finalists to hiring manager (2-3 weeks elapsed)

After:

Review 15-20 pre-screened candidates with scores and evidence (2 hours)
Interview top candidates for culture fit and team dynamics (1-2 days)
Present 3-5 validated finalists with performance predictions (3-4 days elapsed)

Recruiters stop doing work that AI handles better (resume screening, credential verification, pattern matching). They focus on work humans handle better (culture assessment, relationship building, negotiation, candidate experience).

At CNO, recruiters reported 40% reduction in manual screening time after deployment. That time shifted to higher-value activities: sourcing passive candidates, building talent pipelines, improving candidate experience.

The Technical Reality of Multi-Agent Systems

Here's what we learned building this at production scale:

1. Orchestration is harder than individual agent quality

Getting one agent to work well is relatively straightforward. Getting 78 agents to coordinate without conflicts, race conditions, or cascading failures requires serious infrastructure engineering.

This is why we use Kubernetes. The orchestration problem is solved. We inherit that solution instead of building custom coordination logic.

2. Specialization compounds faster than generalization

A generalist model improves slowly on any specific task. A specialist model improves quickly on its narrow task.

After 6 months of continuous learning, our specialized agents are 40% more accurate at their specific functions. A generalist model would improve maybe 10-15% over the same timeframe.

3. Boring technology wins at enterprise scale

We don't use bleeding-edge research. We use Kubernetes (released 2014), standard container patterns, proven orchestration approaches.

CNO didn't deploy us because we have the most innovative architecture. They deployed us because we have the most reliable architecture. Boring is good when you're processing 1.5 million applications per year.

4. The data matters more than the models

Our models are good but not magical. 7B-20B parameters. Open-source base models (Llama, Mistral). Standard fine-tuning approaches.

What makes the system valuable isn't the model architecture. It's the training data: ground truth on what actually predicts success at specific companies, validated against actual performance outcomes.

Competitors can replicate our architecture. They cannot replicate our training data because it lives inside customer VPCs and legal won't let it leave.

What This Means for the Industry

Hiring at enterprise scale is splitting into two categories:

Category 1: Single-model approaches

One API call to GPT-4. One prompt. Generic predictions based on internet-scraped training data. 20-25% accuracy. Fast to deploy but limited effectiveness.

These approaches work fine for low-stakes hiring or small companies. They don't work at Fortune 500 scale where you need explainability, legal defensibility, and continuous improvement.

Category 2: Multi-agent infrastructure

78 specialized agents. Kubernetes orchestration. Training on actual company performance data. 80%+ accuracy. Takes 4-6 weeks to deploy but compounds in value every quarter.

This is what Fortune 500 companies need. Not faster screening—better predictions. Not generic AI—specialized intelligence trained on their specific success patterns.

The gap between these categories will compound every year. Companies using Category 1 approaches will stay at 20-25% accuracy. Companies using Category 2 approaches will improve to 85%, then 90%, then 95% as models learn from more outcomes.

After 3 years, the accuracy gap will be so wide that switching becomes nearly impossible. You can't catch up. The compound learning advantage is too large.

Why Incumbents Can't Build This

ATS vendors (Workday, Greenhouse, Lever) see candidate flow but not performance outcomes. They can't train on what actually predicts success.

HRIS vendors (Workday HCM, SAP, Oracle) see performance data but not hiring context. They don't know what the candidate pool looked like.

AI recruiting tools (HireVue, Eightfold, Paradox) process data on vendor servers. Legal blocks them for data sovereignty reasons. They can't access performance data because legal won't approve sending it externally.

Foundation model providers (OpenAI, Anthropic) can't access enterprise performance data at all. Legal will never approve it.

We're in the VPC, at decision time, connected to ATS and HRIS simultaneously. We capture the context that produces hiring decisions and the outcomes that validate them.

An observer can tell you what happened. Only a participant can tell you why.

FAQs

How long does it take to train these specialized agents for our company?

Initial training takes 4-6 weeks after deployment.

The system needs three data sources to build accurate Success Profiles:

ATS data: Historical candidates and who got hired (2-3 years of data ideal)
HRIS data: Performance reviews, promotions, manager ratings for current employees
CRM/Communication data: Call transcripts, emails, customer interactions (optional but improves accuracy)

During weeks 1-2, we integrate with these systems inside your VPC. Weeks 3-4, models train on your top performer data. Week 5-6, we validate predictions against known outcomes before going live.

First shortlist delivered in 72 hours after go-live.

The models then improve continuously through the quarterly learning cycle. After 6 months, accuracy improves by approximately 40% from baseline as models learn from actual hiring outcomes.

Can we add new agent types for custom workflows?

Yes. The architecture is extensible by design.

CNO started with screening, interview, and sourcing agents. After 6 months, they requested a "re-engagement agent" to identify past candidates worth reconsidering.

We deployed the new agent type in 2 weeks. It integrated with the existing orchestration layer (Kubernetes handles the routing). The agent trained on CNO's hiring patterns and started identifying candidates who had applied 6-18 months ago and deserved a second look.

The routed adapter pattern makes adding new capabilities relatively straightforward. You're adding a new specialized function, not rebuilding the entire system.

What happens if one agent type fails? Does the whole system go down?

No. Container isolation means agent failures don't cascade.

Each agent type runs in its own Kubernetes pod. If a screening agent crashes, interview agents and sourcing agents keep running. The orchestration layer automatically restarts failed pods through native health checks.

This isolation is critical at enterprise scale. CNO processes 1.5 million applications per year. The system cannot have single points of failure.

Additionally, Kubernetes load balancing means work gets distributed across multiple instances of each agent type. If you have 10 screening agent pods and one fails, the other 9 handle the load while Kubernetes restarts the failed pod.

Failures are isolated, automatically healed, and don't impact the broader system.

How do you prevent agents from contradicting each other?

Through the orchestration layer and hierarchical scoring.

When different agents evaluate the same candidate, their scores feed into a weighted aggregation model that reconciles differences.

Example: Screening Agent scores candidate 92 (strong pattern match). Interview Agent scores candidate 73 (some yellow flags in responses). Sourcing Agent scores candidate 88 (career trajectory looks excellent).

The system doesn't just average these scores. It evaluates: Which agent has been most accurate historically for this role type? Which signals are most predictive? How should conflicting evidence be weighted?

The final score (let's say 85) comes with an explanation: "Strong resume match and career trajectory, but interview responses showed some areas requiring development. Recommend proceeding with culture fit conversation."

Hiring managers see the nuanced assessment, not just a black-box score. They can review agent-level reasoning and make informed decisions.