From AUC 0.647 to 0.735: How Multi-System Data Fusion Improves Hiring Prediction
In a study of 10,765 hires, personality assessment was the strongest single predictor of production, reaching an AUC of 0.647 on its own. Fusing that personality signal with behavioral scoring and applicant tracking data raised the model to AUC 0.735 on the study sample. Keywords alone reached only 0.558. No single system wins on its own. The signals carry complementary information, and connecting them produces predictive power that none of them reaches in isolation.
Source: "Decision Traces," Saad Bin Shafiq, NODES, 2026. Model comparison on the retroactive cohort, with personality results on the 229 agents who had Predictive Index data. Read it on arXiv.
What the model compared
The team built regularized logistic regression models from each system's features and their combinations, using 5-fold cross-validation with all preprocessing fit inside each training fold to prevent leakage.
| Model | Features | AUC |
|---|---|---|
| Composite score only | fit score | 0.518 |
| Behavioral dimensions | six behavioral traits | 0.550 |
| ATS keywords | keywords plus source channel | 0.558 |
| All non-personality features | score, traits, ATS | 0.575 |
| Personality type only | PI type | 0.647 |
| Full fusion | all features plus PI | 0.735 |
Personality was the strongest single signal
Predictive Index type alone (0.647) carried more signal than every non-personality feature combined (0.575). The production spread by type was wide, from Captain at 36.8% down to Promoter at 0.0%, larger than any other single variable in the dataset.
Read the 0.735 correctly
This is the part most vendors get wrong, so it is worth stating plainly. The 0.735 figure is cross-validated on the 229 agents who had personality data. It is a research-sample result, not a claim about live production accuracy. The deployed behavioral score, measured as a binary classifier on 714 agents, runs an AUC of 0.57. That lower number is not a weakness. It reflects how the score actually works.
The score predicts who benefits from speed, not just who produces
The behavioral score is a moderator, not a classifier. It does not mainly predict who will produce. It predicts who converts a fast ramp into production. High-scored agents captured about $114 per day of speed acceleration, roughly 2.8 times the $41 per day captured by low-scored agents. A traditional accuracy metric misses this entirely, because the score's value lives in the interaction between the candidate and the conditions, not in a single yes-or-no prediction. See the speed-to-production findings.
What this means
You do not need demographic or geographic data to reach 0.735. Behavioral and keyword features alone, fused together, get there. Almost all of that signal stays invisible until the systems are connected, which is the entire argument for a decision trace. See how.
Frequently asked questions
What is a good AUC for a hiring model? Context matters more than a single threshold. In this study, keyword screening reached 0.558, personality type 0.647, and full multi-system fusion 0.735 on the evaluable sample.
Does AUC 0.735 mean the system is 73.5% accurate? No. AUC is a ranking measure, not an accuracy percentage, and 0.735 was measured on a 229-agent sample with personality data. The deployed score's binary AUC is 0.57.
Why is the deployed AUC lower than the research AUC? Two reasons. The research figure used personality data available for a subset, and hiring managers already used scores to screen, which compresses the range you can measure. The deployed score's main value is moderating the economics of speed.
What predicted production best? Personality type was the strongest single predictor. Fusing it with behavioral and ATS features predicted better than any system on its own.
Related reading
- The speed-to-production constant
- What 10,765 hires revealed about resume keywords
- Decision traces, explained
See what fusing your own systems would surface. Book a 30-minute walkthrough.