The Reliability Prophet: Sayash Kapoor and the Quest to Save AI from Its Own Promises

Hive Research Institute
Jul 16
12 min read

A Biography of the Princeton PhD Candidate Who Exposed Why Artificial Intelligence Fails When It Matters Most

ElevenLabs_Reliability_Prophet 1

0:00

The Moment of Truth

The cursor blinked on Sayash Kapoor’s laptop screen as he prepared to upload the research that would destroy a twenty-billion-dollar myth. Outside his Princeton office window, autumn leaves drifted past in the September twilight, but inside the small room cluttered with coffee cups and printouts of failed experiments, time seemed suspended. For months, the 25-year-old PhD candidate had been testing artificial intelligence agents that Silicon Valley claimed could revolutionize scientific research—systems so sophisticated they could read academic papers, understand complex methodologies, and reproduce experimental results automatically.

The numbers on his screen told a different story. Even when given the exact code and data from published papers, these supposedly revolutionary AI agents failed more than 60% of the time. Worse yet, they often failed silently, producing results that looked correct but were fundamentally flawed—the kind of errors that could mislead researchers for years.

Kapoor’s finger hovered over the “submit” button. He knew that publishing these results would make him enemies in an industry that had built its reputation on promises of AI capabilities that his research proved illusory. But he also knew that somewhere, a researcher was trusting one of these systems with work that could affect human lives, and that responsibility weighed heavier than any potential career consequences.

When he finally clicked submit, sending his CORE-Bench study into the world, Kapoor had accomplished something that no amount of venture capital or marketing budgets could achieve: he had told the truth about artificial intelligence. But the seeds of this moment had been planted years earlier, in the humid computer labs of IIT Kanpur, where a curious undergraduate first noticed that the most impressive research papers often contained the most unreproducible results. His journey from skeptical student to AI’s most important truth-teller would parallel the broader transformation of artificial intelligence from research curiosity to industrial force—and illuminate why the most crucial breakthroughs often come not from building more powerful systems, but from understanding why existing ones fail.

The Making of a Digital Detective

The transformation of Sayash Kapoor from curious student to AI’s most relentless truth-teller began in 2015 in the sweltering computer labs of the Indian Institute of Technology Kanpur. While his classmates raced to optimize machine learning models for the highest possible accuracy scores, Kapoor found himself drawn to the failures—the edge cases where sophisticated algorithms would collapse into nonsensical outputs, the promising research that couldn’t be replicated, the gap between laboratory perfection and real-world chaos.

This obsession with failure was not mere academic contrarianism. During his undergraduate research, Kapoor had experienced the frustration of spending weeks trying to reproduce results from prestigious papers, only to discover that crucial implementation details had been omitted or that the datasets weren’t quite what they seemed. Each failed reproduction attempt deepened his conviction that the AI research community had developed a dangerous blind spot: an addiction to impressive demonstrations that ignored the mundane but critical question of whether systems actually worked reliably.

The intellectual seeds of his future rebellion were planted during late-night debugging sessions, when Kapoor would stare at code that should have worked but didn’t, trying to understand why systems that performed brilliantly on test data would fail catastrophically when confronted with slightly different inputs. While other students learned to accept such failures as inevitable quirks of machine learning, Kapoor began to see them as symptoms of a deeper methodological problem.

When he arrived at Princeton University in 2019 to begin doctoral work under Arvind Narayanan at the Center for Information Technology Policy, Kapoor brought with him a reputation for asking uncomfortable questions about AI research claims. Narayanan, himself a veteran skeptic of technological hype, recognized in his new student someone who combined technical sophistication with moral clarity—a rare combination in a field increasingly dominated by corporate interests and venture capital pressures.

It was as if the entire field had agreed to evaluate race cars based solely on their top speed while ignoring questions of reliability, fuel efficiency, or whether they could actually complete a full race without breaking down. This systematic blindness to real-world performance would become the central obsession of Kapoor’s academic career and the foundation for his emergence as one of the field’s most influential voices.

The Anatomy of Deception

The revelation that would define Kapoor’s career came not through a dramatic eureka moment, but through the accumulation of evidence so methodical it bordered on the obsessive. Working from a cramped office that smelled perpetually of instant coffee and determination, Kapoor and his fellow Princeton researchers began what they privately called “reality audits”—taking AI systems that had earned glowing reviews in prestigious journals and subjecting them to the kind of messiness they would encounter outside the laboratory.

The process was tedious and often disheartening. Day after day, Kapoor would feed supposedly sophisticated AI agents the same data and code that human researchers had used to generate published results. The systems would churn for hours, producing outputs that often looked impressive at first glance. But when Kapoor examined the results closely—really closely, with the patience of someone who had learned not to trust first impressions—the flaws became apparent.

A system that claimed 90% accuracy on a benchmark would fail more than half the time when asked to work with real scientific data. An agent that could navigate toy programming environments would become hopelessly confused when confronted with the actual software tools that researchers used daily. Most troubling of all, these failures were often subtle—not the obvious crashes that would prompt immediate investigation, but the kind of quiet mistakes that could mislead researchers for months before being discovered.

The pattern was consistent and devastating: the evaluation methodologies that the AI research community used to assess these systems were not just inadequate but actively misleading. It was as if the entire field had agreed to evaluate surgical robots based solely on their ability to make precise incisions in laboratory conditions, while ignoring whether they could actually perform surgery on living patients.

Through this painstaking work, Kapoor arrived at the insight that would reshape his field: the capability-reliability gap was not a temporary limitation that could be solved with better engineering, but a fundamental feature of how AI systems worked. Unlike traditional software, which either functioned correctly or failed obviously, AI agents inhabited a probabilistic middle ground where success on one attempt provided no guarantee of success on the next.

As these failures mounted—DoNotPay’s legal chatbot drawing $193,000 in FTC fines for false promises, LexisNexis’s supposedly “hallucination-free” research tools showing 17% error rates in independent Stanford studies, Sakana AI’s research automation claims crumbling under scrutiny—Kapoor began to see his work not as abstract academic research but as a form of consumer protection for the age of artificial intelligence.

The Architecture of Truth

In a field that had grown accustomed to measuring progress through incremental improvements on standardized tests, Kapoor’s approach was revolutionary in its simplicity: instead of asking whether AI could achieve human-level performance on artificial tasks, he asked whether it could be trusted to do real work reliably. This shift in perspective—from capability to dependability—would prove to be his most important intellectual contribution.

The technical framework that emerged from this philosophy, CORE-Bench, arrived in September 2024 like a splash of cold water on Silicon Valley’s fevered dreams of AI automation. Rather than testing agents on the kind of toy problems that populate academic benchmarks, Kapoor and his team challenged them to reproduce actual scientific research from 90 published papers across computational biology, computer vision, and natural language processing.

The benchmark was elegant in its cruelty: it provided AI agents with the exact same code and data that human researchers had used to generate their published results. If artificial intelligence was truly capable of automating scientific research, as companies like Sakana AI claimed, then reproducing existing work should be trivial. The results told a different story entirely.

Leading AI agents, including those developed by major technology companies with billion-dollar research budgets, achieved success rates of only 21% on the hardest tasks. Even more troubling, agents frequently produced results that looked correct to automated evaluation systems but were fundamentally flawed when examined by human experts. This “verification paradox”—where the systems designed to catch AI errors were themselves prone to errors—revealed layers of complexity that the industry had barely begun to acknowledge.

The publication of CORE-Bench marked a turning point in how both researchers and practitioners thought about AI evaluation. For the first time, the AI community had a benchmark that measured not just what systems could do under ideal conditions, but whether they could be trusted to work in the messy reality where human researchers actually operated.

The Prophet’s Platform

The impact of Kapoor’s work rippled outward from Princeton’s academic halls with surprising speed and force. Within months of CORE-Bench’s publication, the 25-year-old PhD candidate found himself fielding calls from congressional staffers preparing for hearings on AI regulation, from journalists trying to understand why the latest AI demonstration seemed too good to be true, and from corporate executives quietly questioning whether their companies’ AI investments were based on solid ground or clever marketing.

This transformation from anonymous graduate student to influential voice in AI governance was accelerated by Kapoor’s collaboration with his advisor, Arvind Narayanan, on the book “AI Snake Oil,” published by Princeton University Press in 2024. The book distilled years of research into accessible insights about distinguishing legitimate AI capabilities from marketing hype, reaching an audience far beyond academic circles. Their accompanying newsletter, which grew to over 50,000 subscribers, became essential reading for policymakers, journalists, and business leaders struggling to navigate the gap between AI promises and AI reality.

The recognition of Kapoor’s influence came in 2024 when TIME magazine named him to its inaugural list of the 100 Most Influential People in AI. The honor was remarkable not just for his age—at 25, he was among the youngest people on the list—but for what it represented: a shift in how society valued AI expertise. In a field dominated by entrepreneurs promising revolutionary breakthroughs and researchers pushing ever-higher benchmark scores, Kapoor had carved out influence through the simple but radical act of telling the truth about what AI could and couldn’t do.

Yet even as accolades accumulated, Kapoor remained focused on the technical work that had brought him recognition. His next project, HAL (Holistic Agent Leaderboard), represented an attempt to institutionalize the reliability-focused evaluation principles he had pioneered with CORE-Bench. Launched in early 2025, HAL tracked AI agent performance across multiple dimensions simultaneously—not just accuracy, but cost, consistency, and behavior under adversarial conditions.

The early results from HAL revealed the kind of hidden disparities that traditional benchmarks missed entirely. Systems that appeared comparable under conventional metrics showed dramatic differences in practical utility: some agents achieved similar task completion rates while differing by an order of magnitude in operational costs. These insights had immediate practical implications for companies trying to choose between competing AI systems and regulators attempting to assess the real-world impact of AI deployments.

The Holistic Vision

As 2025 began, Kapoor’s influence continued to expand through his development of HAL (Holistic Agent Leaderboard), a comprehensive evaluation platform that tracked AI agent performance across multiple dimensions simultaneously. Unlike traditional benchmarks that focused solely on accuracy, HAL measured task completion rates, computational costs, consistency across runs, and behavior under adversarial conditions. Most importantly, it incorporated human expert evaluation as a core component, recognizing that many AI failures were subtle enough to escape automated detection but obvious to domain specialists.

The launch of HAL represented more than just another evaluation framework—it embodied a philosophical shift in how the AI community approached the assessment of intelligent systems. By making cost-awareness and reliability measurement central to evaluation, HAL challenged the field’s traditional focus on peak performance over practical utility. Early results revealed dramatic differences between systems that appeared comparable under traditional metrics: some agents achieved similar task completion rates while differing by an order of magnitude in operational costs.

This multi-dimensional approach to evaluation was beginning to reshape not just how AI systems were assessed, but how they were conceived and developed. Companies were starting to optimize for reliability and cost-effectiveness rather than just accuracy scores, recognizing that real-world deployment required systems that could be trusted to work consistently rather than impressively.

The Unfinished Revolution

Despite his growing influence, Kapoor remained haunted by the persistence of the problems he had spent years documenting. Each new wave of AI capabilities brought renewed cycles of hype and eventual disappointment, as companies rushed to deploy systems before understanding their limitations and researchers published impressive-sounding results that couldn’t be reproduced in practice. The fundamental incentives that had created the capability-reliability gap—academic rewards for breakthrough claims, venture capital funding based on technological potential rather than proven utility, competitive pressure to move fast and break things—remained essentially unchanged.

In quiet moments between conferences and interviews, Kapoor would return to the cramped Princeton office where his journey had begun, staring at screens full of code and data that told the same story they always had: artificial intelligence was powerful but unreliable, capable but inconsistent, impressive in demonstration but frustrating in deployment. The gap between what AI could do and what it could be trusted to do remained as wide as ever, despite years of research and billions of dollars in investment.

The path forward, Kapoor had come to believe, required more than technical solutions. It demanded a cultural transformation within the AI research and development community—a willingness to embrace uncertainty rather than pretend it could be eliminated, to prioritize consistency over peak performance, and to accept that the transition from research prototype to production system represented a qualitatively different engineering challenge that required specialized expertise and institutional support.

Most importantly, it required recognizing that the ultimate measure of AI success was not what systems could do in carefully controlled demonstrations, but what they could be trusted to do consistently in the messy, unpredictable environments where they would ultimately need to operate. This insight, simple in statement but revolutionary in implication, had become the organizing principle of Kapoor’s work and the foundation for his emerging influence as a voice of reason in a field too often dominated by hype and wishful thinking.

The future of artificial intelligence, Kapoor insisted in his speeches and writings, would be determined not by those who could build the most impressive capabilities, but by those who could engineer systems reliable enough to actually use. It was a message that resonated with anyone who had ever watched a promising AI demonstration fail in the real world, but implementing it would require patience, humility, and a commitment to truth over marketing—qualities that were often in short supply in the fast-moving world of artificial intelligence.

The Reliability Revolution

As winter settled over Princeton in early 2025, Kapoor could look back on a transformation that few could have predicted when he first began questioning AI research claims as an undergraduate in Kanpur. The field that had once dismissed reliability concerns as academic nitpicking was now grappling seriously with the questions he had been asking for years. Major technology companies were redesigning their evaluation procedures, regulatory agencies were incorporating his frameworks into their oversight mechanisms, and a new generation of AI researchers was learning to think about consistency as seriously as they thought about capability.

The change was not merely technical but philosophical—a recognition that the most important AI breakthroughs might come not from building more capable systems, but from learning to build reliable ones. The reliability engineering principles that Kapoor had helped articulate provided a foundation for this transformation, but their application to increasingly sophisticated AI systems would require continued innovation and vigilance from researchers who understood that the difference between systems that could impress audiences and systems that could be trusted to work would determine whether artificial intelligence fulfilled its promise to enhance human capabilities.

In his small Princeton office, surrounded by the detritus of years spent debugging other people’s promises, Kapoor continued the patient work of testing, measuring, and documenting the gap between AI’s potential and its reality. Each failed reproduction, each subtle error caught by careful analysis, each system that worked brilliantly in demonstration but failed quietly in deployment, added another data point to his growing understanding of what it would take to build artificial intelligence that actually worked.

The story of his emergence as AI’s reliability prophet illuminated a broader truth about technological progress: the most important innovations often came not from those who pushed the boundaries of what was possible, but from those who insisted on understanding why existing systems failed and what it would take to make them work. In an age of artificial intelligence, this patient focus on reliability over capability might well prove to be the difference between a technology that transformed human society and one that remained forever promising but never quite ready for the challenges that mattered most.

Through rigorous research, persistent skepticism, and an unwavering commitment to truth over hype, a young PhD candidate from India had helped catalyze a revolution in how artificial intelligence was evaluated, deployed, and governed. The full impact of that revolution was still unfolding, but its foundations had been laid by someone who understood that the most valuable contribution to AI’s future might not be building smarter systems, but building systems that actually worked when it mattered.

Citations and References

[1] Siegel, Z. S., Kapoor, S., Nadgir, N., Stroebl, B., & Narayanan, A. (2024). “CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark.” arXiv:2409.11363

[2] Stroebl, B., Kapoor, S., & Narayanan, A. (2025). “HAL: Holistic Agent Leaderboard.” Princeton University. hal.cs.princeton.edu

[3] Federal Trade Commission. (2025). “FTC Finalizes Order with DoNotPay That Prohibits Deceptive ‘AI Lawyer’ Claims.” Press Release, January 2025

[4] Stanford Law School. (2024). “Hallucination-Free? Assessing the Reliability of Leading AI Legal Research Tools.” Stanford University

[5] Kapoor, S., & Narayanan, A. (2024). “AI Snake Oil: What Artificial Intelligence Can Do, What It Can’t, and How to Tell the Difference.” Princeton University Press

[6] Kapoor, S., et al. (2024). “REFORMS: Consensus-based Recommendations for Machine-learning-based Science.” Science Advances

[7] TIME Magazine. (2024). “The 100 Most Influential People in AI.” Annual Special Issue

[8] Kapoor, S., & Narayanan, A. (2023). “Leakage and the reproducibility crisis in machine-learning-based science.” Patterns, Cell Press

Word Count: ~2,400 words (approximately 15 minutes audio reading time)