DeepEval for Healthcare AI: Eval-Driven Compliance That Actually Catches PII Leakage Before the FDA Does
The most dangerous failure mode for a healthcare AI isn't inaccuracy—it's a compliance breach you didn't test for. A model can generate a perfect clinical summary and still violate HIPAA by hallucinating a patient's name that never existed. Under the Breach Notification Rule, that fabricated yet plausible Protected Health Information (PHI) constitutes a reportable incident. Most teams discover these gaps during an audit or, worse, after a breach. The alternative is to treat compliance not as a post-hoc checklist, but as an integrated, automated evaluation layer that fails your CI pipeline before bad code ships. This is eval-driven compliance, and it's the only way to build healthcare AI that doesn't gamble with regulatory extinction.
Reference implementation: Every code example in this article is drawn from Agentic Healthcare, an open-source blood test intelligence app that tracks 7 clinical ratios over time using velocity-based trajectory analysis. The full eval suite, compliance architecture, and production code are available in the GitHub repository.
The Case Against Mandatory In-Person Work for AI Startups
The argument for an "office-first" culture is compelling on its face. It speaks to a romantic ideal of innovation: chance encounters, whiteboard epiphanies, and a shared mission forged over lunch. For a company building AI, this narrative feels intuitively correct. As a senior engineer who has worked in both colocated and globally distributed teams, I understand the appeal.
But intuition is not a strategy, and anecdotes are not data. When we examine the evidence and the unique constraints of an AI startup, a mandatory in-person policy looks like a self-imposed bottleneck. It limits access to the most critical resource—talent—and misunderstands how modern technical collaboration scales.
LLM as Judge: What AI Engineers Get Wrong About Automated Evaluation
Claude 3.5 Sonnet rates its own outputs approximately 25% higher than a human panel would. GPT-4 gives itself a 10% boost. Swap the order of two candidate responses in a pairwise comparison, and the verdict flips in 10--30% of cases -- not because the quality changed, but because the judge has a position preference it cannot override.
These are not edge cases. They are the default behavior of every LLM-as-judge pipeline that ships without explicit mitigation. And most ship without it.
LLM-as-judge -- the practice of using a capable large language model to score or compare outputs from another LLM -- has become the dominant evaluation method for production AI systems. 53.3% of teams with deployed AI agents now use it, according to LangChain's 2025 State of AI Agents survey. The economics are compelling: 80% agreement with human preferences at 500x--5,000x lower cost. But agreement rates and cost savings obscure a deeper problem. Most teams adopt the method, measure the savings, and never measure the biases. The result is evaluation infrastructure that looks automated but is quietly wrong in systematic, reproducible ways.
This article covers the mechanism, the research, and the biases that break LLM judges in production.
What is LLM as a judge? LLM-as-a-Judge is an evaluation methodology where a capable large language model scores or compares outputs from another LLM application against defined criteria -- such as helpfulness, factual accuracy, and relevance -- using structured prompts that request chain-of-thought reasoning before a final score. The method achieves approximately 80% agreement with human evaluators, matching human-to-human consistency, at 500x--5,000x lower cost than manual review.
From Research Papers to Production: ML Features Powering a Crypto Scalping Engine
Every feature in a production trading system has an origin story — a paper, a theorem, a decades-old insight from probability theory or market microstructure. This post catalogs 14 ML features implemented in a Rust crypto scalping engine, traces each back to its foundational research, shows the actual formulas, and includes real production code. The engine processes limit order book (LOB) snapshots, trade ticks, and funding rate data in real time to generate scalping signals for crypto perpetual futures.
The Two-Layer Model That Separates AI Teams That Ship from Those That Demo
In February 2024, a Canadian court ruled that Air Canada was liable for a refund policy its chatbot had invented. The policy did not exist in any document. The bot generated it from parametric memory, presented it as fact, a passenger relied on it, and the airline refused to honor it. The tribunal concluded it did not matter whether the policy came from a static page or a chatbot — it was on Air Canada's website and Air Canada was responsible. The chatbot was removed. Total cost: legal proceedings, compensation, reputational damage, and the permanent loss of customer trust in a support channel the company had invested in building.
This was not a model failure. GPT-class models producing plausible-sounding but false information is a known, documented behavior. It was a process failure: the team built a customer-facing system without a grounding policy, without an abstain path, and without any mechanism to verify that the bot's outputs corresponded to real company policy. Every one of those gaps maps directly to a meta approach this article covers.
In 2025, a multi-agent LangChain setup entered a recursive loop and made 47,000 API calls in six hours. Cost: $47,000+. There were no rate limits, no cost alerts, no circuit breakers. The team discovered the problem by checking their billing dashboard.
These are not edge cases. An August 2025 Mount Sinai study (Communications Medicine) found leading AI chatbots hallucinated on 50–82.7% of fictional medical scenarios — GPT-4o's best-case error rate was 53%. Multiple enterprise surveys found a significant share of AI users had made business decisions based on hallucinated content. Gartner estimates only 5% of GenAI pilots achieve rapid revenue acceleration. MIT research puts the fraction of enterprise AI demos that reach production-grade reliability at approximately 5%. The average prototype-to-production gap: eight months of engineering effort that often ends in rollback or permanent demo-mode operation.
The gap between a working demo and a production-grade AI system is not a technical gap. It is a strategic one. Teams that ship adopt a coherent set of meta approaches — architectural postures that define what the system fundamentally guarantees — before they choose frameworks, models, or methods. Teams that demo have the methods without the meta approaches.
This distinction matters more now that vibe coding — coding by prompting without specs, evals, or governance — has become the default entry point for many teams. Vibe coding is pure Layer 2: methods without meta approaches. It works for prototypes and internal tools where failure is cheap. But the moment a system touches customers, handles money, or makes decisions with legal consequences, vibe coding vs structured AI development is the dividing line between a demo and a product. Meta approaches are what get you past the demo.
This article gives you both layers, how they map to each other, the real-world failures that happen when each is ignored, and exactly how to start activating eval-first development and each other approach in your system today.
McKinsey reports 65–71% of organizations now regularly use generative AI. Databricks found organizations put 11x more models into production year-over-year. Yet S&P Global found 42% of enterprises are now scrapping most AI initiatives — up from 17% a year earlier. IDC found 96% of organizations deploying GenAI reported costs higher than expected, and 88% of AI pilots fail to reach production. Gartner predicts 40% of enterprise applications will feature task-specific AI agents by end of 2026, up from less than 5% in 2025. Enterprise LLM spend reached $8.4 billion in H1 2025, with approximately 40% of enterprises now spending $250,000+ per year.
The Research on LLM Self-Correction
If you’re building with LLMs today, you’ve likely been sold a bill of goods about “reflection.” The narrative is seductive: just have the model check its own work, and watch quality magically improve. It’s the software equivalent of telling a student to “review your exam before turning it in.” The reality, backed by a mounting pile of peer-reviewed evidence, is far uglier. In most production scenarios, adding a self-reflection loop is the most expensive way to achieve precisely nothing—or worse, to degrade your output. The seminal paper that shattered the illusion is Huang et al.’s 2023 work, “Large Language Models Cannot Self-Correct Reasoning Yet.” Their finding was blunt: without external feedback, asking GPT-4 to review and correct its own answers on math and reasoning tasks consistently decreased accuracy. The model changed correct answers to wrong ones more often than it fixed errors. This isn’t an edge case; it’s a fundamental limitation of an autoregressive model critiquing its own autoregressive output with the same data, same biases, and zero new information.
Eval Driven Development
Here's the counterintuitive premise: for any LLM application where errors have real consequences, you must build your evaluation harness before you write a single prompt. You don't prompt-engineer by vibes, tweaking until an output looks good. You start by defining what "good" means, instrumenting its measurement, and only then do you optimize. This is Eval-Driven Development. It's the only sane way to build reliable, high-stakes AI systems.
Forget Elite DORA Scores. Your Platform’s Job is to Make Slow Teams Less Slow.
If your platform team’s North Star is getting every development squad into the “elite” performer bracket for DORA metrics, you’re aiming at the wrong target. You’re probably making things worse. I’ve watched organizations obsess over average deployment frequency or lead time, only to see platform complexity balloon and team friction increase. The real goal isn’t to build a rocket ship for your top performers; it’s to build a reliable highway for everyone else.
The corrective lens comes from a pivotal but under-appreciated source: the CNCF’s Platform Engineering Metrics whitepaper. It makes a contrarian, data-backed claim that cuts through the industry hype. The paper states bluntly that platform teams should focus on “improving the performance of the lowest-performing teams” and “reducing the spread of outcomes, not just the average.” This isn’t about settling for mediocrity. It’s about systemic stability and scaling effectively. When you measure platform success by how much you compress the variance in team performance, you start building for adoption and predictability—not vanity metrics.
Claude Code Doesn't Index Your Codebase. Here's What It Does Instead.
Last verified: March 2026
Boris Cherny's team built RAG into early Claude Code. They tested it against agentic search. Agentic search won — not narrowly. A Claude engineer confirmed it in a Hacker News thread: "In our testing we found that agentic search outperformed [it] by a lot, and this was surprising."
That thread is the clearest primary source on how Claude Code actually works — and why it works that way. Most articles on the topic paraphrase it from memory. This one starts from the source.
Q: Does Claude Code index your codebase? A: No. Claude Code does not pre-index your codebase or use vector embeddings. Instead, it uses filesystem tools — Glob for file pattern matching, Grep for content search, and Read for loading specific files — to explore code on demand as it works through each task. Anthropic calls this "agentic search."
