Miriam Kümmel & Mathis Lucka10 min read

Hallucination Detection Comparison

What's the best tool for hallucination detection? We put 7 of them to the test.

Intro

Hallucinations are far from a solved issue. In our recent pharmaceutical RAG benchmark, we found that 24 to 65% of LLM responses contained hallucinations. Companies deploying AI applications need a way to systematically detect them in order to reduce the amount of ungrounded or false information produced by their systems. But which approach actually works?

We tested seven hallucination detection tools on PlaceboBench, including both open source frameworks (MiniCheck by Bespoke Labs, RAGAS) and proprietary cloud APIs (Azure Groundedness Detection, Google Cloud Check Grounding, AWS Bedrock Guardrails, Vectara HHEM), as well as our own, Blue Guardrails. Six tools achieved accuracies between 53.6% and 62.3% on message level, marginally better than guessing. Blue Guardrails reached 94.4%.

On the more granular claim level, the three existing tools (Azure Groundedness Detection, Google Cloud Check Grounding, MiniCheck) reached F1-scores of 22–24%. Blue Guardrails reached 92.3% F1. The remaining three tools (Vectara, RAGAS, AWS Bedrock Guardrails) operate only at message level and cannot be evaluated on claim-level hallucination detection.

What we tested

To test the performance of the different hallucination detection tools, we ran them on PlaceboBench, a pharmaceutical hallucination benchmark that uses 69 questions healthcare professionals submitted to drug information centers, paired with official regulatory documents from the European Medicines Agency. Then, seven state-of-the-art LLMs generated responses to the questions, resulting in a total of 483 data points. The hallucination annotation was done and reviewed by humans.

We measured two things: whether each tool correctly identified which responses contained hallucinations (accuracy at message level, all tools), and for claim-level tools additionally how precisely they located the hallucinated text within the response (F1 at claim level).

The tools tested were:

RAGAS uses an LLM-as-judge (we chose GPT-5.2) to detect hallucinations. MiniCheck and Azure Groundedness Detection use a fine-tuned Transformer model. Blue Guardrails uses an LLM-based verification agent. The two remaining cloud APIs (Google Cloud Check Grounding, AWS Bedrock Guardrails) don't disclose their internals.

Azure Groundedness Detection, Google Cloud Check Grounding, MiniCheck, and Blue Guardrails provide claim-level hallucination detection, meaning they identify the exact text spans within a response that are hallucinated. AWS Bedrock Guardrails, Vectara, and RAGAS operate at the message level only, providing a binary yes/no verdict or score for whether the entire response contains hallucinations, without pinpointing where.

Vectara and RAGAS return a continuous score for "consistency" and "faithfulness" respectively; AWS Bedrock Guardrails returns both a score and a binary blocked/not-blocked verdict; Azure Groundedness Detection, Google Cloud Check Grounding, and Blue Guardrails return spans with character offsets; MiniCheck works sentence-by-sentence and also returns spans.

We ran each tool against the same dataset (PlaceboBench). Every data sample consists of

  • chunks of medical documents (the context)
  • a user query
  • an LLM-generated response
  • human-annotated hallucination spans marking exactly which parts of the response are not grounded in the source, and therefore considered hallucinated.

For tools that return spans directly (Azure Groundedness Detection, Google Cloud Check Grounding, MiniCheck, and Blue Guardrails), we compared predicted spans against the human annotations. For tools that return a continuous score (Vectara, RAGAS) or a binary verdict plus score (AWS Bedrock Guardrails), we performed a "threshold sweep" to find the threshold that maximizes their F1.

Azure Groundedness Detection comes with a practical constraint: responses with context longer than 55,000 characters had to be excluded because the API cannot handle contexts of this length. This led to a reduced test dataset of 294 samples (60.9% of the full dataset). The other tools ran on the full dataset.

Results

Message-level accuracy

Bar chart showing message-level accuracy across all seven tools. Blue Guardrails reaches 94.4%; the six other tools range from 53.6% to 62.3%.

Hallucination detection accuracy at message level across all seven tools.

Across the other six tools, accuracy at the message level (correctly identifying whether a response contains a hallucination) ranged from 53.6% to 62.3%. AWS Bedrock Guardrails performed best at 62.3%, with RAGAS close behind at 62.0%. Vectara was weakest at 53.6%. Blue Guardrails reached 94.4%. To put this in perspective: a classifier that flags every response as hallucinated would achieve roughly 45% accuracy on our dataset, so the margin above baseline is slim for most tools.

Claim-level F1

Bar chart showing claim-level F1 scores for claim-level tools. Azure Groundedness Detection, Google Cloud Check Grounding, and MiniCheck cluster between 22% and 24%; Blue Guardrails reaches 92.3%.

Claim-level hallucination detection F1 scores for claim-level tools.

For the tools that operate at claim level (Azure Groundedness Detection, Google Cloud Check Grounding, MiniCheck, and Blue Guardrails), we additionally measured how precisely they located the hallucinated text within a response. Azure Groundedness Detection, Google Cloud Check Grounding, and MiniCheck clustered tightly between 22% and 24% F1. Blue Guardrails reached 92.3% F1. Vectara, RAGAS, and AWS Bedrock Guardrails cannot be evaluated at this level since they don't return spans.

Overall takeaway

The other six tools show relatively consistent results regardless of approach: none reach accuracy above 63%, and claim-level detection stays below 25% F1. Blue Guardrails' substantially higher performance (94.4% accuracy, 92.3% F1) suggests a different hallucination detection architecture is needed.

Discussion

To understand the differences in performance of the tools, it's helpful to categorize the underlying technologies.

Transformer-based NLI tools

MiniCheck, Vectara (HHEM), and Azure Groundedness Detection all use Transformer models trained or fine-tuned on the task of Natural Language Inference. In NLI, a model determines whether one text snippet entails, contradicts, or is neutral with respect to another text snippet. It became prominent in 2015 with Stanford's SNLI dataset, which was subsequently used to train many popular Transformer-based NLI classifiers (2018 and onwards). For hallucination detection, entailment means grounded, while both contradiction and neutral are flagged as hallucinations.

Google Cloud and AWS don't disclose their technical implementations, but documentation clues suggest similar approaches. Google's API states that "a sentence is considered a single claim" and defines grounding as whether "the claim is wholly entailed by the facts". AWS describes checking "for relevance for each chunk processed" with confidence scoring. Both tools showed a latency of 500ms or below in our benchmarking tests, which is a definitive indicator for the usage of small language models rather than LLMs.

Looking at the SNLI dataset gives us an idea of why NLI does not translate well into hallucination detection:

Example entries from the SNLI dataset showing short, simple sentence pairs used for NLI training.

Example entries from the Stanford Natural Language Inference (SNLI) dataset.

The training data is too simple. SNLI, like many other NLI datasets, consists of short sentence pairs. PlaceboBench, on the other hand, was modeled after state-of-the-art RAG setups; its contexts average 13,000 tokens (10–20 pages of dense medical text). Judging entailment between two sentences is a very different task from judging if a generated answer is grounded in thousands of tokens of dense medical text.

Additionally, realistic hallucinations are much more subtle than what we see in SNLI. The hallucinations annotated in PlaceboBench include context misattribution (mixing information about drug A with drug B), incorrect generalization (applying information specific to one patient population to all patients), wrong frequencies (calling a rare side effect "common"), and omissions. They aren't clean "entailment versus contradiction" cases.

Research has also shown that NLI models learn shortcuts rather than actual semantic understanding: They pick up on surface patterns (like negation words correlating with contradiction) that don't transfer to real-world hallucination detection tasks.

LLM-as-a-judge-based tool

RAGAS' faithfulness metric uses an LLM to detect hallucinations, but applies the same fragmented verification logic as NLI-based tools: it slices responses into individual sentences, resolves pronouns to their referents, then asks the LLM whether each sentence is supported by the full context.

RAGAS introduces similar methodological problems as NLI-based tools, along with a few more. Decomposing responses into sentences, and therefore isolating the claims from each other, discards cross-sentence relationships that are necessary to judge the groundedness of claims. Also, automatic pronoun resolution can introduce semantic errors that further impede accurate hallucination detection.

On message level versus claim level and explanations

Three tools in our benchmark (AWS Bedrock Guardrails, Vectara, RAGAS) can only determine whether a response contains a hallucination, not where it occurs or what specifically is wrong. This limitation renders them ineffective for any use case beyond basic filtering.

Without claim-level detection, the only available action is to discard the entire response before it reaches the (human or agentic) user. This prevents immediate harm but offers no path to improvement. You cannot analyze patterns in what types of claims hallucinate, which sources are frequently misattributed, or where in the reasoning chain errors occur.

Claim-level detection is indispensable, but more than that, to improve a system, explanations on detected hallucinations are required. Google Cloud Check Grounding and MiniCheck identify which text is hallucinated but provide no explanation on why it was flagged. Azure Groundedness Detection can optionally generate explanations after detection, but the detection itself still relies on its NLI model.

Blue Guardrails integrates reasoning directly into the hallucination detection process, with each flagged span including both the classification and an explanation of why it was identified as hallucinated.

Tools that both locate hallucinations and explain them enable iterative improvement. When you know that a model consistently hallucinates drug dosages but not side effects, or conflates information between similar medications, you can adjust retrieval strategies, enrich metadata, add verification steps, or refine prompts to address specific failure modes. There's a significant difference between "This response has a problem somewhere" and "This specific claim about treatment interruption applies rheumatoid arthritis data to a psoriasis patient without noting the indication mismatch": only the latter gives you the tools to actually improve your system.

You can't solve a 2026 problem with 2018 tooling

The NLI models powering most of the hallucination detectors we tested were designed in a different era of language technology. When most NLI models were trained, the task was to classify relationships between single sentences. In today's production agents and RAG systems, LLMs generate multi-paragraph responses drawing on contexts that span tens of thousands of tokens, with reasoning chains that reference (and conflate) information across multiple sources.

Just upgrading the judging model doesn't fix the underlying methodology. RAGAS uses GPT-5.2 as a judge but still achieves only 62% accuracy because it applies the same sentence-by-sentence verification approach. The LLM is modern, but the framework treats hallucination detection as if responses were single isolated sentences.

Blue Guardrails achieves substantially better performance by taking a different approach altogether: verification agents that can process semantic dependencies across lengthy contexts and surface the actual mechanisms behind each hallucination. The results suggest that closing the gap between 2015 NLI benchmarks and 2026 RAG systems requires rethinking the detection architecture itself, not just swapping in a larger model.

Miriam Kümmel
Miriam Kümmel
Co-Founder at Blue Guardrails
Mathis Lucka
Mathis Lucka
Co-Founder at Blue Guardrails

Create reliable AI agents

We are AI quality specialists who help engineering teams improve the reliability and accuracy of their AI applications through systematic hallucination detection and mitigation.

Copyright © 2026 Blue Guardrails