LLM Hallucination Detection Pipeline

PythonGroq APIHuggingFace DatasetsDeepEvalPandasNumPyMatplotlibSeabornPytestJinja-style HTML

✦ Project Overview

An end-to-end evaluation and analysis pipeline that benchmarks Llama 3.1 8B against Llama 3.3 70B on the TruthfulQA (MCQ1) dataset to assess factual accuracy and identify hallucination patterns. The system runs closed-book MCQ evaluation across 38 topic categories, performs deterministic scoring, and produces a fully visualized, self-contained HTML report — surfacing exactly where scaling model parameters does and doesn't reduce hallucinations.

✦ Key Features

♥Benchmarks Llama 3.1 8B against Llama 3.3 70B on TruthfulQA (MCQ1) — 817 questions across 38 categories specifically designed to trigger human misconceptions rather than reward simple statistical recall.
♥Deterministic, judge-free evaluation layer that matches model answers directly against labeled ground truth, avoiding the noise and inconsistency of LLM-as-judge scoring.
♥Domain risk profiling that separates high-risk categories (health, law, finance, politics) from standard trivia/myth categories, revealing that both models hallucinate less in high-stakes domains than in general misconceptions.
♥Model-delta analysis pinpointing exactly which categories benefit most from scaling 8B → 70B (e.g. +43.6% on Indexical Error: Time) and which ones regress (e.g. -25.9% on Superstitions), challenging the assumption that bigger models are uniformly more truthful.
♥Automated visualization and reporting layer generating 5 diagnostic Matplotlib/Seaborn charts and a standalone HTML dashboard, with resumable batch inference caching to survive Groq API rate limits.

✦ Methodology

A modular, five-stage pipeline that takes raw benchmark data all the way to a deployable HTML report, treating evaluation rigor as a first-class concern:

01.

Data Loading & Categorization

Downloads the TruthfulQA multiple_choice and generation configurations from HuggingFace, merges them to attach category tags to each MCQ question, flags high-risk-keyword questions, and caches the flattened dataset as a local CSV.

02.

Model Inference

Queries the Groq API for llama-3.1-8b-instant and llama-3.3-70b-versatile at zero temperature, prompting each model to respond with only the letter of its chosen answer, then maps that letter back to the full answer text. Batch progress is cached to allow resuming after interruptions or rate limits.

03.

Deterministic Response Evaluation

Rather than using an LLM judge, this stage directly compares the model's selected answer to the labeled correct choice, assigning a hallucination score of 0.0 for correct answers and 1.0 for incorrect ones, while skipping API/parsing errors to avoid skewing metrics.

04.

Analytical Aggregation

Aggregates results by category, risk type, and model to compute overall accuracy/hallucination rates, per-category rates, 8B-vs-70B performance deltas, the top 10 worst-performing categories, and a high-risk-vs-standard summary.

05.

Visualization & Reporting

Uses Matplotlib and Seaborn to render five diagnostic charts (model comparison, worst categories, delta chart, high-risk comparison) and compiles them with the underlying statistics into a self-contained, styled HTML dashboard.