Changelog
Updates to the CatholicBench dataset, models, and platform.
Scoring Accuracy Fix
Fixed an aggregation bug where raw scores were not correctly mapped to the score field in result details. This ensures that individual model result breakdowns now accurately reflect the normalized scores displayed in the leaderboard.
Claude Opus 4.6, GPT-5.2 & Gemini 3 Flash
Added three new models to the benchmark: Claude Opus 4.6, GPT-5.2, and Gemini 3 Flash. These represent the latest generation of frontier models and provide fresh data points for evaluating theological reasoning across providers.
Monorepo Consolidation & Unified Pipeline
Consolidated the project into a Bun workspaces monorepo with a shared library layer and unified benchmarking pipeline. This architectural overhaul streamlines development and ensures consistency between the CLI engine and the web dashboard.
Opus 4.5
We've introduced Opus 4.5 to the benchmark. This model excels at navigating complex theological nuances with high precision. It demonstrates remarkable improvement in pastoral tone, particularly when addressing sensitive bioethical questions, striking a better balance between doctrinal clarity and compassionate delivery than previous iterations.
Model Updates & UI Enhancements
Updated benchmark data with latest model runs. Refined the Dashboard to exclude incomplete model runs and added tooltip indicators for 'stealth' models currently in testing.
Dataset Browser & Historical Bias
Introduced the Dataset Browser component allowing users to search and explore specific benchmark questions. Added a dedicated analysis section for Historical Bias to evaluate model performance on controversial topics.
Data Refresh
Refreshed the core analysis dataset with new models and updated specific result details for accuracy.
Initial Dashboard Launch
Launched the comprehensive results dashboard featuring category-based visualization, normalized scoring, and interactive model comparisons.
Stay Updated
New models are benchmarked weekly. Check back to see how rankings evolve as models improve.
Return to Dashboard →