Changelog

Updates to the CatholicBench dataset, models, and platform.

February 6, 2026Bug FixData

Scoring Accuracy Fix

Fixed an aggregation bug where raw scores were not correctly mapped to the score field in result details. This ensures that individual model result breakdowns now accurately reflect the normalized scores displayed in the leaderboard.

February 6, 2026New ModelsData

Claude Opus 4.6, GPT-5.2 & Gemini 3 Flash

Added three new models to the benchmark: Claude Opus 4.6, GPT-5.2, and Gemini 3 Flash. These represent the latest generation of frontier models and provide fresh data points for evaluating theological reasoning across providers.

February 5, 2026ArchitectureInfrastructure

Monorepo Consolidation & Unified Pipeline

Consolidated the project into a Bun workspaces monorepo with a shared library layer and unified benchmarking pipeline. This architectural overhaul streamlines development and ensures consistency between the CLI engine and the web dashboard.

November 24, 2025New ModelAnalysis

Opus 4.5

We've introduced Opus 4.5 to the benchmark. This model excels at navigating complex theological nuances with high precision. It demonstrates remarkable improvement in pastoral tone, particularly when addressing sensitive bioethical questions, striking a better balance between doctrinal clarity and compassionate delivery than previous iterations.

November 24, 2025DataUI

Model Updates & UI Enhancements

Updated benchmark data with latest model runs. Refined the Dashboard to exclude incomplete model runs and added tooltip indicators for 'stealth' models currently in testing.

November 19, 2025FeatureAnalysis

Dataset Browser & Historical Bias

Introduced the Dataset Browser component allowing users to search and explore specific benchmark questions. Added a dedicated analysis section for Historical Bias to evaluate model performance on controversial topics.

November 19, 2025Data

Data Refresh

Refreshed the core analysis dataset with new models and updated specific result details for accuracy.

November 18, 2025Launch

Initial Dashboard Launch

Launched the comprehensive results dashboard featuring category-based visualization, normalized scoring, and interactive model comparisons.

Current Rankings
1
openai/gpt-5.2
4.77
2
anthropic/claude-opus-4.6
4.67
3
google/gemini-3-flash-preview
4.67
4
moonshotai/kimi-k2.5
4.60
5
google/gemini-3-pro-preview
4.50

Stay Updated

New models are benchmarked weekly. Check back to see how rankings evolve as models improve.

Return to Dashboard →