AI Research & Evaluation

DRQ Benchmark

Multi-Provider LLM Core War Arena

An enhanced fork of Sakana AI's DRQ benchmark for evaluating LLM adversarial code generation. Features multi-provider support, real-time web interface, and battle visualization.

// the problem

Challenge

Evaluating LLM code generation requires controlled benchmarks with measurable outcomes. The original DRQ (Digital Red Queen) research showed convergent evolution in LLM-generated programs, but single-provider evaluation limits insights. Building a fair multi-model battle arena requires consistent prompting, parallel generation, and deterministic battle simulation.

// what we built

Solution

DRQ Benchmark extends the original research with multi-provider LLM support. The system supports leading AI models across multiple providers. Warriors generated by different models compete in Core War, with parallel generation significantly reducing benchmark time.

// shipped

Key features

Multi-provider LLM battles
Real-time benchmark monitoring
Pygame visualizer with color-coded warriors
Parallel warrior generation
Warrior code inspection
Battle history tracking
Score tracking with win rates
Docker containerization

// stack.json

Tech stack

The exact tools shipping this product in production.

Python
Flask
Core War
Pygame
Leading AI Models
Docker

// system.diagram()

Architecture

Multi-provider LLM battle arena for adversarial program evolution research

backend
frontend
service
ai
database

// receipts

Results

Multi-provider LLM support across leading models
Real-time web monitoring interface
Pygame battle visualization
Significantly faster with parallel warrior generation
Player vs Player mode (any model combination)
Battle history with localStorage persistence

Multiple leading providers

Providers

Broad model support

Models

Significant with parallel generation

Speedup

Configurable (default 24)

Battle Rounds

// next()

Have a project like this?

We build production systems with the same engineering rigor you see here. Let's talk.

Start a project See more work