AI Research & Evaluation

DRQ Benchmark

Multi-Provider LLM Core War Arena

An enhanced fork of Sakana AI's DRQ benchmark for evaluating LLM adversarial code generation. Features multi-provider support, real-time web interface, and battle visualization.

DRQ Benchmark hero image

// the problem

Challenge

Evaluating LLM code generation requires controlled benchmarks with measurable outcomes. The original DRQ (Digital Red Queen) research showed convergent evolution in LLM-generated programs, but single-provider evaluation limits insights. Building a fair multi-model battle arena requires consistent prompting, parallel generation, and deterministic battle simulation.

// what we built

Solution

DRQ Benchmark extends the original research with multi-provider LLM support. The system supports leading AI models across multiple providers. Warriors generated by different models compete in Core War, with parallel generation significantly reducing benchmark time.

// shipped

Key features

  • Multi-provider LLM battles
  • Real-time benchmark monitoring
  • Pygame visualizer with color-coded warriors
  • Parallel warrior generation
  • Warrior code inspection
  • Battle history tracking
  • Score tracking with win rates
  • Docker containerization

// stack.json

Tech stack

The exact tools shipping this product in production.

  • Python
  • Flask
  • Core War
  • Pygame
  • Leading AI Models
  • Docker

// system.diagram()

Architecture

Multi-provider LLM battle arena for adversarial program evolution research

Config Generate Warriors Replay Results Stream Flask Web Server backend Real-Time Monitor frontend Model Selection service LLM Generators ai Core War Arena service Pygame Visualizer frontend Battle History database
  • backend
  • frontend
  • service
  • ai
  • database

// receipts

Results

  • Multi-provider LLM support across leading models
  • Real-time web monitoring interface
  • Pygame battle visualization
  • Significantly faster with parallel warrior generation
  • Player vs Player mode (any model combination)
  • Battle history with localStorage persistence
Multiple leading providers
Providers
Broad model support
Models
Significant with parallel generation
Speedup
Configurable (default 24)
Battle Rounds

// next()

Have a project like this?

We build production systems with the same engineering rigor you see here. Let's talk.