DRQ Benchmark
Multi-Provider LLM Core War Arena
An enhanced fork of Sakana AI's DRQ benchmark for evaluating LLM adversarial code generation. Features multi-provider support, real-time web interface, and battle visualization.
// the problem
Challenge
Evaluating LLM code generation requires controlled benchmarks with measurable outcomes. The original DRQ (Digital Red Queen) research showed convergent evolution in LLM-generated programs, but single-provider evaluation limits insights. Building a fair multi-model battle arena requires consistent prompting, parallel generation, and deterministic battle simulation.
// what we built
Solution
DRQ Benchmark extends the original research with multi-provider LLM support. The system supports leading AI models across multiple providers. Warriors generated by different models compete in Core War, with parallel generation significantly reducing benchmark time.
// shipped
Key features
- Multi-provider LLM battles
- Real-time benchmark monitoring
- Pygame visualizer with color-coded warriors
- Parallel warrior generation
- Warrior code inspection
- Battle history tracking
- Score tracking with win rates
- Docker containerization
// stack.json
Tech stack
The exact tools shipping this product in production.
- Python
- Flask
- Core War
- Pygame
- Leading AI Models
- Docker
// system.diagram()
Architecture
Multi-provider LLM battle arena for adversarial program evolution research
- backend
- frontend
- service
- ai
- database
// receipts
Results
- Multi-provider LLM support across leading models
- Real-time web monitoring interface
- Pygame battle visualization
- Significantly faster with parallel warrior generation
- Player vs Player mode (any model combination)
- Battle history with localStorage persistence
// next()
Have a project like this?
We build production systems with the same engineering rigor you see here. Let's talk.