SWE-bench
Benchmark framework evaluating AI coding agents on real GitHub issues and PRs.
About SWE-bench
Benchmark and evaluation framework for testing AI coding agents on real-world software engineering tasks. Uses real GitHub issues and pull requests from popular Python repositories to measure agent capabilities.
Key Features
- Real-world task evaluation
- GitHub issue benchmarks
- Agent comparison
- Leaderboard
- Reproducible testing
- Python repository focus
Pricing
free
Free and open source research benchmark.
Pros
- Industry-standard benchmark
- Real-world tasks
- Open source
- Active leaderboard
Cons
- Python-focused only
- Benchmark gaming concerns
- Limited to issue resolution tasks