pre.dev
pre.dev Labs · RL Environments

Verifiable RL environments
from real production codebases.

Real, multi-file engineering tasks lifted from private production repos in the pre.dev agency network. Each is a runnable harness whose verifier returns continuous reward on [0, 1], not a binary pass/fail. Authored by pre.dev's own coding agent, calibrated across model families, and QC'd before it ships.

Continuous rewardZero contaminationMulti-file, full-stack

For frontier-lab post-training · RL & SFT data procurement

The corpus

Hundreds of production codebases. None in anyone's pretraining.

Every environment is mined from a real private repo in the pre.dev network. The corpus spans 15 languages and 11 project types, 11.2M lines of real engineering with full commit history back to 2013, and billions of tokens once history is included. None of it has ever been public, and it grows every week.

0
ever public on GitHub
15
languages
11
project types
11.2M
lines of real code
162M
tokens of source
Languages · 15
JavaScript
Swift
TypeScript
Kotlin
C#
Python
Objective-C
PHP
Java
+ Dart, C++, Solidity, Rust, C
Project types · 11
Full-stackMobileBack-endFront-endML / dataGamesDevOpsWeb3DesktopOther

Real iOS and Android apps, full-stack web, backends, ML, games, and on-chain code. The hard-to-source distribution, not just another pile of JavaScript.

Why this isn't commodity

Five things other benchmarks miss.

A training-grade environment is not a leaderboard task. Five properties separate something you can train on from something you can only score against.

01

Continuous reward on [0, 1], not binary.

Most public coding benchmarks (Terminal-Bench, SWE-bench Verified) emit 0/1: full credit or nothing. A run that gets 70% of the feature right scores the same as one that submitted nothing, so the gradient is mostly flat. Here, reward is weighted_passed / weighted_total over a two-layer verifier (bash unit ×1, mocked-Prisma behavioral ×4), so every trial exposes exactly which sub-criteria held and which broke.

02

Real production source, not contractor-constructed.

83K LoC across 840 files, 763 commits over 13 months, 7 contributors, a Prisma schema with 200+ migrations, full Next.js + TypeScript + Postgres stack with Vitest + Playwright infra already in place, AWS/GCP/OAuth wired in. Real engineering, not a hand-rolled scaffold.

03

Private source, held out by construction.

The source repo is private: not on GitHub search, not in CommonCrawl, not in any public code corpus. Common patterns (a Prisma aggregation, a season filter) obviously appear in pretraining, but the specific repo, task, and verifier are held out. A file-tree hash ships with every bundle so you can confirm none of these files are already in your training set.

04

Multi-file, full-stack scope.

The agent edits 8 files spanning actions, type signatures, components, table footers, and sidebar nav. Cross-package coordination is required, and single-file benchmarks miss this entirely.

05

Pre-released QC, not ship-and-pray.

Every environment passes oracle, nop, mutation, and cold-cache gates before delivery, with the raw verifier output included for your own audit.

How it's made

A supply chain for verifiable coding data.

The hard part of training-grade coding data isn't the grader. It's sourcing real, private, uncontaminated codebases at volume and turning each one into a calibrated environment. Our agency network supplies the code. Our coding agent does the rest.

01

Source

We license real production codebases from dev agencies in the pre.dev network. Real commit history, real engineering, never indexed on public GitHub.

02

Author

pre.dev’s internal coding agent mines a real engineering task from the repo and builds it into a runnable harness with a two-layer verifier.

03

Calibrate & QC

Run pass@5 baselines across model families, then six FP/FN gates. Only environments that land in the training band ship.

04

Deliver

Ship the runnable env, trajectories, and QC evidence, deduped against your training data by file-tree hash.

A real sample environment · Task #001

Season-aware sales tracking, across 8 files.

Add season-aware sales tracking to a production admin dashboard. The agent extends user records with two new current-season aggregate metrics, sourced by grouping purchase records and applying a season filter consistently across the relevant data and reward paths. The user table picks up sorting + totals on the new metrics, and the sidebar reflects the updated emphasis.

The agent edits 8 files spanning actions, type signatures, constants, table components, and sidebar config: coordinated changes across the web app and shared packages of a Next.js + Prisma monorepo.

The full instruction the agent sees lives in env/instruction.md. It names the contract fields and hints at the data source, but never dictates the implementation.

8
files edited
2
new metrics
83K
LoC repo
840
files
763
commits / 13mo
200+
Prisma migrations
Capability calibration

Weighted reward across 5 samples, terminus-2 on harbor.

A calibration probe, not a leaderboard: 5 samples per model on a single task, one agent scaffold. Enough to show the env separates capability tiers (top-tier near 0.65 against a 0.07 no-op floor and a 1.0 oracle). Not a precise model ranking, and the gap between the two top-tier means is well inside the noise.

Claude Opus 4.7
σ 0.171 · range 0.49–0.94
0.65
Claude Sonnet 4.6
σ 0.174 · range 0.48–0.94
0.65
GPT-5.5
σ 0.201 · range 0.28–0.80
0.45
Claude Haiku 4.5
σ 0.021 · range 0.38–0.44
0.40
Dynamic range
nop 0.07top-tier ~0.65oracle 1.00

Mean near 0.65 places the env in the canonical RL training band: room to improve, not flat, not unlearnable. These numbers are a property of the environment crossed with the terminus-2 scaffold, not the weights alone, so read them as a difficulty signal, not a benchmark score. Both top-tier runs peak at 0.94, held off 1.0 by one behavioral edge case (malformed-row tolerance), a reproducible sub-criterion RL can pressure directly.

What's in the bundle

Nine artifacts, all reproducible.

Everything needed to run the environment, reproduce the baselines, and audit the reward. Not just a prompt and a grader.

Runnable harbor taskenv/Drop-in for harbor run -p ./env
Per-task instructionenv/instruction.mdWhat the agent is asked to do
Verifier harnessenv/tests/Two-layer reward (bash unit + vitest behavioral)
Oracle solutionenv/solution/solve.shCanonical diff for sanity-checking
Pass@5 trajectoriestrajectories/pass-at-5-*4 model families × 5 trials: raw output + traces.parquet + traces.jsonl (911 SFT rows, ShareGPT)
Gold trajectorytrajectories/gold/1.0 reference run for SFT use
QA evidenceqa-evidence/Raw verifier outputs for oracle, nop, and the mutation tests
QC reportqc-report.mdCapability + contamination evidence
Reproduction kitreproduction-kit.mdRun instructions per model
predev-task-001/
├── manifest.json, README.md, qc-report.md, reproduction-kit.md
├── env/
│   ├── task.toml, instruction.md
│   ├── environment/   Dockerfile, repo/ (Next.js + Prisma monorepo), init scripts
│   ├── solution/      canonical.patch, solve.sh
│   └── tests/         run.sh, *.behavioral.test.ts, vitest.config.ts, ...
├── qa-evidence/
│   ├── oracle/                      reward 1.0000  (canonical diff applied)
│   ├── nop/                         reward 0.0704  (no edits, real floor)
│   ├── mutation-drop-season-filter/ reward 0.8310  (1 line removed)
│   └── mutation-rename-internal-vars/ reward 1.0000 (cosmetic rename, no penalty)
└── trajectories/
    ├── gold/                        43-turn 1.0 reference run
    └── pass-at-5-{opus,sonnet,gpt,haiku}/
        ├── stats.json, traces.parquet, traces.jsonl   (HF Dataset / ShareGPT)
        └── trial-{1..5}/   reward.txt, trajectory.json, recording.cast, verifier-run.log
QC evidence · FP / FN audit

Six gates, every env.

The verifier is the artifact under test. Before an environment ships, we probe it for false positives and false negatives, so the reward you train on means what it says. The grade lives in the behavioral layer (weight 4): it asserts query shape, aggregation semantics, and end-to-end returned values against a mocked database. The structural greps (weight 1) are dense shaping, not a path to credit a policy can string-match its way through.

Oracle (canonical solution)

Env reaches max reward; harness is solvable.

1.0000
reward

Nop agent (no edits)

Floor reward from pre-existing structural elements. Real dynamic range 0.07 → 1.0.

0.0704
reward

Mutation: rename internal accumulator vars

No penalty for naming choices not in the spec. The env grades the contract, not style.

1.0000
reward

Mutation: drop the season filter from groupBy

Removing 1 line of core logic drops reward by 0.169. Multiple behavioral tests fire (call-shape, helper invocation, aggregation, end-to-end math).

0.8310
reward

Verifier with zero parsed tests

No fake-green fallback when the runner doesn’t produce countable results.

exits
non-zero

Cold-cache install path

Works on harbor’s fresh container, with no warm-cache dependency, reproducible across environments.

pass
cold-cache
Gold trajectory

A 43-turn, 1.0 reference run.

trajectories/gold/ ships a 1.0 reference run: file exploration → multi-file edits → typecheck → behavioral test execution → edge-case review → fix → verification → submit. SFT-grade success demonstration.

Replay verification
$ cd env/environment/repo
$ git apply ../../trajectories/gold/diff.patch
$ bash ../../tests/setup.sh && bash ../../tests/run.sh
# Test Results: 11 passed, 0 failed (unit)
# Test Results: 15 passed, 0 failed (behavioral)
# REWARD: 1.0000
Reproducibility

One command, any model.

$ harbor run -a terminus-2 \
    -m <model-name> -p ./env
Bring your own infra: a self-contained container plus a reward entrypoint returning a float in [0,1]; harbor is just how we produce the baselines
Oracle baseline: deterministically 1.0 every run
Sampling: append -k 5 for parallel multi-trial sampling
Stack-agnostic: pre-warmed image, cold-cache reproducible
Model-agnostic: any LiteLLM-supported model with an API key
Reward-labeled trajectory data

911 rows (turns across the 1.0 gold run and the 20 pass@5 trials) as traces.parquet + traces.jsonl. Every row carries its trial's verifier reward, so you filter to your own bar: keep the gold path for SFT warmstart, or use the full reward-labeled set for RLVR and reward-model work. A conversations_sharegpt column drops into standard pipelines.

Pipeline at scale

Same methodology, at any volume.

The pipeline behind Task #001 runs at any volume. The same sourcing, authoring, and QC produce a single environment, a themed volume pack, or a continuous stream, including environments generated against your own private repos.

Single tasks

A specific, hand-targeted environment with the full QC + capability bundle, like Task #001.

Volume packs

Batches of calibrated environments across stacks and difficulty bands, delivered on a schedule.

Continuous subscriptions

Ongoing generation against your private repos, deduped against your training data by file-tree hash.

Want the data? Let's talk.

We'll walk you through a full environment end to end, then scope a volume pack or a continuous stream against the slice of the corpus you care about.

Continuous rewardZero contaminationMulti-file, full-stack