pre.dev Labs · RL Environments

Verifiable RL environments
from real production codebases.

Real, multi-file engineering tasks lifted from private production repos in the pre.dev agency network. Each is a runnable harness whose verifier returns continuous reward on [0, 1], not a binary pass/fail. Authored by pre.dev's own coding agent, calibrated across model families, and QC'd before it ships.

Continuous rewardZero contaminationMulti-file, full-stack

Book an intro call

For frontier-lab post-training · RL & SFT data procurement

The corpus

Hundreds of production codebases. None in anyone's pretraining.

Every environment is mined from a real private repo in the pre.dev network. The corpus spans 15 languages and 11 project types, 11.2M lines of real engineering with full commit history back to 2013, and billions of tokens once history is included. None of it has ever been public, and it grows every week.

ever public on GitHub

languages

project types

11.2M

lines of real code

162M

tokens of source

Languages · 15

JavaScript

Swift

TypeScript

Kotlin

Python

Objective-C

PHP

Java

+ Dart, C++, Solidity, Rust, C

Project types · 11

Full-stackMobileBack-endFront-endML / dataGamesDevOpsWeb3DesktopOther

Real iOS and Android apps, full-stack web, backends, ML, games, and on-chain code. The hard-to-source distribution, not just another pile of JavaScript.

Why this isn't commodity

Five things other benchmarks miss.

A training-grade environment is not a leaderboard task. Five properties separate something you can train on from something you can only score against.

Continuous reward on [0, 1], not binary.

Most public coding benchmarks (Terminal-Bench, SWE-bench Verified) emit 0/1: full credit or nothing. A run that gets 70% of the feature right scores the same as one that submitted nothing, so the gradient is mostly flat. Here, reward is weighted_passed / weighted_total over a two-layer verifier (bash unit ×1, mocked-Prisma behavioral ×4), so every trial exposes exactly which sub-criteria held and which broke.

Real production source, not contractor-constructed.

83K LoC across 840 files, 763 commits over 13 months, 7 contributors, a Prisma schema with 200+ migrations, full Next.js + TypeScript + Postgres stack with Vitest + Playwright infra already in place, AWS/GCP/OAuth wired in. Real engineering, not a hand-rolled scaffold.

Private source, held out by construction.

The source repo is private: not on GitHub search, not in CommonCrawl, not in any public code corpus. Common patterns (a Prisma aggregation, a season filter) obviously appear in pretraining, but the specific repo, task, and verifier are held out. A file-tree hash ships with every bundle so you can confirm none of these files are already in your training set.

Multi-file, full-stack scope.

The agent edits 8 files spanning actions, type signatures, components, table footers, and sidebar nav. Cross-package coordination is required, and single-file benchmarks miss this entirely.

Pre-released QC, not ship-and-pray.

Every environment passes oracle, nop, mutation, and cold-cache gates before delivery, with the raw verifier output included for your own audit.

How it's made

A supply chain for verifiable coding data.

The hard part of training-grade coding data isn't the grader. It's sourcing real, private, uncontaminated codebases at volume and turning each one into a calibrated environment. Our agency network supplies the code. Our coding agent does the rest.

Source

We license real production codebases from dev agencies in the pre.dev network. Real commit history, real engineering, never indexed on public GitHub.

Author

pre.dev’s internal coding agent mines a real engineering task from the repo and builds it into a runnable harness with a two-layer verifier.

Calibrate & QC

Run pass@5 baselines across model families, then six FP/FN gates. Only environments that land in the training band ship.

Deliver

Ship the runnable env, trajectories, and QC evidence, deduped against your training data by file-tree hash.

A real sample environment · Task #001

Season-aware sales tracking, across 8 files.

Add season-aware sales tracking to a production admin dashboard. The agent extends user records with two new current-season aggregate metrics, sourced by grouping purchase records and applying a season filter consistently across the relevant data and reward paths. The user table picks up sorting + totals on the new metrics, and the sidebar reflects the updated emphasis.

The agent edits 8 files spanning actions, type signatures, constants, table components, and sidebar config: coordinated changes across the web app and shared packages of a Next.js + Prisma monorepo.

The full instruction the agent sees lives in env/instruction.md. It names the contract fields and hints at the data source, but never dictates the implementation.

files edited

new metrics

83K

LoC repo

840

files

763

commits / 13mo

200+

Prisma migrations

Capability calibration

Weighted reward across 5 samples, terminus-2 on harbor.

A calibration probe, not a leaderboard: 5 samples per model on a single task, one agent scaffold. Enough to show the env separates capability tiers (top-tier near 0.65 against a 0.07 no-op floor and a 1.0 oracle). Not a precise model ranking, and the gap between the two top-tier means is well inside the noise.

Claude Opus 4.7

σ 0.171 · range 0.49–0.94

0.65

Claude Sonnet 4.6

σ 0.174 · range 0.48–0.94

0.65

GPT-5.5

σ 0.201 · range 0.28–0.80

0.45

Claude Haiku 4.5

σ 0.021 · range 0.38–0.44

0.40

Dynamic rangefloor → discrimination band → ceiling

nop 0.07top-tier ~0.65oracle 1.00

Mean near 0.65 places the env in the canonical RL training band: room to improve, not flat, not unlearnable. These numbers are a property of the environment crossed with the terminus-2 scaffold, not the weights alone, so read them as a difficulty signal, not a benchmark score. Both top-tier runs peak at 0.94, held off 1.0 by one behavioral edge case (malformed-row tolerance), a reproducible sub-criterion RL can pressure directly.

What's in the bundle

Nine artifacts, all reproducible.

Everything needed to run the environment, reproduce the baselines, and audit the reward. Not just a prompt and a grader.

ItemPathPurpose

Runnable harbor taskenv/Drop-in for harbor run -p ./env

Per-task instructionenv/instruction.mdWhat the agent is asked to do

Verifier harnessenv/tests/Two-layer reward (bash unit + vitest behavioral)

Oracle solutionenv/solution/solve.shCanonical diff for sanity-checking

Pass@5 trajectoriestrajectories/pass-at-5-*4 model families × 5 trials: raw output + traces.parquet + traces.jsonl (911 SFT rows, ShareGPT)

Gold trajectorytrajectories/gold/1.0 reference run for SFT use

QA evidenceqa-evidence/Raw verifier outputs for oracle, nop, and the mutation tests

QC reportqc-report.mdCapability + contamination evidence

Reproduction kitreproduction-kit.mdRun instructions per model

predev-task-001/
├── manifest.json, README.md, qc-report.md, reproduction-kit.md
├── env/
│   ├── task.toml, instruction.md
│   ├── environment/   Dockerfile, repo/ (Next.js + Prisma monorepo), init scripts
│   ├── solution/      canonical.patch, solve.sh
│   └── tests/         run.sh, *.behavioral.test.ts, vitest.config.ts, ...
├── qa-evidence/
│   ├── oracle/                      reward 1.0000  (canonical diff applied)
│   ├── nop/                         reward 0.0704  (no edits, real floor)
│   ├── mutation-drop-season-filter/ reward 0.8310  (1 line removed)
│   └── mutation-rename-internal-vars/ reward 1.0000 (cosmetic rename, no penalty)
└── trajectories/
    ├── gold/                        43-turn 1.0 reference run
    └── pass-at-5-{opus,sonnet,gpt,haiku}/
        ├── stats.json, traces.parquet, traces.jsonl   (HF Dataset / ShareGPT)
        └── trial-{1..5}/   reward.txt, trajectory.json, recording.cast, verifier-run.log

QC evidence · FP / FN audit

Six gates, every env.

The verifier is the artifact under test. Before an environment ships, we probe it for false positives and false negatives, so the reward you train on means what it says. The grade lives in the behavioral layer (weight 4): it asserts query shape, aggregation semantics, and end-to-end returned values against a mocked database. The structural greps (weight 1) are dense shaping, not a path to credit a policy can string-match its way through.

Oracle (canonical solution)

Env reaches max reward; harness is solvable.

1.0000

reward

Nop agent (no edits)

Floor reward from pre-existing structural elements. Real dynamic range 0.07 → 1.0.

0.0704

reward

Mutation: rename internal accumulator vars

No penalty for naming choices not in the spec. The env grades the contract, not style.

1.0000

reward

Mutation: drop the season filter from groupBy

Removing 1 line of core logic drops reward by 0.169. Multiple behavioral tests fire (call-shape, helper invocation, aggregation, end-to-end math).

0.8310

reward

Verifier with zero parsed tests

No fake-green fallback when the runner doesn’t produce countable results.

exits

non-zero

Cold-cache install path

Works on harbor’s fresh container, with no warm-cache dependency, reproducible across environments.

pass

cold-cache

Gold trajectory

A 43-turn, 1.0 reference run.

trajectories/gold/ ships a 1.0 reference run: file exploration → multi-file edits → typecheck → behavioral test execution → edge-case review → fix → verification → submit. SFT-grade success demonstration.

Replay verification

$ cd env/environment/repo
$ git apply ../../trajectories/gold/diff.patch
$ bash ../../tests/setup.sh && bash ../../tests/run.sh
# Test Results: 11 passed, 0 failed (unit)
# Test Results: 15 passed, 0 failed (behavioral)
# REWARD: 1.0000

Reproducibility

One command, any model.

$ harbor run -a terminus-2 \
    -m <model-name> -p ./env

Bring your own infra: a self-contained container plus a reward entrypoint returning a float in [0,1]; harbor is just how we produce the baselines

Oracle baseline: deterministically 1.0 every run

Sampling: append -k 5 for parallel multi-trial sampling

Stack-agnostic: pre-warmed image, cold-cache reproducible

Model-agnostic: any LiteLLM-supported model with an API key

Reward-labeled trajectory data

911 rows (turns across the 1.0 gold run and the 20 pass@5 trials) as traces.parquet + traces.jsonl. Every row carries its trial's verifier reward, so you filter to your own bar: keep the gold path for SFT warmstart, or use the full reward-labeled set for RLVR and reward-model work. A conversations_sharegpt column drops into standard pipelines.

Pipeline at scale

Same methodology, at any volume.

The pipeline behind Task #001 runs at any volume. The same sourcing, authoring, and QC produce a single environment, a themed volume pack, or a continuous stream, including environments generated against your own private repos.

Single tasks

A specific, hand-targeted environment with the full QC + capability bundle, like Task #001.

Volume packs

Batches of calibrated environments across stacks and difficulty bands, delivered on a schedule.

Continuous subscriptions

Ongoing generation against your private repos, deduped against your training data by file-tree hash.

Want the data? Let's talk.

We'll walk you through a full environment end to end, then scope a volume pack or a continuous stream against the slice of the corpus you care about.

Book an intro call

Continuous rewardZero contaminationMulti-file, full-stack

Verifiable RL environmentsfrom real production codebases.

Hundreds of production codebases. None in anyone's pretraining.

Five things other benchmarks miss.

Continuous reward on [0, 1], not binary.

Real production source, not contractor-constructed.

Private source, held out by construction.

Multi-file, full-stack scope.

Pre-released QC, not ship-and-pray.

A supply chain for verifiable coding data.

Source

Author

Calibrate & QC

Deliver

Season-aware sales tracking, across 8 files.

Weighted reward across 5 samples, terminus-2 on harbor.

Nine artifacts, all reproducible.

Six gates, every env.

Oracle (canonical solution)

Nop agent (no edits)

Mutation: rename internal accumulator vars

Mutation: drop the season filter from groupBy

Verifier with zero parsed tests

Cold-cache install path

A 43-turn, 1.0 reference run.

One command, any model.

Same methodology, at any volume.

Single tasks

Volume packs

Continuous subscriptions

Want the data? Let's talk.

Verifiable RL environments
from real production codebases.