Training
Hosted training is modeled as a durable control plane, not as long-running work inside the API web process.
Each job is owned by an organization and workspace, then dispatched to a backend-specific worker.
Execution model
Section titled “Execution model”The planned v1 job matrix is:
tinker + sfttinker + rlray + rl
Supported algorithm types include grpo, ppo, and importance_sampling (depending on trainer type and backend).
The control plane owns job lifecycle, logs, metrics, artifacts, cancellation, and retries. Backend executors own the actual training work.
For Tinker today, that means:
- the worker resolves visible refs and builds a sandbox payload
- the worker provisions a training sandbox with
DREADNODE_*andTINKER_*env vars - the sandbox runs
python -m dreadnode.training.jobs --payload ... --result ... - the worker reads the structured result back into the control plane
- Docker-backed training sandboxes use
platform/docker/Dockerfile.training
API surface
Section titled “API surface”Training jobs are workspace-scoped:
POST /api/v1/org/{org}/ws/{workspace}/training/jobsGET /api/v1/org/{org}/ws/{workspace}/training/jobsGET /api/v1/org/{org}/ws/{workspace}/training/jobs/{job_id}POST /api/v1/org/{org}/ws/{workspace}/training/jobs/{job_id}/cancelPOST /api/v1/org/{org}/ws/{workspace}/training/jobs/{job_id}/retryGET /api/v1/org/{org}/ws/{workspace}/training/jobs/{job_id}/logsGET /api/v1/org/{org}/ws/{workspace}/training/jobs/{job_id}/artifacts
Current RL implementation
Section titled “Current RL implementation”The first implemented hosted RL path is intentionally narrow:
- rollouts come from prompt datasets, not ad hoc uploaded prompts
- Worlds trajectory datasets can also drive an offline RL baseline through
trajectory_dataset_refs - Worlds manifests can now pre-sample native-agent trajectory datasets through
world_manifest_id + world_runtime_id - policy scaffolding comes from a versioned
capability_ref - rewards are resolved from named server-side recipes
task_refis available during reward evaluation and instruction rendering- execution can be:
syncone_step_off_asyncfully_async
Current limitations:
- the built-in
task_verifier_v1recipe only supports flag-based task verification - the current Tinker RL sandbox runtime is prompt-dataset driven, not full multi-turn task replay
- the current Worlds offline RL baseline is still trajectory imitation, not native environment outcome optimization
- the primary Worlds RL path now asks Worlds to generate native-agent trajectories first, publish them as datasets, and then reuses the offline/async RL runtime over those published datasets
- the older live Worlds HTTP rollout path remains available when a job supplies
world_reward - the async modes are rollout-group schedulers:
one_step_off_asyncoverlaps generation and training with one-step stalenessfully_asyncallows multiple queued rollout groups with bounded staleness viamax_steps_off_policy- neither mode is a partial-rollout continuation runtime for long-horizon Worlds episodes yet
- environment lifecycle and richer verifier execution still need follow-up work
Current SFT implementation
Section titled “Current SFT implementation”Hosted Tinker SFT now supports:
- dataset-artifact backed conversation loading
- Worlds trajectory dataset ETL into SFT conversations
- prompt/answer row normalization into chat conversations
- capability prompt injection as a system message scaffold
- cross-entropy training, optional eval, and checkpoint persistence
Reference resolution
Section titled “Reference resolution”Hosted training resolves references before execution instead of passing opaque strings directly into backend workers.
capability_refis resolved at submission time and persisted onto the job as a capability snapshottask_refis resolved to an org-visible task definition before RL execution startsdataset_refandprompt_dataset_refare resolved to org-visible dataset artifacts before SFT or prompt-dataset RL execution startstrajectory_dataset_refsare validated on submission and resolved again when building the sandbox payload for Worlds-backed SFT and offline RL jobsworld_manifest_idis validated on submission and resolved again when building the sandbox payload for Worlds sampling or live-rollout RL jobs
Current ref conventions:
- dataset refs are structured
{ name, version }objects with explicit versions - task refs can use
namefor the latest visible task orname@versionfor an explicit version
Policy, environment, and reward boundaries
Section titled “Policy, environment, and reward boundaries”Hosted RL jobs use three separate references:
capability_refidentifies the versioned policy scaffoldtask_refidentifies the environment or task definition when the RL mode needs onereward_recipeidentifies the server-side reward or verification logic
That split keeps capability versioning explicit and avoids sending ad hoc agent_spec blobs over
the API.
Request shape
Section titled “Request shape”Training job creation uses typed request bodies instead of one shared config blob.
- SFT requests carry dataset and LoRA-oriented settings in
config - SFT requests can use either
dataset_refortrajectory_dataset_refs - RL requests carry prompt-dataset fields or
trajectory_dataset_refs, plus reward settings and rollout controls inconfig - RL requests can also carry
world_manifest_id,world_runtime_id, and optionalworld_agent_nameto pre-sample native-agent Worlds datasets before training - RL requests can still carry
world_rewardfor the older live-rollout bridge
prompt_dataset_ref is an RL concern and stays inside the RL config block rather than the common
job envelope. For Worlds-driven offline RL, trajectory_dataset_refs can replace both
prompt_dataset_ref and task_ref. For Worlds-native sampling, world_manifest_id plus
world_runtime_id becomes the primary environment target. If world_reward is supplied, the job
falls back to the older live-rollout path so reward shaping can still happen during rollout
generation.
Worker configuration
Section titled “Worker configuration”Self-hosted API deployments can run queued Tinker jobs inside the API process:
TRAINING_IN_PROCESS_WORKER_ENABLEDTRAINING_IN_PROCESS_WORKER_CONCURRENCYTRAINING_IN_PROCESS_WORKER_POLL_INTERVAL_SECTRAINING_IN_PROCESS_WORKER_LEASE_SECONDS
Those workers still hand execution off to training sandboxes. Set these runtime env vars so the sandboxed job can talk to the right backends:
TINKER_BASE_URLTINKER_API_KEY
If TINKER_BASE_URL is unset, the sandbox runtime falls back to the Tinker client default
behavior.