Training

Hosted training is modeled as a durable control plane, not as long-running work inside the API web process.

Each job is owned by an organization and workspace, then dispatched to a backend-specific worker.

Execution model

The planned v1 job matrix is:

tinker + sft
tinker + rl
ray + rl

Supported algorithm types include grpo, ppo, and importance_sampling (depending on trainer type and backend).

The control plane owns job lifecycle, logs, metrics, artifacts, cancellation, and retries. Backend executors own the actual training work.

For Tinker today, that means:

the worker resolves visible refs and builds a sandbox payload
the worker provisions a training sandbox with DREADNODE_* and TINKER_* env vars
the sandbox runs python -m dreadnode.training.jobs --payload ... --result ...
the worker reads the structured result back into the control plane
Docker-backed training sandboxes use platform/docker/Dockerfile.training

API surface

Training jobs are workspace-scoped:

POST /api/v1/org/{org}/ws/{workspace}/training/jobs
GET /api/v1/org/{org}/ws/{workspace}/training/jobs
GET /api/v1/org/{org}/ws/{workspace}/training/jobs/{job_id}
POST /api/v1/org/{org}/ws/{workspace}/training/jobs/{job_id}/cancel
POST /api/v1/org/{org}/ws/{workspace}/training/jobs/{job_id}/retry
GET /api/v1/org/{org}/ws/{workspace}/training/jobs/{job_id}/logs
GET /api/v1/org/{org}/ws/{workspace}/training/jobs/{job_id}/artifacts

Current RL implementation

The first implemented hosted RL path is intentionally narrow:

rollouts come from prompt datasets, not ad hoc uploaded prompts
Worlds trajectory datasets can also drive an offline RL baseline through trajectory_dataset_refs
Worlds manifests can now pre-sample native-agent trajectory datasets through world_manifest_id + world_runtime_id
policy scaffolding comes from a versioned capability_ref
rewards are resolved from named server-side recipes
task_ref is available during reward evaluation and instruction rendering
execution can be:
- sync
- one_step_off_async
- fully_async

Current limitations:

the built-in task_verifier_v1 recipe only supports flag-based task verification
the current Tinker RL sandbox runtime is prompt-dataset driven, not full multi-turn task replay
the current Worlds offline RL baseline is still trajectory imitation, not native environment outcome optimization
the primary Worlds RL path now asks Worlds to generate native-agent trajectories first, publish them as datasets, and then reuses the offline/async RL runtime over those published datasets
the older live Worlds HTTP rollout path remains available when a job supplies world_reward
the async modes are rollout-group schedulers:
- one_step_off_async overlaps generation and training with one-step staleness
- fully_async allows multiple queued rollout groups with bounded staleness via max_steps_off_policy
- neither mode is a partial-rollout continuation runtime for long-horizon Worlds episodes yet
environment lifecycle and richer verifier execution still need follow-up work

Current SFT implementation

Hosted Tinker SFT now supports:

dataset-artifact backed conversation loading
Worlds trajectory dataset ETL into SFT conversations
prompt/answer row normalization into chat conversations
capability prompt injection as a system message scaffold
cross-entropy training, optional eval, and checkpoint persistence

Reference resolution

Hosted training resolves references before execution instead of passing opaque strings directly into backend workers.

capability_ref is resolved at submission time and persisted onto the job as a capability snapshot
task_ref is resolved to an org-visible task definition before RL execution starts
dataset_ref and prompt_dataset_ref are resolved to org-visible dataset artifacts before SFT or prompt-dataset RL execution starts
trajectory_dataset_refs are validated on submission and resolved again when building the sandbox payload for Worlds-backed SFT and offline RL jobs
world_manifest_id is validated on submission and resolved again when building the sandbox payload for Worlds sampling or live-rollout RL jobs

Current ref conventions:

dataset refs are structured { name, version } objects with explicit versions
task refs can use name for the latest visible task or name@version for an explicit version

Policy, environment, and reward boundaries

Hosted RL jobs use three separate references:

capability_ref identifies the versioned policy scaffold
task_ref identifies the environment or task definition when the RL mode needs one
reward_recipe identifies the server-side reward or verification logic

That split keeps capability versioning explicit and avoids sending ad hoc agent_spec blobs over the API.

Request shape

Training job creation uses typed request bodies instead of one shared config blob.

SFT requests carry dataset and LoRA-oriented settings in config
SFT requests can use either dataset_ref or trajectory_dataset_refs
RL requests carry prompt-dataset fields or trajectory_dataset_refs, plus reward settings and rollout controls in config
RL requests can also carry world_manifest_id, world_runtime_id, and optional world_agent_name to pre-sample native-agent Worlds datasets before training
RL requests can still carry world_reward for the older live-rollout bridge

prompt_dataset_ref is an RL concern and stays inside the RL config block rather than the common job envelope. For Worlds-driven offline RL, trajectory_dataset_refs can replace both prompt_dataset_ref and task_ref. For Worlds-native sampling, world_manifest_id plus world_runtime_id becomes the primary environment target. If world_reward is supplied, the job falls back to the older live-rollout path so reward shaping can still happen during rollout generation.

Worker configuration

Self-hosted API deployments can run queued Tinker jobs inside the API process:

TRAINING_IN_PROCESS_WORKER_ENABLED
TRAINING_IN_PROCESS_WORKER_CONCURRENCY
TRAINING_IN_PROCESS_WORKER_POLL_INTERVAL_SEC
TRAINING_IN_PROCESS_WORKER_LEASE_SECONDS

Those workers still hand execution off to training sandboxes. Set these runtime env vars so the sandboxed job can talk to the right backends:

TINKER_BASE_URL
TINKER_API_KEY

If TINKER_BASE_URL is unset, the sandbox runtime falls back to the Tinker client default behavior.