Skip to content

Training

Hosted training is modeled as a durable control plane, not as long-running work inside the API web process.

Each job is owned by an organization and workspace, then dispatched to a backend-specific worker.

The planned v1 job matrix is:

  • tinker + sft
  • tinker + rl
  • ray + rl

Supported algorithm types include grpo, ppo, and importance_sampling (depending on trainer type and backend).

The control plane owns job lifecycle, logs, metrics, artifacts, cancellation, and retries. Backend executors own the actual training work.

For Tinker today, that means:

  • the worker resolves visible refs and builds a sandbox payload
  • the worker provisions a training sandbox with DREADNODE_* and TINKER_* env vars
  • the sandbox runs python -m dreadnode.training.jobs --payload ... --result ...
  • the worker reads the structured result back into the control plane
  • Docker-backed training sandboxes use platform/docker/Dockerfile.training

Training jobs are workspace-scoped:

  • POST /api/v1/org/{org}/ws/{workspace}/training/jobs
  • GET /api/v1/org/{org}/ws/{workspace}/training/jobs
  • GET /api/v1/org/{org}/ws/{workspace}/training/jobs/{job_id}
  • POST /api/v1/org/{org}/ws/{workspace}/training/jobs/{job_id}/cancel
  • POST /api/v1/org/{org}/ws/{workspace}/training/jobs/{job_id}/retry
  • GET /api/v1/org/{org}/ws/{workspace}/training/jobs/{job_id}/logs
  • GET /api/v1/org/{org}/ws/{workspace}/training/jobs/{job_id}/artifacts

The first implemented hosted RL path is intentionally narrow:

  • rollouts come from prompt datasets, not ad hoc uploaded prompts
  • Worlds trajectory datasets can also drive an offline RL baseline through trajectory_dataset_refs
  • Worlds manifests can now pre-sample native-agent trajectory datasets through world_manifest_id + world_runtime_id
  • policy scaffolding comes from a versioned capability_ref
  • rewards are resolved from named server-side recipes
  • task_ref is available during reward evaluation and instruction rendering
  • execution can be:
    • sync
    • one_step_off_async
    • fully_async

Current limitations:

  • the built-in task_verifier_v1 recipe only supports flag-based task verification
  • the current Tinker RL sandbox runtime is prompt-dataset driven, not full multi-turn task replay
  • the current Worlds offline RL baseline is still trajectory imitation, not native environment outcome optimization
  • the primary Worlds RL path now asks Worlds to generate native-agent trajectories first, publish them as datasets, and then reuses the offline/async RL runtime over those published datasets
  • the older live Worlds HTTP rollout path remains available when a job supplies world_reward
  • the async modes are rollout-group schedulers:
    • one_step_off_async overlaps generation and training with one-step staleness
    • fully_async allows multiple queued rollout groups with bounded staleness via max_steps_off_policy
    • neither mode is a partial-rollout continuation runtime for long-horizon Worlds episodes yet
  • environment lifecycle and richer verifier execution still need follow-up work

Hosted Tinker SFT now supports:

  • dataset-artifact backed conversation loading
  • Worlds trajectory dataset ETL into SFT conversations
  • prompt/answer row normalization into chat conversations
  • capability prompt injection as a system message scaffold
  • cross-entropy training, optional eval, and checkpoint persistence

Hosted training resolves references before execution instead of passing opaque strings directly into backend workers.

  • capability_ref is resolved at submission time and persisted onto the job as a capability snapshot
  • task_ref is resolved to an org-visible task definition before RL execution starts
  • dataset_ref and prompt_dataset_ref are resolved to org-visible dataset artifacts before SFT or prompt-dataset RL execution starts
  • trajectory_dataset_refs are validated on submission and resolved again when building the sandbox payload for Worlds-backed SFT and offline RL jobs
  • world_manifest_id is validated on submission and resolved again when building the sandbox payload for Worlds sampling or live-rollout RL jobs

Current ref conventions:

  • dataset refs are structured { name, version } objects with explicit versions
  • task refs can use name for the latest visible task or name@version for an explicit version

Policy, environment, and reward boundaries

Section titled “Policy, environment, and reward boundaries”

Hosted RL jobs use three separate references:

  • capability_ref identifies the versioned policy scaffold
  • task_ref identifies the environment or task definition when the RL mode needs one
  • reward_recipe identifies the server-side reward or verification logic

That split keeps capability versioning explicit and avoids sending ad hoc agent_spec blobs over the API.

Training job creation uses typed request bodies instead of one shared config blob.

  • SFT requests carry dataset and LoRA-oriented settings in config
  • SFT requests can use either dataset_ref or trajectory_dataset_refs
  • RL requests carry prompt-dataset fields or trajectory_dataset_refs, plus reward settings and rollout controls in config
  • RL requests can also carry world_manifest_id, world_runtime_id, and optional world_agent_name to pre-sample native-agent Worlds datasets before training
  • RL requests can still carry world_reward for the older live-rollout bridge

prompt_dataset_ref is an RL concern and stays inside the RL config block rather than the common job envelope. For Worlds-driven offline RL, trajectory_dataset_refs can replace both prompt_dataset_ref and task_ref. For Worlds-native sampling, world_manifest_id plus world_runtime_id becomes the primary environment target. If world_reward is supplied, the job falls back to the older live-rollout path so reward shaping can still happen during rollout generation.

Self-hosted API deployments can run queued Tinker jobs inside the API process:

  • TRAINING_IN_PROCESS_WORKER_ENABLED
  • TRAINING_IN_PROCESS_WORKER_CONCURRENCY
  • TRAINING_IN_PROCESS_WORKER_POLL_INTERVAL_SEC
  • TRAINING_IN_PROCESS_WORKER_LEASE_SECONDS

Those workers still hand execution off to training sandboxes. Set these runtime env vars so the sandboxed job can talk to the right backends:

  • TINKER_BASE_URL
  • TINKER_API_KEY

If TINKER_BASE_URL is unset, the sandbox runtime falls back to the Tinker client default behavior.