Training
Submit SFT jobs
Section titled “Submit SFT jobs”Use dn train sft to submit a hosted Tinker supervised fine-tuning job against an uploaded
dataset and capability:
dn train sft \ --server http://127.0.0.1:8000 \ --api-key "$DREADNODE_API_KEY" \ --organization dreadnode \ --workspace localdev \ --model meta-llama/Llama-3.1-8B-Instruct \ --steps 100 \ --wait \ --jsonYou can also train directly from one or more published Worlds trajectory datasets:
dn train sft \ --server http://127.0.0.1:8000 \ --api-key "$DREADNODE_API_KEY" \ --organization dreadnode \ --workspace localdev \ --model meta-llama/Llama-3.1-8B-Instruct \ --steps 50Common flags:
| Flag | Description |
|---|---|
--trajectory-dataset NAME@VERSION | Worlds trajectory dataset input, repeatable |
--eval-dataset NAME@VERSION | Optional evaluation dataset |
--batch-size <n> | Per-step batch size |
--gradient-accumulation-steps <n> | Gradient accumulation factor |
--learning-rate <float> | Optimizer learning rate |
--checkpoint-interval <n> | Save checkpoint every N steps |
--wait | Poll until the hosted job reaches a terminal state |
--json | Print the full job payload instead of a compact summary |
Submit RL jobs
Section titled “Submit RL jobs”Use dn train rl to submit a hosted Tinker reinforcement learning job:
dn train rl \ --server http://127.0.0.1:8000 \ --api-key "$DREADNODE_API_KEY" \ --organization dreadnode \ --workspace localdev \ --model meta-llama/Llama-3.1-8B-Instruct \ --algorithm importance_sampling \ --execution-mode fully_async \ --max-steps-off-policy 3 \ --reward-recipe contains_v1 \ --reward-params '{"needle":"flag"}'For Worlds-driven offline RL, replace the prompt dataset input with one or more published trajectory datasets:
dn train rl \ --server http://127.0.0.1:8000 \ --api-key "$DREADNODE_API_KEY" \ --organization dreadnode \ --workspace localdev \ --model meta-llama/Llama-3.1-8B-Instruct \ --algorithm importance_samplingWhen dn train rl runs from trajectory datasets without an explicit reward recipe, the sandbox
uses the trajectory_imitation_v1 baseline. That recipe only rewards completions that match the
recorded next assistant action, and it scales that reward by the published trajectory outcome
metadata from Worlds.
For Worlds-first RL, point the job at a manifest plus the runtime that should generate native agent trajectories. The control plane samples and publishes a Worlds dataset first, then the RL sandbox trains from that published dataset:
dn train rl \ --server http://127.0.0.1:8000 \ --api-key "$DREADNODE_API_KEY" \ --organization dreadnode \ --workspace localdev \ --model meta-llama/Llama-3.1-8B-Instruct \ --world-manifest-id c8af2b7b-9b54-4b21-95a9-b8d403cd8c11 \ --world-runtime-id 8b8fd3af-9a5e-47c8-9f67-7b87ca9387eb \ --world-agent-name operator \ --world-goal "Escalate to Domain Admin in corp.local" \ --execution-mode fully_async \ --max-steps-off-policy 3 \ --num-rollouts 8When --world-runtime-id is supplied, hosted RL treats Worlds-published native-agent datasets as
the primary input path. The selected runtime and capability generate trajectories in Worlds first,
and the training sandbox then reuses the existing offline/async RL runtime over the published
dataset.
If you also supply --world-reward, the job falls back to the older live-backend rollout bridge so
the SDK can apply that reward policy directly during rollout generation.
Common RL flags:
| Flag | Description |
|---|---|
--trajectory-dataset REF | Worlds trajectory dataset input, repeatable |
--world-manifest-id ID | Live Worlds manifest target for online RL |
--world-runtime-id ID | Runtime used to sample native-agent Worlds trajectories |
--world-agent-name NAME | Optional agent selected within the runtime capability |
--world-goal TEXT | Optional goal override for the live Worlds rollout agent |
--world-reward NAME | Named live Worlds reward policy |
--world-reward-params JSON | JSON params passed to the selected Worlds reward policy |
--prompt-split <name> | Prompt split selector inside the prompt dataset |
--execution-mode <mode> | RL runtime mode: sync, one_step_off_async, or fully_async |
--steps <n> | Number of optimization steps |
--num-rollouts <n> | Number of rollouts per update |
--max-turns <n> | Maximum turns per episode |
--max-episode-steps <n> | Environment step limit |
--weight-sync-interval <n> | Refresh sampler weights every N updates |
--max-steps-off-policy <n> | Max rollout staleness for async RL; one_step_off_async requires 1 |
--stop <token> | Add a stop token, repeatable |
Hosted Tinker RL now supports two async modes:
one_step_off_asynckeeps one rollout group in flight and bounds staleness to one stepfully_asyncwidens the same pipeline to multiple queued rollout groups with bounded staleness controlled by--max-steps-off-policy- both modes still operate on rollout groups, not partial in-flight episode continuation
--task and --prompt-dataset are optional for Worlds-driven offline RL. --world-manifest-id
plus --world-runtime-id is the native-agent alternative when you want the control plane to
generate and publish fresh Worlds trajectories before training. --world-reward keeps the older
live-rollout bridge available when you explicitly want reward shaping during rollout generation.
Inspect and manage jobs
Section titled “Inspect and manage jobs”The training subcommands also expose job management:
dn train get <job-id>dn train wait <job-id> --jsondn train logs <job-id>dn train artifacts <job-id>dn train cancel <job-id> --jsondn train wait exits non-zero if the job finishes in failed or cancelled.
Platform context
Section titled “Platform context”Hosted training commands require platform credentials plus an active organization and workspace. Pass them explicitly with flags or configure them in your SDK profile first:
dn configuredn train get <job-id>