Skip to content

Training

Use dn train sft to submit a hosted Tinker supervised fine-tuning job against an uploaded dataset and capability:

Terminal window
dn train sft \
--server http://127.0.0.1:8000 \
--api-key "$DREADNODE_API_KEY" \
--organization dreadnode \
--workspace localdev \
--model meta-llama/Llama-3.1-8B-Instruct \
--capability [email protected] \
--dataset [email protected] \
--steps 100 \
--wait \
--json

You can also train directly from one or more published Worlds trajectory datasets:

Terminal window
dn train sft \
--server http://127.0.0.1:8000 \
--api-key "$DREADNODE_API_KEY" \
--organization dreadnode \
--workspace localdev \
--model meta-llama/Llama-3.1-8B-Instruct \
--capability [email protected] \
--trajectory-dataset dreadnode/[email protected] \
--trajectory-dataset dreadnode/[email protected] \
--steps 50

Common flags:

FlagDescription
--trajectory-dataset NAME@VERSIONWorlds trajectory dataset input, repeatable
--eval-dataset NAME@VERSIONOptional evaluation dataset
--batch-size <n>Per-step batch size
--gradient-accumulation-steps <n>Gradient accumulation factor
--learning-rate <float>Optimizer learning rate
--checkpoint-interval <n>Save checkpoint every N steps
--waitPoll until the hosted job reaches a terminal state
--jsonPrint the full job payload instead of a compact summary

Use dn train rl to submit a hosted Tinker reinforcement learning job:

Terminal window
dn train rl \
--server http://127.0.0.1:8000 \
--api-key "$DREADNODE_API_KEY" \
--organization dreadnode \
--workspace localdev \
--model meta-llama/Llama-3.1-8B-Instruct \
--capability [email protected] \
--prompt-dataset [email protected] \
--algorithm importance_sampling \
--execution-mode fully_async \
--max-steps-off-policy 3 \
--reward-recipe contains_v1 \
--reward-params '{"needle":"flag"}'

For Worlds-driven offline RL, replace the prompt dataset input with one or more published trajectory datasets:

Terminal window
dn train rl \
--server http://127.0.0.1:8000 \
--api-key "$DREADNODE_API_KEY" \
--organization dreadnode \
--workspace localdev \
--model meta-llama/Llama-3.1-8B-Instruct \
--capability [email protected] \
--trajectory-dataset dreadnode/[email protected] \
--trajectory-dataset dreadnode/[email protected] \
--algorithm importance_sampling

When dn train rl runs from trajectory datasets without an explicit reward recipe, the sandbox uses the trajectory_imitation_v1 baseline. That recipe only rewards completions that match the recorded next assistant action, and it scales that reward by the published trajectory outcome metadata from Worlds.

For Worlds-first RL, point the job at a manifest plus the runtime that should generate native agent trajectories. The control plane samples and publishes a Worlds dataset first, then the RL sandbox trains from that published dataset:

Terminal window
dn train rl \
--server http://127.0.0.1:8000 \
--api-key "$DREADNODE_API_KEY" \
--organization dreadnode \
--workspace localdev \
--model meta-llama/Llama-3.1-8B-Instruct \
--capability dreadnode/[email protected] \
--world-manifest-id c8af2b7b-9b54-4b21-95a9-b8d403cd8c11 \
--world-runtime-id 8b8fd3af-9a5e-47c8-9f67-7b87ca9387eb \
--world-agent-name operator \
--world-goal "Escalate to Domain Admin in corp.local" \
--execution-mode fully_async \
--max-steps-off-policy 3 \
--num-rollouts 8

When --world-runtime-id is supplied, hosted RL treats Worlds-published native-agent datasets as the primary input path. The selected runtime and capability generate trajectories in Worlds first, and the training sandbox then reuses the existing offline/async RL runtime over the published dataset.

If you also supply --world-reward, the job falls back to the older live-backend rollout bridge so the SDK can apply that reward policy directly during rollout generation.

Common RL flags:

FlagDescription
--trajectory-dataset REFWorlds trajectory dataset input, repeatable
--world-manifest-id IDLive Worlds manifest target for online RL
--world-runtime-id IDRuntime used to sample native-agent Worlds trajectories
--world-agent-name NAMEOptional agent selected within the runtime capability
--world-goal TEXTOptional goal override for the live Worlds rollout agent
--world-reward NAMENamed live Worlds reward policy
--world-reward-params JSONJSON params passed to the selected Worlds reward policy
--prompt-split <name>Prompt split selector inside the prompt dataset
--execution-mode <mode>RL runtime mode: sync, one_step_off_async, or fully_async
--steps <n>Number of optimization steps
--num-rollouts <n>Number of rollouts per update
--max-turns <n>Maximum turns per episode
--max-episode-steps <n>Environment step limit
--weight-sync-interval <n>Refresh sampler weights every N updates
--max-steps-off-policy <n>Max rollout staleness for async RL; one_step_off_async requires 1
--stop <token>Add a stop token, repeatable

Hosted Tinker RL now supports two async modes:

  • one_step_off_async keeps one rollout group in flight and bounds staleness to one step
  • fully_async widens the same pipeline to multiple queued rollout groups with bounded staleness controlled by --max-steps-off-policy
  • both modes still operate on rollout groups, not partial in-flight episode continuation

--task and --prompt-dataset are optional for Worlds-driven offline RL. --world-manifest-id plus --world-runtime-id is the native-agent alternative when you want the control plane to generate and publish fresh Worlds trajectories before training. --world-reward keeps the older live-rollout bridge available when you explicitly want reward shaping during rollout generation.

The training subcommands also expose job management:

Terminal window
dn train get <job-id>
dn train wait <job-id> --json
dn train logs <job-id>
dn train artifacts <job-id>
dn train cancel <job-id> --json

dn train wait exits non-zero if the job finishes in failed or cancelled.

Hosted training commands require platform credentials plus an active organization and workspace. Pass them explicitly with flags or configure them in your SDK profile first:

Terminal window
dn configure
dn train get <job-id>