Training
The Python SDK now includes typed models and ApiClient methods for the hosted training control
plane.
Python API client
Section titled “Python API client”Hosted training methods live on dreadnode.app.api.client.ApiClient:
create_training_job()list_training_jobs()get_training_job()cancel_training_job()retry_training_job()list_training_job_logs()get_training_job_artifacts()
Typed request models
Section titled “Typed request models”Use explicit request types for each backend and trainer combination:
from dreadnode.app.api.client import ApiClientfrom dreadnode.app.api.models import ( CapabilityRef, CreateTinkerSFTJobRequest, DatasetRef, TinkerSFTJobConfig,)
client = ApiClient("https://api.example.com", api_key="dn_...")
job = client.create_training_job( "acme", "default", CreateTinkerSFTJobRequest( model="meta-llama/Llama-3.1-8B-Instruct", capability_ref=CapabilityRef(name="assistant", version="1.2.0"), config=TinkerSFTJobConfig( dataset_ref=DatasetRef(name="acme/default", version="train"), batch_size=8, lora_rank=16, ), ),)For Worlds trajectory datasets, use trajectory_dataset_refs instead of a plain
SFT dataset:
job = client.create_training_job( "acme", "default", CreateTinkerSFTJobRequest( model="meta-llama/Llama-3.1-8B-Instruct", capability_ref=CapabilityRef(name="assistant", version="1.2.0"), config=TinkerSFTJobConfig( trajectory_dataset_refs=[ DatasetRef(name="acme/worlds-trajectories-a", version="0.1.0"), DatasetRef(name="acme/worlds-trajectories-b", version="0.1.0"), ], batch_size=8, ), ),)For prompt-dataset RL jobs, prompt_dataset_ref is nested under the RL config payload:
from dreadnode.app.api.models import ( CapabilityRef, CreateTinkerRLJobRequest, DatasetRef, TinkerRLJobConfig,)
request = CreateTinkerRLJobRequest( model="meta-llama/Llama-3.1-8B-Instruct", capability_ref=CapabilityRef(name="web-agent", version="2.0.1"), config=TinkerRLJobConfig( algorithm="importance_sampling", task_ref="security-mutillidae-sqli-login-bypass", prompt_dataset_ref=DatasetRef(name="seed-prompts", version="sqli-v1"), reward_recipe={"name": "task_verifier_v1"}, execution_mode="fully_async", prompt_split="train", steps=10, max_steps_off_policy=3, num_rollouts=32, lora_rank=16, max_new_tokens=128, temperature=0.1, stop=["</answer>"], ),)For Worlds-driven offline RL, use trajectory_dataset_refs instead. In this mode the sandbox
runtime converts each published trajectory into assistant-step prompt rows and defaults to
trajectory_imitation_v1 when no explicit reward recipe is supplied. The published Worlds dataset
now carries trajectory outcome metadata, so matched steps inherit the recorded trajectory reward
weight instead of using a flat imitation score:
request = CreateTinkerRLJobRequest( model="meta-llama/Llama-3.1-8B-Instruct", capability_ref=CapabilityRef(name="worlds-agent", version="2.0.1"), config=TinkerRLJobConfig( algorithm="importance_sampling", trajectory_dataset_refs=[ DatasetRef(name="acme/worlds-trajectories-a", version="0.1.0"), DatasetRef(name="acme/worlds-trajectories-b", version="0.1.0"), ], steps=10, num_rollouts=32, lora_rank=16, ),)For Worlds-first RL, use a live manifest plus a runtime id. The control plane samples native-agent Worlds trajectories first, publishes them as a dataset, and then runs the existing offline/async RL runtime against that published dataset:
request = CreateTinkerRLJobRequest( model="meta-llama/Llama-3.1-8B-Instruct", capability_ref=CapabilityRef(name="worlds-agent", version="2.0.1"), config=TinkerRLJobConfig( algorithm="importance_sampling", world_manifest_id="c8af2b7b-9b54-4b21-95a9-b8d403cd8c11", world_runtime_id="8b8fd3af-9a5e-47c8-9f67-7b87ca9387eb", world_agent_name="operator", world_goal="Escalate to Domain Admin in corp.local", execution_mode="fully_async", max_steps_off_policy=3, num_rollouts=4, max_turns=8, ),)When world_runtime_id is present, hosted RL treats Worlds-published native-agent datasets as the
primary input path. The selected runtime and capability generate trajectories inside Worlds, and the
training sandbox then reuses the same offline/async RL runtime used for published trajectory
datasets.
If you need live rollout-time reward shaping, keep using world_reward. That preserves the older
HTTP-backed live-rollout bridge instead of the new dataset-primary path:
from dreadnode.app.api.models import WorldRewardPolicy
request = CreateTinkerRLJobRequest( model="meta-llama/Llama-3.1-8B-Instruct", capability_ref=CapabilityRef(name="worlds-agent", version="2.0.1"), config=TinkerRLJobConfig( algorithm="importance_sampling", world_manifest_id="c8af2b7b-9b54-4b21-95a9-b8d403cd8c11", world_goal="Escalate to Domain Admin in corp.local", world_reward=WorldRewardPolicy( name="goal_only_v1", params={"success_reward": 2.0}, ), ),)The API validates refs on submission:
- dataset refs are structured objects with explicit
nameandversion - task refs can use
task-namefor the latest visible task or[email protected]for an explicit version
Current hosted SFT behavior:
- datasets can provide full
messagesconversations or simple prompt/answer rows - Worlds trajectory datasets can be supplied through
trajectory_dataset_refsand are converted from ATIF into SFT conversations inside the sandbox runtime - capability prompts are injected as the system scaffold before tokenization
- eval runs if
eval_dataset_refis supplied
Current hosted RL limitations:
task_verifier_v1currently supports flag-based verification only- Tinker RL supports:
- prompt-dataset RL
- offline Worlds trajectory RL
- runtime-driven Worlds manifest sampling into published native-agent datasets
- the older live Worlds manifest bridge when
world_rewardis supplied
- hosted Tinker RL now supports:
execution_mode="sync"execution_mode="one_step_off_async"execution_mode="fully_async"
- the async modes are rollout-group schedulers:
one_step_off_asynckeeps one rollout group in flight and bounds staleness to one stepfully_asyncwidens the queue to bounded multi-group async training usingmax_steps_off_policy- neither mode is a partial-rollout continuation runtime yet
- the primary Worlds RL path now pre-samples native-agent trajectories from the selected manifest and runtime, then trains from the published dataset
- live Worlds RL over HTTP is now the compatibility path used when you explicitly request a
world_reward
For self-hosted workers, set TINKER_BASE_URL to point the executor at a non-default Tinker
service endpoint.
Worlds RL rollout prototype
Section titled “Worlds RL rollout prototype”For agentic Worlds data collection, the SDK now includes a concrete rollout helper under
dreadnode.training.rollouts. It wraps a normal SDK Agent, attaches reward/trace hooks, and
returns a RolloutResult that can seed a later RL loop.
from dreadnode import Agentfrom dreadnode.training.rollouts import ( CompositeWorldsRewardShaper, HostDiscoveryRewardShaper, ReasoningTraceRewardShaper, TerminalStateRewardShaper, run_worlds_agent_rollout,)
agent = Agent( model="openai/gpt-5", instructions="Enumerate the AD environment and escalate toward Domain Admin.", tools=[...],)
result = await run_worlds_agent_rollout( agent, "Enumerate the domain controller and gather credentials.", reward_shaper=CompositeWorldsRewardShaper( ReasoningTraceRewardShaper(value=0.05), HostDiscoveryRewardShaper(value=0.25), TerminalStateRewardShaper(success_reward=2.0), ),)
print(result.final_reward)print(result.metadata["turns"][0]["reasoning_content"])This is still a prototype surface:
- rewards are attached through agent hooks and can be defined as composable
shapers in
dreadnode.training.rollouts.worlds - the result is built from the SDK
Agentevent stream, not the current algorithmic Worlds walker - it is intended as the basis for a later sandbox-backed Worlds RL loop
Sandbox job entrypoints
Section titled “Sandbox job entrypoints”The SDK now includes sandbox-facing payload and result contracts in
dreadnode.training.jobs, plus a module entrypoint that a training sandbox can
run directly:
python -m dreadnode.training.jobs \ --payload /tmp/dreadnode-training/payloads/job-123.json \ --result /tmp/dreadnode-training/results/job-123.jsonThat boundary is intentionally narrow:
- the API resolves refs and writes the payload JSON
- the sandbox runtime reads
DREADNODE_*andTINKER_*env vars - the SDK runtime executes the job and writes a structured result JSON back out
- the current job runtime supports:
- hosted Tinker SFT
- prompt-dataset Tinker RL in synchronous mode
- prompt-dataset Tinker RL in one-step-off async mode
- online Worlds-manifest Tinker RL in synchronous or one-step-off async mode
Worlds ETL helpers
Section titled “Worlds ETL helpers”The SDK also now includes reusable ETL helpers for converting Worlds ATIF trajectory datasets into chat-template style examples:
from pathlib import Path
from dreadnode.training.etl import ( convert_atif_trajectories_to_chat_template, load_atif_trajectories_jsonl,)
trajectories = load_atif_trajectories_jsonl(Path("trajectories.atif.jsonl"))examples = convert_atif_trajectories_to_chat_template( trajectories, tool_mode="command",)
print(examples[0]["messages"][0]["role"])print(examples[0]["tools"][0]["function"]["name"])This is the reusable library path that hosted SFT jobs now use for published
Worlds trajectory datasets. It replaces the old script-shaped conversion logic
in dreadnode.training.utils.
For hosted SFT preparation, the SDK also exposes reusable normalization helpers
under dreadnode.training.etl.sft for turning dataset records into chat
conversations with an optional injected system prompt.
Local training (Ray)
Section titled “Local training (Ray)”Local training helpers live under dreadnode.training and wrap Ray-based trainers. These are
useful for iterating on reward functions or fine-tuning on local hardware.
from dreadnode.training import train_dpo, train_grpo, train_ppo, train_sft
def reward_fn(prompts: list[str], completions: list[str]) -> list[float]: return [0.0 for _ in completions]
train_sft({"model_name": "meta-llama/Llama-3.1-8B-Instruct"}, prompts=["hello"])train_dpo({"model_name": "meta-llama/Llama-3.1-8B-Instruct"}, prompts=["hello"])train_grpo({"model_name": "meta-llama/Llama-3.1-8B-Instruct"}, prompts=["hello"], reward_fn=reward_fn)train_ppo({"model_name": "meta-llama/Llama-3.1-8B-Instruct"}, prompts=["hello"], reward_fn=reward_fn)For more control, use the underlying trainers directly:
from dreadnode.training import RayGRPOConfig, RayGRPOTrainer
config = RayGRPOConfig(model_name="meta-llama/Llama-3.1-8B-Instruct")trainer = RayGRPOTrainer(config)trainer.train(prompts=["hello"], reward_fn=lambda prompts, completions: [0.0])Cloud trainers
Section titled “Cloud trainers”The SDK also exposes trainer classes for managed execution on cloud backends:
AnyscaleTrainerAzureMLTrainerSageMakerTrainerPrimeTrainer
Import them from dreadnode.training alongside their corresponding config objects.