Training

The Python SDK now includes typed models and ApiClient methods for the hosted training control plane.

Python API client

Hosted training methods live on dreadnode.app.api.client.ApiClient:

create_training_job()
list_training_jobs()
get_training_job()
cancel_training_job()
retry_training_job()
list_training_job_logs()
get_training_job_artifacts()

Typed request models

Use explicit request types for each backend and trainer combination:

from dreadnode.app.api.client import ApiClient
from dreadnode.app.api.models import (
    CapabilityRef,
    CreateTinkerSFTJobRequest,
    DatasetRef,
    TinkerSFTJobConfig,
)

client = ApiClient("https://api.example.com", api_key="dn_...")

job = client.create_training_job(
    "acme",
    "default",
    CreateTinkerSFTJobRequest(
        model="meta-llama/Llama-3.1-8B-Instruct",
        capability_ref=CapabilityRef(name="assistant", version="1.2.0"),
        config=TinkerSFTJobConfig(
            dataset_ref=DatasetRef(name="acme/default", version="train"),
            batch_size=8,
            lora_rank=16,
        ),
    ),
)

For Worlds trajectory datasets, use trajectory_dataset_refs instead of a plain SFT dataset:

job = client.create_training_job(
    "acme",
    "default",
    CreateTinkerSFTJobRequest(
        model="meta-llama/Llama-3.1-8B-Instruct",
        capability_ref=CapabilityRef(name="assistant", version="1.2.0"),
        config=TinkerSFTJobConfig(
            trajectory_dataset_refs=[
                DatasetRef(name="acme/worlds-trajectories-a", version="0.1.0"),
                DatasetRef(name="acme/worlds-trajectories-b", version="0.1.0"),
            ],
            batch_size=8,
        ),
    ),
)

For prompt-dataset RL jobs, prompt_dataset_ref is nested under the RL config payload:

from dreadnode.app.api.models import (
    CapabilityRef,
    CreateTinkerRLJobRequest,
    DatasetRef,
    TinkerRLJobConfig,
)

request = CreateTinkerRLJobRequest(
    model="meta-llama/Llama-3.1-8B-Instruct",
    capability_ref=CapabilityRef(name="web-agent", version="2.0.1"),
    config=TinkerRLJobConfig(
        algorithm="importance_sampling",
        task_ref="security-mutillidae-sqli-login-bypass",
        prompt_dataset_ref=DatasetRef(name="seed-prompts", version="sqli-v1"),
        reward_recipe={"name": "task_verifier_v1"},
        execution_mode="fully_async",
        prompt_split="train",
        steps=10,
        max_steps_off_policy=3,
        num_rollouts=32,
        lora_rank=16,
        max_new_tokens=128,
        temperature=0.1,
        stop=["</answer>"],
    ),
)

For Worlds-driven offline RL, use trajectory_dataset_refs instead. In this mode the sandbox runtime converts each published trajectory into assistant-step prompt rows and defaults to trajectory_imitation_v1 when no explicit reward recipe is supplied. The published Worlds dataset now carries trajectory outcome metadata, so matched steps inherit the recorded trajectory reward weight instead of using a flat imitation score:

request = CreateTinkerRLJobRequest(
    model="meta-llama/Llama-3.1-8B-Instruct",
    capability_ref=CapabilityRef(name="worlds-agent", version="2.0.1"),
    config=TinkerRLJobConfig(
        algorithm="importance_sampling",
        trajectory_dataset_refs=[
            DatasetRef(name="acme/worlds-trajectories-a", version="0.1.0"),
            DatasetRef(name="acme/worlds-trajectories-b", version="0.1.0"),
        ],
        steps=10,
        num_rollouts=32,
        lora_rank=16,
    ),
)

For Worlds-first RL, use a live manifest plus a runtime id. The control plane samples native-agent Worlds trajectories first, publishes them as a dataset, and then runs the existing offline/async RL runtime against that published dataset:

request = CreateTinkerRLJobRequest(
    model="meta-llama/Llama-3.1-8B-Instruct",
    capability_ref=CapabilityRef(name="worlds-agent", version="2.0.1"),
    config=TinkerRLJobConfig(
        algorithm="importance_sampling",
        world_manifest_id="c8af2b7b-9b54-4b21-95a9-b8d403cd8c11",
        world_runtime_id="8b8fd3af-9a5e-47c8-9f67-7b87ca9387eb",
        world_agent_name="operator",
        world_goal="Escalate to Domain Admin in corp.local",
        execution_mode="fully_async",
        max_steps_off_policy=3,
        num_rollouts=4,
        max_turns=8,
    ),
)

When world_runtime_id is present, hosted RL treats Worlds-published native-agent datasets as the primary input path. The selected runtime and capability generate trajectories inside Worlds, and the training sandbox then reuses the same offline/async RL runtime used for published trajectory datasets.

If you need live rollout-time reward shaping, keep using world_reward. That preserves the older HTTP-backed live-rollout bridge instead of the new dataset-primary path:

from dreadnode.app.api.models import WorldRewardPolicy

request = CreateTinkerRLJobRequest(
    model="meta-llama/Llama-3.1-8B-Instruct",
    capability_ref=CapabilityRef(name="worlds-agent", version="2.0.1"),
    config=TinkerRLJobConfig(
        algorithm="importance_sampling",
        world_manifest_id="c8af2b7b-9b54-4b21-95a9-b8d403cd8c11",
        world_goal="Escalate to Domain Admin in corp.local",
        world_reward=WorldRewardPolicy(
            name="goal_only_v1",
            params={"success_reward": 2.0},
        ),
    ),
)

The API validates refs on submission:

dataset refs are structured objects with explicit name and version
task refs can use task-name for the latest visible task or [email protected] for an explicit version

Current hosted SFT behavior:

datasets can provide full messages conversations or simple prompt/answer rows
Worlds trajectory datasets can be supplied through trajectory_dataset_refs and are converted from ATIF into SFT conversations inside the sandbox runtime
capability prompts are injected as the system scaffold before tokenization
eval runs if eval_dataset_ref is supplied

Current hosted RL limitations:

task_verifier_v1 currently supports flag-based verification only
Tinker RL supports:
- prompt-dataset RL
- offline Worlds trajectory RL
- runtime-driven Worlds manifest sampling into published native-agent datasets
- the older live Worlds manifest bridge when world_reward is supplied
hosted Tinker RL now supports:
- execution_mode="sync"
- execution_mode="one_step_off_async"
- execution_mode="fully_async"
the async modes are rollout-group schedulers:
- one_step_off_async keeps one rollout group in flight and bounds staleness to one step
- fully_async widens the queue to bounded multi-group async training using max_steps_off_policy
- neither mode is a partial-rollout continuation runtime yet
the primary Worlds RL path now pre-samples native-agent trajectories from the selected manifest and runtime, then trains from the published dataset
live Worlds RL over HTTP is now the compatibility path used when you explicitly request a world_reward

For self-hosted workers, set TINKER_BASE_URL to point the executor at a non-default Tinker service endpoint.

Worlds RL rollout prototype

For agentic Worlds data collection, the SDK now includes a concrete rollout helper under dreadnode.training.rollouts. It wraps a normal SDK Agent, attaches reward/trace hooks, and returns a RolloutResult that can seed a later RL loop.

from dreadnode import Agent
from dreadnode.training.rollouts import (
    CompositeWorldsRewardShaper,
    HostDiscoveryRewardShaper,
    ReasoningTraceRewardShaper,
    TerminalStateRewardShaper,
    run_worlds_agent_rollout,
)

agent = Agent(
    model="openai/gpt-5",
    instructions="Enumerate the AD environment and escalate toward Domain Admin.",
    tools=[...],
)

result = await run_worlds_agent_rollout(
    agent,
    "Enumerate the domain controller and gather credentials.",
    reward_shaper=CompositeWorldsRewardShaper(
        ReasoningTraceRewardShaper(value=0.05),
        HostDiscoveryRewardShaper(value=0.25),
        TerminalStateRewardShaper(success_reward=2.0),
    ),
)

print(result.final_reward)
print(result.metadata["turns"][0]["reasoning_content"])

This is still a prototype surface:

rewards are attached through agent hooks and can be defined as composable shapers in dreadnode.training.rollouts.worlds
the result is built from the SDK Agent event stream, not the current algorithmic Worlds walker
it is intended as the basis for a later sandbox-backed Worlds RL loop

Sandbox job entrypoints

The SDK now includes sandbox-facing payload and result contracts in dreadnode.training.jobs, plus a module entrypoint that a training sandbox can run directly:

python -m dreadnode.training.jobs \
  --payload /tmp/dreadnode-training/payloads/job-123.json \
  --result /tmp/dreadnode-training/results/job-123.json

That boundary is intentionally narrow:

the API resolves refs and writes the payload JSON
the sandbox runtime reads DREADNODE_* and TINKER_* env vars
the SDK runtime executes the job and writes a structured result JSON back out
the current job runtime supports:
- hosted Tinker SFT
- prompt-dataset Tinker RL in synchronous mode
- prompt-dataset Tinker RL in one-step-off async mode
- online Worlds-manifest Tinker RL in synchronous or one-step-off async mode

Worlds ETL helpers

The SDK also now includes reusable ETL helpers for converting Worlds ATIF trajectory datasets into chat-template style examples:

from pathlib import Path

from dreadnode.training.etl import (
    convert_atif_trajectories_to_chat_template,
    load_atif_trajectories_jsonl,
)

trajectories = load_atif_trajectories_jsonl(Path("trajectories.atif.jsonl"))
examples = convert_atif_trajectories_to_chat_template(
    trajectories,
    tool_mode="command",
)

print(examples[0]["messages"][0]["role"])
print(examples[0]["tools"][0]["function"]["name"])

This is the reusable library path that hosted SFT jobs now use for published Worlds trajectory datasets. It replaces the old script-shaped conversion logic in dreadnode.training.utils.

For hosted SFT preparation, the SDK also exposes reusable normalization helpers under dreadnode.training.etl.sft for turning dataset records into chat conversations with an optional injected system prompt.

Local training (Ray)

Local training helpers live under dreadnode.training and wrap Ray-based trainers. These are useful for iterating on reward functions or fine-tuning on local hardware.

from dreadnode.training import train_dpo, train_grpo, train_ppo, train_sft

def reward_fn(prompts: list[str], completions: list[str]) -> list[float]:
    return [0.0 for _ in completions]

train_sft({"model_name": "meta-llama/Llama-3.1-8B-Instruct"}, prompts=["hello"])
train_dpo({"model_name": "meta-llama/Llama-3.1-8B-Instruct"}, prompts=["hello"])
train_grpo({"model_name": "meta-llama/Llama-3.1-8B-Instruct"}, prompts=["hello"], reward_fn=reward_fn)
train_ppo({"model_name": "meta-llama/Llama-3.1-8B-Instruct"}, prompts=["hello"], reward_fn=reward_fn)

For more control, use the underlying trainers directly:

from dreadnode.training import RayGRPOConfig, RayGRPOTrainer

config = RayGRPOConfig(model_name="meta-llama/Llama-3.1-8B-Instruct")
trainer = RayGRPOTrainer(config)
trainer.train(prompts=["hello"], reward_fn=lambda prompts, completions: [0.0])

Cloud trainers

The SDK also exposes trainer classes for managed execution on cloud backends:

AnyscaleTrainer
AzureMLTrainer
SageMakerTrainer
PrimeTrainer

Import them from dreadnode.training alongside their corresponding config objects.