Skip to main content

Hyperparameters

Guide to tuning hyperparameters for optimal training.

Config Files

All hyperparameters are defined in TOML config files under configs/<species>/. Each species has three stage configs:

configs/
├── velociraptor/
│ ├── stage1_balance.toml
│ ├── stage2_locomotion.toml
│ └── stage3_strike.toml
├── trex/
│ ├── stage1_balance.toml
│ ├── stage2_locomotion.toml
│ └── stage3_bite.toml
└── brachiosaurus/
├── stage1_balance.toml
├── stage2_locomotion.toml
└── stage3_food_reach.toml

Each TOML file contains [ppo], [sac], [env], and [curriculum] sections.

Per-Stage Hyperparameters

Each stage has its own [ppo] and [sac] sections. When the curriculum command advances to the next stage it re-initialises the model with that stage's hyperparameters automatically. The values follow a deliberate progression across all three species:

HyperparameterStage 1 (Balance)Stage 2 (Locomotion)Stage 3 (Behavior)Rationale
learning_rate3e-41e-45e-5Coarser search early, fine-tune for complex behaviour
ent_coef (PPO)0.005–0.010.005–0.010.001High exploration for balance, exploit for strike/bite/food
clip_range (PPO)0.20.20.1Conservative updates for sparse terminal rewards
gamma0.99–0.9980.990.995More farsighted when rewards are sparse
n_steps (PPO)2048–40962048–40964096Larger rollout buffer for complex behaviour
batch_size64–256128–256256Larger batches for later stages

The env reward weights also shift significantly between stages — that is the core mechanic of curriculum learning:

Reward componentStage 1Stage 2Stage 3
alive_bonusHigh (1.0–2.0)Moderate (0.5–1.0)Low (0.1)
forward_vel_weight0.0High (1.0–2.0)Moderate (1.0)
Behavior bonus (strike/bite/food)0.00.0High (10–500)

PPO Parameters

ParameterDefaultDescription
learning_rate3e-4Network learning rate (3e-4 → 1e-4 → 5e-5 across stages)
learning_rate_end(optional)If set, enables a linear decay schedule from learning_rate to this value
n_steps2048–4096Steps per rollout buffer (larger in later stages)
batch_size64–256Minibatch size for gradient updates
n_epochs10Number of epochs per PPO update
gamma0.99–0.998Discount factor (higher in stage 3 for sparse rewards)
gae_lambda0.95GAE lambda for advantage estimation
clip_range0.2 → 0.1PPO surrogate objective clip range (tightened in stage 3)
ent_coef0.001–0.01Entropy bonus (higher early, lower for fine-grained behaviour)

SAC Parameters

ParameterDefaultDescription
learning_rate3e-4 → 1e-4 → 5e-5Network learning rate (same per-stage progression as PPO)
batch_size256Training batch size
gamma0.99–0.995Discount factor
tau0.005Soft target update coefficient
ent_coef"auto"Entropy coefficient (auto-tuned throughout)

Curriculum Thresholds

Stage transitions are controlled by the [curriculum] section in each config:

ParameterDefaultDescription
timesteps4000000–8000000Maximum timesteps per stage before auto-advancing
min_avg_reward50–100Minimum average reward to advance early
min_avg_episode_length300–800Minimum average episode length to advance early
required_consecutive3Consecutive evaluations above threshold

Both min_avg_reward and min_avg_episode_length must be exceeded for required_consecutive evaluations before a stage advances early. If the timesteps budget runs out first, the stage advances anyway.

Overriding Hyperparameters from the CLI

Use --override to change TOML values without editing files — useful for hyperparameter sweeps. Keys use dot notation; values are auto-cast to int, float, or str:

# Override learning rate and entropy coefficient for all stages
python scripts/train_sb3.py train --stage 1 \
--override ppo.learning_rate=1e-3 ppo.ent_coef=0.02 env.alive_bonus=5.0

# Works with curriculum too — applies to ALL stages
python scripts/train_sb3.py curriculum \
--override ppo.learning_rate=2e-4

Supported key prefixes:

PrefixOverrides
ppo.Xppo_kwargs[X]
sac.Xsac_kwargs[X]
env.Xenv_kwargs[X] (reward weights, episode settings)

For stage-scoped overrides within a curriculum run, prefix with the stage number: 1.ppo.learning_rate=3e-4 2.ppo.learning_rate=1e-4. Plain section.key=value still applies to all stages.

Systematic sweeps: To try many combinations automatically, use the Vertex AI HPT sweep tool. See Hyperparameter Sweeps.

Tips

  1. Start with defaults — The TOML configs are tuned for each species
  2. Use --algorithm sac — SAC achieves higher final reward; PPO trains faster per step
  3. Monitor with W&B — Use --wandb to track per-component rewards across stages
  4. Use GPU — Training is significantly faster with CUDA
  5. Increase timesteps for stage 3 — The sparse terminal reward (strike/bite/food) often needs more samples to converge

Tuning Playbook

Use this section as a symptom-driven reference when a run is misbehaving. Each entry lists the most effective knobs first. Adjust one group at a time so you can attribute outcomes to changes.

Stage 1 — Balance

SymptomMost likely causeWhat to change
Agent falls immediately (ep length ~30–80, reward < 50)Over-aggressive actions or wobbly initial poseLower ppo.learning_rate (try 3e-5), raise env.posture_weight, raise env.nosedive_weight
Agent hops / drifts across the arena to stay "alive"alive_bonus too large relative to drift/spin penaltiesLower env.alive_bonus to 1.0–1.75, raise env.drift_penalty_weight and env.speed_penalty_weight
Spins in place to maintain upright torsoSpin is cheaper than true balanceRaise env.spin_penalty_weight to 0.1+, add small env.heading_weight (0.1)
Reward plateaus at a mediocre value, no further progressEntropy too low or LR too high for fine-tuningLower LR to 3e-5, raise ppo.ent_coef to 0.01, or switch to cosine schedule via learning_rate_end
Reward is highly variable across seedsInit noise dominatesLower env.reset_noise_scale (default 0.05, try 0.02)
Jerky, unstable joint motionSmoothness under-weightedRaise env.smoothness_weight and env.energy_penalty_weight

Stage 2 — Locomotion

SymptomMost likely causeWhat to change
Forward velocity stuck near zeroforward_vel_weight too low or alive_bonus dominantRaise env.forward_vel_weight to 1.0–2.0, reduce env.alive_bonus
Crab-walks sideways toward targetNo lateral or heading constraintSet env.lateral_penalty_weight to 0.1, raise env.heading_weight
Walks forward but falls frequentlyBalance reward zeroed too aggressivelyKeep env.posture_weight ≥ 0.3, re-enable mild env.alive_bonus (0.3–0.5)
Unrealistic "ice-skating" gaitSymmetry and smoothness under-weightedRaise env.gait_symmetry_weight, raise env.smoothness_weight
Max forward speed exceeds physical reasonablenessReward uncappedSet env.forward_vel_max (e.g. 3.0 for raptor, 1.5 for brachio)

Stage 3 — Behavior (Strike / Bite / Food Reach)

SymptomMost likely causeWhat to change
Never triggers terminal event (strike/bite/food)Sparse-reward exploration stalledRaise ppo.ent_coef to 0.005–0.01, widen ppo.clip_range to 0.15, tighten env.prey_distance_range so the target spawns closer
Lingers near target without triggeringProximity bonuses rewarding hoveringZero env.strike_proximity_weight (or env.food_proximity_weight), keep env.*_bonus high (1000+) and *_approach_weight as the only gradient
Forgets locomotion during Stage 3 warm-upReward schedule shift too abruptUse curriculum.warmup_timesteps = 300000, curriculum.ramp_timesteps = 500000, curriculum.warmup_clip_range = 0.02 to anneal changes
Learns the behavior but then regressesOver-entropy or critic driftLower ppo.ent_coef after convergence, narrow ppo.clip_range to 0.1
strike_bonus signal not dominatingDiscounted future alive-reward too largeEnsure env.alive_bonus = 0 in stage 3 and that strike_bonus >> gamma^H · per_step_reward

Algorithm-specific notes

PPO. The most impactful lever is the learning_rate / learning_rate_end pair — roughly 10× lower than the "textbook" 3e-4 produces the most consistent runs in this repo. n_steps × n_envs must be divisible by batch_size; otherwise SB3 silently drops samples. clip_range should narrow across the curriculum (0.2 → 0.15 → 0.1) as policies become more specialized.

SAC. ent_coef = "auto" handles exploration automatically — prefer it over a fixed float. The train_freq=16, gradient_steps=8 ratio (2:1 gradient-to-env) smooths the learning curve at modest compute cost. Set buffer_size proportional to stage length (300K for Stage 1, 1M for Stage 3).

JAX / MJX. num_envs × rollout_len is the effective rollout buffer. The default 2048 × 64 ≈ 131K is aggressive — reduce to 1024 × 64 if GPU memory is tight. warmup_updates and ramp_updates in [jax] mirror the warmup_timesteps / ramp_timesteps knobs from [curriculum] but in update-count units.

A minimal tuning workflow

  1. Baseline. Run the stage with committed defaults for 2–3 seeds. Record best reward, mean episode length, and any behavioral metrics (success rate, velocity).
  2. Diagnose. If the run fails, match symptoms against the tables above. Do not change more than one group of knobs per run.
  3. Narrow. For promising directions, launch a Ray Tune sweep over 3–5 candidate values using notebooks/ray_tune_sweep.ipynb. Use the ASHA scheduler to prune early.
  4. Promote. Commit the winning values back to the TOML with a trailing comment explaining why (see existing configs for the house style — e.g. # Setting 4 sweep: ...).
  5. Regress-test. Re-run the full curriculum end-to-end on the winning config before committing. A Stage 1 change often degrades Stage 3.

For systematic multi-parameter sweeps, see Hyperparameter Sweeps.