The experiment loop

Deploy: close the loop

Everything so far prepared data to train a policy. Deployment is where the policy meets the world, and where most teams' tooling quietly falls apart, because evaluation data ends up in spreadsheets and screen recordings, disconnected from the recordings that explain it.

The answer is the one you already know: an evaluation rollout is just another episode. Record it like one, tag it like one, query it like one.

Provenance via properties

Tag each rollout at recording time with the things you'll want to slice by later: checkpoint id, scene configuration, software version, outcome. Then every question, like "how does v2 compare to v1 on the long-horizon task?", is a filter instead of an archaeology project.

Success-by-anything is an aggregation

Ranking outcomes by any condition (model version, scene, site, lighting) is the same short group-by you saw in Refine. The crucial property is the split: an overall success number hides exactly the thing you need to see. Success by condition is what tells you a policy is excellent in the demonstrated regime and failing in one specific corner of it, the classic signature of a training-data coverage gap.

Open the failure

Aggregates locate the problem; recordings explain it. Filter to the failures, take their segment ids, and open those exact episodes, which arrive with the dataset's blueprint, the derived 3D scene, the camera streams, and every timeline intact. You scrub to the moment it went wrong and look at what the robot saw. A metric tells you the gripper closed on air; the recording shows you the glare, the occlusion, or the pose no demonstration ever covered.

Close the loop

What you find feeds straight back: a coverage gap becomes targeted collection; a quality problem becomes a new layer that the export's filter excludes; a better policy gets re-exported with the same config and evaluated on the same axes, fairly. That cycle (deploy reveals, recordings explain, collection targets, refinement cleans, training absorbs) is the experiment loop this course is named for, and every stage of it was a query over the same data layer.

You've walked the whole loop on one dataset. The last article maps every piece onto your robot, your stack, and your team's actual week.

Deploy: close the loop

Provenance via properties

Success-by-anything is an aggregation

Open the failure

Close the loop

Next