Deploy: close the loop
Everything so far prepared data to train a policy. Deployment is where the policy meets the world, and where most teams' tooling quietly falls apart, because evaluation data ends up in spreadsheets and screen recordings, disconnected from the recordings that explain it.
The answer is the one you already know: an evaluation rollout is just another episode. Record it like one, tag it like one, query it like one.
Provenance via properties
Tag each rollout at recording time with the things you'll want to slice by later: checkpoint id, scene configuration, software version, outcome. Then every question, like "how does v2 compare to v1 on the long-horizon task?", is a filter instead of an archaeology project.
Success-by-anything is an aggregation
Ranking outcomes by any condition (model version, scene, site, lighting) is the same short group-by you saw in Refine. The crucial property is the split: an overall success number hides exactly the thing you need to see. Success by condition is what tells you a policy is excellent in the demonstrated regime and failing in one specific corner of it, the classic signature of a training-data coverage gap.
Open the failure
Aggregates locate the problem; recordings explain it. Filter to the failures, take their segment ids, and open those exact episodes, which arrive with the dataset's blueprint, the derived 3D scene, the camera streams, and every timeline intact. You scrub to the moment it went wrong and look at what the robot saw. A metric tells you the gripper closed on air; the recording shows you the glare, the occlusion, or the pose no demonstration ever covered.
Close the loop
What you find feeds straight back: a coverage gap becomes targeted collection; a quality problem becomes a new layer that the export's filter excludes; a better policy gets re-exported with the same config and evaluated on the same axes, fairly. That cycle (deploy reveals, recordings explain, collection targets, refinement cleans, training absorbs) is the experiment loop this course is named for, and every stage of it was a query over the same data layer.
Next
You've walked the whole loop on one dataset. The last article maps every piece onto your robot, your stack, and your team's actual week.