The experiment loop

Train: the training set is a query

There's an expensive way to feed a model: filter, join, resample, export, and copy a bespoke dataset per experiment. Because your data is already a queryable dataset, there's a cheap way: the training set is a query over the dataset you already have, and you can train straight off that query without materializing a copy.

Curate by query

Remember the derived signals and metadata from Refine? They earn their keep now. The demo's train.py ranks episodes by right-arm motion (the arm_activity query) and takes the most active ones as its train/holdout split. No files move, nothing is deleted, no dataset_v2_final folder appears. Trying a stricter threshold is a re-run, not a re-build. Swap arm motion for any signal your layers carry (blur, task outcome, jerkiness, a reward-model score) and that's your curation knob.

Why care this much about what goes in? Data composition is the biggest lever there is: curating which episodes a policy trains on routinely moves success rates more than algorithmic changes do. You can only turn that knob if your dataset is queryable.

Stream the query straight into training

Once the query picks the episodes, you don't export them: you stream them. The demo wraps the catalog query in Rerun's experimental PyTorch dataloader, RerunIterableDataset, and hands it to a standard DataLoader; the toy MLP then trains on per-joint state pulled live from the catalog. Nothing is drained into memory up front; workers prefetch from the server as the loop asks for batches, decoding scalars (and video, images) on the fly.

The multi-rate alignment you've been quietly immune to since Foundations gets settled here, declaratively: each field names an entity path and a fixed sampling rate, and the loader resamples onto one timeline as it streams. Different model wants different cameras or a different rate? Different fields, same data, no re-export. Change the curation query and the training set changes with it; the dataset is the source of truth, not a pile of derived copies that drift out of sync with it.

Watching the run is still a Rerun job

Training itself stays in the loop: log loss curves and sample batches (the actual frames the model saw, predictions next to targets) as recordings. A loss curve tells you that something's wrong; co-logged samples tell you what. Off-by-one labels and double-applied transforms hide in dashboards and are obvious in a sample view. The demo goes one step further and registers each finished run as a segment in a trossen_oss_runs dataset, so runs sit in the catalog right alongside the episodes they trained on: browsable, queryable, comparable.

A model trains on the query. Deploy: it meets the world, the world writes back, and evaluation becomes (you guessed it) recordings and a query.

Train: the training set is a query

Curate by query

Stream the query straight into training

Watching the run is still a Rerun job

Next