Catalog object model

This page covers the Data Platform's object model. For logging and recording basics, see Recordings. For API details, see the Catalog SDK reference.

Catalog catalog

We refer to the contents stored in a given instance of the Data Platform as the catalog. The catalog contains top-level objects called entries.

There are currently two types of entries: tables and datasets. Each is described in more detail below.

Entries share a few common properties:

id: a globally unique identifier
name: a user-provided name, which must be unique within the catalog

The id is immutable, but the name can be changed provided it remains unique.

Table entries table-entries

Table entries model a single table of data. They use the Arrow data model, so a table is logically equivalent to an Arrow table. As a result, tables possess an Arrow schema.

Tables support the following mutation operations through the Catalog SDK:

append: add new rows to the table
overwrite: replace the entire table with new data
upsert: replace existing rows (based on an index column) with new data

Thanks to DataFusion, tables also support most database operations such as querying, filtering, joining, etc.

Datasets datasets

Dataset entries model a collection of Rerun data organized in episodes such as recorded runs of a given robotic task. These episodes within datasets are called segments, which are identified by a segment ID.

Segments are added to datasets by the process of registering a recording (typically stored in some object store such as S3) to the dataset using the Catalog SDK. The recording ID of the .rrd file is used as its segment ID.

Recordings registered to a given segment are organized by layers, identified by a layer name. By default, the "base" layer name is used. Registering two .rrd files with the same recording ID (that is, with the same segment ID) to the same dataset, and using the same layer name, will result in the second .rrd overwriting the first. Additive registration can be achieved by using different layer names for different .rrds with the same recording ID/segment ID.

Layers are immutable and can only be overwritten by registering a new .rrd file. In other words, datasets support the following mutation operations:

create segment: by registering a .rrd with a "new" recording ID
append to segment: by registering a .rrd with a matching recording ID to a new layer name
overwrite segment layer: by registering a .rrd with a matching recording ID to an existing layer name

Schema schema

Datasets are based on the Rerun data model, which consists of a collection of chunks of Arrow data. These chunks hold data for various entities and components corresponding to various indexes (or timelines). A given collection of chunks, say, a dataset segment, defines an Arrow schema. We refer to this as schema-on-read, because the schema proceeds from the data, and not the other way around. This differs from the table model, where the schema is defined upfront (schema-on-write).

In this context, the schema of a dataset is the union of schemas of its segments, which themselves are the union of the schemas of their layers.

Datasets maintain a minimal level of schema self-consistency. Registering a .rrd whose schema is incompatible with the current dataset schema will result in an error. In this context, incompatible means that the schema of the new .rrd contains a column for the same entity, archetype, and component, but with a different Arrow type. Such an occurrence is rare, and practically impossible when using standard Rerun archetypes.

Blueprints blueprints

A dataset can be assigned a blueprint. This is done by registering a .rbl blueprint file typically stored in object storage to the dataset. A dedicated API exists for this in the Catalog SDK: DatasetEntry.register_blueprint(). In that case, the blueprint is applied to all segments of the dataset when visualized in the Rerun Viewer.