Dataset Resampling

This snippet demonstrates how to resample a dataset based on the time index of one component within your data. This is particularly helpful when you have data that is produced at very different frequencies.

First, load a dataset to use for evaluation:

sample_dataset_path = (
    Path(__file__).parents[4] / "tests" / "assets" / "rrd" / "dataset"
)
server = rr.server.Server(datasets={"dataset": sample_dataset_path})
# Using OSS server for demonstration but in practice replace with
# the URL of your cloud instance
CATALOG_URL = server.url()
client = rr.catalog.CatalogClient(CATALOG_URL)
dataset = client.get_dataset(name="dataset")

Investigate time ranges

Before we do the resampling, we can examine the dataset's time ranges using the function get_index_ranges(). This is not strictly necessary for the resampling work to follow, but it can be helpful during investigation of your data. This will show you the start and end values for all indexes in your dataset, one per segment.

(
    dataset
    .get_index_ranges()
    .select(
        "rerun_segment_id",
        "time_1:start",
        "time_1:end",
        "time_2:start",
        "time_2:end",
        "time_3:start",
        "time_3:end",
    )
    .sort("rerun_segment_id")
    .show()
)

Prior to resampling

The sample data we have loaded is very basic, but it demonstrates having components from three different entities at different times in the dataset. The code below demonstrates what the data looks like before resampling. In order to do data analysis on this DataFrame you would likely need to do some aggregation or window across the time index.

time_index = "time_3"
columns_of_interest = [
    "rerun_segment_id",
    time_index,
    "/obj1:Points3D:positions",
    "/obj2:Points3D:positions",
    "/obj3:Points3D:positions",
]
(
    dataset
    .reader(index=time_index)
    .select(*columns_of_interest)
    .sort("rerun_segment_id", time_index)
    .show()
)

# +----------------------------------+--------+--------------------------+--------------------------+--------------------------+
# | rerun_segment_id                 | time_3 | /obj1:Points3D:positions | /obj2:Points3D:positions | /obj3:Points3D:positions |
# +----------------------------------+--------+--------------------------+--------------------------+--------------------------+
# | 141a866deb2d49f69eb3215e8a404ffc | 1      | [[49.0, 0.0, 0.0]]       | [[44.0, 1.0, 0.0]]       | [[1.0, 2.0, 0.0]]        |
# | 141a866deb2d49f69eb3215e8a404ffc | 2      | [[27.0, 0.0, 0.0]]       | [[42.0, 1.0, 0.0]]       |                          |
# | 141a866deb2d49f69eb3215e8a404ffc | 3      | [[25.0, 0.0, 0.0]]       | [[30.0, 1.0, 0.0]]       | [[3.0, 2.0, 0.0]]        |
# | 141a866deb2d49f69eb3215e8a404ffc | 4      | [[38.0, 0.0, 0.0]]       | [[19.0, 1.0, 0.0]]       |                          |
# | 141a866deb2d49f69eb3215e8a404ffc | 5      | [[17.0, 0.0, 0.0]]       | [[5.0, 1.0, 0.0]]        | [[5.0, 2.0, 0.0]]        |
# | 141a866deb2d49f69eb3215e8a404ffc | 6      | [[2.0, 0.0, 0.0]]        | [[35.0, 1.0, 0.0]]       |                          |
# | 141a866deb2d49f69eb3215e8a404ffc | 7      | [[44.0, 0.0, 0.0]]       | [[4.0, 1.0, 0.0]]        | [[7.0, 2.0, 0.0]]        |

Resampled data

The snippet below demonstrates resampling using two lines. First we create a new DataFrame which contains the index values we care about per segment. It is very important in doing this that you do not set fill_latest_at=True. Otherwise it would negate the effect we are trying to produce where we only have rows for which we have data in our component of interest. The required output of this DataFrame is only the segment ID and the index value.

Once we have a DataFrame with these index values, we can now query the dataset using that DataFrame. You can see from the output below that we generate one row per time index for which the component of interest is not null.

resample_column = "/obj3:Points3D:positions"
times_of_interest = (
    dataset
    .reader(index=time_index)
    .filter(col(resample_column).is_not_null())
    .select("rerun_segment_id", time_index)
)

(
    dataset
    .reader(
        index=time_index,
        using_index_values=times_of_interest,
        fill_latest_at=True,
    )
    .select(*columns_of_interest)
    .sort("rerun_segment_id", time_index)
    .show()
)

# +----------------------------------+--------+--------------------------+--------------------------+--------------------------+
# | rerun_segment_id                 | time_3 | /obj1:Points3D:positions | /obj2:Points3D:positions | /obj3:Points3D:positions |
# +----------------------------------+--------+--------------------------+--------------------------+--------------------------+
# | 141a866deb2d49f69eb3215e8a404ffc | 1      | [[49.0, 0.0, 0.0]]       | [[44.0, 1.0, 0.0]]       | [[1.0, 2.0, 0.0]]        |
# | 141a866deb2d49f69eb3215e8a404ffc | 3      | [[25.0, 0.0, 0.0]]       | [[30.0, 1.0, 0.0]]       | [[3.0, 2.0, 0.0]]        |
# | 141a866deb2d49f69eb3215e8a404ffc | 5      | [[17.0, 0.0, 0.0]]       | [[5.0, 1.0, 0.0]]        | [[5.0, 2.0, 0.0]]        |
# | 141a866deb2d49f69eb3215e8a404ffc | 7      | [[44.0, 0.0, 0.0]]       | [[4.0, 1.0, 0.0]]        | [[7.0, 2.0, 0.0]]        |
# | 141a866deb2d49f69eb3215e8a404ffc | 10     | [[12.0, 0.0, 0.0]]       | [[6.0, 1.0, 0.0]]        | [[10.0, 2.0, 0.0]]       |
# | 141a866deb2d49f69eb3215e8a404ffc | 12     | [[13.0, 0.0, 0.0]]       | [[17.0, 1.0, 0.0]]       | [[12.0, 2.0, 0.0]]       |
# | 141a866deb2d49f69eb3215e8a404ffc | 13     | [[20.0, 0.0, 0.0]]       | [[32.0, 1.0, 0.0]]       | [[13.0, 2.0, 0.0]]       |