Datastore compaction
The Rerun datastore continuously compacts data as it comes in, in order find a sweet spot between ingestion speed, query performance and memory overhead.
The compaction is triggered by both number of rows and number of bytes thresholds, whichever happens to trigger first.
This is very similar to, and has many parallels with, the micro-batching mechanism running on the SDK side.
You can configure these thresholds using the following environment variables:
RERUN_CHUNK_MAX_BYTES
Sets the threshold, in bytes, after which a Chunk
cannot be compacted any further.
Defaults to RERUN_CHUNK_MAX_BYTES=4194304
(4MiB).
RERUN_CHUNK_MAX_ROWS
Sets the threshold, in rows, after which a Chunk
cannot be compacted any further.
Defaults to RERUN_CHUNK_MAX_ROWS=4096
.
RERUN_CHUNK_MAX_ROWS_IF_UNSORTED
Sets the threshold, in rows, after which a Chunk
cannot be compacted any further.
Applies specifically to non time-sorted chunks, which can be slower to query.
Defaults to RERUN_CHUNK_MAX_ROWS=1024
.