Add split_key_prefix_len to index config to shard S3 object keys#6529
Add split_key_prefix_len to index config to shard S3 object keys#6529fulmicoton-dd wants to merge 2 commits into
Conversation
a0bce5d to
45cf1a3
Compare
1331d4a to
f3151db
Compare
Recent splits share a ULID timestamp prefix, causing S3 key hotspots under high read load. Setting split_key_prefix_len (e.g. 2) on an index extracts N characters from the ULID random portion (positions 10–25) as a subdirectory prefix, distributing new splits across 32^N S3 partitions. Move split_key_prefix_len into IndexingSettings Replace split_key_prefix_len config field with QW_SPLIT_KEY_PREFIX_LEN env var Remove now-meaningless config changes and stray whitespace
f3151db to
63e9c35
Compare
| @@ -0,0 +1,115 @@ | |||
| // Copyright 2021-Present Datadog, Inc. | |||
There was a problem hiding this comment.
Having this file located under actors is surprising because this is not an actor.
There was a problem hiding this comment.
ok moved it to models.
| /// The value is read from the environment once and cached for the lifetime of the process. | ||
| fn split_key_prefix_len() -> u8 { | ||
| static SPLIT_KEY_PREFIX_LEN: LazyLock<u8> = LazyLock::new(|| { | ||
| let prefix_len: u8 = quickwit_common::get_from_env("QW_SPLIT_KEY_PREFIX_LEN", 0u8, false); |
There was a problem hiding this comment.
I wonder if we should harcode the len to a reasonable default and forget about it reducing the amount of boilerplate in the config. We used to used 4 chars at Airbnb: we would hash the original path and use the first four char of the hash.
There was a problem hiding this comment.
split ulid are base32. 5 bits per chars. a prefix of 2 already gives us 1024 buckets too (and then the regular ulid).
the reason why I didn't do more was, on the off change that we need to find all splits within a range it can be done in 1024 requests. If we don't care then ok for 4. Let me know what you think.
| if prefix.is_empty() { | ||
| return format!("{split_id}.split"); | ||
| } | ||
| format!("{prefix}/{split_id}.split") |
There was a problem hiding this comment.
I think the / as a separator is not great because the AWS UI and S3 CLI is going to show them as a directories.
There was a problem hiding this comment.
ok changing that for - then?
There was a problem hiding this comment.
actually I cannot change this without breaking yahoo.
There was a problem hiding this comment.
we can delete the splits if needed
Summary
split_key_prefix_len: u8toIndexConfig(andIndexTemplate) to configure S3 key sharding per indexcompute_split_key_prefix(split_id, prefix_len)toquickwit-common— extracts N characters from the ULID random portion (positions 10–25) as a prefix; logs a rate-limited warning and falls back to the flat scheme if the split ID is too shortsplit_storage_path(split_id, prefix)toquickwit-common— builds the storage path from a precomputed prefix string; empty prefix = legacy flat schemesplit_key_prefix_len <= 16(ULID random portion length) at config load timeSplitMetadatawill gain aprefix: Stringfield in a follow-up PR to wire the full pipeline (uploader, leaf search, merge, GC)Backward compatibility: the field defaults to
0(serde default), so all existing indexes and splits are unaffected. New splits on indexes withsplit_key_prefix_len: 2will land atND/01ARZ3.../01ARZ3....splitpaths, distributing across 1024 S3 partitions instead of one.Test plan
cargo nextest run -p quickwit-common -p quickwit-config --all-features— 221 tests passcargo clippy --workspace --all-features --tests— no warningscargo +nightly fmt --all -- --check— no issuessplit_key_prefix_len: 2in an index config, verify it round-trips through serde correctlysplit_key_prefix_len: 17, verify it is rejected with a clear error message🤖 Generated with Claude Code