-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Add pluggable table samplers with precomputed broker routing entries and tableSampler query option #17532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
cd4e1c6 to
19f856b
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #17532 +/- ##
============================================
- Coverage 63.18% 63.15% -0.04%
+ Complexity 1477 1476 -1
============================================
Files 3172 3177 +5
Lines 189773 190067 +294
Branches 29041 29098 +57
============================================
+ Hits 119913 120041 +128
- Misses 60547 60678 +131
- Partials 9313 9348 +35
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
19f856b to
758dda4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR adds a pluggable table sampling feature that enables deterministic sampling of segments at the broker routing layer to reduce query-time overhead for large tables. The implementation precomputes sampler-specific routing entries and allows query-time selection via a tableSampler query option.
Changes:
- Introduced
TableSamplerConfigin table configuration with two built-in sampler types:firstN(lexicographic selection) andnPerDay(temporal bucketing) - Extended broker routing manager to build and cache sampler-specific routing entries alongside default routing
- Added MSQ support by propagating query options to leaf routing requests
- Included ZooKeeper serialization/deserialization for table sampler configurations
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
pinot-tools/src/main/resources/examples/batch/airlineStats/airlineStats_offline_table_config.json |
Added sample tableSamplers configuration to quickstart example |
pinot-spi/src/main/java/org/apache/pinot/spi/utils/builder/TableConfigBuilder.java |
Added builder support for table samplers |
pinot-spi/src/main/java/org/apache/pinot/spi/utils/CommonConstants.java |
Registered tableSampler query option constant |
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/sampler/TableSamplerConfig.java |
New configuration class for table sampler definitions |
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/TableConfig.java |
Extended table config to include table samplers list |
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/utils/TableConfigUtilsTest.java |
Updated test constructors with new table sampler parameter |
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/creator/CLPForwardIndexCreatorTest.java |
Updated test constructor with new table sampler parameter |
pinot-query-planner/src/main/java/org/apache/pinot/query/routing/WorkerManager.java |
Propagated query options to MSQ leaf routing for sampler support |
pinot-integration-tests/src/test/java/org/apache/pinot/integration/tests/custom/TableSamplerIntegrationTest.java |
Integration test validating nPerDay sampler behavior |
pinot-connectors/pinot-spark-3-connector/src/main/scala/org/apache/pinot/connector/spark/v3/datasource/PinotDataWriter.scala |
Updated constructor with new table sampler parameter |
pinot-common/src/main/java/org/apache/pinot/common/utils/config/TableConfigSerDeUtils.java |
Added ZK serialization/deserialization for table samplers |
pinot-broker/src/test/java/org/apache/pinot/broker/routing/tablesampler/NPerDaySegmentsTableSamplerTest.java |
Unit tests for nPerDay sampler including timezone handling |
pinot-broker/src/main/java/org/apache/pinot/broker/routing/tablesampler/TableSamplerFactory.java |
Factory for creating table sampler instances |
pinot-broker/src/main/java/org/apache/pinot/broker/routing/tablesampler/TableSampler.java |
Interface defining table sampler contract |
pinot-broker/src/main/java/org/apache/pinot/broker/routing/tablesampler/NPerDaySegmentsTableSampler.java |
Implementation selecting N segments per day using ZK metadata |
pinot-broker/src/main/java/org/apache/pinot/broker/routing/tablesampler/FirstNSegmentsTableSampler.java |
Implementation selecting first N segments lexicographically |
pinot-broker/src/main/java/org/apache/pinot/broker/routing/segmentpreselector/TableSamplerSegmentPreSelector.java |
Wrapper applying table sampler to pre-selected segments |
pinot-broker/src/main/java/org/apache/pinot/broker/routing/manager/BaseBrokerRoutingManager.java |
Core routing logic to build, cache, and select sampler-specific routing entries |
...t-broker/src/main/java/org/apache/pinot/broker/routing/manager/BaseBrokerRoutingManager.java
Outdated
Show resolved
Hide resolved
...t-broker/src/main/java/org/apache/pinot/broker/routing/manager/BaseBrokerRoutingManager.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/apache/pinot/broker/routing/tablesampler/NPerDaySegmentsTableSampler.java
Outdated
Show resolved
Hide resolved
...t-broker/src/main/java/org/apache/pinot/broker/routing/manager/BaseBrokerRoutingManager.java
Show resolved
Hide resolved
758dda4 to
adfc8ba
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.
...t-broker/src/main/java/org/apache/pinot/broker/routing/manager/BaseBrokerRoutingManager.java
Outdated
Show resolved
Hide resolved
.../src/main/java/org/apache/pinot/broker/routing/tablesampler/NPerDaySegmentsTableSampler.java
Outdated
Show resolved
Hide resolved
...t-broker/src/main/java/org/apache/pinot/broker/routing/manager/BaseBrokerRoutingManager.java
Outdated
Show resolved
Hide resolved
pinot-query-planner/src/main/java/org/apache/pinot/query/routing/WorkerManager.java
Show resolved
Hide resolved
adfc8ba to
bed4ff3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.
Comments suppressed due to low confidence (1)
pinot-broker/src/main/java/org/apache/pinot/broker/routing/tablesampler/NPerDaySegmentsTableSampler.java:1
- Line 158 uses the incorrect constant
Segment.TIME_TIME_UNITinstead ofSegment.TIME_UNIT. This will cause the code to fail to retrieve the time unit field from segment metadata, preventing epoch-zero segments from being correctly sampled.
/**
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/TableConfig.java
Show resolved
Hide resolved
...t-broker/src/main/java/org/apache/pinot/broker/routing/manager/BaseBrokerRoutingManager.java
Outdated
Show resolved
Hide resolved
bed4ff3 to
72a336d
Compare
0301e99 to
6241ba7
Compare
6241ba7 to
02b4512
Compare
Motivation
Large-table sampling needs to be deterministic and avoid query-time segment selection overhead. This adds a pluggable “table sampler” definition in table config and precomputes sampler-specific routing entries at the broker.
Key changes
Config: add
tableSamplerstoTableConfig(+ ZK SerDe support) and query optiontableSampler=<name>.Broker routing:
Built-in samplers:
firstN: select first N segments (lexicographic)timeBucket: select up to N segments per day using segment ZK start time metadataMSQ support: propagate query options into MSQ leaf routing requests so
tableSamplerworks with multi-stage engine.Tests:
timeBucketQuickstart: add sample
tableSamplersconfig to batch airlineStats table config.How to use
1. Add samplers to your table config
Example (offline table):
firstN
properties.numSegments(required): number of segments to keep.timeBucket
properties.numSegmentsPerDay(required): number of segments to keep per bucket.properties.bucketDays(optional, default 1): bucket size in days (e.g. 7 = weekly buckets).2. Query with a sampler (via query option)
Examples:
queryOptions: "tableSampler=perDay1"3. Default behavior (no sampler selected)
If you don’t set tableSampler, Pinot uses the default routing entry (full table, no sampling).
Compatibility