⚠ This page is served via a proxy. Original site: https://github.com
This service does not collect credentials or authentication data.
Skip to content

Conversation

@xiangfu0
Copy link
Contributor

@xiangfu0 xiangfu0 commented Jan 19, 2026

Motivation

Large-table sampling needs to be deterministic and avoid query-time segment selection overhead. This adds a pluggable “table sampler” definition in table config and precomputes sampler-specific routing entries at the broker.

Key changes

  • Config: add tableSamplers to TableConfig (+ ZK SerDe support) and query option tableSampler=<name>.

  • Broker routing:

    • Build and cache sampler-specific routing entries per table sampler name.
    • Select routing entry at query time by tableSampler option (fallback to default when absent/unknown).
    • Refresh sampler routing entries on Helix assignment changes (IdealState/ExternalView updates).
  • Built-in samplers:

    • firstN: select first N segments (lexicographic)
    • timeBucket: select up to N segments per day using segment ZK start time metadata
  • MSQ support: propagate query options into MSQ leaf routing requests so tableSampler works with multi-stage engine.

  • Tests:

    • Unit test for timeBucket
    • Integration test (shared cluster) validating 10 segments/day × 7 days → sampler returns 1 segment/day and group-by results reflect that
  • Quickstart: add sample tableSamplers config to batch airlineStats table config.

How to use

1. Add samplers to your table config

Example (offline table):

"tableSamplers": [
  {
    "name": "small",
    "type": "firstN",
    "properties": {
      "numSegments": "10"
    }
  },
  {
    "name": "perDay1",
    "type": "timeBucket",
    "properties": {
      "numSegmentsPerDay": "1",
      "bucketDays": "1"
    }
  }
]
  • firstN

    • Purpose: Always pick a small, deterministic subset: “first N segments” by segment name.
    • Config
      • properties.numSegments (required): number of segments to keep.
  • timeBucket

    • Purpose: Pick N segments per time bucket (bucket size in days), deterministically.
    • Config
      • properties.numSegmentsPerDay (required): number of segments to keep per bucket.
      • properties.bucketDays (optional, default 1): bucket size in days (e.g. 7 = weekly buckets).
    • Notes
      • Buckets are computed in UTC.
      • Segments without parsable timestamps are skipped.

2. Query with a sampler (via query option)

  • Option format: tableSampler=
    Examples:
  • Pinot SQL:
SET tableSampler=small;SELECT COUNT(*) FROM myTable;
SET tableSampler=perDay1;SELECT DaysSinceEpoch, COUNT(*) FROM myTable GROUP BY DaysSinceEpoch;
  • HTTP queryOptions:
    • queryOptions: "tableSampler=perDay1"

3. Default behavior (no sampler selected)
If you don’t set tableSampler, Pinot uses the default routing entry (full table, no sampling).

Compatibility

  • Fully backward compatible: if no sampler is configured or selected, routing behavior is unchanged.

@xiangfu0 xiangfu0 force-pushed the feature/table-sampler-routing branch from cd4e1c6 to 19f856b Compare January 19, 2026 14:30
@xiangfu0 xiangfu0 marked this pull request as draft January 19, 2026 14:35
@codecov-commenter
Copy link

codecov-commenter commented Jan 19, 2026

Codecov Report

❌ Patch coverage is 28.66242% with 224 lines in your changes missing coverage. Please review.
✅ Project coverage is 63.15%. Comparing base (d54ec21) to head (02b4512).

Files with missing lines Patch % Lines
...oker/routing/manager/BaseBrokerRoutingManager.java 22.05% 98 Missing and 8 partials ⚠️
...g/tablesampler/TimeBucketSegmentsTableSampler.java 38.20% 43 Missing and 12 partials ⚠️
...uting/tablesampler/FirstNSegmentsTableSampler.java 0.00% 15 Missing ⚠️
...org/apache/pinot/spi/config/table/TableConfig.java 31.25% 10 Missing and 1 partial ⚠️
.../org/apache/pinot/query/routing/WorkerManager.java 64.28% 8 Missing and 2 partials ⚠️
...t/spi/config/table/sampler/TableSamplerConfig.java 0.00% 8 Missing ⚠️
...oker/routing/tablesampler/TableSamplerFactory.java 0.00% 7 Missing ⚠️
...entpreselector/TableSamplerSegmentPreSelector.java 0.00% 6 Missing ⚠️
...not/common/utils/config/TableConfigSerDeUtils.java 42.85% 2 Missing and 2 partials ⚠️
...he/pinot/spi/utils/builder/TableConfigBuilder.java 0.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##             master   #17532      +/-   ##
============================================
- Coverage     63.18%   63.15%   -0.04%     
+ Complexity     1477     1476       -1     
============================================
  Files          3172     3177       +5     
  Lines        189773   190067     +294     
  Branches      29041    29098      +57     
============================================
+ Hits         119913   120041     +128     
- Misses        60547    60678     +131     
- Partials       9313     9348      +35     
Flag Coverage Δ
custom-integration1 100.00% <ø> (ø)
integration 100.00% <ø> (ø)
integration1 100.00% <ø> (ø)
integration2 0.00% <ø> (ø)
java-11 63.07% <28.66%> (-0.05%) ⬇️
java-21 63.12% <28.66%> (-0.05%) ⬇️
temurin 63.15% <28.66%> (-0.04%) ⬇️
unittests 63.15% <28.66%> (-0.04%) ⬇️
unittests1 55.51% <42.62%> (-0.02%) ⬇️
unittests2 34.04% <22.92%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@xiangfu0 xiangfu0 force-pushed the feature/table-sampler-routing branch from 19f856b to 758dda4 Compare January 20, 2026 08:23
@xiangfu0 xiangfu0 requested a review from Copilot January 20, 2026 08:33
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a pluggable table sampling feature that enables deterministic sampling of segments at the broker routing layer to reduce query-time overhead for large tables. The implementation precomputes sampler-specific routing entries and allows query-time selection via a tableSampler query option.

Changes:

  • Introduced TableSamplerConfig in table configuration with two built-in sampler types: firstN (lexicographic selection) and nPerDay (temporal bucketing)
  • Extended broker routing manager to build and cache sampler-specific routing entries alongside default routing
  • Added MSQ support by propagating query options to leaf routing requests
  • Included ZooKeeper serialization/deserialization for table sampler configurations

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
pinot-tools/src/main/resources/examples/batch/airlineStats/airlineStats_offline_table_config.json Added sample tableSamplers configuration to quickstart example
pinot-spi/src/main/java/org/apache/pinot/spi/utils/builder/TableConfigBuilder.java Added builder support for table samplers
pinot-spi/src/main/java/org/apache/pinot/spi/utils/CommonConstants.java Registered tableSampler query option constant
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/sampler/TableSamplerConfig.java New configuration class for table sampler definitions
pinot-spi/src/main/java/org/apache/pinot/spi/config/table/TableConfig.java Extended table config to include table samplers list
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/utils/TableConfigUtilsTest.java Updated test constructors with new table sampler parameter
pinot-segment-local/src/test/java/org/apache/pinot/segment/local/segment/index/creator/CLPForwardIndexCreatorTest.java Updated test constructor with new table sampler parameter
pinot-query-planner/src/main/java/org/apache/pinot/query/routing/WorkerManager.java Propagated query options to MSQ leaf routing for sampler support
pinot-integration-tests/src/test/java/org/apache/pinot/integration/tests/custom/TableSamplerIntegrationTest.java Integration test validating nPerDay sampler behavior
pinot-connectors/pinot-spark-3-connector/src/main/scala/org/apache/pinot/connector/spark/v3/datasource/PinotDataWriter.scala Updated constructor with new table sampler parameter
pinot-common/src/main/java/org/apache/pinot/common/utils/config/TableConfigSerDeUtils.java Added ZK serialization/deserialization for table samplers
pinot-broker/src/test/java/org/apache/pinot/broker/routing/tablesampler/NPerDaySegmentsTableSamplerTest.java Unit tests for nPerDay sampler including timezone handling
pinot-broker/src/main/java/org/apache/pinot/broker/routing/tablesampler/TableSamplerFactory.java Factory for creating table sampler instances
pinot-broker/src/main/java/org/apache/pinot/broker/routing/tablesampler/TableSampler.java Interface defining table sampler contract
pinot-broker/src/main/java/org/apache/pinot/broker/routing/tablesampler/NPerDaySegmentsTableSampler.java Implementation selecting N segments per day using ZK metadata
pinot-broker/src/main/java/org/apache/pinot/broker/routing/tablesampler/FirstNSegmentsTableSampler.java Implementation selecting first N segments lexicographically
pinot-broker/src/main/java/org/apache/pinot/broker/routing/segmentpreselector/TableSamplerSegmentPreSelector.java Wrapper applying table sampler to pre-selected segments
pinot-broker/src/main/java/org/apache/pinot/broker/routing/manager/BaseBrokerRoutingManager.java Core routing logic to build, cache, and select sampler-specific routing entries

@xiangfu0 xiangfu0 force-pushed the feature/table-sampler-routing branch from 758dda4 to adfc8ba Compare January 20, 2026 11:49
@xiangfu0 xiangfu0 requested a review from Copilot January 20, 2026 15:06
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 4 comments.

@xiangfu0 xiangfu0 force-pushed the feature/table-sampler-routing branch from adfc8ba to bed4ff3 Compare January 21, 2026 18:25
@xiangfu0 xiangfu0 requested a review from Copilot January 21, 2026 18:25
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 18 changed files in this pull request and generated 2 comments.

Comments suppressed due to low confidence (1)

pinot-broker/src/main/java/org/apache/pinot/broker/routing/tablesampler/NPerDaySegmentsTableSampler.java:1

  • Line 158 uses the incorrect constant Segment.TIME_TIME_UNIT instead of Segment.TIME_UNIT. This will cause the code to fail to retrieve the time unit field from segment metadata, preventing epoch-zero segments from being correctly sampled.
/**

@xiangfu0 xiangfu0 force-pushed the feature/table-sampler-routing branch from bed4ff3 to 72a336d Compare January 21, 2026 18:42
@xiangfu0 xiangfu0 marked this pull request as ready for review January 21, 2026 18:42
@xiangfu0 xiangfu0 force-pushed the feature/table-sampler-routing branch 5 times, most recently from 0301e99 to 6241ba7 Compare January 22, 2026 19:02
@xiangfu0 xiangfu0 changed the title Add table sampler routing entries (precomputed segment subsets) with nPerDay sampler and tableSampler query option Add table sampler routing entries (precomputed segment subsets) with timeBucket sampler and tableSampler query option Jan 22, 2026
@xiangfu0 xiangfu0 changed the title Add table sampler routing entries (precomputed segment subsets) with timeBucket sampler and tableSampler query option Add pluggable table samplers with precomputed broker routing entries and tableSampler query option Jan 22, 2026
@xiangfu0 xiangfu0 force-pushed the feature/table-sampler-routing branch from 6241ba7 to 02b4512 Compare January 23, 2026 10:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants