⚠ This page is served via a proxy. Original site: https://github.com
This service does not collect credentials or authentication data.
Skip to content

feat: AoS tuple sketch#476

Open
proost wants to merge 16 commits intoapache:masterfrom
proost:feat-aos-tuple-sketch
Open

feat: AoS tuple sketch#476
proost wants to merge 16 commits intoapache:masterfrom
proost:feat-aos-tuple-sketch

Conversation

@proost
Copy link
Member

@proost proost commented Jan 27, 2026

issue: #475

I change original design. I will follow same design with array of doubles sketch, that allow allocator for container not the element of container. I think using standard allocator for string is fine, for consistency of code following same design is good And even if we use custom allocator, we can't use completely for string allocation because of serialization and deserialization.

even though above reasons, if want to change to orignal design, it is acceptable.

@proost proost self-assigned this Jan 27, 2026
@coveralls
Copy link

coveralls commented Jan 27, 2026

Pull Request Test Coverage Report for Build 21757025873

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 215 of 264 (81.44%) changed or added relevant lines in 3 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage decreased (-0.3%) to 98.533%

Changes Missing Coverage Covered Lines Changed/Added Lines %
common/include/third_party/utf8cpp/utf8/core.h 62 111 55.86%
Totals Coverage Status
Change from base Build 20975908107: -0.3%
Covered Lines: 17263
Relevant Lines: 17520

💛 - Coveralls

@proost proost requested a review from jmalkin February 7, 2026 05:59
@proost
Copy link
Member Author

proost commented Feb 9, 2026

@tisonkun
If you have time, can you review this PR?

@proost proost requested a review from tisonkun February 9, 2026 23:04
@leerho
Copy link
Member

leerho commented Feb 11, 2026

@proost,
Could you please explain the purpose of the third-party library by Nemanja Trifunovic, that you have added, and your reasoning as to why it is absolutely necessary.

I would like to discourage adding third-party dependencies in our code base (especially non-standard ones) as it complicates code integration, debugging, and maintenance. In C++, we use the C++ Standard Library, Boost, XxHash64, and public domain MurmurHash3, all of which are extremely well known and tested. I'm not so sure about this code from Nemanja Trifunovic. Is it versioned? Does it follow a rigorous release process? Is it maintained? Do we absolutely have to have it?

His License header is non-standard, we may need to have it approved.
And even if you could convince me that it is required, we would need to update the LICENSE file, at minimum.

@proost
Copy link
Member Author

proost commented Feb 12, 2026

@leerho
The reason I introduced UTF-8 validation is to preserve cross-language portability for the AoS tuple sketch. On the Java side, strings are Unicode and serialization produces UTF-8 bytes; if the C++ implementation accepts arbitrary std::string bytes and serializes them without checks, we can end up writing invalid UTF-8 into the sketch stream, which could fail to deserialize or behave inconsistently across languages. utfcpp library is small and has a feature what i need, so i chose one.

I agree that bringing in a third-party UTF-8 library (and its non-standard license header) is a significant cost: integration, maintenance, debugging, and licensing updates.

I see three possible approaches:

  1. Add a small, self-contained UTF-8 well-formedness validation routine in our codebase (C++11-compatible). This avoids external dependencies and LICENSE updates, with only a small maintenance burden.
  2. Document the API contract clearly: input strings are expected to be valid UTF-8, without performing validation.
  3. Use ICU. Using ICU works correct, however this introduces a large dependency, additional operational cost, and longer compile times.

Based on your feedback, I propose we go with option (1). If you prefer a different direction, I can adjust.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants