-
Notifications
You must be signed in to change notification settings - Fork 2.8k
feat(evaluation): Add CJK tokenizer support for ROUGE-1 evaluation #4143
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Fixes google#4122 The default ROUGE tokenizer only recognizes ASCII alphanumeric characters, causing ROUGE-1 scores to be 0.0 for CJK (Chinese, Japanese, Korean) text. Changes: - Add CJKTokenizer class that handles CJK characters individually while preserving word-based tokenization for ASCII alphanumeric characters - Add RougeScoreCriterion to allow explicit tokenizer specification - Update RougeEvaluator to support custom tokenizers - Add warning when CJK text is detected without CJK tokenizer - Preserve backward compatibility: default behavior unchanged Usage: criterion = RougeScoreCriterion(threshold=0.8, tokenizer='cjk') eval_metric = EvalMetric( metric_name='response_match_score', threshold=0.8, criterion=criterion, ) Limitations: - Fullwidth alphanumeric (A-Z, 0-9) are skipped - Greek, Cyrillic, and other non-CJK scripts are skipped - This is character-based tokenization, not morphological analysis
Summary of ChangesHello @maru0804, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the ROUGE-1 evaluation framework by introducing robust support for CJK (Chinese, Japanese, Korean) languages. Previously, the default ROUGE tokenizer's ASCII-only nature led to inaccurate, often zero, scores for CJK text. The changes address this by providing an opt-in CJK-specific tokenizer and a mechanism to configure it, ensuring that evaluations involving multilingual content yield meaningful results. The update also includes a helpful warning system to alert users when CJK text is present but no appropriate tokenizer is configured, promoting correct usage. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request does an excellent job of adding CJK language support for ROUGE-1 evaluation, addressing a significant limitation of the default tokenizer. The introduction of CJKTokenizer and RougeScoreCriterion is well-designed, and the changes are thoughtfully integrated to ensure backward compatibility. The addition of comprehensive unit tests covering the new tokenizer, scoring logic, and warning behavior is particularly commendable. My review includes a couple of suggestions to improve performance and code conciseness, but overall, this is a high-quality contribution.
…instantiation Address Gemini Code Assist review feedback: 1. Simplify CJKTokenizer.tokenize() using re.findall instead of manual loop - More concise and idiomatic Python - Same functionality with fewer lines of code 2. Optimize RougeScorer instantiation for performance - Create RougeScorer once in __init__ instead of per invocation - Reuse self._scorer across all evaluate_invocations calls - Avoids unnecessary object creation in loops
Summary
Fixes #4122
The default ROUGE tokenizer only recognizes ASCII alphanumeric characters (
[a-z0-9]), causing ROUGE-1 scores to be 0.0 for CJK (Chinese, Japanese, Korean) text. This PR adds CJK language support through an opt-in tokenizer.Changes
New Features
CJKTokenizer: A character-based tokenizer for CJK languages that:RougeScoreCriterion: New criterion class to specify tokenizer optionsModified
RougeEvaluator: Updated to support custom tokenizers and log a warning (once) when CJK text is detected without proper tokenizer configurationResponseEvaluator: Updated to passeval_metric(including criterion) toRougeEvaluatorUsage
Backward Compatibility
Limitations (documented in docstrings)
Test Coverage
Added 29 new tests covering: