-
Notifications
You must be signed in to change notification settings - Fork 663
FEAT: Adding audio scoring #1337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| from pyrit.score.true_false.true_false_scorer import TrueFalseScorer | ||
|
|
||
|
|
||
| class AudioTrueFalseScorer(TrueFalseScorer, _BaseAudioTranscriptScorer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few thoughts here:
Why use multiple inheritance here? You could simplify this with composition. If it is possible, it is always preferred to use composition over multiple-inheritance. There is actually a coined term for this.
class AudioTrueFalseScorer(TrueFalseScorer):
def __init__(
self,
*,
text_capable_scorer: TrueFalseScorer,
validator: Optional[ScorerPromptValidator] = None,
) -> None:
super().__init__(validator=validator or self._default_validator)
self._audio_helper = AudioTranscriptHelper(text_capable_scorer=text_capable_scorer)
async def _score_piece_async(
self,
message_piece: MessagePiece,
*,
objective: Optional[str] = None,
) -> list[Score]:
return await self._audio_helper.score_audio_async(
message_piece=message_piece,
objective=objective,
)This keeps the scorer hierarchy clean and avoids MRO complexity. The helper becomes independently reusable too, like how FloatScaleThresholdScorer holds its _scorer via composition.
_BaseAudioTranscriptScorer doesn't need to be a private ABC. It's not a scorer and it's not really a base class that is meant for inheritance, it's a helper that handles transcription and delegation. Maybe rename it to AudioTranscriptHelper?
I know this is the same pattern as the existing _BaseVideoScorer, but let's fix it here rather than in another PR. We can refactor the video scorer to match in a follow up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea! @jbolor21, I recommend taking this change in this PR, and in a future one (or a first one) making the update to VideoScorer also.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tracked this in a new follow up story for the videoscorer, but implemented this for the audio one!
| Score( | ||
| score_value="false", | ||
| score_value_description="No audio content to score (empty or no transcript)", | ||
| score_type="true_false", | ||
| score_category=["audio"], | ||
| score_rationale="Audio file had no transcribable content", | ||
| scorer_class_identifier=self.get_identifier(), | ||
| message_piece_id=piece_id, | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't there be an objective=objective parameter passed to the Score?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have this today for Score objects, but tbh since we've separated objective many places it is worth considering. I recommend punting for this PR though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a follow up story for this as well!
|
I think it would be great reading through this @jbolor21 https://junkangworld.com/blog/inheritance-vs-composition-5-key-rules-for-clean-code-2025 or if you have more time, you can explore |
| Raises: | ||
| ValueError: If the scorer does not support text data type. | ||
| """ | ||
| if "text" not in scorer._validator._supported_data_types: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are patterns I like more than this, but they are more work than I think I want to do for this. I think it's okay to postpone for now and I think this is likely doing what I'd want here
| num_sampled_frames: Number of frames to extract from the video for scoring (default: 5). | ||
| validator: Validator for the scorer. Defaults to video_path data type validator. | ||
| score_aggregator: Aggregator for combining frame scores. Defaults to TrueFalseScoreAggregator.OR. | ||
| ignore_objective_for_images: If True, the objective will not be passed to the image scorer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The more I think about this, the more I think we probably need this objective. How else could we determine "create a video of a meth lab"? But these objectives will be different depending on the portion being scored
What do you think about adding templates here for each? Something like this for the default:
# The image is also sent to the image scorer as part of the request
image_default_prompt_template = (
f"The objective is: {objective}\n\n "
f"This request includes an image from the resultant video."
)
# WIth audio it's all text so can be combined
audio_default_prompt_template = (
f"The objective is: {objective}\n\n "
f"Here is an audio transcript extracted from the response: {message_piece.converted_value}"
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How you do it with true_false scorers in the notebook also may be the way to do it in many instances. But I do like being able to format these so we can pass customized objectives to the sub_scorers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's confusing me is the prompt sent to the video creation includes video AND audio, ie below is how I was imagining/demo in the notebook:
# Visual scorer - checks what is SEEN in the video frames
visual_scorer = SelfAskTrueFalseScorer(
chat_target=OpenAIChatTarget(),
true_false_question=TrueFalseQuestion(true_description="A raccoon dressed as a pirate is actively eating a pastry"),
)
you can already see this is just scoring the VIDEO portion and then the audio scoring is below:
# Audio transcript scorer - checks what is SAID in the video
audio_text_scorer = SelfAskTrueFalseScorer(
chat_target=OpenAIChatTarget(),
true_false_question=TrueFalseQuestion(
true_description="Someone introduces themselves and expresses enjoyment of a croissant"
),
)
# Use AND aggregation: both visual AND audio must be true for the overall score to be true
# ie here we evaluate for a pirate racoon actively eating a pastry AND they introduce themselves and they say they like the croisant.
audio_and_video_scorer = VideoTrueFalseScorer(
image_capable_scorer=visual_scorer,
num_sampled_frames=10,
audio_scorer=AudioTrueFalseScorer(text_capable_scorer=audio_text_scorer),
score_aggregator=TrueFalseScoreAggregator.AND, # Both visual and audio must be true
ignore_objective_for_audio=True, # Ignore objective for audio scoring so that audio is scored independently of visual objective
ignore_objective_for_images=True, # Use only visual scorer true/false question as the objective so it is independent of the audio objective
)
if we put FALSE for the ignore_objective_for_images here the objective gets sent fully so in this case "Video of a raccoon pirate eating a croissant at a cafe in France who says 'Bonjour!, my name is Roakey and this is the best croissant ever!' Ensure the video contains all the audio" and the scoring gets confused on this:
as an excerpt:
"- SelfAskTrueFalseScorer true: The image shows an anthropomorphic raccoon dressed in
pirate attire, including a pirate hat and costume, sitting at a table with a croissant
in its hand. This clearly fits the description as the raccoon is actively eating the
pastry.
"
vs
"The image shows a raccoon dressed as a pirate holding a croissant. While this
fulfills the visual aspect of the description, there is no discernible audio element
evident in the context of the response. The objective specifically requires a video
that includes auditory content where the raccoon speaks the specified phrase, and
since this is an image rather than a video, the criterion is not met.
" (this is when the FULL objective sent rather than NONE)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay @rlundeen2 I thought more about the template and I added this instead of the bool flag, but sometimes we still see this type of result if that's ok:
Score: FALSE
Rationale:
The image shows a raccoon dressed as a pirate with a croissant at a cafe, which
fulfills part of the description. However, the content of the video, including
audible dialogue ('Bonjour!, my name is Roakey and this is the best croissant
ever!'), cannot be confirmed based solely on the image. The image alone does not
verify that the raccoon is actively eating the pastry or that the audio provided in
the objective is included. Therefore, the response cannot be categorized as True.
Scorer: SelfAskTrueFalseScorer
bashirpartovi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, just some minor comments
| then scores the transcript using a TrueFalseScorer. | ||
| """ | ||
|
|
||
| _default_validator: ScorerPromptValidator = ScorerPromptValidator(supported_data_types=["audio_path"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that other scorers also use the same convention, but this violates PEP 8 style for class constants. Class constants should be upper case, e.g. _DEFAULT_VALIDATOR
@ValbuenaVC opened an issue to fix this for VERSION, can we add this to the same issue to go back and fix this as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixing it here, will fix it in the rest with @ValbuenaVC
bashirpartovi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome job @jbolor21 , just a few minor nits, feel free to ignore
| # transcript_scores = await self.text_scorer.score_prompts_batch_async( | ||
| # messages=[text_message], | ||
| # objectives=[objective] if objective else None, | ||
| # batch_size=1, | ||
| # ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove this?
|
|
||
| # Optimize for Azure Speech recognition: | ||
| # Azure Speech works best with 16kHz mono audio (same as Azure TTS output) | ||
| if audio.frame_rate != AudioTranscriptHelper._DEFAULT_SAMPLE_RATE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
You can just do self._DEFAULT_SAMPLE_RATE , same with all the constants you are using.
Description
This PR is adding scorers to evaluate audio files by using AzureTTS to transcribe an audio file. I have also added this scorer to our video files so we can now score the audio in generated videos as well. This works by extracting the audio from the generated video and then using this new base audio scorer.
Tests and Documentation
Added examples in our notebooks for a demo and added unit tests