⚠ This page is served via a proxy. Original site: https://github.com
This service does not collect credentials or authentication data.
Skip to content

add: inference-time-scaling#1270

Open
yjy415 wants to merge 1 commit intomodelscope:mainfrom
yjy415:ses
Open

add: inference-time-scaling#1270
yjy415 wants to merge 1 commit intomodelscope:mainfrom
yjy415:ses

Conversation

@yjy415
Copy link
Collaborator

@yjy415 yjy415 commented Feb 6, 2026

No description provided.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @yjy415, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces the Spectral Evolution Search (SES) algorithm, a novel inference-time scaling technique designed to enhance the quality of generated images without requiring model retraining. It integrates SES into several existing image generation pipelines, allowing users to optimize initial noise through an evolutionary search guided by reward models. This feature provides a trade-off between computational cost and image quality, offering greater control over the generation process and improving the fidelity of generated outputs.

Highlights

  • New Feature: Inference-Time Scaling (SES): Introduced the Spectral Evolution Search (SES) algorithm, allowing users to enhance image generation quality during inference by optimizing initial noise without retraining models. This feature trades computational cost for improved output quality.
  • Core SES Implementation: Added a new utility module diffsynth/utils/inference_time_scaling/ses.py which includes functions for wavelet transforms (split_dwt, reconstruct_dwt), a SESRewardScorer class to integrate various reward models (PickScore, CLIP, HPSv2), and the run_ses_cem function for Cross-Entropy Method-based latent optimization.
  • Pipeline Integration: Integrated SES functionality into several key image generation pipelines, including Flux2ImagePipeline, FluxImagePipeline, QwenImagePipeline, and ZImagePipeline. This involves adding new parameters (enable_ses, ses_reward_model, ses_eval_budget, ses_inference_steps) to their __call__ methods and incorporating the SES optimization logic.
  • Documentation and Examples: Provided comprehensive documentation in both English and Chinese (docs/en/Research_Tutorial/inference_time_scaling.md, docs/zh/Research_Tutorial/inference_time_scaling.md) explaining SES principles, usage, parameters, and supported models. Additionally, new example scripts demonstrate how to utilize SES with various models.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • diffsynth/pipelines/flux2_image.py
    • Imported run_ses_cem and SESRewardScorer for SES functionality.
    • Added enable_ses, ses_reward_model, ses_eval_budget, and ses_inference_steps parameters to the __call__ method.
    • Implemented conditional SES logic within __call__ to optimize initial latents using a defined callback and reward scorer.
  • diffsynth/pipelines/flux_image.py
    • Imported run_ses_cem and SESRewardScorer for SES functionality.
    • Added enable_ses, ses_reward_model, ses_eval_budget, and ses_inference_steps parameters to the __call__ method.
    • Implemented conditional SES logic within __call__ to optimize initial latents using a defined callback and reward scorer.
  • diffsynth/pipelines/qwen_image.py
    • Imported run_ses_cem and SESRewardScorer for SES functionality.
    • Added enable_ses, ses_reward_model, ses_eval_budget, and ses_inference_steps parameters to the __call__ method.
    • Implemented conditional SES logic within __call__ to optimize initial latents using a defined callback and reward scorer.
  • diffsynth/pipelines/z_image.py
    • Imported run_ses_cem and SESRewardScorer for SES functionality.
    • Added enable_ses, ses_reward_model, ses_eval_budget, and ses_inference_steps parameters to the __call__ method.
    • Implemented conditional SES logic within __call__ to optimize initial latents using a defined callback and reward scorer.
  • diffsynth/utils/inference_time_scaling/ses.py
    • New file created to house the core Spectral Evolution Search (SES) logic.
    • Includes split_dwt and reconstruct_dwt functions for discrete wavelet transforms.
    • Defines SESRewardScorer class to load and utilize various reward models (PickScore, CLIP, HPSv2) for image evaluation.
    • Contains the run_ses_cem function, which implements the Cross-Entropy Method for optimizing latent representations based on reward scores.
  • docs/en/Research_Tutorial/inference_time_scaling.md
    • New English documentation detailing the Inference-Time Scaling (SES) feature.
    • Explains the basic principles of SES, its search space, evolutionary strategy, and reward feedback mechanism.
    • Provides quick start instructions, core parameter details, a list of supported models, and effect demonstrations.
  • docs/zh/Research_Tutorial/inference_time_scaling.md
    • New Chinese documentation, a direct translation of the English SES tutorial.
    • Covers the same content as the English version, including principles, usage, parameters, supported models, and demonstrations.
  • examples/flux/model_inference/FLUX.1-dev-SES.py
    • New example script demonstrating how to use SES with the FLUX.1-dev pipeline.
  • examples/flux2/model_inference/FLUX.2-dev-SES.py
    • New example script demonstrating how to use SES with the FLUX.2-dev pipeline.
  • examples/qwen_image/model_inference/Qwen-Image-SES.py
    • New example script demonstrating how to use SES with the Qwen-Image pipeline.
  • examples/z_image/model_inference/Z-Image-SES.py
    • New example script demonstrating how to use SES with the Z-Image pipeline.
  • examples/z_image/model_inference/Z-Image-Turbo-SES.py
    • New example script demonstrating how to use SES with the Z-Image-Turbo pipeline.
Activity
  • The user yjy415 initiated this pull request to introduce inference-time scaling capabilities.
  • A new utility module diffsynth/utils/inference_time_scaling/ses.py was added, containing the core logic for Spectral Evolution Search (SES), including wavelet transforms, reward scoring, and Cross-Entropy Method optimization.
  • The SES feature has been integrated into multiple existing image generation pipelines (Flux, Qwen, Z-Image), allowing for dynamic quality improvement during inference.
  • Comprehensive documentation in both English and Chinese, along with several example scripts, were added to guide users on how to leverage this new functionality.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an exciting new feature: Inference-Time Scaling using Spectral Evolution Search (SES). The implementation looks solid and is accompanied by new documentation and examples, which is great. However, there is a significant amount of code duplication across the different pipeline files (flux_image.py, flux2_image.py, qwen_image.py, z_image.py). This duplicated logic for SES should be refactored into a shared helper function or a method in the BasePipeline to improve maintainability. I've also noted a few minor areas for improvement in the ses.py utility file regarding logging and exception handling, as well as some broken links and typos in the new documentation files.

Comment on lines +125 to +174
if enable_ses:
print(f"[SES] Starting optimization with budget={ses_eval_budget}, steps={ses_inference_steps}")
scorer = SESRewardScorer(ses_reward_model, device=self.device, dtype=self.torch_dtype)
self.load_models_to_device(list(self.in_iteration_models) + ['vae'])
models = {name: getattr(self, name) for name in self.in_iteration_models}

h_latent = height // 16
w_latent = width // 16

def ses_generate_callback(trial_latents_spatial):
trial_inputs = inputs_shared.copy()

self.scheduler.set_timesteps(ses_inference_steps, denoising_strength=denoising_strength, dynamic_shift_len=h_latent*w_latent)
eval_timesteps = self.scheduler.timesteps

curr_latents_seq = rearrange(trial_latents_spatial, "b c h w -> b (h w) c")

for progress_id, timestep in enumerate(eval_timesteps):
timestep = timestep.unsqueeze(0).to(dtype=self.torch_dtype, device=self.device)

trial_inputs["latents"] = curr_latents_seq

noise_pred = self.cfg_guided_model_fn(
self.model_fn, cfg_scale,
trial_inputs, inputs_posi, inputs_nega,
**models, timestep=timestep, progress_id=progress_id
)
curr_latents_seq = self.step(self.scheduler, progress_id=progress_id, noise_pred=noise_pred, **trial_inputs)

curr_latents_spatial = rearrange(curr_latents_seq, "b (h w) c -> b c h w", h=h_latent, w=w_latent)

decoded_img = self.vae.decode(curr_latents_spatial)
return self.vae_output_to_image(decoded_img)
initial_noise_seq = inputs_shared["latents"]
initial_noise_spatial = rearrange(initial_noise_seq, "b (h w) c -> b c h w", h=h_latent, w=w_latent)

optimized_latents_spatial = run_ses_cem(
base_latents=initial_noise_spatial,
pipeline_callback=ses_generate_callback,
prompt=prompt,
scorer=scorer,
total_eval_budget=ses_eval_budget,
popsize=10,
k_elites=5
)
optimized_latents_seq = rearrange(optimized_latents_spatial, "b c h w -> b (h w) c")
inputs_shared["latents"] = optimized_latents_seq
self.scheduler.set_timesteps(num_inference_steps, denoising_strength=denoising_strength, dynamic_shift_len=h_latent*w_latent)
del scorer
torch.cuda.empty_cache()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This block of code for Inference-Time Scaling (SES) is largely duplicated across multiple pipeline files (flux_image.py, flux2_image.py, qwen_image.py, z_image.py). This duplication makes the code harder to maintain and update. Consider refactoring this logic into a shared helper function or a method in the BasePipeline class. This would centralize the SES implementation, making it easier to manage and reducing the risk of inconsistencies between pipelines. A base method could accept pipeline-specific parameters (like scheduler settings, VAE model name, and latent shape handling) to accommodate the variations between models.

Comment on lines +285 to +325
if enable_ses:
print(f"[SES] Starting optimization with budget={ses_eval_budget}, steps={ses_inference_steps}")
scorer = SESRewardScorer(ses_reward_model, device=self.device, dtype=self.torch_dtype)
self.load_models_to_device(list(self.in_iteration_models) + ['vae_decoder'])
models = {name: getattr(self, name) for name in self.in_iteration_models}

def ses_generate_callback(trial_latents):
trial_inputs = inputs_shared.copy()

self.scheduler.set_timesteps(ses_inference_steps, denoising_strength=denoising_strength, shift=sigma_shift)
eval_timesteps = self.scheduler.timesteps
curr_latents = trial_latents

for progress_id, timestep in enumerate(eval_timesteps):
timestep = timestep.unsqueeze(0).to(dtype=self.torch_dtype, device=self.device)

trial_inputs["latents"] = curr_latents
noise_pred = self.cfg_guided_model_fn(
self.model_fn, cfg_scale,
trial_inputs, inputs_posi, inputs_nega,
**models, timestep=timestep, progress_id=progress_id
)
curr_latents = self.step(self.scheduler, progress_id=progress_id, noise_pred=noise_pred, **trial_inputs)

decoded_img = self.vae_decoder(curr_latents, device=self.device, tiled=tiled, tile_size=tile_size, tile_stride=tile_stride)
return self.vae_output_to_image(decoded_img)

initial_noise = inputs_shared["latents"]
optimized_latents = run_ses_cem(
base_latents=initial_noise,
pipeline_callback=ses_generate_callback,
prompt=prompt,
scorer=scorer,
total_eval_budget=ses_eval_budget,
popsize=10,
k_elites=5
)
inputs_shared["latents"] = optimized_latents
self.scheduler.set_timesteps(num_inference_steps, denoising_strength=denoising_strength, shift=sigma_shift)
del scorer
torch.cuda.empty_cache()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This block of code for Inference-Time Scaling (SES) is largely duplicated across multiple pipeline files. This duplication makes the code harder to maintain and update. Consider refactoring this logic into a shared helper function or a method in the BasePipeline class. This would centralize the SES implementation, making it easier to manage and reducing the risk of inconsistencies between pipelines.

Comment on lines +181 to +223
if enable_ses:
print(f"[SES] Starting optimization with budget={ses_eval_budget}, steps={ses_inference_steps}")
scorer = SESRewardScorer(ses_reward_model, device=self.device, dtype=self.torch_dtype)

self.load_models_to_device(list(self.in_iteration_models) + ['vae'])
models = {name: getattr(self, name) for name in self.in_iteration_models}

def ses_generate_callback(trial_latents):
trial_inputs = inputs_shared.copy()
self.scheduler.set_timesteps(ses_inference_steps, denoising_strength=denoising_strength, dynamic_shift_len=(height // 16) * (width // 16), exponential_shift_mu=exponential_shift_mu)
eval_timesteps = self.scheduler.timesteps
curr_latents = trial_latents

for progress_id, timestep in enumerate(eval_timesteps):
timestep = timestep.unsqueeze(0).to(dtype=self.torch_dtype, device=self.device)

trial_inputs["latents"] = curr_latents

noise_pred = self.cfg_guided_model_fn(
self.model_fn, cfg_scale,
trial_inputs, inputs_posi, inputs_nega,
**models, timestep=timestep, progress_id=progress_id
)
curr_latents = self.step(self.scheduler, progress_id=progress_id, noise_pred=noise_pred, **trial_inputs)

decoded_img = self.vae.decode(curr_latents, device=self.device, tiled=tiled, tile_size=tile_size, tile_stride=tile_stride)
return self.vae_output_to_image(decoded_img)

initial_noise = inputs_shared["latents"]

optimized_latents = run_ses_cem(
base_latents=initial_noise,
pipeline_callback=ses_generate_callback,
prompt=prompt,
scorer=scorer,
total_eval_budget=ses_eval_budget,
popsize=10,
k_elites=5
)
inputs_shared["latents"] = optimized_latents
self.scheduler.set_timesteps(num_inference_steps, denoising_strength=denoising_strength, dynamic_shift_len=(height // 16) * (width // 16), exponential_shift_mu=exponential_shift_mu)
del scorer
torch.cuda.empty_cache()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This block of code for Inference-Time Scaling (SES) is largely duplicated across multiple pipeline files. This duplication makes the code harder to maintain and update. Consider refactoring this logic into a shared helper function or a method in the BasePipeline class. This would centralize the SES implementation, making it easier to manage and reducing the risk of inconsistencies between pipelines.

Comment on lines +149 to +187
if enable_ses:
print(f"[SES] Starting optimization with budget={ses_eval_budget}, steps={ses_inference_steps}")
scorer = SESRewardScorer(ses_reward_model, device=self.device, dtype=self.torch_dtype)
self.load_models_to_device(list(self.in_iteration_models) + ['vae_decoder'])
models = {name: getattr(self, name) for name in self.in_iteration_models}
def ses_generate_callback(trial_latents):
trial_inputs = inputs_shared.copy()
self.scheduler.set_timesteps(ses_inference_steps, denoising_strength=denoising_strength, shift=sigma_shift)
eval_timesteps = self.scheduler.timesteps
curr_latents = trial_latents

for progress_id, timestep in enumerate(eval_timesteps):
timestep = timestep.unsqueeze(0).to(dtype=self.torch_dtype, device=self.device)
trial_inputs["latents"] = curr_latents
noise_pred = self.cfg_guided_model_fn(
self.model_fn, cfg_scale,
trial_inputs, inputs_posi, inputs_nega,
**models, timestep=timestep, progress_id=progress_id
)
curr_latents = self.step(self.scheduler, progress_id=progress_id, noise_pred=noise_pred, **trial_inputs)
decoded_img = self.vae_decoder(curr_latents)
return self.vae_output_to_image(decoded_img)

initial_noise = inputs_shared["latents"]

optimized_latents = run_ses_cem(
base_latents=initial_noise,
pipeline_callback=ses_generate_callback,
prompt=prompt,
scorer=scorer,
total_eval_budget=ses_eval_budget,
popsize=10,
k_elites=5
)
inputs_shared["latents"] = optimized_latents
self.scheduler.set_timesteps(num_inference_steps, denoising_strength=denoising_strength, shift=sigma_shift)

del scorer
torch.cuda.empty_cache()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This block of code for Inference-Time Scaling (SES) is largely duplicated across multiple pipeline files. This duplication makes the code harder to maintain and update. Consider refactoring this logic into a shared helper function or a method in the BasePipeline class. This would centralize the SES implementation, making it easier to manage and reducing the risk of inconsistencies between pipelines.

self._load_model()

def _load_model(self):
print(f"[SES] Loading Reward Model: {self.reward_name}...")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using print for logging is generally discouraged in library code as it offers less control over verbosity and output streams. It's recommended to use Python's standard logging module instead. This would allow consumers of your library to configure logging as needed. This applies to all print statements in this file.

To do this, add import logging and logger = logging.getLogger(__name__) at the top of the file, then replace the print statements with logger.info, logger.warning, etc.

Suggested change
print(f"[SES] Loading Reward Model: {self.reward_name}...")
logger.info(f"[SES] Loading Reward Model: {self.reward_name}...")

Comment on lines +89 to +91
except Exception as e:
print(f"Error computing score: {e}")
return 0.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Catching a broad Exception and simply printing the message can hide important errors and make debugging difficult. It's better to log the full traceback to get more context on what went wrong.

        except Exception:
            import logging
            logger = logging.getLogger(__name__)
            logger.error("Error computing score", exc_info=True)
            return 0.0


This method essentially **trades inference computation time for generation quality**.

For more technical details on this method, please refer to the paper: **[Spectral Evolution Search: Efficient Inference-Time Scaling for Reward-Aligned Image Generation](https://arxiv.org/abs/2602.03208)**.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There seems to be a typo in the arXiv link. The year 2602 should likely be 2402.

Suggested change
For more technical details on this method, please refer to the paper: **[Spectral Evolution Search: Efficient Inference-Time Scaling for Reward-Aligned Image Generation](https://arxiv.org/abs/2602.03208)**.
For more technical details on this method, please refer to the paper: **[Spectral Evolution Search: Efficient Inference-Time Scaling for Reward-Aligned Image Generation](https://arxiv.org/abs/2402.03208)**.


In DiffSynth-Studio, SES has been integrated into the pipelines of mainstream text-to-image models. You only need to set `enable_ses=True` when calling `pipe()` to enable it.

The following is [quick start code](https://www.google.com/search?q=../../../examples/z_image/model_inference/Z-Image-Turbo-SES.py) using **Z-Image-Turbo** as an example:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The link to the quick start code appears to be broken. It points to a Google search query instead of the example file. It should be a relative link to the file within the repository.

Suggested change
The following is [quick start code](https://www.google.com/search?q=../../../examples/z_image/model_inference/Z-Image-Turbo-SES.py) using **Z-Image-Turbo** as an example:
The following is [quick start code](../../../examples/z_image/model_inference/Z-Image-Turbo-SES.py) using **Z-Image-Turbo** as an example:


这种方法本质上是用**推理计算时间换取生成质量**。

关于该方法的更多技术细节,请参考论文:**[Spectral Evolution Search: Efficient Inference-Time Scaling for Reward-Aligned Image Generation](https://arxiv.org/abs/2602.03208)**。
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The year 2602 in the arXiv link seems to be a typo. It should probably be 2402.

Suggested change
关于该方法的更多技术细节,请参考论文:**[Spectral Evolution Search: Efficient Inference-Time Scaling for Reward-Aligned Image Generation](https://arxiv.org/abs/2602.03208)**
关于该方法的更多技术细节,请参考论文:**[Spectral Evolution Search: Efficient Inference-Time Scaling for Reward-Aligned Image Generation](https://arxiv.org/abs/2402.03208)**

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant