PartiPrompts benchmark (overall)
Start here. These are the headline numbers from PartiPrompts (one image per prompt, shared seeds across models). Below the table you’ll find the story: why ES, what EGGROLL changes, and how the run was set up.
| Model | aesthetic ↑ | clip text similarity ↑ | no artifacts ↑ | pickscore ↑ |
|---|---|---|---|---|
| One-step (same backbone, different post-training) | ||||
| SanaOneStep_Base |
0.5978
baseline
|
0.6592
baseline
|
0.3859
baseline
|
22.3220
baseline
|
| SanaOneStep_eggroll |
0.5975
-0.0003
|
0.6611
+0.0019
|
0.3899
+0.0040
|
22.5013
+0.1793
|
| Two-step (stronger sampler / more compute) | ||||
| SanaTwoStep_Base |
0.5965
|
0.6614
|
0.3926
|
22.8059
|
Notes: deltas are shown relative to SanaOneStep_Base (green = improvement, red = worse) for one-step rows. SanaOneStep_eggroll is the ES-trained LoRA on the one-step Sana backbone. SanaTwoStep_Base is a two-step Sana pipeline baseline (stronger sampler, more compute) shown without deltas to avoid confusion.
Base vs ES-LoRA (drag the slider)
Fast intuition first: swipe through examples, then jump to the logs below for the exact objective, stability knobs, and benchmark table. Left is the frozen base model, right reveals the ES-trained LoRA.
RL-style post-training for T2I… without the usual RL pain
Post-training in text-to-image usually means: start from a strong pretrained generator, then push it toward a reward (human preference, PickScore, aesthetics, safety, “no artifacts”, downstream task success). In principle, you can do this with PPO / GRPO-like updates. In practice, T2I makes the loop fragile and expensive.
Why RL post-training is tricky in text-to-image
- Backprop is expensive: diffusion-style models are heavy; RL updates often require many samples + big graphs.
- Rewards are black-box: preference models, heuristics, and humans are not differentiable signals.
- Reward hacking is real: optimizing a learned scorer can produce “high score / low quality” shortcuts.
- High-variance credit assignment: prompt sensitivity + sampling noise makes stable learning harder.
Hypothesis: in this regime, evolution strategies can be a robust optimizer wrapper— treat the generator as a black box, evaluate a population in parallel, and update parameters toward the reward while keeping the base model frozen.
What EGGROLL adds to Evolution Strategies
The paper’s core argument: naïve ES doesn’t scale to billion-parameter models because full-rank perturbations are too expensive to store and apply. EGGROLL makes ES practical at scale by using low-rank perturbations (per-layer), enabling large populations at near inference throughput.
The idea
- Instead of sampling a full perturbation
E ∈ R^{m×n}, sampleA ∈ R^{m×r}andB ∈ R^{n×r}, and useABᵀ. - Memory drops from
mntor(m+n), and compute becomesO(r(m+n))instead ofO(mn). - Even if each perturbation is low-rank, the averaged update across many samples can still represent rich directions.
Why ES is interesting for T2I “alignment”
Treat each image generation as a rollout: sample noise → generate image → score it with a reward model. RL tries to push gradients through parts of that pipeline. ES doesn’t need gradients at all.
Why this fit is compelling
- Black-box rewards by default: PickScore, CLIP heuristics, human votes, filters—no differentiation required.
- Embarrassingly parallel: evaluate many perturbed candidates in parallel, aggregate rewards, update.
- Stability knobs: antithetic sampling, shared seeds, norm caps—keep “reward chasing” from blowing up the generator.
One-step Sana Sprint + EGGROLL on LoRA
First real test: optimize a LoRA adapter (and keep the backbone frozen) on top of Sana Sprint in a one-step setup. Training uses a population of 128 candidates and PickScore v1 as the target rewar for 350 training step.
Prompt setup (early prototype)
- I started with a very simple setup: the same 4 prompts are used for every ES step and for every individual in the population.
- This is intentionally constrained: it’s a sanity-check phase to validate that the ES loop is stable and actually moves reward.
Training signals and stability
During training I log multiple signals for visibility: PickScore (the objective), plus CLIP text alignment, aesthetic, and no-artifacts as diagnostics. After a short hyperparameter search, runs were generally stable and smooth—but I did encounter regimes where exploration caused the process to explode.
What is actually optimized?
The optimizer step is computed from mean_reward where:
mean_reward = mean(PickScore_v1).
Other metrics are logged only for monitoring and debugging.
What I watch for “explosions”
Sudden reward spikes + visual collapse, LoRA norm blow-up, or a drift where PickScore rises while images become obviously worse (classic reward-hacking smell).
Graphs
The ES update optimizes mean reward (PickScore v1). The other curves are diagnostics to spot reward hacking / tradeoffs.