ES-EGGROLL for Text-to-Image

Log 2 · Benchmark results

PartiPrompts benchmark (overall)

Start here. These are the headline numbers from PartiPrompts (one image per prompt, shared seeds across models). Below the table you’ll find the story: why ES, what EGGROLL changes, and how the run was set up.

Eval: overall means

same seeds across models

one image / prompt

↓ Why ES for post-training? ↓ EGGROLL in 90 seconds ↓ See training curves

Model	aesthetic ↑	clip text similarity ↑	no artifacts ↑	pickscore ↑
One-step (same backbone, different post-training)
SanaOneStep_Base	0.5978 baseline	0.6592 baseline	0.3859 baseline	22.3220 baseline
SanaOneStep_eggroll	0.5975 -0.0003	0.6611 +0.0019	0.3899 +0.0040	22.5013 +0.1793

Two-step (stronger sampler / more compute)
SanaTwoStep_Base	0.5965	0.6614	0.3926	22.8059

Notes: deltas are shown relative to SanaOneStep_Base (green = improvement, red = worse) for one-step rows. SanaOneStep_eggroll is the ES-trained LoRA on the one-step Sana backbone. SanaTwoStep_Base is a two-step Sana pipeline baseline (stronger sampler, more compute) shown without deltas to avoid confusion.

Log 1.2 · Qualitative examples

Base vs ES-LoRA (drag the slider)

Fast intuition first: swipe through examples, then jump to the logs below for the exact objective, stability knobs, and benchmark table. Left is the frozen base model, right reveals the ES-trained LoRA.

↑ Back to benchmark table ↓ Training curves + stability

Prompt

Tip: use ← / → to switch examples

Example

1 / 5

← Base

EGGROLL-LoRA →

⇆

Drag horizontally (or use arrow keys on the slider) to reveal more of the LoRA result.

Log 0 · The real goal

RL-style post-training for T2I… without the usual RL pain

Post-training in text-to-image usually means: start from a strong pretrained generator, then push it toward a reward (human preference, PickScore, aesthetics, safety, “no artifacts”, downstream task success). In principle, you can do this with PPO / GRPO-like updates. In practice, T2I makes the loop fragile and expensive.

Why RL post-training is tricky in text-to-image

Backprop is expensive: diffusion-style models are heavy; RL updates often require many samples + big graphs.
Rewards are black-box: preference models, heuristics, and humans are not differentiable signals.
Reward hacking is real: optimizing a learned scorer can produce “high score / low quality” shortcuts.
High-variance credit assignment: prompt sensitivity + sampling noise makes stable learning harder.

Hypothesis: in this regime, evolution strategies can be a robust optimizer wrapper— treat the generator as a black box, evaluate a population in parallel, and update parameters toward the reward while keeping the base model frozen.

Log 0.1 · Paper in 90 seconds

What EGGROLL adds to Evolution Strategies

The paper’s core argument: naïve ES doesn’t scale to billion-parameter models because full-rank perturbations are too expensive to store and apply. EGGROLL makes ES practical at scale by using low-rank perturbations (per-layer), enabling large populations at near inference throughput.

The idea

Instead of sampling a full perturbation E ∈ R^{m×n}, sample A ∈ R^{m×r} and B ∈ R^{n×r}, and use ABᵀ.
Memory drops from mn to r(m+n), and compute becomes O(r(m+n)) instead of O(mn).
Even if each perturbation is low-rank, the averaged update across many samples can still represent rich directions.

Low-rank ES per layer

Near inference throughput

Large populations

Why this matters for T2I: if optimization runs close to inference speed, then “post-training” becomes something you can iterate on quickly—even when rewards are non-differentiable or easy to game.

Log 0.2 · Mapping EGGROLL to T2I

Why ES is interesting for T2I “alignment”

Treat each image generation as a rollout: sample noise → generate image → score it with a reward model. RL tries to push gradients through parts of that pipeline. ES doesn’t need gradients at all.

Why this fit is compelling

Black-box rewards by default: PickScore, CLIP heuristics, human votes, filters—no differentiation required.
Embarrassingly parallel: evaluate many perturbed candidates in parallel, aggregate rewards, update.
Stability knobs: antithetic sampling, shared seeds, norm caps—keep “reward chasing” from blowing up the generator.

Log 1 · SanaSprint 4 prompts examperimant

One-step Sana Sprint + EGGROLL on LoRA

First real test: optimize a LoRA adapter (and keep the backbone frozen) on top of Sana Sprint in a one-step setup. Training uses a population of 128 candidates and PickScore v1 as the target rewar for 350 training step.

Prompt setup (early prototype)

I started with a very simple setup: the same 4 prompts are used for every ES step and for every individual in the population.
This is intentionally constrained: it’s a sanity-check phase to validate that the ES loop is stable and actually moves reward.

Target reward: PickScore v1

Population: 128

One-step generation

Log 1.1 · SanaSprint training logs

Training signals and stability

During training I log multiple signals for visibility: PickScore (the objective), plus CLIP text alignment, aesthetic, and no-artifacts as diagnostics. After a short hyperparameter search, runs were generally stable and smooth—but I did encounter regimes where exploration caused the process to explode.

What is actually optimized?

The optimizer step is computed from mean_reward where: mean_reward = mean(PickScore_v1). Other metrics are logged only for monitoring and debugging.

What I watch for “explosions”

Sudden reward spikes + visual collapse, LoRA norm blow-up, or a drift where PickScore rises while images become obviously worse (classic reward-hacking smell).

↗ Open training run on Weights & Biases

Graphs

The ES update optimizes mean reward (PickScore v1). The other curves are diagnostics to spot reward hacking / tradeoffs.

**Objective:** mean_reward / pickscore_mean vs ES step (optimized)

Diagnostic: CLIP text-alignment mean vs ES step

Diagnostic: no_artifacts_mean vs ES step