Research log · Post-training T2I via EGGROLL

ES-EGGROLL for Text-to-Image post-training

This project adapts the EGGROLL idea (low-rank, hyperscale evolution strategies) into an RL-like post-training stage for text-to-image: optimize a frozen generator using black-box rewards—without diffusion backprop or policy-gradient plumbing.

LoRA-only updates single-step Sana Sprint PickScore / CLIP diagnostics
On this page
start with results · then scroll for the logs behind them
Log 2 · Benchmark results

PartiPrompts benchmark (overall)

Start here. These are the headline numbers from PartiPrompts (one image per prompt, shared seeds across models). Below the table you’ll find the story: why ES, what EGGROLL changes, and how the run was set up.

Eval: overall means
same seeds across models
one image / prompt
Model aesthetic ↑ clip text similarity ↑ no artifacts ↑ pickscore ↑
One-step (same backbone, different post-training)
SanaOneStep_Base
0.5978
baseline
0.6592
baseline
0.3859
baseline
22.3220
baseline
SanaOneStep_eggroll
0.5975
-0.0003
0.6611
+0.0019
0.3899
+0.0040
22.5013
+0.1793
Two-step (stronger sampler / more compute)
SanaTwoStep_Base
0.5965
0.6614
0.3926
22.8059

Notes: deltas are shown relative to SanaOneStep_Base (green = improvement, red = worse) for one-step rows. SanaOneStep_eggroll is the ES-trained LoRA on the one-step Sana backbone. SanaTwoStep_Base is a two-step Sana pipeline baseline (stronger sampler, more compute) shown without deltas to avoid confusion.

Log 1.2 · Qualitative examples

Base vs ES-LoRA (drag the slider)

Fast intuition first: swipe through examples, then jump to the logs below for the exact objective, stability knobs, and benchmark table. Left is the frozen base model, right reveals the ES-trained LoRA.

Prompt
Tip: use ← / → to switch examples
Base example EGGROLL-LoRA example
Base
EGGROLL-LoRA
Drag horizontally (or use arrow keys on the slider) to reveal more of the LoRA result.
Log 0 · The real goal

RL-style post-training for T2I… without the usual RL pain

Post-training in text-to-image usually means: start from a strong pretrained generator, then push it toward a reward (human preference, PickScore, aesthetics, safety, “no artifacts”, downstream task success). In principle, you can do this with PPO / GRPO-like updates. In practice, T2I makes the loop fragile and expensive.

Why RL post-training is tricky in text-to-image

  • Backprop is expensive: diffusion-style models are heavy; RL updates often require many samples + big graphs.
  • Rewards are black-box: preference models, heuristics, and humans are not differentiable signals.
  • Reward hacking is real: optimizing a learned scorer can produce “high score / low quality” shortcuts.
  • High-variance credit assignment: prompt sensitivity + sampling noise makes stable learning harder.

Hypothesis: in this regime, evolution strategies can be a robust optimizer wrapper— treat the generator as a black box, evaluate a population in parallel, and update parameters toward the reward while keeping the base model frozen.

Log 0.1 · Paper in 90 seconds

What EGGROLL adds to Evolution Strategies

The paper’s core argument: naïve ES doesn’t scale to billion-parameter models because full-rank perturbations are too expensive to store and apply. EGGROLL makes ES practical at scale by using low-rank perturbations (per-layer), enabling large populations at near inference throughput.

The idea

  • Instead of sampling a full perturbation E ∈ R^{m×n}, sample A ∈ R^{m×r} and B ∈ R^{n×r}, and use ABᵀ.
  • Memory drops from mn to r(m+n), and compute becomes O(r(m+n)) instead of O(mn).
  • Even if each perturbation is low-rank, the averaged update across many samples can still represent rich directions.
Low-rank ES per layer
Near inference throughput
Large populations
Why this matters for T2I: if optimization runs close to inference speed, then “post-training” becomes something you can iterate on quickly—even when rewards are non-differentiable or easy to game.
Log 0.2 · Mapping EGGROLL to T2I

Why ES is interesting for T2I “alignment”

Treat each image generation as a rollout: sample noise → generate image → score it with a reward model. RL tries to push gradients through parts of that pipeline. ES doesn’t need gradients at all.

Why this fit is compelling

  • Black-box rewards by default: PickScore, CLIP heuristics, human votes, filters—no differentiation required.
  • Embarrassingly parallel: evaluate many perturbed candidates in parallel, aggregate rewards, update.
  • Stability knobs: antithetic sampling, shared seeds, norm caps—keep “reward chasing” from blowing up the generator.
Log 1 · SanaSprint 4 prompts examperimant

One-step Sana Sprint + EGGROLL on LoRA

First real test: optimize a LoRA adapter (and keep the backbone frozen) on top of Sana Sprint in a one-step setup. Training uses a population of 128 candidates and PickScore v1 as the target rewar for 350 training step.

Prompt setup (early prototype)

  • I started with a very simple setup: the same 4 prompts are used for every ES step and for every individual in the population.
  • This is intentionally constrained: it’s a sanity-check phase to validate that the ES loop is stable and actually moves reward.
Target reward: PickScore v1
Population: 128
One-step generation
Log 1.1 · SanaSprint training logs

Training signals and stability

During training I log multiple signals for visibility: PickScore (the objective), plus CLIP text alignment, aesthetic, and no-artifacts as diagnostics. After a short hyperparameter search, runs were generally stable and smooth—but I did encounter regimes where exploration caused the process to explode.

What is actually optimized?

The optimizer step is computed from mean_reward where: mean_reward = mean(PickScore_v1). Other metrics are logged only for monitoring and debugging.

What I watch for “explosions”

Sudden reward spikes + visual collapse, LoRA norm blow-up, or a drift where PickScore rises while images become obviously worse (classic reward-hacking smell).

Graphs

The ES update optimizes mean reward (PickScore v1). The other curves are diagnostics to spot reward hacking / tradeoffs.

mean_reward / pickscore_mean vs ES step
Objective: mean_reward / pickscore_mean vs ES step (optimized)
CLIP text-alignment mean vs ES step
Diagnostic: CLIP text-alignment mean vs ES step
aesthetic_mean vs ES step
Diagnostic: aesthetic_mean vs ES step
no_artifacts_mean vs ES step
Diagnostic: no_artifacts_mean vs ES step