Summary
Built an edge app that performs GAN inversion and text-conditioned editing directly
on device. It implements
Bridging CLIP and StyleGAN through Latent Alignment to connect language and
vision, keeping identity while applying prompt-guided changes — work that won the
Samsung Next MobileXGenAI Hackathon and was optimized for ~30fps mobile inference.
Deep dive
The pipeline uses MobileStyleNet (a StyleGAN2 distillation) as the
generator. For inversion,
inference is G(f(E_I(x)))
, where a few mapper layers f
remain trainable and
E_I
is an image encoder distilled from OpenCLIP to
EfficientFormer-Large (per the latent-alignment setup in the paper/README).
To connect text to the model, a mapper is trained to align CLIP
representations with the
W+ latent: the OpenCLIP text encoder
(E_T
) produces an embedding that the mapper converts into a
ΔW+. Starting from the mean latent (text→image) or from an inverted
latent (image→text manipulation), we add the scaled ΔW+ to drive edits like “blonde
woman with sunglasses,” “man with a hat and beard,” or head-pose changes. In
practice, the right
scale factor C for ΔW+ depends on each (W+, text) pair, so a small
projection layer is trained to predict the optimal C with a CLIP-based
loss, balancing fidelity to the source face and alignment to the prompt. The app
demonstrates strong attribute control (e.g., glasses, hair, pose), while
acknowledging inversion identity limits versus SOTA methods ). Export and mobile
optimizations deliver smooth, on-device performance
(~30fps).