Summary
                                
                                    Built an edge app that performs GAN inversion and text-conditioned editing directly
                                    on device. It implements
                                    Bridging CLIP and StyleGAN through Latent Alignment to connect language and
                                    vision, keeping identity while applying prompt-guided changes — work that won the
                                    Samsung Next MobileXGenAI Hackathon and was optimized for ~30fps mobile inference.
                                
                                Deep dive
                                
                                    The pipeline uses MobileStyleNet (a StyleGAN2 distillation) as the
                                    generator. For inversion,
                                    inference is G(f(E_I(x))), where a few mapper layers f
                                    remain trainable and
                                    E_I is an image encoder distilled from OpenCLIP to
                                    EfficientFormer-Large (per the latent-alignment setup in the paper/README).
                                    To connect text to the model, a mapper is trained to align CLIP
                                    representations with the
                                    W+ latent: the OpenCLIP text encoder
                                    (E_T) produces an embedding that the mapper converts into a
                                    ΔW+. Starting from the mean latent (text→image) or from an inverted
                                    latent (image→text manipulation), we add the scaled ΔW+ to drive edits like “blonde
                                    woman with sunglasses,” “man with a hat and beard,” or head-pose changes. In
                                    practice, the right
                                    scale factor C for ΔW+ depends on each (W+, text) pair, so a small
                                    projection layer is trained to predict the optimal C with a CLIP-based
                                    loss, balancing fidelity to the source face and alignment to the prompt. The app
                                    demonstrates strong attribute control (e.g., glasses, hair, pose), while
                                    acknowledging inversion identity limits versus SOTA methods ). Export and mobile
                                    optimizations deliver smooth, on-device performance
                                    (~30fps).