AI Research Scientist · Wix

Generative AI & multimodal systems.

I'm an AI Research Scientist at Wix, working on diffusion, multimodal models, and the engineering required to take ideas from paper to production. Previously at Reality Defender, NLPearl, and Pashoot Robotics.

About

Profile

Amit Israeli

I'm an AI Research Scientist at Wix, working across computer vision and NLP with a focus on multimodal models and text-to-image / text-to-video diffusion. My work bridges research and engineering: building data and training pipelines, running rigorous evaluations, and getting models to run on a single GPU — and, when needed, on the edge.

Experience

Timeline

2023 — Present
  1. Current Mar 2026 — Present

    AI Research Scientist

    Wix

    Making Websites beatiful — not like this one ):

    Generative AI Multimodal Research
  2. Jan 2025 — Jan 2026

    Computer Vision Research Engineer

    Reality Defender

    Developed and optimized deep-learning computer-vision systems for deepfake and synthetic-media detection across image, video, and video+audio. Multi-model architectures combining specialist detectors with shared backbones for production-scale inference.

    Deepfake detection Multimodal Production CV
  3. Dec 2024 — Apr 2025

    Computer Vision Research Engineer

    LuckyLab · Freelance

    Edge-optimized segmentation and detection for production in a few-shot domain. Converted research models to production graphs (JAX, TensorRT) with quality gates using pruning and quantization.

    Edge deployment Segmentation Few-shot JAX / TensorRT
  4. Jun 2024 — Jan 2025 Tel Aviv

    Deep Learning Research Engineer

    NLPearl

    Compact language models for multi-task outputs, audio tokenizers and LLMs for audio tasks, and real-time conversational pause detection with fine-tuned LLMs. LoRA and multi-stage training under tight latency budgets.

    LLMs Real-time LoRA Audio
  5. May 2023 — Jun 2024 Rehovot

    Computer Vision & Deep Learning Research Engineer

    Pashoot Robotics

    Zero/few-shot improvements with SAM, YOLO-World, Grounding DINO, and CLIP. Multi-object tracking and 6-DoF pose estimation using synthetic data and domain randomization, plus 3D reconstruction (NeRF, Gaussian Splatting, image-to-3D) for simulation.

    Zero / few-shot SAM / YOLO-World 6-DoF Sim / Domain rand. Robotics
Selected work

Projects

Research and engineering work, filterable by domain.

SANA-Video Tom & Jerry LoRA sample

Tom & Jerry — SANA-Video LoRA

Fine-tuning SANA-Video 2B with LoRA to generate Tom & Jerry-style animations at 224×224 — from class-only LoRA to scene-aware text conditioning, all on a single consumer GPU.

T2V LoRA VLM + vLLM Diffusion video
Read more

Versioned training roadmap

  • V1 — Class-only LoRA @ 224×224. Freeze SANA-Video 2B and train LoRA on all linear layers using a single class prompt over ~16k cached 5s clips (81-frame latents) with flow-matching.
  • V2 — Scene-aware LoRA with VLM labels. Use Qwen3-VL via vLLM to generate structured scene descriptions (ENVIRONMENT / CHARACTERS / PROPS / ACTION / CAMERA), then condition SANA-Video on per-clip text prompts.

V1 checkpoint progression

Same seed and class prompt; only the LoRA checkpoint changes.

Tom & Jerry checkpoint preview

Base model: off-style 224×224 sample before LoRA fine-tuning.

Future directions

Next: distillation, pruning, and quantization for deployment, plus more systematic evaluation (identity consistency, motion quality) and RL-style post-training to better align generations with scene descriptions.

SigLIP–zoom method overview

Sana Simplified — Image & Zoom Control

Research playground around Sana 1.5 and Sana Sprint: ControlNet implementation and a SigLIP-driven zoom controller that smoothly changes camera distance while keeping identity and style consistent.

T2I Diffusion ControlNet 3D rendering
Read more

Method overview

A zero-zoom reference image is encoded with SigLIP into an object token, while a scalar zoom value z ∈ [0, 1] is mapped through a small MLP to a zoom token. Both tokens are appended to the cached text encoding, and the Sana 1.5 transformer is fine-tuned with LoRA.

Reference & zoom sweep

Kokoro coefficient trajectory

Kokoro — Tiny TTS, Big Voices

Small-footprint TTS with Kokoro-82M: fast mixture-of-voices coefficients, full-embedding optimization, and visual tooling (trajectory visualizer + export).

TTS Embedding opt Visualizer
Read more

Highlights

  • Mixture-of-voices coefficient optimizer with temperature-annealed softmax.
  • Optimizing voice embedding and LoRA in few-shot scenarios.
  • GUI visualizer: circular mixer + per-voice bars + MP4 export.
Token Budget example

Token-Budget-Aware Reasoning for VLMs

Built on Token-Budget-Aware LLM Reasoning, extended to multimodal setups by combining a frozen SigLIP image encoder with a LoRA-tuned LLM that predicts a reasoning budget prior to decoding.

CoT Multimodal Paper impl.
Read more

Summary

The original paper asks whether we can control chain-of-thought with an adaptive token budget. This variant predicts the budget for multimodal inputs and constrains decoding accordingly.

Deep dive

  • Architecture: frozen SigLIP + budget head fused with LLM states.
  • Training: LoRA fine-tuning with oracle CoT lengths; KL regularization.
  • Evaluation: accuracy vs. average tokens; ablations on freezing / head depth.
Token budget predictor diagram
VAR Explained

PopYou2 — VAR Text-to-Image

Adapted Visual AutoRegressive (VAR) for Funko Pop! generation with a custom "doll" embedding and a SigLIP→text adapter, enabling I→I and T→I paths.

VAR T2I Autoregressive
Try the builder

Live preview

Pick a name, character style, and action — the figure updates instantly.

A Funko Pop figure of styled as a performing .
Funko preview
SAM LoRA figure

Few-Shot SAM + LoRA

Adapted SAM and efficient variants (EdgeSAM, TinySAM) with LoRA for class-aware few-shot segmentation; improvements over PerSAM and edge deployment.

Few-shot LoRA Pruning Quantization Edge
Read more

Deep dive

LoRA-adapted SAM variants with compression for edge deployment, with experimentation in extreme few-shot scenarios.

COCO Cityscapes Soccer SAM variants

CelebrityLook — Mobile Face Transform

On-device face stylization with a distilled StyleGAN2 and CLIP-to-StyleGAN latent alignment; ~30 fps on modern phones.

Edge GAN CoreML Latent alignment
FastGAN sample

PopYou — FastGAN + CLIP

Multi-stage pipeline with FastGAN and CLIP inversion for promptable Funko-style synthesis.

GAN CLIP inversion Synthetic data
MusicGen visualization

MusicGen — Genre LoRA

LoRA-adapts MusicGen for genre-specific generation while keeping prompt controllability.

LoRA Music generation

KoalaReadingAI — Papers as Podcast

Pipeline that turns AI research papers into short audio episodes published to Spotify and YouTube.

TTS Pipeline
Reading list

Recent papers

Auto-synced from my Zotero library — papers I'm currently working through.

Updated continuously
    Contact

    Get in touch

    Open to research collaborations, side projects, and interesting problems.