Amit Israeli — AI Research Scientist

About

Profile

I'm an AI Research Scientist at Wix, working across computer vision and NLP with a focus on multimodal models and text-to-image / text-to-video diffusion. My work bridges research and engineering: building data and training pipelines, running rigorous evaluations, and getting models to run on a single GPU — and, when needed, on the edge.

Email ↗ LinkedIn ↗ GitHub ↗ Hugging Face ↗

Experience

Timeline

2023 — Present

Current Mar 2026 — Present

AI Research Scientist

Wix

Making Websites beatiful — not like this one ):

Generative AI Multimodal Research
Jan 2025 — Jan 2026

Computer Vision Research Engineer

Reality Defender

Developed and optimized deep-learning computer-vision systems for deepfake and synthetic-media detection across image, video, and video+audio. Multi-model architectures combining specialist detectors with shared backbones for production-scale inference.

Deepfake detection Multimodal Production CV
Dec 2024 — Apr 2025

Computer Vision Research Engineer

LuckyLab · Freelance

Edge-optimized segmentation and detection for production in a few-shot domain. Converted research models to production graphs (JAX, TensorRT) with quality gates using pruning and quantization.

Edge deployment Segmentation Few-shot JAX / TensorRT
Jun 2024 — Jan 2025 Tel Aviv

Deep Learning Research Engineer

NLPearl

Compact language models for multi-task outputs, audio tokenizers and LLMs for audio tasks, and real-time conversational pause detection with fine-tuned LLMs. LoRA and multi-stage training under tight latency budgets.

LLMs Real-time LoRA Audio
May 2023 — Jun 2024 Rehovot

Computer Vision & Deep Learning Research Engineer

Pashoot Robotics

Zero/few-shot improvements with SAM, YOLO-World, Grounding DINO, and CLIP. Multi-object tracking and 6-DoF pose estimation using synthetic data and domain randomization, plus 3D reconstruction (NeRF, Gaussian Splatting, image-to-3D) for simulation.

Zero / few-shot SAM / YOLO-World 6-DoF Sim / Domain rand. Robotics

Featured

Latest research

A current project — drag the slider to compare base output to the ES-trained LoRA.

In progress

Evolution Strategies RL Text-to-image Post-training

ES-EGGROLL — Post-training Text-to-Image with Evolution Strategies

An EGGROLL-style implementation of low-rank Evolution Strategies from "Evolution Strategies at the Hyperscale" (Sarkar et al., 2025), adapted for text-to-image alignment post-training. The base generator stays frozen, and we optimize only a LoRA adapter using black-box reward signals (e.g., PickScore as the objective), with CLIP / aesthetic / no-artifacts as diagnostics. This enables fast iteration on alignment objectives at near inference throughput, without diffusion backprop.

Benchmark snapshot · PartiPrompts (overall means)

One image per prompt · same seeds across models · deltas vs. SanaOneStep_Base.

Model	aesthetic ↑	CLIP text ↑	no artifacts ↑	PickScore ↑
SanaOneStep_Base	0.5978	0.6592	0.3859	22.3220
SanaOneStep_eggroll	0.5975	+0.00190.6611	+0.00400.3899	+0.179322.5013
SanaTwoStep_Base	0.5965	0.6614	0.3926	22.8059

Prompt

Use ← / → to switch examples

Example

1 / 1

← Base

ES-LoRA →

⇆

Drag horizontally (or use the slider) to reveal more of the ES-trained LoRA result.

Selected work

Projects

Research and engineering work, filterable by domain.

Tom & Jerry — SANA-Video LoRA

Fine-tuning SANA-Video 2B with LoRA to generate Tom & Jerry-style animations at 224×224 — from class-only LoRA to scene-aware text conditioning, all on a single consumer GPU.

T2V LoRA VLM + vLLM Diffusion video

V1 — Class-only LoRA @ 224×224. Freeze SANA-Video 2B and train LoRA on all linear layers using a single class prompt over ~16k cached 5s clips (81-frame latents) with flow-matching.
V2 — Scene-aware LoRA with VLM labels. Use Qwen3-VL via vLLM to generate structured scene descriptions (ENVIRONMENT / CHARACTERS / PROPS / ACTION / CAMERA), then condition SANA-Video on per-clip text prompts.

V1 checkpoint progression

Same seed and class prompt; only the LoRA checkpoint changes.

Base model: off-style 224×224 sample before LoRA fine-tuning.

Future directions

Next: distillation, pruning, and quantization for deployment, plus more systematic evaluation (identity consistency, motion quality) and RL-style post-training to better align generations with scene descriptions.

Sana Simplified — Image & Zoom Control

Research playground around Sana 1.5 and Sana Sprint: ControlNet implementation and a SigLIP-driven zoom controller that smoothly changes camera distance while keeping identity and style consistent.

T2I Diffusion ControlNet 3D rendering

Method overview

A zero-zoom reference image is encoded with SigLIP into an object token, while a scalar zoom value z ∈ [0, 1] is mapped through a small MLP to a zoom token. Both tokens are appended to the cached text encoding, and the Sana 1.5 transformer is fine-tuned with LoRA.

Reference & zoom sweep

Kokoro — Tiny TTS, Big Voices

Small-footprint TTS with Kokoro-82M: fast mixture-of-voices coefficients, full-embedding optimization, and visual tooling (trajectory visualizer + export).

TTS Embedding opt Visualizer

Mixture-of-voices coefficient optimizer with temperature-annealed softmax.
Optimizing voice embedding and LoRA in few-shot scenarios.
GUI visualizer: circular mixer + per-voice bars + MP4 export.

Token-Budget-Aware Reasoning for VLMs

Built on Token-Budget-Aware LLM Reasoning, extended to multimodal setups by combining a frozen SigLIP image encoder with a LoRA-tuned LLM that predicts a reasoning budget prior to decoding.

CoT Multimodal Paper impl.

Summary

The original paper asks whether we can control chain-of-thought with an adaptive token budget. This variant predicts the budget for multimodal inputs and constrains decoding accordingly.

Deep dive

Architecture: frozen SigLIP + budget head fused with LLM states.
Training: LoRA fine-tuning with oracle CoT lengths; KL regularization.
Evaluation: accuracy vs. average tokens; ablations on freezing / head depth.

PopYou2 — VAR Text-to-Image

Adapted Visual AutoRegressive (VAR) for Funko Pop! generation with a custom "doll" embedding and a SigLIP→text adapter, enabling I→I and T→I paths.

VAR T2I Autoregressive