Working at the intersection of AI research and engineering

I enjoy exploring ideas, building systems, and turning complex problems into something practical. My focus is on thoughtful research and solid engineering.

About

Profile

Amit Israeli

I’m a hands-on Generative AI Research Engineer working across computer vision and NLP, with a soft spot for multimodal models and text-to-image/text-to-video diffusion. I like taking ideas from papers to real systems: building the data + training loops, running clean evals, and getting things to run on a single GPU (and sometimes on the edge).

Timeline

Experience

Computer Vision Research Engineer @ Reality Defender

Jan 2024 – Present
  • Developing and optimizing deep learning computer vision solutions to detect deepfakes and fraudulent media in the domains of image, video and video with audio, in multi model capabilities
Deepfake Detection

Computer Vision Research Engineer @ LuckyLab (Freelance)

Dec 2024 – Apr 2025
  • Edge-optimized segmentation and detection for production, in a few shot domain.
  • Converted research models to production graphs (Jax/TensorRT) with quality gates, using Pruning and Quantization.
Edge Deployment Segmentation and Detection Few-Shot Jax/TensorRT

Deep Learning Research Engineer @ NLPearl

Jun 2024 – Jan 2025 · Tel Aviv, Israel
  • Compact language model for multi-task outputs; audio tokenizers and LLMs for audio tasks.
  • LoRA and multi-stage training to boost performance.
  • Real-time conversational pause detection and response generation with fine-tuned LLMs.
LLMs Real-time LoRA Audio

Computer Vision & Deep Learning Research Engineer @ Pashoot Robotics

May 2023 – Jun 2024 · Rehovot, Israel
  • Zero/few-shot improvements with SAM, YOLO-World, Grounding DINO, CLIP and more SOTA models.
  • Multi-object tracking and 6-DoF pose estimation with synthetic data.
  • Blender + 3D reconstruction (NeRF, Gaussian Splatting, image-to-3D) for simulation and DR.
Zero/Few-Shot SAM/YOLO-World 6-DoF Sim/Domain Rand. Robotics
Now

Current Project

ES-EGGROLL — Post-training Text-to-Image with Evolution Strategies

An EGGROLL-style implementation of low-rank Evolution Strategies from “Evolution Strategies at the Hyperscale” (Sarkar et al., 2025), adapted for text-to-image alignment post-training. The base generator stays frozen, and we optimize only a LoRA adapter using black-box reward signals (e.g., PickScore as the objective), with CLIP / aesthetic / no-artifacts used as diagnostics. This enables fast iteration on alignment objectives at near inference throughput, without diffusion backprop. For extra details, results, and examples, visit the project website.

Evolution Strategy RL text-to-image
Benchmark snapshot (PartiPrompts · overall means)
One image per prompt · same seeds across models · deltas relative to SanaOneStep_Base.
Model aesthetic ↑ CLIP text ↑ no artifacts ↑ PickScore ↑
SanaOneStep_Base 0.5978 0.6592 0.3859 22.3220
SanaOneStep_eggroll 0.5975 +0.00190.6611 +0.00400.3899 +0.179322.5013
SanaTwoStep_Base 0.5965 0.6614 0.3926 22.8059
Selected

Other Projects

Click to expand
SANA-Video Tom & Jerry LoRA sample

Tom & Jerry Generation — SANA-Video

Fine-tuning SANA-Video 2B with LoRA to generate Tom & Jerry–style animations at 224×224, starting from a class-only LoRA and moving toward scene-aware, text-conditioned control. The focus is on making long, readable cartoons run comfortably on a single consumer GPU rather than a lab-scale setup.

T2V LoRA VLM + VLLM Diffusion video model
Read more

Versioned training roadmap

  • V1 – Class-only LoRA @ 224×224. Freeze the base SANA-Video 2B backbone and train LoRA on all linear layers, using a single Tom & Jerry class prompt over ~16k cached 5s clips (81-frame latents) with a flow-matching objective.
  • V2 – Scene-aware LoRA with VLM labels. Use Qwen3-VL served via vLLM to generate structured scene descriptions (ENVIRONMENT / CHARACTERS / PROPS / ACTION / CAMERA) per clip, encode them, and condition SANA-Video on per-clip text prompts to recover more fine-grained control while staying at 224×224.

V1 checkpoint progression

Same seed and class prompt; only the LoRA checkpoint changes. This makes it easy to see how the model gradually snaps into a consistent Tom & Jerry look.

Tom & Jerry checkpoint preview

Base model: off-style 224×224 sample before LoRA fine-tuning.

Future directions

Next steps include distillation, pruning, and quantization to make the model more deployable, plus more systematic evaluation (identity consistency, motion quality) and exploring RL-style post-training to better align generations with scene descriptions.

SigLIP–zoom method overview

Sana Simplified — Image & Zoom Control

Research playground around Sana 1.5 and Sana Sprint for image generation,controlnet implemntation, and a SigLIP-driven zoom controller. GLB objects are rendered offline at multiple zoom levels, and object + zoom embeddings are appended as tokens to the text encoder so the model can smoothly change the camera distance while keeping identity and style consistent.

T2I diffusion controlnet and img2img condition 3d rendering
Read more

Method Overview

A zero-zoom reference image is encoded with SigLIP to form an object token, while a scalar zoom value z ∈ [0, 1] is mapped through a small MLP to a zoom token. Both tokens are appended to the cached text encoding, and then finetuning the Sana 1.5 transformer with LoRA.

Reference & Zoom Sweep

Kokoro coefficient trajectory (GIF)

Kokoro — Tiny TTS, Big Voices

Small-footprint TTS with Kokoro-82M: fast mixture-of-voices coefficients, full-embedding optimization, and streamlined visual tooling (trajectory visualizer + export).

TTS Embedding Opt Visualizer
Read more

Highlights

  • Mixture-of-voices coefficient optimizer with temperature-annealed softmax
  • optemizing voice embedding and LoRA in a few shot , other scenerios.
  • GUI visualizer: circular mixer + per-voice bars + MP4 export
Token Budget example

Token-Budget-Aware Reasoning for VLMs

Built on Token-Budget-Aware LLM Reasoning, extended to multimodal setups by combining a frozen SigLIP image encoder with a LoRA-tuned LLM to predict reasoning budget prior to decoding.

CoT Multimodal Paper Implamentation
Read more

Summary

The original paper asks whether we can control chain-of-thought with an adaptive token budget. This variant predicts the budget for multimodal inputs and constrains decoding accordingly.

Deep dive

  • Architecture: Frozen SigLIP + budget head fused with LLM states.
  • Training: LoRA fine-tuning with oracle CoT lengths; KL regularization.
  • Evaluation: Accuracy vs. average tokens; ablations on freezing/head depth.
Token budget predictor diagram
VAR Explained

PopYou2 — VAR Text

Adapted Visual AutoRegressive (VAR) for Funko Pop! generation with a custom “doll” embedding and a SigLIP→text adapter, enabling I→I and T→I paths, also Implemented a text to image training for VAR.

VAR T2I autoregressive image generation
Read more

Summary

Text-to-image with VAR using a custom “doll” embedding and a SigLIP→CLIP text space adapter.

A Funko Pop figure of , styled as a , performing .
Funko preview
SAM LoRA figure

Few-Shot Segmentation with SAM + LoRA

Adapted SAM and efficient variants(like edgeSAM,TinySAM) with LoRA for class-aware few-shot segmentation; improvements over PerSAM.

Few-Shot LoRA Pruning Quantization Edge Device
Read more

Deep dive

LoRA-adapted SAM variants with compression for edge deployment, with experimnation in extrem edge case few shots scenario, using few shots methods

COCO few-shot examples Cityscapes examples Soccer examples Segment Anything variants

CelebrityLook — Mobile Face Transform

On-device face stylization with a distilled StyleGAN2 and CLIP-to-StyleGAN latent alignment; ~30fps on modern phones.

Edge device GAN Latent Alignment CoreML image generation T2I
Read more

Summary

Edge app that performs GAN inversion and text-conditioned editing directly on device, using latent aligment of clip and style gan W+ space.

FastGAN sample

PopYou — FastGAN + CLIP

Multi-stage pipeline with FastGAN and CLIP inversion for promptable Funko-style synthesis.

GAN CLIP Inversion Synthetic Data 3D generation T2I
Read more

Summary

GAN-based Funko synthesis with CLIP-guided text/image editing paths.

Audio waveform placeholder

MusicGen — Genre LoRA

LoRA-adapts MusicGen for genre-specific generation while keeping prompt controllability.

LoRA music generation
Read more

Summary

Low-rank adapters bias generation toward a target style and evaluate alignment distributionally.

KoalaReadingAI — AI Papers as a Podcast

Pipeline that turns AI research papers into short audio episodes and publishes to Spotify and YouTube.

TTS
Read more
Reading

What I’ve been reading lately