Working at the intersection of AI, research, and engineering.

I enjoy exploring ideas, building systems, and turning complex problems into something practical. My focus is on thoughtful research and solid engineering that lasts.

About

Profile

Amit Israeli

I work on AI research across computer vision and natural language processing, focusing on training and evaluating models for segmentation, detection, generation, and multimodal reasoning. My background spans deepfake detection, few-shot learning, efficient model design, and language–vision integration, with an emphasis on turning research ideas into working systems.

Selected

Projects

Kokoro coefficient trajectory (GIF)

Kokoro — Tiny TTS, Big Voices WIP

Actively building now — voice embedding optimization + tools

Researching small-footprint TTS with Kokoro-82M: fast mixture-of-voices coefficients, full-embedding optimization, and clean visual tooling (live trajectory + video export).

Tiny TTS Mixture-of-Voices Embedding Opt Speaker Adaptation Visualizer
What I’m building

Now

  • Mixture-of-voices optimizer with temperature-annealed softmax
  • Stable MR-STFT loss (MPS-friendly), clean W&B logging
  • GUI visualizer: circular mixer + per-voice bars + MP4 export

Next

  • Full-embedding optimization & partial model fine-tuning
  • Optimizer ablations (AdamW, SAM, cosine/plateau/one-cycle)
  • “Speaker description → voice embedding” mapping
Token Budget example

Token-Budget-Aware Reasoning for VLMs

Built on the paper Token-Budget-Aware LLM Reasoning, this project extends the idea to multimodal setups by combining a frozen SigLIP image encoder with a LoRA-tuned LLM. The goal is to predict the reasoning budget before decoding, making chain-of-thought both efficient and controllable.

CoT Budgeting LoRA SigLIP Multimodal
Read more

Summary

The original paper asked whether we can control chain-of-thought (CoT) reasoning with an adaptive token budget. Their method trains a predictor model to estimate how many tokens the ground-truth CoT would require, and uses this as a budget for the main LLM.

My contribution was to extend this idea to multimodal environments. The system takes both text and image inputs, predicts the reasoning budget, and then guides the LLM’s CoT length accordingly — balancing efficiency and accuracy.

Deep dive

  • Architecture: Frozen SigLIP encoder extracts visual features; a custom budget head combines these with LLM hidden states to predict the token budget.
  • Training: LoRA fine-tuning on text-only and multimodal batches, supervised with oracle CoT lengths; KL regularization for stable budget predictions.
  • Evaluation: Compared accuracy vs. average token usage on text vs. multimodal inputs; ablations on encoder freezing and head depth.
  • Outcome: Early results show notable token savings while maintaining competitive accuracy — highlighting the trade-offs of token-aware reasoning for VLMs.
Token budget predictor diagram
VAR Explained

PopYou2 — VAR Text

Adapted the Visual AutoRegressive (VAR) model for Funko Pop! generation. Fine-tuned with a custom “doll” embedding and an adapter mapping SigLIP image embeddings into the model’s text space, enabling both I→I and T→I generation paths.

VAR Adapter SigLIP→Text
Read more

Summary

Inspired by the Visual Autoregressive Modeling paper, this project explores text-to-image generation with VAR. By injecting a custom “doll” embedding and training a lightweight adapter, the model can synthesize Funko Pop! figures with controllable styles and actions. The system supports both image-to-image and text-to-image generation, extending the VAR paradigm to creative domains.

Deep dive

  • Dataset: ~100k Funko Pop! images generated with SDXL-Turbo prompts. Images were filtered and upscaled to ensure high quality and diversity.
  • Architecture: BLIP-2 was used for captioning to create initial descriptions. The VAR and VAE were kept frozen. An adapter layer was trained to map SigLIP image embeddings into the CLIP text space, conditioning the VAR generator.
  • Training: Fine-tuned the adapter and a lightweight LoRA module while leaving the generator and VAE frozen for efficiency. A custom “doll” embedding was injected to specialize the model for Funko Pop! synthesis.
  • Generation paths:
    • Image → Image: use SigLIP image embeddings through the adapter to influence VAR output.
    • Text → Image: swap the SigLIP image encoder for a text encoder at inference, enabling controlled text-to-image synthesis.
  • Controls: Style and action modifiers (e.g., “Alien”, “Playing guitar”) allow flexible customization. Optional textual inversion can be applied for specific franchises.
  • Next steps: Extend to 3D outputs with multi-view diffusion and NeRF/Depth priors, enabling mesh export and richer customization.

Interactive Demo

A Funko Pop figure of , styled as a , performing .
Funko preview
SAM LoRA figure

Few-Shot Segmentation with SAM + LoRA

Adapted SAM and its lightweight variants with LoRA adapters for few-shot segmentation, showing improvements over PerSAM.

Few‑Shot LoRA Pruning Quantization Edge
Read more

Deep dive

This project extends SAM and its efficient variants (FastSAM, EfficientSAM, MobileSAM) with LoRA adapters to handle class-aware few-shot segmentation. I ran systematic experiments on COCO, Cityscapes, and Soccer datasets, carefully varying the number of shots per class, loss functions, and augmentation strategies to evaluate generalization. Beyond accuracy gains, the focus was on practical deployment: I applied pruning and quantization to reduce model size by up to 80%, then fine-tuned LoRA modules to recover accuracy without sacrificing speed. This compression pipeline enabled real-time inference on edge devices, demonstrating how research-grade models can be adapted for production. Results showed that these LoRA-adapted SAM models consistently surpassed PerSAM in class-specific IoU, while remaining lightweight and efficient enough for real-world use.

COCO few-shot examples Cityscapes examples Soccer examples Segment Anything variants

CelebrityLook — Mobile Face Transform

On-device face stylization with a distilled StyleGAN2 (MobileStyleNet) and CLIP-to-StyleGAN latent alignment for text-driven edits; ~30fps on modern phones.

On‑Device 30fps StyleGAN2 Distill Latent Alignment CoreML
Read more

Summary

Built an edge app that performs GAN inversion and text-conditioned editing directly on device. It implements Bridging CLIP and StyleGAN through Latent Alignment to connect language and vision, keeping identity while applying prompt-guided changes — work that won the Samsung Next MobileXGenAI Hackathon and was optimized for ~30fps mobile inference.

Deep dive

The pipeline uses MobileStyleNet (a StyleGAN2 distillation) as the generator. For inversion, inference is G(f(E_I(x))), where a few mapper layers f remain trainable and E_I is an image encoder distilled from OpenCLIP to EfficientFormer-Large (per the latent-alignment setup in the paper/README). To connect text to the model, a mapper is trained to align CLIP representations with the W+ latent: the OpenCLIP text encoder (E_T) produces an embedding that the mapper converts into a ΔW+. Starting from the mean latent (text→image) or from an inverted latent (image→text manipulation), we add the scaled ΔW+ to drive edits like “blonde woman with sunglasses,” “man with a hat and beard,” or head-pose changes. In practice, the right scale factor C for ΔW+ depends on each (W+, text) pair, so a small projection layer is trained to predict the optimal C with a CLIP-based loss, balancing fidelity to the source face and alignment to the prompt. The app demonstrates strong attribute control (e.g., glasses, hair, pose), while acknowledging inversion identity limits versus SOTA methods ). Export and mobile optimizations deliver smooth, on-device performance (~30fps).

Mapper training diagram Hackathon win Text edit: blond woman with sunglasses Pose edit: looking left
FastGAN sample

PopYou — FastGAN + CLIP

Multi-stage pipeline: scrape + super-resolve real Funko images, synthesize ~30k with DeciDiffusion, train FastGAN, then add a frozen-CLIP inversion mapper for text- and image-conditioned edits; optional 3D lifting.

FastGAN CLIP Inversion Synthetic Data 3D Lifting
Read more

Summary

PopYou! targets Funko-style synthesis with GAN-level latency and memory. The data stage combines scraped Funko Pop images (upscaled with a super-resolution model) with a synthetic corpus of ~30,000 renders produced by DeciDiffusion. A FastGAN generator is trained on this semi-synthetic set and then frozen. On top, a lightweight inversion/mapper conditioned by frozen CLIP enables two paths: text→image generation via CLIP text embeddings (using a prompt template like “funko pop figure of … on a white background”) and image→image stylization via CLIP image embeddings of real faces/characters. Side-by-side examples show prompt alignment comparable to diffusion baselines at a fraction of runtime and memory. For 3D exploration, outputs can be lifted with SyncDreamer / DreamGaussian.

Deep dive

Stage 1 — Data. Scrape real Funko images and upsample them with a high-resolution super-resolution model; generate an additional ~30k synthetic Funko renders using DeciDiffusion. Stage 2 — GAN training. Train FastGAN on the combined (semi-synthetic) corpus to capture the Funko style; freeze the generator for downstream editing. Stage 3 — Inversion & editing. Train a mapper on top of frozen CLIP encoders to produce ΔW+ edits in W+. For text-conditioned synthesis, start from the mean latent and add a ΔW+ derived from the CLIP text embedding; for image-conditioned stylization, derive ΔW+ from the CLIP image embedding of a reference photo. Attributes like glasses, hair color, clothing and coarse pose are controllable via the prompt/reference. Stage 4 — Evaluation & 3D. Using the templated prompts, aggregate metrics report CLIP-similarity ≈ 0.31 for PopYou! vs 0.33 for DeciDiffusion, and FID ≈ 562 vs 258 against real Funko images. For 3D, generate multi-view/mesh outputs via SyncDreamer/DreamGaussian. The result is a practical trade-off: GAN-based Funko synthesis with strong promptability and vastly lower latency/memory than diffusion, plus an editing path that generalizes to real images via CLIP-guided inversion.

FastGAN Obama sample 3D lifting: Alan Turing (DreamGaussian) 3D lifting: Ras (DreamGaussian)
Audio waveform placeholder

MusicGen — Genre LoRA

LoRA-adapts Meta’s MusicGen for genre-specific generation (e.g., MapleStory-style BGM) while keeping prompt controllability.

LoRA Genre Transfer 32k EnCodec Fréchet
Read more

Summary

This project fine-tunes MusicGen with LoRA to steer the model toward specific genres using genre-related text prompts, without retraining the full network. MusicGen is a single-stage autoregressive Transformer trained over a 32 kHz EnCodec tokenizer with four codebooks @ 50 Hz; the LoRA approach keeps its built-in controllability while specializing style efficiently.

Deep dive

The training setup adds low-rank adapters to MusicGen and conditions on genre descriptions to bias generation toward target styles. To evaluate genre adaptation for a distinctive target, we compare generations against MapleStory background music. We generate 150 audio clips (each 8 s) and measure distributional shift with a Fréchet distance metric. As a reference, example zero-shot prompts yield distances of 1.0153 for “maplestory background music”, 0.7420 for an upbeat orchestral/jazz prompt, and 0.6013 for an electronic/playful prompt. The same protocol is then applied to the fine-tuned model to quantify genre alignment while preserving MusicGen’s prompt controls.

Audio samples

KoalaReadingAI — AI Papers as a Podcast

A pipeline that turns AI research papers into short audio episodes and publishes them to Spotify and YouTube. Automates paper fetching, summarization, and TTS.

TTS LLMs Audio Automation
Read more

Summary

The system fetches recent papers (Hugging Face daily papers), summarizes them, and converts the text to speech, producing episodes for distribution. It supports paid TTS via ElevenLabs as well as a free local option via Tortoise-TTS.

Deep dive

  • Ingestion: pull papers by date range from Hugging Face daily papers.
  • Summarization: generate episode scripts from PDFs using an API workflow (ChatPDF in repo README).
  • Speech: text-to-speech with ElevenLabs; optional Tortoise‑TTS for a free local pipeline.
  • Output: publishable audio files and episode metadata for podcast platforms.

Listen

Koala Reading AI cover
Timeline

Experience

Computer Vision Research Engineer @ Reality Defender

Jan 2024 – Present
  • Developing and optimizing deep learning computer vision solutions to detect deepfakes and fraudulent media (video/image/audio).
  • Built synchronization/thresholding pipelines and dataset alignment utilities for robust evaluation.
  • Contributed model compression & latency tuning for faster screening.
Deepfake Detection Latency/Compression Screening Pipelines

Computer Vision Research Engineer @ LuckyLab (Freelance)

Dec 2024 – Apr 2025
  • Built and deployed edge‑optimized solutions for segmentation and object detection for production.
  • Converted research models to production graphs (CoreML/TensorRT) with quality gates.
Edge Deployment Seg/Det Few‑Shot CoreML/TensorRT

Deep Learning Research Engineer @ NLPearl

Jun 2024 – Jan 2025 · Tel Aviv, Israel
  • Developed real‑time systems for conversational pause detection and response generation using fine‑tuned LLMs.
  • Explored architectures with LoRA and multi‑stage training to boost performance.
  • Built a compact language model for multi‑task outputs; worked with SOTA audio tokenizers and LLMs for audio‑focused tasks.
LLMs Real‑time LoRA Audio

Computer Vision & Deep Learning Research Engineer @ Pashoot Robotics

May 2023 – Jun 2024 · Rehovot, Israel
  • Improved object detection and segmentation in zero/few‑shot settings using foundation models (SAM, YOLO‑World, Grounding DINO, CLIP).
  • Worked on multi‑object tracking and 6‑DoF pose estimation with synthetic data.
  • Used Blender and 3D reconstruction (NeRF, Gaussian Splatting, image‑to‑3D) for simulation and domain randomization.
Zero/Few‑Shot SAM/YOLO‑World 6‑DoF Sim/Domain Rand.