SANA-Video × Tom & Jerry Research Log

Fine-tuning SANA-Video 2B toward efficient, low-budget Tom & Jerry–style videos at low resolution on consumer GPUs.

Project Overview

Goal

The core goal of this project is to generate controlled Tom & Jerry–style videos on consumer GPUs in a reasonable amount of time, instead of needing giant lab hardware or overnight sampling for every clip.

The model of choice is SANA-Video 2B, a recent efficient text-to-video diffusion model that uses flow-matching training, a relatively compact 2B-parameter transformer, and a video-optimized VAE. This experiment bends that general model toward a very specific corner of style space: classic Tom & Jerry cartoons, with readable motion and consistent characters, while still keeping text control alive.

Resolution vs Efficiency Tradeoff

Even though SANA-Video is relatively efficient, running it at its native 480p-ish resolution is still heavy on a single 12–24 GB GPU, especially if we want long clips, no distilled or pruned model(for now (: ), and ~50 inference steps.

To make the project actually usable on normal hardware and in acceptable training compute, the first accepted trade-off is resolution for efficiency and memory. Instead of generating full 480p (the base training resolution), this experiment locks the model to 224×224 and focuses on style, motion, and controllability rather than pure sharpness.

Avg time per iter:
Setting Resolution Latent shape
(C × T × H × W)
VAE encode+decode
(per 81-frame clip)
Diffusion step time
(per step)
Peak VRAM(when only transformer in memory)
Native SANA-Video ~480p (e.g. 832×480) 16 x 21 × 60 × 104 OOM ~1577 ms 6373.37 MiB
Tom & Jerry setup 224×224 16 x 21 × 28 × 28 63 ms + 110 ms ~243 ms 4578.80 MiB
Benchmarks: these reference numbers were measured on a L40 aws instance with bf16 model
Tom & Jerry LoRA sample at 10k steps (seed 69420)
Sample frame from the LoRA-tuned model at 10k steps. Same seed and prompt as the base model; only the LoRA weights change.

V1: Class-Only Training Run

This first pass is a V1 baseline: treat the entire Tom & Jerry world as a single class, and adapt only the diffusion transformer with LoRA while the rest of SANA-Video stays frozen (and in practice is never loaded into GPU memory). The goal is to see how far we can push the style at 224×224 while still keeping some of the original text-conditioning behaviour alive.

LoRA is used instead of full fine-tuning for practicality and speed: it is parameter-efficient, easy to swap in and out, and often reaches quality that is comparable to a fully fine-tuned model. For a deeper dive into how LoRA compares to full fine-tuning and why “LoRA everywhere” (not just on attention layers) can work well, see the “LoRA Without Regret” blog post.

  • Start from base SANA-Video 2B (SANA-Video_2B_480p_diffusers).
  • Freeze everything except LoRA adapters on all linear layers in the diffusion transformer.
  • Train with a single class prompt describing the Tom & Jerry universe.
  • Resolution: 224×224.
  • Dataset: curated Tom & Jerry-style clips (resized + cleaned), with ~16k precomputed latents corresponding to roughly 5 second clips.
Why all linear layers? Inspired by “LoRA Without Regret” , LoRA applied to all linear layers in the transformer often gives better adaptation than limiting it only to attention projections, while still keeping the number of trainable parameters small.

Key Hyperparameters

Component Value
Base model SANA-Video_2B_480p_diffusers
Resolution 224 × 224
Clip length 81 frames
Batch size 8
Optimizer AdamW8bit (LoRA parameters only)
Learning rate 2e-4
LoRA rank 16
LoRA alpha 32
LoRA dropout 0.1
Training objective Flow Matching (velocity prediction)
Steps 10k

Class Prompt

The first run uses a single class prompt as the main style anchor for the whole training:

A vintage slapstick 2D cartoon scene of a grey cat chasing a small brown mouse in a colorful house, Tom and Jerry style, bold outlines, limited color palette, exaggerated expressions, smooth character motion.
Why a single prompt? It deliberately conflates many things—character identities, environment, motion, and style—so LoRA learns a dense “Tom & Jerry prior” before we start testing text-conditioning flexibility, also we have unlabeled tom and jerry clips.

Checkpoint Progression (Same Seed & Prompt)

Every few steps, a clip is generated with a fixed seed and CFG. This makes it easy to see how the LoRA gradually steers SANA-Video from its generic prior into the Tom & Jerry regime.

Base SANA-Video sample (no LoRA)
Base SANA-Video 2B @ 224×224 (seed 69420)
LoRA step 100
Step 100
Still messy; style only slightly nudged away from the base.
LoRA step 1,000
Step 1k
Clearer outlines, more cartoony color palette starting to appear.
LoRA step 5,000
Step 5k
Motion and silhouettes look much closer to classic TV-era animation; Tom is a lot easier to read.
LoRA step 10,000
Step 10k
Style mostly locked in; further training starts to trade diversity for fidelity.
All clips above: same seed (69420), same prompt (class prompt), same CFG (6). The only thing that changes across columns is the LoRA checkpoint. Worth noting: the baseline SANA-Video samples at this resolution are often even weaker than the one shown here — the model was never trained natively at 224×224, and it shows.

Text-Conditioned Experiments

After the class-LoRA run converges, the next question is: how much text-conditioning power is left? To test this, we reuse the same LoRA weights but vary the prompts.

All of the clips below share:

  • The same LoRA checkpoint.
  • The same seed (32420).
  • Identical sampling settings (steps, CFG, etc.).
Class baseline
Class baseline
Same style anchor as the training prompt; sanity check for the LoRA behavior.
Night living room sneaking
Night living room
Tests lighting changes and sneaking motion in a cozy living room.
Vertical chase in old mansion
Vertical staircase chase
Same style, different geometry and motion (up/down motion in an old mansion stairwell).
Tom eating pizza
Tom eating pizza
Adds an explicit object + action (pizza) on top of the base tom and jerry prompt setup.
Prank outside
Backyard prank
Moves the scene outdoors to a sunny backyard with slapstick prank dynamics.
Takeaway: Even after strong class-style tuning, the model still responds to changes in lighting, location, and simple action cues while keeping the Tom & Jerry look. More systematic evaluation is still TODO; full prompts for each video are documented in the GitHub repo.

V0: CFG Sweep Evaluation

I know what you thinking... what are those flashes and blurines? until know while generating we used a set cfg = 6.0 which cuses theose weird distorations, usinge a 100 ground-truth Tom & Jerry segments we will generate 40 corresponding clips with the V1 LoRA at different CFG scales but same list of seeds, and compare each generated clip to its ground-truth counterpart.

For each CFG setting, we compute:

  • SSIM ↑ – structural similarity to the ground-truth frames.
  • PSNR ↑ – per-frame signal-to-noise ratio.
  • LPIPS ↓ – perceptual distance (AlexNet or VGG backbone).
  • FID ↓ – frame-wise FID over all frames (Inception-V3 features).
  • tLPIPS ↓ – temporal LPIPS between consecutive frames (motion smoothness).
Model CFG scale # eval clips SSIM ↑ PSNR ↑ (dB) LPIPS ↓ FID ↓ tLPIPS ↓
Sana Video 2B Base 6.0 40 TBD TBD TBD TBD TBD
Tom & Jerry LoRA (V1, r16) 2.0 40 TBD TBD TBD TBD TBD
Tom & Jerry LoRA (V1, r16) 2.5 40 TBD TBD TBD TBD TBD
Tom & Jerry LoRA (V1, r16) 3 40 TBD TBD TBD TBD TBD
Tom & Jerry LoRA (V1, r16) 4 40 TBD TBD TBD TBD TBD
Tom & Jerry LoRA (V1, r16) 5 40 TBD TBD TBD TBD TBD
Protocol: all clips are generated at 224×224 with the same seed set and prompts as the held-out ground-truth segments. Metrics are averaged over all frames and all 100 clips per CFG setting. Once the sweep is run, this table will be updated with the actual scores.

V2: Scene Dataset & Training

V1 was intentionally simple: one global class prompt and a LoRA that “pulls” SANA-Video into the Tom & Jerry style at 224×224. In V2, the goal is to move beyond a single prompt while preserving as much control from the foundation model as possible, and to build a richer, text-aware scene dataset that the model can actually react to.

Dataset creation

The idea is to extend the existing Tom & Jerry latent cache with per-clip scene descriptions. Using recent advances in multi-modal, affordable models and efficient inference pipelines (continuous batching, KV-cache, vLLM in practice for more deatils explore this blog ), labeling a large dataset with detailed scene descriptions becomes realistic with reasonable engineering effort:

  • Start from the same curated clips used in V1 (already resized & cached as latents).
  • Use Qwen3-VL-8B served via vLLM to auto-label each segment.
  • Add a CSV/JSON file that stores one text_prompt per clip or segment.
  • Encode these text prompts once and cache the embeddings alongside the video latents.

Scene description template & labeling prompt

Each segment is described with a structured scene card (environment, characters, props, action, camera). The same template is used as the base for all prompts fed into the VLM.

Idea: one Tom & Jerry video segment → one rich scene card. Later, these cards become the text backbone for V2 training and for video → text → video experiments, also for future further control in the scene for constructor LLM to generate many scenes and connect them

V2: Dataset Labeling Failure Modes

Qwen3-VL is a very strong VLM, but when I used it to label the Tom & Jerry dataset, several systematic failure modes kept showing up(it still mostly labeled correctly the scenes but I did notice those failures casses). Below are four representative cases, each showing the generated scene description on the left and the corresponding video on the right.

Failure case 1 — out of disturabution

ENVIRONMENT

- A simple, dark blue background with no detailed environment.
- Mood: suspenseful and minimalist, with a focus on the characters and their actions.

CHARACTERS

- Tom: a white-outlined cat with an angry expression, large ears, and whiskers. He is the main character, appearing in all frames.
- Jerry: a small brown mouse with a red nose and ears, running away from Tom. He is the secondary character.

PROPS

- A green rectangular object (possibly a mouse trap or block) that is initially held by Jerry and then falls to the ground.

ACTION

- Jerry runs away from Tom, who is holding a green object. Jerry drops the object, which falls to the ground. Tom then looks down at the object, and the scene ends with Tom listening intently.

CAMERA & FRAMING

- Static camera, medium shot, focused on the characters and their interaction. The camera does not move, maintaining a consistent view of the action.

Out-of-distribution clip: the scene is abstract and sparse, but the model hallucinates a full Tom & Jerry chase with extra props and a wrong interpretation of the motion.

Failure case 2 — cat = Tom

ENVIRONMENT

- A brick wall next to a window ledge with a framed picture of roses.
- The scene is set indoors, likely in a house, with warm, natural lighting suggesting daytime.

CHARACTERS

- Jerry: a small brown mouse, wearing a red bowtie, with a mischievous and sly personality vibe.
- Tom: a black and white cat, sitting atop a trash can, looking up with a wide, grinning expression.

PROPS

- A window ledge with a framed picture of roses.
- A trash can.
- Cardboard boxes and other debris around the trash can.

ACTION

- Jerry is perched on the window ledge, looking down with a sly expression.
- Tom is sitting on top of a trash can, looking up at Jerry with a wide, grinning expression, seemingly amused or about to pounce.

CAMERA & FRAMING

- The camera is positioned at a low angle, looking up at the characters as they move through the air.
- The camera follows the action, panning slightly to keep both characters in frame as they move across the scene.

“Every cat is Tom”: any vaguely cat-shaped character is labeled as Tom, sometimes with Jerry added on top, even when the original design clearly differs.

Failure case 3 — missed main action

ENVIRONMENT

- A sunny, open golf course with vibrant green grass.
- Background elements include a tree with a wooden bench under it, distant trees, and a clear blue sky.
- Mood: Bright, cheerful, cartoonish midday lighting.

CHARACTERS

- Tom: A grey cat with white paws and chest, wearing a white golf glove on his right paw. Personality vibe: Energetic, focused, and slightly comical in his attempt at golf.

PROPS

- Golf clubs (one with a red grip, another with a blue grip), a white golf ball, and a wooden bench.

ACTION

- Tom is attempting to play golf, swinging a club at a golf ball.
- He loses his balance and falls backward, then quickly recovers and stands up, still holding the club.

CAMERA & FRAMING

- Static side view, medium shot.
- The camera remains fixed, capturing Tom’s full body and the immediate surroundings.

Missed main gag: the model gets the setting and characters right and notes that Tom falls once, but completely skips the second hit and the golf club "job" that is the core of the slapstick joke.

Failure case 4 — insert tom and jerry

ENVIRONMENT

- Inside a simple room with light-colored walls and wooden floor.
- A plain wall with a framed picture or mirror is visible in the background.
- Warm, even lighting typical of classic cartoon animation.

CHARACTERS

- Tom: a grey cat with white paws, wearing a red shirt, white shorts, and a large straw hat. He appears to be dancing or struggling with glee.
- Jerry: a small brown mouse, wearing a red shirt and white shorts, also wearing a small red bow on his head. He is being carried or held by Tom.

PROPS

- A large straw hat worn by Tom.
- A long, thin stick or pole held by Tom, which he is using to balance or dance with.

ACTION

- Tom is energetically dancing or stumbling while holding Jerry in his arms.
- Jerry is laughing and appears to be enjoying the chaos.
- The scene transitions to Tom and Jerry suddenly breaking apart, with Jerry running away as Tom stumbles.

CAMERA & FRAMING

- Static medium shot, keeping both characters in frame throughout the sequence.
- The camera remains fixed, allowing the characters' movements to drive the action.

Hallucinated Tom & Jerry: the original clip has different characters, but the model forces Tom and Jerry into the description and then invents matching actions to stay self-consistent -> LLms with their self consistency ):
Why this matters: these failure modes show where automatic labels need either better prompts, light post-filtering, stronger model, more frames per video.

Future Work

This is a living project and there are a bunch of directions I want to explore next. The bullets below are placeholders and will be updated as the project evolves.

  • Resolution × CFG benchmarks: refine the evaluation protocol and run a full sweep across CFG scales, comparing V1 vs V2 side-by-side on SSIM, FID, LPIPS, and temporal metrics.
  • Motion-score conditioning: experiment with adding a motion score (e.g. Unimatch + VMAF, similar to how SANA-Video scores motion) as an extra token appended to the text conditioning, and train the model to respond to “more / less motion” controls.
  • RL-style reward model: take inspiration from Video Generation Models Are Good Latent Reward Models , clone a pruned copy of the video transformer as a latent reward model that scores “real vs fake” Tom & Jerry clips, and use those scores in a lightweight RL loop to sharpen dynamics and reduce artifacts.
Placeholder: all items above are draft ideas. I will replace them with the real roadmap once more experiments are completed.