SANA-Video Tom & Jerry — Research Log

Experiment

Project Overview

Goal

The core goal of this project is to generate controlled Tom & Jerry–style videos on consumer GPUs in a reasonable amount of time, instead of needing giant lab hardware or overnight sampling for every clip.

The model of choice is SANA-Video 2B, a recent efficient text-to-video diffusion model that uses flow-matching training, a relatively compact 2B-parameter transformer, and a video-optimized VAE. This experiment bends that general model toward a very specific corner of style space: classic Tom & Jerry cartoons, with readable motion and consistent characters, while still keeping text control alive.

Resolution vs Efficiency Tradeoff

Even though SANA-Video is relatively efficient, running it at its native 480p-ish resolution is still heavy on a single 12–24 GB GPU, especially if we want long clips, no distilled or pruned model(for now (: ), and ~50 inference steps.

To make the project actually usable on normal hardware and in acceptable training compute, the first accepted trade-off is resolution for efficiency and memory. Instead of generating full 480p (the base training resolution), this experiment locks the model to 224×224 and focuses on style, motion, and controllability rather than pure sharpness.

Avg time per iter:

Setting	Resolution	Latent shape (C × T × H × W)	VAE encode+decode (per 81-frame clip)	Diffusion step time (per step)	Peak VRAM(when only transformer in memory)
Native SANA-Video	~480p (e.g. 832×480)	`16 x 21 × 60 × 104`	OOM	~1577 ms	6373.37 MiB
Tom & Jerry setup	`224×224`	`16 x 21 × 28 × 28`	63 ms + 110 ms	~243 ms	4578.80 MiB

Benchmarks: these reference numbers were measured on a L40 aws instance with bf16 model

Tom & Jerry LoRA sample at 10k steps (seed 69420) — Sample frame from the LoRA-tuned model at 10k steps. Same seed and prompt as the base model; only the LoRA weights change.

V1 — Class-Only LoRA

V1: Class-Only Training Run

This first pass is a V1 baseline: treat the entire Tom & Jerry world as a single class, and adapt only the diffusion transformer with LoRA while the rest of SANA-Video stays frozen (and in practice is never loaded into GPU memory). The goal is to see how far we can push the style at 224×224 while still keeping some of the original text-conditioning behaviour alive.

LoRA is used instead of full fine-tuning for practicality and speed: it is parameter-efficient, easy to swap in and out, and often reaches quality that is comparable to a fully fine-tuned model. For a deeper dive into how LoRA compares to full fine-tuning and why “LoRA everywhere” (not just on attention layers) can work well, see the “LoRA Without Regret” blog post.

Start from base SANA-Video 2B (SANA-Video_2B_480p_diffusers).
Freeze everything except LoRA adapters on all linear layers in the diffusion transformer.
Train with a single class prompt describing the Tom & Jerry universe.
Resolution: 224×224.
Dataset: curated Tom & Jerry-style clips (resized + cleaned), with ~16k precomputed latents corresponding to roughly 5 second clips.

Why all linear layers? Inspired by “LoRA Without Regret” , LoRA applied to all linear layers in the transformer often gives better adaptation than limiting it only to attention projections, while still keeping the number of trainable parameters small.

Config

Key Hyperparameters

Component	Value
Base model	`SANA-Video_2B_480p_diffusers`
Resolution	`224 × 224`
Clip length	`81` frames
Batch size	`8`
Optimizer	AdamW8bit (LoRA parameters only)
Learning rate	`2e-4`
LoRA rank	`16`
LoRA alpha	`32`
LoRA dropout	`0.1`
Training objective	Flow Matching (velocity prediction)
Steps	`10k`

Style Anchor

Class Prompt

The first run uses a single class prompt as the main style anchor for the whole training:

A vintage slapstick 2D cartoon scene of a grey cat chasing a small brown mouse in a colorful house, Tom and Jerry style, bold outlines, limited color palette, exaggerated expressions, smooth character motion.

Why a single prompt? It deliberately conflates many things—character identities, environment, motion, and style—so LoRA learns a dense “Tom & Jerry prior” before we start testing text-conditioning flexibility, also we have unlabeled tom and jerry clips.

Training Log

Checkpoint Progression (Same Seed & Prompt)

Every few steps, a clip is generated with a fixed seed and CFG. This makes it easy to see how the LoRA gradually steers SANA-Video from its generic prior into the Tom & Jerry regime.

Base SANA-Video sample (no LoRA) — Base SANA-Video 2B @ 224×224 (seed 69420)

Step 100

Still messy; style only slightly nudged away from the base.

Step 1k

Clearer outlines, more cartoony color palette starting to appear.

Step 5k

Motion and silhouettes look much closer to classic TV-era animation; Tom is a lot easier to read.

Step 10k

Style mostly locked in; further training starts to trade diversity for fidelity.

All clips above: same seed (69420), same prompt (class prompt), same CFG (6). The only thing that changes across columns is the LoRA checkpoint. Worth noting: the baseline SANA-Video samples at this resolution are often even weaker than the one shown here — the model was never trained natively at 224×224, and it shows.

Text Control

Text-Conditioned Experiments

After the class-LoRA run converges, the next question is: how much text-conditioning power is left? To test this, we reuse the same LoRA weights but vary the prompts.

All of the clips below share:

The same LoRA checkpoint.
The same seed (32420).
Identical sampling settings (steps, CFG, etc.).

Class baseline

Same style anchor as the training prompt; sanity check for the LoRA behavior.

Night living room

Tests lighting changes and sneaking motion in a cozy living room.

Vertical staircase chase

Same style, different geometry and motion (up/down motion in an old mansion stairwell).

Tom eating pizza

Adds an explicit object + action (pizza) on top of the base tom and jerry prompt setup.

Backyard prank

Moves the scene outdoors to a sunny backyard with slapstick prank dynamics.

Takeaway: Even after strong class-style tuning, the model still responds to changes in lighting, location, and simple action cues while keeping the Tom & Jerry look. More systematic evaluation is still TODO; full prompts for each video are documented in the GitHub repo.

Evaluation

V1: CFG Sweep Evaluation

I know what you thinking... what are those flashes and blurines? until know while generating we used a set cfg = 6.0 which cuses theose weird distorations, usinge a 100 ground-truth Tom & Jerry segments we will generate 40 corresponding clips with the V1 LoRA at different CFG scales but same list of seeds, and compare each generated clip to its ground-truth counterpart.

For each CFG setting, we compute:

SSIM ↑ – structural similarity to the ground-truth frames.
PSNR ↑ – per-frame signal-to-noise ratio.
LPIPS ↓ – perceptual distance (AlexNet or VGG backbone).
FID ↓ – frame-wise FID over all frames (Inception-V3 features).
tLPIPS ↓ – temporal LPIPS between consecutive frames (motion smoothness).

Model	CFG scale	# eval clips	SSIM ↑	PSNR ↑ (dB)	LPIPS ↓	FID ↓	tLPIPS ↓
Sana Video 2B Base	6.0	40	TBD	TBD	TBD	TBD	TBD
Tom & Jerry LoRA (V1, r16)	2.0	40	TBD	TBD	TBD	TBD	TBD
Tom & Jerry LoRA (V1, r16)	2.5	40	TBD	TBD	TBD	TBD	TBD
Tom & Jerry LoRA (V1, r16)	3	40	TBD	TBD	TBD	TBD	TBD
Tom & Jerry LoRA (V1, r16)	4	40	TBD	TBD	TBD	TBD	TBD
Tom & Jerry LoRA (V1, r16)	5	40	TBD	TBD	TBD	TBD	TBD

Protocol: all clips are generated at 224×224 with the same seed set and prompts as the held-out ground-truth segments. Metrics are averaged over all frames and all 100 clips per CFG setting. Once the sweep is run, this table will be updated with the actual scores.

V2 — Scene-Based Generation

V2: Scene Dataset & Training

V1 was intentionally simple: one global class prompt and a LoRA that “pulls” SANA-Video into the Tom & Jerry style at 224×224. In V2, the goal is to move beyond a single prompt while preserving as much control from the foundation model as possible, and to build a richer, text-aware scene dataset that the model can actually react to.

Dataset creation

The idea is to extend the existing Tom & Jerry latent cache with per-clip scene descriptions. Using recent advances in multi-modal, affordable models and efficient inference pipelines (continuous batching, KV-cache, vLLM in practice for more deatils explore this blog ), labeling a large dataset with detailed scene descriptions becomes realistic with reasonable engineering effort:

Start from the same curated clips used in V1 (already resized & cached as latents).
Use Qwen3-VL-8B served via vLLM to auto-label each segment.
Add a CSV/JSON file that stores one text_prompt per clip or segment.
Encode these text prompts once and cache the embeddings alongside the video latents.

Scene description template & labeling prompt

Each segment is described with a structured scene card (environment, characters, props, action, camera). The same template is used as the base for all prompts fed into the VLM.

💭 Qwen3-VL-8B labeling prompt

You are an expert animation director and prompt engineer.

You are given a short video segment (about 3–6 seconds) showing a Tom and Jerry cartoon scene.
Your job is to:
1) Understand the scene (environment, characters, actions, style, camera).
2) Return a structured description that can be used as a base prompt for a text-to-video model.

Follow this exact output format and headings, and DO NOT add any extra sections:

ENVIRONMENT:
- Describe the location and setting (e.g. "inside a colorful living room", "in a sunny backyard").
- Mention key background elements (walls, furniture, sky, ground, etc.).
- Include mood / lighting (e.g. "warm lamp light", "bright midday sun", "dark night with moonlight").

CHARACTERS:
- List each main character separately.
- For each: species/type, color, size, clothing/accessories, personality vibe.
  (e.g. "a grey cat with white paws, energetic and mischievous").
- If the character is Tom or Jerry, mention their names explicitly ("Tom", "Jerry").

PROPS:
- List important objects that are involved in the action or stand out visually.
  (e.g. "sofa, table, pizza slice, ladder, flower pot").

ACTION:
- Describe what happens in the segment in 1–3 short sentences.
- Focus on interactions between characters and props, and the main motion.
  (e.g. "the cat chases the mouse around the table and almost slips on a rug").

CAMERA & FRAMING:
- Describe the main camera view (e.g. "side view, medium-wide shot").
- Mention if the camera is static or moving (panning, zooming, following the character, etc.).

Write clearly, concisely, and keep everything in this exact structure.

Idea: one Tom & Jerry video segment → one rich scene card. Later, these cards become the text backbone for V2 training and for video → text → video experiments, also for future further control in the scene for constructor LLM to generate many scenes and connect them

V2 — Scene-Based Generation

V2: Dataset Labeling Failure Modes

Qwen3-VL is a very strong VLM, but when I used it to label the Tom & Jerry dataset, several systematic failure modes kept showing up(it still mostly labeled correctly the scenes but I did notice those failures casses). Below are four representative cases, each showing the generated scene description on the left and the corresponding video on the right.

Failure case 1 — out of disturabution

ENVIRONMENT

- A simple, dark blue background with no detailed environment.
- Mood: suspenseful and minimalist, with a focus on the characters and their actions.

CHARACTERS

- Tom: a white-outlined cat with an angry expression, large ears, and whiskers. He is the main character, appearing in all frames.
- Jerry: a small brown mouse with a red nose and ears, running away from Tom. He is the secondary character.

PROPS

- A green rectangular object (possibly a mouse trap or block) that is initially held by Jerry and then falls to the ground.

ACTION

- Jerry runs away from Tom, who is holding a green object. Jerry drops the object, which falls to the ground. Tom then looks down at the object, and the scene ends with Tom listening intently.

CAMERA & FRAMING

- Static camera, medium shot, focused on the characters and their interaction. The camera does not move, maintaining a consistent view of the action.

Out-of-distribution clip: the scene is abstract and sparse, but the model hallucinates a full Tom & Jerry chase with extra props and a wrong interpretation of the motion.

Failure case 2 — cat = Tom

ENVIRONMENT

- A brick wall next to a window ledge with a framed picture of roses.
- The scene is set indoors, likely in a house, with warm, natural lighting suggesting daytime.

CHARACTERS

- Jerry: a small brown mouse, wearing a red bowtie, with a mischievous and sly personality vibe.
- Tom: a black and white cat, sitting atop a trash can, looking up with a wide, grinning expression.

PROPS

- A window ledge with a framed picture of roses.
- A trash can.
- Cardboard boxes and other debris around the trash can.

ACTION

- Jerry is perched on the window ledge, looking down with a sly expression.
- Tom is sitting on top of a trash can, looking up at Jerry with a wide, grinning expression, seemingly amused or about to pounce.

CAMERA & FRAMING

- The camera is positioned at a low angle, looking up at the characters as they move through the air.
- The camera follows the action, panning slightly to keep both characters in frame as they move across the scene.

“Every cat is Tom”: any vaguely cat-shaped character is labeled as Tom, sometimes with Jerry added on top, even when the original design clearly differs.

Failure case 3 — missed main action

ENVIRONMENT

- A sunny, open golf course with vibrant green grass.
- Background elements include a tree with a wooden bench under it, distant trees, and a clear blue sky.
- Mood: Bright, cheerful, cartoonish midday lighting.

CHARACTERS

- Tom: A grey cat with white paws and chest, wearing a white golf glove on his right paw. Personality vibe: Energetic, focused, and slightly comical in his attempt at golf.

PROPS

- Golf clubs (one with a red grip, another with a blue grip), a white golf ball, and a wooden bench.

ACTION

- Tom is attempting to play golf, swinging a club at a golf ball.
- He loses his balance and falls backward, then quickly recovers and stands up, still holding the club.

CAMERA & FRAMING

- Static side view, medium shot.
- The camera remains fixed, capturing Tom’s full body and the immediate surroundings.

Missed main gag: the model gets the setting and characters right and notes that Tom falls once, but completely skips the second hit and the golf club "job" that is the core of the slapstick joke.

Failure case 4 — insert tom and jerry

ENVIRONMENT

- Inside a simple room with light-colored walls and wooden floor.
- A plain wall with a framed picture or mirror is visible in the background.
- Warm, even lighting typical of classic cartoon animation.

CHARACTERS

- Tom: a grey cat with white paws, wearing a red shirt, white shorts, and a large straw hat. He appears to be dancing or struggling with glee.
- Jerry: a small brown mouse, wearing a red shirt and white shorts, also wearing a small red bow on his head. He is being carried or held by Tom.

PROPS

- A large straw hat worn by Tom.
- A long, thin stick or pole held by Tom, which he is using to balance or dance with.

ACTION

- Tom is energetically dancing or stumbling while holding Jerry in his arms.
- Jerry is laughing and appears to be enjoying the chaos.
- The scene transitions to Tom and Jerry suddenly breaking apart, with Jerry running away as Tom stumbles.

CAMERA & FRAMING

- Static medium shot, keeping both characters in frame throughout the sequence.
- The camera remains fixed, allowing the characters' movements to drive the action.

Hallucinated Tom & Jerry: the original clip has different characters, but the model forces Tom and Jerry into the description and then invents matching actions to stay self-consistent -> LLms with their self consistency ):

Why this matters: these failure modes show where automatic labels need either better prompts, light post-filtering, stronger model, more frames per video.

Roadmap

Future Work

This is a living project and there are a bunch of directions I want to explore next. The bullets below are placeholders and will be updated as the project evolves.

Resolution × CFG benchmarks: refine the evaluation protocol and run a full sweep across CFG scales, comparing V1 vs V2 side-by-side on SSIM, FID, LPIPS, and temporal metrics.
Motion-score conditioning: experiment with adding a motion score (e.g. Unimatch + VMAF, similar to how SANA-Video scores motion) as an extra token appended to the text conditioning, and train the model to respond to “more / less motion” controls.
RL-style reward model: take inspiration from Video Generation Models Are Good Latent Reward Models , clone a pruned copy of the video transformer as a latent reward model that scores “real vs fake” Tom & Jerry clips, and use those scores in a lightweight RL loop to sharpen dynamics and reduce artifacts.

Placeholder: all items above are draft ideas. I will replace them with the real roadmap once more experiments are completed.