Project Overview
Goal
The core goal of this project is to generate controlled Tom & Jerry–style videos on consumer GPUs in a reasonable amount of time, instead of needing giant lab hardware or overnight sampling for every clip.
The model of choice is SANA-Video 2B, a recent
efficient text-to-video diffusion model that uses
flow-matching training, a relatively compact 2B-parameter transformer,
and a video-optimized VAE. This experiment bends that general model toward a very specific
corner of style space:
classic Tom & Jerry cartoons, with readable motion and
consistent characters, while still keeping text control alive.
Resolution vs Efficiency Tradeoff
Even though SANA-Video is relatively efficient, running it at its native 480p-ish resolution is still heavy on a single 12–24 GB GPU, especially if we want long clips, no distilled or pruned model(for now (: ), and ~50 inference steps.
To make the project actually usable on normal hardware and in acceptable training compute, the
first accepted
trade-off is resolution for efficiency and memory. Instead of
generating full 480p (the base training resolution), this experiment locks the model to 224×224
and focuses on style, motion, and controllability rather than pure sharpness.
| Setting | Resolution | Latent shape (C × T × H × W) |
VAE encode+decode (per 81-frame clip) |
Diffusion step time (per step) |
Peak VRAM(when only transformer in memory) |
|---|---|---|---|---|---|
| Native SANA-Video | ~480p (e.g. 832×480) | 16 x 21 × 60 × 104 |
OOM | ~1577 ms | 6373.37 MiB |
| Tom & Jerry setup | 224×224 |
16 x 21 × 28 × 28 |
63 ms + 110 ms | ~243 ms | 4578.80 MiB |
V1: Class-Only Training Run
This first pass is a V1 baseline: treat the entire Tom & Jerry world as a
single class, and adapt only the diffusion transformer with LoRA while the rest
of SANA-Video stays frozen (and in practice is never loaded into GPU memory). The goal is to see
how far we can push the style at 224×224 while still keeping some of the original
text-conditioning behaviour alive.
LoRA is used instead of full fine-tuning for practicality and speed: it is parameter-efficient, easy to swap in and out, and often reaches quality that is comparable to a fully fine-tuned model. For a deeper dive into how LoRA compares to full fine-tuning and why “LoRA everywhere” (not just on attention layers) can work well, see the “LoRA Without Regret” blog post.
- Start from base SANA-Video 2B (
SANA-Video_2B_480p_diffusers). - Freeze everything except LoRA adapters on all linear layers in the diffusion transformer.
- Train with a single class prompt describing the Tom & Jerry universe.
- Resolution: 224×224.
- Dataset: curated Tom & Jerry-style clips (resized + cleaned), with ~16k precomputed latents corresponding to roughly 5 second clips.
Key Hyperparameters
| Component | Value |
|---|---|
| Base model | SANA-Video_2B_480p_diffusers |
| Resolution | 224 × 224 |
| Clip length | 81 frames |
| Batch size | 8 |
| Optimizer | AdamW8bit (LoRA parameters only) |
| Learning rate | 2e-4 |
| LoRA rank | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.1 |
| Training objective | Flow Matching (velocity prediction) |
| Steps | 10k |
Class Prompt
The first run uses a single class prompt as the main style anchor for the whole training:
A vintage slapstick 2D cartoon scene of a grey cat chasing a small brown mouse in a colorful house, Tom and Jerry style, bold outlines, limited color palette, exaggerated expressions, smooth character motion.
Checkpoint Progression (Same Seed & Prompt)
Every few steps, a clip is generated with a fixed seed and CFG. This makes it easy to see how the LoRA gradually steers SANA-Video from its generic prior into the Tom & Jerry regime.
69420),
same prompt (class prompt), same CFG (6).
The only thing that changes across columns is the LoRA checkpoint.
Worth noting: the baseline SANA-Video samples at this
resolution are often even weaker than the one shown here — the model
was never trained natively at 224×224, and it shows.
Text-Conditioned Experiments
After the class-LoRA run converges, the next question is: how much text-conditioning power is left? To test this, we reuse the same LoRA weights but vary the prompts.
All of the clips below share:
- The same LoRA checkpoint.
- The same seed (
32420). - Identical sampling settings (steps, CFG, etc.).
V0: CFG Sweep Evaluation
I know what you thinking... what are those flashes and blurines? until know while generating we used a set cfg = 6.0 which cuses theose weird distorations, usinge a 100 ground-truth Tom & Jerry segments we will generate 40 corresponding clips with the V1 LoRA at different CFG scales but same list of seeds, and compare each generated clip to its ground-truth counterpart.
For each CFG setting, we compute:
- SSIM ↑ – structural similarity to the ground-truth frames.
- PSNR ↑ – per-frame signal-to-noise ratio.
- LPIPS ↓ – perceptual distance (AlexNet or VGG backbone).
- FID ↓ – frame-wise FID over all frames (Inception-V3 features).
- tLPIPS ↓ – temporal LPIPS between consecutive frames (motion smoothness).
| Model | CFG scale | # eval clips | SSIM ↑ | PSNR ↑ (dB) | LPIPS ↓ | FID ↓ | tLPIPS ↓ |
|---|---|---|---|---|---|---|---|
| Sana Video 2B Base | 6.0 | 40 | TBD | TBD | TBD | TBD | TBD |
| Tom & Jerry LoRA (V1, r16) | 2.0 | 40 | TBD | TBD | TBD | TBD | TBD |
| Tom & Jerry LoRA (V1, r16) | 2.5 | 40 | TBD | TBD | TBD | TBD | TBD |
| Tom & Jerry LoRA (V1, r16) | 3 | 40 | TBD | TBD | TBD | TBD | TBD |
| Tom & Jerry LoRA (V1, r16) | 4 | 40 | TBD | TBD | TBD | TBD | TBD |
| Tom & Jerry LoRA (V1, r16) | 5 | 40 | TBD | TBD | TBD | TBD | TBD |
224×224 with the same seed set
and prompts as the held-out ground-truth segments. Metrics are averaged over all frames and all
100 clips per CFG setting. Once the sweep is run, this table will be updated with the actual scores.
V2: Scene Dataset & Training
V1 was intentionally simple: one global class prompt and a LoRA that
“pulls” SANA-Video into the Tom & Jerry style at 224×224.
In V2, the goal is to move beyond a single prompt while preserving
as much control from the foundation model as possible, and to build a
richer, text-aware scene dataset that the model can actually react to.
Dataset creation
The idea is to extend the existing Tom & Jerry latent cache with per-clip
scene descriptions. Using recent advances in multi-modal,
affordable models and efficient inference pipelines
(continuous batching, KV-cache, vLLM in practice for more deatils explore this
blog
), labeling a large
dataset with detailed scene descriptions becomes realistic with reasonable
engineering effort:
- Start from the same curated clips used in V1 (already resized & cached as latents).
- Use Qwen3-VL-8B served via vLLM to auto-label each segment.
- Add a CSV/JSON file that stores one
text_promptper clip or segment. - Encode these text prompts once and cache the embeddings alongside the video latents.
Scene description template & labeling prompt
Each segment is described with a structured scene card (environment, characters, props, action, camera). The same template is used as the base for all prompts fed into the VLM.
V2: Dataset Labeling Failure Modes
Qwen3-VL is a very strong VLM, but when I used it to label the Tom & Jerry dataset, several systematic failure modes kept showing up(it still mostly labeled correctly the scenes but I did notice those failures casses). Below are four representative cases, each showing the generated scene description on the left and the corresponding video on the right.
Failure case 1 — out of disturabution
- A simple, dark blue background with no detailed environment.
- Mood: suspenseful and minimalist, with a focus on the characters and their
actions.
- Tom: a white-outlined cat with an angry expression, large ears, and whiskers.
He
is the main character, appearing in all frames.
- Jerry: a small brown mouse with a red nose and ears, running away from Tom. He
is
the secondary character.
- A green rectangular object (possibly a mouse trap or block) that is initially held by Jerry and then falls to the ground.
- Jerry runs away from Tom, who is holding a green object. Jerry drops the object, which falls to the ground. Tom then looks down at the object, and the scene ends with Tom listening intently.
- Static camera, medium shot, focused on the characters and their interaction. The camera does not move, maintaining a consistent view of the action.
Failure case 2 — cat = Tom
- A brick wall next to a window ledge with a framed picture of roses.
- The scene is set indoors, likely in a house, with warm, natural lighting
suggesting daytime.
- Jerry: a small brown mouse, wearing a red bowtie, with a mischievous and sly
personality vibe.
- Tom: a black and white cat, sitting atop a trash can, looking up with a wide,
grinning expression.
- A window ledge with a framed picture of roses.
- A trash can.
- Cardboard boxes and other debris around the trash can.
- Jerry is perched on the window ledge, looking down with a sly expression.
- Tom is sitting on top of a trash can, looking up at Jerry with a wide,
grinning
expression, seemingly amused or about to pounce.
- The camera is positioned at a low angle, looking up at the characters as they
move
through the air.
- The camera follows the action, panning slightly to keep both characters in
frame
as they move across the scene.
Failure case 3 — missed main action
- A sunny, open golf course with vibrant green grass.
- Background elements include a tree with a wooden bench under it, distant
trees,
and a clear blue sky.
- Mood: Bright, cheerful, cartoonish midday lighting.
- Tom: A grey cat with white paws and chest, wearing a white golf glove on his right paw. Personality vibe: Energetic, focused, and slightly comical in his attempt at golf.
- Golf clubs (one with a red grip, another with a blue grip), a white golf ball, and a wooden bench.
- Tom is attempting to play golf, swinging a club at a golf ball.
- He loses his balance and falls backward, then quickly recovers and stands up,
still holding the club.
- Static side view, medium shot.
- The camera remains fixed, capturing Tom’s full body and the immediate
surroundings.
Failure case 4 — insert tom and jerry
- Inside a simple room with light-colored walls and wooden floor.
- A plain wall with a framed picture or mirror is visible in the background.
- Warm, even lighting typical of classic cartoon animation.
- Tom: a grey cat with white paws, wearing a red shirt, white shorts, and a
large
straw hat. He appears to be dancing or struggling with glee.
- Jerry: a small brown mouse, wearing a red shirt and white shorts, also wearing
a
small red bow on his head. He is being carried or held by Tom.
- A large straw hat worn by Tom.
- A long, thin stick or pole held by Tom, which he is using to balance or dance
with.
- Tom is energetically dancing or stumbling while holding Jerry in his arms.
- Jerry is laughing and appears to be enjoying the chaos.
- The scene transitions to Tom and Jerry suddenly breaking apart, with Jerry
running
away as Tom stumbles.
- Static medium shot, keeping both characters in frame throughout the
sequence.
- The camera remains fixed, allowing the characters' movements to drive the
action.
Future Work
This is a living project and there are a bunch of directions I want to explore next. The bullets below are placeholders and will be updated as the project evolves.
- Resolution × CFG benchmarks: refine the evaluation protocol and run a full sweep across CFG scales, comparing V1 vs V2 side-by-side on SSIM, FID, LPIPS, and temporal metrics.
- Motion-score conditioning: experiment with adding a motion score (e.g. Unimatch + VMAF, similar to how SANA-Video scores motion) as an extra token appended to the text conditioning, and train the model to respond to “more / less motion” controls.
- RL-style reward model: take inspiration from Video Generation Models Are Good Latent Reward Models , clone a pruned copy of the video transformer as a latent reward model that scores “real vs fake” Tom & Jerry clips, and use those scores in a lightweight RL loop to sharpen dynamics and reduce artifacts.