AI Video Glossary

The Complete AI Video Glossary

203+

Defined AI video terms

Showing 203 definitions

A

Adapter

#

An Adapter is a small, lightweight module added to a large AI model (such as a diffusion model or transformer) that allows the model to learn new abilities without retraining all of its original weights. Instead of modifying the entire model—which would be slow, expensive, and could degrade its original capabilities—an adapter introduces a tiny set of additional parameters that “plug into” the model at key layers.

Advanced Motion Tracking

#

Advanced Motion Tracking refers to techniques that analyze a sequence of video frames to detect, follow, and model the movement of objects, people, or the camera itself over time. The goal is to understand how things move, not just what they are.

AI Video Matting

#

AI Video Matting is the process of separating the foreground (such as a person, object, or animal) from the background in each frame of a video using machine learning. The output is usually a high-quality alpha matte—a grayscale transparency map that tells you exactly which pixels belong to the subject and which belong to the background.

Alpha Masking

#

Alpha Masking is the use of a transparency map (the alpha channel) to control which parts of an image or video are visible, partially visible, or hidden, allowing clean compositing and selective editing.

Animation LoRA

#

An Animation LoRA is a lightweight add-on module that teaches a diffusion or video model to generate a specific animation style or motion pattern without retraining the entire model.

Attention Maps

#

Attention Maps are visual or numerical representations showing which parts of an input (such as text, an image, or a video frame) the model focuses on when generating or analyzing content.

Audio Alignment

#

Audio Alignment is the process of synchronizing generated video content with an audio track by matching timing, rhythm, speech patterns, or other audio cues.

Audio Embeddings

#

Audio Embeddings are numerical representations of audio signals—such as speech, music, or sound effects—that capture their meaning or characteristics so an AI model can use them for conditioning or synchronization.

Audio Guidance

#

Audio Guidance is a technique where an audio source influences or controls how a video or image generation model behaves, such as shaping motion, timing, expressions, or overall visual rhythm. When generating videos with audio, the audio guidance value determines how strongly the audio affects the final output. Typical ranges are 0–3 or 0–5, where higher numbers mean the audio has a stronger impact on the visuals.

Audio-to-Motion

#

Audio-to-Motion refers to generating movement—such as dancing, lip motions, or expressive body gestures—based on an audio input, often by analyzing rhythm, pitch, phonemes, or energy levels.

Automatic1111

#

Automatic1111 is a web-based interface for running Stable Diffusion models, known for its wide range of extensions, customization options, and accessible UI. Although still widely used, it is no longer actively updated or maintained, and most new diffusion and video-generation features are now supported primarily by more modern tools such as ComfyUI.

Autoregressive Models

#

Autoregressive Models generate data one step at a time, with each new output depending on all previously generated outputs. In video generation, they can produce frames sequentially so that each frame builds on the last.

B

Background Removal

#

Background Removal is the process of isolating the subject of an image or video by removing or replacing the background using segmentation or matting techniques.

Base Model

#

A Base Model is the original, unmodified AI model trained on broad data. All fine-tuned versions, adapters, or LoRA modules build on top of this foundational model.

Batch Processing

#

Batch Processing refers to generating or processing multiple images or video segments at once instead of handling them individually, improving efficiency and throughput.

Blend Modes

#

Blend Modes are mathematical rules that determine how two layers combine visually, such as Multiply, Screen, Overlay, or Add, commonly used in compositing and video editing.

Bounding Box Conditioning

#

Bounding Box Conditioning involves guiding a model by specifying rectangular regions where certain objects or actions should appear, helping control layout and composition.

C

Canny Edge Maps

#

Canny Edge Maps are outlines generated using the Canny edge-detection algorithm, often used in ControlNet or conditioning workflows to guide the structure of generated images or videos.

CausVid Sampler

#

The CausVid Sampler is a sampling method designed specifically for video diffusion models that generates frames in a causal, time-aware order. Rather than treating each frame independently, it ensures that earlier frames influence later frames in a stable, consistent way.

Use CausVid when you want:

  • Higher temporal consistency across frames
  • Reduced flickering or drifting in long sequences
  • Better motion continuity, especially in scenes with moving subjects or cameras

Compared to other samplers:

  • Unlike Euler or DPM++, which treat each frame mostly independently, CausVid is built for sequential coherence.
  • Unlike UniPC, which prioritizes speed and smoothness, CausVid prioritizes maintaining consistent temporal relationships.
  • It is especially useful for text-to-video, image-to-video, and motion-heavy outputs where maintaining continuity matters more than generation speed

Checkpoints

#

Checkpoints are saved versions of a model’s weights at a specific point in training or fine-tuning. They allow users to load, share, or continue training a model from an exact state rather than starting from scratch.

Classifier-Free Guidance

#

Classifier-Free Guidance is a technique in diffusion models where the model blends an “unconditional” prediction with a “prompt-conditioned” prediction. A guidance scale controls how strongly the model follows the prompt.

Typical guidance values (1–12):

  • Lower values (1–3) produce more natural, smooth results but may follow the prompt less closely.
  • Moderate values (4–7) strike a balance, giving strong prompt fidelity with fewer artifacts.
  • High values (8–12+) force the model to strongly obey the prompt but may introduce noise, distortions, or unnatural details.

Video workflows: In video generation, guidance values often stay on the lower side to avoid flicker and maintain temporal consistency.

CLIP

#

CLIP (Contrastive Language–Image Pretraining) is a model that learns relationships between text and images, enabling diffusion and video models to understand prompts and match visuals to textual descriptions.

Color Grading Maps

#

Color Grading Maps are reference images or lookup data used to shift the colors, brightness, or contrast of generated images or video frames so they follow a specific visual style or atmosphere.

ComfyUI

#

ComfyUI is a modern, actively maintained node-based interface for diffusion and video generation workflows. It allows granular control over models, samplers, conditioning inputs, and advanced pipelines through a visual graph system.

Consistency Models

#

Consistency Models are generative models that learn to produce high-quality outputs in very few steps by directly mapping noisy inputs to clean results, reducing or eliminating the need for multi-step diffusion sampling.

ControlNet

#

ControlNet is a technique that adds an extra neural network alongside a diffusion model, allowing the model to follow structural guidance such as poses, depth maps, edges, or segmentation masks while still responding to a text prompt.

Cross-Attention

#

Cross Attention is an attention mechanism where one set of tokens (such as text embeddings) guides or influences another set of tokens (such as image or video latents). It allows the model to integrate external conditioning—like prompts, reference images, or audio—into the generation process.

Cutout Augmentation

#

Cutout Augmentation is a training technique where random parts of an image are masked or removed during training to help the model learn to generalize better and become less sensitive to missing or occluded information.

D

DDIM (Denoising Diffusion Implicit Models)

#
  • DDIM is a deterministic version of diffusion sampling that allows you to skip many steps while still following the model’s noise-removal trajectory.
  • Much faster than DDPM
  • Typically 20–100 steps
  • Good balance of speed and quality
  • Common default sampler in many tools
  • Think of DDIM as a "shortcut" sampler that keeps decent fidelity without doing the full diffusion process.

DDPM (Denoising Diffusion Probabilistic Models)

#

DDPM refers to the original diffusion sampling process, where an image or latent is gradually denoised one tiny step at a time using a stochastic (random) process.

  • Requires hundreds to thousands of steps
  • Produces smooth, stable results
  • Slowest of all common samplers
  • Mostly of historical importance as a sampling method—modern tools rarely use it directly

DDPMSolver

#

DDPMSolver is a family of specialized numerical solvers designed to approximate the diffusion denoising process much more accurately per step. It uses higher-order mathematical techniques to follow the noise trajectory more precisely.

Use DDPMSolver when you want:

  • Good image/video quality at very low step counts (often 5–30)
  • Better fine detail preservation than DDIM
  • Stable sampling for both images and video

Key distinctions from DPM++:

  • DDPMSolver is not the same as DPM++.
  • It is its own solver family, created specifically to improve speed and accuracy for diffusion models.
  • It is mathematically different and uses unique multi-step and single-step solvers.

DEIS Sampler (Denoising Exponential Integrator Sampler)

#

DEIS is an advanced sampler that uses exponential integrator methods to more accurately follow the diffusion curve.

Use DEIS when you want:

  • Smooth transitions (especially important in video)
  • Higher accuracy than DDIM at similar speeds
  • Good results with moderate step counts (10–50)

Compared to DDIM, DEIS is:

  • More accurate
  • Slightly slower
  • Better for temporal consistency in video

Denoising

#

Denoising is the core step of the diffusion process where the model gradually removes noise from a latent or image representation, transforming random noise into a coherent visual result.

DensePose

#

DensePose is a system that maps every pixel of a person in an image to a 3D surface model of the human body. It provides detailed pose and body-surface information that can guide character animation and video generation.

Depth Conditioning

#

Depth Conditioning is a technique where a diffusion or video model uses a depth map as guidance, helping it maintain accurate spatial structure, perspective, and 3D relationships in the generated output.

Depth Maps

#

Depth Maps are images where each pixel represents the distance from the camera to the object in the scene. They are used to control perspective, layering, and camera movement in AI image and video generation.

Depth-Aware Compositing

#

Depth-Aware Compositing is a method of combining visual elements using depth information so that foreground and background layers interact correctly based on distance, occlusion, and perspective.

Diffusion Models

#

Diffusion Models are generative models that create images or videos by reversing a gradual noising process. They start from pure noise and iteratively denoise it, using learned patterns to produce realistic outputs.

Diffusion Transformers (DiTs) (Revised)

#

Diffusion Transformers are diffusion models that replace the traditional UNet architecture with a transformer architecture—the same type used in large language models. Transformers use self-attention to analyze relationships between all parts of an image or video at once, allowing them to model global structure more effectively.

Using a transformer architecture means:

  • The model can understand long-range dependencies, such as how distant objects relate in a scene.
  • It scales well with size, allowing extremely large, high-performance models.
  • It can unify text, image, and video processing under similar structures.

Why DiTs power modern models (Sora, SD3, etc.):

  • Sharper detail
  • More coherent scenes
  • Better multimodal alignment

Direct Preference Optimization (DPO)

#

Direct Preference Optimization is a training method where a model learns directly from human or automated preference data. Instead of maximizing likelihood, the model is optimized to generate outputs users prefer, improving alignment and realism.

DreamBooth

#

DreamBooth is a fine-tuning technique originally developed by Google Research that teaches a model to recognize a specific subject—such as a person, object, or style—using only a small number of example images. Although the original project is now archived and no longer actively maintained, the method is still widely used in the AI community because it enables personalized generation with very limited data.

Driving Video

#

A Driving Video is a reference video used to drive the motion, timing, and sometimes facial expressions of a generated video. Instead of relying only on prompts or noise, the model extracts motion cues—such as body pose, head movement, camera movement, or facial dynamics—from the driving video and applies them to a new subject or style.

Common uses:

  • Motion transfer: apply dances, gestures, or actions to another character.
  • Talking-head animation: sync facial expression and head motion to a target identity.
  • Camera-motion reproduction: recreate zooms, pans, or shakes from the driving clip.
  • Character reenactment: apply one actor’s performance to another identity.

Why it matters: Driving videos provide precise motion control, ensuring the generated sequence matches the rhythm and movement of the reference while allowing the visual content (identity, style, environment) to be completely different.

Dynamic Thresholding

#

Dynamic Thresholding is a method that prevents extreme pixel values during diffusion sampling by adaptively limiting (or “clipping”) the model’s outputs, helping to reduce over-saturation, noise, or blown-out highlights.

E

Edge Maps

#

Edge Maps are images that highlight the outlines and boundaries of objects. They are commonly used to guide diffusion models by providing a structural blueprint for the composition.

EDS (Euler Discrete Scheduler)

#

EDS is a version of the Euler sampler that uses a discrete scheduling method to control how noise is removed at each step, often improving stability and contrast in the generated outputs.

EI (Euler Integrated)

#

EI (Euler Integrated) is a sampler that extends Euler with an integrated update step designed to improve numerical stability and reduce artifacts, especially in complex or high-contrast images.

Embeddings

#

Embeddings are numerical representations of text, images, audio, or video that capture their meaning or features in a format a model can understand and use for conditioning or generation.

Euler a (Euler Ancestral)

#

Euler a is an ancestral variant of Euler that introduces controlled randomness into each sampling step, producing more stylized, creative, or varied outputs.

Euler Sampler

#

The Euler Sampler is a fast and simple diffusion sampling method that produces sharp, well-defined outputs by iteratively stepping along the noise-reversal trajectory. This means the sampler repeatedly moves the image one step closer to its final form by estimating how much noise to remove at each stage, gradually transforming random noise into a coherent image or video frame.

Exponential Moving Average (EMA)

#

EMA is a smoothing technique applied to model weights during training to stabilize learning. By averaging past weight values, EMA often improves the model’s final quality and reduces noise.

F

Face Mesh (MediaPipe)

#

Face Mesh is a system that identifies and tracks detailed face landmarks—such as eyes, lips, and facial contours—providing fine-grained facial information for animation, lip-sync, or character control. It is part of MediaPipe, an open-source framework from Google that provides fast, real-time solutions for tasks like pose detection, hand tracking, and face analysis.

FID (Fréchet Inception Distance)

#

FID is a metric that measures how close generated images or video frames are to real ones by comparing feature statistics. Lower FID scores mean more realistic outputs.

Fine-Tuning

#

Fine-Tuning is the process of taking a pre-trained model and training it further on specialized data so it can learn new styles, subjects, or behaviors without starting from scratch.

Flash Attention

#

Flash Attention is a highly optimized attention algorithm designed to make transformer models faster and more memory-efficient during inference. It computes attention in GPU-friendly blocks (“tiling”) so it avoids creating large intermediate matrices that usually consume a lot of memory.

Why it matters:

  • Enables faster inference, especially for large models.
  • Reduces GPU memory usage substantially.
  • Allows higher resolutions or longer sequences to fit in memory.
  • Widely used in diffusion and video transformers to speed up attention layers.
  • Common across image/video frameworks because it accelerates attention without changing model weights or quality.

Flow Maps

#

Flow Maps (often optical flow maps) represent pixel-by-pixel motion between video frames. They help models understand how objects move over time and can guide motion-consistent video generation.

Flow Matching

#

Flow Matching is a training method where the model learns a smooth transformation (“flow”) that turns noise into an image or video in a single continuous process. Unlike traditional Stable Diffusion—which relies on many discrete denoising steps—flow-matching models learn the velocity field that moves data from noise to the final output more directly and efficiently. Several modern open-source models use Flow Matching, including Stable Diffusion 3, Wan 2.1, Wan 2.2, and many emerging “rectified flow” or “RF” models in the community.

FOV Control (Field of View Control)

#

FOV Control adjusts the camera’s perceived zoom or lens width in a generated scene, allowing wider, narrower, or more cinematic framing.

Frame Interpolation

#

Frame Interpolation generates new frames between existing ones by analyzing motion, making a video smoother or increasing its frame rate.

Frame Rate Conversion

#

Frame Rate Conversion changes a video’s frames-per-second (FPS) by adding or removing frames. AI-based methods often use interpolation to keep motion smooth.

FVD (Fréchet Video Distance)

#

FVD is a metric that evaluates the realism and temporal consistency of generated video by comparing both appearance and motion to real-world video samples.

G

Gaussian Splatting

#

Gaussian Splatting is a 3D reconstruction method that represents a scene using thousands or millions of small 3D Gaussian points instead of polygons or meshes. These Gaussians are rendered extremely quickly, making it possible to create photo-realistic, navigable 3D scenes from video or image captures. In practice, it behaves like a fast, real-time alternative to NeRF for building 3D environments.

Generator Networks

#

Generator Networks are the components of generative models that create new data—such as images, audio, or video—from noise or latent representations. In diffusion and flow-based models, they guide the transformation from noise to the final output.

GPU Acceleration

#

GPU Acceleration refers to using a graphics processing unit to speed up AI computations. GPUs handle many operations in parallel, dramatically improving performance for diffusion and video generation tasks.

Gradient Checkpointing

#

Gradient Checkpointing is a memory-saving technique where the model stores fewer intermediate values during training and recomputes them when needed, allowing larger models or higher resolutions to run on limited GPU memory.

Green Screen / Chroma Keying

#

Green Screen (Chroma Keying) is the technique of removing a uniformly colored background—typically green or blue—so the subject can be composited onto a new background.

Guidance Scale

#

Guidance Scale is the value that controls how strongly a diffusion model follows the text prompt. Low values give natural but less directed results, while high values enforce prompt accuracy but may introduce artifacts.

H

Heun Sampler

#

The Heun Sampler is a diffusion sampler that uses a two-step correction method to improve accuracy. It balances stability and detail, often producing more consistent results than Euler at similar speeds.

Heun V2

#

Heun V2 is an updated version of the Heun sampling method that improves numerical stability and smoothness, making it useful for both images and video sequences where consistent detail is important.

Human Parsing Maps

#

Human Parsing Maps label different regions of a person—such as face, torso, arms, legs, or clothing—as separate segments. This structured information can guide AI models in editing or animating human subjects.

Hybrid Conditioning

#

Hybrid Conditioning refers to using multiple conditioning inputs at once—such as text, depth maps, pose maps, and segmentation masks—to guide a diffusion or video model with more control and precision.

I

I2V (Image-to-Video)

#

I2V (Image-to-Video) is the process of generating a video sequence starting from a single input image. The model expands the scene over time by predicting motion, camera movement, or animation based on prompts.

Identity Preservation

#

Identity Preservation ensures that the appearance of a person or character remains consistent across generated images or video frames, preventing changes in facial structure, features, or style.

Image Conditioning

#

Image Conditioning is the process of guiding a generation model using one or more input images, allowing the model to follow their style, structure, layout, or visual features while producing new content.

Inference Steps

#

Inference Steps are the number of denoising or solver iterations the model uses to generate the final output. More steps usually mean higher quality but slower results; fewer steps are faster but may reduce detail or stability.

Inpainting

#

Inpainting is the technique of filling in missing, masked, or damaged regions of an image or video using AI, allowing objects to be added, removed, or replaced seamlessly.

Instance Segmentation Maps

#

Instance Segmentation Maps label each object in an image as its own separate region (unlike semantic segmentation, which only labels categories). This allows precise targeting of individual objects for editing or control.

Instant-NGP

#

Instant-NGP (Instant Neural Graphics Primitives) is a fast 3D neural rendering technique that reconstructs scenes from images or video using efficient data structures. It enables quick creation of navigable 3D environments.

IP Adapter

#

An IP Adapter—short for Image Prompt Adapter—is a lightweight module that lets a model use reference images to guide the output’s identity, style, or composition without fully fine-tuning the model. It preserves the base model while adding visual influence from the reference.

Iterative Refinement

#

Iterative Refinement is the process of improving an image or video over multiple passes, where each pass corrects or sharpens the previous output to reach a cleaner final result.

J

Joint Embedding Space

#

A Joint Embedding Space is a shared numerical representation where different types of data—such as text, images, or audio—are mapped into the same space so a model can compare and relate them meaningfully.

K

Karras Sampling Schedules

#

Karras Sampling Schedules are noise schedules designed to improve stability and detail during diffusion sampling by changing how noise levels decrease across steps.

KDM Sampler

#

The KDM Sampler is a diffusion sampler that combines Karras noise scheduling with higher-order denoising techniques to achieve sharp and consistent results at relatively low step counts.

Keypoint Tracking

#

Keypoint Tracking identifies and follows specific landmarks—such as joints, facial features, or object corners—across video frames to understand motion and maintain consistency during editing or generation.

KID (Kernel Inception Distance)

#

KID is a metric used to evaluate generative models by measuring how close their outputs are to real images. Lower KID scores indicate more realistic results, similar to FID but often more reliable on smaller datasets.

L

Lama Inpainting

#

Lama Inpainting is a deep-learning model designed to fill in missing or masked areas of images using high-quality, context-aware synthesis. It is widely used for object removal and background restoration.

Latent Consistency

#

Latent Consistency refers to maintaining stable and coherent latent representations across frames in video generation, helping reduce flicker and keep objects consistent over time.

Latent Diffusion

#

Latent Diffusion is a technique where the model generates images or video within a compressed latent space instead of pixel space. This makes the process faster and more efficient while preserving detail.

Latent Space

#

Latent Space is a compressed numerical representation where the model encodes abstract features of images or videos. It allows efficient manipulation, blending, and generation of visual content.

Lens Encoding

#

Lens Encoding is a method of guiding a model by specifying camera characteristics such as focal length, field of view, or lens style, helping generate images or videos with a consistent cinematic look.

Line-Art Conditioning

#

Line-Art Conditioning uses line drawings or outline maps to control the structure of generated images or videos, ensuring the output follows the shapes and contours of the provided sketch.

LoRA (Low-Rank Adaptation)

#

LoRA (Low-Rank Adaptation) is a training method that adds a small number of extra parameters to a model so it can learn new skills or styles without modifying the full set of weights. LoRA modules capture the difference between the original model and the desired new behavior, making them lightweight, fast to train, and easy to combine with other LoRAs.

Why creators use LoRA:

  • Personalization (faces, characters, products, art styles) with very little data
  • Stacking or mixing multiple LoRAs to blend concepts
  • Safe experimentation without altering or degrading the base model
  • LoRAs provide many of the benefits of full fine-tuning with only a tiny fraction of the compute and storage.

Lumen Models

#

Lumen Models are AI models designed to simulate lighting and illumination effects, helping generate scenes with realistic or stylized lighting that matches the desired artistic or cinematic look.

M

MagCache (Magnitude-aware Cache)

#

MagCache (Magnitude-aware Cache) is a training-free acceleration technique for video and image diffusion models that significantly speeds up the inference process (by 2x-3x) by intelligently skipping redundant computation steps. It is based on the discovery of a "unified magnitude law" in how diffusion models work internally. MagCache improves upon TeaCache's methodology by discovering a more universal and reliable pattern for residual magnitudes, resulting in a more efficient, robust, and easier-to-implement acceleration technique with less quality trade-off.

Mask R-CNN

#

Mask R-CNN is a neural network that performs both object detection and pixel-level segmentation, allowing each detected object to be outlined with an accurate mask for editing or conditioning.

Masked Conditioning

#

Masked Conditioning is the technique of guiding a generation model using a mask to specify which regions should change and which should stay the same, enabling targeted edits without affecting the whole image or frame.

Matting (Alpha Matting)

#

Matting is the process of extracting a foreground object—including fine details like hair or fur—by producing an alpha matte that separates it from the background for clean compositing or editing.

MediaPipe Face Mesh

#

MediaPipe Face Mesh is a Google system that detects and tracks hundreds of facial landmarks, allowing precise control of expressions, lip-sync, or character animation in AI video workflows.

MediaPipe Pose

#

MediaPipe Pose is a Google-developed, real-time system that identifies and tracks human body keypoints—such as shoulders, elbows, hips, and knees—providing detailed skeletal information for pose-based animation or conditioning.

MiDaS Depth

#

MiDaS is a depth-estimation model that generates depth maps from single images. Its outputs help AI systems understand 3D structure and perspective when generating or editing visual content.

Motion LoRA

#

Motion LoRA is a type of LoRA specifically trained to introduce certain motion behaviors—such as walking, head turning, or camera panning—into video generation models without retraining the entire model.

Motion Prior

#

A Motion Prior is pre-learned information that helps a model predict realistic or consistent movement patterns, improving temporal stability and preventing jitter or unnatural motion in generated video.

Motion Vectors

#

Motion Vectors represent the direction and speed of movement for pixels or objects between video frames. They are used for tasks such as interpolation, stabilization, and motion-aware generation.

Multimodal Models

#

Multimodal Models are AI systems that can understand or generate multiple types of data—such as text, images, audio, or video—by processing them in a shared representation.

MultiView Diffusion

#

MultiView Diffusion refers to models or workflows that generate multiple camera views of a scene simultaneously, helping achieve consistent 3D structure and supporting tasks like novel-view synthesis or multi-angle video generation.

N

NeRF (Neural Radiance Fields)

#

NeRF is a 3D representation method that learns how light behaves in a scene so it can render new viewpoints from images or video. It is used for reconstructing realistic 3D environments.

Noise Schedule

#

A Noise Schedule defines how noise levels change during diffusion sampling. Different schedules control how quickly noise is removed, affecting sharpness, stability, and convergence.

Normal Maps

#

Normal Maps encode the orientation of surfaces at each pixel. They help models understand shape and lighting, guiding more accurate shading or 3D-aware generation.

Normalized Attention Guidance (NAG)

#

A recent, training-free mechanism proposed to improve negative guidance in diffusion models, especially under aggressive sampling conditions.

  • Function: NAG suppresses unwanted attributes in generated content.
  • Advantage: It restores effective negative guidance in situations where the widely used Classifier-Free Guidance (CFG) method fails, all with minimal computational overhead.
  • Application: NAG is presented as a universal plug-in that works across various architectures (UNet, DiT) and modalities (image, video), demonstrating improved text alignment, fidelity, and perceived quality.

Novel View Synthesis

#

Novel View Synthesis is the process of generating new camera angles of a scene based on available images or video, creating viewpoints that were never actually recorded.

NVDiffusion

#

NVDiffusion is an NVIDIA-developed framework for generating or optimizing 3D content using diffusion techniques, often used in rendering, reconstruction, and neural graphics workflows.

O

ODE Euler

#

ODE Euler is the simplest ODE-based sampler, using a single-step numerical update. It is easy to compute but less accurate than more advanced ODE solvers.

ODE Heun

#

ODE Heun is a two-step ODE solver that improves on Euler by making an initial prediction and then correcting it, offering more stability and less noise in the final output.

ODE Samplers

#

ODE Samplers use ordinary differential equation (ODE) solvers to precisely follow the denoising path a diffusion model defines. Instead of adding randomness like stochastic samplers, ODE samplers treat generation as a smooth, deterministic trajectory from noise to the final image or video. This makes them more stable and predictable, especially when high accuracy is needed.

What makes ODE Samplers unique:

  • Follow the exact mathematical path of the model’s noise-removal process
  • Avoid randomness, leading to highly consistent results
  • Provide smoother transitions, which can help in video contexts
  • Often produce cleaner, more detailed images, though typically at slower speeds
  • ODE-based methods (like ODE Euler, ODE Heun, or advanced RK solvers) are preferred when visual stability and precision matter more than generation speed.

OpenPose

#

OpenPose is a computer vision system that detects and tracks human body, hand, and face keypoints. It provides detailed pose information for conditioning animation or video generation.

Optical Flow

#

Optical Flow represents the motion of pixels between video frames by estimating how each pixel moves from one frame to the next. It helps models maintain consistent motion, reduce flicker, and guide animation or stabilization.

Outpainting

#

Outpainting is the process of extending an image or video beyond its original borders, allowing new background or scene elements to be generated while keeping the existing content intact.

Overpaint / Mask Passes

#

Overpaint or Mask Passes are workflows where specific regions of an image or video are repeatedly masked and regenerated to fix details, improve quality, or refine consistency without altering the whole frame.

P

Palette Control

#

Palette Control guides the colors used in generation by providing a palette or reference image. The model follows the chosen color themes to create consistent visual style or mood.

Parameter-Efficient Fine-Tuning

#

Parameter-Efficient Fine-Tuning refers to methods—such as LoRA or Adapters—that modify only a small portion of a model’s weights instead of retraining the entire model, making fine-tuning faster and more resource-efficient.

Perceptual Loss

#

Perceptual Loss is a loss function that compares high-level visual features rather than raw pixels. It encourages the model to match the look and feel of target images more closely during training.

Phoneme Alignment

#

Phoneme Alignment maps spoken sounds (phonemes) in audio to their correct positions in time. It helps models generate accurate lip-sync and facial movements in talking-head or character animation.

Pixel Diffusion Models

#

Pixel Diffusion Models perform diffusion directly in pixel space rather than in latent space. They can produce high-fidelity results but are slower and more computationally expensive.

Pose Maps

#

Pose Maps are images or data structures showing the positions of key body points (joints or landmarks). They are used to control the posture, movement, or animation of characters in generated video.

Pose-Free NeRF

#

Pose-Free NeRF is a variation of Neural Radiance Fields that can reconstruct 3D scenes without requiring precise camera pose information, making scene capture and reconstruction more flexible.

Prompt Embedding

#

Prompt Embedding is the numerical representation of a text prompt created by a language–vision model such as CLIP. It allows the generation model to understand the meaning and intent of the prompt when creating images or video.

Prompt Weighting

#

Prompt Weighting adjusts the influence of specific words, phrases, or concepts within a prompt. Higher weights push the model to emphasize those elements more strongly in the final output.

Q

Q-LoRA

#

Q-LoRA (Quantized Low-Rank Adaptation) is a training method that allows you to fine-tune a very large model by loading the base model in a quantized (typically 4-bit) format, while training LoRA adapters in normal higher precision.

Key idea:

  • Load the base model in 4-bit precision so it fits into far smaller GPU memory footprints.
  • Train LoRA adapters in higher precision (16-bit/32-bit) so the adapter still learns accurately.
  • During training, temporarily dequantize the needed weights so gradients flow without ever storing a full-precision copy of the entire model.

Why this matters:

  • You can fine-tune huge models (billions of parameters) on consumer GPUs.
  • You never alter the base model’s weights—only the lightweight LoRA learns.
  • The resulting LoRA behaves like a normal adapter module that can be shared or merged elsewhere.

Quantization

#

Quantization reduces the precision of a model’s numerical weights—such as converting 32-bit values to 8-bit—so it uses less memory and runs faster, often with minimal loss in accuracy.

R

Rectified Flow

#

Rectified Flow is a flow-based training method where the model learns a simplified, straightened path from noise to the final image or video. It often produces faster generation and greater stability than traditional diffusion.

Render Consistency

#

Render Consistency refers to maintaining stable appearance, lighting, and details across all frames in a generated video so objects do not flicker, distort, or shift unpredictably.

Reprojection

#

Reprojection maps pixels or 3D points from one frame or camera view into another. It helps maintain spatial consistency when generating multi-view or 3D-aware content.

RIFE Interpolation

#

RIFE (Real-Time Intermediate Flow Estimation) is an AI method for generating intermediate video frames. It creates smooth slow motion or higher frame rates by predicting motion between frames quickly and accurately.

RK (Runge–Kutta) Sampler

#

An RK (Runge–Kutta) Sampler uses higher-order numerical methods—meaning it takes multiple intermediate calculations within a single sampling step—to more accurately estimate the denoising trajectory in diffusion or flow models.

"Higher-order" means:

  • The sampler evaluates the model’s predictions several times per step.
  • It uses these evaluations to refine the direction and magnitude of the update.
  • This produces a more precise and stable result than simpler one-step methods (like Euler).
  • RK samplers are typically more accurate and smoother but also slower because each step requires extra computations.

Robust Video Matting (RVM)

#

Robust Video Matting (RVM) is a deep-learning model that performs real-time, high-quality video matting, extracting foreground subjects—including fine details like hair—with minimal flicker or artifacts.

S

Sage Attention

#

Sage Attention is an inference-optimized attention mechanism focused on stability and efficiency for large, high-resolution, or long-sequence models. It restructures how attention scores are normalized and aggregated so the model remains stable without excessive memory use.

Key strengths:

  • Designed for faster, more stable inference.
  • Handles longer sequences and deeper attention stacks.
  • Reduces memory overhead compared to standard attention.
  • Often used in large video or multimodal transformers where long temporal context is needed.

Sampling Schedule

#

A Sampling Schedule defines how a sampler distributes its noise-removal steps over time. Different schedules (linear, exponential, Karras, etc.) affect sharpness, stability, and how fast detail emerges during generation.

SDE Samplers

#

SDE Samplers (Stochastic Differential Equation samplers) introduce controlled randomness into the sampling process. This can produce more natural variation and improve robustness but may reduce deterministic consistency.

SDPA (Scaled Dot-Product Attention)

#

Scaled Dot-Product Attention (SDPA) is the standard attention mechanism used in transformer models. It computes attention scores by taking the dot product between queries and keys, scaling the result to maintain numerical stability, and then applying a softmax to determine how much each token should attend to others.

Why it matters:

  • Baseline attention method for most diffusion, video, and multimodal transformers.
  • The scaling term (divide by √dₖ) prevents extremely large dot products, improving stability.
  • Provides high-quality attention but can be computationally expensive for long sequences or high-resolution video—hence optimizations like Flash Attention or Sage Attention.

Seamless Tiling

#

Seamless Tiling is the process of generating images that can repeat infinitely without visible borders. It is used for textures, backgrounds, and looping visual patterns.

Seed

#

A Seed is a numerical value that controls the randomness in generation. Using the same seed with the same settings produces the same result, enabling reproducibility.

Self-Attention

#

Self-Attention is a mechanism that allows a model to decide which parts of an input (image patches, video frames, or tokens) should pay attention to each other. It enables global reasoning about context and relationships.

Self-Forcing

#

Self-Forcing is a video generation technique where a model uses its own previously generated frames as conditioning inputs for generating the next frames. Instead of conditioning only on the original source image or prompt, the model repeatedly “feeds back” its own output to guide future frames.

Why it’s used / what it preserves:

  • Character identity
  • Scene layout
  • Color and lighting consistency
  • Overall temporal stability (each new frame depends on the last, so sequences stay coherent)

Trade-offs:

  • Pros: Reduces flicker, drifting shapes, identity shifts, and unwanted changes across frames.
  • Cons: Errors can accumulate—if one frame develops an artifact, later frames may amplify it (“feedback drift”).

Where it shows up: Image-to-video, video-to-video, and consistency-focused samplers often expose a “self-feedback” or “history strength” slider controlling how much the model relies on its previous outputs versus fresh conditioning.

Semantic Segmentation Maps

#

Semantic Segmentation Maps label each pixel in an image by category (e.g., “sky,” “person,” “car”) rather than by individual instance. These maps guide models in understanding scene structure.

Sequence-to-Sequence Models (Seq2Seq)

#

Seq2Seq models take a sequence as input and generate a sequence as output. In video workflows, they can predict frame sequences, motions, or transformations over time.

Shift / Model Shift

#

Shift (often called Model Shift) is a scheduler parameter that changes how long the sampler remains at higher noise levels during generation by offsetting the noise—or sigma—schedule. Instead of altering the amount of noise added, shift changes which noise levels the sampler visits and how long it spends in the early, very-noisy phase of sampling.

How shift works:

  • A normal noise schedule might decrease steadily, like 1.0 → 0.75 → 0.5 → 0.25 → 0.0.
  • A higher shift “pushes upward,” so the sampler lingers at larger sigma values: 1.0 → 0.95 → 0.8 → 0.4 → 0.0.
  • Practically, the model spends more time exploring global structure and less time committing too early to details.

Effects on output:

  • Lower shift values:
    • Exit the high-noise regime sooner.
    • Produce cleaner, smoother, more stable frames.
    • Reduce hallucinations but sometimes lose richness or complexity.
  • Higher shift values:
    • Stay longer in the high-noise regime.
    • Increase detail, contrast, and perceived complexity.
    • Can introduce more noise, artifacts, or chaotic behavior.

Why it matters for video: Adjusting shift helps balance stability versus visual punch when animating sequences; many workflows expose it as a lever for flicker reduction.

Video-specific impact:

  • Temporal consistency (how stable details are across frames).
  • Degree of creative freedom versus structural stability.
  • Trade-off between detail and flicker.
  • Higher shift encourages bolder global changes and stronger details, while lower shift keeps motion and structure more controlled.

Slerp (Spherical Linear Interpolation)

#

Slerp is a method of smoothly interpolating between two points in latent space. It is commonly used to blend between images, characters, or styles with consistent transitions.

Spatial Conditioning

#

Spatial Conditioning uses maps such as depth, segmentation, pose, or edge maps to guide where objects, structures, or motion should appear in an image or video. It anchors the generated content to a specific spatial layout.

Spectrogram Conditioning

#

Spectrogram Conditioning uses a visual representation of audio—showing frequency over time—to guide motion, rhythm, or timing in video generation, enabling music-driven or speech-driven animation.

Spline Interpolation

#

Spline Interpolation smooths transitions between keyframes or values using curved mathematical functions. In video generation, it helps create smoother camera paths or motion sequences.

SR (Super-Resolution)

#

Super-Resolution is the process of increasing the resolution of an image or video while enhancing sharpness and detail. AI-based SR models can reconstruct textures and edges that were not present in the original.

Stable Video Diffusion

#

Stable Video Diffusion (SVD) is a family of video generation and video-to-video models released by Stability AI, designed to extend the Stable Diffusion ecosystem to motion. It focuses on temporal consistency and controllable video generation.

Style LoRA

#

A Style LoRA is a LoRA trained to apply a specific artistic style—such as watercolor, anime, cinematic film look, or painterly strokes—to generated images or video frames without altering subject identity.

Style Reference Embedding

#

A Style Reference Embedding is a numerical encoding of an example image’s visual attributes—such as color palette, texture, lighting, or brushwork—which the model uses to emulate that style in new outputs.

Surfaces (Surface Normals / Geometry Surfaces)

#

Surfaces describe the 3D orientation of objects in a scene. When encoded as surface normal maps, they guide a model in producing consistent lighting, shading, and depth-aware generations.

T

TeaCache (Timestep Embedding Aware Cache)

#

TeaCache accelerates the inference (generation) process in diffusion models, such as those used for image, video, and audio creation. It works by observing that consecutive steps in the denoising process are often very similar, so it identifies when to reuse a previous, cached result instead of performing the full computation again. This speeds up generation by skipping unnecessary steps, though it can result in a slight loss of quality, making it ideal for getting quick previews before generating a final, high-quality output.

Temporal Attention

#

Temporal Attention is a mechanism that lets a video model analyze relationships across multiple frames instead of just within a single frame. It helps the model keep identities, shapes, lighting, and details consistent over time by allowing information from previous and future frames to influence the current one.

Temporal Consistency Loss

#

Temporal Consistency Loss is a training objective that penalizes flicker, jitter, and sudden frame-to-frame changes in video generation. It encourages the model to keep colors, shapes, and motion coherent across time, improving stability and reducing visual artifacts.

Temporal Consistency Models

#

Temporal Consistency Models are specialized AI models or modules designed to keep visual details steady across frames—reducing flicker, drifting objects, or changes in appearance during video generation.

Temporal Diffusion

#

Temporal Diffusion extends diffusion models to operate across time, not just single images. It learns how frames evolve, allowing the model to generate smooth, coherent video sequences.

Temporal Latent Alignment

#

Temporal Latent Alignment ensures that the latent representations of consecutive frames remain aligned as the model generates a video. This alignment helps maintain identity, pose, and structural consistency, preventing drifting details or morphing artifacts over time.

Temporal Noise Injection

#

Temporal Noise Injection adds controlled noise across frames during generation to prevent the model from overfitting to previous outputs. It helps reduce “frozen” frames and encourages natural motion.

Temporal Propagation

#

Temporal Propagation is the process of carrying information—such as identity, color, poses, or features—from one frame to the next, helping maintain continuity and avoid flicker.

Temporal Super-Resolution

#

Temporal Super-Resolution increases the frame rate or smoothness of a video by predicting intermediate frames while preserving motion accuracy.

Text Conditioning

#

Text Conditioning uses embeddings from a language model or CLIP to guide the visual output based on a user’s prompt. It determines how strongly the text influences the generated content.

Text Embedding Models

#

Text Embedding Models convert prompts into numerical vectors that represent meaning, style, or intent. These vectors are used to control the direction of image or video generation.

Text Embeddings

#

Text Embeddings are numerical vectors that represent the meaning of a prompt. Generated by models like CLIP or T5, they allow diffusion and video models to interpret concepts, styles, and instructions encoded in natural language.

Textual Inversion

#

Textual Inversion is a method for teaching a model new concepts by creating custom text tokens that represent a person, object, or style without modifying the model’s weights.

Timestep Embeddings

#

Timestep Embeddings encode the current diffusion step into a numerical vector. The model uses this information to know how much noise remains and how aggressively to denoise at each stage.

Token Attention Maps

#

Token Attention Maps visualize which parts of an image or video the model is focusing on in response to certain prompt tokens, helping users understand how text influences the generated content.

Token Merging

#

Token Merging combines similar attention tokens inside a transformer to reduce computation. In video, it can speed up generation while keeping visual quality close to full-resolution attention.

Token-Weighted Prompts

#

Token-Weighted Prompts use syntax or special notation (such as parentheses or numerical weights) to emphasize or de-emphasize specific words, concepts, or attributes in a prompt.

Top-K / Top-P Sampling

#

Top-K and Top-P are sampling techniques used in autoregressive models (and sometimes hybrid video systems) to control randomness.

  • Top-K: Limit generation to the K most likely outputs at each step.
  • Top-P (nucleus sampling): Choose the smallest set of outputs whose cumulative probability exceeds threshold P.
  • Tuning tip: Higher values allow more creativity and surprise; lower values produce steadier, more predictable sequences.

Tracking Stabilisers

#

Tracking Stabilisers use motion estimation to smooth out jittery or inconsistent movement in generated or real video. They help maintain stable camera paths and prevent unwanted shaking across frames.

Transformer Blocks

#

Transformer Blocks are the core units of transformer architectures. They contain attention mechanisms and feed-forward layers that let the model analyze relationships across an image or sequence, enabling global understanding of structure, context, and motion.

Transparency Maps

#

Transparency Maps (alpha maps) indicate how visible each pixel should be. They are used for compositing, masking, and matting workflows, allowing clean separation of foreground and background elements.

Two-Stage Video Models

#

Two-Stage Video Models generate video in two steps:

  • Produce a low-resolution or low-frame-rate base video.
  • Run a refinement or super-resolution pass to enhance detail, motion, and quality.

This structure improves temporal consistency and reduces compute costs by reserving expensive processing for the second stage.

U

U-Net

#

U-Net is a neural network architecture commonly used in diffusion models. It processes images using a downsampling “encoder” path and an upsampling “decoder” path with skip connections that preserve detail.

UniDiffuser

#

UniDiffuser is a unified generative model capable of handling multiple tasks—such as text-to-image, image-to-image, and image variation—within a single diffusion framework using shared representations.

UniPC Multisolver

#

UniPC Multisolver is an extended version of the UniPC sampler that dynamically switches or blends between multiple solver behaviors during sampling. This improves robustness across varied prompts, styles, and conditioning inputs, especially in more complex video pipelines.

UniPC Sampler

#

The UniPC Sampler (Unified Predictor-Corrector) is a fast, stable diffusion sampler that uses both prediction and correction steps to follow the denoising trajectory more accurately. It performs well at low step counts and is often more stable than DPM++ in video workflows.

Upscaling Models

#

Upscaling Models increase the resolution of images or video frames while adding or reconstructing fine details. They can be AI-based (e.g., ESRGAN, Real-ESRGAN, SwinIR) for higher-quality results than traditional resizing.

V

VACE (Video Attention-Conditioned Editing / All-in-One Video Creation and Editing)

#

VACE is an open-source video creation and editing system that supports a wide range of tasks—such as reference-to-video (R2V), video-to-video (V2V), masked video editing (MV2V), object swap, motion transfer, and expansion. It uses attention maps and structured conditioning (e.g., masks, images, prompts) to guide edits while preserving temporal consistency and motion in videos.

Key points:

  • Not limited to one model architecture; it supports base models like WAN2.1 or LTX-Video and is designed for extensible workflows.
  • Enables Move-Anything, Swap-Anything, Reference-Anything, and Animate-Anything styles of edits within one pipeline.
  • Emphasizes temporal coherence while handling multiple input forms (video, masks, reference images, prompts).
  • Integrates editing, generation, and conditioning into one toolkit rather than separate modules.

VAE / VAE-XL

#

A VAE (Variational Autoencoder) is a model that compresses images or video frames into latent space and then reconstructs them. Diffusion models rely on VAEs to work efficiently at lower resolutions while preserving detail. VAE-XL is a higher-capacity version designed for sharper detail, better color accuracy, and fewer reconstruction artifacts, especially in high-resolution or video workflows.

Velocity Prediction (v-Prediction)

#

Velocity Prediction is a diffusion training approach where the model predicts the velocity between the noisy and clean image rather than predicting noise directly. This can produce more stable sampling and faster convergence, especially in modern diffusion and flow-matching models.

Video ControlNet

#

Video ControlNet extends the ControlNet concept to video by applying structural guidance—such as pose, depth, edges, or segmentation—to each frame while maintaining cross-frame consistency. It gives users fine control over motion, composition, and structure throughout the video.

Video Depth Estimation

#

Video Depth Estimation predicts depth maps for every frame in a video, allowing models to understand 3D structure, maintain consistent perspective, or apply 3D-aware effects across time.

Video Diffusion Models

#

Video Diffusion Models are diffusion-based generative models designed to create video rather than single images. They learn how visual content changes over time, enabling smooth motion and frame-to-frame consistency.

Video Inpainting

#

Video Inpainting fills in missing or masked regions across an entire sequence of frames. It ensures that replaced or removed objects remain consistent from frame to frame.

Video Latents

#

Video Latents are compressed representations of video frames used by diffusion or flow models. Working in latent space makes video generation faster and more memory-efficient than operating on raw pixel videos.

Video Looping

#

Video Looping is the process of generating or editing a video so the final frame transitions seamlessly back to the first frame, creating a continuous, infinite loop.

Video Motion Transfer

#

Video Motion Transfer applies the motion from a source video—such as a dance or gesture sequence—to a target character or subject. It requires accurate pose or motion extraction.

Video Reference Conditioning

#

Video Reference Conditioning uses one or more reference frames or a short reference clip to guide the model’s style, identity, or motion across the generated video.

Video-to-Video (V2V)

#

Video-to-Video generation transforms an input video into a new output video by applying style changes, motion modifications, or prompt-driven reinterpretations while keeping the original structure or timing.

Viseme Generation

#

Viseme Generation creates the visual mouth shapes that correspond to spoken phonemes in an audio track. Video models use visemes to produce accurate lip-sync for characters, matching speech timing and mouth movements naturally.

VQ-VAE (Vector Quantized Variational Autoencoder)

#

VQ-VAE is a model that compresses images or videos into discrete codebook vectors and then reconstructs them. It enables efficient storage and generation while preserving structure and detail.

W

Warping (Frame Warping)

#

Warping adjusts or repositions pixels in a video frame based on motion estimates, allowing alignment between frames or helping propagate structure during video generation.

Weighted Prompts

#

Weighted Prompts allow specific words, concepts, or tokens in a prompt to have more or less influence on the generated output. Higher weights strengthen the emphasis on a term, while lower weights reduce it. This helps guide composition, style, or subject priority with fine-grained control.

Workflow YAML

#

A Workflow YAML is a structured configuration file (in YAML format) that defines an entire generation pipeline—including model loading, nodes, settings, conditionings, and data flow. It is used by systems like ComfyUI to save, share, and reproduce complex video generation workflows.

World Coordinate Tracking

#

World Coordinate Tracking estimates motion and object positions in a fixed 3D world space rather than in screen (2D) coordinates. This allows consistent motion, stable camera paths, and accurate object interactions across frames—useful for video editing, AR, and 3D-aware generation.

X

X-Pose

#

X-Pose is a pose estimation system that extracts detailed body keypoints from video or images. It provides high-fidelity motion and body-joint information for conditioning video generation, animation, and motion transfer pipelines.

Xformers Acceleration

#

Xformers Acceleration uses the xFormers library (a set of optimized attention and transformer operations) to speed up diffusion and video models. It reduces memory usage, increases attention-layer throughput, and allows larger models or batch sizes to run efficiently on GPUs.

Z

Zero-Shot Generation

#

Zero-Shot Generation is the ability of a model to create images or videos of new concepts or tasks without being explicitly trained on them. It relies on broad generalization from large training datasets, enabling models to respond to prompts they’ve never seen before.

Zoom Stabilization

#

Zoom Stabilization corrects unwanted zoom-in or zoom-out fluctuations in video. In AI video generation, it ensures that subjects maintain consistent size and distance from the camera across frames, reducing “breathing” artifacts or accidental scale drift.

Need sampler intel?

Compare every diffusion sampler on its own page

We carved the sampler comparison table into a dedicated resource with speed, fidelity, and best-use notes for Euler, UniPC, CausVid, DPM++, and more. Bookmark it whenever you are tuning motion blur or trying to move fast without flicker.

New Tool

Diffusion Sampler Guide

View the sampler guide

Put the terms into action

Animate scroll-stopping videos with Fuzz Puppy

Upload a character, record a driving video, and let our pipeline handle motion transfer, lip sync, and sampler tuning for you.