Active Speaker Detection

flagship

NVIDIA · released 2024-06-01 · text

currently routing · 4.2k rpm

1K tokens

Context

— / 1M

Input

— / 1M

Output

— t/s

Speed

open

License

/ ABOUT

NVIDIA's Active Speaker Detection model identifies which person is currently speaking in multi-person video or audio streams. Using computer vision and audio analysis, it determines which visible face corresponds to the active audio source, enabling automated speaker attribution in meetings, interviews, and multi-camera productions.

The model processes video frames alongside audio to correlate lip movements and facial expressions with the audio signal. It handles scenarios with overlapping speakers, off-screen speakers, and varying camera angles.

Active Speaker Detection is essential for automated meeting transcription, video editing, and media processing pipelines where accurate speaker identification is required.

Providers for Active Speaker Detection

1 routes · sorted by uptime

ClosedRouter routes requests to the providers best able to handle your prompt size and parameters, with automatic fallbacks to maximize uptime.

Provider

Context

Quant

Uptime · 30d

NVIDIA NIM

—

bf16

0.00%