Virtual avatars optimized for AMD GPUs in video games

Written by

Izan Leal & Aleix Figueres

September 10, 2025

Next-Gen virtual 3D avatars for AMD GPUs in video games
New features
Architecture
Conclusion

Next-Gen virtual 3D avatars for AMD GPUs in video games

Fluendo collaborated with a company on a cutting-edge real-time application that showcases AI-powered multimedia processing optimized for AMD hardware. Building on our 2023 work with virtual background and video game mixer, this new solution introduces live 3D avatar animation driven by facial expression recognition. It integrates directly with Twitch for seamless streaming.

Designed for content creators, gamers, and broadcasters, the app combines real-time AI inference, advanced media composition, and video game rendering, efficiently distributed across CPU/NPU and dedicated GPUs. Thanks to an optimized dual-GPU architecture, it enables accurate facial feature extraction and 3D avatar rendering without affecting gameplay performance.

New features

The original concept, removing the background from a webcam image and overlaying it onto a game capture, has evolved into a multi-layered pipeline. This newer version includes:

AI-animated 3D avatar rendering, driven by facial expression tracking.
Live Twitch streaming, replacing traditional local video recording.
Upgraded UI controls via a modern application.

This demo not only offers a richer user experience, but also highlights Fluendo’s expertise in real-time edge AI multimedia processing.

Architecture

Overview

This project demonstrates how to parallelize game rendering and AI-driven media processing by efficiently splitting GPU resources. The dedicated GPU (AMD Radeon™ RX 7600S) is reserved exclusively for rendering the video game content intended for streaming, ensuring optimal game performance. Simultaneously, the integrated GPU (AMD Radeon™ 780M) handles the real-time animation of the virtual 3D avatars and executes all AI-related tasks, such as facial expression tracking and avatar control.

This architecture showcases the benefits of offloading AI inference and media composition to the integrated GPU. It enables a smooth, low-latency experience without compromising gaming performance, ideal for modern content creation and live streaming scenarios.

Figure 1: Demo app architecture

Our virtual 3D avatar application is built around a modular architecture that combines real-time AI inference, advanced media composition, and direct-to-stream output.

Media capture: Using GSTSource, the application captures the webcam, microphone, and default system audio output.
Game capture: A proprietary AMD library captures the gameplay feed, ensuring minimal latency and high performance.
AI background removal: The background removal AI module processes webcam frames, generating a GST D3D11 texture where the avatar is rendered over the user’s face while removing the background. This is possible thanks to the usage of Fluendo’s Raven AI Engine, which allows us to combine all different AI tasks into a single API and deploy as a GStreamer plugin.
Preview rendering: A previewer component receives video/x-raw samples, enabling the output video to be drawn directly onto the preview window.
Streaming pipeline: GStreamer encodes and muxes the processed audio and video streams directly to Twitch for live broadcasting.

Thanks to GStreamer’s modular pipeline design, hardware acceleration, and seamless integration with custom AI and AMD components, the system delivers smooth, real-time performance even under GPU-intensive workloads.

Figure 2: GStreamer pipeline

AI-Powered facial expression for avatar rendering

A key innovation in this demo is the four-stage facial AI pipeline, fully implemented in Python and integrated via our backend.

Figure 3: Input frame

Stage 1: Face detection

The first stage of the pipeline uses a lightweight convolutional neural network (CNN) optimized explicitly for real-time face detection on edge devices.

This model processes each incoming webcam frame and accurately identifies human faces by generating a tight bounding box around the detected facial region. It’s designed to be highly efficient, delivering fast inference speeds even on low-power GPUs, without compromising detection quality.

The bounding box is then expanded by 25% in all directions to ensure facial features like ears and jawline are fully captured in the next stage.

Figure 4: Face detection resultant bounding box

Stage 2: FaceMesh extraction

This stage uses the extracted face bounding box and a face mesh extraction model to analyze facial geometry in fine detail. This deep learning model predicts a dense set of 478 3D landmarks, representing key facial features with high spatial precision. Each landmark corresponds to specific anatomical regions: eyes, nose, lips, jawline, ears, eyelids, pupils, cheekbones, forehead, etc.

The model operates within a normalized 256×256×256 virtual coordinate space, where X and Y represent horizontal and vertical facial positions and Z captures depth information (how far each point is from the camera)

Figure 5: FaceMesh estimation

Stage 3: Model rotation estimation (procrustes analysis)

Once the 3D facial landmarks have been extracted, the system must determine how the user’s head is oriented in space. To achieve this, we apply Orthogonal Procrustes Analysis, a mathematical technique commonly used in 3D vision and biometric alignment.

This step compares the current live landmark set to a canonical reference mesh, a neutral face in a known position and orientation. Then it computes the rigid transformation required to align them. This transformation includes rotation, translation, and scale.

Without this step, even highly accurate expressions would look unnatural if the avatar’s head remained static. Procrustes-based rotation bridges the user’s real-world movement with digital animation, anchoring the avatar in a believable 3D space.

Sample image face rotation, pose, and scale:

ROTATION
 - Yaw: -0.51º
 - Pitch: 2.67º
 - Roll: -0.65º

POSITION:
 - X: 472mm
 - Y: 103mm
 - Z: 1230mm

SCALE:
 - Scale X: 0.88
 - Scale Y: 0.93
 - Scale Z: 0.97

Stage 4: Blendshapes calculation

With a dense 3D landmark mesh and accurate head pose now available, the next step is to interpret facial expressions. This is done through blendshape extraction—a process that translates subtle facial movements into numeric weights used for animation.

A dedicated model analyzes the landmarks’ spatial configuration to compute 52 standard blendshape weights, each representing a unique facial deformation.

These blendshapes correspond to key expressive actions, including smile, eye blink, jaw open, etc. Their values are not binary triggers; they reflect gradual facial changes, such as a smile forming or an eye beginning to squint. This level of nuance is essential for creating avatars that feel expressive and human rather than robotic or overly exaggerated.

Figure 6: Blendshapes setting

Stage 5: Avatar rendering

The final stage of the AI pipeline brings everything together: landmarks, rotation, and blendshape weights are applied to a fully rigged 3D avatar model to generate the final visual output. This rendering process uses Blender’s Python API, allowing for frame-accurate, high-quality avatar animation directly from real-time webcam input.

This process is optimized to be headless and efficient, enabling real-time frame generation without opening Blender’s GUI. Rendered avatar frames can then be overlaid on live video, composited into a gameplay stream, or substituted in place of the webcam feed entirely.

This final stage completes the pipeline, delivering a rendered, expressive digital persona ready for real-time interaction, streaming, or virtual production.

Figure 7: Avatar render

UI Application

The final application offers an all-in-one interactive user interface for managing AI-powered streaming, avatar rendering, and real-time background removal.

Figure 8: Demo app screenshot.

Key features:

Live preview: Real-time output with webcam, avatar, and game feed.
Broadcast: Start/stop Twitch stream, set resolution, adjust overlay and avatar.
Settings: Select GPU, camera, microphone, and manage audio levels.
Security: Add Twitch stream key.

A system tray menu gives quick access to start/stop streaming, app info, and exit.

This intuitive UI empowers users to stream professionally with full control—no post-production or green screen needed.

Conclusion

This project marks a significant step forward in real-time, AI-powered multimedia applications. We successfully demonstrated the simultaneous use of both a dedicated and an integrated AMD GPU—splitting workloads between game rendering and AI-driven avatar animation. This allowed us to maintain smooth gameplay while achieving real-time avatar performance with a maximum end-to-end latency under 40 ms.

The animated avatars accurately replicated not just facial expressions, but also head rotation and position, resulting in a highly immersive and responsive user experience. Additionally, we delivered a modern and intuitive GUI that lets users stream directly to Twitch, switch between different 3D avatar models (e.g., cat, bear, dog), and control the pipeline in real time.

By combining deep learning, facial tracking, 3D rendering, and live broadcasting within a user-friendly interface, this application showcases the potential of Fluendo’s AI and GStreamer-based infrastructure for streamers, developers, educators, and content creators alike.

Whether you’re a streamer, educator, developer, or broadcaster, this technology opens up new creative possibilities.

Want to integrate it into your workflow? Contact us to explore how we can help elevate your multimedia pipeline.

Table of contents