Real-time 4K face anonymization benchmark (GStreamer plugin)

Written by

Aleix Figueres

September 25, 2025

The Big Picture
Benchmark
- Test Case 1: Real latency calculus
- Test Case 2: Real latency calculus
Conclusions

The Big Picture

At Fluendo, we’re excited to unveil Fluendo Plugins AI (FAIP) — a new suite of AI-powered GStreamer plugins designed to deliver exceptional performance on virtually any hardware, from NVIDIA and AMD to Intel GPUs.
Built to handle multiple AI models simultaneously at ultra-fast speeds (currently around 400–500 FPS), FAIP is optimized for edge devices and currently available for Windows, with Linux and Embedded Linux support coming soon.

But, how did we get there?
We began our AI development line in 2022 by creating a series of AI applications based on Deep Learning models for detecting, segmenting, tracking, and clustering players in recordings of football matches and training sessions, aimed at tactical analysis for coaches and analysts.

During the project’s inception, we explored different ways to implement and deploy each of these solutions to integrate them into the project’s native environment. To achieve this, we primarily required each solution to run within GStreamer pipelines, so we ultimately decided to deploy them as independent GStreamer plugins. Depending on the use case, these plugins would include different elements and properties (e.g., player detection, ball detection, etc.).

Although there were already similar existing solutions based on Python with Onnxruntime (AI) & OpenCV (rendering), we found that achieving real-time performance at standard resolutions (HD, Full HD, 2K, etc.) was not feasible due to fundamental limitations:

Non-parallelizable architecture
Multiple CPU–GPU transfers
Suboptimal GStreamer integration
Non-boosted rendering tasks

Considering this, the Raven AI Engine project was born. We decided to develop our own cross-platform (Windows, Linux & macOS) and multi-adapter (Nvidia, AMD, Intel, NXP, etc.) engine, with fully customizable inference (AI) and rendering (pre- & post-processing) capabilities. This engine also included a GStreamer integration layer, enabling AI algorithms to run under full GPU-driven (no GPU/CPU transfers nor synchronization), meaning the entire pipeline runs entirely on the GPU.

Figure 1: Full-GPU Driven AI pipeline performance improvement demonstration

Thus, each solution (detection, tracking, etc.) was treated as an independent engine task, which in turn was deployed as separate GStreamer plugins. These plugins were ultimately integrated and released as a new Fluendo product known as Fluendo AI Plugins, currently composed of:

Anonymization plugin
Background removal plugin
Super-resolution plugin

These plugins, based on an old version of Raven AI Engine, were deployed in October 2024 as part of the Fluendo AI Plugins first release (v0.1). After the launch of the stable version v0.1.5, several limitations in the engine’s architecture were identified that prevented real-time processing, especially at higher resolutions (Full HD, 2K, 4K, 4K HDR, etc.). These limitations were primarily related to:

Lack of task parallelization
Processing of certain post-processing algorithms on the CPU
Non-optimized rendering engine

As a result, Fluendo committed to developing a new version of the engine (Raven AI Engine v0.3) designed to enable 100% full-GPU driven processing with parallelization capabilities. This allowed our plugins to achieve real-time processing even in constrained environments, including embedded systems, and at high resolutions (4K, 4K HDR, etc.).

In June 2025, Fluendo released the first version of its Fluendo AI Plugins (v1.0.4) based on the optimized Raven v0.3 engine, achieving 500 FPS at 4K (3840x2160) input resolutions on standard desktop hardware such as the Nvidia RTX 4060 Desktop.

Figure 2: Fluendo’s Anonymizer GStreamer plugin sample

But why running AI at 4K?

For some Deep Learning models based on architectures such as convolutional networks, the input tensor size is crucial for achieving high accuracy metrics — especially when detecting small-pixel-area targets or generating more precise profiles in segmentation tasks.

In other words, AI accuracy depends not only on the quality of the training but also on the quality of the input image. This is why, at Fluendo, we believe that a solution capable of running AI in real time, with maximum precision, and on any device, gives us a true competitive advantage over other alternatives.

⬆ Back

Benchmark

Test case 1: Real latency calculus

Overview

In this section, we present a comparative performance analysis across different hardware platforms (GPUs) to quantify the performance improvements achieved in the latest version of our anonymizer plugin.

The following implementations of the face anonymization plugin are compared:

Fluendo AI Plugins (FAIP) v1.0.4 (latest version, June 2025)
Fluendo AI Plugins (FAIP) v0.1.5 (non-optimized version, October 2024)
Plugin built with Python, OnnxRuntime & OpenCV (standard alternative)

The study is conducted on the following machines (Windows 11):

GPU	Platform	Architecture	Boost Clock (MHz)	Memory (Type & Size)	Power/TDP (W)	FP32 TFLOPS
AMD RX 7600S	Laptop Mobile	RDNA 3	1865	8 GB GDDR6, 128-bit	80 (peak 88)	15.7
NVIDIA MX550	Laptop Mobile	Turing	1320	2 GB GDDR6, 64-bit	25	2.7
NVIDIA RTX 4060	Laptop Mobile	Ada Lovelace	2370	8 GB GDDR6, 128-bit	35–115 (configurable)	11.6–15.1
NVIDIA RTX 3060	Desktop	Ampere	1777	12 GB GDDR6, 192-bit	170	12.7
NVIDIA RTX 3090	Desktop	Ampere	1700	24 GB GDDR6X, 384-bit	350	35.6

Table 1: Benchmark hardware configurations

⬆ Back

Vídeo streaming

In this first analysis, we compare the performance of the anonymization plugin across the previously mentioned hardware within a video streaming pipeline. For this purpose, the reference metric used is the framerate (FPS) that the pipeline can deliver while running the plugin.

Figure 3: AI-based Video Streaming GStreamer pipeline sample (4K input)

Table 2 presents a comparison of performance (FPS) across different hardware configurations for various input resolution permutations. The purpose of this test is to evaluate how the plugin’s performance scales as the AI workload increases with higher input tensor resolutions.

To demonstrate real-time processing capability, a minimum of 25–30 FPS is required to avoid visible stuttering to the human eye. It is also worth noting that a webcam capable of streaming at 4K @30 FPS was used, replicating the same functional environment in which a single person is anonymized.
Taking this into account, the next table highlights all permutations that achieve real-time performance (>25 FPS) as well as the maximum frame rate for each test case.

GPU	4K – FAIP v1.0.4	4K – FAIP v0.1.5	4K – Python	2K – FAIP v1.0.4	2K – FAIP v0.1.5	2K – Python	FullHD – FAIP v1.0.4	FullHD – FAIP v0.1.5	FullHD – Python	HD – FAIP v1.0.4	HD – FAIP v0.1.5	HD – Python
AMD RX 7600S	30.4	5.03	3.41	29.35	8.3	6.38	60.56	11.69	11.04	89.86	25.03	18.88
NVIDIA MX550	30.42	6.38	0.91	30.21	9.25	1.26	58.7	13.1	3.38	90.51	17.44	4.62
NVIDIA RTX 4060	29.18	10.13	*3.92*	*30.29*	19.68	*6.68*	60.5	13.92	*12.14*	*91.41*	20.53	*24.18*
NVIDIA RTX 3060	*30.55*	12.15	1.85	30.25	19.97	3.29	*60.61*	26.21	6.68	60.57	29.91	13.32
NVIDIA RTX 3090	30.34	*17.28*	1.91	30.24	*23.08*	5.49	60.46	*28.95*	8.11	90.98	*31.95*	14.64

Table 2: Video streaming benchmark results (FPS)

The benchmark results demonstrate a substantial leap in performance with FAIP v1.0.4, enabling consistent real-time processing without frame drops, even at 4K resolutions. By contrast, the older FAIP v0.1.5, which lacked task parallelization and had a less optimized rendering pipeline, can only sustain real-time performance up to Full HD (50% smaller input resolution).

This achievement is mainly due to the graph-based architecture refactoring, which allowed us to parallelize and optimize both rendering and AI modules. In this case, the model consists of a CNN-based architecture with minimal preprocessing, so most of the workload is spent on inference and post-processing, where our newest NMS (Non-Maximum Suppression) module stands out thanks to its render-based innovative approach.

As expected, the Python-based approach shows the worst results in terms of performance. Although the AI inference backend is similar to our FAIP implementation, this approach does not use a full-GPU driven pipeline nor parallelization. Due to*CPU–GPU memory transfers for preprocessing, inference, and post-processing, combined with the lack of parallelization, this approach is not suitable for real-time processing at high resolutions on desktop or laptop hardware.

It’s also important to note that the reported results are influenced by two environmental limitations:

Webcam capture constraints: The webcam used in testing can only deliver frames at 30 FPS for 4K and 60 FPS for lower resolutions, as dictated by its sensor, resolution, and lighting conditions. This caps the sampling frequency for analysis and the upper bound of throughput.
Display-related latency:
Using a videosink element adds latency because, unlike a fakesink, it must render the output to screen, requiring synchronization with the pipeline clock and often with the monitor’s VSync (typically 60 Hz).
When the frame generation rate exceeds the display refresh rate, frames must be held or dropped, introducing additional latency. By comparison, a fakesink discards buffers entirely, avoiding these delays.

⬆ Back

Vídeo transcoding

In this scenario, we compare the performance of the anonymization plugin across the previously mentioned hardware within a video transcoding pipeline, where an .MP4 file is processed to run the AI and display the result on screen.

Figure 4: AI-based Video Transcoding GStreamer pipeline sample (4K input)

As seen in Table 3, performance results in the video transcoding pipeline are significantly higher than those measured in the video streaming scenarios, reaching up to 236 FPS at 4K resolution with the latest FAIP v1.0.4 release. This improvement is largely due to the absence of real-time capture constraints (such as the 30/60 FPS webcam limit) and the fact that the input MP4 file can be read and buffered at full speed, maximizing GPU utilization.

The benchmark clearly highlights the massive performance leap delivered by the new FAIP architecture, with improvements of up to 20× over the older FAIP v0.1.5, primarily thanks to the parallelized, full GPU-driven pipeline design and optimized rendering/AI engines introduced in v1.0.4.

Interestingly, when using FAIP v1.0.4, performance remains relatively stable across input resolutions (4K vs. HD).

This consistency is a result of two factors:

The graph-based execution pipeline, which keeps inference and post-processing stages fully parallelized, reducing bottlenecks as resolution increases.
The render-accelerated Non-Maximum Suppression (NMS) and other post-processing routines, which offload most of the scaling cost to the GPU rather than the CPU.

By contrast, the Python + OnnxRuntime + OpenCV approach once again delivers the lowest performance, unable to achieve real-time speeds at any tested resolution due to the absence of optimizations such as parallelization and the heavy CPU–GPU data transfers involved in preprocessing, inference, and rendering.

GPU	4K – FAIP v1.0.4	4K – FAIP v0.1.5	4K – Python	HD – FAIP v1.0.4	HD – FAIP v0.1.5	HD – Python
AMD RX 7600S	162.77	4.49	3.3	164.95	12.91	24.68
NVIDIA MX550	152.2	1.98	0.91	119.77	19.35	5.97
NVIDIA RTX 4060	*236.04*	10.65	3.62	*239.79*	*26.09*	*25.68*
NVIDIA RTX 3060	162.68	6.12	1.87	161.43	13.9	11.87
NVIDIA RTX 3090	216.64	10.83	2.04	218.03	13.55	13.05

Table 3: Video transcoding benchmark results (FPS)

These results confirm that the new FAIP v1.0.4 architecture not only achieves real-time processing even at 4K but can also far exceed real-time thresholds in transcoding scenarios, making it suitable for high-throughput workloads on both desktop and laptop-class hardware.

Finally, for this scenario, we have faced the additional latency introduced by the decoder*, which becomes particularly noticeable in transcoding scenarios like this, where the goal is to maximize performance.
The decoding process entails a significant processing load, as the compressed stream (H.264 in this case) must be decompressed frame by frame, undergo format conversion if necessary, and, in some cases, store frames in buffers to maintain synchronization.

All of this results in accumulated delay before the data can be processed or re-encoded by the rest of the pipeline.

So, the question we must ask is:
“What is the actual latency added by our anonymization plugin to the pipeline?”

⬆ Back

Test case 2: Real latency calculus

To accurately measure the anonymization plugin’s latency contribution, we performed a differential framerate analysis by progressively disabling the primary external elements in the video transcoding pipeline for a 4K input video.

For each hardware platform, the following latency-generating components were isolated:

Decoder:
Replaced by the dummy source element d3d12testsrc which generates raw buffers directly in GPU memory, eliminating decompression and format conversion overhead.
Video sink:
Replaced by fakesink, which discards incoming frames without rendering, bypassing display synchronization (pipeline clock and VSync).

We also measured the baseline nominal pipeline without the anonymization plugin, using the same source and sink configuration.

By comparing the nominal run with and without AI, we can directly compute the plugin-specific latency overhead.

Table 4 summarizes the measured framerates for all configurations, along with the derived plugin latency (in FPS) for each GPU.

GPU	Nominal	d3d12tests	Fakesink	d3d12testsr + fakesink	Nominal (No anonym)	d3d12testsr + fakesink + No anonym	Real Plugin Latency [ms]
AMD RX 7600S	162.77	165.37	175.13	180.91	164.7	14721	0.0720
NVIDIA MX550	173.47	237.24	143.33	284.42	206.27	15596	0.9167
NVIDIA RTX 4060	*236.04*	*245.63*	*272.54*	*499.75*	*230.74*	*26004.92*	0.0973
NVIDIA RTX 3060	162.68	145.77	210.79	261.67	199.42	18111.81	1.1325
NVIDIA RTX 3090	216.64	215.05	221.97	343.01	218.09	14897.46	*0.0307*

Table 4: Differential latency analysis results (FPS)

As observed, there is notable latency added by both the decoder and the videosink elements in all test cases.
With the decoder and videosink completely removed, FAIP v1.0.4 reaches its maximum raw processing throughput, achieving nearly 500 FPS on the NVIDIA RTX 4060 (4K input).

These results show the upper performance limit of the anonymization pipeline when all external bottlenecks are stripped away, demonstrating that the*engine itself is capable of extreme throughput far beyond real-time requirements.

This level of performance highlights the efficiency of the graph-based design and confirms that most of the overhead in nominal runs originates from decoding and display stages, rather than the AI plugin itself.

On the other hand, the anonymization plugin introduces varying latency levels depending on the hardware.
Across the tested GPUs, its overhead ranges from as little as approximately 0.03 ms (milliseconds) on the RTX 3090 to as much as 1.13 ms on the RTX 3060, relative to a pipeline without AI.

These figures highlight that the plugin’s impact is not uniform and depends strongly on the processing capabilities and architecture of each GPU.

This variation in latency may be largely due to hardware differences and GPU utilization.

High-end GPUs like the RTX 3090 can process some AI inferences and render-driven post-processing without saturating compute or memory resources, so the plugin has minimal impact on throughput.

Mid-range GPUs such as the RTX 3060 and MX550, however, operate near their performance limits, with bottlenecks in memory-bound post-processing (mask rendering and compositing) and contention with decoding and rendering tasks.

Disabling these stages reveals the plugin’s relative cost.

Overall, the anonymization plugin’s latency overhead ranges from about 0.03 ms on high-end GPUs to 1.13 ms on mid-tier devices.

Even so, FAIP v1.0.4 consistently achieves real-time performance across all tested systems.

⬆ Back

Conclusions

The benchmarking results clearly demonstrate the substantial performance gains introduced by FAIP v1.0.4 compared to the earlier v0.1.5 and the Python-based OnnxRuntime + OpenCV implementation.

Thanks to its graph-based architecture with full parallelization, full GPU-driven data handling, and optimized render-driven post-processing, the latest version achieves consistent real-time performance across all tested resolutions, including 4K, while the older FAIP release can only sustain real-time performance at Full HD.
By contrast, the Python approach—limited by CPU–GPU data transfers and lack of parallelization—is unsuitable for real-time processing beyond low resolutions.

In the video streaming benchmarks, overall throughput is capped by external constraints rather than the plugin itself.

The webcam used for testing can only deliver 30 FPS at 4K and 60 FPS at lower resolutions, while the use of a videosink element introduces additional synchronization delays due to pipeline clocking and VSync (typically 60 Hz).

These factors limit the upper framerate, even though the plugin itself can sustain far higher processing rates when these bottlenecks are removed, as seen in Table 2, where we achieved 91 FPS for HD input.

In video transcoding scenarios, where webcam capture and display-related bottlenecks are removed, FAIP v1.0.4 demonstrates its full processing potential, achieving throughputs of up to 236 FPS at 4K on standard desktop hardware and nearly 500 FPS on the RTX 4060 when all external pipeline elements are stripped away.

These figures highlight the efficiency of the engine’s parallelized AI and rendering modules, which maintain stable performance even as resolution scales from HD to 4K.

By comparison, the older FAIP architecture suffers performance drops of more than an order of magnitude, underscoring how the architectural overhaul—introducing full graph-based execution, full GPU-driven, and render-driven post-processing—has fundamentally improved scalability and throughput.

Finally, the latency analysis confirms that the anonymization plugin’s processing cost is highly hardware-dependent, ranging from 0.03 ms on high-end GPUs like the RTX 3090 to 1.13 ms on mid-tier devices like the RTX 3060, primarily due to differences in memory bandwidth, GPU scheduling, and how each card handles concurrent AI and rendering workloads.

Even so, FAIP v1.0.4 consistently maintains real-time processing thresholds (25–30 FPS and far beyond) across all tested environments, proving its suitability for both live streaming and high-throughput transcoding workflows on a wide range of hardware.

👉 Contact us today and let’s bring performance and precision to your next project.

⬆ Back

Table of Contents

The Big Picture

Benchmark

Test case 1: Real latency calculus

Overview

Vídeo streaming

Vídeo transcoding

Test case 2: Real latency calculus

Conclusions