Reducing video latency for GStreamer filters in digital microscopes

Written by

Diego Nieto

June 19, 2024

At Fluendo, we are always excited to receive projects that push our boundaries. As part of our Consulting Services, we recently collaborated with a well-known hardware designer company to enhance their cutting-edge digital microscopes, which are integral to their new generation of premium devices. The project addressed video latency issues by applying advanced GStreamer filters and leveraging technologies like NVIDIA, CUDA, and Jetson.

This article will explain how we analyze each filter’s impact on the whole pipeline and what we achieved.

Addressing latency real-time video processing with GStreamer tools

At first glance, we are working with a real-time system processing the input camera image and want to show the processed image as fast as possible. However, a noticeable delay is seen when all the video filters are enabled. In our case, the system comprised three different video filters that can be enabled and overlapped anytime.

When we want to improve something complex, the first question we want to address is where the critical spots are. To answer this question, we used a GStreamer tool: GST Tracers. This tool provides measurements per pipeline and per element, among other things. With those metrics, we can understand which elements take the most time.

Handling the first performance issue

We spotted CUDA filters taking quite more time than the prototype measurements were giving. To work with CUDA filters, we needed to perform a memory transformation.

Using the nvarguscamerasrc element, we were retrieving NVMM memory (GPU memory) from the camera. However, CUDA kernels need to work with the EGL mapping one. To achieve this, it is necessary to use EGLImageKHR and CUgraphicsResource*** ***objects, the latter being an NVIDIA GPU object closely related to the earlier one. They represent a mapping of the NVMM memory.

That mapping took up to 7ms per filter in 4 K frames when several were enabled, so reducing that time was critical. Let’s explain the situation in depth: per frame, we take an NVMM memory, convert it to a CUDA-compatible resource through a mapping, and finally undo that mapping for the next frame.

What if we do an EGL-CUgraphicsResource buffer pool?

Those mappings are consuming a lot of time. We want to perform all our frame modifications below 33ms of pipeline time, so we need to cut as much as possible.

We wondered if we needed to do those mappings for each frame. After analyzing the DMA buffers FD, we found the IDs were a pool of buffers, so they were limited to a few. This means that memory was not constantly created and destroyed but reused efficiently.

Once we understood this, we created a cache of EGLImageKHR, CUgraphicsResource, and CUeglFrame. All three were associated with a unique file descriptor. In this way, we created a pool of CUDA resources where only the first time a new DMA FD is received, the mapping is done. The next frames are just reusing the mapping of the previous DMA FD. nvarguscamera reuses the buffer, and we keep the mapping all the time since we don’t touch that memory while the source element is working and vice versa, so the handling is safe. Here are the results:

Without the new buffer pool:

Latency Statistics:
        0x55acb580e0.nvarguscamerasrc0.src|0x55acc12120.xvimagesink0.sink: mean=0:00:00.028224954 min=0:00:00.026479589 max=0:00:00.093086526

With the new buffer pool:

Latency Statistics:
        0x55cc28cd70.nvarguscamerasrc0.src|0x55cc356120.xvimagesink0.sink: mean=0:00:00.021942232 min=0:00:00.020832060 max=0:00:00.083427083

Memory latency: our second main issue

The memory layout matters when working with CUDA. Char objects are much more efficient than integers, and the speedup is noticeable. The same happens with color formats when we speak about frames.

The developed video filters need to work with channel-specific data, so we will eventually need to access R, G, or B data univocally. However, that does not imply we must pass the kernel data as RGB.

NV12 is a 4:2:0 color format, the same as YUV in terms of occupancy. Although it is impossible to work directly with the channel value in the kernel, it is worth using this color format. We developed a CUDA kernel helper that allows us to get the required color position from NV12 memory, apply a transformation, and store the value in NV12. This means that the color in-place conversion is worth compared with the memory data needed for raw RGB data. Furthermore, in 4:2:0 chrominance values are shared for the same luminance value. This means we can get chrominance once, cache the value, and only access the luminance value.

Here are the results:

Multiple filters RGBA:

GST_DEBUG="GST_TRACER:7" GST_TRACERS="latency(flags=pipeline+element)" GST_DEBUG_FILE=trace.log   gst-launch-1.0 nvarguscamerasrc ! "video/x-raw(memory:NVMM),width=3264,height=2464,framerate=21/1" ! nvvidconv bl-output=false ! "video/x-raw(memory:NVMM),format=NV12" ! flufilter1correct ! nvvidconv ! "video/x-raw(memory:NVMM),format=RGBA" ! flufilter2correct ! nvvidconv ! xvimagesink
...
Latency Statistics:
        0x559fb4b390.nvarguscamerasrc0.src|0x559fc17e20.xvimagesink0.sink: mean=0:00:00.037013052 min=0:00:00.031175391 max=0:00:00.351914664

Using RGBA format we need a color conversion from the original camera input NV12 to RGBA.

Multiple filters NV12:

GST_DEBUG="GST_TRACER:7" GST_TRACERS="latency(flags=pipeline+element)" GST_DEBUG_FILE=trace.log   gst-launch-1.0 nvarguscamerasrc ! "video/x-raw(memory:NVMM),width=3264,height=2464,framerate=21/1" ! nvvidconv bl-output=false ! "video/x-raw(memory:NVMM),format=NV12" ! flufilter1correct ! flufilter2correct ! nvvidconv ! xvimagesink

Latency Statistics:
        0x557014d370.nvarguscamerasrc0.src|0x55702199b0.xvimagesink0.sink: mean=0:00:00.030696960 min=0:00:00.028436184 max=0:00:00.274659614

The pipeline mean time was reduced from 37ms to 30ms, achieving work under 33ms per frame using multiple filters**.**

Fluendo’s solutions for complex hardware challenges

Ultimately, we reduced the latency of the custom CUDA filters integrated as GStreamer plugins. This improvement enhanced the digital microscopes’ performance and underscored our commitment to delivering high-quality, efficient solutions for complex, real-time systems.

Our expertise in handling projects like this demonstrates our ability to innovate in hardware manufacturing and computer vision industries. If you are facing similar challenges or need expert assistance with your next project, don’t hesitate to contact our team at Fluendo for more information and support here.