A few months ago, an exciting consultancy project arrived at the door of Fluendo. This customer used an application based on GStreamer in an NVIDIA Jetson, an SBC (Single Board Computer) with GPU acceleration. The application was responsible for encoding the video from a camera, performing live streaming, and storing video clips and image captures on demand; all of them using the available hardware acceleration provided by the Jetson. The problems reported were two:
- Sometimes the image capture takes too long, up to 4 seconds
- Sometimes there are some glitches in the recorded video
In this article, we will discuss how we help to solve them.
Slow image captures
We analyzed the GStreamer logs provided by the client, and although we could identify these slow captures, we could not determine the cause. In any case, we came up with two hypotheses:
- It is a "nvjpegenc" fault: this is the NVIDIA element responsible for generating a JPEG image from the incoming stream. It performs the conversion using the GPU, and since the incoming streaming is also allocated in the GPU ("Memory: NVMM"), this element saves us a copy on the CPU.
- It is a storage fault: If the hard drive shows lousy behavior, this could reflect in the image capture performance, as it's done synchronously.
We asked the client to do some tests in order to discard or ratify these hypotheses. Using "jpegenc" (conversion done in CPU) instead of "nvjpegenc" did not show any difference in the behavior, so it did not seem like a "nvjpegenc" problem.
Although the customer commented that they are using an SD card as storage for the captures, this was decisive. NAND Flash memories need a lot of logic to maintain data integrity; the SD cards have a microcontroller that takes care of this logic. The result is that the write/read time is not always the same, and sometimes the microcontroller needs time to reallocate blocks inside the NAND flash. Further benchmark tests made to the SD confirmed that it was the issue.
OK, we know the problem, but how do we fix it? The client told us using a different storage system is not an option. Fortunately, they could deal with asynchronous captures as long as they are notified afterward once they have been written to the SD.
We modified the application not to wait for the image to be stored and written, and we also added a notification when it's finally written. We also added a queue to absorb these write time peaks, keeping the captures in RAM until all of them can be written to the SD.
One problem less, one to go :)
The customer shared some videos with us where we appreciated some glitches. Most glitches were missed frames, but some drew our attention because they seemed to be disordered.
The same issue with SD write times could explain the frame dropping, so we added a queue, which made the problem almost disappear.
On the other hand, the out-of-order glitch is interesting. Our first hypothesis is that there is some issue related to B-frames.
What B-frames are
Video compression uses a usual technique that not only compresses each frame of the video individually but also uses the difference between them. And why this extra information is important? Because it allows us to achieve better compression rates.
For example, one can send a full frame at first and, after this, only the things that changed from the original first frame. This action works very well because, in a common video, many things do not change between frames, like the background.
The B-frames, in particular, use not only the differences from the previous frames but also from the future ones. The types of frames can be categorized into three:
- I-frames: these are full images. It could be compressed, but they do not use other frame information.
- P-frames: use information from the previous frames. It could achieve more compression than I-frames.
- B-frames: use information from the previous and future frames. These achieve the highest compression rate.
This image is derived from here. Part of a really cool digital video introduction by Leandro Moreira.
Although, after inspecting the used encoder configuration (“omxh264enc”), we determined the B-frames are not being used.
We have an NVIDIA Jetson Nano in our lab, so we tried to replicate the issue at home since it allows us to debug the problem better. Although we could not repeat the same scenario in our lab because we needed all the required hardware, we could create a simplified version of the client application, which, after some attempts, allowed us to replicate a very similar issue.
We managed to isolate the issue, which only happened when using GPU memory in some parts of the pipeline.
This part of the pipeline was a "pad probe." A pad probe is just a callback that is called whenever a buffer arrives at the pad. In this "pad probe," a copy of the incoming buffer (the video frame) was made and pushed into an "appsrc." The "appsrc" injected the streaming into the pipeline responsible for the recording.
We found this copy of the buffer suspicious because we're dealing with GPU memory, so its implementation is more complex than it would be with CPU memory. We inspected the code of NVIDIA implementation, and we found there is no implementation for "gst_buffer_copy()."
Therefore, GStreamer was using its fallback implementation, which does just a "memcpy()" of the surface handle (we call "surface" to the GPU memory, and its handle is like a pointer to this surface). This means there is no real copy of the buffer but a copy of a reference to the original one.
GStreamer has mechanisms to handle several references to the same buffer, avoiding unnecessary memory copies and preventing some parts of the pipeline from modifying the buffers being used by others. The problem is that GStreamer still thinks it has a copy, not a reference. Therefore we don't have such mechanisms. Our buffer could have been modified or even deleted without us noticing.
Here are some animations to visually understand what could happen when the references to a buffer are not handled properly:
What is happening:
What should be:
What should be (option 2, using GStreamer references):
Furthermore, when we inspected the code of NVIDIA upstream elements (those at left), we realized they are using "surface pools" for the buffers they generate. That means they allocate a fixed amount of surfaces in the GPU memory and reuse them once the downstream (those at right) elements are done with them.
What was actually happening in our case is the following:
- A buffer arrives at the pad probe.
- A "fake copy" is sent into the secondary pipeline and enqueued there for a while,
- The original buffer travels to the end of its pipeline, where it gets destroyed. Now the surface in the surface pool associated with this buffer can be used again.
- The previous steps can be repeated several times, leaving multiple "fake copies" enqueued in the secondary pipeline.
- At a specific moment, the upstream element reuses the previous surfaces. This upstream element thinks is doing things well because the surfaces have already been released, but some of them are, in fact, enqueued in the secondary pipeline.
- Now, we have a scenario where one of the enqueued surfaces has been modified and contains a frame newer than it should. The order has been altered.
Finally, we provided a solution to the client, which is basically to change the "gst_buffer_copy()" for a "gst_buffer_ref()" plus some additional minor changes to make it work fine. We also reported the issue with the copy implementation of NVIDIA for their plugins.
A multimedia application is a very complex creature, with multiple threads inside sharing resources. Keeping a good track of how these resources are being used is essential for a proper operation. When some of these essential pieces fail, identifying the problem could be challenging.
This project was an excellent example of all of these and a case of success for Fluendo in its Consulting Services. Whether it is a bug fixing, training, or optimization of any multimedia product, our team of experts will gladly take your project to take it to the next level. Contact us!