Raven: A 100% GPU-driven AI inference framework for real-time video and graphics

Written by

Nacho Garcia

November 4, 2025

Raven: a 100% GPU-Driven Inference Framework

GPUs for AI: How graphics cards became the backbone of deep learning

Although neural networks have existed for many decades (with ideas going back to the 1950s and 60s), it wasn’t until the 1980s that the first deep architectures with several layers were developed. For decades, their practical use was limited due to two main barriers: a lack of computational power and the scarcity of large labeled datasets.

Everything changed in the mid-2000s with the arrival of large datasets like ImageNet, and especially in 2012 with the publication of the famous AlexNet paper, which showed that a deep convolutional neural network trained on GPUs could outperform previous methods in image classification. It was at that moment that the community realized something crucial.

The key insight was that the operations dominating neural network training and inference (such as matrix multiplications and convolutions) fit surprisingly well with the architecture of graphics processing units (GPUs).

Originally designed for rendering 3D video game graphics, GPUs had quietly evolved into powerful parallel computing engines. The launch of CUDA by NVIDIA in 2007, allowed, for the first time, these GPUs to be programmed using a general-purpose scientific language.

Soon, the community realized that training a convolutional neural network on a CUDA-enabled NVIDIA GPU took days instead of the weeks or months required on CPUs. This discovery reshaped the field of artificial intelligence, and modern AI frameworks quickly began to build around GPU acceleration. While CUDA remains the most widely adopted standard, other vendors have introduced alternatives: ROCm (AMD), OpenVINO (Intel), and DirectML (Microsoft), which implement AI operators directly on top of DirectX12 compute shaders.

… and GPUs for graphics

Even though modern GPUs are increasingly exploited for AI and deep learning, their original purpose — and still one of their strongest domains — is 3D graphics rendering. While more transistors in today’s GPUs are now explicitly dedicated to AI acceleration, the graphics side of the GPU continues to evolve rapidly. The visual power of modern GPUs is enormous.

Yet, a surprising barrier remains between AI and graphics. This is not a hardware limitation but a conceptual and framework-level separation between two worlds: The “world of AI” and the “world of graphics” often remain more isolated than they technically need to be.

When new technologies become popular, experts from adjacent fields are usually the first to adopt them. The pioneers of programming came from the world of mathematics, and it wasn’t until later that new generations arrived who were trained directly in programming without a prior mathematical background.

In the evolution of computer science itself, with increasingly specialized profiles, this is something we see constantly: first the new niche is “colonized” by experts from adjacent technologies, and later, by professionals trained directly in that new niche, who consolidate its development.

Early AI adopters often came from mathematics, statistics, and data science, using languages like MATLAB or Python. This explains why Python-based numerical libraries dominate today’s AI frameworks (TensorFlow, PyTorch), while graphics APIs like DirectX or Vulkan are rarely integrated.

It’s no surprise, then, that frameworks based on Python and focused on intensive numerical computing dominate the AI space today, with little presence of GPU libraries more typical of graphics programming. My particular case, however, is different. My background in mathematics or statistics is very limited, if not nonexistent.

My path was different. I spent years in the video game industry, developing graphics engines ranging from 2D systems for the GameBoy to advanced 3D engines for Nintendo 3DS, Wii, and platforms like OpenGL and DirectX. This gave me deep experience with textures, polygons, shaders, and low-level GPU programming. Later, at Fluendo, I expanded into video and multimedia pipelines using GStreamer as a framework.

With this background, I noticed something odd when we started working on real-time AI inference on video: even though video pipelines and AI inference often run on the same GPU, they felt like two completely separate worlds.

Not just in terminology (for example, in the AI world they call it an “NCHW tensor,” while in video it’s called a “planar RGB buffer”, revealing how one side focuses more on data as matrices, while the other focuses on visual representation), but in the fact that existing frameworks seemed to work in very isolated ways, as black boxes, with no obvious way to, say, use a DirectX12 or Vulkan texture directly as input to a model, or any facility for building asynchronous hybrid systems using shared command lists and fences.

Some AI practices also struck me as odd. For example, rescaling an image using a compute shader, while perfectly possible and not necessarily slow, is, at the very least, strange from the point of view of a graphics programmer. Because “rescaling” in graphics is a native task in a GPU, it’s done thousands of times constantly.

There’s not a single texture in a 3D scene that isn’t being rescaled, rotated, and filtered somehow because the act of rescaling and projecting textures is the backbone of 3D graphics. GPUs have been ultra-optimized for this kind of operation for years.

More concerning to me was the constant flow of data between the GPU and the CPU, as well as the frequent active waits or blocks while waiting for the inference result.

In the world of 3D graphics engines, readback from the GPU (except in very special cases like picking, screenshot capture or occlusion queries) is practically nonexistent. Simply put, in a typical graphics engine, the CPU just fills the command list with the tasks the GPU should perform, launches it for execution, and forgets about it. Then, the CPU immediately starts preparing the next frame without waiting for the previous one to finish.

The GPU executes its work completely asynchronously with respect to the CPU. When it finishes, it flips the buffers to display the result on screen without the CPU intervening in that final phase.

This work pattern is in the DNA of graphics programmers: focus on preparing the command list as efficiently as possible, send it as quickly as possible, and immediately move on to preparing the next frame. This approach makes sense because GPUs, originally designed for real-time video game graphics, are especially optimized for this model of parallel, asynchronous, one-way execution.

By contrast, most AI frameworks act like black boxes. They focus on running the neural network inference as fast as possible, but often ignore what happens before and after the model execution, especially in video + AI pipelines. As a result, it’s common to see inefficient practices such as:

GPU-to-CPU readbacks to perform post-processing tasks.
CPU-to-GPU uploads return the processed data.
Forced GPU waits, stalling the pipeline.

These practices (even if they may seem “harmless” under the assumption that most of the total compute time is spent on the AI model) are far from harmless when the goal is to have the most efficient video+AI pipeline possible. These data waits and roundtrips come at a cost, apart from making a graphics programmer frown, someone used to the “unidirectional orthodoxy” of GPU programming.

⬆ Back

Why is it so bad for the GPU to stop and perform a readback?

GPUs are designed as massively parallel processors, optimized to scan through huge volumes of data in a continuous, uninterrupted flow. Their architecture thrives on keeping busy, filling execution units with work and never stopping. When the CPU requests data back from the GPU (such as inference results or a processed texture), a process called readback occurs. While sometimes necessary, readbacks break the natural GPU workflow and introduce major bottlenecks:

Loss of parallelism: While the GPU waits for the readback to complete, it cannot continue executing commands that depend on the previous data. This introduces a bubble of inactivity: compute units and buses that could be processing data are now blocked waiting for earlier results to be consumed.
Explicit synchronization: The CPU cannot access GPU memory directly. Data must first be copied into a CPU-accessible buffer. This forces both CPU and GPU to synchronize, breaking the ideal asynchronous execution model. The CPU waits for the GPU to finish, and the GPU waits for the CPU to read.
Inherent bus latency Interconnects like PCI Express are optimized for large, one-way transfers (CPU → GPU). Transfers in the opposite direction (GPU → CPU) are slower, less efficient, and have higher latency. Reading a small amount of data back can cost more time than writing hundreds of times that amount forward.
Pipeline destruction: Modern GPUs operate like a deep pipeline (with tens or hundreds of stages), and that pipeline only works if the flow isn’t interrupted. Reading an intermediate result halts that flow: the pipeline is emptied, and then has to be refilled, a costly process in performance terms.

The design of Raven as a 100% GPU-Driven framework

After years of iterations and development, we created Raven, our own AI inference framework designed to be 100% GPU-driven. Being GPU-driven is not just about accelerating graphics and AI with GPU hardware. It means the entire system is architected like a modern 3D graphics engine:

GPU-first memory management: Memory is allocated directly on the GPU.
Command list execution: All operations (graphics or AI) are translated into command lists and dispatched to the GPU.
Asynchronous pipelines: The GPU executes tasks autonomously, without CPU intervention or waiting for intermediate results.

This approach completely eliminates unnecessary transfers between GPU and CPU and costly synchronizations between pipeline stages.

It’s as if inference were naturally integrated into the same chain of operations that already make up rendering: just another step in the pipeline, as if it were an additional compute shader. Just like a Gaussian filter, a motion blur, or a color space conversion can be applied via compute shaders, in Raven, the AI model becomes yet another GPU processing block, operating on the same buffers, never leaving the graphics domain.

Because, after all, from the hardware’s point of view, an AI model is nothing more than a sequence of parallel compute operations, not so different from those used in complex shaders: matrices, convolutions, normalizations, activations.

The result is a fully autonomous pipeline, with no blocking points, no readbacks, no CPU GPU →→CPU bottlenecks, where inference happens within the same continuous flow of video or image processing. This vision, inherited from traditional graphics design, makes it possible to achieve levels of performance and efficiency that would be impossible with a more traditional design, and it is the core of Raven’s philosophy.

⬆ Back

Raven is a hybrid engine

Internally, Raven manages specialized devices that abstract rendering capabilities (via APIs like DirectX 12 or Vulkan) and others that abstract compute capabilities for AI (via CUDA, DirectML, ROCm, etc.).

This distinction is technical and reflects how modern GPUs combine graphics units (rasterization, texturing, blending) with general-purpose compute units designed for neural networks. Raven decides which device to use for each operation through an internal heuristic.

In most cases, both the render device and the AI device run on the same GPU, often even sharing GPU queues. When that’s impossible, for example, in multi-GPU systems or inference runs on an external NPU, Raven uses efficient inter-device transfer mechanisms, such as DirectX 12 cross-adapter resources, to share memory asynchronously without CPU intervention.

Raven’s render devices provide full rasterization capabilities, similar to those of a 3D graphics engine, but focused on data transformation rather than realistic scene rendering. The objective is not to simulate complex materials, but to exploit the GPU in what it does best: transform and visualize data at massive parallel scale. Raven is unique in this: its render devices “understand” tensors.

A tensor for AI is nothing more than a matrix of data (for example, float32[N, C, H, W]), which can naturally be represented as a buffer in the render world. Modern APIs like DirectX 12 and Vulkan allow aliasing between buffers and textures, allowing the same memory block to be used both as a texture (for reading in a shader) and as an unordered access view (for writing from another shader or pipeline stage). This capability is critical for Raven and is exploited to the fullest.

For example, imagine a video texture as input to a model. A traditional approach would involve:

Downloading the video texture to the CPU.
Converting it into a matrix.
Uploading it back to the GPU as an RGB float32 tensor.
Rescaling it, normalizing it, removing the alpha channel…

This type of pipeline is surprisingly common, even in commercial solutions, even though steps 2 and 3 are performed within the GPU. The problem is that conceptually the “two worlds” are still separated: AI on one side, graphics on the other.

Raven, on the other hand, draws the texture directly onto the destination tensor, which is simply a GPU buffer.

And it does this in a single DrawCall, fully leveraging the GPU’s graphics capabilities:

Applies rescaling
Removes the alpha channel if necessary
Applies bilinear filtering
Adds letterboxing (padding) if needed

Everything happens inside the GPU itself, with no round-trip to the CPU and no intermediate conversions. A simple pair of vertex and pixel shader does the entire operation.

This philosophy of using the GPU’s rendering and compute capabilities interchangeably is applied everywhere: for converting color spaces, from tensor to texture, from texture to tensor (HWC or CHW formats), all kinds of filters, drawing bounding boxes, suppression algorithms, and much more.

Dynamic tensors on GPU

In many AI frameworks, tensors are sent to the GPU, but the shape information (which defines how the tensor should be interpreted) often remains on the CPU. At first glance, this seems like a minor detail, but it creates an unnecessary synchronization point. The problem becomes evident with dynamic tensors whose shape is only known after the AI model has executed.

This creates a synchronization point because when an AI model with dynamic output is chained with another task that depends on that output, these frameworks are forced to synchronize the tensor with the CPU just to obtain its shape and continue the pipeline.

Raven, on the other hand, implements “pure” GPU tensors: the information of the current shape is encoded within the same buffer that contains the data. This allows it to chain several operations in a row without needing to synchronize data with the CPU. For example, an inference recognizes objects in an image, the NMS algorithm filters the valid boxes, and the dynamic tensor with the final boxes to be processed is passed directly to other shaders that process them, since they recognize the resulting matrix, without any CPU involvement.

Exploitation of modern capabilities: DrawInstanced, ExecuteIndirect, Mesh Shaders, and GPU predication

One of the most striking differences between CPU and GPU programming is the execution paradigm. While on the CPU we’re used to using loops that iterate over sequential structures (for example, a for (int i = 0; i < N; ++i) to process N boxes), GPUs don’t work like that. You don’t use a loop on the GPU to process N boxes.

Instead, a massive wave of threads is launched in parallel, where each thread knows its index and directly processes the corresponding element i. This execution model is what allows the GPU to process thousands of entities in parallel efficiently.

Raven relies on the modern capabilities of the graphics pipeline, such as:

DrawInstanced / DrawIndexedInstanced: to launch thousands of repeated geometries (for example, boxes or sprites), varying their parameters through buffers
ExecuteIndirect: so that the GPU itself decides in real time how many instances to draw, without CPU intervention
Mesh Shaders: to generate geometry on the fly directly from structured buffers (like a box tensor)
GPU Predication: to enable or disable the drawing of an instance based on conditions stored in GPU memory, useful for visibility, filtering, or score.

Thanks to the fact that Raven keeps the full tensor information in GPU memory, including its shape, the render system can directly access the detection buffer and do the following:

Read the number of boxes to draw from the tensor’s shape
For each box, access its coordinates and score (x1, y1, x2, y2, score)
Draw each one as a rectangle, an outline, a mask, or even a text annotation, all without ever going through the CPU

This allows, for example:

When an inference finishes, the list of detections can be passed directly to a DrawInstanced or DispatchMesh, without CPU intervention
If you want to visually filter out boxes with a low score, you can do it via predication or by writing to the Z-buffer to leverage early-z rejection, all from shaders
If you want to obscure part of the detected content (such as in face anonymization), it can be done directly using masks in DrawInstanced, just like with particles or 2D sprites

⬆ Back

GStreamer integration

As we’ve seen, the entire design of Raven leads to a 100% GPU-Driven pipeline, where the CPU first fills the Command Lists for the GPU as efficiently as possible, launches the execution, and then “forgets about it,” just like in a typical video game engine.

But how is this realized in GStreamer?

GStreamer is fundamental for us because, as pioneers of this multimedia framework, it is obviously our preferred solution for building video pipelines.Raven includes a module for integration with GStreamer, plus another plugin framework module that allows us to conveniently develop AI plugins for GStreamer based on Raven.

But how does it work?

With fences, of course. When a gstbuffer enters Raven, the engine quickly processes it to extract the texture handle, allocate the resulting texture, and return a new gstbuffer with that new texture, which, at that point, is still not ready.

Of course, that new Gstbuffer includes a fence that the engine has preemptively inserted as a semaphore for its future completion.

However, this new GstBuffer, from GStreamer’s point of view, is emitted at such low times that it is below 70 microseconds.

That is, the GStreamer pipeline can continue almost instantly, with near-zero latency measured from the moment the gstbuffer to be processed is received by the element’s sink to the moment a new gstbuffer is emitted by the element’s src.

This near-zero latency is what we’ve decided to call *shadow performance.

Shadow performance

The GStreamer pipeline instantly receives each new gstbuffer. While the buffer is “in flight,” Raven runs several parallel threads to generate GPU command lists. Within a few microseconds, the CPU steps aside, and the GPU takes over the entire workload. On the GStreamer side, the system doesn’t block either: it simply enqueues the fence into its command list (e.g., in the sink stage), without requiring CPU-side waiting. Because the AI inference runs completely asynchronously with respect to the rest of the video pipeline, a surprising effect can occur: depending on the hardware, model complexity, and pipeline structure, the end-to-end latency can be effectively zero. When GStreamer “needs” the buffer, Raven has already finished processing it in parallel. That’s why we say “shadow performance”: it creates scenarios in which it’s not possible to directly measure the AI plugin’s processing cost just by observing the latency of the whole pipeline.

⬆ Back

Conclusion

Designing Raven has been a fascinating journey. It combined the worlds of graphics and AI, blending them into a single, fluid system that runs entirely on the GPU. What began as an effort to eliminate inefficiencies and avoid CPU-GPU roundtrips became a complete rethinking of how real-time AI inference can and should work.

Raven is 100% GPU-driven. That means no blocking, no readbacks, no staging between CPU and GPU. Instead, inference becomes just another step in a unified, graphics-style pipeline. From video input to model execution to drawing results, everything happens in GPU memory, using GPU execution units, without leaving the GPU domain at any point.

This design not only maximizes performance, but it also changes the mindset. The GPU is no longer split into separate “graphics” and “AI” roles. It becomes a single, powerful engine capable of handling both domains with the same efficiency, the same language, and the same execution model.

Raven is what happens when you stop treating graphics and AI as separate problems and start building pipelines where they truly work as one. If you’d like to see what Raven can do for your workflow, get in touch with us today.

Table of Contents

GPUs for AI: How graphics cards became the backbone of deep learning

… and GPUs for graphics

Why is it so bad for the GPU to stop and perform a readback?

The design of Raven as a 100% GPU-Driven framework

Raven is a hybrid engine

Dynamic tensors on GPU

Exploitation of modern capabilities: DrawInstanced, ExecuteIndirect, Mesh Shaders, and GPU predication

GStreamer integration

But how is this realized in GStreamer?

Shadow performance

Conclusion