
NMS-Raster: Post-processing bounding boxes using the “G” in GPU

Written by
Nacho GarciaOctober 28, 2025
Table of contents
- NMS-Raster: Post-processing bounding boxes using the “G” in GPU
- The bounding box problem
- Classical NMS
- NMS-Raster: Z-Aware rasterized suppression
- Advantages of the approach
- Implementation
- Strengths of the design
- Performance
NMS-Raster: Post-processing bounding boxes using the “G” in GPU
In this article, we explore the challenges of bounding box post-processing in AI-powered object detection and demonstrate how our high-performance inference engine, Raven, addresses them from a novel angle. By leveraging the GPU’s rendering capabilities, Raven eliminates costly CPU/GPU synchronization, achieving faster and more efficient object detection without sacrificing accuracy.
The bounding box problem in object detection
Modern object detection models do not produce a single bounding box per object. Instead, they generate dozens, hundreds, or even thousands of candidate boxes per image. This apparent redundancy is not a flaw: it reflects how these models evaluate the probability of an object being present across the image, whether via fixed grid-based predictions or advanced adaptive architectures.
This means that a single object, slightly shifted or scaled, can generate multiple detections, all with reasonably high scores. If multiple objects are present in the scene, the number of boxes increases rapidly.

In a trivial case with only one object, retaining the bounding box with the highest score would be enough. However, real-world scenarios are much more complex: most images contain several simultaneous objects, often overlapping or close together, and the model doesn’t know how many there are in advance. This introduces a key challenge: deciding which boxes belong to which object.
This is where post-processing algorithms come into play. Techniques such as Non-Maximum Suppression (NMS) and its variants are widely used to filter, group, or suppress redundant bounding boxes, leaving only the most accurate and unambiguous representation of each detected object.
Classical NMS: Limitations and challenges
The traditional Non-Maximum Suppression (NMS) algorithm is one of the most widely adopted methods for filtering redundant bounding boxes in AI object detection. Its core logic is simple yet effective: detections are sorted by confidence score, and boxes that overlap too much with already selected ones are discarded. Standard NMS works as follows:
- All bounding boxes are sorted by score, descending
- The first box (highest score) is selected and added to the survivor list
- All other boxes with a high overlap (IoU) with the selected one are removed
- Repeat with the remaining boxes until none are left
While straightforward, classical NMS has a critical limitation: its computational complexity is quadratic in the worst-case scenario, as each box may need to be compared with every other box. This slows down processing and makes parallel GPU execution challenging.
Some GPU-based implementations use compute kernels to accelerate NMS but still rely on the same sequential logic. These approaches require managing shared data structures, atomic operations, and thread synchronization, which can become bottlenecks and contradict the GPU’s rendering paradigm designed for massively parallel, overlap-free computations.
This is where Raven takes a completely different path.* If GPUs can efficiently render millions of triangles without overlap using the Z-buffer, why not leverage the same principle to determine which bounding boxes survive?
This is the core idea behind our NMS-Raster algorithm: reframing the problem as one that can be resolved graphically.
NMS-Raster: Z-Aware rasterized suppression
Traditionally, GPU-based post-processing algorithms are implemented as compute shaders or CUDA kernels. While “GPU-accelerated AI” is widely used, a crucial detail is often overlooked: the “G” in GPU stands for Graphics.
GPUs are not just designed to multiply matrices or execute parallel kernels. In fact, their architecture is highly optimized for drawing: vertices, triangles, pixels, depth testing, blending, and more. Operations that, while seemingly unrelated to AI, can be incredibly powerful if the problem is reframed from a graphical perspective.
This is where Raven’s hybrid engine provides a major advantage. Raven can seamlessly leverage AI compute and rendering devices, efficiently managing data flow between them. In many modern GPUs, these are physically the same device (e.g., supporting DirectX 12 and DirectML or CUDA), eliminating unnecessary data transfer overhead.
Building on this principle, Raven reimagines bounding box suppression as a rendering task. Instead of looping over and comparing boxes manually, we draw them (literally) into a framebuffer. Each box is rendered as a quad (two triangles), with a depth (Z) value based on its score: the higher the score, the closer it appears in the depth buffer.
The GPU’s Z-buffer resolves visibility naturally: when boxes overlap, the one with the highest score “wins” each pixel, without any explicit loops or comparisons. This Z-aware rasterization enables highly parallel, ultra-fast suppression, fully exploiting the GPU’s native graphics capabilities for real-time AI object detection.
Advantages of NMS-Raster
This approach offers several major benefits:
- No CPU/GPU transfers or sync: Everything happens within the GPU’s rendering pipeline.
- No dynamic memory, mutexes, or shared lists: The Z-buffer acts as a perfect arbiter.
- Better performance: It turns a quadratic-time algorithm (pairwise comparisons) into a linear-time one.
- Hardware-accelerated: Modern GPUs can draw millions of primitives per second, and Z-culling is done in hardware at near-zero cost.
Although NMS-Raster is not mathematically identical to classical NMS, its functional effect is the same: keep the box with the highest score in each overlapping region and discard the rest. It also opens the door to new visual suppression criteria (like per-pixel intensity or covered area) that are hard to implement in traditional algorithms but come naturally in a graphical context.
Implementation of NMS-Raster in Raven
This approach offers several major benefits:
The NMS-Raster implementation in Raven has been designed to be modular, flexible, and reusable. It enables seamless integration with tensors produced by different AI object detection models without requiring custom adaptation.
This is achieved through a “view” abstraction over the detection tensor, which interprets its layout to extract the bounding box coordinates, associated score, and any required normalization.
This allows us to support outputs in various formats (e.g., normalized [0,1] or absolute pixel coordinates) and unify them without copying or transforming the data—just interpreting it in place.
Once the view is established, the GPU-driven algorithm follows these steps:
1. Instanced drawing into integer framebuffer
In NMS-Raster, each bounding box is rendered as an instanced 2D quad into an integer framebuffer. Importantly, this framebuffer is used for data processing rather than visualization, enabling GPU-native bounding box suppression. Key points of this step:
- Each instance writes its unique ID (tensor index) into the framebuffer
- The score is encoded as the Z coordinate, used for depth testing.
- The Z-buffer is set to GREATER, so the box with the highest score overwrites others when overlapping.
At the end of the render pass, each pixel in the framebuffer holds the ID of the most visible box for that pixel, in other words, the one with the highest score. No manual comparison took place: the GPU handled everything natively.
Fig. 1: Boxes are rendered into a INT framebuffer, using the score as the Z value. The surviving pixels in the framebuffer directly indicate the surviving box ID. The framebuffer is pre-cleared with -1 to mark invalid IDs
2. Contribution histogram
After instanced rendering, the integer framebuffer is processed using a parallel compute shader to quantify the effective visibility of each bounding box.
This produces a histogram tensor that measures effective visibility per box, expressed as the E number of pixels it dominated.
Fig. 2: Per-box histogram showing the total count of pixels where each box ID was the final winner in the framebuffer
3. Threshold filtering
Once the contribution histogram is computed, a GPU compute shader applies a configurable** threshold** to determine which bounding boxes are retained. For example, this threshold might specify the minimum percentage of a box’s area that must be visible to survive suppression.
The result is a boolean tensor indicating which bounding boxes should be kept. This tensor can be passed directly to the next stage in the inference pipeline.
Fig. 3: Final boolean “survivors” tensor, indicating 0 or 1 for each box ID depending on whether the exposed surface exceeds a user-defined threshold
Strengths of the design
The NMS-Raster pipeline is not only highly efficient and fully parallel, but it also offers several key advantages for GPU-accelerated object detection:
- No CPU/GPU synchronization: No data ever leaves the GPU.
- Linear scalability: No nested loops or pairwise comparisons.
- Reusability across models: Thanks to the abstract tensor view.
- Determinism and debuggability: The framebuffer can be visualized for inspection at any time.
Additionally, because the algorithm relies on instanced drawing, the cost of processing 1,000 or 10,000 boxes is nearly identical. Performance is primarily determined by the framebuffer resolution rather than the number of detections, making NMS-Raster particularly well-suited for dense scenes, real-time video pipelines, and high-performance AI inference applications.
Performance
One of the standout advantages of NMS-Raster is its exceptional performance scalability, both in terms of execution time and architectural efficiency. We benchmarked the algorithm across multiple hardware configurations, including a high-end NVIDIA RTX 3060, a budget Radeon RX 6400, an integrated Intel UHD 770, and a parallelized CPU implementation on an Intel Core i9.
The chart below shows execution time as a function of the number of bounding boxes processed, grouped in batches of 250. Each configuration includes all steps of the algorithm: instanced rendering of bounding boxes with Z-buffer prioritization, histogram construction, and survivor filtering.

As the graph demonstrates, GPU-based execution scales almost linearly with the number of bounding boxes, while the CPU implementation exhibits quadratic growth, quickly becoming impractical for more than a few thousand boxes. The RTX 3060 maintains sub-millisecond latency even with 4,000 boxes, making real-time processing feasible in high-throughput applications.
However, the most critical performance gain is not just speed, but where the computation takes place. Unlike traditional NMS methods that rely on CPU post-processing (requiring tensor downloads, synchronization barriers, and often memory format conversions) NMS-Raster runs entirely inside the GPU, fully GPU-driven from input to output. This architectural advantage eliminates one of the most significant bottlenecks in modern inference pipelines: the GPU↔CPU synchronization gap.
Avoiding CPU roundtrips not only saves transfer time, but also removes latency and contention introduced by synchronization primitives or thread coordination. In scenarios like edge devices, video pipelines, or multi-stream inference, this benefit becomes even more important than raw throughput: it enables true GPU-driven post-processing, where detection, suppression, and rendering all remain inside the GPU domain.
In short, NMS-Raster is not just faster, it’s architecturally superior for real-time, high-efficiency AI pipelines. If you have questions about NMS-Raster, GPU-accelerated object detection, or integrating Raven into your AI pipeline, our team is here to help. We’d love to hear from you, whether it’s technical support, performance benchmarking, or collaboration inquiries.
Contact us via our contact page, and one of our AI and GPU experts will respond promptly. ⬆ Back
