Advertisement detection in multimedia content with artificial intelligence

Written by

Izan Leal

October 7, 2025

At Fluendo, we specialize in multimedia processing and have extensive experience working with GStreamer and other industry-leading technologies. One of the challenges we’ve tackled is detecting advertisements in multimedia content, a critical need for industries such as media monitoring, ad verification, digital rights management, and content streaming.

We have developed an AI-powered solution that automatically detects advertisements in video and audio files to address this. Our system allows running AI on the edge, ensuring privacy, cost efficiency, and adaptability across different environments.

The challenge of advertisement detection
Our AI-Powered advertisement detection system
- Key features of our solution
- How it works
Performance & evaluation
Conclusion

The challenge of advertisement detection

Detecting advertisements in multimedia content presents several challenges. Ad formats vary widely across languages, styles, and media types, which makes reliable detection challenging. To meet this need, detection systems must be fast, accurate, and adaptable to different formats. While some solutions use cloud processing, others require offline capabilities to address privacy concerns or infrastructure limitations.

Ad detection is essential for content platforms, broadcasters, and regulators to manage and monitor commercial material effectively. It enables key features like ad skipping, content monetization, and personalized ad insertion on streaming platforms, while also supporting media analytics by measuring ad reach and effectiveness. Additionally, it ensures regulatory compliance and helps maintain a clear distinction between editorial and promotional content, contributing to a fairer user experience.

Many solutions rely on cloud-based processing; however, some businesses and organizations also require offline processing to maintain data privacy and avoid cloud computing costs. Our solution is optimized for detecting explicitly structured advertisements, ensuring high precision in identifying clearly defined commercial segments. As AI technology advances, we continue to enhance detection capabilities to keep pace with the evolving nature of multimedia advertising.

Our AI-Powered advertisement detection system

We designed our ad detection solution to be simple, robust, scalable, and adaptable to diverse business needs. By combining advanced signal processing techniques, audio fingerprinting, and AI-based content analysis, our system quickly and accurately identifies advertisements across a wide range of multimedia formats. Its modular architecture ensures seamless integration into existing workflows, whether deployed in the cloud or on-premises. It is suitable for everything from large-scale media operations to privacy-sensitive environments.

For the client, this translates into a powerful tool that enhances content management, streamlines monetization strategies, and improves user experience. Whether enabling automated ad skipping, supporting targeted ad insertion, or providing detailed analytics for campaign performance, our solution empowers clients to extract greater value from their content while maintaining full control over how ads are handled and presented.

Key features of our solution

On-the-edge processing: No internet connection is required, ensuring privacy and eliminating cloud costs.
Multilingual support: We support any language, making our system adaptable for diverse markets.
Audio fingerprinting-based detection: Our technology extracts unique audio fingerprints from content and compares them to a stored database of known ads.
Command-line interface (CLI): Our system runs efficiently via CLI, making it easy to integrate into automated workflows without a graphical user interface.
JSON output format: We generate structured JSON files containing precise timestamps of detected ads, making it simple for businesses to process and analyze results.

The overall design and structure of the solution are illustrated in Figure 1, which provides a visual representation of the key components, their interactions, and how they collectively support the system’s functionality. This diagram helps to clarify the architecture by outlining the relationships between different modules, data flows, and integration points within the solution.

Figure 1: System workflow

How does it work?

Below, we explain how our system detects and classifies advertisements in video and audio content, as shown in Figure 1.

1. Input processing & media compatibility

Our system is designed to handle a wide range of multimedia formats, ensuring compatibility with various content sources. It can process video files in formats such as:

Video formats: .mp4, .mkv, .avi, .mov, .flv, .wmv, .webm, .m4v
Audio formats: .wav, .mp3, .ogg, .flac, .m4a, .wma, .aac, .aiff, .au, .raw

When we receive a media file for analysis, our system automatically extracts and converts the audio track into an optimized proprietary format. If the input is a video file, we isolate the audio component since advertisements are primarily identified through their unique sound patterns. This ensures that all content is uniformly prepared before segmentation and detection.

2. Noise filtering – enhancing audio quality for accurate detection

Before proceeding with segmentation and fingerprinting, we apply advanced noise filtering techniques to improve audio clarity and reduce background interference. This step is crucial for ensuring high detection accuracy, especially when working with content that includes background noise, distortions, or overlapping dialogues.

Our system uses adaptive noise reduction algorithms to remove unwanted audio artifacts while preserving key advertisement features. The filtering process includes:

Spectral subtraction reduces background hums and static noise without distorting speech or music.
Voice activity detection (VAD) isolates spoken words and eliminates irrelevant sounds.
Echo and reverb reduction, minimizing distortions that might otherwise alter an advertisement’s unique audio signature.
Equalization and gain control, ensuring consistent volume levels across different recordings.

By cleaning up the audio before fingerprinting, we enhance the system’s ability to recognize advertisements accurately, even in noisy environments such as TV broadcasts, radio streams, and crowded event recordings.

3. Intelligent file segmentation – sliding window technique

Instead of analyzing an entire media file simultaneously, which can be computationally demanding, we use a sliding window segmentation approach. This method allows us to break audio into smaller, overlapping segments for precise ad detection without missing key content transitions.

Each file is divided into 5-second segments, with an overlap of 1 second to ensure seamless detection. This overlapping approach prevents important ad content from being cut off between segments. Additionally, we utilize parallel processing, meaning multiple segments are analyzed simultaneously, dramatically improving efficiency and processing speed.

By breaking media files into smaller, manageable units, we can increase detection accuracy while maintaining performance, even for long-form content such as TV shows, podcasts, and live recordings.

Figure 2: Sliding window diagram

4. Audio fingerprinting – generating unique signatures

Once the audio has been segmented, each clip undergoes fingerprinting, where we extract its unique acoustic signature. This is a crucial step, allowing us to identify advertisements even if they have been modified, compressed, or contain background noise.

To create an accurate fingerprint, we analyze several key sound features, including:

Frequency spectrograms convert audio into a visual time-frequency representation to highlight patterns unique to advertisements.
Tempo & rhythm patterns, allowing us to capture repetitive sound sequences that distinguish ads from regular content.
Amplitude modulation & volume changes, ensuring the system remains effective even when an ad’s loudness or intensity is slightly altered.
Speech & music differentiation, helping us classify whether an advertisement consists primarily of spoken dialogue, background music, or both.

Figure 3: Audio clip spectrogram

We then convert these extracted features into a fingerprint hash, a unique identifier for each segment. This fingerprint is compared against our advertisement database, allowing us to detect known ads accurately.

5. Matching against the advertisement database

Once we generate an audio fingerprint, our system compares it against our structured MySQL database, which contains thousands of known advertisement fingerprints. This step is where the actual detection takes place.

The matching process follows a structured approach:

Exact match identification: If a fingerprint perfectly matches one in our database, we immediately flag it as a detected ad.
Partial match & confidence scoring: If a segment only partially matches an ad, we assign a confidence score based on similarity.
Cluster-based verification: If multiple segments match the same ad within a short time frame, we cluster them together and refine the detection to prevent duplicate timestamps.

By structuring our system this way, we ensure robust ad detection, even when commercials have been slightly altered. This means that ads with minor modifications, compression artifacts, or volume changes can still be recognized with high confidence.

After detecting an advertisement, we refine its start and end timestamps to ensure accuracy. Since our system processes overlapping segments, it often detects the same ad multiple times. We consider this by clustering detections and refining their time boundaries to eliminate inconsistencies and redundancy.

Additionally, we apply confidence thresholding, meaning we only report advertisements that exceed a predefined confidence level—for example, 80% similarity or higher. This ensures our final results are highly reliable, minimizing false positives while capturing the exact placement of ads in the content.

Beyond standard advertisements, we are also exploring future enhancements, such as detecting subliminal and locuted ads, subtly embedded within a program’s dialogue or background.

7. JSON Report generation & output

Once all advertisements have been identified, timestamped, and verified, we generate a structured JSON report containing all relevant details.

Each detected advertisement is logged with the following information:

Advertisement name → Identifies the matched ad.
Language → Specifies the primary language of the detected ad.
Start & end time → Provides precise timestamps.
Confidence score → Indicates detection accuracy (higher values mean better matches).

Example JSON Output:

{
    "ads": [
        {
            "name": "Brand X",
            "language": "Spanish",
            "start_time": 12.0,
            "end_time": 24.5,
            "confidence": 92.3
        }
    ]
}

This structured format allows for easy integration with external analytics tools, advertising compliance systems, and content verification platforms. Businesses can use this data to automate ad detection reports, verify ad placements, and ensure compliance with advertising regulations.

8. Expanding the database – adding new advertisement fingerprints

One of our system’s advantages is that we allow for continuous database expansion. If an advertisement has not yet registered in our system, users can generate new fingerprints and store them in our MySQL database for future detections.

We follow the same Input Processing (Step 1) and Fingerprinting (Step 4) to add a new advertisement fingerprint. Once an ad is fingerprinted, it can be stored in the database, ensuring it will be detected in future analyses.

Our MySQL database structure is designed to be efficient and scalable, storing essential data such as:

Advertisements table

AD Id	AD Name	AD Duration	AD Language	File SHA1	Total HASHES	Date Created	Date Modified
1	Ad_1	30.5	English	b5c4d7d34a	128	2024-03-13 12:30:00	2024-03-13 12:30:00
2	Ad_2	15.0	Spanish	a3f2b13f2c	95	2024-03-12 09:15:00	2024-03-13 10:05:00

Fingerprint table

HASH	AD Id	AD Offset	Date Created	Date Modified
9c5b7d2e8f	1	5	2024-03-13 12:30:00	2024-03-13 12:30:00
9c5b7d2e8f	1	10	2024-03-13 12:30:10	2024-03-13 12:30:10
a2c8d61f5b	2	7	2024-03-12 09:15:00	2024-03-12 09:15:00

By continuously updating our fingerprint database, we ensure that our system remains accurate and effective in detecting new and evolving advertisements.

Performance & evaluation

Our AI-powered advertisement detection system has been thoroughly tested, demonstrating high precision across different languages and formats. The results highlight exceptional accuracy, reliable confidence scoring, and minimal temporal offset errors, ensuring robust performance in real-world applications.

To evaluate our advertisement detection system, we tested it on 10 videos and 8 audio recordings sourced from television and radio broadcasts. These recordings have an approximate duration of 30 to 40 minutes and contain multiple advertisement segments in multiple languages, totaling around 165 ads.

Detection accuracy by language

The following table summarizes the detection accuracy across different languages:

Language	Accuracy (%)
All	96.36%
Basque	100.00%
Catalan	97.50%
English	100.00%
Galician	100.00%
Spanish	93.98%

These results confirm that our system performs exceptionally well across all tested languages. However, minor differences in accuracy suggest potential areas for optimization, particularly in Spanish-language content, where variability in ad formats may require further tuning.

Confidence score distribution

Our system maintains a strong confidence score in ad detection, with an average confidence level of 44.60%.

Important note: The confidence score is not linear because fingerprint generation is time-dependent. This means longer advertisements have more unique fingerprints, increasing the chance of a match. Conversely, shorter ads generate fewer fingerprints, slightly lowering the confidence score even if detection is accurate.

As a rule of thumb, any score above 10% indicates a solid detection, as the fingerprinting process ensures that even lower-confidence matches are highly reliable.

The boxplot below illustrates the distribution of confidence scores across various test cases, showing how detection certainty varies across different advertisements.

Figure 4: Confidence score boxplot

Temporal offset error by language

To evaluate temporal precision, we measured the time offset error between the actual timestamps of detected ads and the predicted timestamps. The average absolute error represents how much the system’s detected start and end times differ from reality.

Time Offset Boxplot Figure 5: Temporal offset error boxplot

The obtained temporal offset deviation is acceptable in most cases, as it does not affect the overall ability to detect and mark ad segments accurately. In the majority of test cases, the deviation remains within ±2-3 seconds, which is well within the industry standard for ad verification and content tracking.

In practical applications, this level of accuracy ensures that detected ad segments can still be effectively reviewed, skipped, or monitored, making the system highly reliable for real-world usage.

Conclusion

We are dedicated to advancing AI-driven solutions that address real-world multimedia challenges. Our on-the-edge advertisement detection system results from extensive research, rigorous testing, and deep expertise in AI, multimedia processing, and content analysis. By offering a fully offline, highly accurate, and scalable solution, we empower businesses to monitor, verify, and analyze advertisements efficiently while ensuring data privacy, reducing operational costs, and enhancing regulatory compliance.

Looking ahead, we remain committed to continuous innovation and improvement. Some key areas of future development include:

Real-time video stream detection: Enhancing our system for faster processing to enable real-time detection of ads within video streams, allowing immediate actions.
Database-free ad recognition: Advancing machine learning techniques to identify new advertisements without requiring a predefined database.
Subliminal & locuted advertisement Detection: Developing AI models capable of detecting hidden branding cues, subtle product mentions, and embedded advertisements through speech and video analysis.
Cloud & edge deployment: Expanding our solution to cloud-based environments for greater flexibility while maintaining offline capabilities. With our expertise, commitment to innovation, and deep understanding of multimedia technologies, we are shaping the future of advertisement detection and content intelligence.

Are you ready to take your multimedia projects to the next level? Let’s make it happen! Contact us here and discover how Fluendo can help you bring your ideas to life. Together, we’ll continue shaping the future of multimedia.