Case Studies

Ethical AI and high-fidelity speech synthesis for the Catalan audiovisual sector

The Catalan audiovisual sector needed professional-grade synthetic voices, but existing open-source models lacked the naturalness and prosody required for high-end production.

Furthermore, AI voice cloning posed significant legal and ethical risks, including intellectual property infringement and unauthorized use of biometric data.

A secure, end-to-end multimedia solution was required to close this technological gap while ensuring compliance with the EU AI Act and protecting voice talent. The objective was to build a highly scalable, ethical platform that provided industry-leading phonetic precision without compromising data sovereignty.

Meeting Specifications

Adaptability to meet the needs of the phonetic nuances required for professional dubbing and media production for regional languages like Catalan.

Product Quality

Achieved 0.95% Phoneme Error Rate (PER) and the highest rank in Mean Opinion Score (MOS), ensuring superior linguistic precision for standard Catalan.

Legal Compliance

Focus on intellectual property rights and compliance with the European Union’s Artificial Intelligence Act (EU AI Act), and through C2PA traceability.

Meeting Specifications

Adaptability to meet the needs of the phonetic nuances required for professional dubbing and media production for regional languages like Catalan.

Product Quality

Achieved 0.95% Phoneme Error Rate (PER) and the highest rank in Mean Opinion Score (MOS), ensuring superior linguistic precision for standard Catalan.

Legal Compliance

Focus on intellectual property rights and compliance with the European Union’s Artificial Intelligence Act (EU AI Act), and through C2PA traceability.

Proposed Solutions

Architecting an ethical and high-fidelity voice cloning platform

Delivering Professional TTS through MLOps, Transfer Learning, and C2PA Standard

To address the project’s complex requirements, the solution was divided into two distinct technological pillars: a highly optimized TTS model and a secure, enterprise-grade web application.

The Core AI Engine: Linguistic Precision and High-Fidelity Synthesis

To ensure professional naturalness, the system used Transfer Learning from existing foundational models (Proyecto AINA), which were heavily refined using over 45 hours of studio-mastered recordings. A critical differentiator was the manual linguistic correction of the phonetizer engine, which drastically reduced pronunciation errors and ensured perfect dialectal accuracy. The architecture deployed a modular TTS engine that combined eSpeak-NG, Matcha-TTS, and Vocos. This rigorous acoustic and linguistic optimization successfully reduced the Phoneme Error Rate (PER) to just 0.95%. The application and model were presented to a focus group of the audiovisual industry, and the experts rated it as the best Catalan TTS available.

Architecting an ethical and high-fidelity voice cloning platform

Delivering Professional TTS through MLOps, Transfer Learning, and C2PA Standard

To address the project’s complex requirements, the solution was divided into two distinct technological pillars: a highly optimized TTS model and a secure, enterprise-grade web application.

The Core AI Engine: Linguistic Precision and High-Fidelity Synthesis

The enterprise architecture: ethical governance and scalability

Surrounding the AI engine, a robust application was built to manage the complexities of voice actors’ IP management through contracts, GDPR compliance, and high-concurrency requests. Orchestrated via a backend using Django, Redis, and Celery, the entire pipeline guarantees absolute data sovereignty. Crucially, the platform addressed the legal challenges of Generative AI by embedding C2PA cryptographic metadata into every generated audio file, ensuring the synthetic media’s origins were immutable and protected against unauthorized data mining.

Supported by real-time ASR validation on the frontend, this scalable architecture achieved an internal synthesis latency of just 0.12 seconds, a Real-Time Factor (RTF) of 0.17, and a 100% success rate for backend inference requests. Ultimately, the project delivered a transparent, ethical, and high-performance solution ready for the professional dubbing industry.