Improving AI-based Football Detection and Tracking with Transfer Learning

Written by

Sergio Sánchez

April 17, 2024

Motivation: Multimedia Edge AI for Enhanced Sports Analysis

At Fluendo, we strive to pioneer the future of multimedia. One of our leading initiatives involves creating AI-powered multimedia solutions that are ready for deployment. These solutions are designed to run efficiently on a wide range of devices, from personal computers to specialized equipment embedded in cameras or robots, ensuring optimized performance for on-the-spot data processing. We focus on performance and efficiency, enabling seamless multimedia experiences on resource-constrained devices.

Sports, particularly football, present dynamic and challenging scenarios for deep learning. Our use case involves using a Telestration tool on football match videos for video analysis software that provides insights into the game, helping coaches, analysts, and athletes.

Our solution is multifaceted, encompassing detection, segmentation, and camera calibration. In this blog, we want to share our experience developing an algorithm for player and ball detection and tracking.

Our solution: Domain adaptation with Transfer Learning

We started with a pre-trained YOLOv5 [1] model on the COCO dataset [2], which includes 80 classes such as ‘people’ and ‘sports_ball.’ This strong starting point provides us with a foundation to build and tailor the model to our specific needs using transfer learning.

On the other hand, SoccerNet-Tracking [3] is a dataset of videos of soccer matches taken from the main camera, along with the players’ tracklet annotations. This task is particularly useful in football analysis, as it allows the performance of individual players to be evaluated. Additionally, it provides information about classes beyond players, such as referees, staff, or the ball.

As a first step, we adapted the Soccernet dataset to the YOLO format [4], ensuring the data were in a coherent and useful format to train the model.

Then, using transfer learning, we took the pre-trained model and fine-tuned it to our specific dataset. This technique allows us to benefit from the features already learned by the model, optimizing and adapting it to accurately detect and track the players and the ball.

1 YOLOv5 is a model developed by Ultralytics. It is being used solely for research and testing purposes in this project.

Results and Analysis: Improved Players and Ball Detections

The fine-tuning process undertaken has produced notable improvements in the detection and categorization of the elements ‘player’ and ‘ball’, as shown in Figure 1. A tangible improvement is evident when comparing the results between the original YOLOv5 model and our tuned model. In particular, we have significantly reduced false positives, allowing for more precise and contextual detection within the football field.

Graphic Multimedia Edge AI for Enhanced Sports Analysis\u00a0 — Figure 1: Performance comparison between the baseline model and the fine-tuned model in object detection, showing significant improvements in all evaluated metrics, particularly in the 'player' and 'ball' categories.

The ‘player’ category showcases remarkable improvements across all evaluated metrics, indicating a high level of precision achieved through the fine-tuning process. The ‘ball’ category has also seen considerable advancements, with increased accuracy. Despite these achievements, there remains potential for further refinement in the detection of the ‘ball’ category. This suggests an opportunity for ongoing adjustments and training to further enhance the model’s overall performance.

To more clearly illustrate these advancements, we present two comparative images below (Figures 2 and 3). These figures contrast the results obtained from the original YOLOv5 model and our fine-tuned model. In them, it can be observed how our model effectively focuses on detecting only players and referees within the field of play, while the original model generates numerous false positives, incorrectly including individuals outside of the playing field and other inconsistencies, like detecting a tennis racket and a potted plant.

Bounding boxes Multimedia Edge AI for Enhanced Sports Analysis\u00a0 — Figure 2: Representative image of the detections made by the base model trained with COCO. A significant amount of false positives and inaccurate detections are observed, including elements outside of the playing field. The confidence threshold used is 0.1.

Results and Analysis: Players Tracking

Object tracking is a crucial phase in video analysis focused on monitoring the movement and trajectory of objects over time in video sequences. This phase is intrinsically linked to object detection, where objects are initially identified and categorized within images or frames. Information acquired during detection, such as object location and category, is subsequently used to facilitate continuous tracking of objects as they evolve in the video in the form of tracklet IDs that encapsulate the objects’ historical position data.

Exploring current literature reveals various methods to enhance the efficacy of object tracking. Purely associative methods rely on associating detections from consecutive frames, using metrics such as bounding box overlap and movement continuity. On the other hand, more sophisticated methods integrate deep learning re-identification layers. In addition to associating objects, these methods are equipped with the ability to “re-identify” objects even after periods of occlusion or absence, using deeply and automatically learned features to maintain tracking consistency and accuracy.

We have chosen a SORT-based architecture [5] as the tracking algorithm for our work. It combines associative strategies with deep learning-based re-identification layers, making it a robust and powerful option for precise object tracking throughout video sequences.

The effectiveness analysis of the tracking algorithms was carried out using the MOTChallenge Official Evaluation Kit [6], utilizing the HOTA metric [7]. HOTA evaluates performance by combining aspects of association and detection localization in object tracking.

As shown in Figure 4, our tracking algorithm exhibits solid performance, maintaining a HOTA score close to 0.7, suggesting good accuracy in detecting and associating objects in video sequences. The error bars, representing the standard deviation, provide a view of the variability and consistency of the tracker’s performance across different sequences.

The algorithm’s robustness is also reflected in its ability to maintain high accuracy across various α values, which influences the tolerance of object localization accuracy. However, it is observed that the HOTA score starts to decrease when the α value increases beyond 50. This indicates that as the algorithm is subjected to more stringent localization criteria (higher α values), its performance in detecting and associating objects accurately decreases. It implies that it is difficult for the algorithm to maintain the same level of performance under a more precise and strict object localization requirement, making alpha=50 a good threshold value to apply to our use case and achieve a good balance between localization and object association of the tracks.

Figure 4: The graph displays the HOTA performance against α thresholds for the 'COMBINED' set that groups the results of all evaluated sequences. The error bars indicate the variability of the HOTA values among different sequences.

To Wrap Up

At Fluendo, we have developed an effective and reliable machine-learning system for detection and tracking in sports environments. Transfer learning has played a significant role in addressing the unique challenges of sports tracking, contributing to our achievement of precise and valuable results.

Our work displays our ongoing commitment to innovation and constant improvement, creating customized solutions that truly meet the specific needs of our clients’ use cases.

Next Steps

Currently, we are focused on advancing our research and development in strategic areas such as transfer learning to efficiently adapt foundational models to new challenges in sports and any other multimedia realm. We are also delving into MLOps to optimize the lifecycle of our Machine Learning models, ensuring agile and efficient operations.

Besides, we prioritize a Data-Centric approach in AI, enhancing the quality and management of data to train more robust and accurate models. All of this while we continue to drive innovations in Multimedia Edge AI, seeking solutions that offer powerful and efficient processing at the edge, thus responding to our client’s current and future needs

If you want to learn more or are interested in collaborating, please feel free to contact us here.

References

YOLOv5 GitHub Repository. Available at: https://github.com/ultralytics/yolov5
COCO Dataset. Available at: https://cocodataset.org
SoccerNet Tracking Task. Available at: https://www.soccer-net.org/tasks/tracking
Ultralytics YOLO Format Documentation. Available at: https://docs.ultralytics.com/datasets/detect/
Du, Yunhao, et al. “Strongsort: Make deepsort great again.” IEEE Transactions on Multimedia (2023). https://doi.org/10.48550/arXiv.2202.13514
MOTChallenge Evaluation Kit GitHub Repository. Available at: https://github.com/dendorferpatrick/MOTChallengeEvalKit
Luiten, J., Os̆ep, A., Dendorfer, P. et al. HOTA: A Higher Order Metric for Evaluating Multi-object Tracking. Int J Comput Vis 129, 548–578 (2021). https://doi.org/10.1007/s11263-020-01375-2