1. Introduction
This study leverages Deep Learning (DL) models and stereoscopic vision techniques to develop tools that automatically estimate fish size in near real-time, using a non-intrusive approach. Accurate and efficient length measurement in intensive aquaculture systems requires advanced object detection algorithms capable of extracting highly discriminative features specific to fish. However, practical fish detection still presents challenges, such as complex underwater backgrounds, varied fish orientations, and a lack of large-scale annotated datasets.
M onitoring fish growth is essential in aquaculture for optimizing feed quantity, oxygen supply, and overall animal welfare. Typically, length estimation relies on manual sample collection, a process that can induce stress, negatively affect fish growth and, in some cases, result in mortality. This highlights the urgent need for automated, non-invasive alternatives to manual handling. Stereoscopic vision techniques in computer vision have proven effective for estimating the size of freely swimming fish .
The primary objective of this work is to automatically process underwater stereo videos of freely swimming fish to obtain accurate measurements of many individuals in near real-time. The main contributions of the proposed approach are as follows:
2. Stereoscopic vision system and video acquisition
A stereoscopic camera system was prototyped using a watertight enclosure with sealed wiring, ensuring durability in aquatic environments. Its design required careful consideration of parameters such as focal length, inter-camera distance, shutter speed, and other optical and mechanical factors. Video recordings of sea bream at different growth stages were conducted at the Institute of Aquaculture Torre de la Sal (IATS- CSIC ) facilities, resulting in six hours of footage at varying resolutions and frame rates, with the configuration detailed in Figure 1.
3. Results: Applying DL models to automatic fish length estimation
A dataset was generated by manually annotating the snout and tail keypoints of individual fish, resulting in 250 images and 700 annotated individuals. Two DL-based object detection models were trained: Keypoint R-CNN and YOLO. Keypoint R-CNN extends the Mask R-CNN architecture for keypoint detection. It first uses a Region Proposal Network (RPN) to generate candidate regions and then refines these proposals with a CNN-based object detection head. YOLO (You Only Look Once) is a fast, single-stage object detection algorithm that treats detection as a regression problem. It divides an image into a grid and simultaneously predicts bounding boxes and keypoints per cell in a single pass through the network , making it well-suited for real-time applications.
Table 1 compares the average fish lengths obtained from manual measurements and the two DL models. Both models yielded accurate results, with YOLO performing slightly better , particularly in tank 25. The average processing times per frame were 0.347 ms for Keypoint R-CNN and 0.068 ms for YOLO. An example of automatic fish detection and length estimation is shown in Figure 2.
4. Conclusion
A dataset of sea bream was developed with manually annotated snout and tail keypoints to train Keypoint R-CNN and YOLO networks for automatic fish length estimation. Both models demonstrated high accuracy across different growth stages when compared to manual measurements. These results support the potential of Deep Learning and stereoscopic vision as a robust, non-invasive, and automated method for fish length estimation in aquaculture.
Funding
The work was funded by MCIN NextGenerationEU (PRTR-C17.I1) and by Generalitat Valenciana (THINKINAZUL/2021/007).