Abstract:
Both traditional machine learning algorithms (linear discriminant analysis,
support vector machine, decision tree, to name a few) and deep learning
algorithms such as Convolutional Neural Network (CNN), Long ShortTerm
Memory (LSTM), and Recurrent Neural Network (RNN) have been used
in bioacoustics research in general and bird species identification in partic ular. However, often there is a limitation of data in bioacoustic research,
including bird vocalizations. Training a deep neural network with such
a small amount of data most often leads to overfitting. Many researchers
have used various techniques, for instance, data augmentation and transfer
learning to surpass this problem, but no research has yet been conducted
on pre-training neural networks on public repositories which contain bird
vocalizations, such as Xeno-canto and eBird for bioacoustic classification
models. In this dissertation, we pre-trained CNNs for bioacoustic classifi cation models using two public bird vocalization repositories (Xeno-canto
and eBird) and fine-tuned them on locally collected bird audio record ings; audio recordings obtained from Intaka Island Nature Reserve, Cape Town, South Africa. First, we used bird audio vocalizations from the pub lic repositories to pre-train three CNN models using different sample sizes.
We pre-trained the three CNN models using 9000, 12000, and 15000 spec trograms (obtained by converting the audio using Fourier Transforms).
Next, we trained five baseline models using different sample sizes (the en tire training set, 6150, 9000, 12000, 16000, and 21000 spectrograms) from the collected data. Then, we used the same sample sizes as those employed in training the baseline models to fine-tune the pre-trained models. We used the baseline models as reference models to evaluate the performances vii viii
Keywords: Data augmentation; Bioacoustics; Deep learning; Pre-training. Of the fine-tuned models. The best baseline model had a test accuracy of 91.70%, and the best-fine-tuned model achieved 91.73%. The AUC for the best baseline was 96.9% against 96.3% for the best-fine-tuned model.
Three findings were observed. Firstly, the performance of the model improved when increasing the size of the training data, and secondly, the performance also improved when using the time-shift augmentation technique.
Finally, the results revealed that the baseline models outperformed the fine-tuned model. The reason why the baseline models outperformed the fine-tuned model might have been because the data used in pre-training was not large enough, and a combination of CNN and RNN could produce better results. Using much larger data to pre-train the model might also improve the performance of the fine-tuned models. Despite the results, the
research is the first attempt at pre-training models on publicly available bird vocalizations data that has not been investigated in the existing literature.