The times of manual labour are changing as automation grows larger and larger by the day. Self-driving vehicles being one of the more well known examples of automation (the vehicles in this thesis being those found in the construction industry), relies on a machine learning network to recognize its surroundings. To achieve this, the network needs a dataset. A dataset consists of two things, data, which usually come in the form of images, and annotated labels in order for it to learn what it sees. The labels is a descriptor that describes what objects exists in an image, and coordinates for where in the image these objects exists, and the area they occupy.
As data is collected, it needs to be manually annotated, which can take several months to finish. With this in mind, is it possible to set up some form of semi-automatic annotation step that does a majority of the work? If so, what techniques can be used to achieve this? How does it compare to a dataset which have been annotated by a human? and is it even worth implementing in the first place?
For this research, a dataset was collected where a remote controlled wheel loader approached a stationary dump truck, at various different angles, and during different conditions. Four videos were used in the trainingset, containing 679 images and their respective labels. Two other videos were used for the validationset, consisting of 120 images and their respective labels.
The chosen object detector was YOLOv3, which has a low inference time and high accuracy. This helped with gathering results at a faster rate than what would've been possible if an older version was chosen.
The method which was chosen for doing the automatic annotations was linear interpolation, which was implemented to work in conjunction with the labels of the trainingset to approximate the corresponding values.
The interpolation was done at different frame gaps, a gap of 10 frames, a gap of 20 frames, all the way up to a gap of 60 frames. This was done in order to help locate a sweet spot, where the model had similar performance compared to the manually annotated dataset.
The results showed that the fully manually annotated dataset approached a precision value of 0.8, a recall of 0.96, and a mean average precision (mAP) value of 0.95. Some of the models which had interpolated frames between a set gap, achieved similar results in the metrics, where interpolating between every 10th frame, every 20th frame, and every 30th frame, showed the most promise. They all approached precision values of around 0.8, a recall of around 0.94, and an mAP value of around 0.9.