This paper proposes a method to facilitate labelling of music performance videos with automatic methods (3D-Convolutional Neural Networks) instead of tedious labelling by human experts. In particular, we are interested in the detection of the 17 musical performance gestures generated during the performance (guitar play) of musical pieces which have been video-recorded. In earlier work, these videos have been annotated manually by a human expert according to the labels in the musical analysis methodology. Such a labelling method is time-consuming and would not be scalable to big collections of video recordings. In this paper, we use a 3D-CNN model from activity recognition tasks and adapt it to the music performance dataset following a transfer learning approach. In particular, the weights of the first blocks were kept and only the later layers as well as additional classification layers were re-trained. The model was evaluated on a set of 17 music performance gestures and reports an average accuracy of 97.9% (F1:77.8%) on the training set and 85.7% (F1:38.6%) on the test set. An additional analysis shows which gestures are particularly difficult and suggest improvements for future work.