Translated Abstract
Human action recognition is one of the the most important research issues in the field of computer vision, whose goal is to extract, analyze and recognize human action information from video. In the field of action recognition, the performance of traditional methods depend largely on the quality of artificial design features, and the computational complexity of artificial design features is high, and the ability of extracting advanced semantic information of video is limited. Action recognition based on deep learning simulates the processing process of visual information in the biological brain, and realizes the extraction from shallow features to advanced features by autonomous learning of images. Based on deep learning, this thesis studies the modeling of video long-range temporal structure information, and studies the spatial-temporal two-stream convolution neural network action recognition algorithm based on video segmentation. The main work of the thesis includes:
Due to the limitation of action information extraction based On traditional two-stream convolutional neural network, the action recognition method of apatial-temporal two-stream convolutional neural network based on video segmentation is studied. The algorithm first divides the video into several video segments with equal length and no overlap, and then randomly samples the RGB image representing the static feature of the video and the stacked optical flow image representing the motion feature of the video. The feature learning is carried out by input to the spatial network stream and the temporal network stream, respectively. In the two streams, the single frame feature of the network output is aggregated into the behavior feature of the video level by a variety of feature fusion methods respectively. Finally, the video action recognition results are obtained by integrating the spatial-temporal dual channel action recognition results with the ensemble learning method.
The difficulties and problems in human action recognition are improved and studied. There are two common problems.First, the dataset of action recognition is extremely small compared the ImageNet dataset.Second,how to design an effective network for feature extraction and expression. For the original data, this paper studies the dense sampling and sparse sampling, and studies a variety of data argumentation strategies in the spatial stream, such as horizontal flipping, rotation, shift transformation, shear transformation and random clipping. The spatial action recognition of convolution neural network models with different structures and depths is analyzed and studied. In order to prevent over-fitting, two transfer learning strategies are adopted in this paper. In the network training stage, the pre-training model is adopted on ImageNet and the network parameters are fine-tuned at different levels. The transfer learning strategy of cross modality pre-training is adopted to improve the performance of network learning effectively. At the same time, the effect of the number of video segments on the long-term temporal information action recognition is studied in this paper.
Translated Keyword
[Convolutional neural network, Deep learning, Human action recognition, Segmental two-stream network]
Corresponding authors email