Translated Abstract
Visual human action recognition is a basic technology for intelligent visual surveillance, human-machine interface, video retrieval and many other applications. Action recognition is a challenging task due to the illumination variation, viewpoint variation and background clutter etc. As videos becoming one of the main carrier of information and with the development of computer vision technology and artificial intelligence in recent years, how to efficiently analyze the semantic content and classify the action in videos has been a hot topic in both academia and industry. Deep learning techniques have drawn lots of scholars’ attention because it can extract highly abstract semanic features automatically from large data. As a representive deep learning architecture, Convolutional Neural Network(CNN) has the characteristics of local connection and weight sharing, which enables high level expression of the data effectively. Futhermore, Recurrent Neural Network(RNN) can make use of information along time sequence. Videos can be seen as a time series of image frames with available context information.
Firstly, a fall detection method based on 3D convolution neural network (3D CNN) is proposed in this thesis. It is difficult to train deep model for fall detection since the training data is very limited. In addition, the 2D CNN could only encode spatial information instead of motion information, so it is inappropriate to use it in fall detection for videos as time series. Considering the three dimensional convolution can effectively extract the spatial-temporal information, the 3D CNN is introduced into the fall detection task. A large sports dataset Sports-1M is employed to train the 3D CNN to extract valid feature to represent motion in videos. Then the trained 3D CNN which is directly used as an automatic feature extractor to obtain the representation of short-time sequence is combined with a linear SVM classifier to detect the falls.
Secondly, by conbining the 3D convolution neural network with the LSTM based visual attention model, a visual attention guided 3D convolution neural network is proposed for action recognition in this thesis. In general, only a small region of each frame provides action-related information while the rest which is almost irrelevant to action recognition is redundant. The LSTM-based visual attention model can automatically learn to focus on the region of interest and get the information over the long sequence based on the local short-term dynamic features. The visual attention guided 3D convolution neural network is trained on UCF-11, HMDB-51 and Multiple cameras fall datasets to classify the actions.
On Multiple cameras fall dataset, by using a linear SVM classifier with a 3D CNN feature extrator, we get a high average classification accuracy. Multiple experienments has been carried out on the activity classification benchmark UCF-11 and HMDB-51 using visual guided 3D CNN. The classifiction accuracy of the proposed method is comparative to the state-of-art performance. Futher experiments on different feature extraction methods and different feature sampling method have been conducted, which verifies the effectivity of the proposed method. By analyzing where the model focuses its attention,it could be found that the key regions which are most informative for each frame have been correctly discovered.
Corresponding authors email