基于深度学习的视频人体行为识别研究 - Details

Author：

吴彝丹 (吴彝丹.)

Indexed by：

学位论文库

Abstract：

视频人体行为识别是智能监控、人机交互、视频检索等诸多应用的一项基础技术。在光照条件变化、视角变化、复杂背景、类内变化等诸多因素的影响下，行为识别成为一项具有挑战性的任务。而随着近年来视频逐渐成为信息的主要载体之一，以及计算机视觉技术和人工智能的发展，研究如何高效地对视频内容进行分析和理解、对视频中人的行为进行识别，越来越受到学术界和工业界的重视。深度学习方法具备强大的学习建模能力，能从大量数据中提取有效的信息，因而受到了学者的广泛关注。深度学习代表算法卷积神经网络具有局部连接和权值共享的特点，能有效地对数据进行高层抽象表示，而利用递归神经网络我们可以对时间序列进行建模。另外，视频可以看做由多帧图像组成的时间序列，且具有可利用的上下文信息。
首先，本文设计实现了基于三维卷积神经网络的摔倒检测。由于能够作为训练数据的摔倒视频非常有限，难以直接用于训练深度模型。此外，二维卷积神经网络只能编码空域信息，无法表示运动特征，用于基于视频时间序列的摔倒检测中并不合适，而三维卷积神经网络能有效地对视频中的时空信息进行编码。因此，本文将三维卷积神经网络引入到摔倒检测任务中，并将网络在大型运动分类数据集Sports-1M上进行训练。训练好的网络能够提取视频中有效的运动表示特征，可以直接作为特征提取器使用，用以得到短时序列的有效特征表示。得到特征表示后，可以结合线性SVM分类器进行摔倒检测。
其次，本文将三维卷积神经网络与基于LSTM的视觉注意力模型相结合，提出了一种注意力引导的三维卷积网络用于视觉行为识别。每帧图像中，通常只有小部分区域提供了行为相关的信息，其余部分则是与行为识别无关或者相关性很小的冗余区域。基于LSTM的视觉注意力模型可以在短时局部动态特征的基础上自动学习在空间上应该关注的区域和整个长序列包含的信息。本文将注意力引导的三维卷积神经网络在UCF-11、HMDB-51和Multiple　cameras　fall　dataset三个数据集上进行了训练，并实现了视频行为分类识别。
在摔倒检测数据集Multiple　cameras　fall　dataset上，我们利用三维卷积提取特征，利用线性SVM分类器分类，得到了很高的准确率。在行为识别数据集UCF-11数据集和HMDB-51数据集上利用注意力引导的三维卷积神经网络进行了多次实验，均取得了很好的效果。另外，我们通过实验对比了不同的特征提取方式、不同的特征采样方法对算法性能的影响，充分说明了注意力引导的三维卷积神经网络的有效性。通过分析视觉注意力集中的位置发现，视频帧中能表征运动类别的最有信息量的关键区域可以被注意力权重正确地发现。

Keyword：

行为识别卷积神经网络深度学习视觉注意力摔倒检测

Author Community：

[ 1 ] 西安交通大学电子与信息工程学院

Reprint Author's Address：

Show more details

Translated Title

Translated Abstract

Visual　human　action　recognition　is　a　basic　technology　for　intelligent　visual　surveillance,　human-machine　interface,　video　retrieval　and　many　other　applications.　Action　recognition　is　a　challenging　task　due　to　the　illumination　variation,　viewpoint　variation　and　background　clutter　etc.　As　videos　becoming　one　of　the　main　carrier　of　information　and　with　the　development　of　computer　vision　technology　and　artificial　intelligence　in　recent　years,　how　to　efficiently　analyze　the　semantic　content　and　classify　the　action　in　videos　has　been　a　hot　topic　in　both　academia　and　industry.　Deep　learning　techniques　have　drawn　lots　of　scholars’　attention　because　it　can　extract　highly　abstract　semanic　features　automatically　from　large　data.　As　a　representive　deep　learning　architecture,　Convolutional　Neural　Network(CNN)　has　the　characteristics　of　local　connection　and　weight　sharing,　which　enables　high　level　expression　of　the　data　effectively.　Futhermore,　Recurrent　Neural　Network(RNN)　can　make　use　of　information　along　time　sequence.　Videos　can　be　seen　as　a　time　series　of　image　frames　with　available　context　information.

Firstly,　a　fall　detection　method　based　on　3D　convolution　neural　network　(3D　CNN)　is　proposed　in　this　thesis.　It　is　difficult　to　train　deep　model　for　fall　detection　since　the　training　data　is　very　limited.　In　addition,　the　2D　CNN　could　only　encode　spatial　information　instead　of　motion　information,　so　it　is　inappropriate　to　use　it　in　fall　detection　for　videos　as　time　series.　Considering　the　three　dimensional　convolution　can　effectively　extract　the　spatial-temporal　information,　the　3D　CNN　is　introduced　into　the　fall　detection　task.　A　large　sports　dataset　Sports-1M　is　employed　to　train　the　3D　CNN　to　extract　valid　feature　to　represent　motion　in　videos.　Then　the　trained　3D　CNN　which　is　directly　used　as　an　automatic　feature　extractor　to　obtain　the　representation　of　short-time　sequence　is　combined　with　a　linear　SVM　classifier　to　detect　the　falls.

Secondly,　by　conbining　the　3D　convolution　neural　network　with　the　LSTM　based　visual　attention　model,　a　visual　attention　guided　3D　convolution　neural　network　is　proposed　for　action　recognition　in　this　thesis.　In　general,　only　a　small　region　of　each　frame　provides　action-related　information　while　the　rest　which　is　almost　irrelevant　to　action　recognition　is　redundant.　The　LSTM-based　visual　attention　model　can　automatically　learn　to　focus　on　the　region　of　interest　and　get　the　information　over　the　long　sequence　based　on　the　local　short-term　dynamic　features.　The　visual　attention　guided　3D　convolution　neural　network　is　trained　on　UCF-11,　HMDB-51　and　Multiple　cameras　fall　datasets　to　classify　the　actions.

On　Multiple　cameras　fall　dataset,　by　using　a　linear　SVM　classifier　with　a　3D　CNN　feature　extrator,　we　get　a　high　average　classification　accuracy.　Multiple　experienments　has　been　carried　out　on　the　activity　classification　benchmark　UCF-11　and　HMDB-51　using　visual　guided　3D　CNN.　The　classifiction　accuracy　of　the　proposed　method　is　comparative　to　the　state-of-art　performance.　Futher　experiments　on　different　feature　extraction　methods　and　different　feature　sampling　method　have　been　conducted,　which　verifies　the　effectivity　of　the　proposed　method.　By　analyzing　where　the　model　focuses　its　attention,it　could　be　found　that　the　key　regions　which　are　most　informative　for　each　frame　have　been　correctly　discovered.

Translated Keyword

[]

Research Interests

Classification

Corresponding authors email

Basic Info ：

Degree：工学硕士

Mentor：吕娜

Student No.：

Year： 2017

Language： Chinese

Cited Count：

WoS CC Cited Count： 0

30 Days PV： 33

Affiliated Colleges：

电子与信息工程学部（原电子与信息工程学院）本学院/部未明确归属的数据

Location

Library Discovery Baidu Scholar Search

Type
Departments

All Years Choose Year From to