Book Description
This thesis contributes to the literature of understanding and recognizing human activities in videos. More specifically, the thesis draw line between short-range atomic actions and long-range complex activities. For the classification of the latter, the mainstream approach in literature is to divide the activity into a handful of short segments, called atomic actions. Then, a neural model, such as 3D CNN, is trained to represent and classify each segment independently. Then, the activity-level classification probability scores are obtained by pooling over that of the segments. Differently, this work argues that long-range activities are better classified in full. That is to say, the neural model has to reason about the long-range activity, all at once, to better recognize it. Based on this argument, the thesis proposes different methods and neural network models for recognizing these complex activities.