Understanding human actions from videos captured by UAVs is a challenging task in computer vision due to the unfamiliar viewpoints of individuals and changes in their size due to the camera’s location and motion.
This work proposes DroneCaps, a capsule network architecture for multi-label HAR (Human Action Recognition) in videos captured by UAVs. DroneCaps uses features computed by 3D convolution neural networks plus a new set of features computed by a novel Binary Volume Comparison layer.
All these features, in conjunction with the learning power of CapsNets, allow understanding and abstracting the different viewpoints and poses of the depicted individuals very efficiently, thus improving multi-label HAR.
The evaluation of the DroneCaps architecture’s performance for multi-label classification shows that it outperforms state-of-the-art methods on the Okutama-Action dataset.
Read more at: https://ieeexplore.ieee.org/document/9190864