Recognizing Human Actions from Images with Attention Mechanism
View/ Open
Date
2022-05Author
Baş, Çağdaş
xmlui.dri2xhtml.METS-1.0.item-emb
Acik erisimxmlui.mirage2.itemSummaryView.MetaData
Show full item recordAbstract
Human action recognition in still images is a challenging problem due to the lack of complete motion compared to videos. A single snippet of the ongoing action does not compete with the rich information provided by a video. In this thesis, we explore combining the surrounding information with two different attentional multiple instance mechanisms.
The surrounding objects and scene clues are essential in still image action recognition. However, detecting every object is not feasible. For this reason, we employ two different attention mechanisms on possible action related regions proposed by a region proposal network.
The first attention layer is the bottom-up attention layer. It learns a spatial attention map to refine each proposal according to ongoing action. It eliminates the background and highlights only the foreground and the pixels related to the action. Our experiments show that the bottom-up attention layer increases the models' accuracy. Visual analysis of the highlighted areas shows that it successfully finds action related objects, scene clues and poselets.
The second attention layer is the top-down attention layer. It learns to select which region proposals are related to the ongoing action. There may be multiple action related clues in an image, and the bottom-up attention layer can highlight multiple image regions. However, the selection of related proposals is the top-down attention layers task. It learns to select regions and combines region features to create a single image-level descriptor. Our experiments show that the top-down attention layer successfully selects the related regions to boost the overall performance.
Our proposed model can be plugged after any region proposal network and allows end-to-end learning. This way, the network simultaneously learns to propose action related regions, weights each region with an action attention map and selects and combines these regions into an image feature vector. As a result, our model improved the state-of-the-art average precision on four different datasets.