Moving in Sync: Self-supervised learning of n-human interactions
Speaker: Sonal Sannigrahi
Track: Data Science
Room: Video Stream 2
Time: Oct 08 (Fri): 13:45
When watching a tv show or movie, we are easily able to detect which interactions are taking place on screen and between whom. However this is not trivial for computers and remains to be a challenging task as it involves both tracking humans as well as learning the semantics of the interaction taking place. This is an important problem in computer vision as solving it would allow for easy annotations of videos (paving way for awesome unsupervised learning!), automated surveillance, and quick content-based video retrieval (just imagine finding a video by describing actions in one go!). Attendees would find the most benefit from this talk if they are interested in computer vision and/or have some prior knowledge of convolutional architectures in machine learning, an additional benefit would be knowledge about action recognition however this is not too important.
In this talk, we introduce a novel self-supervised method (termed "Sync-3D") that learns spatio-temporal video embeddings to enable the detection of human interactions. Our work combines the I3D architecture used for action localisation and the siamese SyncNet architecture for video-audio synchronisation, casting the problem of human interaction detection as one of motion synchronisation both spatially and temporally. This talk will cover the motivations behind this architectural choice, our learning framework, a new data sampling strategy for curriculum learning, and lastly, how our architecture compares to others on the downstream task of interaction classification on the challenging TV-HID dataset. We will also be motivating some future research directions and pointing out certain improvements of this system.