Sample retrieval results. Courtesy of Graham W. Taylor, Ian Spiro, Chris Bregler and Rob Fergus
Computer vision has hit the mainstream with applications such as cars that detect pedestrians, motion capture for animation, and applications that let you cash a cheque by snapping a picture from your mobile phone. A great example of computer vision in the consumer market is Microsoft's Kinect gaming system which can accurately detect the pose of one or more individuals allowing gameplay to be controlled using just the body. Such a system must be able to detect pose reliably under a wide variety of conditions - different players, unusual clothing, poor lighting, cluttered backgrounds, and other sources of variation. One way that we could perform pose estimation is keeping around a large database of examples of people in a variety of poses along with labels indicating the configuration of the body in 2D or 3D. When presented with a new example (without labels) we can compare it against the database to find the best match. We then can assign the labels of the best match to the new example. However, the matching (or similarity) problem is a very tough one - especially due to the large amount of input variability due to the factors described above. If we had many examples of people in similar pose but under differing conditions, we could use machine learning to construct an algorithm that matches based on the important information (e.g. pose) and ignores the distracting information (e.g lighting, clothing, background, etc.). But how do we collect such data? In a somewhat unusual move for computer scientists, we turned to the Dutch progressive-electro band C-Mon and Kypski. Their music video/crowdsourcing project "One Frame of Frame" asks people on the web to replace one frame of the band's music video for the song "More or Less" with a capture from a webcam. A visitor to the band's website is shown a single frame of the video and asked to perform an imitation in front of the camera. The new contribution is spliced into the video which updates once an hour. This turns out to be the perfect data source for learning an algorithm to compute similarity based on pose. Armed with the band's data and a few machine learning tricks up our sleeves, we built a system that is highly effective at matching people in similar pose but under widely different settings.
|Schematic of the approach. It assumes for each frame of video, there exists an unobserved low-dimensional representation of pose, Z. A seed image is generated by mapping from pose space to pixels, X, through an unobserved interpretation function. The method learns a nonlinear embedding, f(X|θ) which approximates Z with a low-dimensional vector. In the example above, users are asked to imitate seed images taken from a music video (http://oneframeoffame.com).|
Watch to the video clip from C-Mon and Kypski.
This project has been appeared in CVPR 2011.
If you would like to showcase your project in our front page, please contact us. We love to hear from you.