Pavel Tokmakov

I am currently a Research Scientist at Toyota Research Institute, where I study object perception in videos. Prior to that I was a postdoc at CMU working with Martial Hebert and Deva Ramanan.
I completed my PhD at Inria, France under Cordelia Schmid's and Karteek Alahari's supervision, studying the role of motion in object recognition. Prior to PhD I worked on statistical relational learning and interactive knowledge discovery. You can find my CV here. A full list of publications is avaliable on Google Scholar.


Selected publications

Tao: A large-scale benchmark for tracking any object

For many years, multi-object tracking benchmarks have focused on a handful of categories. To address this liitation, we introduce a diverse dataset for Tracking Any Object (TAO). It consists of 2,907 high resolution videos, covering 833 cetagories. We perform an extensive evaluation of state-of- the-art trackers and make a number of important discoveries regarding large-vocabulary tracking in an open-world.

Learning compositional representations for few-shot recognition

Deep learning representations lack the compositionality property, which is instrumental for the human ability to learn novel concepts from a few examples. In this work we investigate several approaches to enforcing this property during training. The resulting models demonstrate significant improvements in the few-shot setting.

A structured model for action detection

A dominant paradigm for learning-based approaches in computer vision is training generic models on large datasets, and allowing them to discover the optimal representation for the problem at hand. In this work we propose instead to integrate some domain knowledge into the architecture of an action detection model. This allows us to achieves significant improvements over the state-of-the art without much parameter tuning.

Towards segmenting anything that moves

Detecting and segmenting all the objects in a scene is a key requirement for agents operating in the world. However, even defining what is an object is ambiguous. In this work we use motion as a bottom up cue, and propose a learning-based method for category-agnostic instance segmentation in videos.

Learning to segment moving objects

Motion segmentation is the classical problem of separating moving object in a video from the background. In this work we propose the first learning-based approach for this problem. We then extend the model with an appearance stream and a visual memory module, allowing it to segment objects before they start and after they stop moving.

Weakly-supervised semantic segmentation using motion cues

Semantic segmentation models require a large amount of expensive, pixel-level annotations to train. We propose to reduce the annotation burden by training the models on weakly-labeled videos and obtaining information about the precise shape of the objects from motion for free. Our model integrates motion cues into a label inference framework in a soft way, which allows to automatically improve the quality of the masks during training.