Pavel Tokmakov

I am currently a Senior Research Scientist at Toyota Research Institute, where I work on video representation learning. Prior to that I was a postdoc at CMU working with Martial Hebert and Deva Ramanan.
I completed my PhD at Inria, France under Cordelia Schmid's and Karteek Alahari's supervision, studying the role of motion in object recognition. Prior to PhD I worked on statistical relational learning and interactive knowledge discovery. You can find my CV here. A full list of publications is avaliable on Google Scholar.


Selected publications

Zero-1-to-3: Zero-shot One Image to 3D Object

We introduce Zero-1-to-3, a framework for changing the camera viewpoint of an object given just a single RGB image. To perform novel view synthesis in this under-constrained setting, we capitalize on the geometric priors that large-scale diffusion models learn about natural images. Our viewpoint-conditioned diffusion approach can further be used for the task of 3D reconstruction from a single image.

Breaking the “Object” in Video Object Segmentation

VOST is a semi-supervised video object segmentation benchmark that focuses on complex object transformations. Differently from existing datasets, objects in VOST are broken, torn and molded into new shapes, dramatically changing their overall appearance. As our experiments demonstrate, this presents a major challenge for the mainstream, appearance-centric VOS methods.

Object Permanence Emerges in a Random Walk along Memory

We propose a self-supervised objective for learning representations that localize objects under occlusion - a property known as object permanence. A central question is the choice of learning signal in cases of total occlusion. Rather than directly supervising the locations of invisible objects, we propose a self-supervised objective that requires neither human annotation, nor assumptions about object dynamics.

Discovering Objects that Can Move

Existing approaches for object discovery rely on appearance cues, such as color, texture and location. However, by relying on appearance alone, these methods fail to reliably separate objects from the background in cluttered scenes. To resolve this ambiguity, in this work we choose to focus on dynamic objects -- entities that are capable of moving independently in the world.

Tao: A large-scale benchmark for tracking any object

For many years, multi-object tracking benchmarks have focused on a handful of categories. To address this liitation, we introduce a diverse dataset for Tracking Any Object (TAO). It consists of 2,907 high resolution videos, covering 833 cetagories. We perform an extensive evaluation of state-of- the-art trackers and make a number of important discoveries regarding large-vocabulary tracking in an open-world.

Learning compositional representations for few-shot recognition

Deep learning representations lack the compositionality property, which is instrumental for the human ability to learn novel concepts from a few examples. In this work we investigate several approaches to enforcing this property during training. The resulting models demonstrate significant improvements in the few-shot setting.

A structured model for action detection

A dominant paradigm for learning-based approaches in computer vision is training generic models on large datasets, and allowing them to discover the optimal representation for the problem at hand. In this work we propose instead to integrate some domain knowledge into the architecture of an action detection model. This allows us to achieves significant improvements over the state-of-the art without much parameter tuning.

Towards segmenting anything that moves

Detecting and segmenting all the objects in a scene is a key requirement for agents operating in the world. However, even defining what is an object is ambiguous. In this work we use motion as a bottom up cue, and propose a learning-based method for category-agnostic instance segmentation in videos.

Learning to segment moving objects

Motion segmentation is the classical problem of separating moving object in a video from the background. In this work we propose the first learning-based approach for this problem. We then extend the model with an appearance stream and a visual memory module, allowing it to segment objects before they start and after they stop moving.

Weakly-supervised semantic segmentation using motion cues

Semantic segmentation models require a large amount of expensive, pixel-level annotations to train. We propose to reduce the annotation burden by training the models on weakly-labeled videos and obtaining information about the precise shape of the objects from motion for free. Our model integrates motion cues into a label inference framework in a soft way, which allows to automatically improve the quality of the masks during training.