Current Research

I am currently undertaking research in four areas: (a) 3D scene reconstruction and semantic segmentation, (b) 2D shape detection, (c) segmentation hierarchy editing and (d) visual object tracking.

3D Scene Reconstruction and Semantic Segmentation

Real-Time RGB-D Camera Pose Estimation in Novel Scenes using a Relocalisation Cascade
Submitted September 2018
arXiv Link Preprint Link
Tommaso Cavallari*, Stuart Golodetz*, Nicholas A. Lord*, Julien Valentin*, Victor A. Prisacariu, Luigi Di Stefano and Philip H. S. Torr
Camera pose estimation is an important problem in computer vision, with applications as diverse as simultaneous localisation and mapping, virtual/augmented reality and navigation. Common techniques match the current image against keyframes with known poses coming from a tracker, directly regress the pose, or establish correspondences between keypoints in the current image and points in the scene in order to estimate the pose. In recent years, regression forests have become a popular alternative to establish such correspondences. They achieve accurate results, but have traditionally needed to be trained offline on the target scene, preventing relocalisation in new environments. Recently, we showed how to circumvent this limitation by adapting a pre-trained forest to a new scene on the fly. The adapted forests achieved relocalisation performance that was on par with that of offline forests, and our approach was able to estimate the camera pose in close to real time, which made it desirable for systems that require online relocalisation. In this paper, we present an extension of this work that achieves significantly better relocalisation performance whilst running fully in real time. To achieve this, we make several changes to the original approach: (i) instead of simply accepting the camera pose hypothesis produced by RANSAC without question, we make it possible to score the final few hypotheses it considers using a geometric approach and select the most promising one; (ii) we chain several instantiations of our relocaliser (with different parameter settings) together in a cascade, allowing us to try faster but less accurate relocalisation first, only falling back to slower, more accurate relocalisation as necessary; and (iii) we tune the parameters of our cascade, and the individual relocalisers it contains, to achieve effective overall performance. Taken together, these changes allow us to significantly improve upon the performance our original state-of-the-art method was able to achieve on the well-known 7-Scenes and Stanford 4 Scenes benchmarks. As additional contributions, we present a novel way of visualising the internal behaviour of our forests, and use the insights gleaned from this to show how to entirely circumvent the need to pre-train a forest on a generic scene.
Collaborative Large-Scale Dense 3D Reconstruction with Online Inter-Agent Pose Optimisation
TVCG, November 2018
Torr Vision Group Link, Preprint Link, Supplementary Material Link, arXiv Preprint Link
Stuart Golodetz*, Tommaso Cavallari*, Nicholas A. Lord*, Victor A. Prisacariu, David W. Murray and Philip H. S. Torr
Reconstructing dense, volumetric models of real-world 3D scenes is important for many tasks, but capturing large scenes can take significant time, and the risk of transient changes to the scene goes up as the capture time increases. These are good reasons to want instead to capture several smaller sub-scenes that can be joined to make the whole scene. Achieving this has traditionally been difficult: joining sub-scenes that may never have been viewed from the same angle requires a high-quality relocaliser that can cope with novel poses, and tracking drift in each sub-scene can prevent them from being joined to make a consistent overall scene. Recent advances in mobile hardware, however, have significantly improved our ability to capture medium-sized sub-scenes with little to no tracking drift. Moreover, high-quality regression forest-based relocalisers have recently been made more practical by the introduction of a method to allow them to be trained and used online. In this paper, we leverage these advances to present what to our knowledge is the first system to allow multiple users to collaborate interactively to reconstruct dense, voxel-based models of whole buildings. Using our system, an entire house or lab can be captured and reconstructed in under half an hour using only consumer-grade hardware.

Probabilistic Object Reconstruction with Online Global Model Correction
3DV (spotlight), October 2017
Jack Hunt, Victor A. Prisacariu, Stuart Golodetz, Tommaso Cavallari, Nicholas A. Lord and Philip H. S. Torr
In recent years, major advances have been made in 3D scene reconstruction, with a number of approaches now able to yield dense, globally-consistent models at scale. However, much less progress has been made for objects, which can exhibit far fewer unambiguous geometric/texture cues than a full scene, and thus are much harder to track against. In this work, we present a novel probabilistic object reconstruction framework that simultaneously allows for online, implicit deformation of the object’s surface to reduce tracking drift and handle loop closure events. Coupled with our probabilistic formulation is the use of a multiple sub-segment representation of the object, used to enforce global consistency, with segmentation of the object built in to the formulation. Finally, we employ a CRF framework to refine the overall segmentation, defined by a probability field over the object. We present compelling improvements over the current state-of-the-art reconstruction work and demonstrate robustness and consistency w.r.t. established dense SLAM frameworks.
On-the-Fly Adaptation of Regression Forests for Online Camera Relocalisation
CVPR (oral), July 2017
Torr Vision Group Link, Publication Link, Preprint Link
Tommaso Cavallari, Stuart Golodetz*, Nicholas A. Lord*, Julien Valentin, Luigi Di Stefano and Philip H. S. Torr
Camera relocalisation is a key problem in computer vision, with applications as diverse as simultaneous localisation and mapping, virtual/augmented reality and navigation. Common techniques either match the current image against keyframes with known poses coming from a tracker, or establish 2D-to-3D correspondences between keypoints in the current image and points in the scene in order to estimate the camera pose. Recently, regression forests have become a popular alternative to establish such correspondences. They achieve accurate results, but must be trained offline on the target scene, preventing relocalisation in new environments. In this paper, we show how to circumvent this limitation by adapting a pre-trained forest to a new scene on the fly. Our adapted forests achieve relocalisation performance that is on par with that of offline forests, and our approach runs in under 150ms, making it desirable for real-time systems that require online relocalisation.
Joint Object-Material Category Segmentation from Audio-Visual Cues
BMVC, September 2015 (Link)
Anurag Arnab, Michael Sapienza, Stuart Golodetz, Julien Valentin, Ondrej Miksik, Shahram Izadi and Philip H. S. Torr
It is not always possible to recognise objects and infer material properties for a scene from visual cues alone, since objects can look visually similar whilst being made of very different materials. In this paper, we therefore present an approach that augments the available dense visual cues with sparse auditory cues in order to estimate dense object and material labels. Since estimates of object class and material properties are mutually-informative, we optimise our multi-output labelling jointly using a random-field framework. We evaluate our system on a new dataset with paired visual and auditory data that we make publicly available. We demonstrate that this joint estimation of object and material labels significantly outperforms the estimation of either category in isolation.
SemanticPaint: Interactive Segmentation and Learning of 3D Worlds
SIGGRAPH Emerging Technologies, August 2015
Torr Vision Group Link, Abstract Link, Conference Link, Technical Report Link
Stuart Golodetz*, Michael Sapienza*, Julien P. C. Valentin, Vibhav Vineet, Ming-Ming Cheng, Victor A. Prisacariu, Olaf Kahler, Carl Yuheng Ren, Anurag Arnab, Stephen L. Hicks, David W. Murray, Shahram Izadi and Philip H. S. Torr
We present a real-time, interactive system for the geometric reconstruction, object-class segmentation and learning of 3D scenes. Using our system, a user can walk into a room wearing a depth camera and a virtual reality headset, and both densely reconstruct the 3D scene and interactively segment the environment into object classes such as ‘chair’, ‘floor’ and ‘table’. The user interacts physically with the real-world scene, touching objects and using voice commands to assign them appropriate labels. These user-generated labels are leveraged by an online random forest-based machine learning algorithm, which is used to predict labels for previously unseen parts of the scene. The predicted labels, together with those provided directly by the user, are incorporated into a dense 3D conditional random field model, over which we perform mean-field inference to filter out label inconsistencies. The entire pipeline runs in real time, and the user stays ‘in the loop’ throughout the process, receiving immediate feedback about the progress of the labelling and interacting with the scene as necessary to refine the predicted segmentation.
Incremental Dense Semantic Stereo Fusion for Large-Scale Semantic Scene Reconstruction
ICRA, May 2015 (Link)
Robot Vision Best Paper Award – Finalist
Vibhav Vineet*, Ondrej Miksik*, Morten Lidegaard, Matthias Niessner, Stuart Golodetz, Victor A Prisacariu, Olaf Kahler, David Murray, Shahram Izadi, Patrick Pérez and Philip H. S. Torr
Our abilities in scene understanding, which allow us to perceive the 3D structure of our surroundings and intuitively recognise the objects we see, are things that we largely take for granted, but for robots, the task of understanding large scenes quickly remains extremely challenging. Recently, scene understanding approaches based on 3D reconstruction and semantic segmentation have become popular, but existing methods either do not scale, fail outdoors, provide only sparse reconstructions or are rather slow. In this paper, we build on a recent hash-based technique for large-scale fusion and an efficient mean-field inference algorithm for densely-connected CRFs to present what to our knowledge is the first system that can perform dense, large-scale, outdoor semantic reconstruction of a scene in (near) real time. We also present a ‘semantic fusion’ approach that allows us to handle dynamic objects more effectively than previous approaches. We demonstrate the effectiveness of our approach on the KITTI dataset, and provide qualitative and quantitative results showing high-quality dense reconstruction and labelling of a number of scenes.
The Semantic Paintbrush: Interactive 3D Mapping and Recognition in Large Outdoor Spaces
CHI, April 2015 (Link)
Ondrej Miksik*, Vibhav Vineet*, Morten Lidegaard, Matthias Niessner, Ramprasaath Selvaraju, Stuart Golodetz, Shahram Izadi, Stephen L Hicks, Patrick Pérez and Philip H. S. Torr
We present an augmented reality system for large scale 3D reconstruction and recognition in outdoor scenes. Unlike existing prior work, which tries to reconstruct scenes using active depth cameras, we use a purely passive stereo setup, allowing for outdoor use and extended sensing range. Our system not only produces a map of the 3D environment in real-time, it also allows the user to draw (or ‘paint’) with a laser pointer directly onto the reconstruction to segment the model into objects. Given these examples our system then learns to segment other parts of the 3D map during online acquisition. Unlike typical object recognition systems, ours therefore very much places the user ‘in the loop’ to segment particular objects of interest, rather than learning from predefined databases. The laser pointer additionally helps to ‘clean up’ the stereo reconstruction and final 3D map, interactively. Using our system, within minutes, a user can capture a full 3D map, segment it into objects of interest, and refine parts of the model during capture. We provide full technical details of our system to aid replication, as well as quantitative evaluation of system components. We demonstrate the possibility of using our system for helping the visually impaired navigate through spaces.

2D Shape Detection

Straight to Shapes: Real-time Detection of Encoded Shapes
CVPR, July 2017
Torr Vision Group Link, Publication Link, Preprint Link
Saumya Jetley*, Michael Sapienza*, Stuart Golodetz and Philip H. S. Torr
Current object detection approaches predict bounding boxes, but these provide little instance-specific information beyond location, scale and aspect ratio. In this work, we propose to directly regress to objects’ shapes in addition to their bounding boxes and categories. It is crucial to find an appropriate shape representation that is compact and decodable, and in which objects can be compared for higher-order concepts such as view similarity, pose variation and occlusion. To achieve this, we use a denoising convolutional auto-encoder to establish an embedding space, and place the decoder after a fast end-to-end network trained to regress directly to the encoded shape vectors. This yields what to the best of our knowledge is the first real-time shape prediction network, running at 35 FPS on a high-end desktop. With higher-order shape reasoning well-integrated into the network pipeline, the network shows the useful practical quality of generalising to unseen categories that are similar to the ones in the training set, something that most existing approaches fail to handle.

Segmentation Hierarchy Editing

Simplifying TugGraph using Zipping Algorithms
Submitted August 2018
Stuart Golodetz, Anurag Arnab, Irina Voiculescu and Stephen Cameron
Graphs are an invaluable and widely-used modelling tool in numerous problem domains because of their ability to represent both the entities relevant to a problem and the relationships between them. For many graph-based problems it can be helpful to visualise the corresponding graphs to highlight important information, but as the size of problem increases the graphs can become too large to visualise in their entirety. For this reason, hierarchical graph visualisation, in which a graph’s nodes are recursively clustered to allow the graph to be understood at different levels of abstraction, has become popular. However, this sometimes results in the hiding of information that may be important to the user. For example, in networking, it can be helpful to see which nodes are connected to a given node, so as to better understand the graph’s topology. This is the problem addressed by the popular TugGraph algorithm, which works by ‘separating out’ the neighbours of a given node from their ancestors in the hierarchy to allow them to be visualised independently. The original implementation of TugGraph, whilst easy to understand, was comparatively slow because it unnecessarily copied parts of the hierarchy; as a result, it was later optimised to avoid the copying, but at a significant cost in clarity. Optimising TugGraph without sacrificing clarity is challenging because of the need to ensure that every node in the resulting hierarchy remains connected, a common problem shared by many graph hierarchy editing algorithms. Recently, however, this problem has been solved by the introduction of ‘zipping’ algorithms: multi-level split and merge algorithms for hierarchies that preserve the connectedness of the hierarchy nodes and can serve as building blocks for higher-level editing algorithms. In this paper, we show how the existing multi-node unzipping algorithm can be generalised to implement a much simpler version of TugGraph that is more modular, easier to understand, easier to implement, and faster. We formally prove the equivalence of our method, which we call SimpleTug, to the original approaches, and further demonstrate its effectiveness experimentally using our millipede hierarchical image segmentation system. We also discuss the interesting theoretical connections between TugGraph and zipping, and suggest ideas for further work.
Simpler Editing of Graph-Based Segmentation Hierarchies using Zipping Algorithms
Pattern Recognition, October 2017
Publication Link Preprint Link Supplementary Material Link
Stuart Golodetz, Irina Voiculescu and Stephen Cameron
Graph-based image segmentation is popular, because graphs can naturally represent image parts and the relationships between them. Whilst many single-scale approaches exist, significant interest has been shown in segmentation hierarchies, which represent image objects at different scales. However, segmenting arbitrary images automatically remains elusive: segmentation is under-specified, with different users expecting different outcomes. Hierarchical segmentation compounds this, since it is unclear where in the hierarchy objects should appear. Users can easily edit flat segmentations to influence the outcome, but editing hierarchical segmentations is harder: indeed, many existing interactive editing techniques make only small, local hierarchy changes. In this paper, we address this by introducing ‘zipping’ operations for segmentation hierarchies to facilitate user interaction. We use these operations to implement algorithms for non-sibling node merging and parent switching, and perform experiments on both 2D and 3D images to show that these latter algorithms can significantly reduce the interaction burden on the user.

Visual Object Tracking

Staple: Complementary Learners for Real-Time Tracking
CVPR, June 2016
Publication Link Preprint Link arXiv Link Project Page Link
Luca Bertinetto, Jack Valmadre, Stuart Golodetz, Ondrej Miksik and Philip H. S. Torr
Correlation Filter-based trackers have recently achieved excellent performance, showing great robustness to challenging situations such as motion blur and illumination changes. However, since the model that they learn depends strongly on the spatial layout of the tracked object, they are notoriously sensitive to deformation. Models based on colour statistics have complementary traits: they cope well with variation in shape, but suffer when illumination is not consistent throughout a sequence. Moreover, colour distributions alone can be insufficiently discriminative. In this paper, we show that a simple tracker combining complementary cues in a ridge regression framework can operate faster than 90 FPS and outperform not only all entries in the popular VOT14 competition, but also recent and far more sophisticated trackers according to multiple benchmarks.
Struck: Structured Output Tracking with Kernels
TPAMI, October 2016 (accepted for publication December 2015)
Publication Link Author’s Version Link Supplementary Material Link YouTube Channel
Sam Hare*, Stuart Golodetz*, Amir Saffari*, Vibhav Vineet, Ming-Ming Cheng, Stephen L. Hicks and Philip H. S. Torr
Adaptive tracking-by-detection methods are widely used in computer vision for tracking arbitrary objects. Current approaches treat the tracking problem as a classification task and use online learning techniques to update the object model. However, for these updates to happen one needs to convert the estimated object position into a set of labelled training examples, and it is not clear how best to perform this intermediate step. Furthermore, the objective for the classifier (label prediction) is not explicitly coupled to the objective for the tracker (estimation of object position). In this paper, we present a framework for adaptive visual object tracking based on structured output prediction. By explicitly allowing the output space to express the needs of the tracker, we avoid the need for an intermediate classification step. Our method uses a kernelised structured output support vector machine (SVM), which is learned online to provide adaptive tracking. To allow our tracker to run at high frame rates, we (a) introduce a budgeting mechanism that prevents the unbounded growth in the number of support vectors that would otherwise occur during tracking, and (b) show how to implement tracking on the GPU. Experimentally, we show that our algorithm is able to outperform state-of-the-art trackers on various benchmark videos. Additionally, we show that we can easily incorporate additional features and kernels into our framework, which results in increased tracking performance.

Leave a Reply

FiveAI Ltd.