Tom Drummond

I am the Melbourne Connect Chair of Digital Innovation for Society
in the School of Computing and Information Systems at the
University of Melbourne

email: tom.drummond@unimelb.edu.au


Research Topics:

Improving Denoising Diffusion Models via Simultaneous Estimation of Image and Noise (with Zhenkai Zhang and Krista Ehinger)

This paper shows how to improve denoising diffusion models by having the network predict the image and the noise jointly, rather than predicting just one and recovering the other algebraically. The dual prediction provides a richer training signal and a more stable sampling trajectory, Reformulating the noise schedule in terms of the arc on the unit circle between pure-image and pure-noise states removes singularities and enables the use of higher order ODE solvers such as RK4.

Knowledge Combination to Learn Rotated Detection without Rotated Annotation (with Tianyu Zhu, Bryce Ferenczi, Pulak Purkait, Hamid Rezatofighi and Anton van den Hengel)

This paper shows how to train a rotated-object detector without ever having rotated bounding-box annotations. The key insight is that axis-aligned annotations from one dataset can be combined with per-pixel segmentation masks from another to bootstrap an oriented detection signal, with the rotation knowledge transferred into the final detector through a combination loss. This removes the need for expensive rotated annotation on every new domain where orientation matters, such as aerial imagery or industrial inspection.

Multimorbidity Content-Based Medical Image Retrieval and Disease Recognition Using Multi-Label Proxy Metric Learning (with Yunyan Xing, Ben Meyer, Mehrtash Harandi and Zongyuan Ge)

This paper shows how to do content-based retrieval on medical images where each image can have many simultaneous diagnoses (multimorbidity), by contrast to standard metric learning methods that assume a single class per sample. A multi-label proxy-based metric learning framework learns a single embedding in which images sharing any subset of labels are pulled together in a structured way, supporting both retrieval of similar cases and direct multi-label disease recognition on the same embedding.

Looking Beyond Two Frames: End-to-End Multi-Object Tracking Using Spatial and Temporal Transformers (with Tianyu Zhu, Markus Hiller, Mahsa Ehsanpour, Rongkai Ma, Ian Reid and Hamid Rezatofighi)

This paper shows how to do multi-object tracking end-to-end with transformers that reason over both space and time, rather than treating tracking as a two-stage detect-then-associate pipeline over pairs of frames. A spatial transformer encodes per-frame object features while a temporal transformer links them across a longer window, enabling the network to recover identities through long occlusions that frame-pair methods fail on. The approach sets new state of the art on MOT17 and MOT20.

Flynet: Max it, Excite it, Quantize it (with Luis Guerra)


This paper shows how to make mobile net networks much more efficient by combining multi-head depthwise convolution with maxout and mean-variance aware channelwise attention and dense residual connections. The resulting Flynet achieves a twofold reduction in parameter count while maintaining accuracy on ImageNet compared to mobilenet v3.

A Differentiable Distance Approximation for Fairer Image Classification (with Nicholas Rosa and Mehrtash Harandi)

This paper shows how to train image classifiers that are more equitable across demographic or sensitive subgroups, by adding a differentiable approximation of the distance between per-group accuracy distributions to the training objective. Because the approximation is smooth and end-to-end differentiable, standard SGD can directly push the network toward equalised performance rather than just high average accuracy, without the discrete/bi-level optimisation that earlier fairness methods needed.

Deep Laparoscopic Stereo Matching with Transformers (with Xuelian Cheng, Yiran Zhong, Mehrtash Harandi, Zhiyong Wang and Zongyuan Ge)

This paper shows how to do stereo matching in the particularly difficult domain of laparoscopic surgery, where the classical stereo assumptions break down: tissue is textureless or specular, illumination changes rapidly with camera motion, and the scene contains thin instruments with sharp depth discontinuities. A transformer-based matching module is used to aggregate context across the whole image, combined with a new laparoscopic stereo dataset, producing noticeably better depth estimates than CNN-based stereo networks designed for driving scenes.

Rethinking Generalization in Few-Shot Classification (with Markus Hiller, Rongkai Ma and Mehrtash Harandi)

This paper shows that the standard few-shot classification pipeline systematically underfits the rich structure of images. By framing few-shot classification in terms of patches rather than whole images and rethinking the sampling strategy over patches, the method obtains better-generalising features that close the gap to upper-bound oracle performance on standard miniImageNet and tieredImageNet benchmarks.

On Enforcing Better Conditioned Meta-Learning for Rapid Few-Shot Adaptation (with Markus Hiller and Mehrtash Harandi)

This paper shows how to make gradient-based meta-learners adapt faster by actively controlling the conditioning of their inner-loop optimisation problem. By recasting meta-learning as a non-linear least-squares problem, the method can place a loss on the condition number (local curvature) of the adaptation landscape, enforcing a well-conditioned parameter space at meta-train time. The result is substantially faster adaptation in the first few inner-loop steps, opening the door to dynamically choosing the number of steps at inference based on task difficulty.

Learning Instance and Task-Aware Dynamic Kernels for Few-Shot Learning (with Rongkai Ma, Pengfei Fang, Gil Avraham, Yan Zuo, Tianyu Zhu and Mehrtash Harandi)

This paper shows how to adapt a convolutional network to a new few-shot task at inference time by making the convolution kernels themselves functions of the task. Dynamic kernels are generated conditioned on both the entire task (episode-level context) and each individual sample, with further per-channel and per-spatial-location adaptation, and frequency-domain information is injected to enrich the adaptation signal. This gives the network a principled way to specialise to a new task without full retraining.

Implicit Motion Handling for Video Camouflaged Object Detection (with Xuelian Cheng, Huan Xiong, Deng-Ping Fan, Yiran Zhong, Mehrtash Harandi and Zongyuan Ge)

This paper shows how to detect camouflaged objects in video by exploiting motion as an implicit cue. When an object is visually indistinguishable from its background in any single frame, its motion relative to the background often reveals it. Rather than using explicit optical flow as an input, the method implicitly encodes motion information through a short-term and long-term feature interaction module, producing the first dedicated video camouflaged object detection system and a new large-scale benchmark (MoCA-Mask).

Adaptive Poincaré Point to Set Distance for Few-Shot Classification (with Rongkai Ma, Pengfei Fang and Mehrtash Harandi)

This paper shows how to do few-shot classification by embedding both query points and class support sets into hyperbolic (Poincaré ball) space. The geometry is a natural fit because hyperbolic space expands exponentially with radius, giving more room for fine-grained distinctions between classes than Euclidean space. The proposed adaptive point-to-set distance learns a per-task notion of how far a query is from each support set, yielding consistent gains on standard few-shot benchmarks.

[AAAI 2022 paper]

Improved Training of Generative Adversarial Networks Using Decision Forests (with Yan Zuo and Gil Avraham)

This paper shows how to use decision forests as the discriminator in a Generative Adversarial Network. The forest's piecewise structure provides a stronger, more stable gradient signal to the generator than a standard CNN discriminator, improving sample diversity and training stability across CIFAR-10, STL-10 and CelebA without any change to the generator architecture.

[WACV 2021 paper]

Hierarchical Neural Architecture Search for Deep Stereo Matching (with Xuelian Cheng, Yiran Zhong, Mehrtash Harandi, Yuchao Dai, Xiaojun Chang, Hongdong Li and Zongyuan Ge)

This paper shows how to apply neural architecture search to the structured problem of stereo matching, where the network has to both extract features and compare them across two views. Rather than searching at a single level, the method searches simultaneously over cell structure (the local feature operations) and network structure (how the cells are connected for matching and aggregation), producing architectures that outperform hand-designed stereo networks like PSMNet on KITTI and Middlebury while using far fewer parameters.

Reducing the Sim-to-Real Gap for Event Cameras (with Timo Stoffregen, Cedric Scheerlinck, Davide Scaramuzza, Nick Barnes, Lindsay Kleeman and Robert Mahony)

This paper shows how to close the sim-to-real gap for event cameras: networks trained on event streams from simulators systematically underperform on real data because the simulator's contrast threshold and scene dynamics don't match what real sensors produce. By carefully matching simulator statistics to the target use case and releasing a new High Quality Frames (HQF) dataset of well-exposed ground-truth frames, the paper delivers a 20–40% boost in video reconstruction quality and up to 15% on optical flow with no change to the network architecture. [ECCV 2020 paper]