This paper shows how to do multi-object tracking end-to-end with transformers that reason over both space and time, rather than treating tracking as a two-stage detect-then-associate pipeline over pairs of frames. A spatial transformer encodes per-frame object features while a temporal transformer links them across a longer window, enabling the network to recover identities through long occlusions that frame-pair methods fail on. The approach sets new state of the art on MOT17 and MOT20.

No comments:
Post a Comment
Note: only a member of this blog may post a comment.