MuST: Multi-Scale Transformers for Surgical Phase Recognition

1Universidad de los Andes, 2BCV-Uniandes

Multi-Scale Transformers for Surgical Phase Recognition in videos using short-, mid-, and long-term context.

Abstract

Phase recognition in surgical videos is crucial for enhancing computer-aided surgical systems as it enables automated understanding of sequential procedural stages. Existing methods often rely on fixed temporal windows for video analysis to identify dynamic surgical phases. Thus, they struggle to simultaneously capture short-, mid-, and longterm information necessary to fully understand complex surgical procedures. To address these issues, we propose Multi-Scale Transformers for Surgical Phase Recognition (MuST), a novel Transformer-based approach that combines a Multi-Term Frame encoder with a Temporal Consistency Module to capture information across multiple temporal scales of a surgical video. Our Multi-Term Frame Encoder computes interdependencies across a hierarchy of temporal scales by sampling sequences at increasing strides around the frame of interest. Furthermore, we employ a long-term Transformer encoder over the frame embeddings to further enhance long-term reasoning. MuST achieves higher performance than previous state-of-the-art methods on three different public benchmarks.

MuST Architecture

MuST employs a Multi-Term Frame Encoder to generate rich embeddings containing short- and mid-term dependencies for a long-term sequence of F′ frames. The Temporal Consistency Module introduces long-term analysis by processing relationships among frame embeddings to enable coherence in the predictions.

MuST main architecture.

The MuST Multi-Term Frame Encoder captures temporal context in surgical videos by constructing a temporal pyramid around a keyframe. This allows it to extract features across short-, mid-, and long-term time scales, providing a rich understanding of the surgical procedure's dynamics. The encoder generates spatio-temporal embeddings for each time scale, which are combined using attention mechanisms to understand relationships between different phases. This flexibility enables accurate phase recognition by handling both rapid transitions and longer-term changes within the video.

MuST frame encoder architecture.

Quantitative Results

MuST consistently outperforms previous state-of-the-art methods in public benchmarks.

MuST benchmark results.

Consistent results across classes are evident across diverse temporalities within the databases. These results underscore MuST’s proficiency in identifying multiple duration phases, attributed to its multitemporal reasoning. Leveraging information from distinct temporal contexts enhances MuST’s predictive capabilities for phases of different durations. The best results are highlighted in bold.

MuST per-class results.

Ablation Experiments

MuST per-class results.

Temporal Multisequence Pyramid

This is an illustration of a temporal multisequence pyramid centered on the red keyframe. Sampled frames at each level are highlighted. The top levels of the pyramid capture short-term temporal information, while the lower levels comprehend a broader temporal context.

MuST temporal pyramid.

Attention Heads Visualization

The Cross-Attention Module in MuST enhances temporal understanding by computing relationships between different time scales. It enables the model to align and compare information from various temporal windows, capturing both short-term and long-term dependencies. This mechanism ensures that the model comprehensively understands phase transitions and interactions across the entire surgical video.

MuST attention heads visualization.

BibTeX

@article{perez2024must,
  author    = {Alejandra Pérez and Santiago Rodríguez and Nicolás Ayobi and Nicolás Aparicio and Eugénie Dessevres and Pablo Arbeláez},
  title     = {MuST: Multi-Scale Transformers for Surgical Phase Recognition},
  journal   = {arXiv},
  year      = {2024},
  url       = {https://arxiv.org/abs/2407.17361}
}