To the OTH Regensburg website

MusteR-FM

Motivation

Minimally invasive procedures in surgery and therapeutic endoscopy generate large amounts of high-resolution video data. These videos do not only show individual anatomical structures or instruments, but document the course of an intervention over time. For AI methods, this poses a particular challenge: surgical workflows are long, variable and shaped by clinical decisions. Transitions between phases can be subtle, while blood, smoke, occlusions or motion blur may further complicate reliable analysis.

Current AI approaches in surgical image analysis are often designed for narrowly defined individual tasks. However, research, clinical development and future assistance systems require models that can be used more broadly and that are able to integrate different data sources, modalities and types of procedures. This is where MusteR-FM comes in: the project develops a multimodal AI foundation model for surgical and endoscopic video data. It aims to learn a shared representation of clinical workflows and thereby provide a robust basis for further applications.

As a connecting project within the Bavarian AI Foundation Model Initiative, MusteR-FM contributes a clinically demanding and socially relevant application area. The project combines medical image analysis, multimodal learning and temporal video understanding with concrete questions from the health domain and, prospectively, from robotics and perception. In this way, MusteR-FM helps to advance AI foundation models not only methodologically, but also to make them testable and usable in a sensitive, practice-oriented medical context.

Goals and procedure

The objective of MusteR-FM is to develop a reusable foundation model for minimally invasive surgical and endoscopic workflows. The model is intended to jointly capture two different clinical imaging domains: laparoscopic videos with rigid optics and a wide field of view, and flexible gastrointestinal endoscopy in narrow luminal structures. This creates a shared model basis that can generalize beyond individual datasets and types of interventions.

The project builds a multimodal spatio-temporal model architecture. Existing clinical video data and annotations are used to link visual information with medical-procedural descriptions, phase information and indicators of visibility quality. As a result, the model should not only capture image content, but also model the temporal course of an intervention and infer cues about upcoming workflow transitions.

Evaluation is carried out using clinically validated laparoscopic data as well as endoscopic ESD videos from the participating clinical collaborations. In addition, publicly available datasets are included to investigate the robustness and transferability of the learned representations. The results are to be documented transparently and made usable for non-commercial research. Planned outputs include model and data cards, reproducible evaluation protocols and the open provision of suitable artefacts.

The resulting foundation model thus provides a basis for various downstream tasks such as workflow recognition, phase classification, prediction of next process steps, object detection, segmentation, image retrieval, quality assurance and training support. At the same time, MusteR-FM strengthens the connection between university-based AI research, clinical application and medical technology transfer in Bavaria.

 

Funding FOR

OTH Regensburg

with clinical cooperation partners

Funding BY

Bayerisches Staatsministerium für Wissenschaft und Kunst as part of the Bavarian AI Foundation Model Initiative.

Project page: https://www.ai-bay.eu/#

 

Period and volume

Overall project (phase 1 and 2): May 2026 till April 2029

Project period of MusteR-FM (phase 1): August 2026 till October 2027

Volume (phase 1): € 105 thousand