Accepted Tutorials for SIGMOD
An (Updated) Overview of Data Provenance: Concepts, Challenges and Opportunities
Chrysanthi Kosyfaki (Hong Kong University of Science and Technology), Paul Groth (University of Amsterdam)
Abstract:Data provenance (or data lineage) tracks where data comes from and how it is transformed throughout its lifecycle. Introduced in the early 2000s to explain and verify data changes, it has since expanded from relational databases to areas such as big data, machine learning pipelines, streaming systems, and explainable AI. Driven by regulatory and industrial needs, the field has grown significantly. This tutorial surveys recent advances in data provenance, focusing on provenance models, querying, databases, and the role of provenance in explainability.
Vector Search for the Future: From Memory-Resident, Static Heterogeneous Storage, to Cloud-Native Architectures
Yitong Song (Hong Kong Baptist University), Xuanhe Zhou (Shanghai Jiao Tong University), Christian S. Jensen (Aalborg University), Jianliang Xu (Hong Kong Baptist University)
Abstract:Vector search (VS) is a fundamental component of modern AI applications, enabling multimodal retrieval over images, videos, and text. As vector data rapidly scales, VS faces increasing challenges in balancing latency, scalability, and cost, largely driven by the evolution of storage architectures.
This tutorial presents a structured overview of VS techniques from a storage-centric perspective. We first cover classical in-memory approaches, including IVF-, hash-, quantization-, and graph-based methods, and discuss their strengths and limitations. We then examine heterogeneous memory–SSD designs that enable billion-scale search through I/O-efficient indexing, layout, and query strategies. Finally, we discuss emerging cloud-native, multi-tiered architectures for trillion-scale vector search and highlight key challenges and future research directions.
This tutorial presents a structured overview of VS techniques from a storage-centric perspective. We first cover classical in-memory approaches, including IVF-, hash-, quantization-, and graph-based methods, and discuss their strengths and limitations. We then examine heterogeneous memory–SSD designs that enable billion-scale search through I/O-efficient indexing, layout, and query strategies. Finally, we discuss emerging cloud-native, multi-tiered architectures for trillion-scale vector search and highlight key challenges and future research directions.
A Tutorial on Relational Language Design
Wolfgang Gatterbauer (Northeastern University)
Abstract:Relational query languages have been studied and used for more than 50 years, with SQL dominant in practice. Yet that dominance is now being questioned from several directions at once: higherlevel abstractions such as entity-relationship and functional data models, application languages that integrate querying with application logic, algebraic intermediate representations that blur the boundary between logical and physical specification, and large language models (LLMs) that both generate and explain queries. These developments highlight that relational languages differ not only in expressive power, but also in what relational structure they make explicit and in how effectively they support humans and machines in writing, understanding, revising, and reasoning about queries.
This tutorial uses this moment to give the data management community a framework for comparing and redesigning relational languages in an era of AI-assisted query generation, explanation, verification, and revision. Rather than beginning from formal definitions, we start from a fixed set of representative SQL queries and compare how classical alternatives and a range of recent languages express the same relational intent. From these examples, we derive a common vocabulary of recurring design dimensions and trade-offs in relational language design. This vocabulary distinguishes three parent aspects (query intent, relational intent, and notation), together with several subaspects (such as relational pattern, semantic conventions, and modality). Participants will leave with clearer mental models for comparing existing and proposed languages, evaluating their usability for humans and AI systems, and articulating open problems in the design and evaluation of relational query languages.
The tutorial webpage is at: https://northeastern-datalab.github.io/relational-language-tutorial.
This tutorial uses this moment to give the data management community a framework for comparing and redesigning relational languages in an era of AI-assisted query generation, explanation, verification, and revision. Rather than beginning from formal definitions, we start from a fixed set of representative SQL queries and compare how classical alternatives and a range of recent languages express the same relational intent. From these examples, we derive a common vocabulary of recurring design dimensions and trade-offs in relational language design. This vocabulary distinguishes three parent aspects (query intent, relational intent, and notation), together with several subaspects (such as relational pattern, semantic conventions, and modality). Participants will leave with clearer mental models for comparing existing and proposed languages, evaluating their usability for humans and AI systems, and articulating open problems in the design and evaluation of relational query languages.
The tutorial webpage is at: https://northeastern-datalab.github.io/relational-language-tutorial.
Data Agents: Levels, State of the Art, and Open Problems
Yuyu Luo (Hong Kong University of Science and Technology), Guoliang Li (Tsinghua University), Ju Fan (Renmin University of China), Nan Tang (Hong Kong University of Science and Technology)
Abstract:Data agents are emerging as a new paradigm for automating data management, preparation, and analysis through large language models and foundation agents. However, the term “data agent” is often used broadly, ranging from simple query assistants to ambitious autonomous data scientists, making it difficult to reason about their capabilities, limitations, and responsibilities. This tutorial introduces a level-based taxonomy of data agents, from Level 0 with no autonomy to Level 5 with full autonomy, and uses it to organize the state of the art across the data lifecycle. We will review representative systems in data management, data preparation, and data analysis, discuss the transition from assistant-style systems to workflow-orchestrating agents, and highlight open challenges toward proactive and generative data agents. The tutorial aims to provide SIGMOD attendees with a practical map of today’s systems and a research roadmap for future data-agent development.
Relational Database Engines on Quantum Platforms: Concepts, Algorithms, and Implementations
Manish Kesarwani (IBM Research), Jayant R. Haritsa (Indian Institute of Science)
Abstract:Recent advances in quantum computing have led to early-stage platforms with over 1,000 qubits, alongside roadmaps targeting fault-tolerant, industry-scale systems within this decade. To harness the immense potential of these emerging systems, it is essential to explore the feasibility of hosting relational database engines on quantum platforms. Encouragingly, several early studies have already investigated the applicability of quantum computing to core DBMS components such as multi-query optimization, join-order optimization and index-configuration selection offering initial insights into key challenges and practical implementation considerations.
In this tutorial, we provide an in-depth exploration of how quantum computing can be harnessed to advance database technologies. We begin by introducing the foundational principles of quantum computation and then outline the distinct technical challenges that emerge when operating on quantum hardware. The discussion proceeds to quantum-driven optimization methods, applicable to various DBMS components such as query optimization and physical schema design. Subsequently, we examine quantum-based query execution for core relational operators, emphasizing the architectural strategies proposed to work around the probabilistic computational framework. We conclude by outlining open research problems that must be solved to realize practical quantum databases.
In the concluding segment, participants will engage in a comprehensive demonstration, followed by a practical session where they will design quantum database algorithms and deploy them on both quantum simulators and real quantum hardware.
In this tutorial, we provide an in-depth exploration of how quantum computing can be harnessed to advance database technologies. We begin by introducing the foundational principles of quantum computation and then outline the distinct technical challenges that emerge when operating on quantum hardware. The discussion proceeds to quantum-driven optimization methods, applicable to various DBMS components such as query optimization and physical schema design. Subsequently, we examine quantum-based query execution for core relational operators, emphasizing the architectural strategies proposed to work around the probabilistic computational framework. We conclude by outlining open research problems that must be solved to realize practical quantum databases.
In the concluding segment, participants will engage in a comprehensive demonstration, followed by a practical session where they will design quantum database algorithms and deploy them on both quantum simulators and real quantum hardware.