Speakers
Abstract coming soon.
Piera Riccio is a postdoctoral researcher in the Multimedia Analytics Lab of the University of Amsterdam on the project of "Visual Imaginaries of Gender in Generative AI". In 2025, she earned a PhD from ELLIS Alicante/University of Alicante, with a thesis focusing on the influence of AI technologies on the representation of human bodies in contemporary visual culture. Her research lies at the intersection of computer vision, algorithmic fairness, gender studies, media studies and art.
On the challenges of capturing nuanced context-rich information for computational artwork analysis & the need for humans in the loop.
Selina Khan is a PhD student at the University of Amsterdam at the Multimedia Analytics Lab. Her research focuses on the use of interactive technologies for discovering biases in cultural heritage collections. Her research interests include the challenges of capturing nuanced, context-rich information from the artistic domain in AI systems and exploring how such technologies can deepen our understanding of art while respecting its subjectivity and cultural significance.
Building on the previous presentation, this tutorial extends multimodal representation learning for fine art analysis toward relational multimodal vision-language models. We frame artworks and captions not as isolated image-text pairs, but as entities embedded in structured networks of visual, textual, historical, and contextual relations. We explain strategies for integrating graph-based representations into inductive settings and for combining multimodal models with contextual learning.
A central motivation is the distinction between standard VLM tasks and art-historical analysis. Art history scholarship, like VLMs, relies on textual analysis of visual content. However, its analytical process is shaped by context, interpretative frameworks, archival research, and competing readings rather than by direct image-caption alignment alone. The tutorial, therefore, introduces key principles of art-historical research and interpretation, and shows how these can inform the design of multimodal models that are more sensitive to context, relation, and meaning. Overall, the session aims to connect recent advances in computer vision, vision-language modeling, and graph-based learning with the methodological needs of fine art analysis.
Ludovica Schaerf is a PhD student in Digital Visual Studies between the Max Planck Society (MPG) and the University of Zurich (UZH). Her interests lie in interdisciplinary research at the intersection of the Arts, Artificial Intelligence, and Philosophy. Her research focuses on understanding representations in generative vision models from a technical and media-theoretic perspective.
Understanding fine art is a knowledge-intensive, multimodal task that requires reasoning over visual content, cultural history, artistic style, and symbolic meaning. It could be far beyond what standard image captioning or retrieval systems can provide. This talk presents two complementary frameworks that tackle this challenge through structured knowledge retrieval and agentic reasoning.
The first framework, ArtRAG (ACM MM 2025), integrates a domain-specific art knowledge graph into a retrieval-augmented generation pipeline, enabling contextually grounded, multi-perspective artwork explanations without task-specific training. The second framework, A-MAR (ICMR 2026), extends this with an explicit reasoning and planning stage that conditions retrieval on structured, step-wise evidence requirements and moving beyond static retrieval toward interpretable, goal-driven multimodal reasoning.
Together, these works chart a path toward explainable AI systems for cultural industry applications, with relevance to digital humanities, museum technology, and multimodal AI research.
Shuai Wang is a PhD candidate at the University of Amsterdam, jointly in the MultiX lab (Informatics Institute) and Amsterdam Business School. His research specializes in Multimodal Agent and Agentic AI with a focus on applied research to solve complex, real-world business and cultural industry challenges.
Chinese calligraphy, as known as Shufa, the way to write characters, presents a rich case for multimodal art understanding. It encodes visual structure, linguistic content, and traces of the bodily movement that produced it, yet these modalities are rarely studied together. This tutorial examines computational approaches to calligraphy along three progressive layers. The first layer addresses the finished work, showing how structural aesthetic features of regular script and seal carving can be quantified and connected to human perceptual preferences. The second layer moves beneath the surface to the writing process, exploring how cross-modal methods can recover implicit micro-actions such as brush pressure, pause, and wrist angle from the visual trace alone. The third layer considers the machine as a participant in calligraphic practice, surveying style transfer and generative approaches from early pix2pix methods to recent diffusion models, and discussing emotion-driven generation that translates poetic sentiment into expressive strokes. Together, these layers illustrate how calligraphy can serve as a testbed for multimodal representation, cross-modal translation, and human-AI co-creation in the cultural domain.
Tiancheng Liu is a PhD candidate in Computational Media and Arts at the Hong Kong University of Science and Technology (Guangzhou) and a visiting researcher in the MultiX Lab at the University of Amsterdam. His research lies at the intersection of computational aesthetics, embodied interaction, and creative AI, with a focus on computational approaches to aesthetic and multimedia analysis in visual and cultural works, particularly Chinese calligraphy. His work takes the form of both published research and exhibited artworks.
Multimodal representation learning offers a principled approach to fine art analysis, where artworks are associated with visual, textual, and relational information. This tutorial presents a line of work that develops representations of artworks by jointly modeling these complementary sources, moving beyond approaches based solely on visual features. We examine how such representations capture both visual characteristics and relationships between artworks and artists, enabling tasks such as similarity modeling, retrieval, and the analysis of artistic production over time. By integrating multiple modalities into a unified representation, this approach allows artworks to be analyzed not only in terms of their visual properties, but also in relation to other works and within the broader structure of artistic production. The tutorial provides a technical perspective on learning and using such representations for fine art analysis.
Athanasios Efthymiou is a postdoctoral researcher at the University of Amsterdam. His research lies at the intersection of multimodal representation learning and computational fine art analysis, focusing on methods for integrating visual, textual, and relational information to represent artworks and analyze patterns of artistic production.
Instance-level recognition (ILR) distinguishes individual objects rather than just broad categories, offering powerful applications for understanding art and culture. However, ILR faces two fundamental challenges: the scarcity of large-scale annotated datasets due to the fine-grained nature of the task, and the difficulty of handling distribution shifts between controlled gallery images and unstructured visitor photographs. This talk addresses these challenges by examining recent advances along two complementary fronts. First, we explore a large-scale benchmark built from The Met museum's open-access collection, featuring over 224k individual exhibits for training under studio conditions, while testing on noisy visitor-taken photos and out-of-distribution images. The benchmark shows that combining self-supervised and supervised contrastive learning for non-parametric classification is a promising direction. Second, we examine a novel synthetic data generation approach that bypasses manual annotation entirely: given only target domain names, it synthesizes diverse object instances without any real images. Fine-tuning foundation vision models on this synthetic data improves retrieval across multiple ILR benchmarks, offering an efficient and scalable alternative to extensive data collection. Together, these perspectives show how the field is moving from curated benchmarks toward flexible, domain-independent ILR systems, unlocking real-world applications for cultural heritage and beyond.
Noa Garcia is an Associate Professor at the Institute for Advanced Co-Creation Studies and D3 Center, The University of Osaka (Japan). She earned her Ph.D. in Computer Science from Aston University (UK), specializing in multimodal retrieval and instance-level recognition. She moved to Japan in 2018 as a postdoc, and has been conducting research at The University of Osaka since then. Her research sits at the intersection of computer vision, natural language processing, fairness, and art. Her recent work includes investigating demographic bias in computer vision, analyzing visual datasets, and exploring how generative models can reinforce social stereotypes.
Within Visual Culture the notion of iconic images refers to particular images which have a unique position in collective memory, often very widely reproduced and recalled as well as strongly connected to a particular (often emotional) historical event. Because these images are widely circulated they are frequently found in large-scale datasets, yet, from a technical perspective they are often not treated any differently, which raises questions about a-culturality in how we learn from multimedia collections.
In this talk the notion of iconic images is introduced, including methods for how to study them as well as open research questions. A central thesis that will be presented is that iconic images form an inviting research topic which can, and should, be studied through multimedia research which naturally opens the direction for understanding that is better situated in art & cultural understanding.
Dr. Nanne van Noord is Assistant Professor at the Multimedia Analytics lab of the University of Amsterdam. His research lies at the intersection of Multimodal AI and Visual Culture, with the aim of integrating equitable visual cultural understanding into AI models to bridge the gap between humanistic and algorithmic inquiry.