Dissertation

Multi Modal Representation Learning and Cross-Modal Semantic Matching

Humans perceive the real world through their sensory organs: vision, taste, hearing, smell, and touch. In terms of information, we consider these different modesalso referred to as different channels of information or modals.

Author: Wang, X.
Date: 24 June 2022
Links: Thesis in Leiden Repository

Considering multiple channels of information, at the same time, is referred to as multimodal and the input as multimedia. By their very nature, multimedia data are complex and often involve intertwined instances of different kinds of information. We can leverage this multimodal perspective to extract meaning and understanding of theworld. This is comparable to how our brain processes these multiple channels, we learn how to combine and extract meaningful information from them. In this thesis, the learning is done by computer programs and smart algorithms. This is referred to as artificial intelligence. To that end, in this thesis, we have studied multimedia information, with a focus on vision and language information representation for semantic mapping. The aims of the semantic mapping learning in this thesis are: (1) visually supervised word embedding learning; (2) fine-grained labellearning for vision representation; (3) kernel-based transformation for image and text association; (4) visual representation learning via a cross-modal contrastivelearning framework.