Supervisor: Pietro Falco
Creation Date: 08/10/2025 17:28
This thesis explores how robots can learn to perceive and understand objects through both vision and touch by developing a cross-modal representation using diffusion autoencoders. The goal is to train a model that learns from visual data and can generate corresponding haptic (tactile) features, enabling robots to predict how an object would feel based only on its appearance. The diffusion autoencoder will align visual and tactile modalities within a shared latent space, leveraging the denoising and generative power of diffusion models to improve robustness and generalization. The approach will be tested in simulation and, when possible, with real tactile sensors, evaluating its ability to enhance object recognition and grasp planning in
Dataset type: Data to be acquired
Dataset description: A simulated and real dataset including RGB-D images and tactile signals (from r force sensors) collected during object exploration of various shapes and materials. The data will be used for training and validation of cross-modal perception models.
List of Methods: Autoencoders and Diffusion Autoencoders for visual–tactile alignment; convolutional neural networks for feature extraction; contrastive learning; evaluation based on reconstruction error and object recognition accuracy.