Autonomous Embodied Agents: When Robotics Meets Deep Learning Reasoning
Roberto Bigazzi Ph.D. Thesis, 2023
pdf
/ cover / bibtex
In this thesis, I present the research work carried out during my Ph.D. that was focused on the
challenges defined by the field of Embodied Artificial Intelligence. I would like to thank again
my
family, friends, and supervisor for the support during the last three years.
We propose a new task denominated Personalized Instance-based Navigation (PIN), in which an embodied agent is tasked with locating and reaching a specific personal object by distinguishing it among multiple instances of the same category. The task is accompanied by PInNED, a dedicated new dataset composed of photo-realistic scenes augmented with additional 3D objects.
We propose a novel architecture inspired by Generative Adversarial Networks (GANs) that produces
meaningful and well-formed synthetic instructions to improve navigation agents’ performance. The
validation analysis of our proposal on REVERIE and R2R highlights the promising aspects of our
proposal, achieving state-of-the-art results.
We propose a novel approach that combines visual and fine-tuned CLIP features to generate grounded
language-visual features for mapping. These region classification features are seamlessly integrated
into a global grid map using an exploration-based navigation policy.
We propose and evaluate an approach that combines recent advances in visual robotic exploration
and image captioning on images generated through agent-environment interaction. Our approach can
generate smart scene descriptions that maximize semantic knowledge of the environment and avoid
repetitions.
We use explainable maps to visualize model predictions and highlight the correlation between
observed entities and words generated by a captioning model during the embodied exploration.
We propose Spot the Difference: a novel task for Embodied AI where the agent has access to an
outdated map of the environment and needs to recover the correct layout in a fixed time budget.
We propose to train a navigation model with a purely intrinsic reward signal to guide
exploration,
which is based on the impact of the robot's actions on its internal representation of the
environment.
We build and release a new 3D space for embodied navigation with unique characteristics: the one
of
a complete art museum. We name this environment ArtGallery3D (AG3D).
We detail how to transfer the knowledge acquired in simulation into the real world describing the
architectural discrepancies that damage the Sim2Real adaptation ability of models trained on the
Habitat simulator and we propose a novel solution tailored towards the deployment in real-world
scenarios.
We devise an embodied setting in which an agent needs to explore a previously unknown environment
while recounting what it sees during the path. The agent needs to navigate the environment driven
by
an exploration goal, select proper moments for description, and output natural language
descriptions
of relevant objects and scenes.
Reviewing Committees
Journals:
IEEE Robotics and Automation Letters (RA‑L)
IEEE Geoscience and Remote Sensing Letters (GRSL)
IEEE Pattern Recognition Letters (PRL)
Transactions on Multimedia Computing Communications and Applications (TOMM)
Conferences:
IEEE/CVF Computer Vision and Pattern Recognition Conference (CVPR)
European Conference on Computer Vision (ECCV) [Outstanding Reviewer @ ECCV 2024]
IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)
IEEE International Conference on Robotics and Automation (ICRA)
IAPR International Conference on Pattern Recognition (ICPR)
ACM International Conference on Multimedia (ACMMM)
International Conference on Image Analysis and Processing (ICIAP)