Written By
Dr. Ir. Thierry Deruyttere
Dr. Ir. Dusan Grujicic
Written on 18/08/2023

interacting with autonomous vehicles through commands

Keywords: Computer Vision, Natural Language Processing, Uncertainty Detection, Self-Driving Cars, Deep Learning

The You & AI team worked on a project called Talk2Car. The goal of Talk2Car was to create a system that allows humans to interact with an autonomous vehicle through verbal commands. The reason why this is interesting is that, while multiple companies are currently competing to be the first to develop a fully functional self-driving car that can bring passengers from point A to point B, multiple surveys indicate that people are scared to step into these self-driving cars as the lack of control frightens them. This is where Talk2Car comes in. By allowing people to interact with the car through verbal commands, they can feel more in control of the car and thus feel safer.
The Talk2Car system consists of multiple parts. First, the car needs to be able to understand the commands that are given to it. What is the goal of the command, which object are we referring to, etc... After the car has understood the command, it needs to be able to execute it. Additionally, the car needs to be able to communicate with the passenger about uncertainties or problems that it encounters as we are in a real-world settings we need to be fairly certain about the actions that we take. In the following sections, we will discuss the different parts of the Talk2Car system.
Understanding the command
In the Talk2Car project, the command always referred to a specific object in the visual scene. Hence, the goal of the first step is to detect which object the command is referring to. In order to do this, all objects in the scene are first detected using a Monocular 3D object detector. Then, the AI needs to determine which of these objects is the object that the command is referring to. This is done by using a Natural Language Processing (NLP) model to embed the command and a Computer Vision (CV) model to embed the detected objects. The command embedding and the object embeddings are then compared by using the dot product between the two embeddings. During our work, we also observed that the computer vision model was focusing more on the shape of the object and was ignoring the color of the object. Hence, we also added a color embedding to the object embeddings. A schematic overview of this process can be seen in the figure below.
While the above approach seems rather simple, the method proposed by the You & AI team, is at time of writing the state-of-the-art method for this task. It even beats other well known alternatives such as VLBert, VilBert, LXMert, etc... Additionally, the method proposed by the You & AI team is also the fastest method.
Detecting Uncertainties
The potential life-threatening consequences of a self-driving car executing a wrong command, makes it important to be certain about the actions that the car takes. Hence, the Talk2Car system needs to be able to detect uncertainties and communicate these to the passenger. The uncertainty that we are interested in is the uncertainty about the object that the command is referring to. We wish to know how certain the AI is that it has detected the correct object, if not we need to communicate this to the passenger and ask for clarification.
To detect the uncertainty, many different approaches exist in literature. However, most of these approaches are not suitable for real-world applications as they have been developed on toy datasets and are not robust enough to be used in real-world settings. The current method for detecting uncertainty that works on multiple datasets is using an ensemble of models that all vote for the referred object. If the models do not agree on the referred object, we can assume that the AI is uncertain about the referred object. The main drawback of this method is that it is very computationally expensive and hence this is still a research topic on how to make it more efficient.
Executing the command
After understanding the command and detecting uncertainties, the car needs to execute the command. We do this by predicting the destination and the trajectory that the vehicle has to follow to reach this destination. To facilitate this, instead of making these predictions in images from the vehicle, we make these predictions in a top-down view of the scene. This makes it easier for the AI to predict the trajectory as it does not have to deal with the perspective distortion of the camera.
For this task, the You & AI team proposed a novel method that uses a Density Mixture approach to predict the trajectory. For instance, we divide the top-down view of the scene into a grid of cells. Then, for each cell, we predict the probability that the vehicle will be in that cell at a certain time in the future. We do this for certain key points of the trajectory. Once we have these predictions, we use a neural network to combine these predictions into a single trajectory. This method is also the current state-of-the-art method for this task.
Below, we show the simplified architectural layout of our proposed system for the Talk2Car project. Thesis

Want to have a chat about this project? Feel free to contact us here.
Interested readers might also want to check out the following publications:
  • Othman, Kareem. "Public acceptance and perception of autonomous vehicles: a comprehensive review." AI and Ethics 1.3 (2021): 355-387.
  • Deruyttere, Thierry*, Victor Milewski*, and Marie-Francine Moens. "Giving commands to a self-driving car: How to deal with uncertain situations?." Engineering Applications of Artificial Intelligence 103 (2021): 104257.
  • Schoettle, Brandon, and Michael Sivak. A survey of public opinion about autonomous and self-driving vehicles in the US, the UK, and Australia. University of Michigan, Ann Arbor, Transportation Research Institute, 2014.
  • Das, Subasish, Anandi Dutta, and Kay Fitzpatrick. "Technological perception on autonomous vehicles: perspectives of the non-motorists." Technology Analysis & Strategic Management 32.11 (2020): 1335-1352.
  • Deruyttere, Thierry, et al. "Talk2Car: Taking Control of Your Self-Driving Car." Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019.
  • Grujicic, Dusan*, Deruyttere, Thierry* et al. "Predicting Physical World Destinations for Commands Given to Self-Driving Cars." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 36. No. 1. 2022.
  • Deruyttere, Thierry*, Grujicic, Dusan* et al. “My Virtual Taxi Driver: From Command To Planning”. Submitted.
  • Caesar, Holger, et al. "nuScenes: A multimodal dataset for autonomous driving." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020.
  • Deruyttere, Thierry, Guillem Collell Talleda, and Marie Francine Moens. "Giving Commands to a Self-driving Car: A Multimodal Reasoner for Visual Grounding." AAAI Conference on Artificial Intelligence (RCQA Workshop). AAAI, 2020.
  • Vandenhende, Simon, Thierry Deruyttere, and Dusan Grujicic. "A baseline for the commands for autonomous vehicles challenge." arXiv preprint arXiv:2004.13822 (2020).
  • Rufus, Nivedita, et al. "Cosine meets softmax: A tough-to-beat baseline for visual grounding." European Conference on Computer Vision. Springer, Cham, 2020.
  • Havasi, Marton, et al. "Training independent subnetworks for robust prediction." arXiv preprint arXiv:2010.06610 (2020).
  • Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017, July). On calibration of modern neural networks. In International conference on machine learning (pp. 1321-1330). PMLR.
  • Gal, Yarin, and Z. Ghahramani. "Dropout as a bayesian approximation: representing model uncertainty in deep learning. arXiv." arXiv preprint arXiv:1506.02142 (2015).
  • Lakshminarayanan, B., Pritzel, A., & Blundell, C. (2017). Simple and scalable predictive uncertainty estimation using deep ensembles. Advances in neural information processing systems, 30.