We present a method for detecting driver frustration from both video and audio streams captured during the driver’s interaction with an in-vehicle voice-based navigation system. The video is of the driver’s face when the machine is speaking, and the audio is of the driver’s voice when he or she is speaking. We analyze a dataset of 20 drivers that contains 596 audio epochs (audio clips, with duration from 1 sec to 15 sec) and 615 video epochs (video clips, with duration from 1 sec to 45 sec). The dataset is balanced across 2 age groups, 2 vehicle systems, and both genders. The model was subject-independently trained and tested using 4-fold cross-validation. We achieve an accuracy of 77.4% for detecting frustration from a single audio epoch and 81.2% for detecting frustration from a single video epoch. We then treat the video and audio epochs as a sequence of interactions and use decision fusion to characterize the trade-off between decision time and classification accuracy, which improved the prediction accuracy to 88.5% after 9 epochs.
– If you enjoyed this video, please subscribe to this channel.
– AI Podcast:
– Show your support:
this video accompanies our paper presented at HK the international joint conference and artificial intelligence where we propose a system for detecting driver frustration from the fusion of two data streams first the audio of the drivers voice and second the video of the drivers face let's ask an illustrative question these are video snapshots of two drivers using the in-car voice based navigation system which one of them looks more frustrated with the interaction to help answer that question let's take a look at an example interaction involving the driver on the right our proposed approach uses the audio of the drivers voice when the quote-unquote human is speaking and the video of the drivers face when he's listening to the machine speak when you're seeing and hearing is the driver attempting to instruct the car's voice based navigation system to navigate is 177 Massachusetts have Cambridge Massachusetts 177 that's just sad Cambridge Massachusetts and a tub of what 7-7 Massachusetts Avenue Cambridge Massachusetts Cambridge Massachusetts so there's your answer on the scale of one to ten with one being completely satisfied and ten being completely frustrated the smiling driver reported his frustration level with this interaction to be a nine we use self-reported level of frustration as a ground truth for the binary classification of satisfied versus frustrated when the driver is speaking we extract the Geneva minimalistic acoustic parameter set G Maps features from their voice which measures basic physiological changes in voice production when the driver is listening we extract 14 facial actions using the apdex system from the video of the drivers face the classifier decisions are fused together to produce an accuracy of 80 8.5% on an on-road data set of twenty subjects there are two takeaways from this work that may go beyond just detecting driver frustration first self-reported emotion state may be very different than one assigned by a group of external annotators so we have to be careful when using such annotations as a ground truth for other affective computing experiments second detection of emotion may require considering not just facial actions or voice acoustics but also context of the interaction and the target of the effective communication for more information or to contact the authors please visit the following website