A Near-Real-Time Processing Ego Speech Filtering Pipeline Designed for Speech Interruption During Human-Robot Interaction

Li, Yue; Kunneman, Florian A.; Hindriks, Koen V.

doi:https://doi.org/10.1109/RO-MAN60168.2024.10731262

A Near-Real-Time Processing Ego Speech Filtering Pipeline Designed for Speech Interruption During Human-Robot Interaction

Files

A_Near-Real-Time_Processing_Ego_Speech_Filtering_Pipeli... (1.36 MB)

Publication date

2024-10-30

Authors

Li, Yue

Kunneman, Florian

Hindriks, Koen V.

DOI

https://doi.org/10.1109/RO-MAN60168.2024.10731262

Document Type

Part of book

Metadata

Show full item record

Collections

Utrecht University Repository

License

taverne

Abstract

With current state-of-the-art (SOTA) automatic speech recognition (ASR) systems, it is not possible to transcribe overlapping speech audio streams separately. Consequently, when these ASR systems are used as part of a social robot like Pepper for interaction with a human, it is common practice to close the robot's microphone while it is talking itself. This prevents the human users to interrupt the robot, which limits speech-based human-robot interaction. To enable a more natural interaction which allows for such interruptions, we propose an audio processing pipeline for filtering out robot's ego speech using only a single-channel microphone. This pipeline takes advantage of the possibility to feed the robot ego speech signal, generated by a text-to-speech API, as training data into a machine learning model. The proposed pipeline combines a convolutional neural network and spectral subtraction to extract overlapping human speech from the audio recorded by the robot-embedded microphone. When evaluating on a held-out test set, we find that this pipeline outperforms our previous approach to this task, as well as SOTA target speech extraction systems that were retrained on the same dataset. We have also integrated the proposed pipeline into a lightweight robot software development framework to make it available for broader use. As a step towards demonstrating the feasibility of deploying our pipeline, we use this framework to evaluate the effectiveness of the pipeline in a small lab-based feasibility pilot using the social robot Pepper. Our results show that when participants interrupt the robot, the pipeline can extract the participant's speech from one-second streaming audio buffers received by the robot-embedded single-channel microphone, hence in near-real time.

Keywords

Taverne, Artificial Intelligence, Computer Vision and Pattern Recognition, Human-Computer Interaction, Software

Citation

Li, Y, Kunneman, F A & Hindriks, K V 2024, A Near-Real-Time Processing Ego Speech Filtering Pipeline Designed for Speech Interruption During Human-Robot Interaction. in 33rd IEEE International Conference on Robot and Human Interactive Communication, ROMAN 2024. IEEE International Workshop on Robot and Human Communication, RO-MAN, IEEE, pp. 1370-1377, 33rd IEEE International Conference on Robot and Human Interactive Communication, ROMAN 2024, Pasadena, United States, 26/08/24. https://doi.org/10.1109/RO-MAN60168.2024.10731262, conference

URI

https://dspace.library.uu.nl/handle/1874/482501

A Near-Real-Time Processing Ego Speech Filtering Pipeline Designed for Speech Interruption During Human-Robot Interaction

Files

Publication date

Authors

Editors

Advisors

Supervisors

DOI

Document Type

Metadata

Collections

License

Abstract

Keywords

Citation

URI