Super-Audible Voice Activity Detection

EXPLANATION: A small loudspeaker located close to a persons mouth outputs a very high frequency chirp. The sound is reflected from the persons face and captured by a small microphone. The chirp is low energy with a frequency that lies above the limits of human hearing, yet it can be generated (and received) by ordinary smartphone hardware and software.

By carefully analysing how the reflected chirp varies in time, it is possible to very accurately discern the 'mouth state' of the person using the device. At the very least, we can tell if they are currently speaking or not (which is useful information in high background noise environments). In future, we hope to begin to determine what they are saying.

(1) Ian McLoughlin, “Super-audible Voice Activity Detection”, IEEE Trans. Audio, Speech and Language Processing, vol. 22, no. 8, Sept. 2014, pp. 1424-1433

Abstract—In this paper, reflected sound of frequency just above the audible range is used to detect speech activity. The active signal used is inaudible to humans, readily generated by the typical audio circuitry and components found in mobile telephones, and is robust to background sounds such as nearby voices. In use, the system relies upon a wideband excitation signal emitted from a loudspeaker located near the lips, which reflects from the mouth region and is then captured by a nearby microphone. The state of the lip opening is evaluated periodically by tracking the resonance patterns in the reflected excitation signal. When the lips are open, deep and complex resonances are formed as energy propagates into and then reflects out from the open mouth and vocal tract, with resonance depth being related to the open lip area. When the lips are closed, these resonance patterns are absent. The presence of the resonances can thus serve as a low complexity detection measure.

The technique is evaluated for multiple users in terms of sensitivity to source placement and sensor placement. Voice activity detection performance using this measure is further evaluated in the presence of realistic wideband acoustic background noise, as well as artificially added noise. The system is shown to be relatively insensitive to sensor placement, highly insensitive to background noise, and able to achieve greater than 90% voice activity detection accuracy. The technique is even suitable when a subject is whispering in the presence of much louder multi- speaker babble. The technique has potential for speech-based systems operating in high noise environments as well as in silent speech interfaces, whisper-input systems and voice prostheses for speech-impaired users.

Notes: this paper investigates the voice activity detection (VAD) potential for the technique, in very high noise levels.

(2) Ian McLoughlin, Yan Song, “Mouth State Detection From Low-Frequency Ultrasonic Reflection”, Journal of Circuits, Systems & Signal Processing, Oct. 2014

Abstract--This paper develops, simulates and experimentally evaluates a novel method based on non-contact low frequency (LF) ultrasound which can determine, from airborne reflection, whether the lips of a subject are open or closed. The method is capable of accurately distinguishing between open and closed lip states through the use of a low-complexity detection algorithm, and is highly robust to interfering audible noise. A novel voice activity detector is implemented and evaluated using the proposed method and shown to detect voice activity with high accuracy, even in the presence of high levels of background noise. The lip state detector is evaluated at a number of angles of incidence to the mouth and under various conditions of background noise. The underlying mouth state detection technique relies upon an inaudible LF ultrasonic excitation, generated in front of the face of a user, either reflecting back from their face as a simple echo in the closed mouth state or resonating inside the open mouth and vocal tract, affecting the spectral response of the reflected wave when the mouth is open. The difference between echo and resonance behaviours is used as the basis for automated lip opening detection, which implies determining whether the mouth is open or closed at the lips. Apart from this, potential applications include use in voice generation prosthesis for speech impaired patients, or as a hands-free control for electrolarynx and similar rehabilitation devices. It is also applicable to silent speech interfaces and may have use for speech authentication.

Notes: this paper investigates the science behind the technique, including the use of a simulation to explore the physicals and physical methods used.

History and Motivation

This research is a relatively recent culmination of ideas that grew from the Bionic Voice Project from 2006/2007 onwards, which aims to return the power of speech to post-laryngectomised patients. The original idea was to fill their vocal tract with high frequency excitation (in place of pitch) and then see if we could determine what the remainder of their vocal tract (VT) modulators were doing by 'listening' to the 'echo response' from out of their mouth.

It didn't work at that time, unfortunately, but we did find that we could VERY CLEARLY identify their mouth open or closed response!

The first papers I published about this were using lower frequency sounds to 'scan' the vocal tract, with the able assistance of my PhD student Farzaneh Ahmadi:

This paper was a validation of a technique that had grown in my mind much earlier. Basically the idea was to use signals at the lower end of the ultrasonic range to scan the VT... which led to quite a few investigations, including the following:

Having worked on this for a while, I have made quite a few more discoveries and breakthroughs. But first, we had to prove that the underlying ultrasonic signals were safe for daily use. Hence the following work, again with my PhD student Farzaneh Ahmadi:

Once the safety was assured, I also patented the basic ultrasound scanning technique!

Finally, after even more work, I had enough to write and then submit a paper which was able to show this promising new technique capable of operating with normal speaking humans to detect the state of their mouth (open or closed):

The system in the paper is called a “low-frequency ultrasound” or LFUS mouth state detection system.

Since this paper was written, significant work has been done to advance on the very basic initial system described in that paper. The newer technique is much faster, much higher accuracy, lower computational complexity and has been demonstrated for continuous speech (the paper above only did sustained vowel-like sounds)... The following year, working on my own, I extended the technique and did quite a bit of experimentation to refine the ideas:

Finally, I extended the analysis to the two journal papers mentioned above:

This area is currently a work-in (very slow) progress, but please expect more soon!

© 2013, 2014, 2015 by Professor Ian McLoughlin of NELSLIP and USTC.