Mitsubishi Electric Separates Simultaneous Speech of Multiple Unknown Speakers Recorded with One Microphone

Speech-separation technology achieved with proprietary "Deep Clustering" AI method

PDF Version (PDF:239.3KB)

TOKYO, May 24, 2017 - Mitsubishi Electric Corporation (TOKYO: 6503) announced today that it has created the world's first technology that separates, and then reconstructs with high quality, the simultaneous speech of multiple unknown speakers recorded with a single microphone in real time. In tests, the simultaneous speeches of two and three people were separated with up to 90 and 80 percent accuracy, respectively, which the company believes are world's firsts as of this announcement. The novel technology, which was realized with Mitsubishi Electric's proprietary "Deep Clustering" method based on artificial intelligence (AI), is expected to contribute to more intelligible voice communications and more accurate automatic speech recognition.

In the case of two simultaneous speakers, accuracy exceeded 90 percent, sufficient for commercial applications, compared with 51 percent accuracy using conventional technology. The new technology is able to discern between combinations of several spoken languages and gender. The above results are based on ideal recording conditions, including low ambient noise and speakers talking at roughly similar volume.
The Deep Clustering technology uses Mitsubishi Electric's proprietary deep-learning method to learn how to encode signal components of the original speech data of multiple people so that signal components belonging to each individual speaker can be easily distinguished by their encodings. To accomplish this, the encodings are optimized such that different signal components belonging to the same speaker have similar encodings, and those belonging to different speakers have dissimilar encodings. The learned encoding transformation is applied to the input speech, and the encodings of the signal components of each speaker are identified using a clustering algorithm, which processes data points into groups depending on their similarities. Each person's speech is then reconstructed by resynthesizing their separated speech components.

Accuracy in Separating Simultaneous Speech of Multiple Speakers*

  Two speakers (single microphone) Three speakers (single microphone)
New technology >90% (world's first) >80% (world's first)
Conventional technology 51%

*Based on ideal recording conditions

Note that the releases are accurate at the time of publication but may be subject to change without notice.