Regisztráció és bejelentkezés

Realisztikus nyelv-ultrahang képszintézis Generative Adversarial Network segítségéve

Realistic ultrasound tongue image synthesis using Generative Adversarial Networks

Nadia Hajjej1


Nowadays, a large amount of linguistic studies is applied for analyzing the 2D ultrasound images and especially the articulatory data, which requires a huge dataset large number of images to objectively assess the quality of developed methods. Thus, the generation of ultrasound images can facilitate the exploitation of higher-level representation of the tongue in a variety of applications in speech research [1].

In order to build the dataset, we have chosen the Generative Adversarial Networks (GANs) [2] as a branch of unsupervised learning techniques in machine learning which are able to mimic any data distribution and generate data like it. The process of GAN mainly consists in managing a game of two players. The generator if the player which aims to make samples by referring to the distribution of training data. The other player is the discriminator that tries to distinguish real from fake images. The learning process of the discriminator usually is the traditional supervised learning techniques to be able to divide inputs into two classes (real or fake). The generator is trained to fool the discriminator.

This dataset contains tongue-ultrasound images. We have collected those images from a file generated by the ultrasound software [3]. The latter file contains raw ultrasound images stored after each other. Each pixel is stored as a 1-byte unsigned integer, which has actually a grayscale pixel intensity. Using the extracted raw ultrasound images, we can build ultrasound frames. Those frames can be used to produce a video illustrating the movement of the tongue. In our case, we are interested in raw scanline ultrasound images. Therefore, after successfully extracting those images, we have built a data set of 27925 raw scanline images which dimension is 64*842. For our project, we have chosen to use the deep convolutional neural network for both the generator and the discriminator. In our model, we have fixed the number of hidden layers to nine for both discriminator and generator. As hyperparamater, we have fixed 64 as batch size and 25 for the number of epoch. After every 100 iterations, we generated 64 ultrasound images.

In order to assess the generated raw images, we have developed an internet-based test. In this test, we have 100 samples made of real and generated images. The participants task was to accord a number (between 0-100%) for each image without any knowledge about their origin whether they are real or generated ones. According to the results, the ‘bad’ generated images resulted in 36.1%, the ‘good’ generated images (from a full training) resulted in 70.73%, whereas real ultrasound images achieved 90.1%. Analyzing the result, we can say that the GANs are very efficient in image generation and deceive humans.

The results can be useful for predicting the next frame in an ultrasound image sequence or for motion detection of tongue contours within images [1].


[1] C. Wu, S. Chen, G. Sheng, P. Roussel, and B. Denby, “Predicting tongue motion in unlabeled ultrasound video using 3D convolutional neural networks,” in Proc. ICASSP, 2018.

[2] I. Goodfellow, “NIPS 2016 Tutorial: Generative Adversarial Networks,” Dec. 2016.

[3] T. G. Csapó, A. Deme, T. E. Gráczi, A. Markó, and G. Varjasi, “Synchronized speech, tongue ultrasound and lip movement video recordings with the ‘Micro’ system,” in Challenges in analysis and processing of spontaneous speech, 2017.


  • nadia hajjej Id.
    Villamosmérnöki szak, mesterképzés
    mesterképzés (MA/MSc)


  • Dr. Csapó Tamás Gábor
    tudományos munkatárs, Távközlési és Médiainformatikai Tanszék