A Wavenet for Speech Denoising

This week I started with reading some research papers. The one today which I read was about "A Wavenet for Speech Denoising" by Dario Rethage, Jordi Pons and Xavier Serra.

Currently, most of the speech recognition techniques use magnitude spectrogram as frontend discarding the phase. In order to overcome this limitation, the paper was developed based on the wavenet framework by utilizing its acoustic modeling capabilities while ignoring its autoregressive nature thereby significantly reducing time complexity. 
The model also uses non-causal, dilated convolutions. Non-causal means that the current value also depends on future values. Dilated convolutions can be used to make the receptive field grow faster by introducing a factor called dilation in the convolution layer. This means that it is possible to have empty spaces between each cell.

 Raw audio waveforms (do not contain any header information)are successfully used for generative tasks, but usually, these have a tendency of autoregression except for the SEGAN model which uses an adversarial network. This means that the generative signals in SEGAN resemble closely to the real signals as there is a discriminative network which is used to discriminate between generated signals and real signals. As the model trains, the generative signals are almost similar to real signals. As the generative network trains, the discriminative network also trains.

In the next section of the paper, they discussed the Wavenet architecture. Gated units are long short term memory with forget gates. They contain fewer parameters than LSTM as they do not have an output gate. The operator with circle and dot in between stands for Hadamard multiplication in which elements of matrices are multiplied element-wise so the dimensions of both the matrices to be multiplied must remain the same. The activation functions used are sigmoid function and the hyperbolic tangent functions. The receptive field growth is exponential as the dilation is increased exponentially. An 8-bit quantization is performed whenever a discrete softmax output distribution is used. A softmax distribution is used to map a non-normalized output of a network to a probability distribution over the predicted output class. The output would be normalized and the values will be in the range of 0 to 1 with the sum of all the values equal to 1. The usefulness of skip connections is that they allow the transfer of information to the final layers by skipping the intermittent layers. This has several advantages one being that it makes training a deep model easier and the second being that the information at each layer being propagated to the final layers. Another way to deepen the network which the wavenet uses is context stacks. This is done by stacking layers on top layers with maximum dilation factor on top of each other and can be done as many times as desired.

The modifications which they introduced were instead of using an asymmetrical receptive field as in the case of wavenet, symmetrical receptive field was used where the sample of our interest was placed in the center of the receptive field by performing symmetric padding at each dilated layer. This helped to eliminate the causality thereby giving it access to the same amount of samples in the past as it would have in the future. Multimodal distributions allowed for artifacts in the denoised signal. This meant that real-valued predictions seemed more appropriate for our problem.
Unlike wavenet(generative) the model here used is discriminative. In order to ensure that the proposed architecture is autoregressive, the final 2 layers of the kernel are replaced by 3x1 filters instead of 1x1 filters. After observing the output of the model, they also made some changes to the training data, i.e. they sent only background noise as input. This was done to ensure that silence is produced.

To test the network, they used dataset from 2 sources: speech data from Voice bank corpus while environmental sounds were provided by the Diverse Environments Multichannel Acoustic Noise Database(DEMAND). The recordings were sampled at 48kHz and subsampled at 16kHz for this study. Samples are on an average 3 seconds long with a standard deviation of 1 second.

The model had features like 30 residual layers with the dilation factor increasing in each layer by a factor of 2 (2, 4, 8, 16, 32,.., 256, 512) This pattern was repeated 3 times with some additional changes which can be found in the paper section 4.2.

Quality of denoised signal was measured based on 3 dimensions: signal distortion, background- noise interference and overall quality. On the MOS measurement method, the proposed wavenet scored 3.60 which was higher than wiener filtering(2.92) which confirms that it is possible to learn multi-scale hierarchical representations from raw audio instead of magnitude spectrograms as front-end for the task of speech denoising.




Comments

Popular posts from this blog

Setting up Pytorch