Special on Transformer Models
for Natural Language Processing
Transformer Models
Transformer Models: Deep Transformer Models for speech and language processing using RETURNN
“In this talk, we will talk about our recent development of Transformer based speech and language models at Prof. Hermann Ney’s Chair of Computer Science 6 at RWTH Aachen University.The first part of the talk, given by Kazuki Irie, will focus on language modeling with deep Transformers with application to automatic speech recognition. We show how the Transformer architecture, originally proposed for machine translation, can be scaled up to accommodate the large training data of language modeling task, and finally achieves excellent performance for automatic speech recognition.The second part of the talk, given by Albert Zeyer, will focus on our software RETURNN, RWTH’s TensorFlow based framework for neural networks. Its flexible implementation, which allows to experiment researchers with various model architectures as well as different tasks, will be described. This flexibility will be illustrated by an example of end-to-end speech recognition entirely based on the Transformer.”
Bidirectional Encoder Representations from Transformers (BERT)
“BERT is a state-of-the-art natural language processing (NLP) model that allows pretraining on unlabelled text data and later transfer training to a variety of NLP tasks. Due to its promising novel ideas and impressive performance we chose it as a core component for a new natural language generation product. Reading a paper, maybe following a tutorial with example code and putting a working piece of software into production are, however, two totally different things.
In this talk we will tell you how we trained a custom version of the BERT network and included it into a natural language generation (NLG) application. You will hear how we arrived at the decision to use BERT and what other approaches we tried. A number of changes to the vanilla BERT paper will be discussed that allowed us to train and deploy the network on consumer-grade GPUs and make it highly cost-effective, including a morph-based input encoding to reduce dimensionality and increase side-channel knowledge and of course a lot of hyperparameter tuning.
We will tell you about the failures and the mistakes we made so you do not have to repeat them, but also about the surprises, successes and lessons learned.”