Journey through Sequence-to-Sequence models, Attention and Transformer

7 min readJul 13, 2021

Note: This post assumes that you already have some experience with recurrent networks and TensorFlow.

Why this blog

This is a blog to give you brief idea about Autoencoders, Sequence-to-sequence model, Attention layer, Transformer and what is their highlighting factor and how different ideas from multiple papers have motivated to come up with a model like transformer that is state of the art model at the time.

I am gonna give different links so that you can get thorough understanding of the topics, please do read those.

Table of Content –

Introduction
Autoencoders
Sequence-to-sequence model
Sequence-to-sequence model with Context vector
Sequence-to-sequence model with attention
Transformer
References

1. Introduction

I just wanted to track form where the idea of encoder decoder models started. from my perspective it started from autoencoders. Both have the same idea. Where initial part is to encode something and then following part is a Decoder.

2. Autoencoders

The idea of autoencoder is we have a input layer, hidden layer/layers and output layer. The number of neurons in input and output is same and hidden layer has a smaller number of neurons which acts as a bottleneck layer

Why autoencoder?

We use autoencoder to represent the information into a smaller dimension. Say we have inputs as words which has 100 dimensions or a digit representing binary image (black and white) that goes through our network and idea is to get same vector of 100 dimension or same image at the output with having bottleneck layer at middle. So, we train something this and at the time of inference we only use its encoder part that gives output from bottleneck layer which has a smaller number of dimensions.

For thorough explanation visit https://www.jeremyjordan.me/autoencoders/

3. Sequence-to-sequence model

Sequence-to-sequence model are encoder decoder models same as autoencoders their initial part is to encode and then following part is a Decoder. Encoder decoder model is based on RNN units where the basic unit is a LSTM unit or a GRU unit.

For understanding RNN — http://colah.github.io/posts/2015-08-Understanding-LSTMs/

There are different types of RNN models:

One to one
One to many
Many to one
Many to Many (same input and output length)
Many to Many (different input, output length)

The one we are gonna focus today is Many to Many (different input, output length) which are used for Machine Translation and which are called as Sequence-to-sequence model.

What is Sequence-to-sequence models:

Sequence-to-sequence learning (Seq2Seq) is about training models to convert sequences from one domain (e.g., sentences in English) to sequences in another domain (e.g., the same sentences translated to French).

So, I will try explaining what a encoder and decoder looks like

So, we have A RNN layer (or stack thereof) acts as “encoder”: it processes the input sequence and returns its own internal state. Note that we discard the outputs of the encoder RNN, only recovering the state. This state will serve as the “context”, or “conditioning”, of the decoder in the next step. Another RNN layer (or stack thereof) acts as “decoder”: it is trained to predict the next characters of the target sequence, given previous characters of the target sequence. Specifically, it is trained to turn the target sequences into the same sequences but offset by one timestep in the future, a training process called “teacher forcing” in this context. Importantly, the encoder uses as initial state the state vectors from the encoder, which is how the decoder obtains information about what it is supposed to generate. Effectively, the decoder learns to generate targets [t+1…] given targets […t], conditioned on the input sequence.

For more understating visit — https://blog.keras.io/a-ten-minute-introduction-to-sequence-to-sequence-learning-in-keras.html

Disadvantage — Is not effective when length of sentence is high.

4. Sequence-to-sequence model with Context vector

In December 2015 Ilya Sutskever and his team from google introduced the concept of context vector where you take the last output of RNN cell into the decoder input. And this context vector is considered as the essence of the while input sentence.

In short, there are two RNNs/LSTMs. One we call the encoder — this reads the input sentence and tries to make sense of it, before summarizing it. It passes the summary (context vector) to the decoder which translates the input sentence by just seeing it.

Disadvantage: The main drawback of this approach is evident. If the encoder makes a bad summary, the translation will also be bad and indeed, it has been observed that the encoder creates a bad summary when it tries to understand longer sentences but obviously their performance was better than classic Sequence-to-sequence model.

If you want to study the progression of context vector to attention models you can visit — https://www.analyticsvidhya.com/blog/2019/11/comprehensive-guide-attention-mechanism-deep-learning/

5. Sequence-to-sequence model with attention

The attention mechanism is one of the most valuable breakthroughs in Deep Learning research in the last decade. It has spawned the rise of so many recent breakthroughs in natural language processing (NLP), including the Transformer architecture and Google’s BERT.

In 2015 Dzmitry Bahdanau, introduced a machine translation concept that will aim at building a single neural network that can be jointly tuned to maximize the translation performance. He said that the use of a fixed-length vector (context vector) is a bottleneck in improving the performance of this basic encoder-decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly which attention mechanism.

“A neural network is considered to be an effort to mimic human brain actions in a simplified manner. Attention Mechanism is also an attempt to implement the same action of selectively concentrating on a few relevant things, while ignoring others in deep neural networks.”

How Bahdanau attention mechanism works

The main difference between Ilya’s sequence-to-sequence model and Bahdanau attention is Ilya’s model takes the output from last state of RNN and gives that to input of decoder but in Bahdanau’s it takes output from n(n is a hyperparameter but most of the times it is equal to number of states of RNN) states of RNN and gives this to 1st decoder state and does the same thing and gives to 2nd decoder unit and so on.

In a English Sentence structure subject and Predicate are main part where subject of a sentence is the person, place, or thing that is performing the action of the sentence i.e. a noun or pronoun whereas the predicate expresses action or being within the sentence. The simple predicate contains the verb.

Example 1: Man builds a house.

So, when a decoder wants to predict the word Man (i.e., Subject) it needs to put more focus on Predicate.

Example 2: Despite originally being from Uttar Pradesh, as he was brought up in Bengal, he is more comfortable in Bengali.

In these groups of sentences, if we want to predict the word “Bengali”, the phrase “brought up” and “Bengal”- these two should be given more weight while predicting it. And although Uttar Pradesh is another state’s name, it should be ignored.So, by training the weights in the model we want it to understand these characteristics of the English sentence this is what is called as attention.

For better understanding of this — https://www.analyticsvidhya.com/blog/2019/11/comprehensive-guide-attention-mechanism-deep-learning/

Implementation — https://www.tensorflow.org/text/tutorials/nmt_with_attention

Disadvantage –

Dealing with long-range dependencies is still challenging
The sequential nature (RNN units) of the model architecture prevents parallelization. These challenges are addressed by Google Brain’s Transformer concept

6. Transformer

As RNN cells were made for working on sequence data and their big advantage was it maintains the sequence information of data and RNN layer can take any length of a sentence as input. But RNN cells can’t be parallelized then if not RNN then how would you maintain the sequence information.

So, these were kind of big challenges, but these guys came up with a great solution to it.

“The Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution.”

Here, “transduction” means the conversion of input sequences into output sequences.I will try to explain briefly few topics that stand out and that are similar with other models –

To preserve the sequence information, they introduced array whose value depend on sin and cos waves and location of word in sentence called as Positional Encoding.
No RNN cells just self-attention layers (concept is similar to attention).
Instead of using only one self-attention to capture the relation it uses multiple called as multi-Head attention to capture multiple relation.

Please visit below links to see how they overcome these challenges and how the overall transformer works.

This is a phenomenal explanation with diagram and everything: http://jalammar.github.io/illustrated-transformer/

For a video explanation you can visit: https://www.youtube.com/playlist?list=PL86uXYUJ7999zE8u2-97i4KG_2Zpufkfb

If you want to continue on this and wants to learn about Machine translation example visit .

7. References

Contact me: Email , Linkedin, Gihub