Machine Translation of Indian Languages using Transformer

Kuldeep Sangwan
7 min readJul 13, 2021

--

Note: This Blog assumes you have theoretical knowledge of Transformer if not you can visit

Table of Content –

  1. Problem Statement
  2. Tokenization and Embedding
  3. Dataset
  4. Data cleaning
  5. Creating Vocab file from our corpus
  6. Custom Tokenizer
  7. Transformer Model
  8. Results
  9. References

1.Problem Statement

There are many Languages in the world but only few good translators available for only few of languages for example there are no good translators available from English to low Resourced Indian Languages like Tamil, Malayalam, Telugu or Bengali. There are multiple reasons for that

  • Obvious one is that these are less spoken languages so small data size available.
  • English to Indian language translation poses the challenge of morphological and structural divergence. For instance, the number of parallel corpora and differences between languages, mainly the morphological richness and variation in word order due to syntactical divergence. English has Subject-Verb-Object (SVO) whereas Tamil and Malayalam have Subject-Object-Verb (SOV).
  • Many Translators used old or conventional technologies and models like Rule-based machine translation (RBMT) or corpus based as these approaches had their own flaws.
  • One more reason for the bad performance of these translators is not using the new approaches and models that have been introduced in recent years models like Transformer or sequence to sequence models which use bidirectional LSTMs, attention layers and new approach to encode words like Byte-Pair-Encoding (BPE)/ Subword Tokenizer.

In this Blog we are gonna work on translating English to Malayalam language

Before starting on Transformer Model. There is a new approach called Subword Tokenization that we will using in the model so it crucial to learn that beforehand.

2. Tokenization and Embedding

Previously for tokenization we used technique like bag of words or TF-idf. But these doesn’t tell any information about semantic meaning words so we moved to Word embedding which gives us a dense vector of a word, from the vector we can get semantic meaning of word like two words are similar then distance between them would be less example of models is Word2Vec, Glove.

But this has issues like as these word embedding models pre-trained on Wikipedia were either limited by vocabulary size or the frequency of word occurrences, rare words like athazagoraphobia would never be captured resulting in unknown <unk> tokens when occurring in the text.

That’s when BPE comes in picture, it was reproduced a little and used as a sub word tokenization technique, what happens in this is when you create a vocabulary say there is a word “Hyperphobia”, if the occurrence of Hyper and Phobia is more in corpus then in vocabulary “Hyper” and “Phobia” would be stored not “Hyperphobia” and then at the time of inference you get a word like “Hypertension” then it would be converted into “Hyper” and ”Tension”(assuming Tension is a word in vocab) and will be tokenized and we would be having semantic information about both words. So, the conclusion is if you don’t have a word like “Hypertension” in your dictionary you would not get a <unk> token.

For more clear explanation you can visit — https://towardsdatascience.com/byte-pair-encoding-the-dark-horse-of-modern-nlp-eb36c7df4f10

https://towardsdatascience.com/word-embeddings-for-nlp-5b72991e01d4

Now before starting our Transformer model, we need few more things

  • Dataset
  • Data cleaning
  • Vocab file from our corpus and
  • Custom Tokenizer

3. Dataset

I have downloaded dataset from https://github.com/himanshudce/Indian-Language-Dataset (It has huge data for low resourced languages like Tamil, Malayalam, Telugu or Bengali), after taking the data in pandas table it looks like

Sample Dataset

4. Data cleaning

  • preprocess_english method is cleaning English Text
  • preprocess_malayalam method is for cleaning Malayalam Text

5. Creating Vocab file from our corpus

Before starting with Vocab file and Custom Tokenizer. Please go through tensorflow_text, it is a sub module in Tensorflow which gives us methods that help in creating vocab file and custom tokenizer and for better explanation you can visit here.

6. Custom Tokenizer

7. Transformer Model

I am gonna explain few of the important topics with the code that Transformer uses.

Transformer Model

Positional encoding

Since this model doesn’t contain any recurrence or convolution, positional encoding is added to give the model some information about the relative position of the words in the sentence.

The positional encoding vector is added to the embedding vector. Embedding represent a token in a d-dimensional space where tokens with similar meaning will be closer to each other. But the embedding do not encode the relative position of words in a sentence. So, after adding the positional encoding, words will be closer to each other based on the similarity of their meaning and their position in the sentence, in the d-dimensional space.

The formula for calculating the positional encoding is as follows:

Positional encoding formula

Code:

Scaled dot product attention

Scaled dot product attention diagram

The attention function used by the transformer takes three inputs: Q (query), K (key), V (value). The equation used to calculate the attention weights is:

self-attention mathematical representation

The dot-product attention is scaled by a factor of square root of the depth. This is done because for large values of depth, the dot product grows large in magnitude pushing the softmax function where it has small gradients resulting in a very hard softmax.

For example, consider that Q and K have a mean of 0 and variance of 1. Their matrix multiplication will have a mean of 0 and variance of dk. So, the square root of dk is used for scaling, so you get a consistent variance regardless of the value of dk. If the variance is too low the output may be too flat to optimize effectively. If the variance is too high the softmax may saturate at initialization making it difficult to learn.

The mask is multiplied with -1e9 (close to negative infinity). This is done because the mask is summed with the scaled matrix multiplication of Q and K and is applied immediately before a softmax. The goal is to zero out these cells, and large negative inputs to softmax are near zero in the output.

Code:

As the softmax normalization is done on K, its values decide the amount of importance given to Q.

Multi-head attention

multi-head attention diagram

Multi-head attention consists of four parts:

  • Linear layers and split into heads
  • Scaled dot-product attention
  • Concatenation of heads
  • Final linear layer

Each multi-head attention block gets three inputs; Q (query), K (key), V (value). These are put through linear (Dense) layers and split up into multiple heads.

The scaled_dot_product_attention defined above is applied to each head (broadcasted for efficiency). An appropriate mask must be used in the attention step. The attention output for each head is then concatenated and put through a final Dense layer.

Instead of one single attention head, Q, K, and V are split into multiple heads because it allows the model to jointly attend to information from different representation subspaces at different positions. After the split each head has a reduced dimensionality, so the total computation cost is the same as a single head attention with full dimensionality.

The output represents the multiplication of the attention weights and the V (value) vector. This ensures that the words you want to focus on are kept as-is and the irrelevant words are flushed out.

Code:

Point wise feed forward network

Point wise feed forward network consists of two fully-connected layers with a ReLU activation in between.

You can visit my code: https://github.com/KuldeepSangwan/MachineTranslation/blob/main/Tranformer_model_1.ipynb

For detail explanation of code, you can visit https://www.tensorflow.org/text/tutorials/transformer

8. Results

I have also trained a sequence-to-sequence model on same data so we will compare both their Bleu score results.

bleu score comparison between sequence model and Transformer

9. References

Contact me: Email , Linkedin, Gihub

--

--