Demystifying Part-of-Speech (POS) Tagging Techniques for Accurate Language Analysis

Tavish Aggarwal

December 20, 2023

Part-of-speech (POS) tagging is a natural language processing (NLP) technique that involves assigning specific grammatical labels (such as nouns, verbs, adjectives, adverbs, etc.) to individual words within a sentence. This procedure helps to decipher word meanings, comprehend word relationships, and facilitates a variety of linguistic and computational studies of textual data. It also offers insights into the syntactic structure of the text.

In this post, we will cover various techniques that can be used to assign the POS tags to the words in the sentences using the NLTK library in Python. Let's get started.

Lexicon Based Approach

This is the simplest approach out of all the available approaches. In this approach, for each word, the algorithm assigns the POS tag that most frequently occurs for that word in the training corpus. Because of this, the POS tag will not be assigned for the word that doesn't exist in the training corpus.

For example, consider two sentences I went for a run/NN and I run/VB in the morning. Lexicon tagger will tag 'run' basis the highest frequency tag. In most contexts, 'run' is likely to appear as a verb, implying that 'run' will be wrongly tagged in the first sentence. This is the potential limitation of the lexicon-based tagger.

Let us look at Python code to understand how to code lexicon-based taggers. In this example, we are using a treebank-tagged dataset that comes preloaded with the NLTK library and already has a POS tag assigned to the words.

import nltk
# Downloading the treebank dataset if it's not downloaded already
nltk.download('treebank')

import numpy as np
import pandas as pd
import random
from sklearn.model_selection import train_test_split
from nltk.tokenize import word_tokenize
import math

# Reading the Treebank tagged sentences
wsj = list(nltk.corpus.treebank.tagged_sents())

# Converting the list of sents to a list of (word, pos tag) tuples
tagged_words = [tup for sent in wsj for tup in sent]

# Splitting into train and test
train_set, test_set = train_test_split(wsj, test_size=0.3)

# Lexicon (or unigram tagger)
unigram_tagger = nltk.UnigramTagger(train_set)
unigram_tagger.evaluate(test_set)

The NLTK-tagged sentence library has a list of sentences where each sentence is ended with a full stop '.' whose POS tag is also a '.'. As we don't need the corpus to be segmented by sentences, we convert the corpus to the list of (word, tag) tuples. Also, we are using UnigramTagger which assigns the most commonly assigned tag to a word.

Rule-based approach

In the rule-based POS tagging approach, we can assign the rules that apply to the entire text, such as, 'replace VB with NN if the previous tag is DT', 'tag all words ending with ing as VBG', etc. Rule-based taggers are also known as syntagmatic taggers.

Let us look at an example and we are going to evaluate this approach on the same dataset that we have created in the previous treebank example.

# Specify patterns for tagging
# Example from the NLTK book
patterns = [
    (r'.*ing$', 'VBG'),              # gerund
    (r'.*ed$', 'VBD'),               # past tense
    (r'.*es$', 'VBZ'),               # 3rd singular present
    (r'.*ould$', 'MD'),              # modals
    (r'.*\'s$', 'NN$'),              # possessive nouns
    (r'.*s$', 'NNS'),                # plural nouns
    (r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers
    (r'.*', 'NN')                    # nouns
]

regexp_tagger = nltk.RegexpTagger(patterns)

regexp_tagger.evaluate(test_set)

As we can see in the example shown above, we are using the RegexpTagger tagger to tag the words based on the rules defined.

It is important to note that, defining such rules requires some exploratory data analysis and intuition based on which rules can be generated that apply to the entire corpus. Let us look at the same example of the treebank dataset and apply both rule-based and lexicon-based taggers to the dataset.

# rule based tagger
rule_based_tagger = nltk.RegexpTagger(patterns)

# lexicon backed up by the rule-based tagger
lexicon_tagger = nltk.UnigramTagger(train_set, backoff=rule_based_tagger)

lexicon_tagger.evaluate(test_set)

The rule-based tagger is quite ineffective since there are few handwritten rules. However, if we combine the lexicon and the rule-based tagger, we can potentially create a tagger much better than any of the individual ones. NLTK provides a convenient way to combine taggers using the 'backoff' argument as shown in the example above.

Probabilistic (or Stochastic) taggers

Unlike lexicon-based taggers, probabilistic taggers don't naively assign the highest frequency tag to each word, instead, they look at slightly longer parts of the sequence and often use the tag(s) and the word(s) appearing before the target word to be tagged.

This approach uses Bayes’ theorem and the chain rule of probability to assign a POS tag to the words. Before moving ahead, if you are not sure how Bayes' algorithm works, I would recommend going through the post, Naïve Bayes Algorithm Detailed Explanation and Unlocking the Power of Naive Bayes Algorithm for Text Classification.

To summarize Bayes' algorithm, consider we have two features X= (x1, x2) and a binary target variable y. According to Bayes’ rule, the probability of a point (x1, x2) belonging to the target class c1 is given by:

$$P(class = c1|x1, x2) = {P(x1, x2|c1).P(c1) \over P(x1, x2)}$$

Now, according to the chain rule of probability, the term P(x1, x2 | c1) can be rewritten as P(x1|c1). P(x2|c1).

A similar idea is used by probabilistic parsers to assign POS tags to sequences of words. Let us consider an example, where we want to assign a POS tag to "The high cost" sentence, and for simplicity let's also assume that there are only three POS tags (DT, JJ, NN) that exist. Possible combinations in which POS tags can be applied are:

  • (DT, JJ, NN)
  • (DT, NN, JJ)
  • (JJ, DT, NN), and so on.

With these assumptions let's see how we can apply a POS tag to the words. The objective is to find the most probable POS tag sequence for the sentence: "The high cost".

The probability that the tag sequence is (DT, JJ, NN) can be computed as:

$$P(DT,JJ,NN | The,high,cost) = {P(The,high,cost | DT,JJ,NN) * P(DT,JJ,NN) \over P(The,high,cost)} $$

Expanding the terms on the right side by using the chain rule:

$$ = {P(the|DT,JJ,NN)∗P(high|the,DT,JJ,NN)∗P(cost|the,high,DT,JJ,NN)∗P(DT)∗P(JJ|DT)∗P(NN|DT,JJ) \over P(The,high,cost)} $$

Using Markov assumption states that a state's probability depends only on the previous state. That implies, in a tag sequence (DT, JJ, NN), the probability that a word is tagged as NN depends only on the previous tag JJ and not on DT. 

Another simplifying assumption we make is that the probability of a word w being assigned a tag t depends only on the tag t and not on any other tag. Thus, the probability P(the|DT), i.e. the probability that a word is 'the' given its tag is DT, depends only on DT and not on any of the other tags NN or JJ.

We are simplifying the above equation with stated assumptions:

$$P(DT,JJ,NN | The,high,cost) = {P(the|DT)∗P(high|JJ)∗P(cost|NN)∗P(DT)∗P(JJ|DT)∗P(NN|JJ)\over P(The,high,cost)} $$

Similarly, we calculate the probability that the tag sequence is (DT, NN, JJ), (JJ, DT, NN), and so on as:

$$P(DT,NN,JJ | The,high,cost) = {P(the|DT)∗P(high|NN)∗P(cost|JJ)∗P(DT)∗P(NN|DT)∗P(JJ|NN)\over P(The,high,cost)} $$

$$P(JJ,DT,NN | The,high,cost) = {P(the|JJ)∗P(high|DT)∗P(cost|NN)∗P(JJ)∗P(DT|JJ)∗P(NN|DT)\over P(The,high,cost)} $$

After calculating all three probabilities we assign the POS tag with the highest probability.

NOTE: In calculation, the denominator can be ignored as it will have the same impact on the calculation of three equations.

For our example, where we have three POS tags - DT, JJ, NN, and we are tagging the sentence "The high cost", there are \(3^3\)=27 possible tag sequences (each word can have three possible tags).

In general, for a sequence of n words and t tags, a total of \(t^n\) tag sequences are possible. The Penn Treebank dataset in NLTK itself has 36 POS tags, so for a sentence of length say 10, there are \(36^{10}\) possible tag sequences (that's about three thousand trillion!).

Computing trillions of probabilities to tag a 10-word sentence is impractical. Thus, we need to find a much more efficient approach to tagging. 

Viterbi Heuristic

The approach of calculating probabilities of all possible tag sequences and assigning the sequence having the maximum probability is practical but computationally very expensive.

Therefore finding probabilities is not an efficient way of doing POS tagging. In Viterbi Heuristic, the basic idea is given a list of observations (words) \(O_1, O_2, . . . . . ., O_n\) to be tagged, rather than computing the probabilities of all possible tag sequences, we assign tags sequentially, i.e. assign the most likely tag to each word using the previous tag. Let's understand this further.

Speaking in mathematical terms, we  assign the tag \(T_j\) to each word \(O_i\) such that it maximizes the likelihood given below:

$$P(T_j|O_i)=P(O_i|T_j)∗P(T_j|T_{j−1})$$

where \(T_{j-1}\) is the tag assigned to the previous word. Note that the above equation also considers the Markov assumption.

We assign tags sequentially such that the tag assigned to every word \(O_i\) maximizes the likelihood \(P(T_j|O_i)\) locally. This is also why it is called a greedy algorithm as it assigns the tag that is most likely at every word, rather than looking for the overall most likely sequence.

Viterbi Algorithm is an example of a dynamic programming algorithm. (In general, algorithms that break down a complex problem into subproblems and solve each subproblem optimally are called dynamic programming algorithms.)

The limitation is when the algorithm hits an unknown word (i.e. not present in the training set), it naively assigns the first tag in the list of tags. Can you think of any techniques that we have discussed so far to avoid this limitation? I am attaching the notebook POS tagging using modified Viterbi Algorithm.ipynb as a reference to answer the above question. But I highly encourage you to give it a try first before going through the notebook. 

But how do we find out which is the most probable POS tag out of multiple POS tags available that can maximize the likelihood? This question can be answered after understanding how the Hidden Markov Model works.

Markov Model and Hidden Markov Model

The most commonly used probabilistic algorithm for POS tagging is the Hidden Markov Model (HMM). But before we go forward and discuss the HMM, it is important to understand the Markov model first.

Markov Model

Markov models are probabilistic (or stochastic) models that were developed to model sequential processes, such as text and speech. In a Markov process, it is usually assumed that the probability of each event (or state) depends only on the probability of the previous event. This simplifying assumption is a special case that is known as the Markovian or one-Markov, or the first-order Markov assumption. 

$$P(X_{i +1}=s | X_i, X_{i-1}, ......, X_1) = P(X_{i +1}=s | X_i)$$

In the equation shown above, the probability of step 's' in the process is only dependent on the previous step. That is why this is called the one-Markov assumption. But the same can be generalized for the k-Markov assumption as well where the state is dependent on the kth previous steps.

Markov state process can also be represented as the probabilistic state transition diagram as shown below:

Markov Model

  • Here, ‘a’, ‘p’, ‘i’, ‘t’, ‘e’, and ‘h’ are the states.
  • The numbers mentioned on the edges are transition probabilities. These are the probabilities that happen within the states. (More on transition probabilities in the next section of the post.)
  • The start state is a special state which represents the initial state of the process (e.g. the start of a sentence).

We can represent each word in a sentence as a state. The transition probabilities (which can be learned from the text corpus) would represent the probability that the process moves from the current word to the next word. For example, the transition probability from the state 'San' to 'Franciso' will be higher than to the state 'Delhi'.

This is the basic definition of the Markov model. Having an understanding of how the Makov model works lets us move forward and understand about Hidden Markov Model.

Hidden Markov Model

Relating the concept of the Markov model that we have understood above to our problem of PoS tagging, our goal is to find the PoS tags given the words in the sentences where the PoS tags are states in the Markovian process that are unknown. This brings us to the concept of Hidden Markov Model.

The Hidden Markov Model (HMM) is an extension to the Markov process (which follows all the properties of the Markov model. i.e., there are states, given state depends only on the previous state) which is used to model phenomena where the states are hidden (or latent) and state emit observations. The only difference between the HMM and the Markov model is that in HMM states are hidden.

For example, in a speech recognition system (a speech-to-text converter), the states represent the actual text words that we want to predict, but we do not directly observe them (i.e. the states are hidden). Rather, we only observe the speech (audio) signals corresponding to each word, and we need to infer the states (or words) using the observations.

Similarly, in POS tagging, what we observe are the words in a sentence, while the POS tags themselves are hidden and we don't know what the POS tags are for the given word in the sentence. Thus, we can model the POS tagging task as an HMM with the hidden states representing POS tags that emit observations, i.e. words. Let's understand this process further.

HMM Model

The diagram shown above represents how the POS tagging problem can be represented as an HMM. In the diagram, we have four words of the sentences y1, y2, y3, and y4 for which we want to POS tag with three possible tags x1, x2, and x3 which are hidden states. Further, the hidden states emit observations with a certain probability which are called transition and emission probabilities.

Transition probability is defined as the probability which is represented within the states and emission probabilities are represented as from the hidden states to the words in the sentence. In the diagram, shown above a12, a21, and a23 are represented as transition probability, and b11, b21, b12, b13, etc. are represented as emission probability.

For the sentence "The high cost" which we have seen earlier, the probabilities P(NN|JJ), P(JJ|DT), etc. are transition probabilities, while the P(high|JJ), P(cost|NN), etc. are the emission probabilities.

POS tags are states that are hidden in HMM. Only, Emission and Transition Probability and Initial state probabilities (i.e., the probability of transition into any state from the initial state) are required to define the HMM model which are usually learned from a training corpus. Let us understand how the probabilities are learned.

    Assigning PoS tags using the HMM model

    The two main problems in building an HMM for POS tagging are the learning and explanation problem. Let us understand how these two problems together help us in building the HMM model for the POS tagging task.

    The Learning Problem

    This problem is about learning the probabilities from the training corpus. Here we calculate the transition probability (from one state to another), the emission probability (of a state emitting a word), and the initial state probabilities (of a state appearing at the start of a sentence).

    To calculate the POS tags, we need historical data on which we need to train the HMM model (calculate emission and transition probabilities). The process of learning the probabilities from a tagged corpus is called training an HMM model.

    The emission Probability of a word 'w' for tag 't' is calculated as:

    $$P(w|t) = \text{Number of times w has been tagged t/Number of times t appears}$$

    The transition Probability of tag t1 followed by tag t2 is calculated as:

    $$P(t2|t1) = \text{Number of times } t_1 \text{ is followed by tag } t_2 \text{/ Number of times } t_1 \text{ appears}$$

    This problem of calculating the emission and transition probabilities is called an HMM learning problem. Let us take an example corpus of 3 sentences that are POS-tagged and understand how the HMM learns emission and transition probability. Three sentences are as follows:

    • Twitter/NN  is/VB  the/DT  most/JJ  open/JJ  social/JJ  media/NN  platform/NN.
    • Twitter/NN  is/VB  marketing/VB  itself/PRP  as/IN  /DT  news/NN  platform/NN.
    • Twitter/NN  was/VB  exploited/VB  by/IN  adversarial/JJ  governments/NN.

    Let's say we want to calculate the emission probability P(“Twitter”|NN) which can be calculated as the Number of times 'Twitter' appears as Noun / Number of Nouns. This is equal to 3/8. In the same way, we can calculate transition probability. Let's say we want to calculate P(NN|JJ) which is the number of times an adjective is followed by a noun/Number of adjectives. This is equal to 2/4.

    Let's move forward and assume that, we have a new sentence "Twitter was exploited" which we want to POS tag and we want to calculate the probability of tagging the sequence as P(NN|VB|VB). Also, assume that P(NN|start) = 1/3. This will be equal to P(("Twitter"|NN)* P(NN|start) * ("was"|VB)* P(VB|NN) * ("exploited"|VB)* P(VB|VB). Now, this can be calculated easily using the emission and transition probabilities that we calculated.

    So far we have successfully calculated HMM parameters i.e., transition and emission probabilities, but how do we find out the best POS tags for the given sequence of the words? The explanation problem is the answer.

    The Explanation Problem

    After learning the model parameters/probabilities, we need to find the best possible state (tag) sequence for each given sentence. The explanation problem, also called the decoding problem, is defined as, given a sequence of words/observations and an HMM model (i.e. transition, emission, and start state probabilities), finding the tag/state sequence that maximizes the probability of observing the observed words.

    The decoding algorithm (or explanation problem) used for HMMs is called the Viterbi algorithm (which we have discussed briefly earlier).

    As we have seen earlier assigning the PoS tags is an exponential problem, therefore we need a better and more effective algorithm to solve the problem. We have already discussed Vierbi heuristics briefly earlier, but now let's understand how the Viterbi algorithm can help solve the explanation problem.

    The visualization of the HMM to solve the explanation problem using the Viterbi algorithm is as shown below:

    HMM POS Tag Example

    Here we have considered an example string 'The man slept' and have assumed that we have only three possible tags - DT, JJ, and VB. We have also assumed some emission (P('the'|DT), P( 'the '|NN), P('the'|VB), etc.) and transition probabilities (P(NN|VB), P(VB|DT), P(DT|VB), etc.) that we have got from the learning problem as we have seen earlier.

    Let us see how to calculate the most probable tag sequence using the Viterbi Heuristic.

    We start from the start state probabilities and multiply this with the emission probabilities. For example, we will calculate the following:

    • P(DT|Start) x P('the'|DT) = 0.4 x 0.6 = 0.24
    • P(NN|Start) x P('the'|NN) = 0.3 x 0.3 = 0.09
    • P(VB|Start) x P('the'|VB) = 0.3 x0.1 = 0.03

    Since P(DT|Start) x P('the'|DT) is highest, therefore, we assign the POS tag 'DT' to the word 'The'. We now move forward and find the POS tag for the word 'man' and this time we consider transition probabilities instead of start probabilities as shown:

    • P(DT|DT) x P('man'|DT) = 0.3 x 0.1 = 0.03
    • P(NN|DT) x P('man'|NN) = 0.4 x 0.6 = 0.24
    • P(VB|DT) x P('man'|VB) = 0.3 x 0.3 = 0.09

    Therefore, we assign the tag 'NN' to the word 'man', and in the same way, we POS tag the remaining words of the sentence.

    To summarise, given a Hidden Markov Model defined by the initial state, emission, and transition probabilities, we assign a tag \(T_j\) to an observation \(O_j\) such that it maximizes the likelihood:

    $$P(T_j|O_i)=P(O_i|T_j)∗P(T_j|T_{j−1})$$

    Or, in simple terms, we use the Viterbi algorithm where for every word w in the sentence, a tag t is assigned to w such that it maximizes the likelihood of the occurrence of P(tag|word).

    $$P(tag|word) = P(word|tag) * P(tag|\text{previous tag}) = \text{Emission probability * Transition probability}$$

    HMM and Viterbi Algorithm Python Implementation

    Let us look at Python code to train a POS tagging algorithm using HMM and Viterbi algorithm. We are using a treebank POS-tagged dataset as an example to train an HMM and Viterbi algorithm.

    The very first step that we have performed in the code shown below after splitting the dataset into train and test is to train an HMM model. As part of the training of the HMM model, emission probabilities \(P(w_i|t_j)\)  and transition probabilities \(P(t_i|t_{i-1})\) are calculated from the treebank text corpus.

    # Importing libraries
    import nltk, re, pprint
    import numpy as np
    import pandas as pd
    import requests
    import matplotlib.pyplot as plt
    import seaborn as sns
    import pprint, time
    import random
    from sklearn.model_selection import train_test_split
    from nltk.tokenize import word_tokenize
    
    random.seed(1234)
    
    # Reading the Treebank tagged sentences
    wsj = list(nltk.corpus.treebank.tagged_sents())
    
    # Splitting into train and test
    train_set, test_set = train_test_split(wsj,test_size=0.3)
    
    # Getting a list of tagged words
    train_tagged_words = [tup for sent in train_set for tup in sent]
    
    # Tokens and vocabulary
    tokens = [pair[0] for pair in train_tagged_words]
    V = set(tokens)
    
    # number of tags
    T = set([pair[1] for pair in train_tagged_words])
    
    # Compute word given tag: Emission Probability
    def word_given_tag(word, tag, train_bag = train_tagged_words):
        tag_list = [pair for pair in train_bag if pair[1]==tag]
        count_tag = len(tag_list)
        w_given_tag_list = [pair[0] for pair in tag_list if pair[0]==word]
        count_w_given_tag = len(w_given_tag_list)
        
        return (count_w_given_tag, count_tag)
    
    # Compute tag given tag: tag2(t2) given tag1 (t1), i.e. Transition Probability
    def t2_given_t1(t2, t1, train_bag = train_tagged_words):
        tags = [pair[1] for pair in train_bag]
        count_t1 = len([t for t in tags if t==t1])
        count_t2_t1 = 0
        for index in range(len(tags)-1):
            if tags[index]==t1 and tags[index+1] == t2:
                count_t2_t1 += 1
        return (count_t2_t1, count_t1)
    
    # Creating t x t transition matrix of tags
    tags_matrix = np.zeros((len(T), len(T)), dtype='float32')
    for i, t1 in enumerate(list(T)):
        for j, t2 in enumerate(list(T)): 
            tags_matrix[i, j] = t2_given_t1(t2, t1)[0]/t2_given_t1(t2, t1)[1]
    # convert the matrix to a df for better readability
    tags_df = pd.DataFrame(tags_matrix, columns = list(T), index=list(T))
    
    # Viterbi Heuristic
    def Viterbi(words, train_bag = train_tagged_words):
        state = []
        T = list(set([pair[1] for pair in train_bag]))
        
        for key, word in enumerate(words):
            # Initialize list of probability columns for a given observation
            p = [] 
            for tag in T:
                if key == 0:
                    transition_p = tags_df.loc['.', tag]
                else:
                    transition_p = tags_df.loc[state[-1], tag]
                    
                # compute emission and state probabilities
                emission_p = word_given_tag(words[key], tag)[0]/word_given_tag(words[key], tag)[1]
                state_probability = emission_p * transition_p    
                p.append(state_probability)
                
            pmax = max(p)
            # getting state for which probability is maximum
            state_max = T[p.index(pmax)] 
            state.append(state_max)
        return list(zip(words, state))
    
    # Running on entire test dataset would take more than 3-4hrs. 
    # Let's test our Viterbi algorithm on a few sample sentences of test dataset
    rndom = [random.randint(1,len(test_set)) for x in range(5)]
    # list of sents
    test_run = [test_set[i] for i in rndom]
    # list of tagged words
    test_run_base = [tup for sent in test_run for tup in sent]
    # list of untagged words
    test_tagged_words = [tup[0] for sent in test_run for tup in sent]
    tagged_seq = Viterbi(test_tagged_words)
    
    # Calculating accuracy
    check = [i for i, j in zip(tagged_seq, test_run_base) if i == j] 
    accuracy = len(check)/len(tagged_seq)

    After learning the model parameters, the next explanation stage is coded where the Viterbi function is responsible for finding the most probable tag for a given word in a greedy fashion. And finally, the accuracy of the tagged word from the Viterbi algorithm is calculated. I am also attaching the notebook POS Tagging using HMMs and Viterbi.ipynb if you want to try executing code on your own.

    Deep Learning Techniques: RNN Algorithm

    Recurrent Neural Networks (RNNs) have empirically proven to outperform many conventional sequence models for tasks such as POS tagging, entity recognition, dependency parsing, etc. Below is the implementation using the Keras library to train an RNN algorithm for the task of POS Tagging. We will understand how RNN works in the upcoming post. 

    import numpy as np
    from nltk.corpus import treebank
    from sklearn.model_selection import train_test_split
    
    from keras.preprocessing.text import Tokenizer
    from keras.preprocessing.sequence import pad_sequences
    from keras.utils.np_utils import to_categorical
    from keras.models import Sequential, Model
    from keras.layers import Dense, SimpleRNN
    
    # Creating treebank dataset
    treebank_sentences = treebank.tagged_sents(tagset='universal')
    
    # Separate the Words and Tags from the dataset of sentences
    Words = []
    Tags = []
    word_count = []
    tag_count = []
    
    for s in treebank_sentences:
        tmpX = []
        tmpY = []
        for word,tag in s:
            tmpX.append(word.lower())
            tmpY.append(tag.lower()) 
            word_count.append(word.lower())
            tag_count.append(tag.lower())
            
        Words.append(tmpX)
        Tags.append(tmpY)
        
    # Encode tags into numerical data using Keras tokenizer
    # Convert the extracted words into respective sequences using the tokeniser
    tag_tok = Tokenizer()
    tag_tok.fit_on_texts(Tags)
    tagEncoded = tag_tok.texts_to_sequences(Tags)
    tEnc = {}
    
    for i in range(len(Tags)):
        for j in range(len(Tags[i])):
            if Tags[i][j] in tEnc:
                pass
            else:
                tEnc[Tags[i][j]] = tagEncoded[i][j]
    
    tag_arr = np.zeros(12) # As there are 12 POS tags in treebank dataset
    for (t,i) in tEnc.items():
        tag_arr[i-1] = t
    
    # Encode words into numerical data using Keras tokenizer
    word_tok = Tokenizer()
    word_tok.fit_on_texts(Words)
    wordEncoded = word_tok.texts_to_sequences(Words)
    wEnc = {}
    
    for i in range(len(Words)):
        for j in range(len(Words[i])):
            if Words[i][j] in wEnc:
                pass
            else:
                wEnc[Words[i][j]] = wordEncoded[i][j]
    
    # Create tag sets for each word, this will be used to convert each words into its corresponding known POS tags
    tag_set = {}
    
    for (sw,st) in zip(Words,Tags):
        for (word,tag) in zip(sw,st):
            if wEnc[word] in tag_set:
                if tEnc[tag] in tag_set[wEnc[word]]:
                    pass
                else:
                    tag_set[wEnc[word]].add(tEnc[tag])
            else:
                tag_set[wEnc[word]] = {tEnc[tag]}
    
    # Creating POS tag-set feature matrix for data-set
    MX = 5 #  Assuming that a word cannot have more than 5 possible POS tags to simplify the problem
    # Constructing the training features
    
    train_val = []
    for s in treebank_sentences:
        tmp = []
        for word,t in s:
            if wEnc[word.lower()] in tag_set:
                it = list(tag_set[wEnc[word.lower()]])
                it.sort()
                tmp.append(it)
            else:
                tmp.append([])
        tmp = pad_sequences(tmp, maxlen = MX, padding = 'post', truncating = 'post')
        train_val.append(tmp)
    
    # Padding the data-set, assuming the sentence size to be 100
    SENTENCE_SIZE = 100
    train_val = pad_sequences(train_val, maxlen = SENTENCE_SIZE, padding = 'post', truncating = 'post')
    padded_tag = pad_sequences(tagEncoded, maxlen = SENTENCE_SIZE, padding = 'post', truncating = 'post')
    X = train_val
    Y = padded_tag
    
    # Applying one-hot encoding to the POS tag labels
    Y = to_categorical(Y)
    
    # split entire data into training and testing sets
    X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.15, random_state=6)
    
    # Create a Sequential model for our RNN
    model = Sequential()
    model.add(SimpleRNN(64, input_shape = X_train.shape[1:], activation='relu', return_sequences = True))
    model.add(Dense(Y.shape[2], activation='softmax'))
    model.compile(loss = 'mean_squared_error', optimizer = 'adam', metrics=['acc'])
    model.fit(X_train, Y_train, batch_size = 10, epochs = 10)
    

    The steps to create an RNN algorithm for POS tagging tasks include importing libraries and datasets preprocessing data and extracting features to train an RNN algorithm, splitting up train and test data, creating and compiling an RNN model using the Keras library, and finally training the RNN model on the training set.

    Here we are tokenizing the word to extract the features. There is a better way to extract features from text using word embeddings which is discussed in the post Distributional Semantics: Techniques to represent words as vectors.

    Summary

    The post discusses Part-of-Speech (POS) tagging techniques focusing on rule-based, probabilistic, and deep learning methods. Rule-based methods, such as regular expressions and transformation-based learning, are simple and efficient but can struggle with ambiguity and lack of context.

    We saw how probabilistic methods, including Hidden Markov Models (HMM) and the Viterbi algorithm, offer improved accuracy by considering word sequences and probabilities. Deep learning techniques, such as Recurrent Neural Networks (RNN), can further enhance performance by learning complex patterns in data.

    I hope you enjoyed the post.

    Author Info

    Tavish Aggarwal

    Website: http://tavishaggarwal.com

    Living in Hyderabad and working as a research-based Data Scientist with a specialization to improve the major key performance business indicators in the area of sales, marketing, logistics, and plant productions. He is an innovative team leader with data wrangling out-of-the-box capabilities such as outlier treatment, data discovery, data transformation with a focus on yielding high-quality results.