During my MSc program, I was lucky to squeeze into Michael Collin’s NLP class. We used his Coursera course as part of the program, which I’d highly recommend.

Recently I decided to review my NLP studies and I believe the best way to learn or relearn a subject is to teach it. This is one in a series of 4 posts with a walk-through of the algorithms we implemented during the course. I’ll provide links to my code hosted on Github.

Disclaimer: Before taking this NLP course, the only thing I knew about Python was that ‘it’s the one without curly brackets’. I learned Python on the go while implementing these algorithms. So if I did anything against Python code conventions or flat-out heinous, I apologize and thank you in advance for your understanding. Feel free to write and let me know.

The Concept

To quote Wikipedia, “Named-entity recognition (I’ve always known it as tagging) is a subtask of information extraction that seeks to locate and classify elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quanitites, monetary values, percentages, etc.”

For example, the algorithm receives as input some text

“Bill Gates founded Microsoft in 1975.”

and outputs

“Bill Gates[person] founded Microsoft[organization] in 1975[date].”

Off the top of my head, some useful applications are document matching (ex. a document containing Gates[person] may not be on the same topic as one containing gates[object]) and query searches. I’m sure there are lots more, if you check out Collin’s Coursera course he may discuss this in greater depth.

The Requirements

Development data: The file ner_dev.dat provided by prof. Michael Collins has a series of sentences separated by an empty line, one word per line.

Training data: The file ner_train.dat provided by prof. Michael Collins has a series of sentences separated by an empty line, one word and tag per line, speparated by a space.

Word-tag count data: The file ner.counts has the format [count] [type of tag] [label] [word]. The tags used are RARE, O, I-MISC, I-PER, I-ORG, I-LOC, B-MISC, B-PER, B-ORG, B-LOC. The tag O means it’s not an NE. This file is generated by count_freqs.py, a script provided by prof. Michael Collins. Run count_freqs.py on the training data ner_train.dat

The Algorithm

Python code: viterbi.py Usage: python viterbi.py ner.counts ngram.counts [input_file] > [output_file] Summary: The Viterbi algorithm finds the maximum probability path for a series of observations, based on emission and transition probabilities. In a Markov Process, emission is the probability of an output given a state and transition is the probability of transitioning to the state given the previous states. In our case, the emission parameter e(x|y) = the probability of the word being x given you attributed tag y. If your training data had 100 counts of ‘person’ tags, one of which is the word ‘London’ (I know a guy who named his kid London), e(‘London’|’person’) = 0.01. Now with 50 counts of ‘location’ tags, 5 of which are ‘London’, e(‘London’|’location’) = 0.1 which clearly trumps 0.01. The transition parameter q(yi | yi-1, yi-2) = the probability of putting tag y in position i given it’s two previous tags. This is calculated by Count(trigram)/Count(bigram). For each word in the development data, he Viterbi algorithm will associate a score for a word-tag combo based on the emission and transition parameters it obtained from the training data. It does this for every possible tag and sees which is more likely. Clearly this won’t be 100% correct as natural language is unpredictable, but you should get pretty high accuracy.

Optional Preprocessing

Re-label words in training data with frequency < 5 as ‘RARE’ - This isn’t required, but useful. Re-run count_freqs.py if used.

Python code: label_rare.py

Usage: python label_rare.py [input_file]


  1. Uses Python Counter to obtain word counts in [input_file]; removes all word-count pairs with count < 5, store remaining pairs in a dictionary named rare_words.
  2. Iterates through each line in [input file], checks if word is in rare words dictionary, if so, replaces word with RARE.

Step 1. Get Count(y) and Count(x~y)

Python code: emission_counts.py


  1. Iterate through each line in ner.counts file and store each word-label-count combo in a dictionary count_xy and update the dictionary of count_y. For example count_xy[Peter][I-PER] returns the number of times the word ‘Peter’ was labeled ‘I-PER’ in the training data and count_y[I-PER] the total number of ‘I-PER’ tags. The dictionary count_y contains 8 items, one for each label ( RARE , O, I-MISC, I-PER, I-ORG, I-LOC, B-MISC, B-PER, B-ORG, B-LOC);
  2. Return count_xy, count_y

Step 2. Get bigram and trigram counts

Python code: transition_counts.py


  1. Iterate through each line in the n-gram_counts file
  2. If the line contains ’2-GRAM’ add an item to the bigram_counts dictionary using the bigram (two space-separated labels following the tag type ‘2-gram’) as key, count as value. This dictionary will contain Count(yi-2,yi-1).
  3. If the line contains ’3-GRAM’, add an item to the trigram_counts dictionary using the trigram as key, count as value. This dictionary will contain Count(yi-2, yi-1, yi).
  4. Return dictionaries of bigram and trigram counts.

Step 3. Viterbi

(For each line in the [input_file]):

  1. If the word was seen in training data (present in the count_xy dictionary), for each of the possible labels for the word:
  2. Calculate emission = count_xy[word][label] / float(count_y[label]
  3. Calculate transition = trigram_counts[trigram])/float(bigram_counts[bigram] Note: yi-2 = *, yi-1 = * for the first round
  4. Set probability = emission x transition
  5. Update max(probability) and arg max if needed. 2 If the word was not seen in the training data:
  6. Calculate emission = count xy[RARE][label] / float(count y[label].
  7. Calculate q(yi yi-1, yi-2) = trigram counts[trigram])/float(bigram counts[bigram]. Note: yi-2 = ∗, yi-1 = ∗ for the first round
  8. Set probability = emission × transition
  9. Update max(probability) if needed, arg max = RARE
  10. Write arg max and log(max(probability)) to output file.
  11. Update yi-2, yi-1.
  12. Update yi-2, yi-1.


Prof. Michael Collins provided an evaluation script eval_ne_tagger.py to verify the output of your Viterbi implementation. Usage: python eval_ne_tagger.py ner_dev.key [output_file]