logo

Unit 7: Visualization using NLP pipelines

Learning outcomes

By the end of this unit you should:

  • understand how natural language pipelines process text
  • have experimented nlp compromise to visualize language features
  • have created a natural langauge pipeline with the Natural Language Tool Kit (NLTK)
Rubik

Activity 1: General introduction

Read this to understand the concepts of NLP, POS tags and parse trees

Natural language processing pipeline

A natural language processing (NLP) system is usually called an NLP pipeline. This because it usually involves several stages (steps or layers) of processing. The pipeline is one directional. There is an input (natural language) and an output (processed text). Simply put, NLP is applying artificial intelligence to human languages.

NLP pipeline
Source: Morioh
Part-of-speech (POS) tagging

POS tagging is the act of labelling words with a particular part of speech. The common parts of speech are noun, verb, adverb and adjective. However, most POS taggers use a much large set of tags. The most popular POS tagset has 36 tags. NLP pipelines that aim to map syntax or disambiguate meanings often use this layer. The Penn treebank tagset is shown in the table below.

CC Coordinating conjunction CD Cardinal number DT Determiner
EX Existential there FW Foreign word IN Preposition or subordinating conjunction
JJ Adjective JJR Adjective, comparative JJS Adjective, superlative
LS List item marker MD Modal NN Noun, singular or mass
NNS Noun, plural NNP Proper noun, singular NNPS Proper noun, plural
PDT Predeterminer POS Possessive ending PRP Personal pronoun
PRP$ Possessive pronoun RB Adverb RBRAdverb, comparative
RBS Adverb, superlative RP Particle SYM Symbol
TO to UH Interjection VB Verb, base form
VBD Verb, past tense VBG Verb, gerund or present participle VBN Verb, past participle
VBP Verb, non-3rd person singular present VBZ Verb, 3rd person singular present WDT Wh-determiner
WP Wh-pronoun WP$ Possessive wh-pronoun WRB Wh-adverb

If you are keen on learning this tagset. Try out this timed game.

Dependency parsing and parse trees

NLP pipelines can be used for many tasks. Dependency parsing is one task that is often used as one step or layer. Dependency parsing uses the part-of-speech tags assigned to words in a previous layer and creates a parse tree. The parse tree identifies a sentence and splits up the sentence sequentially. This shows the relationship between the words. During this process, trees of parent and child words are created. Parse trees are used in many NLP tasks. However, it needs to be remembered that any errors in the POS tags will affect the accuracy of the parse tree. The example below shows how a simple sentence can be broken down and the relationship between individual words mapped out on to a parse tree.

parse tree

Source: Wikicommons

fruit flies

Activity 2: NLP Compromise

Watch and listen to this animated explanation of the JavaScript library NLP Compromise (17 min 52 sec).

Activity 3: Identifying tense using NLP Compromise

Try out this online tool created by a student team in 2019. This project was awarded grade A. Great job! NLP compromise is a JavaScript library that mimics a full-blown pipeline. It is completely rule-based. The problems in this tool all stem from NLP Compromise inaccurately tagging parts of speech.

Activity 4: Experiment with NLP Compromise

Read this explanation of NLP Compromise written by its creator, Spencer Kelly. There are a couple of tutorials listed at the bottom of the page that should help you get started.

On its Github page you can find useful functions in the Readme section.

Activity 5: Natural Language Tool Kit (NLTK)

The Natural Language Tool Kit (NLTK) is one of the most popular libraries for creating NLP pipelines. There are many tutorials online to show you how to get started. For those who prefer a video introduction, check out the first video in a playlist. The topic is tokenizing. Sentdex is a popular programming YouTuber with over 900k subscribers.

Watch and listen to this short introducttion to using NLTK with Python.

Knowledge and application

Activity 6: Problem solving: NLP pipeline in Python

Create a NLP pipeline using NLTK. Your pipeline needs to process this text:

"Two frogs, a father and his son, accidentally fell into a bucket of milk. They started swimming for their lives. They swam for a long time, but there seemed no hope of their getting out. The father soon gave up and drowned. The son carried on swimming. During this time, the milk had begun to form a ball of butter. Using this island of butter as a platform, he managed to hop out of the bucket."

Solve as many of these problems as possible.

  1. Aspect: find all instances of progressive aspect
  2. Aspect: find all instances of perfect aspect
  3. Tense: find all instances of past tense
  4. Tense: find all instances of regular past tense
  5. Visualize: underline each verb group
  6. Visualize: display found instances in alphabetical list

Add comments in your code to show (1) the function of each important line of code and (2) the source of any code copied from tutorials, etc.

This pipeline can serve as a starting point for your final project. Submit your code, or a link to your code online via ELMS

frog

Enjoy an adaption of this story in "Catch me if you can".

Review

Make sure you can explain the following 8 terms simple English:

  1. natural language processing (NLP)
  2. NLP pipeline
  3. part-of-speech (POS) tagging
  4. POS tagset
  5. dependency parsing
  6. parse tree
  7. NLP Compromise
  8. Natural Language Tool Kit (NLTK)

Running count: 70 of 70 time-and-tense-related concepts covered so far.

"There are only two days in the year that nothing can be done. One is called yesterday and the other is called tomorrow, so today is the right day to love, believe, do and mostly live." - Dalai Lama

Copyright John Blake, 2020