By the end of this unit you should:
Read this to understand the concepts of NLP, POS tags and parse trees
A natural language processing (NLP) system is usually called an NLP pipeline. This because it usually involves several stages (steps or layers) of processing. The pipeline is one directional. There is an input (natural language) and an output (processed text). Simply put, NLP is applying artificial intelligence to human languages.
Source: MoriohPOS tagging is the act of labelling words with a particular part of speech. The common parts of speech are noun, verb, adverb and adjective. However, most POS taggers use a much large set of tags. The most popular POS tagset has 36 tags. NLP pipelines that aim to map syntax or disambiguate meanings often use this layer. The Penn treebank tagset is shown in the table below.
CC Coordinating conjunction | CD Cardinal number | DT Determiner |
EX Existential there | FW Foreign word | IN Preposition or subordinating conjunction |
JJ Adjective | JJR Adjective, comparative | JJS Adjective, superlative |
LS List item marker | MD Modal | NN Noun, singular or mass |
NNS Noun, plural | NNP Proper noun, singular | NNPS Proper noun, plural |
PDT Predeterminer | POS Possessive ending | PRP Personal pronoun |
PRP$ Possessive pronoun | RB Adverb | RBRAdverb, comparative |
RBS Adverb, superlative | RP Particle | SYM Symbol |
TO to | UH Interjection | VB Verb, base form |
VBD Verb, past tense | VBG Verb, gerund or present participle | VBN Verb, past participle |
VBP Verb, non-3rd person singular present | VBZ Verb, 3rd person singular present | WDT Wh-determiner |
WP Wh-pronoun | WP$ Possessive wh-pronoun | WRB Wh-adverb |
If you are keen on learning this tagset. Try out this timed game.
NLP pipelines can be used for many tasks. Dependency parsing is one task that is often used as one step or layer. Dependency parsing uses the part-of-speech tags assigned to words in a previous layer and creates a parse tree. The parse tree identifies a sentence and splits up the sentence sequentially. This shows the relationship between the words. During this process, trees of parent and child words are created. Parse trees are used in many NLP tasks. However, it needs to be remembered that any errors in the POS tags will affect the accuracy of the parse tree. The example below shows how a simple sentence can be broken down and the relationship between individual words mapped out on to a parse tree.
Source: Wikicommons
Watch and listen to this animated explanation of the JavaScript library NLP Compromise (17 min 52 sec).
Try out this online tool created by a student team in 2019. This project was awarded grade A. Great job! NLP compromise is a JavaScript library that mimics a full-blown pipeline. It is completely rule-based. The problems in this tool all stem from NLP Compromise inaccurately tagging parts of speech.
Read this explanation of NLP Compromise written by its creator, Spencer Kelly. There are a couple of tutorials listed at the bottom of the page that should help you get started.
On its Github page you can find useful functions in the Readme section.The Natural Language Tool Kit (NLTK) is one of the most popular libraries for creating NLP pipelines. There are many tutorials online to show you how to get started. For those who prefer a video introduction, check out the first video in a playlist. The topic is tokenizing. Sentdex is a popular programming YouTuber with over 900k subscribers.
Watch and listen to this short introducttion to using NLTK with Python.
Create a NLP pipeline using NLTK. Your pipeline needs to process this text:
"Two frogs, a father and his son, accidentally fell into a bucket of milk. They started swimming for their lives. They swam for a long time, but there seemed no hope of their getting out. The father soon gave up and drowned. The son carried on swimming. During this time, the milk had begun to form a ball of butter. Using this island of butter as a platform, he managed to hop out of the bucket."
Solve as many of these problems as possible.
Add comments in your code to show (1) the function of each important line of code and (2) the source of any code copied from tutorials, etc.
This pipeline can serve as a starting point for your final project. Submit your code, or a link to your code online via ELMS
Enjoy an adaption of this story in "Catch me if you can".
Make sure you can explain the following 8 terms simple English:
Running count: 70 of 70 time-and-tense-related concepts covered so far.