logo

Unit 6: Python and Natural Language Processing

Learning outcomes

By the end of this unit you should:

  • be familiar with the basics in Python
  • have created some simple programs to process natural language
  • have solved some problems requiring natural language processing
  • have analyzed a complex NLP program written in Python
  • have drafted pseudocode for an authorship analysis tool
  • have identified aspects of Python and NLP to study in more depth
Rubik

Activity 1 Python

Read.

This unit aims at helping you understand more about using Python for natural language processing (NLP). All computer science majors learn C and Java in their first and second year at the University of Aizu. Many students also learn C++. This means that concepts such as lists, arrays and loops should need no explanation.

Take this online quiz from W3 Schools to assess how much or how little you need to study to be able to write a program in Python.

Activity 2 Introduction to Python

Watch this short introductory video.

This introductory video (6 mins 41 secs) covers the basics in slightly over five minutes. This video is probably too fast for those new to programming, but is suitable for those who already know what operators, lists and loops are.

Activity 3 The Python way

Work in pairs or threes. Name and explain the difference between the following data types.

  1. ["apple","banana","carrot"]
  2. {"apple","banana","carrot"}
  3. {"food": "banana","colour": "yellow"}
  4. ("apple","banana","carrot")

Check your answers online.

Note that to work with arrays in Python, you have to import a library. The most popular library for this is NumPy.

placeholder

Activity 4: General introduction to NLP pipelines

Read this to understand the concepts of NLP, POS tags and parse trees.

Natural language processing pipeline

A natural language processing (NLP) system is usually called an NLP pipeline. This because it usually involves several stages (steps or layers) of processing. The pipeline is one directional. There is an input (natural language) and an output (processed text). Simply put, NLP is applying artificial intelligence to human languages.

NLP pipeline
Source: Morioh
Part-of-speech (POS) tagging

POS tagging is the act of labelling words with a particular part of speech. The common parts of speech are noun, verb, adverb and adjective. However, most POS taggers use a much large set of tags. The most popular POS tagset has 36 tags. NLP pipelines that aim to map syntax or disambiguate meanings often use this layer. The Penn treebank tagset is shown in the table below.

CC Coordinating conjunction CD Cardinal number DT Determiner
EX Existential there FW Foreign word IN Preposition or subordinating conjunction
JJ Adjective JJR Adjective, comparative JJS Adjective, superlative
LS List item marker MD Modal NN Noun, singular or mass
NNS Noun, plural NNP Proper noun, singular NNPS Proper noun, plural
PDT Predeterminer POS Possessive ending PRP Personal pronoun
PRP$ Possessive pronoun RB Adverb RBRAdverb, comparative
RBS Adverb, superlative RP Particle SYM Symbol
TO to UH Interjection VB Verb, base form
VBD Verb, past tense VBG Verb, gerund or present participle VBN Verb, past participle
VBP Verb, non-3rd person singular present VBZ Verb, 3rd person singular present WDT Wh-determiner
WP Wh-pronoun WP$ Possessive wh-pronoun WRB Wh-adverb

If you are keen on learning this tagset. Try out this timed game.

Dependency parsing and parse trees

NLP pipelines can be used for many tasks. Dependency parsing is one task that is often used as one step or layer. Dependency parsing uses the part-of-speech tags assigned to words in a previous layer and creates a parse tree. The parse tree identifies a sentence and splits up the sentence sequentially. This shows the relationship between the words. During this process, trees of parent and child words are created. Parse trees are used in many NLP tasks. However, it needs to be remembered that any errors in the POS tags will affect the accuracy of the parse tree. The example below shows how a simple sentence can be broken down and the relationship between individual words mapped out on to a parse tree.

parse tree

Source: Wikicommons

fruit flies

Activity 5: Natural Language Tool Kit (NLTK)

The Natural Language Tool Kit (NLTK) is one of the most popular libraries for creating NLP pipelines. There are many tutorials online to show you how to get started. For those who prefer a video introduction, check out the first video in this playlist. The topic is tokenizing. Sentdex is a popular programming YouTuber with over a million subscribers.

Watch and listen to this short introducttion to using NLTK with Python.

Activity 6: Teach to learn

Work in a group. Each group will be allocated a topic. Learn your assigned topic and prepare to explain the topic.

  • Group 1: TBD
  • Group 2: TBD
  • Group 3: TBD
  • Group 4: TBD
  • Group 5: TBD
  • Group 6: TBD
  • Group 7: TBD
  • Group 8: TBD
  • Group 9: TBD
  • Group 10: TBD

Cross group and explain your topic to students who have prepared a different topic. Change groups and repeat.

Activity 7: Explanation of authorship analysis program

Listen in English to the explanations of the Python code created by Jonathan Dunn, which is available on GitHub. These explanations were created by students in Spring 2022. If you find any errors, please notify your tutor.

  • Steps 1 to 4 Lines 0 to 1110 General introduction
  • Step 1 and 2 Lines 0 to 100
  • Step 3 Lines 101 to 229
  • Step 3 Lines 230 to 343
  • Step 3 Lines 344 to 502
  • Step 3 Lines 503 to 641
  • Step 3 Lines 642 to 726
  • Step 3 Lines 727 to 814
  • Step 3 Lines 815 to 998
  • Step 4 Lines 999 to 1110

If you prefer, you can also listen to explanations of the code in Japanese.

  • Steps 1 to 4 Lines 0 to 1110 General introduction
  • Step 1 and 2 Lines 0 to 100
  • Step 3 Lines 101 to 229
  • Step 3 Lines 230 to 343
  • Step 3 Lines 344 to 502
  • Step 3 Lines 503 to 641
  • Step 3 Lines 642 to 726
  • Step 3 Lines 727 to 814
  • Step 3 Lines 815 to 998
  • Step 4 Lines 999 to 1110

Activity 8: Example of Pseudocode and Python

Read the pseudocode and compare it to the Python code.

Pseudocode

Retrieve the number to be reversed from the user into variable sample_number.

Initialize the temporary variable test_number as zero.

Perform a while loop until the sample_number variable is greater than zero.

Modulus the sample_number variable by 10 and store the remainder.

Multiply the temporary number of test_number by 10 and add the returned value to the remainder.

Print the generated test_number onto the console.

Python code

sample_number = int(input("Number to be reversed: "))
test_number = 0
while(sample_number>0):
remainder_number = sample_number % 10
test_number = (test_number * 10) + remainder_number
sample_number = sample_number//10
print("Value after reverse : {}".format(test_number))

Source: Python Tutorial on Educaba .

Knowledge and application

Activity 9: Python practice

Work alone, but feel free to ask your friends for help. Practice using Python for some natural language programming tasks. You can use Terminal, any IDE, or an online compiler. Submit a Python file with solutions to the following problems. Name the file with your student number, e.g. s12322312.py

  1. Parse a sentence to find the word "banana".
  2. Insert "(fruit)" after matching the word "banana".
  3. Search a sentence for "banana", print "found" or "not found" depending on the result.
  4. Search a sentence for "banana", print the number of instances the word "banana" is found.
  5. Create three string variables. Each string is a short sentence. Randomly select one variable, and search that string for the word "banana".

Activity 10: Analyze and explain an authorship analysis program

Read this Python code created by Jonathan Dunn available on GitHub.

Explain the section of the code you are allocated. Record a Japanese and English version of the same explanation. Submit both audio files via ELMS. The recommended length is around 2 minutes. However, the clarity of the content is more important that the length of the recording. Your topic is decided by the final digit of your student id number. See the list below. Your audio file may be uploaded to this website for other students to listen to. Do not state your name or personal information! Speak clearly. Name the files by the starting and finishing line numbers and the two-letter code for the language, e.g. 0-100_en or 0-100_jp.

  • 0: Step 1 and 2 Lines 0 to 100
  • 1: Step 3 Lines 101 to 229
  • 2: Step 3 Lines 230 to 343
  • 3: Step 3 Lines 344 to 502
  • 4: Step 3 Lines 503 to 641
  • 5: Step 3 Lines 642 to 726
  • 6: Step 3 Lines 727 to 814
  • 7: Step 3 Lines 815 to 998
  • 8: Step 4 Lines 999 to 1110
  • 9: Steps 1 to 4 Lines 0 to 1110 General introduction

Activity 11: Pseudocode program

Write in plain language the steps needed in an program to automatically analysis authorship. Work in a team. Only the team leader needs to submit the pseudocode to ELMS. This code is likely to become the basis for your final project.

Select one of the following options.

  1. Authorship verification for one text
  2. Authorship verification for two different texts
  3. Authorship profiling for age
  4. Authorship profiling for education level
  5. Authorship profiling for another aspect
  6. Authorship attribution of questioned text to one of two known texts
  7. Authorship attribution of questioned text to one of ten known texts

Review

Make sure you can explain the differences between the following in simple English:

  1. No new authorship terms in this unit.

Running count: 58 of 60 concepts covered so far.

"Python is the most powerful programming language that you can read." - Unknown

Copyright John Blake, 2022