Unit 6: Python and Natural Language Processing

Learning outcomes

By the end of this unit you should:

be familiar with the basics in Python
have created some simple programs to process natural language
have solved some problems requiring natural language processing
have analyzed a complex NLP program written in Python
have drafted pseudocode for an authorship analysis tool
have identified aspects of Python and NLP to study in more depth

Activity 1 Python

Read.

This unit aims at helping you understand more about using Python for natural language processing (NLP). All computer science majors learn C and Java in their first and second year at the University of Aizu. Many students also learn C++. This means that concepts such as lists, arrays and loops should need no explanation.

Take this online quiz from W3 Schools to assess how much or how little you need to study to be able to write a program in Python.

Activity 2 Introduction to Python

Watch this short introductory video.

This introductory video (6 mins 41 secs) covers the basics in slightly over five minutes. This video is probably too fast for those new to programming, but is suitable for those who already know what operators, lists and loops are.

Activity 3 The Python way

Work in pairs or threes. Name and explain the difference between the following data types.

["apple","banana","carrot"]
{"apple","banana","carrot"}
{"food": "banana","colour": "yellow"}
("apple","banana","carrot")

Check your answers online.

Note that to work with arrays in Python, you have to import a library. The most popular library for this is NumPy.

Activity 4: General introduction to NLP pipelines

Read this to understand the concepts of NLP, POS tags and parse trees.

Natural language processing pipeline

A natural language processing (NLP) system is usually called an NLP pipeline. This because it usually involves several stages (steps or layers) of processing. The pipeline is one directional. There is an input (natural language) and an output (processed text). Simply put, NLP is applying artificial intelligence to human languages.

Source: Morioh

Part-of-speech (POS) tagging

POS tagging is the act of labelling words with a particular part of speech. The common parts of speech are noun, verb, adverb and adjective. However, most POS taggers use a much large set of tags. The most popular POS tagset has 36 tags. NLP pipelines that aim to map syntax or disambiguate meanings often use this layer. The Penn treebank tagset is shown in the table below.

CC Coordinating conjunction	CD Cardinal number	DT Determiner
EX Existential there	FW Foreign word	IN Preposition or subordinating conjunction
JJ Adjective	JJR Adjective, comparative	JJS Adjective, superlative
LS List item marker	MD Modal	NN Noun, singular or mass
NNS Noun, plural	NNP Proper noun, singular	NNPS Proper noun, plural
PDT Predeterminer	POS Possessive ending	PRP Personal pronoun
PRP$ Possessive pronoun	RB Adverb	RBRAdverb, comparative
RBS Adverb, superlative	RP Particle	SYM Symbol
TO to	UH Interjection	VB Verb, base form
VBD Verb, past tense	VBG Verb, gerund or present participle	VBN Verb, past participle
VBP Verb, non-3rd person singular present	VBZ Verb, 3rd person singular present	WDT Wh-determiner
WP Wh-pronoun	WP$ Possessive wh-pronoun	WRB Wh-adverb

If you are keen on learning this tagset. Try out this timed game.

Dependency parsing and parse trees

NLP pipelines can be used for many tasks. Dependency parsing is one task that is often used as one step or layer. Dependency parsing uses the part-of-speech tags assigned to words in a previous layer and creates a parse tree. The parse tree identifies a sentence and splits up the sentence sequentially. This shows the relationship between the words. During this process, trees of parent and child words are created. Parse trees are used in many NLP tasks. However, it needs to be remembered that any errors in the POS tags will affect the accuracy of the parse tree. The example below shows how a simple sentence can be broken down and the relationship between individual words mapped out on to a parse tree.

Source: Wikicommons

Activity 5: Natural Language Tool Kit (NLTK)

The Natural Language Tool Kit (NLTK) is one of the most popular libraries for creating NLP pipelines. There are many tutorials online to show you how to get started. For those who prefer a video introduction, check out the first video in this playlist. The topic is tokenizing. Sentdex is a popular programming YouTuber with over a million subscribers.

Watch and listen to this short introducttion to using NLTK with Python.

Activity 6: Teach to learn

Work in a group. Each group will be allocated a topic. Learn your assigned topic and prepare to explain the topic.

Group 1: TBD
Group 2: TBD
Group 3: TBD
Group 4: TBD
Group 5: TBD
Group 6: TBD
Group 7: TBD
Group 8: TBD
Group 9: TBD
Group 10: TBD

Cross group and explain your topic to students who have prepared a different topic. Change groups and repeat.

Activity 7: Explanation of authorship analysis program

Listen in English to the explanations of the Python code created by Jonathan Dunn, which is available on GitHub. These explanations were created by students in Spring 2022. If you find any errors, please notify your tutor.

Steps 1 to 4 Lines 0 to 1110 General introduction
Step 1 and 2 Lines 0 to 100
Step 3 Lines 101 to 229
Step 3 Lines 230 to 343
Step 3 Lines 344 to 502
Step 3 Lines 503 to 641
Step 3 Lines 642 to 726
Step 3 Lines 727 to 814
Step 3 Lines 815 to 998
Step 4 Lines 999 to 1110

If you prefer, you can also listen to explanations of the code in Japanese.

Steps 1 to 4 Lines 0 to 1110 General introduction
Step 1 and 2 Lines 0 to 100
Step 3 Lines 101 to 229
Step 3 Lines 230 to 343
Step 3 Lines 344 to 502
Step 3 Lines 503 to 641
Step 3 Lines 642 to 726
Step 3 Lines 727 to 814
Step 3 Lines 815 to 998
Step 4 Lines 999 to 1110

Activity 8: Example of Pseudocode and Python

Read the pseudocode and compare it to the Python code.

Pseudocode

Retrieve the number to be reversed from the user into variable sample_number.

Initialize the temporary variable test_number as zero.

Perform a while loop until the sample_number variable is greater than zero.

Modulus the sample_number variable by 10 and store the remainder.

Multiply the temporary number of test_number by 10 and add the returned value to the remainder.

Print the generated test_number onto the console.

Python code

sample_number = int(input("Number to be reversed: ")) test_number = 0 while(sample_number>0): remainder_number = sample_number % 10 test_number = (test_number * 10) + remainder_number sample_number = sample_number//10 print("Value after reverse : {}".format(test_number))

Source: Python Tutorial on Educaba .

Knowledge and application

Activity 9: Python practice

Work alone, but feel free to ask your friends for help. Practice using Python for some natural language programming tasks. You can use Terminal, any IDE, or an online compiler. Copy and paste the code into the submission form.

Parse a sentence to find the word "but" start with either lowercase or upppercase letter.
Insert "(conjunction)" after matching the word "but".
Search a sentence for "But", print "sentence-initial conjunction" or "punctuation mistake" depending on the result.
Search a sentence for all instances of "but", print the number of instances the word "but" is found.

Activity 10: Authorship marker analysis, pseudocode and python program for adjectives

Describe your assigned authorship marker. Compare its usage with another related authorship marker. Write in plain language the steps needed in an program to automatically analyze your assigned language marker. Your code needs to identify, classify, count, compare and generate a value for the authorship feature. Work in a team. Only the team leader needs to submit the pseudocode and the Python program to ELMS.

Your assigned task is given below.

Team A, H: attributive vs. predicative adjectives (e.g. good man vs. The man is good)
Team B, I: coordinate vs. non-coordinate adjectives (e.g. quick, lazy fox vs. quick lazy fox)
Team C, J: gradable vs. non-gradable adjectives (e.g. cold, freezing)
Team D, K: prepositions of place vs. prepositions of time (e.g. in the bed, in the morning)
Team E, L: adverbials of place vs. adverbials of time (e.g. on the floor, on Saturday)
Team F, M: fronted adverbs of manner vs. regular adverbs of manner (He quickly ate it vs. He ate it quickly.)
Team G, N: sentence-initial transitional adverbs vs. delayed transitional adverbs (e.g. However, Tony did it. vs. Tony, however, did it.)

Activity 11: Authorship marker analysis, pseudocode and python program for verb phrases

Describe your NEW assigned authorship marker. Compare its usage with another related authorship marker. Write in plain language the steps needed in an program to automatically analyze your assigned language marker. Adapt your original code submitted for Activity 10. Your code needs to identify, classify, count, compare and generate a value for the authorship feature. Work in a team. Only the team leader needs to submit the pseudocode and the Python program to ELMS.

Your assigned task is given below.

Team A, H: passive voice vs. non-passive voice (e.g. The police found it. vs. It was found.)
Team B, I: tensed verbs vs. modalized verbs (e.g. The police found it. vs. The police may have found it.)
Team C, J: planned future vs. arranged future (e.g. I'm going there tomorrow. vs. I am going to go there tomorrow.)
Team D, K: will future vs. going-to future (e.g. I'll go there tomorrow. vs. I am going to go there tomorrow.)
Team E, L: perfect aspect/tense vs. and non-perfect aspect/tense (e.g. The police have found it vs. The police found it.)
Team F, M: regular past tenses vs. irregular past tenses (discovered vs. found)
Team G, N: regular verbs vs. phrasal verbs (e.g. find vs. come across)

Review

Make sure you can explain the differences between the following in simple English:

No new authorship terms in this unit.

Running count: 58 of 60 concepts covered so far.