By the end of this unit you should:
Read page 4 of this book chapter.
When you are ready, work in pairs. Compare and contrast the following:
The feature visualizer is a prototype tool that is under development. It is designed to help learners become familiar with the genre of short scientific research articles. In the UoA, the graduation thesis for undegraduates takes the form of such an article. The feature visualizer comprises of a small database of research articles. Various language features can be identified and highlighted in the current version. Video and textual explanations will be added in the near future to provide users with more details about the language that is visualized. The accuracy and usability of the functions vary.
Check out the functionalities listed below. Evaluate the accuracy, scope and usability of each function. Remember the purpose of this tool is pedagogic.
Discuss ways to improve the tool with a partner. Share your conclusions with your tutor.
Work in pairs. Design a rule-based expert system that identifies the appropriate grammatical structure to use in one of the following scenarios. Once you have worked out your system, create a decision tree using flowchart symbols, or a set of if-then decision rules (i.e. a rulebase).
Read.
Regular expressions look daunting. There is a lot to learn, but it can be approached systematically by dividing the knowledge to be acquired into managable blocks. There are many tutorial sites geared to helping learners understand and use regular expressions. I recommend using the website RegexOne to help you practise each of the following concepts. You will learn how to match, skip and capture characters and groups. There is a lesson and an exercise for each of the concepts listed below. Each lesson should only take a few minutes. If you are an expert, you should be finished in 15 minutes or less.
The solution to the exercise for Lesson 1 is shown below. When you solve the problem, you can continue to the next stage. The website offers solutions, but I strongly advise you to attempt the exercises yourself. If you cannot solve these exercises, you will almost certainly struggle with the assignment later in this course. Learn regex now.
Read
POS tagging is the act of labelling words with a particular part of speech. The common parts of speech are noun, verb, adverb and adjective. However, most POS taggers use a much large set of tags. The most popular POS tagset has 36 tags. NLP pipelines that aim to map syntax or disambiguate meanings often use this layer. The Penn treebank tagset is shown in the table below.
CC Coordinating conjunction | CD Cardinal number | DT Determiner |
EX Existential there | FW Foreign word | IN Preposition or subordinating conjunction |
JJ Adjective | JJR Adjective, comparative | JJS Adjective, superlative |
LS List item marker | MD Modal | NN Noun, singular or mass |
NNS Noun, plural | NNP Proper noun, singular | NNPS Proper noun, plural |
PDT Predeterminer | POS Possessive ending | PRP Personal pronoun |
PRP$ Possessive pronoun | RB Adverb | RBRAdverb, comparative |
RBS Adverb, superlative | RP Particle | SYM Symbol |
TO to | UH Interjection | VB Verb, base form |
VBD Verb, past tense | VBG Verb, gerund or present participle | VBN Verb, past participle |
VBP Verb, non-3rd person singular present | VBZ Verb, 3rd person singular present | WDT Wh-determiner |
WP Wh-pronoun | WP$ Possessive wh-pronoun | WRB Wh-adverb |
The Natural Language Tool Kit (NLTK) is one of the most popular libraries for creating NLP pipelines. There are many tutorials online to show you how to get started. For those who prefer a video introduction, check out the first video in a playlist. The topic is tokenizing. Sentdex is a popular programming YouTuber.
Watch and listen to this short introducttion to using NLTK with Python.
Create a program that runs in Terminal/Command line for the expert system you have created.
Discuss the best way to make your program accessible online. Identify the pros and cons of the different approaches.
Try out these expert systems developed by participants in the 2023 cohort. Consider the following questions:
Here are the links:
Result -
Can you do the following?
If you cannot, make sure that you do before your next class.
Running count: 39 of 65 concepts covered so far.