logo

Unit 6: Regular expressions

Learning outcomes

By the end of this unit you should:

  • understand the power of regular expressions
  • know the components of regular expressions
  • be able to write simple regular expressions
  • have practised writing simple and complex regular expressions
Rubik

Activity 1: Introduction to regular expressions

Read.

Regular expressions (Regex or RE) can be created to search texts for specific combinations or permutations of letters or characters. Regex are, in the words of Christiansen and Torkington, (2003) akin to ''mutant wildcards on steroids''. Regex are powerful search tools that can be used to identify predetermined patterns. Once a particular pattern is matched, the discovered language features can be highlighted. This could be achieved by using JavaScript to control the behaviour of elements in a webpage to emphasize the matched language feature by altering its colour or size.

Regular expressions are used in document typesetting programs, such as Microsoft Word. Regex differ slightly in their usage according to the programming language. However, if you have, for example, learnt regex for Python, it should be relatively easy to learn regex for JavaScript.

Reference: Christiansen, T. and Torkington, N. (2003). Perl Cookbook: Solutions and Examples for Perl Programmers. O'Reilly Media, Inc.

colourful visualization

Activity 2: Charismic explanations of regular expressions

Watch and listen to at least the first tutorial video (11 mins 14 sec), but even better the complete playlist of tutorials entitled "Introduction to regular expressions" (around 2 hours). Use the closed caption function if you find it difficult to understand the spoken English.

Activity 3: Learning regular expressions

Read.

Regular expressions look very daunting. There is a lot to learn, but it can be approached systematically by dividing the knowledge to be acquired into managable blocks. There are many tutorial sites geared to helping learners understand and use regular expressions. I recommend using the website RegexOne to help you practise each of the following concepts. You will learn how to match, skip and capture characters and groups. There is a lesson and an exercise for each of the concepts listed below. Each lesson should only take a few minutes.

  • Lesson 1: matching letters
  • Lesson 1.5: matching letters and numbers
  • Lesson 2: wildcard character
  • Lesson 3: matching specific characters
  • Lesson 4: excluding specific characters
  • Lesson 5: character ranges
  • Lesson 6: matching repeated characters
  • Lesson 7: matching repeated characters (part 2)
  • Lesson 8: matching optional characters
  • Lesson 9: matching whitespace
  • Lesson 10: matching lines
  • Lesson 11: capturing groups
  • Lesson 12: capturing nexted groups
  • Lesson 13: capturing multiple nexted groups
  • Lesson 14: matching conditional text using the pipe symbol
  • Lesson 15: using metacharacters

The solution to the exercise for Lesson 1 is shown below. When you solve the problem, you can continue to the next stage. The website offers solutions, but I strongly advise you to attempt the exercises yourself. If you cannot solve these exercises, you will almost certainly struggle with the assignment later in this course. Learn regex now.

stock crash

Activity 4: Teach to learn

Work in a group. Each group will be allocated a topic. Learn your assigned topic and prepare to explain the topic.

  • Group 1: anchors
  • Group 2: character classes
  • Group 3: quantifiers
  • Group 4: ranges
  • Group 5: special characters
  • Group 6: pattern modifiers
  • Group 7: assertions
  • Group 8: string replacement (back references)
  • Group 9: metacharacters
  • Group 10: assertions

Cross group and explain your topic to students who have prepared a different topic. Change groups and repeat.

Activity 5: Special characters

Read and remember.

It is necessary to remember a number of special characters to be able write regular expressions that are powerful enough to match, skip or capture exactly what you want. The table below (adapted from the Python documentation) introduces the special characters you are most likely to need. The main usage of each character is given. For more detailed information, please refer to the Python documentation on Regular Expressions

Special characters Basic function
. (Dot) In the default mode, this matches any character except a newline.
^ (Caret) Matches the start of the string.
$ Matches the end of the string or just before the newline at the end of the string.
* Causes the resulting RE to match 0 or more repetitions of the preceding RE.
+ Causes the resulting RE to match 1 or more repetitions of the preceding RE.
? Causes the resulting RE to match 0 or 1 repetitions of the preceding RE.
*?, +?, ?? The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible.
{m} Specifies that exactly m copies of the previous RE should be matched; fewer matches cause the entire RE not to match.
{m,n} Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible.
{m,n}? Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible.
\ Either escapes special characters (permitting you to match characters like '*', '?', and so forth), or signals a special sequence.
[] Used to indicate a set of characters.
| A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B.

This table is based on the Python Documentation.

Activity 6: Problem solving with regex

Write regular expressions that can do the following:

  1. match UoA student numbers.
  2. match HTTPS urls and HTTP urls.
  3. match sentences with one comma.
  4. match sentences with one or more commas.
  5. match regular verbs used in passive voice.
  6. caputure X.
  7. caputure X and Y separately.
  8. match X but skip Y.
  9. match X but skip Y.
  10. match X but skip Y.

Knowledge and application

Activity 7: Audio recording

Explain the key point of your assigned lesson (Activity 3). Give an original example in your explanation. Record a Japanese and English version of the same explanation. Submit both audio files via ELMS. The recommended length is between 1 to 2 minutes. However, the clarity of the content is more important that the length of the recording. Your topic is decided by the final digit of your student id number. See the list below. Your audio file may be uploaded to this website for other students to listen to. Do not state your name or personal information! Speak clearly. Name the files Lesson_X_en or Lesson_X_jp. Replace X with your assigned lesson number.

  • 1: Lesson 5
  • 2: Lesson 6
  • 3: Lesson 7
  • 4: Lesson 8
  • 5: Lesson 9
  • 6: Lesson 10
  • 7: Lesson 11
  • 8: Lesson 12
  • 9: Lesson 13
  • 0: Lesson 14

Review

Make sure you can explain the differences between the following in simple English:

  1. literal character vs metacharacter
  2. match vs capture
  3. greedy vs ungreedy
  4. anchors vs assertions

Running count: 63 of 71 pattern-related concepts covered so far.

"Give a man a regular expression and he’ll match a string... teach him to make his own regular expressions and you’ve got a man with problems." - Unknown

Copyright John Blake, 2021