Unit 3: Language as a fingerprint

Learning outcomes

By the end of this unit you should:

  • understand how ngrams and parts-of-speech are used to investigate authorship
  • know the difference between tokens and types
  • know the difference between combinations and permutations
  • have practised analyzing short extracts using various markers
  • be familiar with the basics of machine learning

Activity 1: Terminology review

Work in pairs. Discuss the differences between the following pairs of terms

  1. authorship attribution vs. authorship profiling
  2. similarity detection vs. authorship attribution
  3. authorship profiling vs. authorship analysis
  4. authorship attribution vs. authorship verification
  5. text categorization vs. text classification

Activity 2: Deep learning

Work in pairs or threes. Discuss the following concepts in Japanese.

  1. Neural network
  2. What is between the input and output layers
  3. Recursive neural network
  4. How input vector is transformed into output
  5. weights and bias
  6. activation function

Now, discuss the same concepts in English.

Activity 3: Access points


To investigate authorship, there are a number of possible access points. An access point is the starting point for the investigation. Common access points can be explained in lay terms as words and grammar. However, when operationalized, mroe precise terminology is necessary. To understand the different access points, let's consider the following sentence:

Questioned sentence

I want you to know what you did wrong. And to understand that you caused the problem.

Five different access points for analysis are given below.

  • Token - Tokens may be words or non-words, e.g. punctuation marks and numbers
  • Type - Type is the number of different tokens
  • Part of speech - Part of speech (POS) describes the main eight grammatical categories, e.g. verb, noun, adjective, adverb, etc.
  • POS tags - POS tags are more finely grained. The most popular tagset (Penn Treebank) comprises 36 tags
  • POS tags and token - POS tags and tokens can be used together to identify particular grammatical units, e.g. It is + ADJECTIVE

Work with a partner to analyze the sentence above by answering these questions

  1. How many word tokens are there?
  2. How many non-word tokens are there?
  3. How many word types are there?
  4. How many verbs are there?
  5. Which Penn-Treebank POS tags will be used for the verbs?

Compare your answers with another group.

Activity 4: Idiosyncratic language

Work alone. Identify the idiosyncratic language in each of the following cases.

For each of the pairs of expressions, one version is more natural (i.e. frequently used by many people), and one version is less natural (i.e. less frequently used). The less frequently used forms are idiosyncratic as they show creative or mistaken use of language.

Case 1: Authorship attribution - Unambomber
  1. you can't eat your cake and have it too
  2. You can't have your cake and eat it
Case 2: Authorship verification - UoA assignment
  1. The e-mail was writen by student X.
  2. The email was written by student X.
Case 3: Authorship profiling - chat forum
  1. it was very fun :)
  2. it was a lot of fun :)

Discuss your answers with a partner. What evidence do you have to support your decisions?


Source: Wikipedia

Knowledge and application

Activity 5: Teach-to-learn challenge

Work with your team mates to produce content for your assigned task. You shoud produce (1) a written textual explanation in English, (2) an audio explanation in Japanese and (2) a practice actvity, e.g. questions with answers. Submit your work via ELMS.

Detailed instructions

  1. One submission per group.
  2. State the group name.
  3. Submit the explantion in HTML with hyperlinks if necessary
  4. Submit an audio file (length = betwen 1 and 2 minutes) in mp3 format.
  5. Submit the practice activity in HTML or HTML/JS
  6. All materials may be placed on the course website, so do not include any personal information.


Make sure you can explain the following in simple English:

  1. token
  2. non-word token
  3. word token
  4. type
  5. part of speech
  6. POS tag
  7. combination
  8. permutation

Running count: 45 of 60 concepts covered so far.

TODO 2023: add examples for combination and permutation to webpage. Add marked and unmarked to concept list