Unit 3: Language as a fingerprint

Learning outcomes

By the end of this unit you should:

understand how ngrams and parts-of-speech are used to investigate authorship
know the difference between tokens and types
know the difference between combinations and permutations
have practised analyzing short extracts using various markers
be familiar with the basics of machine learning

Activity 1: Terminology review

Work in pairs. Discuss the differences between the following pairs of terms

authorship attribution vs. authorship profiling
similarity detection vs. authorship attribution
authorship profiling vs. authorship analysis
authorship attribution vs. authorship verification
text categorization vs. text classification

Activity 2: Deep learning

Work in pairs or threes. Discuss the following concepts in Japanese.

Neural network
What is between the input and output layers
Recursive neural network
How input vector is transformed into output
weights and bias
activation function

Now, discuss the same concepts in English.

Activity 3: Access points

Read.

To investigate authorship, there are a number of possible access points. An access point is the starting point for the investigation. Common access points can be explained in lay terms as words and grammar. However, when operationalized, mroe precise terminology is necessary. To understand the different access points, let's consider the following sentence:

Questioned sentence

I want you to know what you did wrong. And to understand that you caused the problem.

Five different access points for analysis are given below.

Token - Tokens may be words or non-words, e.g. punctuation marks and numbers
Type - Type is the number of different tokens
Part of speech - Part of speech (POS) describes the main eight grammatical categories, e.g. verb, noun, adjective, adverb, etc.
POS tags - POS tags are more finely grained. The most popular tagset (Penn Treebank) comprises 36 tags
POS tags and token - POS tags and tokens can be used together to identify particular grammatical units, e.g. It is + ADJECTIVE

Work with a partner to analyze the sentence above by answering these questions

How many word tokens are there?
How many non-word tokens are there?
How many word types are there?
How many verbs are there?
Which Penn-Treebank POS tags will be used for the verbs?

Compare your answers with another group.

Activity 4: Idiosyncratic language

Work alone. Identify the idiosyncratic language in each of the following cases.

For each of the pairs of expressions, one version is more natural (i.e. frequently used by many people), and one version is less natural (i.e. less frequently used). The less frequently used forms are idiosyncratic as they show creative or mistaken use of language.

Case 1: Authorship attribution - Unabomber

you can't eat your cake and have it too
You can't have your cake and eat it

Case 2: Authorship verification - UoA assignment

The e-mail was writen by student X.
The email was written by student X.

Case 3: Authorship profiling - chat forum

it was very fun :)
it was a lot of fun :)

Discuss your answers with a partner. What evidence do you have to support your decisions?

Source: Wikipedia

Knowledge and application

Activity 5: Teach-to-learn challenge

Work with your team mates to produce content for your assigned task. You shoud produce (1) a written textual explanation in English, (2) an audio explanation in Japanese and (2) a practice actvity, e.g. questions with answers. Submit your work via ELMS.

Detailed instructions

One submission per group.
State the group name.
Submit the explantion in HTML with hyperlinks if necessary
Submit an audio file (length = betwen 1 and 2 minutes) in mp3 format.
Submit the practice activity in HTML or HTML/JS
All materials may be placed on the course website, so do not include any personal information.

Review

Make sure you can explain the following in simple English:

token
non-word token
word token
type
part of speech
POS tag
combination
permutation

Running count: 45 of 60 concepts covered so far.

TODO: add examples for combination and permutation to webpage. Add marked and unmarked to concept list