Unit 7: Problem breakdown

Learning outcomes

By the end of this unit you should:

have practised breaking down an authorship analysis problem

Activity 1: Terminology review

Work in pairs. Describe the differences between:

text classification and text categorization
idiolect and idiosyncratic language
syntactical and structural features
types and tokens
bigrams and trigrams

Activity 2: Problems in authorship analysis

There are four main types of problem in authorship analysis.

Authorship verification - Identifying whether a text was written by one or more authors
Authorship profiling - Identifying personal characteristics about the author from a text
Authorship attribution - Identifying which author wrote a text from a selection of candidate authors
Needle in the haystack - Searching for the author of a text on the (dark) web

]Discuss the type of computer program that would be needed to identify authorship and provide sufficient evidence and be explainable to a non-technical jury in court.

Knowledge and application - AY2024

Activity 4: Problem breakdown

Create the functionalities needed for a powerful corpus tool that authorship experts can use to help make judgements about authorship.

From the remainder of this course, your team will be aiming to create the best possible intergrated program/tool. More specific details about your problem are provided on the Trello task board.

This task is very challenging. However, by breaking down the task into managable sub-tasks you will be able to undersand and solve the problems systematically. The core steps in breaking down a problem are:

understanding the problem,
identifying possible solutions,
selecting solution, and
evaluating the solution.
Repeat until optimum solution is found.

Knowledge and application - AY2023 (FOR REFERENCE ONLY)

Activity 4: Problem breakdown

Work on ONE of the scenarios described in this activity. Once you have fully understood your selected program, write the pseucode for your planned program.

From the remainder of this course, your team will be aiming to solve one of the authorship problems listed above. More specific details about your problem are provided on the Trello task board.

The core steps in breaking down a problem are:

understanding the problem,
identifying possible solutions,
selecting solution, and
evaluating the solution.
Repeat until optimum solution is found.

Knowledge and application - AY2022 (FOR REFERENCE ONLY)

Activity 5: Scenarios

Work on ONE of the scenarios described in this activity. Once you have fully understood your selected program, write the pseucode for your planned program.

The problem to be solved, details about the dataset and the program to be created are given below. Where necessary, additional comments are provided. In the program include the the source (in comments) of any sections of code that your team did not create.

Authorship verification

Problem - Identify whether Tweets where written by one author or another author. Select two authors who regular share their opinons on Twitter
Dataset - You can select two authors from an existing dataset, for example, the Top 20 most followed users, or you can adapt a program in Python that scrapes the Twitter feed of each other. There are many ready-made programs that you can easily adapt. Divide each dataset into two parts. The end results should be:

Author 1: 90% of Tweets (Known)
Author 1: 10% of Tweets (Questioned)
Author 2: 90% of Tweets (Known)
Author 2: 10% of Tweets (Questioned)

Program - Write a program that compares ONE Questioned dataset with ONE Known dataset. The program should evaluate whether each Tweet in the questioned dataset was written by the same author as the known dataset. The result should be numerical, e.g. 50 out of 70 Tweets were written by the same author.
Comments - You can use n-grams as the access point to discover the number of shared n-grams, or any other language features.

Authorship profiling

Problem - Identify the personal characteristics of the author of a text. You should evaluate at least two characteristics, e.g. age (young, old), education level (junior high, university).
Dataset - Select TEN letters from Letters Anonymous and save each as a plain text file.
Program - Write a program that analyzes all TEN texts. The program should focus on the language features that are relevant to profiling. Useful features are likely to be lexical density which is evaluated using type-token ratio. This ratio provides a measure of the variety of vocabulary. We assume that the wider the variety, the higher the education level. Other useful features are punctuation and spaces. Older people tend to use double space after periods as this was the case for typewritten text. A list of generation Z vocabulary might help identify young people. The system should output the numerical values and the personal characterists for the author of each text.
Comments -

Authorship attribution

Problem - To identify the actual author of a text from ten candidate authors
Dataset - Use this dataset: Reuters_50_50. Select 10 authors out of the 50 authors. For each author use 90% of the texts as the training set, and 10% as the test set.
Program - Write a program that measures the cosine similarity (or another measure) for the training set and then uses the result to identify the author of the test set. The program should provide numerical results for all TEN test sets against all TEN authors. Finally, the program names the most likely author for each test set. This may be in the form of a probability statement, e.g. XXX is most likely the author (90%).
Comments -

Review

Make sure you can explain the following in simple English:

needle in the haystack
lexical density

Running count: 60 of 60 concepts covered so far.