logo

Unit 7: Problem breakdown

Learning outcomes

By the end of this unit you should:

  • have practised breaking down an authorship analysis problem
Rubik

Activity 1: Terminology review

Work in pairs. Describe the differences between:

  1. text classification and text categorization
  2. idiolect and idiosyncratic language
  3. syntactical and structural features
  4. types and tokens
  5. bigrams and trigrams

Activity 2: Problems in authorship analysis

There are four main types of problem in authorship analysis.

  1. Authorship verification - Identifying whether a text was written by one or more authors
  2. Authorship profiling - Identifying personal characteristics about the author from a text
  3. Authorship attribution - Identifying which author wrote a text from a selection of candidate authors
  4. Needle in the haystack - Searching for the author of a text on the (dark) web

Submit your work via ELMS.

Activity 3: Problem breakdown

Read.

From the remainder of this course, your team will be aiming to develop an system that can explain the linguistic differences between two datasets.

This task is an authorship verification task. Your data set is the one that you sourced in an earlier assignment. If your data set divides into two datasets, e.g. A or B, human or machine, then you can use the data set as is. If your dataset is more complex, then you may need to divide the data set. For example, if your data set contains the writing of 20 authors. Then you can divide the data set into 1 author vs 19 authors. The task will then be to identify whether author 1 wrote a text or not. If you are unsure how to use your data set, ask your tutor for advice. More specific details about your problem are provided on the Trello task board.

Rather than analyzing new linguistic features, you are advised to focus on the use of adjectives and/or the use of verbs as you have already analyzed those in previous units.

The core steps in breaking down a problem are:

  1. understanding the problem,
  2. identifying possible solutions,
  3. selecting solution, and
  4. evaluating the solution.
  5. Repeat until optimum solution is found.

Knowledge and application - AY2023

Activity 4: Problem breakdown

Work on ONE of the scenarios described in this activity. Once you have fully understood your selected program, write the pseucode for your planned program.

From the remainder of this course, your team will be aiming to solve one of the authorship problems listed above. More specific details about your problem are provided on the Trello task board.

The core steps in breaking down a problem are:

  1. understanding the problem,
  2. identifying possible solutions,
  3. selecting solution, and
  4. evaluating the solution.
  5. Repeat until optimum solution is found.

Knowledge and application - AY2022 (FOR REFERENCE ONLY)

Activity 5: Scenarios

Work on ONE of the scenarios described in this activity. Once you have fully understood your selected program, write the pseucode for your planned program.

The problem to be solved, details about the dataset and the program to be created are given below. Where necessary, additional comments are provided. In the program include the the source (in comments) of any sections of code that your team did not create.

Authorship verification

  1. Problem - Identify whether Tweets where written by one author or another author. Select two authors who regular share their opinons on Twitter
  2. Dataset - You can select two authors from an existing dataset, for example, the Top 20 most followed users, or you can adapt a program in Python that scrapes the Twitter feed of each other. There are many ready-made programs that you can easily adapt. Divide each dataset into two parts. The end results should be:
    1. Author 1: 90% of Tweets (Known)
    2. Author 1: 10% of Tweets (Questioned)
    3. Author 2: 90% of Tweets (Known)
    4. Author 2: 10% of Tweets (Questioned)
  3. Program - Write a program that compares ONE Questioned dataset with ONE Known dataset. The program should evaluate whether each Tweet in the questioned dataset was written by the same author as the known dataset. The result should be numerical, e.g. 50 out of 70 Tweets were written by the same author.
  4. Comments - You can use n-grams as the access point to discover the number of shared n-grams, or any other language features.

Authorship profiling

  1. Problem - Identify the personal characteristics of the author of a text. You should evaluate at least two characteristics, e.g. age (young, old), education level (junior high, university).
  2. Dataset - Select TEN letters from Letters Anonymous and save each as a plain text file.
  3. Program - Write a program that analyzes all TEN texts. The program should focus on the language features that are relevant to profiling. Useful features are likely to be lexical density which is evaluated using type-token ratio. This ratio provides a measure of the variety of vocabulary. We assume that the wider the variety, the higher the education level. Other useful features are punctuation and spaces. Older people tend to use double space after periods as this was the case for typewritten text. A list of generation Z vocabulary might help identify young people. The system should output the numerical values and the personal characterists for the author of each text.
  4. Comments -

Authorship attribution

  1. Problem - To identify the actual author of a text from ten candidate authors
  2. Dataset - Use this dataset: Reuters_50_50. Select 10 authors out of the 50 authors. For each author use 90% of the texts as the training set, and 10% as the test set.
  3. Program - Write a program that measures the cosine similarity (or another measure) for the training set and then uses the result to identify the author of the test set. The program should provide numerical results for all TEN test sets against all TEN authors. Finally, the program names the most likely author for each test set. This may be in the form of a probability statement, e.g. XXX is most likely the author (90%).
  4. Comments -

Review

Make sure you can explain the following in simple English:

  1. needle in the haystack
  2. lexical density

Running count: 60 of 60 concepts covered so far.