Unit 7: Problem breakdown
By the end of this unit you should:
- have practised breaking down an authorship analysis problem
Activity 1: Terminology review
Work in pairs. Describe the differences between:
- text classification and text categorization
- idiolect and idiosyncratic language
- syntactical and structural features
- types and tokens
- bigrams and trigrams
Activity 2: Problems in authorship analysis
There are four main types of problem in authorship analysis.
- Authorship verification - Identifying whether a text was written by one or more authors
- Authorship profiling - Identifying personal characteristics about the author from a text
- Authorship attribution - Identifying which author wrote a text from a selection of candidate authors
- Needle in the haystack - Searching for the author of a text on the (dark) web
Submit your work via ELMS.
Activity 3: Problem breakdown
From the remainder of this course, your team will be aiming to solve one of the authorship problems listed above. More specific details about your problem are provided on the Trello task board.
The core steps in breaking down a problem are:
- understanding the problem,
- identifying possible solutions,
- selecting solution, and
- evaluating the solution.
- Repeat until optimum solution is found
Knowledge and application
Activity 4: Scenarios
Work on ONE of the scenarios described in this activity. Once you have fully understood your selected program, write the pseucode for your planned program.
The problem to be solved, details about the dataset and the program to be created are given below. Where necessary, additional comments are provided. In the program include the the source (in comments) of any sections of code that your team did not create.
- Problem - Identify whether Tweets where written by one author or another author. Select two authors who regular share their opinons on Twitter
- Dataset - You can select two authors from an existing dataset, for example, the Top 20 most followed users, or you can adapt a program in Python that scrapes the Twitter feed of each other. There are many ready-made programs that you can easily adapt. Divide each dataset into two parts. The end results should be:
- Author 1: 90% of Tweets (Known)
- Author 1: 10% of Tweets (Questioned)
- Author 2: 90% of Tweets (Known)
- Author 2: 10% of Tweets (Questioned)
- Program - Write a program that compares ONE Questioned dataset with ONE Known dataset. The program should evaluate whether each Tweet in the questioned dataset was written by the same author as the known dataset. The result should be numerical, e.g. 50 out of 70 Tweets were written by the same author.
- Comments - You can use n-grams as the access point to discover the number of shared n-grams, or any other language features.
- Problem - Identify the personal characteristics of the author of a text. You should evaluate at least two characteristics, e.g. age (young, old), education level (junior high, university).
- Dataset - Select TEN letters from Letters Anonymous and save each as a plain text file.
- Program - Write a program that analyzes all TEN texts. The program should focus on the language features that are relevant to profiling. Useful features are likely to be lexical density which is evaluated using type-token ratio. This ratio provides a measure of the variety of vocabulary. We assume that the wider the variety, the higher the education level. Other useful features are punctuation and spaces. Older people tend to use double space after periods as this was the case for typewritten text. A list of generation Z vocabulary might help identify young people. The system should output the numerical values and the personal characterists for the author of each text.
- Comments -
- Problem - To identify the actual author of a text from ten candidate authors
- Dataset - Use this dataset: Reuters_50_50. Select 10 authors out of the 50 authors. For each author use 90% of the texts as the training set, and 10% as the test set.
- Program - Write a program that measures the cosine similarity (or another measure) for the training set and then uses the result to identify the author of the test set. The program should provide numerical results for all TEN test sets against all TEN authors. Finally, the program names the most likely author for each test set. This may be in the form of a probability statement, e.g. XXX is most likely the author (90%).
- Comments -
Make sure you can explain the following in simple English:
- needle in the haystack
- lexical density
Running count: 60 of 60 concepts covered so far.