logo

Unit 2: Types of authorship analysis

Learning outcomes

By the end of this unit you should:

  • understand the differences between authorship attribution, profiling and verification
  • understand the importance of machine learning in authorship analysis
  • know why white-box models are preferable to black-box models
  • be able to name four areas in stylometry
Rubik

Activity 1: Three types of authorship analysis

Listen to this introduction to authorship analysis to discover the difference between authorship attribution, authorship profiling and similarity detection.

Activity 2: Authorship attribution

Read the introduction below:

The main focus of authorship attribution is on identifying the real author of a disputed document. For example, in criminal cases, the text could be an anonymous threating letter, a will suspected of being completely or partially forged. Authorship attribution is a type of text categorization or text classification problem.

Machine learning, specifically, Support Vector Machines (SVMs), are able to attribute authorship with reasonably high degrees of accuracy. A key problem, however, is that machine learning models are often black box, which means that we cannot explain exactly how the algorithm makes its decisions. White-box machine learning models are preferred in court cases, because it is easier for the judges and jury to understand how authorship is attributed.

Activity 3: Machine learning

Work in pairs or threes. Discuss your answers to the questions listed below.

  1. How does machine learning work?
  2. What are features and feature space?
  3. What is the difference between machine learning and deep learning?
  4. What is a Support Vector Machine?
  5. What is a Neural Network?
  6. What are the differences between SVMs and NNs
  7. What is the difference between supervised and unsupervised learning?

If you are not already familiar with machine learning, you should dedicate some time to get up to speed. The prototype that you will develop is highly likely to use machine learning.

Activity 4: Authorship profiling

Read.

Authorship profiling differs from authorship attribution in that the aim is to create a character profile for the writer rather than identify the real author. Profiles include details about the gender, age, occupation and educational level of the author.

Authorship profiling is featured in many detective dramas, including Criminal Minds and Sherlock. Frequently, a letter is discovered and the genius-level detective is able to profile the author. Naturally, that is the difference between television and reality. Actual profilers use large databases and make use of various feature extraction and processing procedures.

Activity 5: Watching (Optional)

Watch this analysis of the handwriting of an anonymous letter by Reid, one of the profilers in the television series Criminal Minds (47 sec). Be prepared to listen intently as there are no closed captions and he speaks quickly.

Activity 6: Authorship verification

Read.

Authorship verification is a form of similarity detection. The aim is to determine between texts are produced by the same or different authors. In universities, authorship verification is often used to decide whether a text contains plagiarism or not. Take the example of a student who wrote an essay but copied many words from an online source with citation. In short, the student stole the words and used them without giving credit to the true author. This is called plagiarism. Plagiarism is often easy to detect using basic simliarity detection tools.

Activity 7: Stylometry

Read.

Stylometry is the study of the linguistic features that can be used for authorship analysis. There are four main types of stylometric features, namely lexical, syntactic, structural and content-specific. Texts are made up of words. Words themselves are made up of letters. Digital texts can be considered as strings of characters.

Lexical features are often analyzed by looking at individual words (unigrams, 1-grams). The unit of analysis is called a token. A token may refer to a word, a punctuation mark or a number. Collocations between words are usually analyzed by looking at two-word sequences (bigrams, 2-grams).

Syntatic features are those features that structure the words within and between sentences. For example, prepositions and conjuctions are used to link noun phrases and clauses to main clauses. to analyze these automatically, it is necessary to tag the text with parts of speech prior to analysis.

Structural features are related to how documents are organized. For example, when writing a note, are greetings and salutations used. See the example below.

Dear Professor

Please do not give us homework today.

Overworked student

Content-specific features are related to the particular text type or genre of writing. The genre of writing has a signficant effect on the choice of vocabulary used. When writing a new year greeting card, it would be usual to write "Happy New Year" and so a writer who uses "Cheerful New Year" would stand out.

Knowledge and application

Activity 8: Authorship analysis

Compare and contrast the questioned text with the three known texts. Decide which of the known authors is most likely to be the author of the questioned text. Collect evidence to support your decision. Be prepared to present your conclusion and evidence in class. There is no need to create a slideshow, but you should make some bulletpoint notes.

Questioned text

I think that this email was not written by a British army officer. There are three reason why. First, “Gary Hoffman” is an american name. Second, This email is written politely and british are polite. Third, he do not use British Pounds but he use United States Dollars.

Known text 1

I think this email was written by an American army officer because there are three reasons. First, it can see that the email is written by an educated military man because the text in the email is not abbreviated. Second, it can see that the email is written by an American because the word which "Truly Yours" and "Greetings" is used in the email. Finally, it can see that the email is written by an American because 38,000,000 is the American way of expression. Therefore, the email is written by an American army officer.

Known text 2

I have concluded that this email is written by an American army officer. Firstly, he gives his name as "Maj. Gary Hoffman." "Maj." means a major general in US army. Secondly, he explains his situation. He explains his position and current tasks in army. Lastly, he uses formal and polite language such as "Greetings," and "Truly Yours." Therefore, we can think that he is an educated person. For these reasons, I concluded that the writer of this email is an American army officer.

Known text 3

I think this email was written by an American army officer. I have three reason. First, it's an american name. I think "Gary Hoffman" is american name. Second, This email is written politely. For example, "My name is" and "Truly yours". Third, he use United States Dollars.

Review

Make sure you can explain the following simple English:

  1. profiling
  2. forgery (to forge)
  3. attribution
  4. authorship verification
  5. similarity detection
  6. disputed
  7. anonymous
  8. plagiarism
  9. text categorization
  10. text classification
  11. judge
  12. jury
  13. machine learning
  14. support vector machine
  15. black-box model
  16. white-box model
  17. stylometry
  18. lexical features
  19. syntactic features
  20. structural features
  21. content-specific features

Make sure you can explain the differences between the following in simple English:

  1. authorship attribution vs. authorship profiling
  2. similarity detection vs. authorship attribution
  3. authorship profiling vs. authorship analysis
  4. authorship attribution vs. authorship verification
  5. text categorization vs. text classification

Running count: 37 of 60 concepts covered so far.