Unit 1: Solving crimes using authorship analysis

Learning outcomes

By the end of this unit you should:

  • know more about your teacher, classmates and course
  • have identified patterns in texts
  • have used patterns to attribute authorship
  • understand how authorship analysis can be used to solve crimes
  • be able to define the terms in the review section

Activity 1: Your tutor

Listen to this introduction to find out the name of your teacher and how to contact him on campus and via email.

your tutor

Activity 2: Your course

Read the introduction below:

The official university course syllabus provides details of the grade percentages awarded to participation, quizzes and final assessment.

The course divides into two parts: (1) authorship and language, and (2) prototype development. You will work individually, in pairs or teams to understand how language can be used to analyze authorship. You will work in teams to develop a prototype. For students who can program in Python and prefer not to work in teams, you can form a team of one.

Active participation is defined (by me) as submitting assignments or completing assigned tasks via the learning management system ( ELMS ).

In general, each assignment or task is awarded either zero or 100%. Most assignments involve solving problems. This emoticon is used to remind you of these. Quizzes are conducted either online or live. The final assignment is the creation of a prototype authorship analysis tool. For this assignment, you need to design, develop and evaluate an original tool. Your group will need to submit three items, namely the source code, a written report and a video evaluation.

Activity 3: Your classmates

Introduce yourself to your classmates. State your preferred name, something you are proficient at (programming, gaming, maths?), and share the reason why you selected this course.

Activity 4: Course content

Read the following.

The course divides into two parts: knowledge acquistion and prototype development. In the knowledge acquisition part, we focus on the core concepts of time and tense. In the prototype developmet part we focus on visualization of language.

Authorship and language

The first five units are dedicated to understanding how language can be used to ascertain authorship, and enabling you to apply this knowledge to texts written in English. The five units to be covered are:

  1. Solving crimes using authorship analysis
  2. Types of authorship analysis
  3. Language as a fingerprint
  4. Stylometry
  5. Case studies

Prototype development

In this part, different visualization tools are introduced. This is followed by a brief introduction to different natural language pipelines. The lion's share of this part will be spent on prototype development. This prototype needs to be evaluated and so methods of evaluation are also covered. The final unit aims to review the course, bringing together all the core concepts covered.

  1. Python for Natural Language Processing (NLP), including Natural Language ToolKit (NLTK)
  2. Problem breakdown
  3. Prototype development: idea generation, design and development
  4. Prototype evaluation: usability, accuracy and efficacy
  5. Revision

The courses comprises 14 sessions and 10 units. the first half of the course will focus on Units 1 to 5. The remainder of the course will focus on Units 6 to 10.

Activity 5: Authorship

Read and think.

Authorship identification can be simple, difficult or impossible. There are many factors that impact authorship identification. Consider your own writing in your first language. Is your writing stable? Do you use the same spellings, same structure and same punctuation consistently? Does your language change when you write short messages, posts on social network sites, or university assignments? Do you use some words or phrases more frequently than other people? Do you have a catchphrase that other people could identify as being yours? Do you ever copy the language of anyone else?

Discuss your answers in pairs or small groups.

Activity 6: Genre identification

Work individually, in pairs or in teams. Decide the genre each text comes from. The genres are children's story, newspaper article and personal letter. Explain the reasons for your choices.

  1. A MAN gunned down outside his Bolton home was arrested early today by detectives in a swoop on a Lancashire hotel. Barry Lomax, aged 38, was held by police after officers burst into a room at the Yarrow Bridge Hotel, Chorley.
  2. I hope that everything is going well. Life is pretty much the same here. Jack has managed to get another job and so we hope to be able to get to see you all soon. Brad is enjoying his new school and can now read simple books. He won't be reading War and Peace for a while, though.
  3. Once upon a time there was a widow who had two daughters; one of them was beautiful and industrious, the other ugly and lazy. The mother, however, loved the ugly and lazy one best, because she was her own daughter, and so the other, who was only her stepdaughter, was made to do all the work of the house, and was quite the Cinderella of the family.

Work with a partner. Discuss any patterns that you were able to find.

Activity 7: Ima desho

Consider the following phrase.


Who said this? Are you sure? What is the probability? What evidence do you have?

Activity 8: Issues affecting authorship identification

The activities above should have raised your awareness of the effect of genre and borrowing on language choice. There is another key issue which relates to whether the language a person uses stands out as markedly different. Language that is different is creative and original. So for example, someone introducing themselves as:

"I am John."

shows conformity and not creativity. But, someone introducing themselves as:

"I was named John, so that is my name."

shows creativity (but is likely to be considered a little odd or strange by others).

How about:

"The name is Blake, John Blake."

This version of an introduction draws on the format: family name, given name then family name, which was made famous by the British secret agent, James Bond. Clearly, I am not James Bond and am not claiming to be James Bond, but I borrowed the structure, not the name.

Draw a Venn diagram to show how the three aspects of genre, borrowed language and idiosyncratic (creative and original) language interact

Activity 9: Introduction to markers for English

Identifying authorship is a classification problem. Languages can be classified at multiple levels, which include the language itself (e.g. English or French), the genre (e.g. Letter or Note), the author (e.g. Shakespeare or Chaucer). Classification at each of these levels requires the classifier (automatic or human) to make decisions based on probability.

Work in pairs or threes to solve the following problems. All decisions must be based on evidence.

  1. Which of the following is written in English?
    1. This in English.
    2. Dit is Engels
  2. Which of the following is written in American English?
    1. I'm gonna close the trunk.
    2. I'm going to close the boot.
  3. Which of the following is written by a learner of English?
    1. I'm a safety driver.
    2. I'm a careful driver.
  4. Which of the following is written by a younger person?
    1. That song is splendid.
    2. That song is super.
    3. That song is sick.
    4. That song is bad.
  5. Which of the followin SMS messages is written by a younger person?
    1. See you later.
    2. see ya later
    3. cu l8r
  6. Which of the following extracts from a formal letter is written by a younger person?
    1. The policy is clear.  We need to implement it now, or there will be problems.
    2. The policy is clear. We need to implement it now, or there will be problems.
  7. Which of the following extracts from a formal letter is likely to have been written by a more educated person?
    1. Its time to implement it.
    2. It's time to implement it.
  8. Which of the following is more likely to be said by a gang member?
    1. Get the f**k out of here.
    2. Would you please leave?

Compare your answers with other groups.

Activity 10: Questioned and Known text comparison

Discuss and decide on the authorship of the following cases.

  1. California case - given name (McMenamin, 2002, p.77)
    • Questioned text: Mary Ann x 2
    • Known text 1: Mary Ann x 5
    • Known text 2: Maryanne x 4, Maryann x 2
  2. Anonymous letter, written in 1984 - spelling
    • Questioned text: buisness x 7
    • Known text 1: Business x 1, business x 4
    • Known text 2: Buisness x 3
  3. Court case, handwriting testimony (Wellman, 1936)
    • Questioned text: toutch
    • Known text 1: toutch
    • Known text 2: touch
  4. California case of 686 letters - state abbreviation (McMenamin, 2002, p.77)
    • Questioned text: Ca.
    • Known text group 1 (514 letters): CA
    • Known text group 2 (39 letters): CA.
    • Known text group 3 (31 letters): Ca
    • Known text group 4 (76 letters): Ca.
    • Known text group 5 (1 letter): ca
    • Known text group 6 (1 letter): ca.
    • Known text group 7 (24 letters): Other, e.g. Cal, California

Knowledge and application

Knowledge and application activities are designed to help you activate the key terminology and apply the concepts covered in the course so far. Try to use the terminology and concepts accurately and appropriately.

Activity 11: Email analysis

Analyze the language in this email to decide whether or not the email was written by an American army officer. Identify the markers you use to make your decision and justify your decision.


My name is Maj. Gary Hoffman. I am an American soldier, presently in Iraqi for the protection of the US embassy and advise the Iraqi army in relation to the advance of ISIS. With a very desperate need for assistance, I have decided to contact you for your kind assistance to move the sum of Thirty eight Million United States Dollars to you if I can be assured that my share will be safe in your care until I complete my service.

More details will be follow

Truly Yours

Activity 12: Authorship analysis

Discuss the authorship of the following cases. How could authorship be analyzed in each case?

  1. Plagiarism of university essay
    1. Questioned text: utterly ridiculous, completely insane, absolutely fabulous,
    2. Known text 1: very good, abolutely bored, quite exciting, fairly difficult
    3. Known text 2: totally unbelievable, utterly stupid, entirely useless
  2. Beale letters, written in 1882 in the UK. (Juola, 2006, p.31)
    1. Questioned text (extract): Keeping well together they followed their trail for two weeks or more, securing many, and stampeding the rest.
    2. Choice one: Authentic
    3. Choice two: Counterfeit
  3. Books of the bible. Do they have the same or different authors?
    1. Questioned text: Genesis
    2. Known text: Exodus

Activity 13: Authorship analysis

Compare and contrast the questioned text with the two known texts. Decide which markers are important. Prepare to present your evidence in support of your decision.

Questioned text

There is a bom in XXXX school. It will explode this afternoon. This is no joke. Evacuate the school by 2.00 pm or else their will be many casulties. You have been warned.

Known text 1

Yesterday afternoon, the headmaster received an anonymous email, which stated that there was a bomb in our school. To ensure the safety of the students and staff, our school was evacuated and all lessons were cancelled.

Known text 2

We had the afternoon off school yesturday. Someone sent a bom threat to Mr XXX. Our chemistry test was cancelled. The police and the bom squad arrived to search for the bom.


Make sure you can explain the following simple English:

  1. authorship
  2. analysis
  3. attribution
  4. genre
  5. borrowed language
  6. idiosyncratic language
  7. punctuation
  8. spelling
  9. vocabulary
  10. markers
  11. text
  12. questioned text
  13. known text
  14. authentic
  15. counterfeit
  16. collocation

Make sure you can explain the differences between the following in simple English:

  1. authorship analysis vs. authorship attribution
  2. questioned text vs. known text
  3. authentic vs. counterfeit

Running count: 16 of 60 concepts covered so far.