Unit 4: Stylometry

Learning outcomes

By the end of this unit you should:

  • understand how language feature profiles are used in authorship analysis
  • know how to use the R package Stylo
  • have used Stylo to identify the author of unknown texts.

Activity 1: Language feature profiles


Everyone is different. We each have DNA profiles that are unique to us although they are very similiar to our parents, siblings (brothers and sisters) and offspring (children). Many off our physical features are unique. In the criminal system, the most commonly used feature is fingerprints. Each fingerprint comprises many unique features, creating an infinite number of permutations. Other physical features that are unique are our eyes (specifically the pattern in the iris), ears (the shape of the lobe) and tongues (bumps and ridges).

Our speech patterns are also unique both in terms of speech production (sounds) and the lexical and grammmatical choices (words). This course focusses on written language, and so let's consider the lexical and grammatical profile of your language, your idiolect. We will focus on linking words, punctuation marks and letter case. There are two letter cases, namely lower and upper case. Upper case letters are also called capital letters

Check your usage of the following markers (language features) in your written English. Your written English is the English you write without any help from dictionaries, grammar checkers, etc. Use the following scales and questions. Write your answers down.

Occurence scale

  • A = absent
  • P = present

Do you ever:

  1. use a comma (,) before but
  2. use a semi-colon (;) before but
  3. use a full stop (.) before But
  4. use a comma (,) before however
  5. use a semi-colon (;) before however
  6. use a full stop (.) before However
  7. use a comma (,) before and
  8. use a semi-colon (;) before and
  9. use a full stop (.) before And
  10. use the structure Though.... , .... .
  11. use the structure Although.... , .... .
  12. use the structure Even though.... , .... .
  13. use the structure .... though .... .
  14. use the structure .... although .... .
  15. use the structure .... even though .... .

Frequency scale

  • 0 = never
  • 1 = rarely
  • 2 = sometimes
  • 3 = often
  • 4 = usually
  • 5 = always

How often do you use the following:

  1. but / But
  2. however / However
  3. and / And
  4. in addition / In addition
  5. Additionally
  6. so / So
  7. therefore / Therefore
  8. thus / Thus

Activity 2: Unique or shared idiolect?

Write your answers to the two questions above as a string of 23 characters. e.g. AAPAPPPAAAPPPP54533542

Try to find someone with the same language feature profile.

As English is (probably) not your mother tongue, and many of you share a similar language learning experience, it is possible that with just 23 language markers, you may find someone with a similar profile. However, if the number of language features doubles or triples, the possibility to find someone else with the same profile decreases dramatically.

Activity 3: Introduction to Stylo

Watch and listen to this basic introduction to the R package Stylo by its creator Maciej Eder (12 min 47 sec).

He gives a demonstration of its use on a Mac. First, you need to install R. Then, he explains how to:

  • load the library Stylo - library(stylo)
  • set the working dictionary - setwd("yourfoldername")
  • run Stylo with default settings - stylo()

Stylo creates a dendrogram plot, which makes it easy to see the different clusters that are created. The closer the cluster is on the dendrogram plot, the more similiar the language features between the clusters. His first demonstation uses Cluster Analysis while the second demonstration uses Multidimensional Scaling (MDS).

Activity 4: Introduction to Stylo: Installation

Watch and listen to this introduction to the installation process for the R package Stylo by its creator Maciej Eder (8 min 06 sec).

Activity 5: Introduction to Stylo: Basic parameters

Watch and listen to this introduction to the basic parameters that you can use in the R package Stylo by its creator Maciej Eder (18 min 46 sec).

Knowledge and application

Activity 6: Authorship analysis using the R package Stylo

Work in your Trello teams. Identify the authorship of the questioned dataset using the R package Stylo. This dataset can be used with Stylo. The team leader (i.e. first person listed on Trello) should submit the R code and a PDF of the results of your analysis via ELMS.


Source: Maciej Eder


Make sure you can explain the following in simple English:

  1. idiolect
  2. Stylo
  3. dendrogram
  4. Cluster Analysis
  5. Multidimensional Scaling
  6. Letter case
  7. Upper case
  8. Lower case
  9. Capital letters

Running count: 54 of 60 concepts covered so far.