By the end of this unit you should:
Listen to this introduction to authorship analysis to discover the difference between authorship attribution, authorship profiling and similarity detection.
Read the introduction below:
The main focus of authorship attribution is on identifying the real author of a disputed document. For example, in criminal cases, the text could be an anonymous threating letter, a will suspected of being completely or partially forged. Authorship attribution is a type of text categorization or text classification problem.
Machine learning, specifically, Support Vector Machines (SVMs), are able to attribute authorship with reasonably high degrees of accuracy. A key problem, however, is that machine learning models are often black box, which means that we cannot explain exactly how the algorithm makes its decisions. White-box machine learning models are preferred in court cases, because it is easier for the judges and jury to understand how authorship is attributed.
Work in pairs or threes. Discuss your answers to the questions listed below.
If you are not already familiar with machine learning, you should dedicate some time to get up to speed. The prototype that you will develop is highly likely to use machine learning.
Authorship profiling differs from authorship attribution in that the aim is to create a character profile for the writer rather than identify the real author. Profiles include details about the gender, age, occupation and educational level of the author.
Authorship profiling is featured in many detective dramas, including Criminal Minds and Sherlock. Frequently, a letter is discovered and the genius-level detective is able to profile the author. Naturally, that is the difference between television and reality. Actual profilers use large databases and make use of various feature extraction and processing procedures.
Watch this analysis of the handwriting of an anonymous letter by Reid, one of the profilers in the television series Criminal Minds (47 sec). Be prepared to listen intently as there are no closed captions and he speaks quickly.
Authorship verification is a form of similarity detection. The aim is to determine between texts are produced by the same or different authors. In universities, authorship verification is often used to decide whether a text contains plagiarism or not. Take the example of a student who wrote an essay but copied many words from an online source with citation. In short, the student stole the words and used them without giving credit to the true author. This is called plagiarism. Plagiarism is often easy to detect using basic simliarity detection tools.
Stylometry is the study of the linguistic features that can be used for authorship analysis. There are four main types of stylometric features, namely lexical, syntactic, structural and content-specific. Texts are made up of words. Words themselves are made up of letters. Digital texts can be considered as strings of characters.
Lexical features are often analyzed by looking at individual words (unigrams, 1-grams). The unit of analysis is called a token. A token may refer to a word, a punctuation mark or a number. Collocations between words are usually analyzed by looking at two-word sequences (bigrams, 2-grams).
Syntatic features are those features that structure the words within and between sentences. For example, prepositions and conjuctions are used to link noun phrases and clauses to main clauses. to analyze these automatically, it is necessary to tag the text with parts of speech prior to analysis.
Structural features are related to how documents are organized. For example, when writing a note, are greetings and salutations used. See the example below.
Please do not give us homework today.
Content-specific features are related to the particular text type or genre of writing. The genre of writing has a signficant effect on the choice of vocabulary used. When writing a new year greeting card, it would be usual to write "Happy New Year" and so a writer who uses "Cheerful New Year" would stand out.
Compare and contrast the questioned text with the three known texts. Decide which of the known authors is most likely to be the author of the questioned text. Collect evidence to support your decision. Be prepared to present your conclusion and evidence in class. There is no need to create a slideshow, but you should make some bulletpoint notes.
I think that this email was not written by a British army officer. There are three reason why. First, “Gary Hoffman” is an american name. Second, This email is written politely and british are polite. Third, he do not use British Pounds but he use United States Dollars.
Known text 1
I think this email was written by an American army officer because there are three reasons. First, it can see that the email is written by an educated military man because the text in the email is not abbreviated. Second, it can see that the email is written by an American because the word which "Truly Yours" and "Greetings" is used in the email. Finally, it can see that the email is written by an American because 38,000,000 is the American way of expression. Therefore, the email is written by an American army officer.
Known text 2
I have concluded that this email is written by an American army officer. Firstly, he gives his name as "Maj. Gary Hoffman." "Maj." means a major general in US army. Secondly, he explains his situation. He explains his position and current tasks in army. Lastly, he uses formal and polite language such as "Greetings," and "Truly Yours." Therefore, we can think that he is an educated person. For these reasons, I concluded that the writer of this email is an American army officer.
Known text 3
I think this email was written by an American army officer. I have three reason. First, it's an american name. I think "Gary Hoffman" is american name. Second, This email is written politely. For example, "My name is" and "Truly yours". Third, he use United States Dollars.
Make sure you can explain the following simple English:
Make sure you can explain the differences between the following in simple English:
Running count: 37 of 60 concepts covered so far.