logo

Unit 3 Text corpora

Learning outcomes

By the end of this unit you should:

  • work with large text collections and understand corpus linguistics
  • load and process various corpus formats using NLTK
  • build custom corpora from different sources
  • extract patterns and insights from corpus data
Text Corpora Collection

Activity 1 Introduction to corpora

Read about corpus linguistics and its applications.

A corpus (plural: corpora) is a large, structured collection of texts used for linguistic analysis. Corpora enable researchers to study language patterns, word usage, and linguistic phenomena across thousands or millions of texts. In NLP, corpora serve as training data for machine learning models and provide insights into natural language patterns.

Types of corpora we'll explore:

  • Literary corpora: Collections of novels, poems, and literary works
  • News corpora: Newspaper articles and journalistic content
  • Academic corpora: Research papers and scholarly publications
  • Social media corpora: Posts, tweets, and informal communication
  • Multilingual corpora: Parallel texts in different languages

Activity 2 Interactive corpus explorer

Explore different text types using the interactive browser.

Use the tabs below to browse sample texts from different corpora and observe how language varies across genres and contexts.

Corpus Text Browser
News Text Sample

Scientists at the University announced yesterday that they have developed a new machine learning algorithm that can predict weather patterns with 95% accuracy. The breakthrough research, published in Nature Climate Change, could revolutionize meteorological forecasting and help communities better prepare for extreme weather events.

Characteristics: Formal tone, third-person, factual reporting, temporal references

Activity 3 NLTK corpus resources

Watch how to access and use NLTK's built-in corpora.

This video (9 minutes) demonstrates how to load and explore NLTK's extensive collection of corpora, including the Brown Corpus, Reuters Corpus, and Gutenberg Collection.

Activity 4 N-gram generator

Extract and visualize n-grams from corpus text.

N-grams are sequences of n consecutive words that help identify common phrases and patterns in text. This tool generates bi-grams (2 words) and tri-grams (3 words) from your input.

N-gram Extraction Tool

Activity 5 Concordance tool (KWIC)

Search for keywords in context across corpus data.

A concordance shows how a word appears in different contexts. KWIC (Keyword in Context) displays your search term surrounded by neighboring words, helping you understand usage patterns.

Keyword in Context Search

Activity 6 Building custom corpora

Watch how to create and organize your own text collections.

This video (8 minutes) shows how to collect texts from various sources, organize them into a structured corpus, and prepare them for analysis.

Activity 7 Corpus comparison project

Compare linguistic features across different text collections.

Upload or paste texts from different genres and compare their linguistic characteristics using automated analysis.

Multi-Corpus Comparison Tool
Corpus A
Corpus B

Unit Review

Test your understanding of corpus linguistics:

Self-Assessment Quiz

1. What is a corpus in NLP?




2. What does KWIC stand for?




3. What are bi-grams?