» UW-Milwaukee, 2013 Matthew L. Jockers

April 19, 2013: Text Analysis and Topic Modeling in the Humanities.

University of Wisconsin-Milwaukee, 2013.

Description:

Text collections such as the HathiTrust Digital Library and Google Books have provided scholars in many fields with convenient access to their materials in digital form, but text analysis at the scale of millions or billions of words still requires the use of tools and methods that may initially seem complex or esoteric to researchers in the humanities. Text Analysis and Topic Modeling in the Humanities will provide a practical introduction text analysis with a special emphasis on topic modeling. The course will cover data ingestion, data preparation, data preprocessing, part of speech tagging, topic modeling and data analysis. The main computing environment for the course will be R, “the open source programming language and software environment for statistical computing and graphics.” While no programming experience is required, students should have basic computer skills and be familiar with their computer’s file system and comfortable entering commands in a command line environment.

Materials:

RStudio: Download
Instructions for installing R and RStudio
Class Materials: (Download 40MB Zip)

Schedule (subject to change, even at the last minute☺):

SESSION ONE (9:00-10:15)

The R computing environment
R console vs. RStudio
Vectors and basic math in R
Text manipulation in R

BREAK (10:15-10:30)
SESSION TWO (10:30-12:00)

Downloading and exploring the exercise corpus
What is Latent Dirichlet Allocation (LDA) anyhow?
Text chunking
lapply(ing)
A toy model
Stop list implementation
A function
A toy model

LUNCH BREAK (12:00-1:30)
SESSION THREE (1:30-2:45)

A better stoplist
A slightly better model
POS tagging
3 more functions
Re-Modeling

BREAK (2:45-3:00)
SESSION FOUR (3:00-4:30)

Ingesting entire corpus
Modeling the corpus
Visualizing topics as clouds
Topic data analysis
Basic data plotting