April 19, 2013: Text Analysis and Topic Modeling in the Humanities.
University of Wisconsin-Milwaukee, 2013.
Description:
Text collections such as the HathiTrust Digital Library and Google Books have provided scholars in many fields with convenient access to their materials in digital form, but text analysis at the scale of millions or billions of words still requires the use of tools and methods that may initially seem complex or esoteric to researchers in the humanities. Text Analysis and Topic Modeling in the Humanities will provide a practical introduction text analysis with a special emphasis on topic modeling. The course will cover data ingestion, data preparation, data preprocessing, part of speech tagging, topic modeling and data analysis. The main computing environment for the course will be R, “the open source programming language and software environment for statistical computing and graphics.” While no programming experience is required, students should have basic computer skills and be familiar with their computer’s file system and comfortable entering commands in a command line environment.
Materials:
- RStudio: Download
- Instructions for installing R and RStudio
- Class Materials: (Download 40MB Zip)
Schedule (subject to change, even at the last minute☺):
- SESSION ONE (9:00-10:15)
- The R computing environment
- R console vs. RStudio
- Vectors and basic math in R
- Text manipulation in R
- BREAK (10:15-10:30)
- SESSION TWO (10:30-12:00)
- Downloading and exploring the exercise corpus
- What is Latent Dirichlet Allocation (LDA) anyhow?
- Text chunking
- lapply(ing)
- A toy model
- Stop list implementation
- A function
- A toy model
- LUNCH BREAK (12:00-1:30)
- SESSION THREE (1:30-2:45)
- A better stoplist
- A slightly better model
- POS tagging
- 3 more functions
- Re-Modeling
- BREAK (2:45-3:00)
- SESSION FOUR (3:00-4:30)
- Ingesting entire corpus
- Modeling the corpus
- Visualizing topics as clouds
- Topic data analysis
- Basic data plotting