Corpus, the Latin word for “body,” refers to a body of natural texts, and an approach to language analysis that uses a corpus involves discovering patterns of language use through analysis of the corpus. Corpora (the plural of corpus) are becoming increasingly popular thanks to computers and databases that allow for near-instant searching of huge collections of written or spoken language. We talk about corpora in this class as part of our interest in the pragmatics of language and our hopes that we can pique students’ interest in the way language works in real situations. I’ve identified here a few of the most popular corpora (many housed here at BYU, you’ll notice) that you might want to access as you complete assignments for this course.
Mark Davies, a professor of linguistics here at BYU, is responsible for organizing some of the most respected corpora in the world, and they’re housed here at BYU. The COCA (for Corpus of Contemporary American English) is perhaps the most popular and contains millions of indexed words from a balanced set of spoken, fiction, popular magazines, newspapers, and academic texts. The Time magazine corpus indexes articles appearing in the popular news magazine, from 1923 to 2006. Davies also provides a powerful interface for the British National Corpus, if you’re specifically interested in British uses of words and language. Any of these will be useful for your work in this course, although there may be another corpus that is helpful, depending on your needs, so check out the complete list of corpora described at this page.
Google’s Ngram Viewer
The Google Ngram Viewer is an online viewer, initially based on Google Books, that charts frequencies of any word or short sentence using yearly count of n-grams found in the sources printed between 1800 and 2012 in American English, British English, French, German, Spanish, Russian, Hebrew, and Chinese. It’s great for comparing the frequencies of words, for tracking a word’s popularity over time, and so forth.
Michigan Corpus of Academic Spoken English
For those of you specifically interested in the way words or phrases are use in speech, this is the corpus for you. It has some pretty decent search tools that can help you identify differences in the way spoken language is used. It is not the largest corpus, but it’s an interesting one in its focus.