There is a legend about an Egyptian pharaoh who wanted to revive the ancient Egyptian language, which in his time had become a ceremonial language that no one but the priests understood. To this purpose, he took two new-born babies and had them raised in complete isolation with the hope that they would grow up into speaking “pure Egyptian”
We’ve come a long way since then in understanding the internal works of languages, but no one could say that the linguists have reached a consensus yet on subjects like the nature, or even the structure of languages.
I suspect that this is an area where the computational component of humanities computing can bring a significant contribution, and one of the most complex ways of doing so is through the development and study of “monitor corpora”
**********
On October 9th I attended Mark Davies’s lecture on “Using robust corpora to examine genre-based variation and recent historical shifts in English”, presented as a part of the Humanities Computing Colloquium series.
The speaker, who is the creator of several searchable online corpora, used this occasion to describe to the audience the structure of a monitor corpus, its characteristics and purpose, and the way in which it can be used to track recent changes in language.
He based his presentation on his largest corpus, the Corpus of Contemporary American English (COCA) which contains more than 400 million words collected from American sources between 1990 and 2009.
According to Davies, in order to function as a monitor corpus, a corpus needs to be larger than 100 million words, to balance its composition between genres and across the time span of the collection, and to have a robust architecture. Some of the goals of what the speaker defined as robust architecture are allowing annotation and complex searches at reasonable speeds of no more than 2 or 3 seconds per query. All these are attainable goals if relational databases are employed. The five genres selected for the COCA, fiction, speech, popular magazines, newspapers and academic papers each represent 20% of the total and are intended to cover most areas of the language. Various comparative and statistical data are retrievable through the interface provided and the user can extract the collocations of a chosen word up to 10 words on both sides.
What I find most exciting about large corpuses like the one described is the fact that its usage is by no means restricted to computational linguistics. As some of the examples provided by Mark Davies in his lecture have shown, the context and the meaning of words throughout time and media is influenced by complex, socio-political and economical aspects. I don’t think that I exaggerate at all when I say that monitor corpora like COCA could prove valuable tools in various social sciences research projects.
In addition to that, another field where the usage of contemporary language corpora could prove highly beneficial is the study of foreign languages. I myself, as a non-native English speaker, have recently used COCA as the last resort for clarifying some semantic issues in which I had reached a dead end using more traditional methods.
And finally, if we dare to think further ahead, well structured, purpose oriented corpora are more likely to bring a valuable contribution to the development of natural language processing than other giant internet based corpora like Google corpus, where little attention was given to observing the structural principles mentioned above.
I for one am looking forward to the creation of COHA (The Corpus of Historical American English) next year which will contain texts spanning on more than two centuries and which I think will offer a more panoramic view of the changes that took place in the language spoken in the US during most of the States’ existence. Having historical linguists and historians working together tracking changes and identifying their background and consequences would prove a most interesting study.
(Just for fun, here is our little computation: This post has 666 words.)
Further readings:
Davies, M. (2009). The 385+ million word Corpus of Contemporary American English (1990–2008+): Design, architecture, and linguistic insights. International Journal of Corpus Linguistics, 14(2), 159-190.
Davies, M. (2009). Relational databases as a robust architecture for the analysis of word frequency. In What's in a word-list? : investigating word frequency and keyword extraction. Ed. Archer, D. Ashgate Pub. Ltd., 2009.