Tuesday, November 24, 2009

Corpus Linguistics or How to Make Sense of the Needle in the Haystack.

There is a legend about an Egyptian pharaoh who wanted to revive the ancient Egyptian language, which in his time had become a ceremonial language that no one but the priests understood. To this purpose, he took two new-born babies and had them raised in complete isolation with the hope that they would grow up into speaking “pure Egyptian”

We’ve come a long way since then in understanding the internal works of languages, but no one could say that the linguists have reached a consensus yet on subjects like the nature, or even the structure of languages.

I suspect that this is an area where the computational component of humanities computing can bring a significant contribution, and one of the most complex ways of doing so is through the development and study of “monitor corpora”

**********

On October 9th I attended Mark Davies’s lecture on “Using robust corpora to examine genre-based variation and recent historical shifts in English”, presented as a part of the Humanities Computing Colloquium series.

The speaker, who is the creator of several searchable online corpora, used this occasion to describe to the audience the structure of a monitor corpus, its characteristics and purpose, and the way in which it can be used to track recent changes in language.

He based his presentation on his largest corpus, the Corpus of Contemporary American English (COCA) which contains more than 400 million words collected from American sources between 1990 and 2009.

According to Davies, in order to function as a monitor corpus, a corpus needs to be larger than 100 million words, to balance its composition between genres and across the time span of the collection, and to have a robust architecture. Some of the goals of what the speaker defined as robust architecture are allowing annotation and complex searches at reasonable speeds of no more than 2 or 3 seconds per query. All these are attainable goals if relational databases are employed. The five genres selected for the COCA, fiction, speech, popular magazines, newspapers and academic papers each represent 20% of the total and are intended to cover most areas of the language. Various comparative and statistical data are retrievable through the interface provided and the user can extract the collocations of a chosen word up to 10 words on both sides.

What I find most exciting about large corpuses like the one described is the fact that its usage is by no means restricted to computational linguistics. As some of the examples provided by Mark Davies in his lecture have shown, the context and the meaning of words throughout time and media is influenced by complex, socio-political and economical aspects. I don’t think that I exaggerate at all when I say that monitor corpora like COCA could prove valuable tools in various social sciences research projects.

In addition to that, another field where the usage of contemporary language corpora could prove highly beneficial is the study of foreign languages. I myself, as a non-native English speaker, have recently used COCA as the last resort for clarifying some semantic issues in which I had reached a dead end using more traditional methods.

And finally, if we dare to think further ahead, well structured, purpose oriented corpora are more likely to bring a valuable contribution to the development of natural language processing than other giant internet based corpora like Google corpus, where little attention was given to observing the structural principles mentioned above.

I for one am looking forward to the creation of COHA (The Corpus of Historical American English) next year which will contain texts spanning on more than two centuries and which I think will offer a more panoramic view of the changes that took place in the language spoken in the US during most of the States’ existence. Having historical linguists and historians working together tracking changes and identifying their background and consequences would prove a most interesting study.

(Just for fun, here is our little computation: This post has 666 words.)

Further readings:

Davies, M. (2009). The 385+ million word Corpus of Contemporary American English (1990–2008+): Design, architecture, and linguistic insights. International Journal of Corpus Linguistics, 14(2), 159-190.

Davies, M. (2009). Relational databases as a robust architecture for the analysis of word frequency. In What's in a word-list? : investigating word frequency and keyword extraction. Ed. Archer, D. Ashgate Pub. Ltd., 2009.

To boldly go...

A grande dame of the high society is said to have asked for Frédéric Chopin’s opinion about her daughter’s musical talents. After watching the young lady dance and listening to her singing, the great musician replied:

“Congratulations, Madam, your daughter sings magnificently for a ballerina and dances divinely for a singer!”

Paraphrasing this joke, is the modern literature consumer in danger of becoming “a brilliant reader for a writer and a wonderful writer for a reader”?

***********

On September 24 I attended Teresa Dobson’s talk on "The Role of Multimedia Literature in Critical Literary Education", presented as part of the Humanities Computing Colloquium series.

The speaker opened her lecture by presenting a short history of the appearance and development of e-literature and then mentioned some criticism works in this new area. She continued by showing two examples of the 2nd generation e-literature, “Cruising” (Ankerson & Sapnar, 2001/2006) and “Girls' Day Out” (Lawrynovicz, 2004). These works, together with other pieces of e-literature, have been gathered in an electronic volume that can be found at http://collection.eliterature.org/1/. The speaker finished the first part of her talk by listing the main characteristics of e-literature: the collaboration of authors, the juxtaposition of layers, the bricolage appropriation, its iconoclastic nature and the implication of the reader in the form of embodied metaphor.

The last part of Teresa Dobson’s lecture, which actually prompted this blog entry, was dedicated to describing her research project on digital literacy. The first phase of the project is described in detail in the article “In search of a story: Reading and writing e-literature.” (Luce-Kapler, R. & Dobson, T.M., 2005). Dobson and her colleague introduced a group of undergraduate English education majors (ages 23 to 40) to several e-literature works, writing down their reactions to the unfamiliar form of the literary creations. They also held a workshop that endowed the participants with the skills necessary for creating similar hyperlink projects, and introduced them to Shelley Jackson’s “The Patchwork Girl”. What the authors of the study discovered was that the participants, who generally had a rejection reaction towards the electronic piece of literature at the beginning, became much more receptive in the end, after being themselves involved in the creation of a similar product. From this, the two project coordinators concluded that a rethinking of reading-writing relationship is necessary in the context of hypertextual literature.

The second phase of their project is described in the article “For the love of a good narrative: Digitality and textuality” (Dobson, T.M., 2006). Dobson prompted the participants of this phase of her project to write a narrative based on the first 5 paragraphs of Alice Munro’s “Love of a Good Woman”, a work they were unacquainted with. After that, Munro’s narrative was introduced to them, and the author of the study was able to conclude that the pre-reading exercise prepared them better for the complex structure of the work at hand, and that, per extrapolation, writing prepares the participant for reading e-literature.

***********

Teresa Dobson’s hypothesis is unarguably well developed and strongly supported by both theoretical arguments and well developed study cases. However, it raises the following dilemma:

The participants in both her studies were English majors or graduates, who assumably had extensive critical reading training and perhaps some creative writing training as well at the time of the studies. If we reasonably admit that hyperlink literature is not written for the sole purpose of being read by literature “professionals”, what kind of relevance do these studies have for “regular” readers (people with a high level of education in fields other than English who enjoy reading new fiction for the pleasure of intellectual stimulation)?

If we were to repeat the studies with participants selected from the above-mentioned category, would we get the same results?

In my opinion, “regular” readers would probably be more willing to accept unusual forms of literary creation, since their mind is less set on certain expectancy patterns than that of English students or graduates.

I also believe that the method of “learning to read through writing” is a two-edged sword for most of us who have little training in creative writing and who, in most cases, lack the skills or “the spark” for writing decent fiction.

Going back to our little joke at the beginning, bad writing is less obvious than bad music, and I speculate that what we would probably risk when scaling the method of familiarizing readers with e-literature through writing exercises is ending up with three types of results:

· Some participants would be discouraged by the results of their literary experiments and would presumably be pulled away from e-literature by the memory of their own failure.

· Some would become more accepting of e-literature, but only from an ill-conceived notion of solidarity (from one e-literature author to another)

· Very few participants would go through the process and come close to the same results described in Teresa Dobson’s lecture.

It is my opinion that e-literature and its pool of readers are going through an organic expansion and evolution, as was the case with other literary genres, like science-fiction. As long as e-literature writers continue to “boldly go where no author has gone before” the numbers of their readers, admirers and imitators will grow, with no particular need of external help in this matter.

References:

Luce-Kapler, R., & Dobson, T. (2005, May/June). In search of a story: Reading and writing e-literature. Reading Online, 8(6). Available: http://www.readingonline.org/articles/art_index.asp?HREF=luce-kapler/index.html

Dobson, T.M. (2006). For the love of a good narrative: Digitality and textuality. English Teaching: Practice and Critique, 5(2), 56-68. Available:http://education.waikato.ac.nz/research/journal/index.php?id=1

Sunday, November 22, 2009

LAW AND disORDER IN INTERNET REGULATION

On September 23, I attended a research colloquium organized by The School of Library and Information Studies. The talk, entitled “Social Networking on the Internet & the Law”, was given by Cameron Hutchison who is an Assistant Professor at the Faculty of Law at U of A.

The title of the lecture implied a focus on ways to regulating the activity of social networking sites like Facebook or MySpace, but the actual lecture encompassed a broader area of subjects, like cyberbullying, copyright law, jurisdiction on the Internet, and privacy.

The speaker opened his talk with a brief review of the Lori Drew case and used it to argue in favour of legislative measures to cover the legal “terra incognita” in which most of the Internet based activities take place. He proceeded to referring to the legal act that currently governs Internet in Canada, the Personal Information Protection and Electronic Documents Act (PIPEDA) and its failing to foresee and provide an adequate legal response to the variety of Internet infractions that commonly occur.

The speaker continued by listing some of the solutions that could counteract this legal gap. The most accessible is appealing to Common Law when statutory law in the particular matter that is being judged is missing or incomplete. The disadvantage of this method, in Hutchinson’s opinion, is that not always precedent court decisions are relevant to new Internet-based infractions. The more effective solution would be the adoption of international statutes, which he argued would be more effective if they were created having in mind a dynamic purposive interpretation of the law. In other words, international Internet legislation should be worded in a broader manner that would allow judges and other ruling forums to abide by the spirit more that by the letter of the law in their decision-making process.

To the end of his lecture, the speaker offered the example of the dispute whether the digital download should be taxed or not and raised the question of what happens with someone’s on-line intellectual property in case of death to support his argument that law can be adapted to the Internet on two conditions: that the language of the document is broad enough and that the intent of a legal decision is to conform to the purpose behind the rule, more than to the form of the legal document that contains it.

************

The reason for which I attended this lecture is that the subject of Internet-related crimes is increasingly present in the media. From relatively minor infringements of rights, like the recent Kindle copyright infringement scandal, to more serious ones, like Facebook’s compliance or lack thereof with Canadian privacy law, the need for comprehensive Internet legislation becomes clearer every day.

I myself have never seen it clearer than two weeks after attending this lecture, when I watched a CBC documentary about Nadia Kajouji, a Canadian student who killed herself at the instigation of an American middle-aged nurse and family-man, William Melchert-Dinkel. The man in question is now suspected of having encouraged eight people to commit suicide, and the legalities of the case, like the question of jurisdiction for example, seem endless at this point.

Stories like this one make the Lori Drew case look like an ill-fated child’s prank and raise some very disturbing questions: Is Internet today the Wild West of times passed in what law enforcement is concerned? Do we have any viable methods of preventing stories like Nadia’s from repeating? How could we control the currents and flows in the ocean of information that Internet has become?

I am confident that solutions will be found eventually – after all, necessity is the mother of invention. However, it remains to be seen if it will be soon enough for all.

References:

Steinhauer, J. (2008, November 27). Verdict in MySpace Suicide Case. New York Times, A25. Available: http://www.nytimes.com/2008/11/27/us/27myspace.html?_r=2&hp

Picker, R. (2009, July 18). The Kindle Fiasco? The University of Chicago Law School Faculty Blog. Retrieved November 25, 2009, from http://uchicagolaw.typepad.com/faculty/2009/07/the-kindle-fiasco.html

Sites of interest for the topic:

http://www.pipedainfo.com/ (PIPEDA)

http://www.priv.gc.ca/media/nr-c/2009/nr-c_090716_e.cfm (Facebook’ clash with the Privacy Commissioner in Canada)

http://www.cbc.ca/fifth/2009-2010/death_online/ (suicide of Nadia Kajouji)