From: Christopher Potts
Date: July 16, 2011 09:54:05 MDT
To: Class list
Subject: Computational Pragmatics: continuing the review data explorations

Computational Pragmaticists!

I enjoyed class quite a lot yesterday, but it felt a bit chaotic near the end, with the projector going out just before I was going to explain what all the lines meant ...

The bright side is that it gave you all a chance to explore the data, which is essential to understanding large data sets like this.

If you want to continue the exploration, here's what I recommend:

1. If you haven't already, download these files and place them in the same directory:

http://compprag.christopherpotts.net/code-data/review_functions.R
http://compprag.christopherpotts.net/code-data/imdb-words.csv.zip

Unzip the second. (Probably this just means double-clicking the file.)

2. Start R and navigate to the directory containing the above files. This can be done from the pull-down menus (Apple-D on the Mac), and purists can use setwd(dirname), where dirname is the full path of the directory you want.

3. Install plyr if you haven't already. (I recommending using Packages & Data > Package Installer.)

4. Enter these commands:

source('review_functions.R')
imdb = read.csv('imdb-words.csv')

5. Now you can explore via visualization with commands like this:

WordDisplay(imdb, 'memorable', 'a')

The main page I was going through today gives all the details on what this function is displaying.  A breakdown:

* Black line with dots: the empirical distribution Pr(rating|word), as described and derived here:

http://compprag.christopherpotts.net/reviews.html#pr

* Red vertical line: expected ratings, a weighted average of the ratings based on the probability distribution:

http://compprag.christopherpotts.net/reviews.html#er

* Blue line: the fitted values of a logistic regression model, as described here:

http://compprag.christopherpotts.net/reviews.html#logit

That's it! 

The connection with the IQAP data is described here:

http://compprag.christopherpotts.net/iqap-experiments.html#oneword_review

We won't have time to go through the experiments described at that page, because I want to move on to the Switchboard Dialog Act corpus. Suffice it to say that I think I've set up two useful comparison points if you decide to do your own experiments --- one deterministic approach that depends on the linguist to synthesize and weight the data, and a probabilistic (MaxEnt) model that purports to handle that weighting for us via training.

Finally, the raw data for building lexical scales are in this CSV file, which can be read into R and studying by itself --- such a quantitative look could be useful after you've worked with WordDisplay for a little while.

http://compprag.christopherpotts.net/code-data/imdb-words-assess.csv.zip

---Chris

P.S. Links to similar data:

# More varied data --- not just star ratings, but other categories too, from other corpora:
http://www.stanford.edu/~cgpotts/data/salt20/potts-salt20-data-and-code.zip

# Bigrams with POS tags:
http://compprag.christopherpotts.net/code-data/imdb-bigrams.csv.zip

# Cross-linguistic data:
http://semanticsarchive.net/Archive/jQ0ZGZiM/readme.html

# 50,000 review subset of IMDB (flat text files):
http://ai.stanford.edu/~amaas/data/aclImdb_v1.tar.gz