Course overview and computer set-up

Goals and approach
Requirements
Tools
1. Freedom
2. This is not a programming course
Testing your set-up
1. R
2. Python and NLTK

Goals and approach

This course is called "Computational Pragmatics", but it might be more accurately called "Doing pragmatics with computational resources".

The phrase "computational pragmatics" probably calls to mind dialogue systems and intelligent agents. While I think our explorations could inform research in those areas, they won't be our focus. Rather, we will concentrate on using computational resources (corpora, algorithms, etc.) to explore pragmatic phenomena.

The current plan goes roughly like this:

Conversational implicature, using a new corpus of question–answer pairs
Clause-typing and illocutionary force, using the Switchboard Dialog Act Corpus (SwDA)
Meanings for discourse particles, using the SwDA
Discourse coherence, using the Penn Discourse TreeBank 2.0 (PDTB)
Attribution and speaker commitment, using FactBank (with a pragmatic extension) and the PDTB

I am not sure what our pace will be, or whether will get diverted onto other topics, but this list should give you a sense for the kinds of things I have in mind to cover, as well as the resources available to you as part of this course.

Requirements

Regular exercises

Each Friday, I'll pick one or two of the exercise sections, which are at the bottoms of all the content pages. On the following Tuesday, you should submit, by email to , answers to at least two of those exercises. Most involve coding of some kind, but a handful from each section can be done without a computer.

Project problem

On August 7, you should turn in a report on one of the project problems. These are more open-ended and involved than the regular exercises. I think any of them could be developed into something publishable. You're also welcome to define your own project.

A note on hardware

This is a hands-on, experiment-driven course in computational pragmatics. Ideally, you will attend class with a laptop that is set up with R, Python, and NLTK, as described below.

If you don't have a laptop here, I hope you can get regular access here at the Institute to a computer with these tools on it, so that you can play around with the ideas outside of the classroom.

If you really have no computer access while here, you're still welcome to stay enrolled. In this case, talk to me during office hours, or drop me a note by email (). We might want to make special arrangements for the exercises and project problem.

Tools

If you have access to a computer while here, you should install the following:

R
NLTK and Python
1. Python is the main language. It's worth downloading from the NLTK page because they've picked a distribution that plays well with NLTK.
2. The Python installation you already have should be fine, as long as it is 2.5 or above and not higher than 2.6.6.
3. NLTK requires you to install PyYAML as well as Python.
4. Our work will call on NumPy and SciPy. At present, the newest version of SciPy has some incompatibilities with NLTK, so I advise installing NumPy 1.5.1 and SciPy 0.8.0. I've tested this combination against the relevant code for this class, and it seems to be fine.
5. matplotlib is easy to install and very useful, so you might as well go ahead with that.
6. Feel free to try installing the other packages recommended by the NLTK group, but don't worry if any of them fail.
NLTK data: A separate install, but crucial!
The dateutil Python library: for parsing strings into datetime objects

Freedom

I'm going to be using R and Python extensively, but the goal of the course is to study linguistic phenomena, not programming. You should feel free to use other programming tools. For example, if you are a whiz with the likes of Excel, MySQL, or SPSS, then you might prefer to stick with that instead of using R. Similarly, Perl, Ruby, Java, Scheme — these are all great for computational linguistics. Use whatever will get you as a swiftly as possible to doing analysis.

My main reasons for using Python: it's the currently optimal mix of (i) intuitive, (ii) efficient, (iii) well-supported by scientific libraries.

I rely on R for statistical analysis and visualization. This can be done in Python too, of course, but I think R is a more natural choice for these things.

This is not a programming course

I'm assuming that you have some programming experience. I'm not going to give explicit instruction in R or Python. Rather, we will just dive right in. The lectures and exercises will acquaint you with a wide range of tools and techniques, especially if you're willing to use the Web and documentation to fill in gaps. I'm also happy to answer programming questions during my office hours.

To prep for this, you might check out these useful materials:

The opening chapters of Harald Baayen's R-based book Analyzing Linguistic Data
The R group's An Introduction to R is a good overview of the basic data structures and commands.
Mark Pilgrim's Dive into Python
The NLTK book
Materials for my Stanford class Programming for Linguists (Python)

Testing your set-up

R

R has a excellent graphical interface that will launch when you start it like any other program.

When you launch it, you'll be in interactive mode. Try some basic mathematical expressions. Here is some code you might try; paste it into the buffer and hit enter:

ratings = seq(1,10)
# Check out the ratings vector.
ratings
[1] 1 2 3 4 5 6 7 8 9 10
counts = c(1324, 604, 783, 881, 1404, 2031, 3800, 6468, 7484, 21142)
totals = c(25395214, 11755132, 13995838, 14963866, 20390515, 27420036, 40192077, 48723444, 40277743, 73948447)
relfreq = counts/totals
plot(ratings, relfreq, main="Relative frequency of 'awesome' in IMDB")

(The above is how I will display interactive R code, with the greater-than prompt, code in blue, comments in maroon, and output in black. This reflects the default style for the graphical interface.)

This should pop up on your screen (click to enlarge):

(For more on this kind of data.)

You can also put your R code into a separate file. Here is some code that you could paste into a text-editor (you can use R's by selecting File > New Document):

## A basic plotting function for relative frequencies.
## Args:
## xvals: the values for the x-axis
## counts: numerator in the relative frequency
## totals: denominator in the relative frequency
## main: title value (default: no title)
PlotRelFreq = function(xvals, counts, totals, main=''){
plot(xvals, counts/totals, main=main)
}

(This is how I will display R code presumed to be written and saved in a separate file — same style as in interactive mode, but with no > prompt.)

Save the file (I picked the name basic_plots.R). Now, back at the interactive prompt, you can load and use this function:

source('basic_plots.R')
PlotRelFreq(ratings, counts, totals, main="Relative frequency of 'awesome' in IMDB")

Python and NLTK

For absolute beginners (people who haven't run programming languages at all before, or did so from an environment that someone else set up and maintained), I suggest following the tips at the NLTK Getting Started page.

All (or nearly so) of the NLTK modules have demos. These provide an easy way to make sure that everything is installed correctly. Here's a sequence of commands that use modules we'll rely on throughout the course; start the Python interpreter and try pasting them in:

# WordNet demo --- displays lots Synsets with descriptions:
from nltk.corpus.reader import wordnet
wordnet.demo()
# Classifier demo --- trains a classifier for boy vs. girl names:
from nltk.classify import maxent
maxent.demo()
# Tree demo --- displays trees with descriptions:
from nltk import tree
tree.demo()
# K-means clustering demo --- two small examples:
from nltk.cluster import kmeans
kmeans.demo()

Now that the demos have all worked beautifully(?), let's try some code of our own.

Start the python interpreter and load the nltk WordNet lemmatizer:

from nltk.stem import WordNetLemmatizer

Instantiate the lemmatizer for later use:

wnl = WordNetLemmatizer()

Test the lemmatizer. (The first argument is the string, and the second is a POS tag — values: a, n, r, v.)

wnl.lemmatize('dogs', 'n')
'dog'
wnl.lemmatize('helpless', 'a')
'helpless'
wnl.lemmatize('assisted', 'a')
'assisted'
wnl.lemmatize('assisted', 'v')
'assist'

To generalize this code a bit, open up an empty file in a text editor and paste in this code:

#!/usr/bin/env python
import re
from nltk.stem import WordNetLemmatizer
def wn_string_lemmatizer(s):
"""
WordNet lemmatizer for strings.
Argument:
s -- a string of word/tag pairs, separated by spaces. If a word is
missing a tag, or if its tag is not one of the WordNet pos values
(a, n, r, v), then its tag is ignored. (It seems that the
lemmatizer does much less in such cases.)
Output:
lemmatized (list) -- the lemmatized strings (no tags)
"""
# Instantiate the lemmatizer:
wnl = WordNetLemmatizer()
# Split on whitespace to create a list of word/tag strings:
lemma_strs = re.split(r'\s+', s)
# The output list:
lemmatized = []
# Now loop through the string_tag string pairs trying to lemmatize them:
for sl in lemma_strs:
word = ''
tag = None
try: # If there is no slash divider,
word, tag = re.split(r'/', sl)
tag = tag.lower()
except: # treat the whole unit as a word.
word = sl
# Make sure the tag is a WordNet-kosher:
if tag in ('a', 'n', 's', 'r', 'v'):
lemmatized.append(wnl.lemmatize(word, tag))
else:
lemmatized.append(wnl.lemmatize(word))
return lemmatized

If you save this in a file (mine is in wordnet_functions.py), then you can import it into the interpreter and use it:

from wordnet_functions import wn_string_lemmatizer
pos_sentence = 'Sam/NNP was/V happily/R watching/v movies/n while his friends/n did/v the dishes/n'
wn_string_lemmatizer(pos_sentence)
['Sam', 'be', 'happily', 'watch', 'movie', 'while', 'his', 'friend', 'do', 'the', 'dish']

If all of the above worked out for you, then we can be pretty sure that you got Python, NLTK, and the NLTK data installed properly. If something went wrong, try to interpret the error messages you get back to determine what the problem is, and feel free to come to my office hours for trouble-shooting, debugging, and other techical woes.