This section is less assessment oriented than the previous ones have been. I want to instead map out a general investigative strategy that is only feasible for data-intensive, computational approaches like the ones we've been exploring.
The guiding empirical idea is that the meanings for discourse particles, very broadly construed, are best given as use conditions, rather than as denotations that depend on a truth-functional foundation. That is, whereas it might be fruitful to analyze a term like cat as the (world-relative set of cats), this idea stumbles with things like interjection-well, which makes no claims, but rather functions to manage the information flow in complex ways.
The technique I propose is an example of vector-space semantics. It makes crucial use of the dialog act tags in the SwDA. It doesn't get us down to specific meanings, but it moves us in that direction, by exposing abstract usage patterns that are reflective of those usage conditions.
Associated reading:
Code and data:
The main class in swda_wordtag_clusterer.py is SwdaWordTagClusterer, which makes use of the NLTK clustering module. Here is the instance that I make use of for most of this section:
Calling clust.kmeans() then does the following:
The clutering is repeated 10 times because the clustering method is not guaranteed to find a globally optimal clustering. The final output is the most frequently seen set of clusters.
The next few subsections explain this pocedure in more detail.
The kmeans() method starts by building a count dictionary using the method build_count_dictionary, which constructs a mapping word → DAMSL-tag → count capturing the number of times that (word, pos) appears in an utterance marked DAMSL-tag, where pos is in cats. This count dictionary is an intermediate step towards the matix we need.
The method build_matrix() maps the count dictionary to an n x m matrix, where n is the size of the vocabulary and m is the number of DAMSL tags. The cells in this matrix are filled with the final count values from the count dictionary.
Intuitively, the rows represent words and the columns represent tags. Table MATRIX gives the upper left corner of this matrix.
| word | % | + | ^2 | ^g | ^h | ^q | aa | ... | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | absolutely | 0 | 2 | 0 | 0 | 0 | 0 | 95 | ... | 
| 2 | actually | 17 | 12 | 0 | 0 | 1 | 0 | 4 | ... | 
| 3 | anyway | 23 | 14 | 0 | 0 | 0 | 0 | 0 | ... | 
| 4 | boy | 5 | 3 | 1 | 0 | 5 | 2 | 1 | ... | 
| 5 | bye | 0 | 1 | 0 | 0 | 0 | 0 | 0 | ... | 
| 6 | bye-bye | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 
| 7 | dear | 0 | 0 | 0 | 0 | 1 | 0 | 0 | ... | 
| 8 | definitely | 0 | 2 | 0 | 0 | 0 | 0 | 56 | ... | 
| 9 | exactly | 2 | 6 | 1 | 0 | 0 | 0 | 294 | ... | 
| 10 | gee | 0 | 3 | 0 | 0 | 2 | 1 | 1 | ... | 
| 11 | god | 0 | 2 | 0 | 0 | 1 | 0 | 0 | ... | 
| 12 | golly | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 
| 13 | good | 1 | 0 | 0 | 0 | 0 | 0 | 2 | ... | 
| 14 | good-bye | 0 | 2 | 1 | 0 | 0 | 2 | 0 | ... | 
| 15 | goodness | 1 | 0 | 0 | 0 | 2 | 0 | 0 | ... | 
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 
Figure COUNTS depicts the distances between these count vectors from the initial vector absolutely (which I chose more or less arbitrarily for the sake of illustration).
The distance measure is euclidean distance. Here is the Python code from nltk.cluster.util.euclidean_distance():
It's of course very hard to visualize this in 44 dimensions (the length of our vectors), but it's easy in two and even three dimensions:
The distance between raw count vectors is very heavily dependent upon the sum of the counts in those vectors. For example:
The first pair of vectors are similar in the sense that their totals are distributed in the same way. The second pair is very different in this regard.
This is not what we want from semantic word clusters; overall frequency is not a good predictor of meaning or usage conditions. The relevant notion of similarity for us is distribution with respect to the tags.
Thus, when initializing SwdaWordTagClusterer cluster instances, we call the method length_normalize_matrix, which rescales each row of the matrix by dividing each of its elements by its length (magnitude).
(This step can also be done with normalise=False as one of the arguments to NLTK's cluster.KMeansClusterer, but I decided to do this to the matrix beforehand, to make it easier to study its effects.)
Table NORMMATRIX is the length-normalized version of table MATRIX.
| word | % | + | ^2 | ^g | ^h | ^q | aa | ... | |
|---|---|---|---|---|---|---|---|---|---|
| 1 | absolutely | 0.00 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 0.99 | ... | 
| 2 | actually | 0.09 | 0.06 | 0.00 | 0.00 | 0.01 | 0.00 | 0.02 | ... | 
| 3 | anyway | 0.34 | 0.21 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 
| 4 | boy | 0.04 | 0.03 | 0.01 | 0.00 | 0.04 | 0.02 | 0.01 | ... | 
| 5 | bye | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 
| 6 | bye-bye | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 
| 7 | dear | 0.00 | 0.00 | 0.00 | 0.00 | 0.03 | 0.00 | 0.00 | ... | 
| 8 | definitely | 0.00 | 0.04 | 0.00 | 0.00 | 0.00 | 0.00 | 0.98 | ... | 
| 9 | exactly | 0.01 | 0.02 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 | ... | 
| 10 | gee | 0.00 | 0.05 | 0.00 | 0.00 | 0.03 | 0.02 | 0.02 | ... | 
| 11 | god | 0.00 | 0.06 | 0.00 | 0.00 | 0.03 | 0.00 | 0.00 | ... | 
| 12 | golly | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | ... | 
| 13 | good | 0.01 | 0.00 | 0.00 | 0.00 | 0.00 | 0.00 | 0.02 | ... | 
| 14 | good-bye | 0.00 | 0.06 | 0.03 | 0.00 | 0.00 | 0.06 | 0.00 | ... | 
| 15 | goodness | 0.02 | 0.00 | 0.00 | 0.00 | 0.05 | 0.00 | 0.00 | ... | 
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | 
Figure NORMED depicts the normed distances from absolutely (cf. figure COUNTS).
The impact of normalization is dramatic.
First, two words that should be close:
And two words that ought not to be close:
This section gives a brief overview of how k-means clustering works. I don't devote too much time to this because I actually think that k-means is not the right approach to this kind of modeling — I am using it as a first step because it is conceptually and computationally simple introduction to using clustering for pragmatic analysis.
The goal of k-means clustering is to group items, qua vectors, into k clusters, where k is a prespecified integer value. The algorithm works by randomly picking k mean values, assigning every item to the closes of those means, and then recalcualting the means for those new clusters. This process repeats iteratively until the mean values stop changing.
The Wikipedia page for the algorithm tells the story both in math and in pictures. See also Manning and Schütze 1999: §14. Figure KMEANS shows Wikipedia's examples for a particular run with randomly chosen initial values (which is what NLTK does; the Wikipedia example begins with pre-selected ones).
The SwdaWordTagClusterer method kmeans() calls the NLTK clustering algorithm using the following code:
This instructs the clusterer to use the user's value for the number of means, distance measure, and number of repeats. normalise=False because we normalize the matrix ourselves. self.mat is our matrix. assign_clusters=True does the actual clustering. trace=False supresses the printing of a little bit of progress reporting.
For more on how to work with the NLTK interface, check out their demo, which you can run with from nltk.cluster import kmeans; kmeans.demo(), and also the class documentation.
[uh] with a threshold of 20, euclidean distance:
0: [bye, bye-bye, good-bye, hello, hi, thanks] 1: [dear, golly, good, goodness, gosh, great, my, ooh, ugh, wow] 2: [absolutely, definitely, exactly, huh, huh-uh, no, oh, okay, really, right, sure, true, uh-huh, ye-, yeah, yep, yes] 3: [actually, anyway, hey, like, now, say, see, so, uh, um, well] 4: [boy, gee, god, jeez, man, shoot]
Assessment: Remarkably good; the dialog act tags capture something important about how these items are used.
['prp', 'prp$', 'wp', 'wp$'], with a threshold of 20, euclidean distance:
0: [her, hers, herself, mine, my, myself, ours, she] 1: [he, him, himself, his, i, me, our, theirs, w-, we] 2: [it, ourselves, them, they, us, what, whatever, who, whoever, whose] 3: [i-, its, itself, one, th-, their, themselves, y-, you, your, yourself] 4: ['s, wh, wha, yo-, yours]
Assessment: Pretty good; pronouns are a mix of dialog-act relevant things (who, whatever) and things that are largely independent of dialog act.
[in] with a threshold of 20, euclidean distance:
0: [across, around, at, before, behind, course, down, during, except, inside, outside, prior, since, through, till, underneath, up, while] 1: [about, above, after, along, although, because, between, by, cause, due, for, from, in, into, like, of, off, on, out, over, past, per, so, the, to, under, until, with, within] 2: [instead, onto, throughout] 3: [among, once, rather, such, than, towards, unless, upon, versus, whereas] 4: [against, as, besides, beyond, but, i-, if, that, though, toward, whether, without]
Assessment: Very bad; the dialog act tags seem not to have any interesting relationship to preposition usage. I think this is what we expect given that prepositions are not typically discourse oriented in a sense reflected in the tags.
The results of k-means clustering seem promising overall, though I think the approach is not ideal. Some criticisms:
WORDCMP The following code builds a word–tag matrix, stores the matrix in the numpy.array variable mat, and stores the vocabulary in the list variable vocab:
Row i of mat corresponds to string i of vocab.
Write a function with the following behavior:
OTHERSPick some other subsets of Penn Treebank 3 Switchboard tags and cluster them, using different values for num_means. What, if anything, do you see in the results?
LSA The NLTK k-means clustering interface has an option for performing singular value decomposition (SVD), the heart of Latent Semantic Analysis, on the matrix. Modify the kmeans() method of swda_wordtag_clusterer.SwdaWordTagClusterer so that the user can specify whether or not to apply svd. Compare the kmeans output from the results section above with the results using SVD.
ALGORITHMS NLTK provides a number of other vector-space clustering alogrithms: