Analysis: Clustering words by tags in the SwDA

Overview
Word vectors
K-means clustering
Exercises

Overview

This section is less assessment oriented than the previous ones have been. I want to instead map out a general investigative strategy that is only feasible for data-intensive, computational approaches like the ones we've been exploring.

The guiding empirical idea is that the meanings for discourse particles, very broadly construed, are best given as use conditions, rather than as denotations that depend on a truth-functional foundation. That is, whereas it might be fruitful to analyze a term like cat as the (world-relative set of cats), this idea stumbles with things like interjection-well, which makes no claims, but rather functions to manage the information flow in complex ways.

The technique I propose is an example of vector-space semantics. It makes crucial use of the dialog act tags in the SwDA. It doesn't get us down to specific meanings, but it moves us in that direction, by exposing abstract usage patterns that are reflective of those usage conditions.

Associated reading:

Turney and Patel 2010

Code and data:

swda.zip: our distribution of the data
swda.py: Python classes for working with swda
swda_wordtag_clusterer.py: the clustering interface
kmeans-illustration.py: a toy k-means clusterer for producing figure KMEANS

Word vectors

The main class in swda_wordtag_clusterer.py is SwdaWordTagClusterer, which makes use of the NLTK clustering module. Here is the instance that I make use of for most of this section:

corpus = CorpusReader('swda') # The SwDA corpus.
cats = ['uh'] # A set of POS tags
clust = SwdaWordTagClusterer(cats, corpus, count_threshold=20, # Remove words with fewer than this many tokens. num_means=5, # Number of word clusters to build. repeats=10, # Number of times to repeat the clustering with random initial means. distance_measure=nltk.cluster.util.euclidean_distance) # Distance measure.

Calling clust.kmeans() then does the following:

Builds a word--tag matrix for all the words that appear at least 20 times with one of the POS tags in cats.
Applies k-means clustering to that matrix, forming 5 clusters, seeking to minimize the mean euclidean distance of points in those clusters.

The clutering is repeated 10 times because the clustering method is not guaranteed to find a globally optimal clustering. The final output is the most frequently seen set of clusters.

The next few subsections explain this pocedure in more detail.

Count dictionary

The kmeans() method starts by building a count dictionary using the method build_count_dictionary, which constructs a mapping word → DAMSL-tag → count capturing the number of times that (word, pos) appears in an utterance marked DAMSL-tag, where pos is in cats. This count dictionary is an intermediate step towards the matix we need.

Word–tag matrix

The method build_matrix() maps the count dictionary to an n x m matrix, where n is the size of the vocabulary and m is the number of DAMSL tags. The cells in this matrix are filled with the final count values from the count dictionary.

Intuitively, the rows represent words and the columns represent tags. Table MATRIX gives the upper left corner of this matrix.

	word	%	+	^2	^g	^h	^q	aa	...
1	absolutely	0	2	0	0	0	0	95	...
2	actually	17	12	0	0	1	0	4	...
3	anyway	23	14	0	0	0	0	0	...
4	boy	5	3	1	0	5	2	1	...
5	bye	0	1	0	0	0	0	0	...
6	bye-bye	0	0	0	0	0	0	0	...
7	dear	0	0	0	0	1	0	0	...
8	definitely	0	2	0	0	0	0	56	...
9	exactly	2	6	1	0	0	0	294	...
10	gee	0	3	0	0	2	1	1	...
11	god	0	2	0	0	1	0	0	...
12	golly	0	0	0	0	0	0	0	...
13	good	1	0	0	0	0	0	2	...
14	good-bye	0	2	1	0	0	2	0	...
15	goodness	1	0	0	0	2	0	0	...
...	...	...	...	...	...	...	...	...	...

Table MATRIX

The count matix for interjections.

Figure COUNTS depicts the distances between these count vectors from the initial vector absolutely (which I chose more or less arbitrarily for the sake of illustration).

Figure COUNTS

Count distances from absolutely.

The distance measure is euclidean distance. Here is the Python code from nltk.cluster.util.euclidean_distance():

def euclidean_distance(u, v):
"""
Returns the euclidean distance between vectors u and v. This is equivalent
to the length of the vector (u - v).
"""
diff = u - v
return math.sqrt(numpy.dot(diff, diff))

It's of course very hard to visualize this in 44 dimensions (the length of our vectors), but it's easy in two and even three dimensions:

euclidean_distance(numpy.array([0,0]), numpy.array([0,1]))
1.0
euclidean_distance(numpy.array([0,0]), numpy.array([1,1]))
1.4142135623730951
euclidean_distance(numpy.array([0,0]), numpy.array([1,0]))
1.0
euclidean_distance(numpy.array([0,0]), numpy.array([-1,-1]))
1.4142135623730951
euclidean_distance(numpy.array([0,0,0]), numpy.array([0,1,0]))
1.0
euclidean_distance(numpy.array([0,0,0]), numpy.array([0,1,1]))
1.4142135623730951

Length normalization

The distance between raw count vectors is very heavily dependent upon the sum of the counts in those vectors. For example:

euclidean_distance(numpy.array([1,1]), numpy.array([4,4]))
4.2426406871192848
euclidean_distance(numpy.array([1,1]), numpy.array([1,2]))
1.0

The first pair of vectors are similar in the sense that their totals are distributed in the same way. The second pair is very different in this regard.

This is not what we want from semantic word clusters; overall frequency is not a good predictor of meaning or usage conditions. The relevant notion of similarity for us is distribution with respect to the tags.

Thus, when initializing SwdaWordTagClusterer cluster instances, we call the method length_normalize_matrix, which rescales each row of the matrix by dividing each of its elements by its length (magnitude).

def length_normalization(vec):
return vec / numpy.sqrt(numpy.dot(vec, vec))

(This step can also be done with normalise=False as one of the arguments to NLTK's cluster.KMeansClusterer, but I decided to do this to the matrix beforehand, to make it easier to study its effects.)

euclidean_distance(length_normalization(numpy.array([1,1])), length_normalization(numpy.array([4,4])))
0.0
euclidean_distance(length_normalization(numpy.array([1,1])), length_normalization(numpy.array([1,2])))
0.32036448601393441

Table NORMMATRIX is the length-normalized version of table MATRIX.

	word	%	+	^2	^g	^h	^q	aa	...
1	absolutely	0.00	0.02	0.00	0.00	0.00	0.00	0.99	...
2	actually	0.09	0.06	0.00	0.00	0.01	0.00	0.02	...
3	anyway	0.34	0.21	0.00	0.00	0.00	0.00	0.00	...
4	boy	0.04	0.03	0.01	0.00	0.04	0.02	0.01	...
5	bye	0.00	0.00	0.00	0.00	0.00	0.00	0.00	...
6	bye-bye	0.00	0.00	0.00	0.00	0.00	0.00	0.00	...
7	dear	0.00	0.00	0.00	0.00	0.03	0.00	0.00	...
8	definitely	0.00	0.04	0.00	0.00	0.00	0.00	0.98	...
9	exactly	0.01	0.02	0.00	0.00	0.00	0.00	1.00	...
10	gee	0.00	0.05	0.00	0.00	0.03	0.02	0.02	...
11	god	0.00	0.06	0.00	0.00	0.03	0.00	0.00	...
12	golly	0.00	0.00	0.00	0.00	0.00	0.00	0.00	...
13	good	0.01	0.00	0.00	0.00	0.00	0.00	0.02	...
14	good-bye	0.00	0.06	0.03	0.00	0.00	0.06	0.00	...
15	goodness	0.02	0.00	0.00	0.00	0.05	0.00	0.00	...
...	...	...	...	...	...	...	...	...	...

Table NORMMATRIX

Length normalized matrix for interjections.

Figure NORMED depicts the normed distances from absolutely (cf. figure COUNTS).

Figure NORMED

Normed distances from absolutely.

The impact of normalization is dramatic.

First, two words that should be close:

yeah (43008 tokens) vs. yep (318 tokens)
1. Count: 25989.16 (25 words apart)
2. Normed: 1.117172 (5 words apart; between them: right, sure, huh-uh)

And two words that ought not to be close:

jeez (120 tokens) vs. good-bye (80 tokens)
1. Count: 50.65570 (13 words apart)
2. Normed: 1.4293055 (40 words apart)

exercise WORDCMP

K-means clustering

Background on k-means clustering

This section gives a brief overview of how k-means clustering works. I don't devote too much time to this because I actually think that k-means is not the right approach to this kind of modeling — I am using it as a first step because it is conceptually and computationally simple introduction to using clustering for pragmatic analysis.

The goal of k-means clustering is to group items, qua vectors, into k clusters, where k is a prespecified integer value. The algorithm works by randomly picking k mean values, assigning every item to the closes of those means, and then recalcualting the means for those new clusters. This process repeats iteratively until the mean values stop changing.

The Wikipedia page for the algorithm tells the story both in math and in pictures. See also Manning and Schütze 1999: §14. Figure KMEANS shows Wikipedia's examples for a particular run with randomly chosen initial values (which is what NLTK does; the Wikipedia example begins with pre-selected ones).

Figure KMEANS

The Wikipedia k-means example with numerical values and randomly chosen initial means. The mean values are given are colored squares, and the data points are dots, with color representing which cluster they belong to. The initial means are very poor, but the algorithm recovers. In the last two panels, the change in means does not affect the clustering — the algorithm has converged.

The SwdaWordTagClusterer method kmeans() calls the NLTK clustering algorithm using the following code:

clusterer = cluster.KMeansClusterer(self.num_means, self.distance_measure, repeats=self.repeats, normalise=False)
cluster_vector = clusterer.cluster(self.mat, assign_clusters=True, trace=False)

This instructs the clusterer to use the user's value for the number of means, distance measure, and number of repeats. normalise=False because we normalize the matrix ourselves. self.mat is our matrix. assign_clusters=True does the actual clustering. trace=False supresses the printing of a little bit of progress reporting.

For more on how to work with the NLTK interface, check out their demo, which you can run with from nltk.cluster import kmeans; kmeans.demo(), and also the class documentation.

exercise OTHERS, exercise LSA

Experiment runs

Interjections

[uh] with a threshold of 20, euclidean distance:

0: [bye, bye-bye, good-bye, hello, hi, thanks]
1: [dear, golly, good, goodness, gosh, great, my, ooh, ugh, wow]
2: [absolutely, definitely, exactly, huh, huh-uh, no, oh, okay, really, right, sure, true, uh-huh, ye-, yeah, yep, yes]
3: [actually, anyway, hey, like, now, say, see, so, uh, um, well]
4: [boy, gee, god, jeez, man, shoot]

Assessment: Remarkably good; the dialog act tags capture something important about how these items are used.

Pronouns

['prp', 'prp$', 'wp', 'wp$'], with a threshold of 20, euclidean distance:

0: [her, hers, herself, mine, my, myself, ours, she]
1: [he, him, himself, his, i, me, our, theirs, w-, we]
2: [it, ourselves, them, they, us, what, whatever, who, whoever, whose]
3: [i-, its, itself, one, th-, their, themselves, y-, you, your, yourself]
4: ['s, wh, wha, yo-, yours]

Assessment: Pretty good; pronouns are a mix of dialog-act relevant things (who, whatever) and things that are largely independent of dialog act.

Prepositions

[in] with a threshold of 20, euclidean distance:

0: [across, around, at, before, behind, course, down, during, except, inside, outside, prior, since, through, till, underneath, up, while]
1: [about, above, after, along, although, because, between, by, cause, due, for, from, in, into, like, of, off, on, out, over, past, per, so, the, to, under, until, with, within]
2: [instead, onto, throughout]
3: [among, once, rather, such, than, towards, unless, upon, versus, whereas]
4: [against, as, besides, beyond, but, i-, if, that, though, toward, whether, without]

Assessment: Very bad; the dialog act tags seem not to have any interesting relationship to preposition usage. I think this is what we expect given that prepositions are not typically discourse oriented in a sense reflected in the tags.

Discussion

The results of k-means clustering seem promising overall, though I think the approach is not ideal. Some criticisms:

It is difficult to know how many clusters to use, but whatever number we choose has a dramatic impact on performance.
K-means is a hard clustering algorithm in the sense that each word belongs to one and only one cluster. It would be better to allow words to belong to multiple clusters, or no cluster at all (for true outliers/isolates).
We lack an independent measure of success to use for assessment, so we are left to stare at the clusters and try to make sense of them (which can lead us to perceive patterns that are not really there).

exercise ALGORITHMS

Exercises

WORDCMP The following code builds a word–tag matrix, stores the matrix in the numpy.array variable mat, and stores the vocabulary in the list variable vocab:

corpus = CorpusReader('swda')
clust = SwdaWordTagClusterer(cats, corpus)
mat = clust.mat
vocab = clust.vocab

Row i of mat corresponds to string i of vocab.

Write a function with the following behavior:

If given a string w that corresponds to a row value in mat, then it returns the list of all the words in vocab, sorted from closest to farthest from w by euclidean distance.
If given a string w that doesn't correspond to a row value in mat, then it warns the user that the word isn't represented in the data.

OTHERSPick some other subsets of Penn Treebank 3 Switchboard tags and cluster them, using different values for num_means. What, if anything, do you see in the results?

LSA The NLTK k-means clustering interface has an option for performing singular value decomposition (SVD), the heart of Latent Semantic Analysis, on the matrix. Modify the kmeans() method of swda_wordtag_clusterer.SwdaWordTagClusterer so that the user can specify whether or not to apply svd. Compare the kmeans output from the results section above with the results using SVD.

ALGORITHMS NLTK provides a number of other vector-space clustering alogrithms:

Pick another clustering algorithm.
Extend swda_wordtag_clusterer.SwdaWordTagClusterer with a method for clustering using the algorithm you chose.
Compare your results with those listed in the results section above, by clustering with respect to those category sets. How do the results differ? Can you explain the contrasts in terms of the nature of the algorithm? Which approach seems better for the task at hand? Does your algorithm address any of the criticisms from the discussion section?