From: Christopher Potts Date: July 24, 2011 22:13:51 MDT To: Class list Subject: Computational Pragmatics: student suggestions implemented Computational Pragmaticists! I'm writing to describe two improvements I made to the course codebase and webpages based on student suggestions from Friday's class. 1. Joseph observed that the current Switchboard Dialog Act code and experiments treat '+' as just another tag, though its interpretation is "continued from previous by same speaker".  I've now modified swda.py so that one can follow these + tags back to their source.  The implementation is not as intuitive as I would like (to avoid a lot of modifications to the code at this point), but it works well enough: a. Instantiate the corpus as usual: corpus = CorpusReader('swda') b. When you iterate with corpus.iter_utterances() or corpus.iter_transcripts(), you can supply a keyword argument follow_plus.  If follow_plus=False (the default), + is left alone. If follow_plus=True, then + is converted to its source act tag.  For example: # Count the tags, following + to its source: d = defaultdict(int) for utt in corpus.iter_utterances(follow_plus=True): d[utt.act_tag] += 1 # Or, for DAMSL, which is also affected: # d[utt.damsl_act_tag()] += 1 I adjusted tag_counts() in swda_functions.py so that tag_counts(follow_plus=True) will do the counts this way, whereas tag_counts() or tag_counts(follow_plus=False) will leave + alone. 2. During our discussion of the word-tag clustering experiment, I wasn't able to articulate why I normalized the vectors by length rather than by turning them into probability distributions.  Relatedly, I did not have an explanation for what the x-axis meant in figure 2: http://compprag.christopherpotts.net/swda-clustering.html#NORMED Ariel explained it to me after class, though: the function length_normalization is sensitive to the extent to which the values are distributed across the vector: it delivers higher values for vectors that are highly distributed.  Here's an illustration in Python: import numpy def length_normalization(vec): vec = numpy.array(vec) return vec / numpy.sqrt(numpy.dot(vec, vec)) ## Most distributed; sum = 40 sum(length_normalization([8,8,8,8,8])) 2.2360679774997898 ## Less distributed; sum = 40 sum(length_normalization([10,10,10,10,0])) 2.0 ## Even less distributed; sum = 39.999 sum(length_normalization([13.333,13.333,13.333,0,0])) 1.7320508075688772 In terms of figure 2, we see that 'absolutely', 'exactly', and 'definitely' occur with basically only 1 act tag (it happens to be 'aa'), whereas 'oh' is widely distributed relative to the act tags. I've added a note explaining this to the caption of figure 2. ---Chris