Penn Discourse Treebank 2 Competition

  1. Overview
  2. Train/test split
  3. The classifier set-up and interface
  4. Team Potts?
    1. Verbal word-pairs baseline
    2. Harvard Inquirer baseline
    3. Verbal word-pairs and Harvard Inquirer combined
  5. Team Banana Wugs
  6. Team Banana Slugs
  7. And the winner is ...
  8. Addendum: The combined Slugs–Wugs model
  9. Exercises


In the second half of class on July 29, we broke up into teams (originally there were three, but two formed an alliance):

Team Banana Slugs
This is a banana slug

Team Banana Wugs
This is a wug

Each team developed a probabilistic model of the semantics of Implicit coherence relations in the Penn Discourse Treebank 2 (see Pitler et al. 2009 and also this preliminary discussion.) In essence, the goal was to accurately predict the primary semantic class (Comparison, Contingency, Expansion, Temporal) for Implicit examples.

I implemented the teams' theories to the best of my ability using Python/NLTK. I also defined a balanced train/test split for evaluation.

This page describes the results of the competition. Read on to find out who won ...

Code/data listing

Train/test split

The train/test split is released as two pickle files giving sets of indices.

The indices correspond to what one gets when using the pdtb.CorpusReader method iter_data() to go through the corpus. They also correspond to the row numbers in the CSV file, starting from 0 and ignoring the header.

These sets are relatively small. There are only 826 Temporal examples in the Implicit part of the corpus, which put an upper-bound on the sizes given the goal of having balanced datasets.

The classifier set-up and interface

The classifier interface is the same as the one described on the first predicting Implicit relations page.

The model is a maximum entropy (MaxEnt) classifier in which features can be boolean-valued, integer-valued, or float-valued. MaxEnt classifiers are a close cousin to generalized linear models of the sort used for the IMDB review data.

For the current task, the dependent variable is the primary semantic class of the example (Comparison, Contingency, Expansion, Temporal). Each team came up with a bunch of feature functions, each of which maps a datum instance to a property/feature of that example. The model uses the training data to estimate weights for those those features. The effectiveness of the features and associated weights is evaluated on the test data.

The competition is defined by the following code:

Team Potts?

I wasn't officially part of the competition (that wouldn't be fair, since I handled the implementation). However, before the teams were formed, I did briefly assess two different kinds of feature function, adapted from the model developed by Pitler et al. 2009. I assessed them against random train/test splits. How do they do against our fixed, balanced train/test split?

Verbal word-pairs baseline

The verbal word-pairs experiment had a feature set consisting entirely of verb-pairs (including modals), where V1 was drawn from Arg1 and V2 from Arg2.

Its accuracy for the random train/test split was about 41%, but this is not a very useful evaluation number in light of the highly imbalanced nature of the corpus's primary semantic classes.

Here's a summary of its performance on our train/test split:

  3. Confusion matrix
  4. Comparison Contingency Expansion Temporal
  5. Comparison 67.0 46.0 31.0 56.0
  6. Contingency 36.0 67.0 49.0 48.0
  7. Expansion 51.0 34.0 51.0 64.0
  8. Temporal 39.0 27.0 32.0 102.0
  10. Effectiveness
  11. precision recall f1
  12. Comparison 0.35 0.34 0.34
  13. Contingency 0.39 0.34 0.36
  14. Temporal 0.38 0.51 0.43
  15. Expansion 0.31 0.26 0.28
  16. --------------------------------------------------
  17. Average 0.36 0.36 0.35
  19. Accuracy: 0.36
  20. Train set accuracy: 0.96
  22. Feature count: 15679

The huge gap between the train accuracy and the test accuracy is a reliable indicator of problematic over-fitting. The feature-space for this model is very large relative to the size of the dataset, and it is also (in turn) very sparse.

Harvard Inquirer baseline

The features of the Harvard Inquirer model were again pairs of elements derived from Arg1 and Arg2, but here they were the abstract semantic classes of those texts, rather than words themselves.

Here is a summary of the effectiveness of this approach on our train/test split:

  3. Confusion matrix
  4. Comparison Contingency Expansion Temporal
  5. Comparison 58.0 38.0 41.0 63.0
  6. Contingency 53.0 51.0 42.0 54.0
  7. Expansion 56.0 38.0 37.0 69.0
  8. Temporal 44.0 35.0 34.0 87.0
  10. Effectiveness
  11. precision recall f1
  12. Comparison 0.27 0.29 0.28
  13. Contingency 0.31 0.26 0.28
  14. Temporal 0.32 0.44 0.37
  15. Expansion 0.24 0.19 0.21
  16. --------------------------------------------------
  17. Average 0.29 0.29 0.29
  19. Accuracy: 0.29
  20. Train set accuracy: 0.61
  22. Feature count: 3933

This feature set is about four times smaller than the previous one. Its performance on the test set is not as strong, but it does less over-fitting. Still, it would be hard to argue that one of these two models is better than the other.

Here's an organized list of the top five positive features from each category. The links go to the Harvard Inquirer page for the semantic categories.

  1. Comparison
  2. 4.769 (Our, Causal)
  3. 4.372 (ECON, If)
  4. 4.356 (Self, Space)
  5. 4.282 (Quan, IAV)
  6. 4.124 (If, Know)
  7. Contingency
  8. 5.826 (Causal, Strong)
  9. 5.166 (IAV, Strong)
  10. 5.141 (SV, PtLw)
  11. 4.871 (Means, You)
  12. 4.395 (SV, You)
  13. Expansion
  14. 6.910 (MALE, PtLw)
  15. 5.172 (Our, Time@)
  16. 5.053 (MALE, Rel)
  17. 4.994 (TimeSpc, ECON)
  18. 4.450 (You, PtLw)
  19. Temporal
  20. 5.665 (Female, Means)
  21. 4.861 (MALE, Compare)
  22. 4.750 (Ovrst, Know)
  23. 4.297 (H4, MALE)
  24. 4.099 (Causal, MALE)

Verbal word-pairs and Harvard Inquirer combined

Here are the results for a feature function that combines the above models (this an exercise on the first mplicit page about predicting Implicit relations):

  3. Confusion matrix
  4. Comparison Contingency Expansion Temporal
  5. Comparison 70.0 48.0 46.0 36.0
  6. Contingency 31.0 88.0 54.0 27.0
  7. Expansion 35.0 55.0 64.0 46.0
  8. Temporal 21.0 41.0 36.0 102.0
  10. Effectiveness
  11. precision recall f1
  12. Comparison 0.45 0.35 0.39
  13. Contingency 0.38 0.44 0.41
  14. Temporal 0.48 0.51 0.5
  15. Expansion 0.32 0.32 0.32
  16. --------------------------------------------------
  17. Average 0.41 0.41 0.4
  19. Accuracy: 0.41
  20. Train set accuracy: 1.0
  22. Feature count: 632559

Accuracy and effectiveness are up, but with over 600,000 features, we memorized the training data and still didn't do all that well in testing.

Team Banana Wugs

This is a wug

Here's the experiment code, which is well-documented with the team's informal statements and my own interpretations of them:

High-level overview of the Wugs' features:

  1. Negation: features capturing negation balances and imbalances across the Args, targeting both sentential and constituent negation.
  2. Sentiment: A separate sentiment score for each Arg, representing the sum of the coefficients for all the words in that Arg (p-value threshold at 0.1).
  3. Overlap: the cardinality of the intersection of the Arg1 and Arg2 words divided by their union.
  4. Structural complexity: features capturing, for each Arg, whether it has an embedded clause, the number of embedded clauses, and the height of its largest tree.
  5. Complexity ratios: a feature for log of the ratio of the lengths (in words) of the two Args, a feature for the ratio of the clause-counts for the two Args, and a feature for the ratio of the max heights for the two Args.
  6. Pronominal subjects: a pair-feature capturing whether the subject of the Arg is pronominal (pro) or non-promominal (non-pro). The features are pairs from {pro, non-pro} x {pro, non-pro}.
  7. It seems: returns False if the first argument of the second bigram is not it seems.
  8. Tense agreement: a feature for the degree to which the verbal nodes in the two Args have the same tense.
  9. Modals: a pair-feature capturing whether Arg contains a modal (modal) or not (non-modal). The featuers are pairs from {modal, non-modal} x {modal, non-modal}.

Performance summary for the balanced train/test split:

  3. Confusion matrix
  4. Comparison Contingency Expansion Temporal
  5. Comparison 46.0 50.0 35.0 69.0
  6. Contingency 37.0 77.0 37.0 49.0
  7. Expansion 35.0 42.0 46.0 77.0
  8. Temporal 23.0 36.0 41.0 100.0
  10. Effectiveness
  11. precision recall f1
  12. Comparison 0.33 0.23 0.27
  13. Contingency 0.38 0.39 0.38
  14. Temporal 0.34 0.5 0.4
  15. Expansion 0.29 0.23 0.26
  16. --------------------------------------------------
  17. Average 0.33 0.34 0.33
  19. Accuracy: 0.34
  20. Train set accuracy: 0.37
  22. Feature count: 116
  24. Top 5 features for each category
  26. Comparison
  27. 4.009 Normalized_Shared_Token_Count==<type 'float'>
  28. 0.564 ('non-neg', 'neg')==True
  29. 0.427 Arg2_Coef_Sum==<type 'float'>
  30. 0.381 Sentential_Negation_Balanced==False
  31. 0.306 Arg_Length_Ratio==<type 'float'>
  32. Contingency
  33. 1.130 ('neg', 'neg')==True
  34. 0.616 Constituent_Negation_Balanced==True
  35. 0.597 Arg2_Startswith_It_Seems==False
  36. 0.560 Arg1_Coef_Sum==<type 'float'>
  37. 0.473 ('modal', 'non-modal')==True
  38. Expansion
  39. 0.635 ('neg', 'neg')==True
  40. 0.360 ('pro-subj', 'pro-subj')==True
  41. 0.322 Normalized_Shared_Token_Count==<type 'float'>
  42. 0.231 Constituent_Negation_Balanced==True
  43. 0.191 Arg2_Startswith_It_Seems==False
  44. Temporal
  45. 2.737 Normalized_Shared_Token_Count==<type 'float'>
  46. 0.928 ('non-neg', 'non-neg')==True
  47. 0.260 Tree_Height_Ratio==<type 'float'>
  48. 0.231 ('non-modal', 'non-modal')==True
  49. 0.153 Constituent_Negation_Balanced==False

Team Banana Slugs

This is a banana slug

Here's the well-documented experiment code:

High-level overview of the Slugs' features:

  1. Negation: for each Arg, a feature for whether it was negated and the number of negation it contains. Also, a feature capturing negation balance/imbalance across the Args.
  2. Main verbs: for each Arg, a feature for its main-verb. Also, a feature returning True of the two Args' main verbs match, else False.
  3. Length ratio: a feature for the ratio of the lengths (in words) of Arg1 and Arg2.
  4. WordNet antonyms: the number of words in Arg2 that are antonyms of a word in Arg1.
  5. Genre: a feature for the genre of the file containing the example.
  6. Modals: for each Arg, the number of modals in it.
  7. WordNet hypernyms (not used; see below): a feature (hyp1, hyp2) for every hypernym hyp1 consistent with a word in Arg1 and every hypernym hyp2 consistent with a word in Arg2.
  8. WordNet hypernym counts (written by me; see below): for Arg1, a feature for the number of words in Arg2 that are hypernyms of a word in Arg1, and ditto for Arg2.
  9. N-gram features: for each Arg, a feature for each unigram it contains. (The team suggested going to 2- or 3-grams, but I called a halt at 1 because the data-set is not that big.)

Two comments before assessment.

Let's first look at the results without the 1-gram features:

  3. Confusion matrix
  4. Comparison Contingency Expansion Temporal
  5. Comparison 56.0 48.0 49.0 47.0
  6. Contingency 32.0 78.0 55.0 35.0
  7. Expansion 34.0 57.0 74.0 35.0
  8. Temporal 31.0 27.0 47.0 95.0
  10. Effectiveness
  11. precision recall f1
  12. Comparison 0.37 0.28 0.32
  13. Contingency 0.37 0.39 0.38
  14. Temporal 0.45 0.48 0.46
  15. Expansion 0.33 0.37 0.35
  16. --------------------------------------------------
  17. Average 0.38 0.38 0.38
  19. Accuracy: 0.38
  20. Train set accuracy: 0.73
  22. Feature count: 1824
  24. Top 5 features for each category
  26. Comparison
  27. 6.844 Arg2_Main_Verb_provide==True
  28. 6.837 Arg2_Main_Verb_earned==True
  29. 6.765 Arg1_Main_Verb_fled==True
  30. 6.712 Arg1_Main_Verb_totaled==True
  31. 6.433 Arg1_Main_Verb_result==True
  32. Contingency
  33. 7.577 Arg2_Main_Verb_raises==True
  34. 7.185 Arg2_Main_Verb_believe==True
  35. 7.001 Arg1_Main_Verb_help==True
  36. 6.660 Arg1_Main_Verb_bristle==True
  37. 6.425 Arg1_Main_Verb_leapt==True
  38. Expansion
  39. 6.880 Arg2_Main_Verb_include==True
  40. 6.749 Arg2_Main_Verb_shows==True
  41. 6.642 Arg1_Main_Verb_prompted==True
  42. 6.494 Arg2_Main_Verb_tumbled==True
  43. 6.430 Arg2_Main_Verb_drag==True
  44. Temporal
  45. 7.899 Arg1_Main_Verb_joined==True
  46. 7.279 Arg2_Main_Verb_begin==True
  47. 7.239 Arg1_Main_Verb_begins==True
  48. 7.142 Arg2_Main_Verb_managed==True
  49. 7.073 Arg2_Main_Verb_remains==True

And now with the 1-gram features added in:

  3. Confusion matrix
  4. Comparison Contingency Expansion Temporal
  5. Comparison 67.0 38.0 63.0 32.0
  6. Contingency 49.0 72.0 56.0 23.0
  7. Expansion 39.0 55.0 66.0 40.0
  8. Temporal 34.0 33.0 40.0 93.0
  10. Effectiveness
  11. precision recall f1
  12. Comparison 0.35 0.34 0.34
  13. Contingency 0.36 0.36 0.36
  14. Temporal 0.49 0.47 0.48
  15. Expansion 0.29 0.33 0.31
  16. --------------------------------------------------
  17. Average 0.38 0.37 0.37
  19. Accuracy: 0.37
  20. Train set accuracy: 1.0
  22. Feature count: 28758

Nearly indistinguishable accuracy and effectiveness, but with a much larger model and a lot of over-fitting! I say we toss out the unigrams.

Here's my hypothesis about why unigram features are particularly bad for this application: the PDTB has instances where the same sentence, or largely overlapping portions of the same sentence, are repeated with a focus on different connectives and, in turn, with different coherence semantics. The model thus gets a lot of contradictory word-level information.

One more model: the non-ungrams model is dominated by main-verb features. This obscures the contributions of the others. So I ran one more Slugs variation: a model without unigrams or main-verb features. It turns out to the the smallest model of the whole competition, at 86 features, and it is still pretty competitive.

  3. Confusion matrix
  4. Comparison Contingency Expansion Temporal
  5. Comparison 50.0 44.0 36.0 70.0
  6. Contingency 29.0 58.0 45.0 68.0
  7. Expansion 23.0 35.0 48.0 94.0
  8. Temporal 27.0 24.0 25.0 124.0
  10. Effectiveness
  11. precision recall f1
  12. Comparison 0.39 0.25 0.3
  13. Contingency 0.36 0.29 0.32
  14. Temporal 0.35 0.62 0.45
  15. Expansion 0.31 0.24 0.27
  16. --------------------------------------------------
  17. Average 0.35 0.35 0.34
  19. Accuracy: 0.35
  20. Train set accuracy: 0.34
  22. Feature count: 86
  24. Top 5 features for each category
  26. Comparison
  27. 0.500 Main_Verb_Match==True
  28. 0.292 ('non-neg', 'neg')==True
  29. 0.277 Arg_Length_Ratio==<type 'float'>
  30. 0.169 Genre_ptb-errata==True
  31. 0.153 Arg1_Modal_Count==<type 'int'>
  32. Contingency
  33. 0.362 Arg1_Negated==True
  34. 0.355 Arg2_Negated==True
  35. 0.299 Genre_ptb-essays==True
  36. 0.299 Arg1_Negation_Count==<type 'int'>
  37. 0.244 ('neg', 'neg')==True
  38. Expansion
  39. 0.443 Genre_ptb-highlights==True
  40. 0.314 ('non-neg', 'non-neg')==True
  41. 0.284 Arg2_Negation_Count==<type 'int'>
  42. 0.255 Arg1_Negation_Count==<type 'int'>
  43. 0.120 ('neg', 'neg')==True
  44. Temporal
  45. 0.476 Main_Verb_Match==False
  46. 0.208 Calendar_Words_Count==<type 'int'>
  47. 0.083 Arg_Length_Ratio==<type 'float'>
  48. 0.053 Main_Verb_Match==True
  49. 0.014 Genre_ptb-errata==True

And the winner is ...

I think there is no clear winner; both teams are winners! Some considerations:

  1. The Slugs' accuracy was 37% (38% without unigram features), whereas the Wugs' was 34%. Thus, the Slugs got about 24 more test examples correct than did the Wugs.
  2. However, the Slugs' had training accuracy of 100% (over-fitting) for their unigrams model and 73% for their non-unigrams model. The gap between between these numbers and the test accuracy is worrisome. In contrast the Wugs had just 37% accuracy on the training data.
  3. The Wugs model has only 116 features, to 1824 for the Slugs without unigrams and 28758 with them. However, both teams had around a dozen feature functions, it's just that some of Slugs' functions generated large numbers of features.

The results from all the runs reported above, including all features and their weights, are in this directory.

Addendum: The combined Slugs–Wugs model

The following code implements a simple merger of the Slugs and Wugs models — the union of the two feature-sets, but without unigrams features:

Here are the summary performance numbers for this model:

  1. Confusion matrix
  2. Comparison Contingency Expansion Temporal
  3. Comparison 56.0 45.0 49.0 50.0
  4. Contingency 45.0 78.0 52.0 25.0
  5. Expansion 46.0 47.0 64.0 43.0
  6. Temporal 31.0 27.0 50.0 92.0
  8. Effectiveness
  9. precision recall f1
  10. Comparison 0.31 0.28 0.3
  11. Contingency 0.4 0.39 0.39
  12. Temporal 0.44 0.46 0.45
  13. Expansion 0.3 0.32 0.31
  14. --------------------------------------------------
  15. Average 0.36 0.36 0.36
  17. Accuracy: 0.36
  18. Train set accuracy: 0.74
  20. Feature count: 1920

It's surprising that this resulted in a drop in performance from the Slugs model without unigrams. A more intelligent merger of the two models might produce better overall results. See exercise GLM for data that might help with this merger project.


FEATURES The results directory contains the raw output of the assessment function. It also contains a command-line program that faciliatates exploring the feature weights. Basic usage:

python -f foo

where foo is one of the results filenames. By default, it prints out the top 5 features for each class, restricting to positive values. The following returns 10 features for each class and includes negative weights:

python -f foo -n 10 -k

The keyword arguments can be given in any order.

Use this feature to compare, in general terms, the top features from each team's model. How are they alike, how are they different, and why? Also, do you notice any shared or nearly shared features that differ markedly across teams (say, because their weights have contrasting polarity for a given class)?

NEGS Both teams used the negation features (neg, non-neg), (neg, neg), (non-neg, neg), and (non-neg, non-neg) indicating the negation balance between the two Args. To what extent is the behavior (in)consistent across their models?

GLM Classifier models often obscure the contribution of individual features, which can be disheartening, especially for linguists inclined to lovingly craft their features based on specific scientific intuitions. Thus, it is often smart to balance classifier results with more fine-grained visualization and statistical modeling.

To facilitate such study, each team's feature function file (Banana Slugs, Banana Wugs) contains a function model_to_csv() that, limiting attention to the training data, creates a CSV file for all of the relatively constrained features (that is, excluding n-gram features, hypernym-pair features, etc. — the ones that would result in tens of thousands of columns). This function calls on the general facilities in, which you should also download if you want to create the files yourself.

Direct links to the CSV files:

The leftmost column is the semantic class, and the remaining columns represent features. This format makes it easy to identify associations between feature values and classes using plotting functions and statistical modeling.

See what you can find in the CSV file that could inform our understanding of the current models and help us build better ones!