Penn Discourse Treebank 2 Competition

Overview
Train/test split
The classifier set-up and interface
Team Potts?
Team Banana Wugs
Team Banana Slugs
And the winner is ...
Addendum: The combined Slugs–Wugs model
Exercises

Overview

In the second half of class on July 29, we broke up into teams (originally there were three, but two formed an alliance):

Team Banana Slugs

Team Banana Wugs

Each team developed a probabilistic model of the semantics of Implicit coherence relations in the Penn Discourse Treebank 2 (see Pitler et al. 2009 and also this preliminary discussion.) In essence, the goal was to accurately predict the primary semantic class (Comparison, Contingency, Expansion, Temporal) for Implicit examples.

I implemented the teams' theories to the best of my ability using Python/NLTK. I also defined a balanced train/test split for evaluation.

This page describes the results of the competition. Read on to find out who won ...

Code/data listing

pdtb-competition-implicit-train-indices.pickle: balanced train set (600 * 4)
pdtb-competition-implicit-test-indices.pickle: balanced test set (200 * 4)
pdtb_classifier.py: the classifier interface
pdtb_competition.py: fits the teams' models and print assessments
pdtb_competition_team_banana_slugs.py: the Banana Slugs model
pdtb_competition_team_banana_wugs.py: the Banana Wugs model
pdtb_competition_compprag.py: the combined Slugs–Wugs model
text files containing the raw results of competition
pdtb_competition_model_to_csv.py: code for mapping a model to a CSV file
pdtb-competition-banana-slugs-model.csv: CSV representation of the Slugs model
pdtb-competition-banana-wugs-model.csv: CSV representation of the Wugs model
pdtb-competition-compprag-model.csv (union of the above two)

Train/test split

The train/test split is released as two pickle files giving sets of indices.

The indices correspond to what one gets when using the pdtb.CorpusReader method iter_data() to go through the corpus. They also correspond to the row numbers in the CSV file, starting from 0 and ignoring the header.

Training set of 2400 examples: 600 randomly chosen examples from each of the four primary semantic classes that we are trying to predict.
Test set of 800 examples: 200 randomly chosen examples from each of the four primary semantic classes that we are trying to predict.

These sets are relatively small. There are only 826 Temporal examples in the Implicit part of the corpus, which put an upper-bound on the sizes given the goal of having balanced datasets.

The classifier set-up and interface

The classifier interface is the same as the one described on the first predicting Implicit relations page.

The model is a maximum entropy (MaxEnt) classifier in which features can be boolean-valued, integer-valued, or float-valued. MaxEnt classifiers are a close cousin to generalized linear models of the sort used for the IMDB review data.

For the current task, the dependent variable is the primary semantic class of the example (Comparison, Contingency, Expansion, Temporal). Each team came up with a bunch of feature functions, each of which maps a datum instance to a property/feature of that example. The model uses the training data to estimate weights for those those features. The effectiveness of the features and associated weights is evaluated on the test data.

The competition is defined by the following code:

pdtb_classifier.py: interface to the NLTK MaxEnt classifier, specialized for this task, with a lot of methods for assessment
pdtb_competition.py: code for training, testing, and assessing the teams' models using pdtb_classifier.py.

Team Potts?

I wasn't officially part of the competition (that wouldn't be fair, since I handled the implementation). However, before the teams were formed, I did briefly assess two different kinds of feature function, adapted from the model developed by Pitler et al. 2009. I assessed them against random train/test splits. How do they do against our fixed, balanced train/test split?

Verbal word-pairs baseline

The verbal word-pairs experiment had a feature set consisting entirely of verb-pairs (including modals), where V1 was drawn from Arg1 and V2 from Arg2.

Its accuracy for the random train/test split was about 41%, but this is not a very useful evaluation number in light of the highly imbalanced nature of the corpus's primary semantic classes.

Here's a summary of its performance on our train/test split:

VERBAL WORD-PAIRS MODEL, BALANCED TRAIN-TEST SPLIT
Confusion matrix
Comparison Contingency Expansion Temporal
Comparison 67.0 46.0 31.0 56.0
Contingency 36.0 67.0 49.0 48.0
Expansion 51.0 34.0 51.0 64.0
Temporal 39.0 27.0 32.0 102.0
Effectiveness
precision recall f1
Comparison 0.35 0.34 0.34
Contingency 0.39 0.34 0.36
Temporal 0.38 0.51 0.43
Expansion 0.31 0.26 0.28
--------------------------------------------------
Average 0.36 0.36 0.35
Accuracy: 0.36
Train set accuracy: 0.96
Feature count: 15679

The huge gap between the train accuracy and the test accuracy is a reliable indicator of problematic over-fitting. The feature-space for this model is very large relative to the size of the dataset, and it is also (in turn) very sparse.

Harvard Inquirer baseline

The features of the Harvard Inquirer model were again pairs of elements derived from Arg1 and Arg2, but here they were the abstract semantic classes of those texts, rather than words themselves.

Here is a summary of the effectiveness of this approach on our train/test split:

HARVARD INQUIRER MODEL, BALANCED TRAIN-TEST SPLIT
Confusion matrix
Comparison Contingency Expansion Temporal
Comparison 58.0 38.0 41.0 63.0
Contingency 53.0 51.0 42.0 54.0
Expansion 56.0 38.0 37.0 69.0
Temporal 44.0 35.0 34.0 87.0
Effectiveness
precision recall f1
Comparison 0.27 0.29 0.28
Contingency 0.31 0.26 0.28
Temporal 0.32 0.44 0.37
Expansion 0.24 0.19 0.21
--------------------------------------------------
Average 0.29 0.29 0.29
Accuracy: 0.29
Train set accuracy: 0.61
Feature count: 3933

This feature set is about four times smaller than the previous one. Its performance on the test set is not as strong, but it does less over-fitting. Still, it would be hard to argue that one of these two models is better than the other.

Here's an organized list of the top five positive features from each category. The links go to the Harvard Inquirer page for the semantic categories.

Comparison
4.769 (Our, Causal)
4.372 (ECON, If)
4.356 (Self, Space)
4.282 (Quan, IAV)
4.124 (If, Know)
Contingency
5.826 (Causal, Strong)
5.166 (IAV, Strong)
5.141 (SV, PtLw)
4.871 (Means, You)
4.395 (SV, You)
Expansion
6.910 (MALE, PtLw)
5.172 (Our, Time@)
5.053 (MALE, Rel)
4.994 (TimeSpc, ECON)
4.450 (You, PtLw)
Temporal
5.665 (Female, Means)
4.861 (MALE, Compare)
4.750 (Ovrst, Know)
4.297 (H4, MALE)
4.099 (Causal, MALE)

Verbal word-pairs and Harvard Inquirer combined

Here are the results for a feature function that combines the above models (this an exercise on the first mplicit page about predicting Implicit relations):

COMBINED VERBAL WORD-PAIRS AND HARVARD INQUIRER MODEL, BALANCED TRAIN-TEST SPLIT
Confusion matrix
Comparison Contingency Expansion Temporal
Comparison 70.0 48.0 46.0 36.0
Contingency 31.0 88.0 54.0 27.0
Expansion 35.0 55.0 64.0 46.0
Temporal 21.0 41.0 36.0 102.0
Effectiveness
precision recall f1
Comparison 0.45 0.35 0.39
Contingency 0.38 0.44 0.41
Temporal 0.48 0.51 0.5
Expansion 0.32 0.32 0.32
--------------------------------------------------
Average 0.41 0.41 0.4
Accuracy: 0.41
Train set accuracy: 1.0
Feature count: 632559

Accuracy and effectiveness are up, but with over 600,000 features, we memorized the training data and still didn't do all that well in testing.

Team Banana Wugs

Here's the experiment code, which is well-documented with the team's informal statements and my own interpretations of them:

pdtb_competition_team_banana_wugs.py

High-level overview of the Wugs' features:

Negation: features capturing negation balances and imbalances across the Args, targeting both sentential and constituent negation.
Sentiment: A separate sentiment score for each Arg, representing the sum of the coefficients for all the words in that Arg (p-value threshold at 0.1).
Overlap: the cardinality of the intersection of the Arg1 and Arg2 words divided by their union.
Structural complexity: features capturing, for each Arg, whether it has an embedded clause, the number of embedded clauses, and the height of its largest tree.
Complexity ratios: a feature for log of the ratio of the lengths (in words) of the two Args, a feature for the ratio of the clause-counts for the two Args, and a feature for the ratio of the max heights for the two Args.
Pronominal subjects: a pair-feature capturing whether the subject of the Arg is pronominal (pro) or non-promominal (non-pro). The features are pairs from {pro, non-pro} x {pro, non-pro}.
It seems: returns False if the first argument of the second bigram is not it seems.
Tense agreement: a feature for the degree to which the verbal nodes in the two Args have the same tense.
Modals: a pair-feature capturing whether Arg contains a modal (modal) or not (non-modal). The featuers are pairs from {modal, non-modal} x {modal, non-modal}.

Performance summary for the balanced train/test split:

BANANA WUGS MODEL, BALANCED TRAIN-TEST SPLIT
Confusion matrix
Comparison Contingency Expansion Temporal
Comparison 46.0 50.0 35.0 69.0
Contingency 37.0 77.0 37.0 49.0
Expansion 35.0 42.0 46.0 77.0
Temporal 23.0 36.0 41.0 100.0
Effectiveness
precision recall f1
Comparison 0.33 0.23 0.27
Contingency 0.38 0.39 0.38
Temporal 0.34 0.5 0.4
Expansion 0.29 0.23 0.26
--------------------------------------------------
Average 0.33 0.34 0.33
Accuracy: 0.34
Train set accuracy: 0.37
Feature count: 116
Top 5 features for each category
Comparison
4.009 Normalized_Shared_Token_Count==<type 'float'>
0.564 ('non-neg', 'neg')==True
0.427 Arg2_Coef_Sum==<type 'float'>
0.381 Sentential_Negation_Balanced==False
0.306 Arg_Length_Ratio==<type 'float'>
Contingency
1.130 ('neg', 'neg')==True
0.616 Constituent_Negation_Balanced==True
0.597 Arg2_Startswith_It_Seems==False
0.560 Arg1_Coef_Sum==<type 'float'>
0.473 ('modal', 'non-modal')==True
Expansion
0.635 ('neg', 'neg')==True
0.360 ('pro-subj', 'pro-subj')==True
0.322 Normalized_Shared_Token_Count==<type 'float'>
0.231 Constituent_Negation_Balanced==True
0.191 Arg2_Startswith_It_Seems==False
Temporal
2.737 Normalized_Shared_Token_Count==<type 'float'>
0.928 ('non-neg', 'non-neg')==True
0.260 Tree_Height_Ratio==<type 'float'>
0.231 ('non-modal', 'non-modal')==True
0.153 Constituent_Negation_Balanced==False

Team Banana Slugs

Here's the well-documented experiment code:

pdtb_competition_team_banana_slugs.py

High-level overview of the Slugs' features:

Negation: for each Arg, a feature for whether it was negated and the number of negation it contains. Also, a feature capturing negation balance/imbalance across the Args.
Main verbs: for each Arg, a feature for its main-verb. Also, a feature returning True of the two Args' main verbs match, else False.
Length ratio: a feature for the ratio of the lengths (in words) of Arg1 and Arg2.
WordNet antonyms: the number of words in Arg2 that are antonyms of a word in Arg1.
Genre: a feature for the genre of the file containing the example.
Modals: for each Arg, the number of modals in it.
WordNet hypernyms (not used; see below): a feature (hyp1, hyp2) for every hypernym hyp1 consistent with a word in Arg1 and every hypernym hyp2 consistent with a word in Arg2.
WordNet hypernym counts (written by me; see below): for Arg1, a feature for the number of words in Arg2 that are hypernyms of a word in Arg1, and ditto for Arg2.
N-gram features: for each Arg, a feature for each unigram it contains. (The team suggested going to 2- or 3-grams, but I called a halt at 1 because the data-set is not that big.)

Two comments before assessment.

First, WordNet hypernyms generated so many features that Python ran out of memory when trying to do the feature encoding. (It could map each datum to its feature values, but the MaxEnt model has to map those to a much larger joint (feature, class) space. That's where Python/NLTK conked out.) Thus, I replaced this with WordNet hypernyms, which I thought would carry a lot of the same information, but with a radically smaller number of features.
Second, N-gram features will generate a huge number of features, even with N=1. Thus, I report results for models with and without these included.

Let's first look at the results without the 1-gram features:

BANANA SLUGS MODEL, NO UNIGRAM FEATURES, BALANCED TRAIN-TEST SPLIT
Confusion matrix
Comparison Contingency Expansion Temporal
Comparison 56.0 48.0 49.0 47.0
Contingency 32.0 78.0 55.0 35.0
Expansion 34.0 57.0 74.0 35.0
Temporal 31.0 27.0 47.0 95.0
Effectiveness
precision recall f1
Comparison 0.37 0.28 0.32
Contingency 0.37 0.39 0.38
Temporal 0.45 0.48 0.46
Expansion 0.33 0.37 0.35
--------------------------------------------------
Average 0.38 0.38 0.38
Accuracy: 0.38
Train set accuracy: 0.73
Feature count: 1824
Top 5 features for each category
Comparison
6.844 Arg2_Main_Verb_provide==True
6.837 Arg2_Main_Verb_earned==True
6.765 Arg1_Main_Verb_fled==True
6.712 Arg1_Main_Verb_totaled==True
6.433 Arg1_Main_Verb_result==True
Contingency
7.577 Arg2_Main_Verb_raises==True
7.185 Arg2_Main_Verb_believe==True
7.001 Arg1_Main_Verb_help==True
6.660 Arg1_Main_Verb_bristle==True
6.425 Arg1_Main_Verb_leapt==True
Expansion
6.880 Arg2_Main_Verb_include==True
6.749 Arg2_Main_Verb_shows==True
6.642 Arg1_Main_Verb_prompted==True
6.494 Arg2_Main_Verb_tumbled==True
6.430 Arg2_Main_Verb_drag==True
Temporal
7.899 Arg1_Main_Verb_joined==True
7.279 Arg2_Main_Verb_begin==True
7.239 Arg1_Main_Verb_begins==True
7.142 Arg2_Main_Verb_managed==True
7.073 Arg2_Main_Verb_remains==True

And now with the 1-gram features added in:

BANANA SLUGS MODEL, WITH UNIGRAM FEATURES, BALANCED TRAIN-TEST SPLIT
Confusion matrix
Comparison Contingency Expansion Temporal
Comparison 67.0 38.0 63.0 32.0
Contingency 49.0 72.0 56.0 23.0
Expansion 39.0 55.0 66.0 40.0
Temporal 34.0 33.0 40.0 93.0
Effectiveness
precision recall f1
Comparison 0.35 0.34 0.34
Contingency 0.36 0.36 0.36
Temporal 0.49 0.47 0.48
Expansion 0.29 0.33 0.31
--------------------------------------------------
Average 0.38 0.37 0.37
Accuracy: 0.37
Train set accuracy: 1.0
Feature count: 28758

Nearly indistinguishable accuracy and effectiveness, but with a much larger model and a lot of over-fitting! I say we toss out the unigrams.

Here's my hypothesis about why unigram features are particularly bad for this application: the PDTB has instances where the same sentence, or largely overlapping portions of the same sentence, are repeated with a focus on different connectives and, in turn, with different coherence semantics. The model thus gets a lot of contradictory word-level information.

One more model: the non-ungrams model is dominated by main-verb features. This obscures the contributions of the others. So I ran one more Slugs variation: a model without unigrams or main-verb features. It turns out to the the smallest model of the whole competition, at 86 features, and it is still pretty competitive.

BANANA SLUGS MODEL, NO UNIGRAM OR MAIN-VERB FEATURES, BALANCED TRAIN-TEST SPLIT
Confusion matrix
Comparison Contingency Expansion Temporal
Comparison 50.0 44.0 36.0 70.0
Contingency 29.0 58.0 45.0 68.0
Expansion 23.0 35.0 48.0 94.0
Temporal 27.0 24.0 25.0 124.0
Effectiveness
precision recall f1
Comparison 0.39 0.25 0.3
Contingency 0.36 0.29 0.32
Temporal 0.35 0.62 0.45
Expansion 0.31 0.24 0.27
--------------------------------------------------
Average 0.35 0.35 0.34
Accuracy: 0.35
Train set accuracy: 0.34
Feature count: 86
Top 5 features for each category
Comparison
0.500 Main_Verb_Match==True
0.292 ('non-neg', 'neg')==True
0.277 Arg_Length_Ratio==<type 'float'>
0.169 Genre_ptb-errata==True
0.153 Arg1_Modal_Count==<type 'int'>
Contingency
0.362 Arg1_Negated==True
0.355 Arg2_Negated==True
0.299 Genre_ptb-essays==True
0.299 Arg1_Negation_Count==<type 'int'>
0.244 ('neg', 'neg')==True
Expansion
0.443 Genre_ptb-highlights==True
0.314 ('non-neg', 'non-neg')==True
0.284 Arg2_Negation_Count==<type 'int'>
0.255 Arg1_Negation_Count==<type 'int'>
0.120 ('neg', 'neg')==True
Temporal
0.476 Main_Verb_Match==False
0.208 Calendar_Words_Count==<type 'int'>
0.083 Arg_Length_Ratio==<type 'float'>
0.053 Main_Verb_Match==True
0.014 Genre_ptb-errata==True

And the winner is ...

I think there is no clear winner; both teams are winners! Some considerations:

The Slugs' accuracy was 37% (38% without unigram features), whereas the Wugs' was 34%. Thus, the Slugs got about 24 more test examples correct than did the Wugs.
However, the Slugs' had training accuracy of 100% (over-fitting) for their unigrams model and 73% for their non-unigrams model. The gap between between these numbers and the test accuracy is worrisome. In contrast the Wugs had just 37% accuracy on the training data.
The Wugs model has only 116 features, to 1824 for the Slugs without unigrams and 28758 with them. However, both teams had around a dozen feature functions, it's just that some of Slugs' functions generated large numbers of features.

The results from all the runs reported above, including all features and their weights, are in this directory.

Addendum: The combined Slugs–Wugs model

The following code implements a simple merger of the Slugs and Wugs models — the union of the two feature-sets, but without unigrams features:

pdtb_competition_compprag.py

Here are the summary performance numbers for this model:

Confusion matrix
Comparison Contingency Expansion Temporal
Comparison 56.0 45.0 49.0 50.0
Contingency 45.0 78.0 52.0 25.0
Expansion 46.0 47.0 64.0 43.0
Temporal 31.0 27.0 50.0 92.0
Effectiveness
precision recall f1
Comparison 0.31 0.28 0.3
Contingency 0.4 0.39 0.39
Temporal 0.44 0.46 0.45
Expansion 0.3 0.32 0.31
--------------------------------------------------
Average 0.36 0.36 0.36
Accuracy: 0.36
Train set accuracy: 0.74
Feature count: 1920

It's surprising that this resulted in a drop in performance from the Slugs model without unigrams. A more intelligent merger of the two models might produce better overall results. See exercise GLM for data that might help with this merger project.

Exercises

FEATURES The results directory contains the raw output of the assessment function. It also contains a command-line program pdtb_view_top_features.py that faciliatates exploring the feature weights. Basic usage:

python pdtb_view_top_features.py -f foo

where foo is one of the results filenames. By default, it prints out the top 5 features for each class, restricting to positive values. The following returns 10 features for each class and includes negative weights:

python pdtb_view_top_features.py -f foo -n 10 -k

The keyword arguments can be given in any order.

Use this feature to compare, in general terms, the top features from each team's model. How are they alike, how are they different, and why? Also, do you notice any shared or nearly shared features that differ markedly across teams (say, because their weights have contrasting polarity for a given class)?

NEGS Both teams used the negation features (neg, non-neg), (neg, neg), (non-neg, neg), and (non-neg, non-neg) indicating the negation balance between the two Args. To what extent is the behavior (in)consistent across their models?

GLM Classifier models often obscure the contribution of individual features, which can be disheartening, especially for linguists inclined to lovingly craft their features based on specific scientific intuitions. Thus, it is often smart to balance classifier results with more fine-grained visualization and statistical modeling.

To facilitate such study, each team's feature function file (Banana Slugs, Banana Wugs) contains a function model_to_csv() that, limiting attention to the training data, creates a CSV file for all of the relatively constrained features (that is, excluding n-gram features, hypernym-pair features, etc. — the ones that would result in tens of thousands of columns). This function calls on the general facilities in pdtb_competition_model_to_csv.py, which you should also download if you want to create the files yourself.

Direct links to the CSV files:

pdtb-competition-banana-slugs-model.csv
pdtb-competition-banana-wugs-model.csv
pdtb-competition-compprag-model.csv (union of the above two)

The leftmost column is the semantic class, and the remaining columns represent features. This format makes it easy to identify associations between feature values and classes using plotting functions and statistical modeling.

See what you can find in the CSV file that could inform our understanding of the current models and help us build better ones!