Penn Discourse Treebank 2 Competition
- Overview
- Train/test split
- The classifier set-up and interface
- Team Potts?
- Verbal word-pairs baseline
- Harvard Inquirer baseline
- Verbal word-pairs and Harvard Inquirer combined
- Team Banana Wugs
- Team Banana Slugs
- And the winner is ...
- Addendum: The combined Slugs–Wugs model
- Exercises
In the second half of class on July 29, we broke up into teams
(originally there were three, but two formed an alliance):
Team Banana Slugs
Team Banana Wugs
Each team developed a probabilistic model of the semantics of
Implicit coherence relations in the Penn Discourse Treebank 2
(see Pitler et al. 2009
and also this preliminary discussion.)
In essence, the goal was to accurately predict the primary semantic
class (Comparison, Contingency, Expansion, Temporal) for Implicit
examples.
I implemented the teams' theories to the best of my ability using
Python/NLTK. I also defined a balanced train/test split for
evaluation.
This page describes the results of the competition. Read on to find
out who won ...
Code/data listing
The train/test split is released as two pickle files giving sets of
indices.
The indices correspond to what one gets when using the
pdtb.CorpusReader method
iter_data() to go through the
corpus. They also correspond to the row numbers in the CSV file,
starting from 0 and ignoring the header.
- Training set of 2400 examples: 600 randomly chosen examples from
each of the four primary semantic classes that we are trying to
predict.
- Test set of 800 examples: 200 randomly chosen examples from each
of the four primary semantic classes that we are trying to
predict.
These sets are relatively small. There are only 826 Temporal
examples in the Implicit part of the corpus, which put an upper-bound
on the sizes given the goal of having balanced datasets.
The classifier interface is the same as the one described on the
first predicting Implicit
relations page.
The model is a maximum entropy (MaxEnt) classifier in which
features can be boolean-valued, integer-valued, or float-valued.
MaxEnt classifiers are a close cousin to generalized linear models
of the sort used for the IMDB review data.
For the current task, the dependent variable is the primary
semantic class of the example (Comparison, Contingency, Expansion,
Temporal). Each team came up with a bunch of feature functions, each
of which maps a datum instance to a
property/feature of that example. The model uses the training data to
estimate weights for those those features. The effectiveness of the
features and associated weights is evaluated on the test data.
The competition is defined by the following code:
- pdtb_classifier.py:
interface to the NLTK MaxEnt classifier, specialized for this task,
with a lot of methods for assessment
- pdtb_competition.py:
code for training, testing, and assessing the teams' models using
pdtb_classifier.py.
I wasn't officially part of the competition (that wouldn't be
fair, since I handled the implementation). However, before the
teams were formed, I did briefly assess two different kinds of
feature function, adapted from the model developed by
Pitler et al. 2009.
I assessed them against random train/test splits. How do they
do against our fixed, balanced train/test split?
The verbal word-pairs experiment
had a feature set consisting entirely of verb-pairs (including modals), where
V1 was drawn from Arg1 and V2 from Arg2.
Its accuracy for the random train/test split was about 41%, but
this is not a very useful evaluation number in light of the highly
imbalanced nature of the corpus's primary semantic classes.
Here's a summary of its performance on our train/test split:
- VERBAL WORD-PAIRS MODEL, BALANCED TRAIN-TEST SPLIT
-
- Confusion matrix
- Comparison Contingency Expansion Temporal
- Comparison 67.0 46.0 31.0 56.0
- Contingency 36.0 67.0 49.0 48.0
- Expansion 51.0 34.0 51.0 64.0
- Temporal 39.0 27.0 32.0 102.0
-
- Effectiveness
- precision recall f1
- Comparison 0.35 0.34 0.34
- Contingency 0.39 0.34 0.36
- Temporal 0.38 0.51 0.43
- Expansion 0.31 0.26 0.28
- --------------------------------------------------
- Average 0.36 0.36 0.35
-
- Accuracy: 0.36
- Train set accuracy: 0.96
-
- Feature count: 15679
The huge gap between the train accuracy and the test accuracy is a
reliable indicator of problematic over-fitting. The feature-space for
this model is very large relative to the size of the dataset, and it
is also (in turn) very sparse.
The features of the
Harvard Inquirer model were again pairs of elements derived from
Arg1 and Arg2, but here they were the abstract semantic classes of
those texts, rather than words themselves.
Here is a summary of the effectiveness of this approach on our
train/test split:
- HARVARD INQUIRER MODEL, BALANCED TRAIN-TEST SPLIT
-
- Confusion matrix
- Comparison Contingency Expansion Temporal
- Comparison 58.0 38.0 41.0 63.0
- Contingency 53.0 51.0 42.0 54.0
- Expansion 56.0 38.0 37.0 69.0
- Temporal 44.0 35.0 34.0 87.0
-
- Effectiveness
- precision recall f1
- Comparison 0.27 0.29 0.28
- Contingency 0.31 0.26 0.28
- Temporal 0.32 0.44 0.37
- Expansion 0.24 0.19 0.21
- --------------------------------------------------
- Average 0.29 0.29 0.29
-
- Accuracy: 0.29
- Train set accuracy: 0.61
-
- Feature count: 3933
This feature set is about four times smaller than the previous one.
Its performance on the test set is not as strong, but it does less
over-fitting. Still, it would be hard to argue that one of these two
models is better than the other.
Here's an organized list of the top five positive features from
each category. The links go to the Harvard Inquirer page for the
semantic categories.
- Comparison
- 4.769 (Our, Causal)
- 4.372 (ECON, If)
- 4.356 (Self, Space)
- 4.282 (Quan, IAV)
- 4.124 (If, Know)
- Contingency
- 5.826 (Causal, Strong)
- 5.166 (IAV, Strong)
- 5.141 (SV, PtLw)
- 4.871 (Means, You)
- 4.395 (SV, You)
- Expansion
- 6.910 (MALE, PtLw)
- 5.172 (Our, Time@)
- 5.053 (MALE, Rel)
- 4.994 (TimeSpc, ECON)
- 4.450 (You, PtLw)
- Temporal
- 5.665 (Female, Means)
- 4.861 (MALE, Compare)
- 4.750 (Ovrst, Know)
- 4.297 (H4, MALE)
- 4.099 (Causal, MALE)
Here are the results for a feature function that combines the above
models (this an exercise on
the first mplicit page about predicting Implicit relations):
- COMBINED VERBAL WORD-PAIRS AND HARVARD INQUIRER MODEL, BALANCED TRAIN-TEST SPLIT
-
- Confusion matrix
- Comparison Contingency Expansion Temporal
- Comparison 70.0 48.0 46.0 36.0
- Contingency 31.0 88.0 54.0 27.0
- Expansion 35.0 55.0 64.0 46.0
- Temporal 21.0 41.0 36.0 102.0
-
- Effectiveness
- precision recall f1
- Comparison 0.45 0.35 0.39
- Contingency 0.38 0.44 0.41
- Temporal 0.48 0.51 0.5
- Expansion 0.32 0.32 0.32
- --------------------------------------------------
- Average 0.41 0.41 0.4
-
- Accuracy: 0.41
- Train set accuracy: 1.0
-
- Feature count: 632559
Accuracy and effectiveness are up, but with over 600,000 features, we
memorized the training data and still didn't do all that well in testing.
Here's the experiment code, which is well-documented with the team's informal
statements and my own interpretations of them:
High-level overview of the Wugs' features:
- Negation: features capturing negation balances
and imbalances across the Args, targeting both sentential and
constituent negation.
- Sentiment: A separate sentiment score for each
Arg, representing the sum of the coefficients for all the words in
that Arg (p-value threshold at 0.1).
- Overlap: the cardinality of the intersection of
the Arg1 and Arg2 words divided by their union.
- Structural complexity: features capturing, for
each Arg, whether it has an embedded clause, the number of
embedded clauses, and the height of its largest tree.
- Complexity ratios: a feature for log of the
ratio of the lengths (in words) of the two Args, a feature for the
ratio of the clause-counts for the two Args, and a feature for the
ratio of the max heights for the two Args.
- Pronominal subjects: a pair-feature capturing
whether the subject of the Arg is pronominal (pro) or
non-promominal (non-pro). The features are pairs from {pro,
non-pro} x {pro, non-pro}.
- It seems: returns False if the first argument
of the second bigram is not it seems.
- Tense agreement: a feature for the degree to
which the verbal nodes in the two Args have the same tense.
- Modals: a pair-feature capturing whether Arg
contains a modal (modal) or not (non-modal). The featuers are
pairs from {modal, non-modal} x {modal, non-modal}.
Performance summary for the balanced train/test split:
- BANANA WUGS MODEL, BALANCED TRAIN-TEST SPLIT
-
- Confusion matrix
- Comparison Contingency Expansion Temporal
- Comparison 46.0 50.0 35.0 69.0
- Contingency 37.0 77.0 37.0 49.0
- Expansion 35.0 42.0 46.0 77.0
- Temporal 23.0 36.0 41.0 100.0
-
- Effectiveness
- precision recall f1
- Comparison 0.33 0.23 0.27
- Contingency 0.38 0.39 0.38
- Temporal 0.34 0.5 0.4
- Expansion 0.29 0.23 0.26
- --------------------------------------------------
- Average 0.33 0.34 0.33
-
- Accuracy: 0.34
- Train set accuracy: 0.37
-
- Feature count: 116
-
- Top 5 features for each category
-
- Comparison
- 4.009 Normalized_Shared_Token_Count==<type 'float'>
- 0.564 ('non-neg', 'neg')==True
- 0.427 Arg2_Coef_Sum==<type 'float'>
- 0.381 Sentential_Negation_Balanced==False
- 0.306 Arg_Length_Ratio==<type 'float'>
- Contingency
- 1.130 ('neg', 'neg')==True
- 0.616 Constituent_Negation_Balanced==True
- 0.597 Arg2_Startswith_It_Seems==False
- 0.560 Arg1_Coef_Sum==<type 'float'>
- 0.473 ('modal', 'non-modal')==True
- Expansion
- 0.635 ('neg', 'neg')==True
- 0.360 ('pro-subj', 'pro-subj')==True
- 0.322 Normalized_Shared_Token_Count==<type 'float'>
- 0.231 Constituent_Negation_Balanced==True
- 0.191 Arg2_Startswith_It_Seems==False
- Temporal
- 2.737 Normalized_Shared_Token_Count==<type 'float'>
- 0.928 ('non-neg', 'non-neg')==True
- 0.260 Tree_Height_Ratio==<type 'float'>
- 0.231 ('non-modal', 'non-modal')==True
- 0.153 Constituent_Negation_Balanced==False
Here's the well-documented experiment code:
High-level overview of the Slugs' features:
- Negation: for each Arg, a feature for whether
it was negated and the number of negation it contains. Also, a
feature capturing negation balance/imbalance across the Args.
- Main verbs: for each Arg, a feature for its
main-verb. Also, a feature returning True of the two Args' main
verbs match, else False.
- Length ratio: a feature for the ratio of the
lengths (in words) of Arg1 and Arg2.
- WordNet antonyms: the number of words in Arg2
that are antonyms of a word in Arg1.
- Genre: a feature for the genre of the file
containing the example.
- Modals: for each Arg, the number of modals in
it.
- WordNet hypernyms (not used; see below): a
feature (hyp1, hyp2) for every hypernym hyp1 consistent with a
word in Arg1 and every hypernym hyp2 consistent with a word in
Arg2.
- WordNet hypernym counts (written by me; see
below): for Arg1, a feature for the number of words in Arg2 that
are hypernyms of a word in Arg1, and ditto for Arg2.
- N-gram features: for each Arg, a feature for
each unigram it contains. (The team suggested going to 2- or
3-grams, but I called a halt at 1 because the data-set is not that
big.)
Two comments before assessment.
- First, WordNet hypernyms generated so many
features that Python ran out of memory when trying to do the
feature encoding. (It could map each datum to its feature values,
but the MaxEnt model has to map those to a much larger joint
(feature, class) space. That's where Python/NLTK conked out.)
Thus, I replaced this with WordNet hypernyms,
which I thought would carry a lot of the same information, but
with a radically smaller number of features.
- Second, N-gram features will generate a huge
number of features, even with N=1. Thus, I report results for
models with and without these included.
Let's first look at the results without the 1-gram
features:
- BANANA SLUGS MODEL, NO UNIGRAM FEATURES, BALANCED TRAIN-TEST SPLIT
-
- Confusion matrix
- Comparison Contingency Expansion Temporal
- Comparison 56.0 48.0 49.0 47.0
- Contingency 32.0 78.0 55.0 35.0
- Expansion 34.0 57.0 74.0 35.0
- Temporal 31.0 27.0 47.0 95.0
-
- Effectiveness
- precision recall f1
- Comparison 0.37 0.28 0.32
- Contingency 0.37 0.39 0.38
- Temporal 0.45 0.48 0.46
- Expansion 0.33 0.37 0.35
- --------------------------------------------------
- Average 0.38 0.38 0.38
-
- Accuracy: 0.38
- Train set accuracy: 0.73
-
- Feature count: 1824
-
- Top 5 features for each category
-
- Comparison
- 6.844 Arg2_Main_Verb_provide==True
- 6.837 Arg2_Main_Verb_earned==True
- 6.765 Arg1_Main_Verb_fled==True
- 6.712 Arg1_Main_Verb_totaled==True
- 6.433 Arg1_Main_Verb_result==True
- Contingency
- 7.577 Arg2_Main_Verb_raises==True
- 7.185 Arg2_Main_Verb_believe==True
- 7.001 Arg1_Main_Verb_help==True
- 6.660 Arg1_Main_Verb_bristle==True
- 6.425 Arg1_Main_Verb_leapt==True
- Expansion
- 6.880 Arg2_Main_Verb_include==True
- 6.749 Arg2_Main_Verb_shows==True
- 6.642 Arg1_Main_Verb_prompted==True
- 6.494 Arg2_Main_Verb_tumbled==True
- 6.430 Arg2_Main_Verb_drag==True
- Temporal
- 7.899 Arg1_Main_Verb_joined==True
- 7.279 Arg2_Main_Verb_begin==True
- 7.239 Arg1_Main_Verb_begins==True
- 7.142 Arg2_Main_Verb_managed==True
- 7.073 Arg2_Main_Verb_remains==True
And now with the 1-gram features added in:
- BANANA SLUGS MODEL, WITH UNIGRAM FEATURES, BALANCED TRAIN-TEST SPLIT
-
- Confusion matrix
- Comparison Contingency Expansion Temporal
- Comparison 67.0 38.0 63.0 32.0
- Contingency 49.0 72.0 56.0 23.0
- Expansion 39.0 55.0 66.0 40.0
- Temporal 34.0 33.0 40.0 93.0
-
- Effectiveness
- precision recall f1
- Comparison 0.35 0.34 0.34
- Contingency 0.36 0.36 0.36
- Temporal 0.49 0.47 0.48
- Expansion 0.29 0.33 0.31
- --------------------------------------------------
- Average 0.38 0.37 0.37
-
- Accuracy: 0.37
- Train set accuracy: 1.0
-
- Feature count: 28758
Nearly indistinguishable accuracy and effectiveness, but with a
much larger model and a lot of over-fitting! I say we toss out the
unigrams.
Here's my hypothesis about why unigram features are particularly
bad for this application: the PDTB has instances where the same
sentence, or largely overlapping portions of the same sentence, are
repeated with a focus on different connectives and, in turn, with
different coherence semantics. The model thus gets a lot of
contradictory word-level information.
One more model: the non-ungrams model is dominated by main-verb
features. This obscures the contributions of the others. So I ran one
more Slugs variation: a model without unigrams or main-verb
features. It turns out to the the smallest model of the whole
competition, at 86 features, and it is still pretty competitive.
- BANANA SLUGS MODEL, NO UNIGRAM OR MAIN-VERB FEATURES, BALANCED TRAIN-TEST SPLIT
-
- Confusion matrix
- Comparison Contingency Expansion Temporal
- Comparison 50.0 44.0 36.0 70.0
- Contingency 29.0 58.0 45.0 68.0
- Expansion 23.0 35.0 48.0 94.0
- Temporal 27.0 24.0 25.0 124.0
-
- Effectiveness
- precision recall f1
- Comparison 0.39 0.25 0.3
- Contingency 0.36 0.29 0.32
- Temporal 0.35 0.62 0.45
- Expansion 0.31 0.24 0.27
- --------------------------------------------------
- Average 0.35 0.35 0.34
-
- Accuracy: 0.35
- Train set accuracy: 0.34
-
- Feature count: 86
-
- Top 5 features for each category
-
- Comparison
- 0.500 Main_Verb_Match==True
- 0.292 ('non-neg', 'neg')==True
- 0.277 Arg_Length_Ratio==<type 'float'>
- 0.169 Genre_ptb-errata==True
- 0.153 Arg1_Modal_Count==<type 'int'>
- Contingency
- 0.362 Arg1_Negated==True
- 0.355 Arg2_Negated==True
- 0.299 Genre_ptb-essays==True
- 0.299 Arg1_Negation_Count==<type 'int'>
- 0.244 ('neg', 'neg')==True
- Expansion
- 0.443 Genre_ptb-highlights==True
- 0.314 ('non-neg', 'non-neg')==True
- 0.284 Arg2_Negation_Count==<type 'int'>
- 0.255 Arg1_Negation_Count==<type 'int'>
- 0.120 ('neg', 'neg')==True
- Temporal
- 0.476 Main_Verb_Match==False
- 0.208 Calendar_Words_Count==<type 'int'>
- 0.083 Arg_Length_Ratio==<type 'float'>
- 0.053 Main_Verb_Match==True
- 0.014 Genre_ptb-errata==True
I think there is no clear winner; both teams are winners! Some considerations:
- The Slugs' accuracy was 37% (38% without unigram features),
whereas the Wugs' was 34%. Thus, the Slugs got about 24 more test examples
correct than did the Wugs.
- However, the Slugs' had training accuracy of 100% (over-fitting)
for their unigrams model and 73% for their non-unigrams model.
The gap between between these numbers and the test accuracy is
worrisome. In contrast the Wugs had just 37% accuracy on the training
data.
- The Wugs model has only 116 features, to 1824 for the Slugs
without unigrams and 28758 with them. However, both teams had
around a dozen feature functions, it's just that some of Slugs'
functions generated large numbers of features.
The results from all the runs reported above, including all
features and their weights, are in
this directory.
The following code implements a simple merger of the Slugs and Wugs
models — the union of the two feature-sets, but without unigrams
features:
Here are the summary performance numbers for this model:
- Confusion matrix
- Comparison Contingency Expansion Temporal
- Comparison 56.0 45.0 49.0 50.0
- Contingency 45.0 78.0 52.0 25.0
- Expansion 46.0 47.0 64.0 43.0
- Temporal 31.0 27.0 50.0 92.0
-
- Effectiveness
- precision recall f1
- Comparison 0.31 0.28 0.3
- Contingency 0.4 0.39 0.39
- Temporal 0.44 0.46 0.45
- Expansion 0.3 0.32 0.31
- --------------------------------------------------
- Average 0.36 0.36 0.36
-
- Accuracy: 0.36
- Train set accuracy: 0.74
-
- Feature count: 1920
It's surprising that this resulted in a drop in performance from
the Slugs model without unigrams. A more intelligent merger of the
two models might produce better overall results. See
exercise GLM for
data that might help with this merger project.
FEATURES
The results
directory contains the raw output of the assessment function.
It also contains a command-line
program pdtb_view_top_features.py
that faciliatates exploring the feature weights. Basic usage:
python pdtb_view_top_features.py -f foo
where foo is one of the results filenames.
By default, it prints out the top 5 features for each class, restricting to
positive values. The following returns 10 features for each class and
includes negative weights:
python pdtb_view_top_features.py -f foo -n 10 -k
The keyword arguments can be given in any order.
Use this feature to compare, in general terms, the top features
from each team's model. How are they alike, how are they
different, and why? Also, do you notice any shared or nearly
shared features that differ markedly across teams (say,
because their weights have contrasting polarity for a given
class)?
NEGS Both teams used
the negation features
(neg, non-neg),
(neg, neg),
(non-neg, neg),
and (non-neg, non-neg)
indicating the negation balance between the two Args. To what
extent is the behavior (in)consistent across their models?
GLM Classifier models
often obscure the contribution of individual features, which can
be disheartening, especially for linguists inclined to lovingly
craft their features based on specific scientific intuitions.
Thus, it is often smart to balance classifier results with
more fine-grained visualization and statistical modeling.
To facilitate such study, each team's feature function
file
(Banana Slugs,
Banana Wugs)
contains a function model_to_csv()
that, limiting attention to the training data, creates a CSV file
for all of the relatively constrained features (that is, excluding
n-gram features, hypernym-pair features, etc. — the ones
that would result in tens of thousands of columns). This function
calls on the general facilities in
pdtb_competition_model_to_csv.py,
which you should also download if you want to create the files
yourself.
Direct links to the CSV files:
The leftmost column is the semantic class, and the remaining
columns represent features. This format makes it easy to identify
associations between feature values and classes using plotting
functions and statistical modeling.
See what you can find in the CSV file that could inform our
understanding of the current models and help us build better
ones!