The goal of this section is to provide a framework for running (probabilistic) experiments that attempt to predict the semantics of implicit discourse coherence relations in the Penn Discourse Treebank 2.0.
I provide two detailed illustrations of the kinds of predictors we might use. Their success is limited, but that's okay. My hope is that seeing these examples will help you to implement your own hypotheses.
To warm-up, I suggest downloading the following file and studying it:
This file contains 30 randomly selected Implicit examples from each of the four primary semantic classes (Temporal, Contingency, Expansion, Comparison). What features of Arg1 and Arg2 might be shaping this semantic judgment? This is the primary question for this section; I'm hoping that studying some specific examples will get our linguistic and scientific intuitions going strong.
I've included the connective string (always chosen by the annotators for these Implicit cases) in column 2. This might help you understand the semantic classification. It shouldn't be incorporated into the predictive model, though.
Associated reading:
Data and code:
I've written a very focused interface for building classifier models to predict implicit coherence relations:
The file contains just one class, PdtbClassifier, which trains, tests, and evaluates MaxEnt classifiers using nltk.MaxentClassifier.
The interface is very similar to that of iqap_classifier.py. Here is a description of the initialization process:
Once the model is instantiated, we typically train it, test it, and then look at the confusion matrix to assess performance:
This section implements (versions of) two of the features used by Pitler et al. 2009: word-pairs and Harvard Inquirer semantic-class pairs. The guiding intuition is that these features encode something about the semantic relation between the two arguments of the implicit coherence relation, and thus that they will be predictive of that relation.
Building on earlier work, Pitler et al. 2009 suggest creating features (word1, word2) drawn from the cross-product of Arg1 and Arg2. The rationale is that these lexical pairings approximate the semantic connection between the two arguments.
It is an open question which subset of this cross-product to include. One could restrict by frequency, removing/keeping very common/uncommon words. One could also restrict by part of speech, keeping/removing only content/function words. And so on. (We probably don't want to keep all the pairings, since this will likely lead to overfitting.)
The following function implements a basic version of the word-pairs feature, by allowing the user optionally to restrict attention to just words that are tagged with one of the tags in the keyword argument list tags:
There is a great deal of work suggesting that verbs are particularly important for determining coherence relationships McKoon et al. 1993, so that seems like a reasonable first experiment.
The following code provides a general framework for training, testing, and assessing classifier models. It basically reduces the task of running an experiment to the task of writing an interesting value for the feature_function argument.
The verbal word-pairs experiment then takes the following form:
Here are the relevant parts of the output for a run of verb_pairs_experiment(); your results will differ because the train/test splits are random:
exercise POS, exercise EFFECTIVENESS exercise FREQ
Pitler et al. 2009 also call on the Harvard General Inquirer Database to generate pair-features.
The Harvard Inquirer is a giant CSV file providing classifications for 11788 sense-disambiguated words. Table HI provides a fragment of the file (which is a real monster, at 187 columns and 11788 rows.
Entry | Positiv | Negativ | Hostile | ...184 classes ... | Othtags | Defined | |
---|---|---|---|---|---|---|---|
1 | A | DET ART | ... | ||||
2 | ABANDON | Negativ | SUPV | ||||
3 | ABANDONMENT | Negativ | Noun | ||||
4 | ABATE | Negativ | SUPV | ||||
5 | ABATEMENT | Noun | |||||
... | |||||||
35 | ABSENT#1 | Negativ | Modif | ||||
36 | ABSENT#2 | SUPV | |||||
... | |||||||
11788 | ZONE | Noun |
The sense-disambiguations are given as #X suffixes on the values in the Entry column. Unfortunately, these are not aligned with WordNet, so I don't see any practical way to take advantage of them.
However, we can appeal to the Othtags column, which gives specialized part-of-speech tags, to distinguish some of the senses. Because these tags are specific to the Inquirer, we can't really uses them directly with PDTB data, which uses Penn Treebank tags. To get us over this hurdle, I've created a pickled version of the Inquirer that remaps the Othtags values to Penn Treebank values:
This version takes the form of a dict mapping (Entry, POS) tags to sets of values, where POS is the Treebank version of Othtags.
A brief look at harvard_inquirer.pickle:
The feature function has two parts. First, words2inquirer_classes() maps a set of lemmas to its associated set of semantic classes, via look-up in the Inquirer:
The core feature function uses words2inquirer_classes() to get the semantic classes associated with each argument, and then it creates features very much like our verbal ones above:
The performance is comparable to that of verb_pairs_experiment():
exercise COMBO, exercise SEMLIMIT, exercise POLARITY, exercise OTHERS
POS The function verb_pairs_experiment() tests the extent to which verbs carry information about the nature of implicit coherence relationships. What other categories or sets of categories are worth testing?
EFFECTIVENESS The semantic classes for Implicit coherence relationships are highly imbalanced:
For this reason, accuracy is not a good measure of the quality of a classifier — always guessing Expansion or Contingency would yield fairly good performance. It would be better to look at by-category precision and recall.
The code for pdtb_predict_implicit_experiment() includes a comment, below the accuracy calculation, calling for a function that calculates these effectiveness measures.
Write such a function. Its arguments should be a two-dimensional numpy.array object (the confusion matrix model.cm in pdtb_predict_implicit_experiment()) and a set of labels indexed according to that matrix (model.classes in this context). It should return the precision and recall values for each class.
FREQ Pitler et al. 2009 discuss approaches to word-pair features that involve filtering based on frequency. This cannot be done as part of building the model, because it depends on statistics gathered from the entire corpus.
To prepare for the day when we might want to set frequency thresholds, build a dictionary mapping lemmatized word-pairs to their counts for the whole corpus, and store this in a pickle file for later use. The following code begins this; your task is just to finish this function and run it:
Of course, if you want to make use of these counts in building a classifier, please do! In that case, this problem should be considered two problems for the purposes of requirements.
COMBO Combine verb_pairs_experiment() and inquirer_pairs_experiment() into a single experiment and run it. For this, you should provide (i) your experiment code (a function comparable to inquirer_pairs_experiment() or inquirer_pairs_experiment()) and (ii) its output.
SEMLIMIT Modify inquirer_features() so that the user can optionally supply a limited set of semantic classes to include when building features (analogous to what is done with the tags in word_pair_features()).
POLARITY Pitler et al. 2009 hypothesize that polarity agreement and opposition will correlate with different coherence relations. We have a variety of methods for testing this hypothesis. Pick one of the following approaches to getting polarity word scores and integrate it with a classifier experiment (alone or with other predictors — the design of the experiment is up to you):
I favor the IMDB reviews approach myself, but all of them seem potentially valuable. (A systematic comparison of these approaches in the context of this PDTB prediction task would make an excellent final project.)
OTHERS Pitler et al. 2009 employ a wide variety of predictors not discussed here. Pick one of them, implement it, and run an experiment using pdtb_predict_implicit_experiment() to test its effectiveness.