Predicting Implicit coherence relations in the Penn Discourse Treebank 2.0

Overview
Python classifier interface
Features
1. Word cross-product
2. Harvard Inquirer semantic classes
Exercises

Overview

The goal of this section is to provide a framework for running (probabilistic) experiments that attempt to predict the semantics of implicit discourse coherence relations in the Penn Discourse Treebank 2.0.

I provide two detailed illustrations of the kinds of predictors we might use. Their success is limited, but that's okay. My hope is that seeing these examples will help you to implement your own hypotheses.

To warm-up, I suggest downloading the following file and studying it:

pdtb-random-Implicit-subset.csv

This file contains 30 randomly selected Implicit examples from each of the four primary semantic classes (Temporal, Contingency, Expansion, Comparison). What features of Arg1 and Arg2 might be shaping this semantic judgment? This is the primary question for this section; I'm hoping that studying some specific examples will get our linguistic and scientific intuitions going strong.

I've included the connective string (always chosen by the annotators for these Implicit cases) in column 2. This might help you understand the semantic classification. It shouldn't be incorporated into the predictive model, though.

Associated reading:

Data and code:

pdtb.py: PDTB2 corpus interface
pdtb_classifier.py: code for fitting MaxEnt models predicting Implicit connectives semantics
pdtb_experiment_predict_implicit.py: (extensible, improvable) code for the simple experiments described here
harvard_inquirer.pickle.zip: dict version of the Harvard General Inquirer Database. (This section provides details.)

Python classifier interface

I've written a very focused interface for building classifier models to predict implicit coherence relations:

pdtb_classifier.py

The file contains just one class, PdtbClassifier, which trains, tests, and evaluates MaxEnt classifiers using nltk.MaxentClassifier.

The interface is very similar to that of iqap_classifier.py. Here is a description of the initialization process:

model = PdtbClassifier(corpus, # should be PdtbReader('pdtb2.csv')
feature_function, # should map Datum instances to dicts; see below for examples
train_on_implicit=True # if False, train on Explicit and predict Implicit (default: True)
train_percentage=0.2, # default: 0.2
test_percentage=0.1, # default: 0.1
algorithm='CG') # optimizer; CG=conjugate gradient; defaults to IIS it's unhappy with the numpy/scipy combo

Once the model is instantiated, we typically train it, test it, and then look at the confusion matrix to assess performance:

model.train()
model.test()
model.print_cm() # Prints the confusion matrix to standard output.

Features

This section implements (versions of) two of the features used by Pitler et al. 2009: word-pairs and Harvard Inquirer semantic-class pairs. The guiding intuition is that these features encode something about the semantic relation between the two arguments of the implicit coherence relation, and thus that they will be predictive of that relation.

Word cross-product

Building on earlier work, Pitler et al. 2009 suggest creating features (word1, word2) drawn from the cross-product of Arg1 and Arg2. The rationale is that these lexical pairings approximate the semantic connection between the two arguments.

It is an open question which subset of this cross-product to include. One could restrict by frequency, removing/keeping very common/uncommon words. One could also restrict by part of speech, keeping/removing only content/function words. And so on. (We probably don't want to keep all the pairings, since this will likely lead to overfitting.)

The following function implements a basic version of the word-pairs feature, by allowing the user optionally to restrict attention to just words that are tagged with one of the tags in the keyword argument list tags:

from itertools import product
def word_pair_features(datum, tags=None):
"""
Create a dictionary feats mapping string-pairs to True, where the
strings are lemmatized words from the cross-product of the words
in Arg1 and the words in Arg2. If tags is specified, then we
filter to just words whose tags are in that set.
"""
feats = {}
# Lists of (word, tag) pairs, lemmatized:
lems1 = datum.arg1_pos(lemmatize=True)
lems2 = datum.arg2_pos(lemmatize=True)
# If the user supplied a set of tags, restrict attention to them:
if tags:
# Downcase to ensure consistency with what the lemmatizing does:
tags = map(str.lower, tags)
# Filter the lists:
lems1 = filter((lambda x : x[1] in tags), lems1)
lems2 = filter((lambda x : x[1] in tags), lems2)
# Remove the tags:
words1 = [lem[0] for lem in lems1]
words2 = [lem[0] for lem in lems2]
# Create the cross-product features:
for w1, w2 in product(words1, words2):
feats[(w1, w2)] = True
return feats

There is a great deal of work suggesting that verbs are particularly important for determining coherence relationships McKoon et al. 1993, so that seems like a reasonable first experiment.

The following code provides a general framework for training, testing, and assessing classifier models. It basically reduces the task of running an experiment to the task of writing an interesting value for the feature_function argument.

def pdtb_predict_implicit_experiment(feature_function):
"""
Generic experiment code for training, testing, and assessing a classifier.
Argument:
feature_function: should be a function taking datum objects as the sole argument
and returning a dictionary mapping feature names to values.
"""
# Instantiate the corpus:
corpus = CorpusReader('pdtb2.csv')
# Create the classifier instance:
model = PdtbClassifier(corpus, feature_function, train_percentage=0.3, test_percentage=0.1)
# Sequence of commands for running the experiment and viewing the output:
model.train()
model.test()
model.print_cm()
# Some basic work with the classifier attribute:
print 'Accuracy:', numpy.sum(numpy.diag(model.cm)) / numpy.sum(model.cm)
# It would be good to continue with precision and recall estimates for each category:
# ...
# Look at the most informative features, restricting just to the positive-value estimates:
model.classifier.show_most_informative_features(n=20, show='pos')

The verbal word-pairs experiment then takes the following form:

def verb_pairs_experiment():
"""Test a set of features (w1, w2), where w1 and w2 are restricted
to being (stemmed) verbs or modals."""
def feature_function(datum):
return word_pair_features(datum, tags=['md', 'v'])
pdtb_predict_implicit_experiment(feature_function)

Here are the relevant parts of the output for a run of verb_pairs_experiment(); your results will differ because the train/test splits are random:

from pdtb_experiment_predict_implicit import verb_pairs_experiment
verb_pairs_experiment()
# ... NLTK provides a lot of information about the optimizing ...
Comparison 30.0 51.0 148.0 27.0
Contingency 40.0 106.0 250.0 43.0
Expansion 74.0 142.0 504.0 118.0
Temporal 6.0 13.0 32.0 21.0
Accuracy: 0.411838006231
3.724 ('name', 'have')==True and label is 'Temporal'
3.445 ('be', 'vary')==True and label is 'Contingency'
3.356 ('increase', 'rise')==True and label is 'Expansion'
3.154 ('have', 'say')==True and label is 'Expansion'
3.017 ('will', 'will')==True and label is 'Expansion'
2.923 ('include', 'rise')==True and label is 'Expansion'
2.896 ("'s", "'s")==True and label is 'Expansion'
2.892 ('trade', 'be')==True and label is 'Expansion'
2.778 ('say', 'say')==True and label is 'Comparison'
2.712 ('ca', 'be')==True and label is 'Contingency'
2.657 ('total', 'be')==True and label is 'Expansion'
2.634 ('rise', 'gain')==True and label is 'Expansion'
2.613 ('fell', 'decline')==True and label is 'Expansion'
2.526 ('be', 'boost')==True and label is 'Expansion'
2.476 ('be', 'mature')==True and label is 'Expansion'
2.456 ('say', 'call')==True and label is 'Expansion'
2.413 ('fell', 'fell')==True and label is 'Expansion'
2.408 ('gain', 'slip')==True and label is 'Expansion'
2.391 ('remain', 'rise')==True and label is 'Expansion'
2.330 ('name', 'be')==True and label is 'Temporal'

exercise POS, exercise EFFECTIVENESS exercise FREQ

Harvard Inquirer semantic classes

Pitler et al. 2009 also call on the Harvard General Inquirer Database to generate pair-features.

The Harvard Inquirer is a giant CSV file providing classifications for 11788 sense-disambiguated words. Table HI provides a fragment of the file (which is a real monster, at 187 columns and 11788 rows.

	Entry	Negativ	Othtags	Defined
1	A		DET ART	...
2	ABANDON	Negativ	SUPV
3	ABANDONMENT	Negativ	Noun
4	ABATE	Negativ	SUPV
5	ABATEMENT		Noun
...
35	ABSENT#1	Negativ	Modif
36	ABSENT#2		SUPV
...
11788	ZONE		Noun

Table HI

A fragment of the Harvard General Inquirer CSV file.

The sense-disambiguations are given as #X suffixes on the values in the Entry column. Unfortunately, these are not aligned with WordNet, so I don't see any practical way to take advantage of them.

However, we can appeal to the Othtags column, which gives specialized part-of-speech tags, to distinguish some of the senses. Because these tags are specific to the Inquirer, we can't really uses them directly with PDTB data, which uses Penn Treebank tags. To get us over this hurdle, I've created a pickled version of the Inquirer that remaps the Othtags values to Penn Treebank values:

harvard_inquirer.pickle.zip

This version takes the form of a dict mapping (Entry, POS) tags to sets of values, where POS is the Treebank version of Othtags.

A brief look at harvard_inquirer.pickle:

import pickle
inq = pickle.load(file('harvard_inquirer.pickle'))
# Some sense distinctions are destroyed by the conversion (the original Inquirer has 11788 entries):
len(inq)
10447
# The POS values are more constrained now; the Inquirer has 228 distinct values for Othtags:
len(set([lem[1] for lem in inq.keys()]))
28
# A sampling of values:
inq[('happy', 'jj')]
set(['Positiv', 'WlbTot', 'Pstv', 'H4Lvd', 'PosAff', 'WlbPsyc', 'Pleasur', 'EMOT'])
inq[('sad', 'jj')]
set(['Ngtv', 'Pain', 'Negativ', 'H4Lvd', 'Passive', 'WlbPsyc', 'WlbTot', 'EMOT'])
inq[('tree', 'nn')]
set(['Object', 'NatObj', 'H4Lvd'])
# Values not in the set of keys return the empty set:
inq[('gyre', 'vb')]
set([])

The feature function has two parts. First, words2inquirer_classes() maps a set of lemmas to its associated set of semantic classes, via look-up in the Inquirer:

def words2inquirer_classes(lems, inquirer):
"""
Returns the set of semantic classes for the lemmas in lems,
according to inquirer, the dictionary in harvard_inquirer.pickle.
"""
classes = []
for lem in lems:
classes += inquirer[lem]
return set(classes)

The core feature function uses words2inquirer_classes() to get the semantic classes associated with each argument, and then it creates features very much like our verbal ones above:

def inquirer_features(datum, inquirer):
"""
Creates features (cls1, cls2) where cls1 and cls2 are drawn from
the cross-product of semantic classes represented in Arg1 and
Arg2.
"""
feats = {}
# Lemmatize Arg1 and get its associated Inquirer classes:
lems1 = datum.arg1_pos(lemmatize=True)
classes1 = words2inquirer_classes(lems1, inquirer)
# Lemmatize Arg2 and get its associated Inquirer classes:
lems2 = datum.arg2_pos(lemmatize=True)
classes2 = words2inquirer_classes(lems2, inquirer)
# Inquirer class-pair features:
for cls1, cls2 in product(classes1, classes2):
feats[(cls1, cls2)] = True
return feats

The experiment:

def inquirer_pairs_experiment():
"""Test a set of features (cls1, cls2) where cls1 and cls2 are
drawn from the cross-product of semantic classes represented in
Arg1 and Arg2."""
inquirer = inq = pickle.load(file('harvard_inquirer.pickle'))
def feature_function(datum):
return inquirer_features(datum, inquirer)
pdtb_predict_implicit_experiment(feature_function)

The performance is comparable to that of verb_pairs_experiment():

from pdtb_experiment_predict_implicit import inquirer_pairs_experiment
inquirer_pairs_experiment()
# ... NLTK provides a lot of information about the optimizing ...
Comparison Contingency Expansion Temporal
Comparison 17.0 36.0 180.0 15.0
Contingency 33.0 73.0 261.0 61.0
Expansion 50.0 116.0 605.0 79.0
Temporal 6.0 8.0 51.0 14.0
Accuracy: 0.441744548287
4.025 ('Know', 'Our')==True and label is 'Expansion'
3.200 ('IAV', 'Rel')==True and label is 'Expansion'
3.165 ('Rel', 'MALE')==True and label is 'Expansion'
3.083 ('ECON', 'Know')==True and label is 'Expansion'
2.988 ('Female', 'Causal')==True and label is 'Comparison'
2.965 ('MALE', 'Strong')==True and label is 'Expansion'
2.963 ('PtLw', 'Know')==True and label is 'Expansion'
2.900 ('Undrst', 'IAV')==True and label is 'Temporal'
2.872 ('TimeSpc', 'You')==True and label is 'Contingency'
2.810 ('Undrst', 'Self')==True and label is 'Expansion'
2.802 ('If', 'Ovrst')==True and label is 'Contingency'
2.785 ('Time@', 'You')==True and label is 'Expansion'
2.757 ('Undrst', 'Our')==True and label is 'Expansion'
2.663 ('Submit', 'H4Lvd')==True and label is 'Comparison'
2.662 ('Affil', 'Know')==True and label is 'Temporal'
2.657 ('WltOth', 'Strong')==True and label is 'Expansion'
2.657 ('WltTot', 'Strong')==True and label is 'Expansion'
2.593 ('Ought', 'Time@')==True and label is 'Contingency'
2.562 ('IAV', 'ECON')==True and label is 'Expansion'
2.533 ('SV', 'Quan')==True and label is 'Contingency'

exercise COMBO, exercise SEMLIMIT, exercise POLARITY, exercise OTHERS

Exercises

POS The function verb_pairs_experiment() tests the extent to which verbs carry information about the nature of implicit coherence relationships. What other categories or sets of categories are worth testing?

Select a set of tags and provide some rationale for why they might be useful to look at in the word-pairs mode. (The tags should be Penn Treebank tags, except the lemmatization collapses all nominal tags to 'n', all verbal tags to 'v', all adjectival tags to 'a', and all adverbial tags to 'r'.)
Modify verb_pairs_experiment() to test your set of categories.
Study the results, focussing on trying to understand why the top features have the weights that they do.

EFFECTIVENESS The semantic classes for Implicit coherence relationships are highly imbalanced:

def semantic_classes_in_implicit_relations():
"""Count the primary semantic classes for connectives limited to Implicit relations."""
d = defaultdict(int)
pdtb = CorpusReader('pdtb2.csv')
for datum in pdtb.iter_data(display_progress=True):
if datum.Relation == 'Implicit':
d[datum.primary_semclass1()] += 1
# Print, sorted by values, largest first:
for key, val in sorted(d.items(), key=itemgetter(1), reverse=True):
print key, val
Expansion 8601
Contingency 4185
Comparison 2441
Temporal 826

For this reason, accuracy is not a good measure of the quality of a classifier — always guessing Expansion or Contingency would yield fairly good performance. It would be better to look at by-category precision and recall.

The code for pdtb_predict_implicit_experiment() includes a comment, below the accuracy calculation, calling for a function that calculates these effectiveness measures.

Write such a function. Its arguments should be a two-dimensional numpy.array object (the confusion matrix model.cm in pdtb_predict_implicit_experiment()) and a set of labels indexed according to that matrix (model.classes in this context). It should return the precision and recall values for each class.

FREQ Pitler et al. 2009 discuss approaches to word-pair features that involve filtering based on frequency. This cannot be done as part of building the model, because it depends on statistics gathered from the entire corpus.

To prepare for the day when we might want to set frequency thresholds, build a dictionary mapping lemmatized word-pairs to their counts for the whole corpus, and store this in a pickle file for later use. The following code begins this; your task is just to finish this function and run it:

def word_pair_frequencies(output_filename):
"""
Gather count data on word pairs where the first word is drawn from
Arg1 and the second from Arg2. The results are storied in the
pickle file output_filename.
"""
d = defaultdict(int)
pdtb = CorpusReader('pdtb2.csv')
for datum in pdtb.iter_data(display_progress=True):
if datum.Relation == 'Implicit':
# Gather the word-pair features for inclusion in d.
# See the Datum methods arg1_words() and arg2_words.
# Finally, pickle the results.
pickle.dump(d, file(output_filename, 'w'))

Of course, if you want to make use of these counts in building a classifier, please do! In that case, this problem should be considered two problems for the purposes of requirements.

COMBO Combine verb_pairs_experiment() and inquirer_pairs_experiment() into a single experiment and run it. For this, you should provide (i) your experiment code (a function comparable to inquirer_pairs_experiment() or inquirer_pairs_experiment()) and (ii) its output.

SEMLIMIT Modify inquirer_features() so that the user can optionally supply a limited set of semantic classes to include when building features (analogous to what is done with the tags in word_pair_features()).

POLARITY Pitler et al. 2009 hypothesize that polarity agreement and opposition will correlate with different coherence relations. We have a variety of methods for testing this hypothesis. Pick one of the following approaches to getting polarity word scores and integrate it with a classifier experiment (alone or with other predictors — the design of the experiment is up to you):

Use the scales created by the IMDB review data, along with the associated code we created for using that data for scoring
Reimplement Pitler et al.'s use of the Multi-perspective Question Answering Opinion Corpus to create polarity features.
SentiWordNet
WordNet propagation scores from this exercise

I favor the IMDB reviews approach myself, but all of them seem potentially valuable. (A systematic comparison of these approaches in the context of this PDTB prediction task would make an excellent final project.)

OTHERS Pitler et al. 2009 employ a wide variety of predictors not discussed here. Pick one of them, implement it, and run an experiment using pdtb_predict_implicit_experiment() to test its effectiveness.