Computational Pragmatics | Penn Discourse Treebank 2.0

The Penn Discourse Treebank 2.0 (PDTB) is an incredibly rich resource for studying not only the way discourse coherence is expressed but also how information about discourse commitments (content attribution) is conveyed linguistically. The goal of this section is to provide an overview of the basic structure of the corpus and introduce you to some tools for working with it.

With great power and richness comes great complexity, so bear with me... Let's start with an intuitive example from Prasad et al. 2008 to get a feel for what the corpus is like:

figures/pdtb/pdtb-ex-explicit-attr-tree.png

Figure WHILE

The structure of example (16) from Prasad et al. 2008

This figure breaks down into three parts: Arg1, the connective, and Arg2. Within each, there is basic semantic and pragmatic information (black) as well as attribution information (blue). The Args can also have supplementary text (green), though this one doesn't have any.

The nodes containing heavily bracketed strings are the associated Penn Treebank 2.0 parsetrees. Each span of text (the _RawText nodes) has an associated set of such parsetrees.

Getting and using the corpus

Downloads

The Institute has obtained a license for all of us to access the corpus for the purposes of this course, so I suggest that you download it in its usual distribution form:

The PDTB is a complex resource because it pools information from the Penn Treebank 2.0, both its merged parsetrees and its raw text. To make it easier for us to work with the corpus, I've created a single CSV file that pools (most of) the requisite information. I strongly suggest that you work with this version during the course:

This is a large file with a lot of information in it. The basic structure is as follows: each row represents a single item (henceforth datum). There are 54 columns for each datum (though most of them are empty for a given example). These columns determine information like that depicted in figure WHILE.

The PDTB team has released a Java tool for searching and browsing. I myself won't be working with it, but it is flexible and powerful, so you might check it out if you dislike using Python or R.

Python classes (preferred)

The central method for CorpusReader objects is iter_data(), which allows you to iterate through the data in the corpus. Intuitively, iter_data() reads each row of the source csv file pdtb2.csv and turns it into a Datum object, which has lots of methods and attributes for doing cool things.

To test your set-up, paste the following code into a file, change the path to pdtb.py as needed (if it is in the same directory as pdtb.py, then you needn't do anything), and then run python on it.

If all goes well, your output will be the following table of relation-type counts, which is the same as Prasad et al. 2008, table 2 (though see their footnote for why the count for Implicit is slightly different).

In addition, pdtb.py allows you to create Datum objects directly from a string. For details on this, see Appendix A below.

Working directly with the CSV file (dispreferred but okay)

It is possible to work with the PDTB in a limited way using just pdtb2.csv. Since it is a CSV file, it can be read into a program like Excel or R. For example, the following R code does exactly what the Python function relation_count() does:

As with the Switchboard Dialog Act Corpus, this has the potential to be a useful way of working with the corpus, but it will require you to write a lot of auxiliary functions to deal with the non-string and non-numeric values (trees, Gorn addresses, etc.), whereas pdtb.py does all of this for you.

PDTB structure

Let's begin to unpack the PDTB. There is a lot of information to assimilate. The strategy I have taken is to give a high-level overview and some reference diagrams, and then hope that we can come to appreciate the details through some specific case studies.

Each datum in the PDTB has three basic parts: Arg1, the connective, and Arg2. Arg1 and Arg2 are always (heavily annotated) spans of text. The structure of the connective depends on the nature of the relation.

Datum attributes

Table ATTRIBUTES provides a full listing of all the attributes of Datum instances. Thus, if dat is a Datum and att is an attribute, then dat.att wil return the corresponding value for that attribute.

Attribute name	Object type	Applicable relations (None for others)	Description
Relation	str	Explicit\|Implicit\|AltLex\|EntRel\|NoRel	Explicit\|Implicit\|AltLex\|EntRel\|NoRel
Section	str	Explicit\|Implicit\|AltLex\|EntRel\|NoRel	00 .. 24
FileNumber	str	Explicit\|Implicit\|AltLex\|EntRel\|NoRel	4-digit number where digits 1-2 == Section
Connective_SpanList	list	Explicit\|AltLex	a list of lists where each member is a pair of integers
Connective_GornList	list	Explicit\|AltLex	a list of lists where each member is a sequence of integers
Connective_RawText	str	Explicit\|AltLex	raw text (same as obtainable with Connective_SpanList)
Connective_Trees	list	Explicit\|AltLex	a list of nltk.tree.Tree objectives (same as obtainable with Connective_GornList)
Connective_StringPosition	int	Implicit\|EntRel\|NoRel	the position of the inferred relation for Implicit
SentenceNumber	int	Implicit\|EntRel\|NoRel	number of the source sentence in the raw and parsed files
ConnHead	str	Explicit	the head of the connective string (which could be phrasal)
Conn1	str	Implicit	the obligatory inferred connective
Conn2	str	Implicit	an optional second connective
ConnHeadSemClass1	str	Explicit\|Implicit\|AltLex	the semantic class of Conn1; see the gray table in EXPLICIT
ConnHeadSemClass2	str	Explicit\|Implicit\|AltLex	optional second semantic class for Conn1, drawn from the same values as ConnHeadSemClass1
Conn2SemClass1	str	Implicit	the semantic class of Conn2 if present; see the gray table in EXPLICIT
Conn2SemClass2	str	Implicit	optional second semantic class for Conn2 if present, drawn from the same values as ConnHeadSemClass1
Attribution_Source	str	Explicit\|Implicit\|AltLex	Wr\|Ot\|Arb
Attribution_Type	str	Explicit\|Implicit\|AltLex	Comm\|PAtt\|Ftv\|Ctrl
Attribution_Polarity	str	Explicit\|Implicit\|AltLex	Neg\|Null
Attribution_Determinacy	str	Explicit\|Implicit\|AltLex	Indet\|Null
Attribution_SpanList	list	Explicit\|Implicit\|AltLex	a list of lists where each member is a pair of integers
Attribution_GornList	list	Explicit\|Implicit\|AltLex	a list of lists where each member is a sequence of integers
Attribution_RawText	str	Explicit\|Implicit\|AltLex	raw text (same as obtainable with Attribution_SpanList)
Arg1_SpanList	list	Explicit\|Implicit\|AltLex\|EntRel\|NoRel	a list of lists where each member is a pair of integers
Arg1_GornList	list	Explicit\|Implicit\|AltLex\|EntRel\|NoRel	a list of lists where each member is a sequence of integers
Arg1_RawText	str	Explicit\|Implicit\|AltLex\|EntRel\|NoRel	raw text (same as obtainable with Arg1_SpanList)
Arg1_Trees	list	Explicit\|Implicit\|AltLex\|EntRel\|NoRel	a list of nltk.tree.Tree objectives (same as obtainable with Arg1_GornList)
Arg1_Attribution_Source	str	Explicit\|Implicit\|AltLex	Wr\|Ot\|Arb\|Inh
Arg1_Attribution_Type	str	Explicit\|Implicit\|AltLex	Comm\|PAtt\|Ftv\|Ctrl
Arg1_Attribution_Polarity	str	Explicit\|Implicit\|AltLex	Neg\|Null
Arg1_Attribution_Determinacy	str	Explicit\|Implicit\|AltLex	Indet\|Null
Arg1_Attribution_SpanList	list	Explicit\|Implicit\|AltLex	a list of lists where each member is a pair of integers
Arg1_Attribution_GornList	list	Explicit\|Implicit\|AltLex	a list of lists where each member is a sequence of integers
Arg1_Attribution_RawText	str	Explicit\|Implicit\|AltLex	raw text (same as obtainable with Arg1_Attribution_SpanList)
Arg1_Attribution_Trees	lost	Explicit\|Implicit\|AltLex	list of nltk.tree.Tree objectives (same as obtainable with Arg1_Attribution_GornList)
Arg2_SpanList	list	Explicit\|Implicit\|AltLex\|EntRel\|NoRel	a list of lists where each member is a pair of integers
Arg2_GornList	list	Explicit\|Implicit\|AltLex\|EntRel\|NoRel	a list of lists where each member is a sequence of integers
Arg2_RawText	str	Explicit\|Implicit\|AltLex\|EntRel\|NoRel	raw text (same as obtainable with Arg2_SpanList)
Arg2_Trees	list	Explicit\|Implicit\|AltLex\|EntRel\|NoRel	a list of nltk.tree.Tree objectives (same as obtainable with Arg2_GornList)
Arg2_Attribution_Source	str	Explicit\|Implicit\|AltLex	Wr\|Ot\|Arb\|Inh
Arg2_Attribution_Type	str	Explicit\|Implicit\|AltLex	Comm\|PAtt\|Ftv\|Ctrl
Arg2_Attribution_Polarity	str	Explicit\|Implicit\|AltLex	Neg\|Null
Arg2_Attribution_Determinacy	str	Explicit\|Implicit\|AltLex	Indet\|Null
Arg2_Attribution_SpanList	list	Explicit\|Implicit\|AltLex	a list of lists where each member is a pair of integers
Arg2_Attribution_GornList	list	Explicit\|Implicit\|AltLex	a list of lists where each member is a sequence of integers
Arg2_Attribution_RawText	str	Explicit\|Implicit\|AltLex	raw text (same as obtainable with Arg2_Attribution_SpanList)
Sup1_SpanList	list	Explicit\|Implicit\|AltLex	a list of lists where each member is a pair of integers
Sup1_GornList	list	Explicit\|Implicit\|AltLex	a list of lists where each member is a sequence of integers
Sup1_RawText	str	Explicit\|Implicit\|AltLex	optional supporting text for Arg1 (same as obtainable with Sup1_SpanList)
Sup1_Trees	list	Explicit\|Implicit\|AltLex	list of nltk.tree.Tree objectives (same as obtainable with Sup1_GornList)
Sup2_SpanList	list	Explicit\|Implicit\|AltLex	a list of lists where each member is a pair of integers
Sup2_GornList	list	Explicit\|Implicit\|AltLex	a list of lists where each member is a sequence of integers
Sup2_RawText	str	Explicit\|Implicit\|AltLex	optional supporting text for Arg1 (same as obtainable with Sup2_SpanList)
Sup2_Trees	list	Explicit\|Implicit\|AltLex	list of nltk.tree.Tree objectives (same as obtainable with Sup2_GornList)

Table ATTRIBUTES

The attributes of Datum objects from pdtb.py.

Connectives

There are five types of connective: Explicit, Implicit, AltLex, EntRel, and NoRel. The following characterizations and examples are from Prasad et al. 2008.

Explicit

Figure EXPLICIT_EX

An Explicit connective.

Figure EXPLICIT

Explicit

Implicit

Figure IMPLICIT_EX

Example (6) from Prasad et al. 2008

Figure IMPLICIT

Implicit

AltLex

Figure ALTLEX_EX

Example (7) from Prasad et al. 2008

Figure ALTLEX

AltLex

EntRel and NoRel

Very few things are defined for these, as the abstract structure, figure ENTNO, shows.

Figure ENTNO

EntRel and NoRel

Args

The arguments each have (i) basic attributes for their content (the raw text and the trees) as well as (ii) attribution informtion and (iii) supplementary text helping to contextualize the content.

The arguments always have basic information. Supplementary text is somewhat rarer. Attribution text is always present for Implicit, Explicit, and AltLex, and it is alway absent for EntRel and NoRel.

Figure ARG shows the full structure, though without the Attribution information. Each edge label corresponds to an an attribute of Datum objects as long as you insert 1 or 2 after Arg for each one.

Attribution

Figure ATTRIBUTION breaks down the structure of attributions, showing the relevant attributes as edge labels (insert 1 or 2 after Arg) and the range of values on the nodes, with description.

The Arg_Attribution_Source value Inh means that the attribution is inherited from the connective. The Datum methods final_arg1_attribution_source() and final_arg2_attribution_source() handle this for you, providing the final attribution value in every case.

Figure ATTRIBUTION

Attribution

Working with text and trees

For each of these, there are associated methods for getting at their structure. For example:

The SpanList and GornList attributes are for connecting with the Penn Treebank files. The relevant material is already inserted into the CSV file and accessible via the _RawText and _Trees attributes, so you probably won't need it, but it is there just in case you need to connect with the external files.

Some basic analysis

The distribution of semantic classes

The function count_semantic_classes() looks at the values for ConnHeadSemClass1, which gives the primary sense for the connective for Implicit, Explicit, and AltLex data.

The following uses count_semantic_classes to sort the output from most to least frequent and put it into a CSV file.

The output has 41 lines, one for each relation seen in the gray and black tables for Explicit, Implicit, and AltLex. The following R code turns this into a barplot with (I think) readable labels):

Figure ConnHeadSemClass1

ConnHeadSemClass1

A common string representation for connectives

There is no single attribute of Datum objects that provides an accurate high-evel summary of its nature:

Therefore, Datum objects include the method conn_str for getting at the above strings directly. The code datum.conn_str() will return the above-listed values. The following code uses this method to gather counts of all of the different regularized heads:

To summarize this, a function for mapping this into a format that Wordle can understand:

figures/pdtb/conn-counts-wordle-explicit.png

figures/pdtb/figures/conn-counts-wordle-implicit.png

figures/pdtb/conn-counts-wordle-altlex.pn

Figure WORDLE

Wordle representations of the connectives, by relation type: Explicit (left), Implicit (middle), and AltLex (right).

The distribution of attribution values

The top of the output of print_attribution_texts() (the frequent repeats are due to the fact that individual spans of text are often involved in multiple relationships):

Relative argument ordering

The Datum method relative_arg_order() determines the ordering of Arg1 and Arg2. Its values:

The function distribution_of_relative_arg_order() in pdtb_functions.py calculates the distribution of these orderings. Here is the output:

As expected, it is most common for Arg1 to precede Arg2. The two 'overlap' relations, arg2_precedes_and_overlaps_but_does_not_contain_arg1 and arg1_precedes_and_overlaps_but_does_not_contain_arg2, turn out to be non-attested.

The Datum methods arg1_precedes_arg2(), arg1_contains_arg2(), arg1_precedes_and_overlaps_but_does_not_contain_arg2(), arg2_precedes_arg1(), arg2_contains_arg1(), and arg2_precedes_and_overlaps_but_does_not_contain_arg1() allow for quick identification of these subsets. Each returns True or False.

Hunting for interesting associations

The best way to get acquainted with a new large-scale corpus resource is to hypothesize that important relationships might exist between two of its properties and then test to see whether the hypothesis holds up.

The function contingencies() in pdtb_functions.py seeks to provide a general method for exploring such relationships at the level of properties of Datum objects.

This function takes as its arguments two functions on Datum instances and calculates the observed/expected (O/E) values for the two classes of results. The print-out includes the observed contingency table, the expected contingency table, and an ordered list of O/E values. Values of None are ignored and can thus be used as a filter.

Relation by primary semantic class

The following code tests for associations between relation-type and the primary semantic classes:

Values above 1 indicate that the observed values are larger than we would expect given the null hypothesis that the two variables are independent of each other. Values below 1 indicate that the observed values are small than we would expect given this hypothesis.

The Count table can be the input to chisq.test and g.test in R. However, the counts involved in these tables are so large that the null hypothesis is almost certain to look false, so a more qualitative assessment might be called for, followed by more articulated regression modeling.

Relation order by argument order

Argument order by primary semantic class

The following code tests for associations between relative argument ordering and the primary semantic class:

Explicit connective by precedence argument order

The following explores the relationship between argument ordering and the nature of connective's cannonical name:

Here's a sample of the ranked output (the contingency tables are too large to display usefully here):

Negation balances and imbalances

Chris Brown suggested looking at patterns of negation across the two Args — both where the negations are imbalanced and where they are balanced. The following adapts his own code for exploring this issue:

This looks like solid support for Chris's hypothesis that mismatches in negation would assoicate with Comparison relations.

Appendix A: Graphiviz representations

You can create Datum objects directly from a string. The idea is that you might copy and paste a row out of pdtb.csv so that you can work with it directly. It's unwieldy, but it is useful for looking at specific examples:

The datum method to_graphiviz() will generate a Graphviz representation, as a plain-text file. If you then open that file with Graphviz, you should get a nice image that you can save.

The Graphviz language is easy to learn (I advise studying the gallery of examples), and it is a flexible, powerful way to visualize data.

The Penn Discourse Treebank 2.0

Overview

Getting and using the corpus

Downloads

Python classes (preferred)

Working directly with the CSV file (dispreferred but okay)

PDTB structure

Datum attributes

Connectives

Explicit

Implicit

AltLex

EntRel and NoRel

Args

Attribution

Working with text and trees

Some basic analysis

The distribution of semantic classes

A common string representation for connectives

The distribution of attribution values

Relative argument ordering

Hunting for interesting associations

Relation by primary semantic class

Relation order by argument order

Argument order by primary semantic class

Explicit connective by precedence argument order

Negation balances and imbalances

Appendix A: Graphiviz representations

Exercises