The Penn Discourse Treebank 2.0

  1. Overview
  2. Getting and using the corpus
    1. Downloads
    2. Python classes (preferred)
    3. Working directly with the CSV file (dispreferred but okay)
  3. PDTB structure
    1. Datum attributes
    2. Connectives
      1. Explicit
      2. Implicit
      3. AltLex
      4. EntRel and NoRel
    3. Args
    4. Attribution
    5. Working with text and trees
  4. Some basic analysis
    1. The distribution of semantic classes
    2. A common string representation for connectives
    3. The distribution of attribution values
    4. Relative argument ordering
    5. Hunting for interesting associations
      1. Relation by primary semantic class
      2. Relation order by argument order
      3. Argument order by primary semantic class
      4. Explicit connective by precedence argument order
      5. Negation balances and imbalances
  5. Appendix A: Graphiviz representations
  6. Exercises

Overview

The Penn Discourse Treebank 2.0 (PDTB) is an incredibly rich resource for studying not only the way discourse coherence is expressed but also how information about discourse commitments (content attribution) is conveyed linguistically. The goal of this section is to provide an overview of the basic structure of the corpus and introduce you to some tools for working with it.

Associated reading:

Code and data:

With great power and richness comes great complexity, so bear with me... Let's start with an intuitive example from Prasad et al. 2008 to get a feel for what the corpus is like:

  1. [Arg1 Factory orders and construction outlays were largely flat in December ] while purchasing agents said [Arg2 manufacturing shrank further in October ].
    1. Arg1: attributed to the writer
    2. Connective: while in the sense of contrast/juxtaposition, with this sense attributed to the writer
    3. Arg2: attributed to purchasing agents

And the example in all its glory (here and throughout, click to enlarge):

figures/pdtb/pdtb-ex-explicit-attr-tree.png
Figure WHILE
The structure of example (16) from Prasad et al. 2008

This figure breaks down into three parts: Arg1, the connective, and Arg2. Within each, there is basic semantic and pragmatic information (black) as well as attribution information (blue). The Args can also have supplementary text (green), though this one doesn't have any.

The nodes containing heavily bracketed strings are the associated Penn Treebank 2.0 parsetrees. Each span of text (the _RawText nodes) has an associated set of such parsetrees.

Getting and using the corpus

Downloads

The Institute has obtained a license for all of us to access the corpus for the purposes of this course, so I suggest that you download it in its usual distribution form:

The PDTB is a complex resource because it pools information from the Penn Treebank 2.0, both its merged parsetrees and its raw text. To make it easier for us to work with the corpus, I've created a single CSV file that pools (most of) the requisite information. I strongly suggest that you work with this version during the course:

This is a large file with a lot of information in it. The basic structure is as follows: each row represents a single item (henceforth datum). There are 54 columns for each datum (though most of them are empty for a given example). These columns determine information like that depicted in figure WHILE.

The PDTB team has released a Java tool for searching and browsing. I myself won't be working with it, but it is flexible and powerful, so you might check it out if you dislike using Python or R.

Python classes (preferred)

The classes:

The main interface provided by pdtb.py is the CorpusReader:

  1. # Import the needed classes:
  2. from pdtb import CorpusReader, Datum
  3. # CorpusReader objects are built from pdtb2.csv.
  4. # The initalization argument is the full path to that file.
  5. # Here I assume that it is in the same directory as pdtb.py:
  6. pdtb = CorpusReader('pdtb2.csv')

The central method for CorpusReader objects is iter_data(), which allows you to iterate through the data in the corpus. Intuitively, iter_data() reads each row of the source csv file pdtb2.csv and turns it into a Datum object, which has lots of methods and attributes for doing cool things.

To test your set-up, paste the following code into a file, change the path to pdtb.py as needed (if it is in the same directory as pdtb.py, then you needn't do anything), and then run python on it.

  1. #!/usr/bin/env python
  2.  
  3. from collections import defaultdict
  4. from pdtb import CorpusReader, Datum
  5.  
  6. def relation_count():
  7. """Calculate and display the distribution of relations."""
  8. pdtb = CorpusReader('pdtb2.csv')
  9. # Create a count dictionary of relations.
  10. d = defaultdict(int)
  11. for datum in pdtb.iter_data():
  12. d[datum.Relation] += 1
  13. # Print the results to standard output.
  14. for key, val in d.iteritems():
  15. print key, val
  16.  
  17. relation_count()

If all goes well, your output will be the following table of relation-type counts, which is the same as Prasad et al. 2008, table 2 (though see their footnote for why the count for Implicit is slightly different).

  1. EntRel 5210 Explicit 18459 Implicit 16053 AltLex 624 NoRel 254

In addition, pdtb.py allows you to create Datum objects directly from a string. For details on this, see Appendix A below.

Working directly with the CSV file (dispreferred but okay)

It is possible to work with the PDTB in a limited way using just pdtb2.csv. Since it is a CSV file, it can be read into a program like Excel or R. For example, the following R code does exactly what the Python function relation_count() does:

  1. pdtb = read.csv('pdtb2.csv')
  2. xtabs(~ Relation, data=pdtb)
  3. Relation AltLex EntRel Explicit Implicit NoRel 624 5210 18459 16053 254

As with the Switchboard Dialog Act Corpus, this has the potential to be a useful way of working with the corpus, but it will require you to write a lot of auxiliary functions to deal with the non-string and non-numeric values (trees, Gorn addresses, etc.), whereas pdtb.py does all of this for you.

PDTB structure

Let's begin to unpack the PDTB. There is a lot of information to assimilate. The strategy I have taken is to give a high-level overview and some reference diagrams, and then hope that we can come to appreciate the details through some specific case studies.

Each datum in the PDTB has three basic parts: Arg1, the connective, and Arg2. Arg1 and Arg2 are always (heavily annotated) spans of text. The structure of the connective depends on the nature of the relation.

Datum attributes

Table ATTRIBUTES provides a full listing of all the attributes of Datum instances. Thus, if dat is a Datum and att is an attribute, then dat.att wil return the corresponding value for that attribute.

Attribute name Object type Applicable relations (None for others) Description
Relation str Explicit|Implicit|AltLex|EntRel|NoRel Explicit|Implicit|AltLex|EntRel|NoRel
Section str Explicit|Implicit|AltLex|EntRel|NoRel 00 .. 24
FileNumber str Explicit|Implicit|AltLex|EntRel|NoRel 4-digit number where digits 1-2 == Section
Connective_SpanList list Explicit|AltLex a list of lists where each member is a pair of integers
Connective_GornList list Explicit|AltLex a list of lists where each member is a sequence of integers
Connective_RawText str Explicit|AltLex raw text (same as obtainable with Connective_SpanList)
Connective_Trees list Explicit|AltLex a list of nltk.tree.Tree objectives (same as obtainable with Connective_GornList)
Connective_StringPosition int Implicit|EntRel|NoRel the position of the inferred relation for Implicit
SentenceNumber int Implicit|EntRel|NoRel number of the source sentence in the raw and parsed files
ConnHead str Explicit the head of the connective string (which could be phrasal)
Conn1 str Implicit the obligatory inferred connective
Conn2 str Implicit an optional second connective
ConnHeadSemClass1 str Explicit|Implicit|AltLex the semantic class of Conn1; see the gray table in EXPLICIT
ConnHeadSemClass2 str Explicit|Implicit|AltLex optional second semantic class for Conn1, drawn from the same values as ConnHeadSemClass1
Conn2SemClass1 str Implicit the semantic class of Conn2 if present; see the gray table in EXPLICIT
Conn2SemClass2 str Implicit optional second semantic class for Conn2 if present, drawn from the same values as ConnHeadSemClass1
Attribution_Source str Explicit|Implicit|AltLex Wr|Ot|Arb
Attribution_Type str Explicit|Implicit|AltLex Comm|PAtt|Ftv|Ctrl
Attribution_Polarity str Explicit|Implicit|AltLex Neg|Null
Attribution_Determinacy str Explicit|Implicit|AltLex Indet|Null
Attribution_SpanList list Explicit|Implicit|AltLex a list of lists where each member is a pair of integers
Attribution_GornList list Explicit|Implicit|AltLex a list of lists where each member is a sequence of integers
Attribution_RawText str Explicit|Implicit|AltLex raw text (same as obtainable with Attribution_SpanList)
Arg1_SpanList list Explicit|Implicit|AltLex|EntRel|NoRel a list of lists where each member is a pair of integers
Arg1_GornList list Explicit|Implicit|AltLex|EntRel|NoRel a list of lists where each member is a sequence of integers
Arg1_RawText str Explicit|Implicit|AltLex|EntRel|NoRel raw text (same as obtainable with Arg1_SpanList)
Arg1_Trees list Explicit|Implicit|AltLex|EntRel|NoRel a list of nltk.tree.Tree objectives (same as obtainable with Arg1_GornList)
Arg1_Attribution_Source str Explicit|Implicit|AltLex Wr|Ot|Arb|Inh
Arg1_Attribution_Type str Explicit|Implicit|AltLex Comm|PAtt|Ftv|Ctrl
Arg1_Attribution_Polarity str Explicit|Implicit|AltLex Neg|Null
Arg1_Attribution_Determinacy str Explicit|Implicit|AltLex Indet|Null
Arg1_Attribution_SpanList list Explicit|Implicit|AltLex a list of lists where each member is a pair of integers
Arg1_Attribution_GornList list Explicit|Implicit|AltLex a list of lists where each member is a sequence of integers
Arg1_Attribution_RawText str Explicit|Implicit|AltLex raw text (same as obtainable with Arg1_Attribution_SpanList)
Arg1_Attribution_Trees lost Explicit|Implicit|AltLex list of nltk.tree.Tree objectives (same as obtainable with Arg1_Attribution_GornList)
Arg2_SpanList list Explicit|Implicit|AltLex|EntRel|NoRel a list of lists where each member is a pair of integers
Arg2_GornList list Explicit|Implicit|AltLex|EntRel|NoRel a list of lists where each member is a sequence of integers
Arg2_RawText str Explicit|Implicit|AltLex|EntRel|NoRel raw text (same as obtainable with Arg2_SpanList)
Arg2_Trees list Explicit|Implicit|AltLex|EntRel|NoRel a list of nltk.tree.Tree objectives (same as obtainable with Arg2_GornList)
Arg2_Attribution_Source str Explicit|Implicit|AltLex Wr|Ot|Arb|Inh
Arg2_Attribution_Type str Explicit|Implicit|AltLex Comm|PAtt|Ftv|Ctrl
Arg2_Attribution_Polarity str Explicit|Implicit|AltLex Neg|Null
Arg2_Attribution_Determinacy str Explicit|Implicit|AltLex Indet|Null
Arg2_Attribution_SpanList list Explicit|Implicit|AltLex a list of lists where each member is a pair of integers
Arg2_Attribution_GornList list Explicit|Implicit|AltLex a list of lists where each member is a sequence of integers
Arg2_Attribution_RawText str Explicit|Implicit|AltLex raw text (same as obtainable with Arg2_Attribution_SpanList)
Sup1_SpanList list Explicit|Implicit|AltLex a list of lists where each member is a pair of integers
Sup1_GornList list Explicit|Implicit|AltLex a list of lists where each member is a sequence of integers
Sup1_RawText str Explicit|Implicit|AltLex optional supporting text for Arg1 (same as obtainable with Sup1_SpanList)
Sup1_Trees list Explicit|Implicit|AltLex list of nltk.tree.Tree objectives (same as obtainable with Sup1_GornList)
Sup2_SpanList list Explicit|Implicit|AltLex a list of lists where each member is a pair of integers
Sup2_GornList list Explicit|Implicit|AltLex a list of lists where each member is a sequence of integers
Sup2_RawText str Explicit|Implicit|AltLex optional supporting text for Arg1 (same as obtainable with Sup2_SpanList)
Sup2_Trees list Explicit|Implicit|AltLex list of nltk.tree.Tree objectives (same as obtainable with Sup2_GornList)
Table ATTRIBUTES
The attributes of Datum objects from pdtb.py.

The next few subsections work to make this clearer with diagrams.

Connectives

There are five types of connective: Explicit, Implicit, AltLex, EntRel, and NoRel. The following characterizations and examples are from Prasad et al. 2008.

Explicit

Prasad et al. 2008: "Explicit connectives are drawn from three grammatical classes: subordinating conjunctions (e.g., because, when, etc.), coordinating conjunctions (e.g., and, or, etc.), and discourse adverbials (e.g., for example, instead, etc.)."

Here is example from above repeated without its trees and Args.

  1. [Arg1 that hung over parts of the factory ] even though [Arg2 exhaust fans ventilated the area ].
figures/pdtb/pdtb-conn-explicit.png
Figure EXPLICIT_EX
An Explicit connective.

Here is the abstract structure of such connectives:

figures/pdtb-relation-explicit.png
Figure EXPLICIT
Explicit

Implicit

Prasad et al. 2008: "[S]uch inferred relations are annotated by inserting a connective expression — called an 'Implicit' connective — that best expresses the inferred relation"

  1. But a few funds have taken other defensive steps. [Arg1 Some have raised their cash positions to record levels ]. Implicit = BECAUSE [Arg2 High cash positions help buffer a fund when the market falls ].
figures/pdtb/pdtb-conn-implicit.png
Figure IMPLICIT_EX
Example (6) from Prasad et al. 2008

Figure IMPLICIT gives the abstract structure of such connectives:

figures/pdtb/pdtb-implicit.png
Figure IMPLICIT
Implicit

AltLex

Prasad et al. 2008: "the insertion of an Implicit connective to express an inferred relation led to a redundancy due to the relation being alternatively lexicalized by some non-connective expression"

  1. [Arg1 Ms. Bartlett's previous work, which earned her an international reputation in the non-horticultural art world, often took gardens as its nominal subject ]. [Arg2 Mayhap this metaphorical connection made the BPC Fine Arts Committee think she had a literal green thumb ].
figures/pdtb/pdtb-conn-altlex.png
Figure ALTLEX_EX
Example (7) from Prasad et al. 2008

Figure ALTLEX depicts the general structure:

figures/pdtb/pdtb-altlex.png
Figure ALTLEX
AltLex

EntRel and NoRel

Prasad et al. 2008 on EntRel: "only an entity-based coherence relation could be perceived between the sentences"

  1. [Arg1 Hale Milgrim, 41 years old, senior vice president, marketing at Elecktra Entertainment Inc., was named president of Capitol Records Inc., a unit of this entertainment concern ]. [Arg2 Mr. Milgrim succeeds David Berman, who resigned last month ].

Prasad et al. 2008 on NoRel: "no discourse relation or entity-based relation could be perceived between the sentences"

  1. [Arg1 Jacobs is an international engineering and construction concern ]. [Arg2 Total capital investment at the site could be as much as $400 million, according to Intel ].

Very few things are defined for these, as the abstract structure, figure ENTNO, shows.

figures/pdtb/pdtb-entrelnorel.png
Figure ENTNO
EntRel and NoRel

Args

The arguments each have (i) basic attributes for their content (the raw text and the trees) as well as (ii) attribution informtion and (iii) supplementary text helping to contextualize the content.

The arguments always have basic information. Supplementary text is somewhat rarer. Attribution text is always present for Implicit, Explicit, and AltLex, and it is alway absent for EntRel and NoRel.

Figure ARG shows the full structure, though without the Attribution information. Each edge label corresponds to an an attribute of Datum objects as long as you insert 1 or 2 after Arg for each one.

figures/pdtb/pdtb-argument.png
Figure ARG
Arg

Attribution

Figure ATTRIBUTION breaks down the structure of attributions, showing the relevant attributes as edge labels (insert 1 or 2 after Arg) and the range of values on the nodes, with description.

The Arg_Attribution_Source value Inh means that the attribution is inherited from the connective. The Datum methods final_arg1_attribution_source() and final_arg2_attribution_source() handle this for you, providing the final attribution value in every case.

figures/pdtb/pdtb-attribution.png
Figure ATTRIBUTION
Attribution

Working with text and trees

There are lots of different kinds of text to work with:

For each of these, there are associated methods for getting at their structure. For example:

  1. from pdtb import Datum
  2. s = '''Explicit,00,03,3672..3683,"26,1,1,4,1,1,3,0;26,1,1,4,1,1,3,1", (RB even) ||| (IN though) ,even though,,,though,,,Comparison.Concession.Expectation,,,,Wr,Comm,Null,Null,,,,,3635..3670,"26,1,1,4,0;26,1,1,4,1,0;26,1,1,4,1,1,0;26,1,1,4,1,1,1;26,1,1,4,1,1,2"," (WHNP-1 (WDT that) ) ||| (NP-SBJ (-NONE- *T*-1) ) ||| (VBD hung) ||| (PP-LOC (IN over) (NP (NP (NNS parts) ) (PP (IN of) (NP (DT the) (NN factory) ) ) ) ) ||| (, ,) ",that hung over parts of the factory,Ot,Comm,Null,Null,3595..3612,"26,0;26,1,0;26,1,1,0;26,1,1,3;26,2", (NP-SBJ (NNS Workers) ) ||| (VBD described) ||| (`` ``) ||| ('' '') ||| (. .) ,Workers described,3684..3716,"26,1,1,4,1,1,3,2", (S (NP-SBJ (NN exhaust) (NNS fans) ) (VP (VBD ventilated) (NP (DT the) (NN area) ) ) ) ,exhaust fans ventilated the area,Inh,Null,Null,Null,,,,,,,,,,,,,"that hung over parts of the factory, even though exhaust fans ventilated the area"'''
  3. d = Datum(s)
  4. d.arg1_words()
  5. ['that', '*T*-1', 'hung', 'over', 'parts', 'of', 'the', 'factory', ',']
  6. d.arg1_words(lemmatize=True)
  7. ['that', '*T*-1', 'hang', 'over', 'part', 'of', 'the', 'factory', ',']
  8. d.arg1_pos(wn_format=True)
  9. [('that', 'wdt'), ('*T*-1', '-none-'), ('hung', 'v'), ('over', 'in'), ('parts', 'n'), ('of', 'in'), ('the', 'dt'), ('factory', 'n'), (',', ',')]
  10. d.arg1_pos(lemmatize=True)
  11. [('that', 'wdt'), ('*T*-1', '-none-'), ('hang', 'v'), ('over', 'in'), ('part', 'n'), ('of', 'in'), ('the', 'dt'), ('factory', 'n'), (',', ',')]

There are similarly named methods for Sups, connectives, and attributions.

The SpanList and GornList attributes are for connecting with the Penn Treebank files. The relevant material is already inserted into the CSV file and accessible via the _RawText and _Trees attributes, so you probably won't need it, but it is there just in case you need to connect with the external files.

Some basic analysis

The distribution of semantic classes

The function count_semantic_classes() looks at the values for ConnHeadSemClass1, which gives the primary sense for the connective for Implicit, Explicit, and AltLex data.

  1. #!/usr/bin/env python
  2.  
  3. from pdtb import CorpusReader
  4.  
  5. def count_semantic_classes():
  6. """Count ConnHeadSemClass1 values."""
  7. pdtb = CorpusReader('pdtb2.csv')
  8. d = defaultdict(int)
  9. for datum in pdtb.iter_data():
  10. sc = datum.ConnHeadSemClass1
  11. # Filter None values (should be just EntRel/NonRel data).
  12. if sc:
  13. d[sc] += 1
  14. return d

The following uses count_semantic_classes to sort the output from most to least frequent and put it into a CSV file.

  1. import csv
  2.  
  3. def count_semantic_classes_to_csv(output_filename):
  4. """Write the results of count_semantic_classes() to a CSV file."""
  5. # Create the CSV writer.
  6. csvwriter = csv.writer(file(output_filename, 'w'))
  7. # Add the header row.
  8. csvwriter.writerow(['ConnHeadSemClass1', 'Count'])
  9. # Get the counts.
  10. d = count_semantic_classes()
  11. # Sort by name so that we can perhaps see trends in the
  12. # super-categories.
  13. for sem, count in sorted(d.items()):
  14. csvwriter.writerow([sem, count])
  15.  
  16. count_semantic_classes_to_csv('ConnHeadSemClass1.csv')

The output has 41 lines, one for each relation seen in the gray and black tables for Explicit, Implicit, and AltLex. The following R code turns this into a barplot with (I think) readable labels):

  1. sem = read.csv('ConnHeadSemClass1.csv')
  2. par(mar=c(18,4,2,2))
  3. barplot(sem$Count, names.arg=sem$ConnHeadSemClass1, main='ConnHeadSemClass1', las=3, cex.names=0.8)
figures/pdtb/ConnHeadSemClass1.png
Figure ConnHeadSemClass1
ConnHeadSemClass1

A common string representation for connectives

There is no single attribute of Datum objects that provides an accurate high-evel summary of its nature:

Therefore, Datum objects include the method conn_str for getting at the above strings directly. The code datum.conn_str() will return the above-listed values. The following code uses this method to gather counts of all of the different regularized heads:

  1. #!/usr/bin/env python
  2.  
  3. from collections import defaultdict
  4.  
  5. def connective_distribution():
  6. """Counts of connectives by relation type."""
  7. pdtb = CorpusReader('pdtb2.csv')
  8. d = defaultdict(lambda : defaultdict(int))
  9. for datum in pdtb.iter_data():
  10. cs = datum.conn_str(distinguish_implicit=False)
  11. # Filter None values (should be just EntRel/NoRel data).
  12. if cs:
  13. # Downcase for further collapsing, and add 1.
  14. d[datum.Relation][cs.lower()] += 1
  15. return d

The resulting dictionary is long:

  1. Explicit: 100 distinct connectives
  2. Implicit: 201 distinct connectives
  3. AltLex: 477 distinct connectives

To summarize this, a function for mapping this into a format that Wordle can understand:

  1. def connective_distribution2wordle(d):
  2. """
  3. Map the dictionary returned by connective_distribution() to a
  4. Wordle format. The return value is a string. Its sublists it
  5. returned can be pasted in at http://www.wordle.net/advanced.
  6. """
  7. s = ''
  8. # Print lists of words with the relation type as the header.
  9. for rel, counts in d.items():
  10. s += '======================================================================\n'
  11. s += rel + '\n'
  12. s += '======================================================================\n'
  13. # Map the counts dict to a list of pairs via items() and sort on
  14. # the second member (index 1) of those pairs, largest to smallest.
  15. sorted_counts = sorted(counts.items(), key=itemgetter(1), reverse=True)
  16. # Print the result in Wordle format.
  17. for conn, c in sorted_counts:
  18. # Spacing is hard to interpret in Wordle. This should help.
  19. conn = conn.replace(' ', '_')
  20. # Append to the growing string.
  21. s += '%s:%s\n' % (conn, c)
  22. return s

Here is the result of pasting the sublists into the Wordle advanced function:

figures/pdtb/conn-counts-wordle-explicit.png figures/pdtb/figures/conn-counts-wordle-implicit.png figures/pdtb/conn-counts-wordle-altlex.pn
Figure WORDLE
Wordle representations of the connectives, by relation type: Explicit (left), Implicit (middle), and AltLex (right).

The distribution of attribution values

The basic distribution of attribution values.

  1. #!/usr/bin/env python
  2.  
  3. from pdtb import CorpusReader
  4.  
  5. def attribution_counts():
  6. """Create a count dictionary of non-null attribution values."""
  7. pdtb = CorpusReader('pdtb2.csv')
  8. d = defaultdict(int)
  9. for datum in pdtb.iter_data():
  10. src = datum.Attribution_Source
  11. if src:
  12. d[src] += 1
  13. return d

And the output:

  1. Arb 65 Ot 6539 Wr 28532

A simple look at the nature of the attributions:

  1. #!/usr/bin/env python
  2.  
  3. from pdtb import CorpusReader
  4.  
  5. def print_attribution_texts():
  6. """Inspect the strings characterizing attribution values."""
  7. pdtb = CorpusReader('pdtb2.csv')
  8. for datum in pdtb.iter_data(display_progress=False):
  9. txt = datum.Attribution_RawText
  10. if txt:
  11. print txt

The top of the output of print_attribution_texts() (the frequent repeats are due to the fact that individual spans of text are often involved in multiple relationships):

  1. researchers said A Lorillard spokewoman said A Lorillard spokewoman said said Darrell Phillips, vice president of human resources for Hollingsworth & Vose said Darrell Phillips, vice president of human resources for Hollingsworth & Vose Longer maturities are thought Shorter maturities are considered considered by some said Brenda Malizia Negus, editor of Money Fund Report the Treasury said The Treasury said Newsweek said said Mr. Spoon According to Audit Bureau of Circulations According to Audit Bureau of Circulations saying that . . .

Relative argument ordering

The Datum method relative_arg_order() determines the ordering of Arg1 and Arg2. Its values:

  1. arg1_precedes_arg2: Arg1 completely precedes Arg2 (perhaps with connective and attribution values intervening.
  2. arg1_contains_arg2: Arg1 properly contains Arg2.
  3. arg1_precedes_and_overlaps_but_does_not_contain_arg2: Arg1 both begins and ends before Arg2.
  4. arg2_precedes_arg1: Arg2 completely precedes Arg1 (perhaps with connective and attribution values intervening.
  5. arg2_contains_arg1: Arg2 properly contains Arg1.
  6. arg2_precedes_and_overlaps_but_does_not_contain_arg1: Arg2 both begins and ends before Arg1.

The function distribution_of_relative_arg_order() in pdtb_functions.py calculates the distribution of these orderings. Here is the output:

  1. arg1_precedes_arg2 38041
  2. arg2_precedes_arg1 1763
  3. arg1_contains_arg2 765
  4. arg2_contains_arg1 31

As expected, it is most common for Arg1 to precede Arg2. The two 'overlap' relations, arg2_precedes_and_overlaps_but_does_not_contain_arg1 and arg1_precedes_and_overlaps_but_does_not_contain_arg2, turn out to be non-attested.

The Datum methods arg1_precedes_arg2(), arg1_contains_arg2(), arg1_precedes_and_overlaps_but_does_not_contain_arg2(), arg2_precedes_arg1(), arg2_contains_arg1(), and arg2_precedes_and_overlaps_but_does_not_contain_arg1() allow for quick identification of these subsets. Each returns True or False.

Hunting for interesting associations

The best way to get acquainted with a new large-scale corpus resource is to hypothesize that important relationships might exist between two of its properties and then test to see whether the hypothesis holds up.

The function contingencies() in pdtb_functions.py seeks to provide a general method for exploring such relationships at the level of properties of Datum objects.

This function takes as its arguments two functions on Datum instances and calculates the observed/expected (O/E) values for the two classes of results. The print-out includes the observed contingency table, the expected contingency table, and an ordered list of O/E values. Values of None are ignored and can thus be used as a filter.

Relation by primary semantic class

The following code tests for associations between relation-type and the primary semantic classes:

  1. import pdtb_functions
  2. def get_Relation(datum): return datum.Relation
  3. def get_primary_semclass1(datum): return datum.primary_semclass1()
  4. pdtb_functions.contingencies(get_Relation, get_primary_semclass1)

The output:

  1. ======================================================================
  2. Observed
  3. Comparison Contingency Expansion Temporal
  4. AltLex 46.0 275.0 217.0 86.0
  5. Explicit 5471.0 3250.0 6298.0 3440.0
  6. Implicit 2441.0 4185.0 8601.0 826.0
  7. --------------------------------------------------
  8. Expected
  9. Comparison Contingency Expansion Temporal
  10. AltLex 141.33 136.93 268.45 77.29
  11. Explicit 4180.8 4050.51 7941.32 2286.36
  12. Implicit 3635.87 3522.56 6906.23 1988.35
  13. --------------------------------------------------
  14. O/E
  15. ('AltLex', 'Contingency') 2.00838072433
  16. ('Explicit', 'Temporal') 1.50457452606
  17. ('Explicit', 'Comparison') 1.30860003806
  18. ('Implicit', 'Expansion') 1.24539803789
  19. ('Implicit', 'Contingency') 1.18805677982
  20. ('AltLex', 'Temporal') 1.1126979638
  21. ('AltLex', 'Expansion') 0.808333502962
  22. ('Explicit', 'Contingency') 0.80236713482
  23. ('Explicit', 'Expansion') 0.793067077948
  24. ('Implicit', 'Comparison') 0.671366948954
  25. ('Implicit', 'Temporal') 0.415419877538
  26. ('AltLex', 'Comparison') 0.325477990218
  27. ======================================================================

Values above 1 indicate that the observed values are larger than we would expect given the null hypothesis that the two variables are independent of each other. Values below 1 indicate that the observed values are small than we would expect given this hypothesis.

The Count table can be the input to chisq.test and g.test in R. However, the counts involved in these tables are so large that the null hypothesis is almost certain to look false, so a more qualitative assessment might be called for, followed by more articulated regression modeling.

Relation order by argument order

  1. import pdtb_functions
  2. def get_Relation(datum): return datum.Relation
  3. def get_arg_order(datum): return datum.relative_arg_order()
  4. pdtb_functions.contingencies(get_Relation, get_arg_order)

The output:

  1. ======================================================================
  2. Observed
  3. arg1_contains_arg2 arg1_precedes_arg2 arg2_contains_arg1 arg2_precedes_arg1
  4. AltLex 0.0 623.0 0.0 1.0
  5. EntRel 0.0 5210.0 0.0 0.0
  6. Explicit 765.0 15901.0 31.0 1762.0
  7. Implicit 0.0 16053.0 0.0 0.0
  8. NoRel 0.0 254.0 0.0 0.0
  9. --------------------------------------------------
  10. Expected
  11. arg1_contains_arg2 arg1_precedes_arg2 arg2_contains_arg1 arg2_precedes_arg1
  12. AltLex 11.76 584.67 0.48 27.1
  13. EntRel 98.17 4881.62 3.98 226.24
  14. Explicit 347.81 17295.54 14.09 801.56
  15. Implicit 302.48 15041.19 12.26 697.08
  16. NoRel 4.79 237.99 0.19 11.03
  17. --------------------------------------------------
  18. O/E
  19. ('Explicit', 'arg1_contains_arg2') 2.19946909367
  20. ('Explicit', 'arg2_contains_arg1') 2.19946909367
  21. ('Explicit', 'arg2_precedes_arg1') 2.19822152186
  22. ('NoRel', 'arg1_precedes_arg2') 1.06726952499
  23. ('Implicit', 'arg1_precedes_arg2') 1.06726952499
  24. ('EntRel', 'arg1_precedes_arg2') 1.06726952499
  25. ('AltLex', 'arg1_precedes_arg2') 1.06555915716
  26. ('Explicit', 'arg1_precedes_arg2') 0.919370102216
  27. ('AltLex', 'arg2_precedes_arg1') 0.0369053332752
  28. ('AltLex', 'arg2_contains_arg1') 0.0
  29. ('EntRel', 'arg2_contains_arg1') 0.0
  30. ('Implicit', 'arg2_contains_arg1') 0.0
  31. ('NoRel', 'arg1_contains_arg2') 0.0
  32. ('EntRel', 'arg2_precedes_arg1') 0.0
  33. ('AltLex', 'arg1_contains_arg2') 0.0
  34. ('NoRel', 'arg2_precedes_arg1') 0.0
  35. ('Implicit', 'arg2_precedes_arg1') 0.0
  36. ('EntRel', 'arg1_contains_arg2') 0.0
  37. ('Implicit', 'arg1_contains_arg2') 0.0
  38. ('NoRel', 'arg2_contains_arg1') 0.0

Argument order by primary semantic class

The following code tests for associations between relative argument ordering and the primary semantic class:

  1. import pdtb_functions
  2. def get_arg_order(datum): return datum.relative_arg_order()
  3. def get_primary_semclass1(datum): return datum.primary_semclass1()
  4. pdtb_functions.contingencies(get_arg_order, get_primary_semclass1)

The output:

  1. ======================================================================
  2. Observed
  3. Comparison Contingency Expansion Temporal
  4. arg1_contains_arg2 157.0 286.0 26.0 296.0
  5. arg1_precedes_arg2 7302.0 6745.0 15049.0 3481.0
  6. arg2_contains_arg1 10.0 8.0 7.0 6.0
  7. arg2_precedes_arg1 489.0 671.0 34.0 569.0
  8. --------------------------------------------------
  9. Expected
  10. Comparison Contingency Expansion Temporal
  11. arg1_contains_arg2 173.27 167.87 329.11 94.75
  12. arg1_precedes_arg2 7378.41 7148.47 14015.08 4035.04
  13. arg2_contains_arg1 7.02 6.8 13.34 3.84
  14. arg2_precedes_arg1 399.3 386.86 758.47 218.37
  15. --------------------------------------------------
  16. O/E
  17. ('arg1_contains_arg2', 'Temporal') 3.12387543253
  18. ('arg2_precedes_arg1', 'Temporal') 2.60569383738
  19. ('arg2_precedes_arg1', 'Contingency') 1.73447541443
  20. ('arg1_contains_arg2', 'Contingency') 1.70373693446
  21. ('arg2_contains_arg1', 'Temporal') 1.56261859583
  22. ('arg2_contains_arg1', 'Comparison') 1.424251514
  23. ('arg2_precedes_arg1', 'Comparison') 1.22463010214
  24. ('arg2_contains_arg1', 'Contingency') 1.17605121125
  25. ('arg1_precedes_arg2', 'Expansion') 1.07377178874
  26. ('arg1_precedes_arg2', 'Comparison') 0.989644292634
  27. ('arg1_precedes_arg2', 'Contingency') 0.943558446203
  28. ('arg1_contains_arg2', 'Comparison') 0.906121845572
  29. ('arg1_precedes_arg2', 'Temporal') 0.862693184834
  30. ('arg2_contains_arg1', 'Expansion') 0.524870037303
  31. ('arg1_contains_arg2', 'Expansion') 0.0790000466977
  32. ('arg2_precedes_arg1', 'Expansion') 0.0448272440902

Explicit connective by precedence argument order

The following explores the relationship between argument ordering and the nature of connective's cannonical name:

  1. import pdtb_functions
  2. def get_explicit_conn_str(datum): if datum.Relation == 'Explicit': return datum.conn_str() else: return None
  3. def get_precedence_arg_order(datum): ord = datum.relative_arg_order() if ord in ('arg1_precedes_arg2', 'arg2_precedes_arg1'): return ord else: return None
  4. pdtb_functions.contingencies(get_explicit_conn_str, get_precedence_arg_order)

Here's a sample of the ranked output (the contingency tables are too large to display usefully here):

  1. O/E
  2. ('when and if', 'arg2_precedes_arg1') 10.0244040863
  3. ('insofar as', 'arg2_precedes_arg1') 10.0244040863
  4. ('although', 'arg2_precedes_arg1') 5.86712613745
  5. ('once', 'arg2_precedes_arg1') 5.66596752702
  6. ('besides', 'arg2_precedes_arg1') 5.56911338126
  7. ('if', 'arg2_precedes_arg1') 4.95217567136
  8. ('now that', 'arg2_precedes_arg1') 4.71736662883
  9. ('whereas', 'arg2_precedes_arg1') 4.00976163451
  10. ('when', 'arg2_precedes_arg1') 3.7854393053
  11. ('while', 'arg2_precedes_arg1') 3.47966759272
  12. ('since', 'arg2_precedes_arg1') 3.2414240758
  13.  
  14. [...]
  15.  
  16. ('until', 'arg1_precedes_arg2') 1.0259079167
  17. ('overall', 'arg1_precedes_arg2') 1.01824308744
  18. ('except', 'arg2_precedes_arg1') 1.00244040863
  19. ('except', 'arg1_precedes_arg2') 0.999729576756
  20. ('before', 'arg1_precedes_arg2') 0.984842423838
  21. ('as long as', 'arg1_precedes_arg2') 0.971959310735
  22. ('as soon as', 'arg1_precedes_arg2') 0.971959310735
  23.  
  24. [...]
  25.  
  26. # Random selection of the 0-valued elements:
  27.  
  28. ('either or', 'arg2_precedes_arg1') 0.0
  29. ('by then', 'arg2_precedes_arg1') 0.0
  30. ('meanwhile', 'arg2_precedes_arg1') 0.0
  31. ('ultimately', 'arg2_precedes_arg1') 0.0
  32. ('thus', 'arg2_precedes_arg1') 0.0
  33. ('further', 'arg2_precedes_arg1') 0.0
  34. ('lest', 'arg2_precedes_arg1') 0.0
  35. ('regardless', 'arg2_precedes_arg1') 0.0
  36. ('moreover', 'arg2_precedes_arg1') 0.0
  37. ('And', 'arg2_precedes_arg1') 0.0
  38. ('thereby', 'arg2_precedes_arg1') 0.0

Negation balances and imbalances

Chris Brown suggested looking at patterns of negation across the two Args — both where the negations are imbalanced and where they are balanced. The following adapts his own code for exploring this issue:

  1. import pdtb_functions
  2. import re
  3. def get_negation_balance(datum): neg_search = re.compile(r"(\bnot?\b)|(n't\b)|(\bneither\b)|(\bnever\b)", re.IGNORECASE) a1 = 'POSITIVE' a2 = 'POSITIVE' if neg_search.search(datum.Arg1_RawText): a1 = 'NEGATED' if neg_search.search(datum.Arg2_RawText): a2 = 'NEGATED' return (a1, a2)
  4. def get_primary_semclass1(datum): return datum.primary_semclass1()
  5. pdtb_functions.contingencies(get_negation_balance, get_primary_semclass1)

The output ranking:

  1. O/E
  2. (('POSITIVE', 'NEGATED'), 'Comparison') 1.58401926288
  3. (('NEGATED', 'NEGATED'), 'Contingency') 1.27867520864
  4. (('NEGATED', 'POSITIVE'), 'Comparison') 1.2168104343
  5. (('NEGATED', 'POSITIVE'), 'Contingency') 1.19215210451
  6. (('POSITIVE', 'POSITIVE'), 'Temporal') 1.18743450321
  7. (('POSITIVE', 'NEGATED'), 'Contingency') 1.16698575247
  8. (('NEGATED', 'NEGATED'), 'Expansion') 1.12699367317
  9. (('POSITIVE', 'POSITIVE'), 'Expansion') 1.03542149117
  10. (('NEGATED', 'NEGATED'), 'Comparison') 0.981151042976
  11. (('POSITIVE', 'POSITIVE'), 'Contingency') 0.940714688988
  12. (('NEGATED', 'POSITIVE'), 'Expansion') 0.897101503701
  13. (('POSITIVE', 'POSITIVE'), 'Comparison') 0.887653120057
  14. (('POSITIVE', 'NEGATED'), 'Expansion') 0.820563740053
  15. (('NEGATED', 'POSITIVE'), 'Temporal') 0.620529298736
  16. (('POSITIVE', 'NEGATED'), 'Temporal') 0.259483699411
  17. (('NEGATED', 'NEGATED'), 'Temporal') 0.0996732026144

This looks like solid support for Chris's hypothesis that mismatches in negation would assoicate with Comparison relations.

Appendix A: Graphiviz representations

You can create Datum objects directly from a string. The idea is that you might copy and paste a row out of pdtb.csv so that you can work with it directly. It's unwieldy, but it is useful for looking at specific examples:

  1. from pdtb import *
  2. # Use triple single quotes to avoid clashes with the internal punctuation of the example.
  3. s = '''Explicit,00,03,3672..3683,"26,1,1,4,1,1,3,0;26,1,1,4,1,1,3,1", (RB even) ||| (IN though) ,even though,,,though,,,Comparison.Concession.Expectation,,,,Wr,Comm,Null,Null,,,,,3635..3670,"26,1,1,4,0;26,1,1,4,1,0;26,1,1,4,1,1,0;26,1,1,4,1,1,1;26,1,1,4,1,1,2"," (WHNP-1 (WDT that) ) ||| (NP-SBJ (-NONE- *T*-1) ) ||| (VBD hung) ||| (PP-LOC (IN over) (NP (NP (NNS parts) ) (PP (IN of) (NP (DT the) (NN factory) ) ) ) ) ||| (, ,) ",that hung over parts of the factory,Ot,Comm,Null,Null,3595..3612,"26,0;26,1,0;26,1,1,0;26,1,1,3;26,2", (NP-SBJ (NNS Workers) ) ||| (VBD described) ||| (`` ``) ||| ('' '') ||| (. .) ,Workers described,3684..3716,"26,1,1,4,1,1,3,2", (S (NP-SBJ (NN exhaust) (NNS fans) ) (VP (VBD ventilated) (NP (DT the) (NN area) ) ) ) ,exhaust fans ventilated the area,Inh,Null,Null,Null,,,,,,,,,,,,,"that hung over parts of the factory, even though exhaust fans ventilated the area"'''
  4. d = Datum(s)
  5. d.Arg1_RawText
  6. 'that hung over parts of the factory'
  7. d.Arg2_RawText
  8. 'exhaust fans ventilated the area'

The datum method to_graphiviz() will generate a Graphviz representation, as a plain-text file. If you then open that file with Graphviz, you should get a nice image that you can save.

The Graphviz language is easy to learn (I advise studying the gallery of examples), and it is a flexible, powerful way to visualize data.

Exercises

SEMCLASSES Modify count_semantic_classes() so that it counts only via the highest level semantic classifications, by splitting ConnHeadSemClass1 values on the leftmost period and using only the first element. Provide your modified code and the output counts.

CM Construct a confusion matrix with the Relation types as rows, the ConnHeadSemClass1 as colums, and the cells representing the number of times that the correspondong row and columns values occur together. Are there patterns here that we might take advantage of in experiments predicting Relation-types or semantic coherence classes?

WORDS The Datum method calls arg1_pos(wn_format=True, lemmatize=True) and arg2_pos(wn_format=True, lemmatize=True) return the stemmed (word, pos) pairs for Arg1 and Arg2. Use these functions to try to find words that are predictive of ConnHeadSemClass1 values. I suggest constructing count dictionaries and looking at the relationships between counts, but feel free to try something more ambitious.

MAINVERB Explore the role that main predicates play in determining coherence relations. To do this, you'll need to:

  1. Write a function that, when given a Tree, identifies the head of the main predicate of that tree if there is one. The functions defined at the top of swda_experiment_clausetyping.py might provide a useful guide here.
  2. Write a function that counts the relationships between Datum.conn_str() values and the predicates you identify.

I leave the exact formulation of this problem to you. You might want to restriction attention to one of the Args, or use both of them. You might want to look only at verbs (and perhaps stem them to address sparseness problems; see wn_string_lemmatizer in wordnet_functions.py). And so forth; just be sure to explain what you did and why; I myself am not sure how best to go about this.

WORDLEEX The Wordle digrams in figure WORDLE depict the nature of the connectives in Explicit, Implicit, and AltLex relations. The pictures are quite different. Pick a connective and offer an explanation for its distribution across these relation categories.

ROOTS Study the relationship between the root node labels for Arg1 and Arg2 and the value of conn_str(). In the spirit of the earlier clause-typing experiment, what do the associations say about clause-types and discourse coherence?

ATT Explore the output of print_attribution_texts(). What semantic relationships recur in the list? What might this tell us about the nature of speaker commitment?