Computational Pragmatics | FactBank and the Stanford PragBank Extension

This section provides an overview of (parts of) the FactBank corpus as well as a recent extension of it created at Stanford. I've kept this material to a minimum because de Marneffe et al. 2011 covers exactly the ground that I would cover here, and it is short and accessible.

As in previous sections, I've got some data and Python code to aid investigation:

The data distribution is a CSV file, so you can also read it into Excel, R, and so forth.

Corpus distribution and tools

FactBank is distributed as multiple files with stand-off annotations, and the Stanford PragBank distribution can be merged with it. Working directly with these files requires a lot of careful set up, so I've put together a single CSV file containing just the information about veridicality/commitment that we'll focus on here. Table COLUMNS summarizes the column values for this file. See de Marneffe et al. 2011 for additional information about the data and annotations.

Column/Attribute name	Python type	Description
File	str	FactBank source filename
sentId	int	FactBank sentence id
Sentence	str	the full text of the sentence
SentenceParse	nlkt.tree.Tree	the parse of the sentence — not hand-corrected, but rather the Stanford Parser output
eId	str	FactBank event Id (integer with 'e' prefix)
eiId	str	FactBank event instance Id (integer with 'ei' prefix; I believe this is a unique identifier for events, but I often use (File, sentId, eId) to be cautious)
eText	str	text of the even description
Normalization	str	the question posed to the Mechanical Turk annotators
FactValues	dict	a dictionary mapping FactBank relSourceText values to {ct_plus, pr_plus, ps_plus, ct_minus, pr_minus, ps_minus, uu}
PragValues	dict	a dictionary mapping {ct_plus, pr_plus, ps_plus, ct_minus, pr_minus, ps_minus, uu} to counts

Table COLUMNS

The column values of fb-semprag.csv. The column names are also factbank.Event attributes with the types listed. For FactValues and PragValues, the values appear in the CSV file as name:value pairs separated by |.

Examples

The section gives a few simple examples designed to give you a sense for how the code works and what the data are like.

Comparing the FactBank and PragBank annotations

The following function generates the confusion matrix in Table III of de Marneffe et al. 2011:

Non-AUTHOR annotations

Most FactBank events are annotated from multiple perspectives — not just the author of the text but also various other participants named in the sentence.

The following function provides a look at the labels for these non-AUTHOR annotations:

The output has 131 lines, most of which occur only once. Here is the top, showing the steep dive to very few tokens:

There might, though, be interesting generalizations over these labels based on their syntactic roles.

It is quite illuminating to compare AUTHOR and non-AUTHOR annotations for the same event:

In the vast majority of cases, the author annotation (rows) is stronger than the other annotation, in the sense that it is higher on the scale of veridicality. Strikingly, the largest category is (uu, ct_plus). Many of these examples are roughly of the form X said that S, where (according to the FactBank annotation guidelines), S is probably certain from the perspective of the subject X but unknown (merely reported) from the perspective of the author of the sentence.

Lexical associations

The function lexical_associations() explores how the distribution of words differs between the FactBank and PragBank annotations.

C-commanding modals

Modals are clearly related to veridicality. In FactBank, they are taken to be fairly certain markers of specific veridicality categories, and such associations turn up in the PragBank annotations as well.

The present section provides an initial look at the way modals relate to specific veridicality values.

The first step is defining a function that, when given a tree and a word in that tree (in our case, the event text), returns the set of modals that c-command that word (by the most conservative definition of c-command):

The following function then uses this modal finding ability to gather a count matrix relating the FactBank or PragBank annotations to modal token counts:

The patterns are very similar, but modals are less categorical in their behavior on the PragBank data, with the trends again towards less use of the uncertainty category. This split is especially strong for will and would.

Exercises

CM Summarize de Marneffe et al.'s (2011) characterization of the confusion matrix generated by semprag_confusion_matrix(), and then offer new observations about the pattern and/or what is causing it.

PRAGNONAUTHOR The function author_nonauthor_factvalue_compare() compares AUTHOR and non-AUTHOR FactBank annotations. As discussed above, what we see is a very strong tendency for the AUTHOR annotation to be more uncertain than the non-AUTHOR one.

How do the PragBank (READER-level) annotations compare to the non-AUTHOR ones? To address this, write a function comparable to author_nonauthor_factvalue_compare() but that compares the 6/10 majority choice PragBank annotation (where there is one) to the FactBank non-AUTHOR annotations. What is the overall picture like and how does it compare to the output of author_nonauthor_factvalue_compare()?

HIGHFREQ The function lexical_associations() is somewhat helpful in understanding the way lexical items associate with the veridicality tags for the two annotation types, but it is clear that the method is heavily biased in favor of high-frequency words. This is presumably because even small differences get amplified by the hight token counts. Though these associations might be important, they are clearly dampening effects from more interesting markers of veridicality.

Your task: Try to devise a method for finding these associations that is less susceptible to frequency (both high frequency, as with the current function, and low-frequency).

COMMANDEX The goal of this exercise is to generalize the c-commanding modals functionality:

Modify (and perhaps change the name of) c_commanding_modals() so that it takes any regular expression as its argument and then looks for c-commanding nodes whose terminals match it.
Modify (and perhaps change the name of) modal_stats() so that it takes a regular expression as its second argument and passes it to your revised c_commanding_modals()
Write and interesting regular expression (capturing, say, a class of verbs, determiners, etc.) and provide the results of running your modified code with it. How to FactBank and PragBank compare for this output?