FactBank and the Stanford PragBank Extension
- Overview
- Corpus distribution and tools
- Examples
- Comparing the FactBank and PragBank annotations
- Non-AUTHOR annotations
- Lexical associations
- C-commanding modals
- Exercises
This section provides an overview of (parts of) the FactBank corpus
as well as a recent extension of it created at Stanford. I've kept this
material to a minimum because de Marneffe et al. 2011
covers exactly the ground that I would cover here, and it is short and
accessible.
As in previous sections, I've got some data and Python code to aid investigation:
The data distribution is a CSV file, so you can also read it into
Excel, R, and so forth.
Associated reading:
- Saurí and Pustejovsky 2009: introduces FactBank and reports on experiments predicting veridicality
- de Marneffe et al. 2011: discusses and compares FactBank with the PragBank annotations, and reports on experiments predicting veridicality distributions
FactBank is distributed as multiple files with stand-off
annotations, and the Stanford PragBank distribution can be merged with
it. Working directly with these files requires a lot of careful set
up, so I've put together a single CSV file containing just the
information about veridicality/commitment that we'll focus on here.
Table COLUMNS
summarizes the column values for this file. See de Marneffe et al. 2011
for additional information about the data and annotations.
The section gives a few simple examples designed to give you a sense for how the code
works and what the data are like.
The following function generates the confusion matrix in
Table III of de Marneffe et al. 2011:
- def semprag_confusion_matrix(output_filename):
- """
- Build a confusion matrix comparing the FactBank annotations
- with the PragBank extension, limiting attention to cases
- where 6/10 agreed on a single category (which we then take
- to be the correct label).
-
- The output CSV file has FactBank annotations as rows and the
- pragmatic annotations as columns.
-
- The output is the same as table III of de Marneffe et al.'s
- 'Veridicality and utterance understanding'.
- """
- # 2-d defaultdict for the counts:
- cm = defaultdict(lambda : defaultdict(int))
- # Instantiate the corpus:
- corpus = FactbankCorpusReader('fb-semprag.csv')
- # Iterate through the training set:
- for event in corpus.train_events():
- # Where defined, this will be a pair like (ct_plus, 7):
- pv, pv_count = event.majority_pragvalue()
- # The AUTHOR-level factuality value is the most comparable to the pragmatic annotations:
- fv = event.FactValues['AUTHOR']
- # We limit attention to the items where the majority got at least 6 votes:
- if pv and pv_count >= 6:
- cm[fv][pv] += 1
- # CSV output with the FactBank annotations as rows and the
- # pragmatic annotations as columns:
- csvwriter = csv.writer(file(output_filename, 'w'))
- # Listing the keys like this ensures an intuitive ordering:
- keys = ['ct_plus', 'pr_plus', 'ps_plus', 'ct_minus', 'pr_minus', 'ps_minus', 'uu']
- csvwriter.writerow(['FactBank'] + keys)
- for fb in keys:
- row = [fb] + [cm[fb][pv] for pv in keys]
- csvwriter.writerow(row)
I ran it as follows:
- from factbank_functions import semprag_confusion_matrix
- semprag_confusion_matrix('factbank-semprag-confusion-matrix.csv')
Feel free to download factbank-semprag-confusion-matrix.csv
if you'd prefer not to regenerate it yourself.
exercise CM
Most FactBank events are annotated from multiple perspectives
— not just the author of the text but also various other
participants named in the sentence.
The following function provides a look at the labels for these
non-AUTHOR annotations:
- def nonauthor_factbank_annnotations():
- """Look at the strings associated with non-AUTHOR annotations in FactBank."""
- d = defaultdict(int)
- corpus = FactbankCorpusReader('fb-semprag.csv')
- for event in corpus.train_events():
- for src, fv in event.FactValues.items():
- if src != 'AUTHOR':
- d[src] += 1
- for key, val in sorted(d.items(), key=itemgetter(1), reverse=True):
- print key, val
The output has 131 lines, most of which occur only once. Here is the
top, showing the steep dive to very few tokens:
- GEN_AUTHOR 44
- officials_AUTHOR 19
- company_AUTHOR 10
- he_AUTHOR 10
- DUMMY_AUTHOR 7
- He_AUTHOR 6
- analysts_AUTHOR 5
- spokesman_AUTHOR 5
- Doyle_AUTHOR 3
- unit_AUTHOR 3
- ...
There might, though, be interesting generalizations over these
labels based on their syntactic roles.
It is quite illuminating to compare AUTHOR and non-AUTHOR annotations
for the same event:
- def author_nonauthor_factvalue_compare():
- """Compare FactBank AUTHOR and non-AUTHOR annotations for the same event."""
- d = defaultdict(int)
- corpus = FactbankCorpusReader('fb-semprag.csv')
- for event in corpus.train_events():
- fvs = event.FactValues
- auth = fvs['AUTHOR']
- for src, fv in fvs.items():
- if src != 'AUTHOR':
- d[(auth, fv)] += 1
- keys = ['ct_plus', 'pr_plus', 'ps_plus', 'ct_minus', 'pr_minus', 'ps_minus', 'uu']
- # Little function for displaying neat columns:
- def fmt(a):
- return "".join(map((lambda x : str(x).rjust(14)), a))
- # Output printing:
- print fmt(['author\other'] + keys)
- for auth in keys:
- row = [auth]
- for other in keys:
- row.append(d[(auth, other)])
- print fmt(row)
The full output:
- author\other ct_plus pr_plus ps_plus ct_minus pr_minus ps_minus uu
- ct_plus 5 0 0 0 0 0 0
- pr_plus 1 53 0 0 0 0 10
- ps_plus 0 1 3 0 0 0 2
- ct_minus 0 0 0 18 0 0 4
- pr_minus 0 0 0 0 2 0 0
- ps_minus 0 0 0 0 0 0 0
- uu 95 18 6 14 3 0 40
In the vast majority of cases, the author annotation (rows) is
stronger than the other annotation, in the sense that it is higher on
the scale of veridicality. Strikingly, the largest category is (uu,
ct_plus). Many of these examples are roughly of the form
X said that S, where (according to the
FactBank annotation guidelines), S is
probably certain from the perspective of the
subject X but unknown (merely reported) from
the perspective of the author of the sentence.
exercise PRAGNONAUTHOR
The function lexical_associations()
explores how the distribution of words differs between the FactBank
and PragBank annotations.
- def lexical_associations(n=10):
- """
- This function looks for words that are unusually over-represented
- in the FactBank (PragBank) annotations.
-
- The output is the top n items for each tag (default n=10).
-
- To do this, it iterates through the subset of the training set
- where there was a 6/10 majority choice label selected by the
- Turkers.
-
- For each event, it iterates through the words in the sentence for
- that event, adding 1 for (factbank-label, word) pairs and
- subtracting 1 for (pragbank-label, word) pairs.
-
- Thus, if the two sets of annotations were the same, these values
- would all be 0. What we see instead is a lot of lexical variation
- (though the results are somewhat marred by the tendency for
- high-frequency words to end up with very large counts.
- """
- # Keep track of the differences:
- diff = defaultdict(lambda : defaultdict(int))
- # Instantiate the corpus:
- corpus = FactbankCorpusReader('fb-semprag.csv')
- # Limit to the training set:
- events = corpus.train_events()
- # Limit to the events with at least a 6/10 majority choice:
- events = filter((lambda e : e.majority_pragvalue()[1] and e.majority_pragvalue()[1] >= 6), events)
- # Iterate through this restricted set of events:
- for event in events:
- # Lemmatize:
- event_words = event.leaves(wn_lemmatize=True)
- # Remove punctuation, so that we look only at real words:
- event_words = filter((lambda x : not re.search(r"\W", x)), event_words)
- # Downcase:
- event_words = map(str.lower, event_words)
- # Word counting:
- for word in event_words:
- diff[event.FactValues['AUTHOR']][word] += 1
- diff[event.majority_pragvalue()[0]][word] += 1
- # Function for formatting the results:
- def fmt(a):
- return ', '.join(map((lambda x : '%s: %s' % x), a))
- # View the results:
- keys = ['ct_plus', 'pr_plus', 'ps_plus', 'ct_minus', 'pr_minus', 'ps_minus', 'uu']
- for key in keys:
- sorted_vals = sorted(diff[key].items(), key=itemgetter(1))
- print key
- print '\tFactBank:', fmt(sorted(sorted_vals[-n:], key=itemgetter(1), reverse=True)) # Put these in decreasing order.
- print '\tPragBank:', fmt(sorted_vals[:n])
Output with n=10:
- ct_plus
- FactBank the: 312, of: 176, to: 161, a: 137, in: 125, be: 114, and: 104, say: 87, million: 59, it: 59
- PragBank secretary: 1, zinc: 1, unisys: 1, settlement: 1, hate: 1, clearance: 1, whose: 1, violate: 1, herbicide: 1, supreme: 1
- pr_plus
- FactBank the: 266, to: 225, be: 175, a: 91, of: 90, in: 89, and: 86, that: 69, expect: 61, for: 49
- PragBank focus: 1, follow: 1, bronfman: 1, lure: 1, sony: 1, chairman: 1, lever: 1, ten: 1, core: 1, merieux: 1
- ps_plus
- FactBank the: 187, to: 113, of: 79, a: 75, be: 65, and: 48, in: 48, that: 44, could: 42, for: 31
- PragBank represent: 1, whose: 1, cheney: 1, send: 1, petersburg: 1, telerate: 1, past: 1, appear: 1, above: 1, public: 1
- ct_minus
- FactBank the: 436, be: 226, to: 217, a: 155, of: 148, have: 128, and: 126, t: 126, it: 115, in: 110
- PragBank follow: 1, whose: 1, violate: 1, breach: 1, labor: 1, solution: 1, constitution: 1, categorically: 1, along: 1, engage: 1
- pr_minus
- FactBank the: 18, and: 13, be: 12, to: 8, t: 7, in: 7, computer: 6, maker: 6, n: 6, for: 6
- PragBank japan: 1, yet: 1, potential: 1, do: 1, capital: 1, despite: 1, stiff: 1, fully: 1, large: 1, force: 1
- ps_minus
- FactBank from: 2, the: 2, edelman: 1, boost: 1, run: 1, may: 1, a: 1, asher: 1, martin: 1, make: 1
- PragBank and: 1, prevent: 1, help: 1, move: 1, intelogic: 1, at: 1, concern: 1, 20: 1, ackerman: 1, stake: 1
- uu
- FactBank the: 281, to: 173, of: 141, be: 126, say: 108, a: 103, in: 103, and: 92, it: 57, that: 53
- PragBank represent: 1, debenture: 1, 80486: 1, dollar: 1, focus: 1, nadeau: 1, zinc: 1, settlement: 1, hate: 1, whose: 1
exercise HIGHFREQ
Modals are clearly related to veridicality. In FactBank, they are
taken to be fairly certain markers of specific veridicality categories,
and such associations turn up in the PragBank annotations as well.
The present section provides an initial look at the way modals
relate to specific veridicality values.
The first step is defining a function that, when given a tree and a
word in that tree (in our case, the event text), returns the set of
modals that c-command that word (by the most conservative definition
of c-command):
- def tree_has_modal_daughter(tree):
- """
- If tree has a preterminal daughter whose terminal is a modal,
- return that modal, else return False.
- """
- modal_re = re.compile(r'^(Can|Could|Shall|Should|Will|Would|May|Might|Must|Wo)$', re.I)
- for daught in tree:
- if swda_experiment_clausetyping.is_preterminal(daught) and modal_re.search(daught[0]):
- return daught[0]
- return False
-
- def c_commanding_modals(tree, terminal):
- """Return the set of modals that c-command terminal in tree."""
- modals = set([])
- for subtree in tree.subtrees():
- md = tree_has_modal_daughter(subtree)
- if md:
- if terminal in subtree.leaves():
- modals.add(md)
- return modals
The following function then uses this modal finding ability to
gather a count matrix relating the FactBank or PragBank annotations to
modal token counts:
- def modal_stats(factbank_or_pragbank):
- """
- Gather and print a matrix relating modal use (rows) to
- veridicality values:
-
- factbank_or_pragbank (str) -- if 'factbank' (case-insensitive),
- then use the FactBank annotations,
- else use the PragBank majority annotation
-
- The calculations are limited to the subset of the events that have
- a 6/10 majority category in PragBank, to facilitate comparisons
- between the two annotation groups.
- """
-
- corpus = FactbankCorpusReader('fb-semprag.csv')
- # Limit to the training set:
- events = corpus.train_events()
- # Limit to the events with at least a 6/10 majority choice:
- events = filter((lambda e : e.majority_pragvalue()[1] and e.majority_pragvalue()[1] >= 6), events)
- # For the counts:
- counts = defaultdict(lambda : defaultdict(int))
- # Iterate through the events:
- for event in events:
- modals = c_commanding_modals(event.SentenceParse, event.eText)
- for modal in modals:
- val = None
- if factbank_or_pragbank.lower() == 'factbank':
- val = event.FactValues['AUTHOR']
- else:
- val = event.majority_pragvalue()[0]
- counts[modal][val] += 1
- # Modals:
- modals = sorted(counts.keys())
- # Categories:
- keys = ['ct_plus', 'pr_plus', 'ps_plus', 'ct_minus', 'pr_minus', 'ps_minus', 'uu']
- # Little function for displaying neat columns:
- def fmt(a):
- return "".join(map((lambda x : str(x).rjust(10)), a))
- # Output printing:
- print "======================================================================"
- print factbank_or_pragbank
- print fmt([''] + keys)
- for modal in modals:
- row = [modal]
- for cat in keys:
- row.append(counts[modal][cat])
- print fmt(row)
Here is the output for the two annotation types:
- modal_stats('FactBank')
- ======================================================================
- FactBank
- ct_plus pr_plus ps_plus ct_minus pr_minus ps_minus uu
- can 0 0 0 0 0 0 3
- could 0 0 22 10 0 0 2
- may 0 0 12 0 0 1 2
- might 0 0 9 1 0 0 1
- must 0 0 0 0 0 0 3
- should 0 1 0 0 0 0 2
- will 2 8 1 3 0 0 14
- would 0 2 4 7 0 0 8
- modal_stats('PragBank')
- ======================================================================
- PragBank
- ct_plus pr_plus ps_plus ct_minus pr_minus ps_minus uu
- can 0 0 1 0 0 0 2
- could 1 1 20 10 0 0 2
- may 0 1 13 0 0 0 1
- might 0 1 9 1 0 0 0
- must 1 0 0 0 0 0 2
- should 0 2 0 0 1 0 0
- will 9 12 2 3 0 0 2
- would 8 2 3 8 0 0 0
The patterns are very similar, but modals are less categorical in
their behavior on the PragBank data, with the trends again towards
less use of the uncertainty category. This split is especially strong
for will
and would.
exercise COMMANDEX
PRAGNONAUTHOR The
function author_nonauthor_factvalue_compare()
compares AUTHOR and non-AUTHOR FactBank
annotations. As discussed above, what we
see is a very strong tendency for the AUTHOR annotation to be more
uncertain than the non-AUTHOR one.
How do the PragBank (READER-level) annotations compare to the
non-AUTHOR ones? To address this, write a function comparable
to author_nonauthor_factvalue_compare()
but that compares the 6/10 majority choice PragBank annotation
(where there is one) to the FactBank non-AUTHOR annotations.
What is the overall picture like and how does it compare to the
output of author_nonauthor_factvalue_compare()?
HIGHFREQ The
function lexical_associations() is
somewhat helpful in understanding the way lexical items associate
with the veridicality tags for the two annotation types, but it
is clear that the method is heavily biased in favor of high-frequency
words. This is presumably because even small differences get
amplified by the hight token counts. Though these associations
might be important, they are clearly dampening effects from more
interesting markers of veridicality.
Your task: Try to devise a method for finding these associations
that is less susceptible to frequency (both high frequency, as
with the current function, and low-frequency).
COMMANDEX
The goal of this exercise is to generalize the c-commanding modals
functionality:
- Modify (and perhaps change the name
of) c_commanding_modals()
so that it takes any regular expression as its argument and then
looks for c-commanding nodes whose terminals match it.
- Modify (and perhaps change the name of) modal_stats()
so that it takes a regular expression as its second argument and passes it to
your revised c_commanding_modals()
- Write and interesting regular expression (capturing, say, a
class of verbs, determiners, etc.) and provide the results of
running your modified code with it. How to FactBank and PragBank
compare for this output?