FactBank and the Stanford PragBank Extension

  1. Overview
  2. Corpus distribution and tools
  3. Examples
    1. Comparing the FactBank and PragBank annotations
    2. Non-AUTHOR annotations
    3. Lexical associations
    4. C-commanding modals
  4. Exercises

Overview

This section provides an overview of (parts of) the FactBank corpus as well as a recent extension of it created at Stanford. I've kept this material to a minimum because de Marneffe et al. 2011 covers exactly the ground that I would cover here, and it is short and accessible.

As in previous sections, I've got some data and Python code to aid investigation:

The data distribution is a CSV file, so you can also read it into Excel, R, and so forth.

Associated reading:

Corpus distribution and tools

FactBank is distributed as multiple files with stand-off annotations, and the Stanford PragBank distribution can be merged with it. Working directly with these files requires a lot of careful set up, so I've put together a single CSV file containing just the information about veridicality/commitment that we'll focus on here. Table COLUMNS summarizes the column values for this file. See de Marneffe et al. 2011 for additional information about the data and annotations.

Column/Attribute name Python typeDescription
File str FactBank source filename
sentId int FactBank sentence id
Sentence str the full text of the sentence
SentenceParse nlkt.tree.Tree the parse of the sentence — not hand-corrected, but rather the Stanford Parser output
eId str FactBank event Id (integer with 'e' prefix)
eiId str FactBank event instance Id (integer with 'ei' prefix; I believe this is a unique identifier for events, but I often use (File, sentId, eId) to be cautious)
eText str text of the even description
Normalization str the question posed to the Mechanical Turk annotators
FactValues dict a dictionary mapping FactBank relSourceText values to {ct_plus, pr_plus, ps_plus, ct_minus, pr_minus, ps_minus, uu}
PragValues dict a dictionary mapping {ct_plus, pr_plus, ps_plus, ct_minus, pr_minus, ps_minus, uu} to counts
Table COLUMNS
The column values of fb-semprag.csv. The column names are also factbank.Event attributes with the types listed. For FactValues and PragValues, the values appear in the CSV file as name:value pairs separated by |.

Examples

The section gives a few simple examples designed to give you a sense for how the code works and what the data are like.

Comparing the FactBank and PragBank annotations

The following function generates the confusion matrix in Table III of de Marneffe et al. 2011:

  1. def semprag_confusion_matrix(output_filename):
  2. """
  3. Build a confusion matrix comparing the FactBank annotations
  4. with the PragBank extension, limiting attention to cases
  5. where 6/10 agreed on a single category (which we then take
  6. to be the correct label).
  7.  
  8. The output CSV file has FactBank annotations as rows and the
  9. pragmatic annotations as columns.
  10.  
  11. The output is the same as table III of de Marneffe et al.'s
  12. 'Veridicality and utterance understanding'.
  13. """
  14. # 2-d defaultdict for the counts:
  15. cm = defaultdict(lambda : defaultdict(int))
  16. # Instantiate the corpus:
  17. corpus = FactbankCorpusReader('fb-semprag.csv')
  18. # Iterate through the training set:
  19. for event in corpus.train_events():
  20. # Where defined, this will be a pair like (ct_plus, 7):
  21. pv, pv_count = event.majority_pragvalue()
  22. # The AUTHOR-level factuality value is the most comparable to the pragmatic annotations:
  23. fv = event.FactValues['AUTHOR']
  24. # We limit attention to the items where the majority got at least 6 votes:
  25. if pv and pv_count >= 6:
  26. cm[fv][pv] += 1
  27. # CSV output with the FactBank annotations as rows and the
  28. # pragmatic annotations as columns:
  29. csvwriter = csv.writer(file(output_filename, 'w'))
  30. # Listing the keys like this ensures an intuitive ordering:
  31. keys = ['ct_plus', 'pr_plus', 'ps_plus', 'ct_minus', 'pr_minus', 'ps_minus', 'uu']
  32. csvwriter.writerow(['FactBank'] + keys)
  33. for fb in keys:
  34. row = [fb] + [cm[fb][pv] for pv in keys]
  35. csvwriter.writerow(row)

I ran it as follows:

  1. from factbank_functions import semprag_confusion_matrix
  2. semprag_confusion_matrix('factbank-semprag-confusion-matrix.csv')

Feel free to download factbank-semprag-confusion-matrix.csv if you'd prefer not to regenerate it yourself.

Non-AUTHOR annotations

Most FactBank events are annotated from multiple perspectives — not just the author of the text but also various other participants named in the sentence.

The following function provides a look at the labels for these non-AUTHOR annotations:

  1. def nonauthor_factbank_annnotations():
  2. """Look at the strings associated with non-AUTHOR annotations in FactBank."""
  3. d = defaultdict(int)
  4. corpus = FactbankCorpusReader('fb-semprag.csv')
  5. for event in corpus.train_events():
  6. for src, fv in event.FactValues.items():
  7. if src != 'AUTHOR':
  8. d[src] += 1
  9. for key, val in sorted(d.items(), key=itemgetter(1), reverse=True):
  10. print key, val

The output has 131 lines, most of which occur only once. Here is the top, showing the steep dive to very few tokens:

  1. GEN_AUTHOR 44
  2. officials_AUTHOR 19
  3. company_AUTHOR 10
  4. he_AUTHOR 10
  5. DUMMY_AUTHOR 7
  6. He_AUTHOR 6
  7. analysts_AUTHOR 5
  8. spokesman_AUTHOR 5
  9. Doyle_AUTHOR 3
  10. unit_AUTHOR 3
  11. ...

There might, though, be interesting generalizations over these labels based on their syntactic roles.

It is quite illuminating to compare AUTHOR and non-AUTHOR annotations for the same event:

  1. def author_nonauthor_factvalue_compare():
  2. """Compare FactBank AUTHOR and non-AUTHOR annotations for the same event."""
  3. d = defaultdict(int)
  4. corpus = FactbankCorpusReader('fb-semprag.csv')
  5. for event in corpus.train_events():
  6. fvs = event.FactValues
  7. auth = fvs['AUTHOR']
  8. for src, fv in fvs.items():
  9. if src != 'AUTHOR':
  10. d[(auth, fv)] += 1
  11. keys = ['ct_plus', 'pr_plus', 'ps_plus', 'ct_minus', 'pr_minus', 'ps_minus', 'uu']
  12. # Little function for displaying neat columns:
  13. def fmt(a):
  14. return "".join(map((lambda x : str(x).rjust(14)), a))
  15. # Output printing:
  16. print fmt(['author\other'] + keys)
  17. for auth in keys:
  18. row = [auth]
  19. for other in keys:
  20. row.append(d[(auth, other)])
  21. print fmt(row)

The full output:

  1. author\other ct_plus pr_plus ps_plus ct_minus pr_minus ps_minus uu
  2. ct_plus 5 0 0 0 0 0 0
  3. pr_plus 1 53 0 0 0 0 10
  4. ps_plus 0 1 3 0 0 0 2
  5. ct_minus 0 0 0 18 0 0 4
  6. pr_minus 0 0 0 0 2 0 0
  7. ps_minus 0 0 0 0 0 0 0
  8. uu 95 18 6 14 3 0 40

In the vast majority of cases, the author annotation (rows) is stronger than the other annotation, in the sense that it is higher on the scale of veridicality. Strikingly, the largest category is (uu, ct_plus). Many of these examples are roughly of the form X said that S, where (according to the FactBank annotation guidelines), S is probably certain from the perspective of the subject X but unknown (merely reported) from the perspective of the author of the sentence.

Lexical associations

The function lexical_associations() explores how the distribution of words differs between the FactBank and PragBank annotations.

  1. def lexical_associations(n=10):
  2. """
  3. This function looks for words that are unusually over-represented
  4. in the FactBank (PragBank) annotations.
  5.  
  6. The output is the top n items for each tag (default n=10).
  7. To do this, it iterates through the subset of the training set
  8. where there was a 6/10 majority choice label selected by the
  9. Turkers.
  10.  
  11. For each event, it iterates through the words in the sentence for
  12. that event, adding 1 for (factbank-label, word) pairs and
  13. subtracting 1 for (pragbank-label, word) pairs.
  14.  
  15. Thus, if the two sets of annotations were the same, these values
  16. would all be 0. What we see instead is a lot of lexical variation
  17. (though the results are somewhat marred by the tendency for
  18. high-frequency words to end up with very large counts.
  19. """
  20. # Keep track of the differences:
  21. diff = defaultdict(lambda : defaultdict(int))
  22. # Instantiate the corpus:
  23. corpus = FactbankCorpusReader('fb-semprag.csv')
  24. # Limit to the training set:
  25. events = corpus.train_events()
  26. # Limit to the events with at least a 6/10 majority choice:
  27. events = filter((lambda e : e.majority_pragvalue()[1] and e.majority_pragvalue()[1] >= 6), events)
  28. # Iterate through this restricted set of events:
  29. for event in events:
  30. # Lemmatize:
  31. event_words = event.leaves(wn_lemmatize=True)
  32. # Remove punctuation, so that we look only at real words:
  33. event_words = filter((lambda x : not re.search(r"\W", x)), event_words)
  34. # Downcase:
  35. event_words = map(str.lower, event_words)
  36. # Word counting:
  37. for word in event_words:
  38. diff[event.FactValues['AUTHOR']][word] += 1
  39. diff[event.majority_pragvalue()[0]][word] += 1
  40. # Function for formatting the results:
  41. def fmt(a):
  42. return ', '.join(map((lambda x : '%s: %s' % x), a))
  43. # View the results:
  44. keys = ['ct_plus', 'pr_plus', 'ps_plus', 'ct_minus', 'pr_minus', 'ps_minus', 'uu']
  45. for key in keys:
  46. sorted_vals = sorted(diff[key].items(), key=itemgetter(1))
  47. print key
  48. print '\tFactBank:', fmt(sorted(sorted_vals[-n:], key=itemgetter(1), reverse=True)) # Put these in decreasing order.
  49. print '\tPragBank:', fmt(sorted_vals[:n])

Output with n=10:

  1. ct_plus
  2. FactBank the: 312, of: 176, to: 161, a: 137, in: 125, be: 114, and: 104, say: 87, million: 59, it: 59
  3. PragBank secretary: 1, zinc: 1, unisys: 1, settlement: 1, hate: 1, clearance: 1, whose: 1, violate: 1, herbicide: 1, supreme: 1
  4. pr_plus
  5. FactBank the: 266, to: 225, be: 175, a: 91, of: 90, in: 89, and: 86, that: 69, expect: 61, for: 49
  6. PragBank focus: 1, follow: 1, bronfman: 1, lure: 1, sony: 1, chairman: 1, lever: 1, ten: 1, core: 1, merieux: 1
  7. ps_plus
  8. FactBank the: 187, to: 113, of: 79, a: 75, be: 65, and: 48, in: 48, that: 44, could: 42, for: 31
  9. PragBank represent: 1, whose: 1, cheney: 1, send: 1, petersburg: 1, telerate: 1, past: 1, appear: 1, above: 1, public: 1
  10. ct_minus
  11. FactBank the: 436, be: 226, to: 217, a: 155, of: 148, have: 128, and: 126, t: 126, it: 115, in: 110
  12. PragBank follow: 1, whose: 1, violate: 1, breach: 1, labor: 1, solution: 1, constitution: 1, categorically: 1, along: 1, engage: 1
  13. pr_minus
  14. FactBank the: 18, and: 13, be: 12, to: 8, t: 7, in: 7, computer: 6, maker: 6, n: 6, for: 6
  15. PragBank japan: 1, yet: 1, potential: 1, do: 1, capital: 1, despite: 1, stiff: 1, fully: 1, large: 1, force: 1
  16. ps_minus
  17. FactBank from: 2, the: 2, edelman: 1, boost: 1, run: 1, may: 1, a: 1, asher: 1, martin: 1, make: 1
  18. PragBank and: 1, prevent: 1, help: 1, move: 1, intelogic: 1, at: 1, concern: 1, 20: 1, ackerman: 1, stake: 1
  19. uu
  20. FactBank the: 281, to: 173, of: 141, be: 126, say: 108, a: 103, in: 103, and: 92, it: 57, that: 53
  21. PragBank represent: 1, debenture: 1, 80486: 1, dollar: 1, focus: 1, nadeau: 1, zinc: 1, settlement: 1, hate: 1, whose: 1

C-commanding modals

Modals are clearly related to veridicality. In FactBank, they are taken to be fairly certain markers of specific veridicality categories, and such associations turn up in the PragBank annotations as well.

The present section provides an initial look at the way modals relate to specific veridicality values.

The first step is defining a function that, when given a tree and a word in that tree (in our case, the event text), returns the set of modals that c-command that word (by the most conservative definition of c-command):

  1. def tree_has_modal_daughter(tree):
  2. """
  3. If tree has a preterminal daughter whose terminal is a modal,
  4. return that modal, else return False.
  5. """
  6. modal_re = re.compile(r'^(Can|Could|Shall|Should|Will|Would|May|Might|Must|Wo)$', re.I)
  7. for daught in tree:
  8. if swda_experiment_clausetyping.is_preterminal(daught) and modal_re.search(daught[0]):
  9. return daught[0]
  10. return False
  11.  
  12. def c_commanding_modals(tree, terminal):
  13. """Return the set of modals that c-command terminal in tree."""
  14. modals = set([])
  15. for subtree in tree.subtrees():
  16. md = tree_has_modal_daughter(subtree)
  17. if md:
  18. if terminal in subtree.leaves():
  19. modals.add(md)
  20. return modals

The following function then uses this modal finding ability to gather a count matrix relating the FactBank or PragBank annotations to modal token counts:

  1. def modal_stats(factbank_or_pragbank):
  2. """
  3. Gather and print a matrix relating modal use (rows) to
  4. veridicality values:
  5.  
  6. factbank_or_pragbank (str) -- if 'factbank' (case-insensitive),
  7. then use the FactBank annotations,
  8. else use the PragBank majority annotation
  9.  
  10. The calculations are limited to the subset of the events that have
  11. a 6/10 majority category in PragBank, to facilitate comparisons
  12. between the two annotation groups.
  13. """
  14.  
  15. corpus = FactbankCorpusReader('fb-semprag.csv')
  16. # Limit to the training set:
  17. events = corpus.train_events()
  18. # Limit to the events with at least a 6/10 majority choice:
  19. events = filter((lambda e : e.majority_pragvalue()[1] and e.majority_pragvalue()[1] >= 6), events)
  20. # For the counts:
  21. counts = defaultdict(lambda : defaultdict(int))
  22. # Iterate through the events:
  23. for event in events:
  24. modals = c_commanding_modals(event.SentenceParse, event.eText)
  25. for modal in modals:
  26. val = None
  27. if factbank_or_pragbank.lower() == 'factbank':
  28. val = event.FactValues['AUTHOR']
  29. else:
  30. val = event.majority_pragvalue()[0]
  31. counts[modal][val] += 1
  32. # Modals:
  33. modals = sorted(counts.keys())
  34. # Categories:
  35. keys = ['ct_plus', 'pr_plus', 'ps_plus', 'ct_minus', 'pr_minus', 'ps_minus', 'uu']
  36. # Little function for displaying neat columns:
  37. def fmt(a):
  38. return "".join(map((lambda x : str(x).rjust(10)), a))
  39. # Output printing:
  40. print "======================================================================"
  41. print factbank_or_pragbank
  42. print fmt([''] + keys)
  43. for modal in modals:
  44. row = [modal]
  45. for cat in keys:
  46. row.append(counts[modal][cat])
  47. print fmt(row)

Here is the output for the two annotation types:

  1. modal_stats('FactBank')
  2. ======================================================================
  3. FactBank
  4. ct_plus pr_plus ps_plus ct_minus pr_minus ps_minus uu
  5. can 0 0 0 0 0 0 3
  6. could 0 0 22 10 0 0 2
  7. may 0 0 12 0 0 1 2
  8. might 0 0 9 1 0 0 1
  9. must 0 0 0 0 0 0 3
  10. should 0 1 0 0 0 0 2
  11. will 2 8 1 3 0 0 14
  12. would 0 2 4 7 0 0 8
  13. modal_stats('PragBank')
  14. ======================================================================
  15. PragBank
  16. ct_plus pr_plus ps_plus ct_minus pr_minus ps_minus uu
  17. can 0 0 1 0 0 0 2
  18. could 1 1 20 10 0 0 2
  19. may 0 1 13 0 0 0 1
  20. might 0 1 9 1 0 0 0
  21. must 1 0 0 0 0 0 2
  22. should 0 2 0 0 1 0 0
  23. will 9 12 2 3 0 0 2
  24. would 8 2 3 8 0 0 0

The patterns are very similar, but modals are less categorical in their behavior on the PragBank data, with the trends again towards less use of the uncertainty category. This split is especially strong for will and would.

Exercises

CM Summarize de Marneffe et al.'s (2011) characterization of the confusion matrix generated by semprag_confusion_matrix(), and then offer new observations about the pattern and/or what is causing it.

PRAGNONAUTHOR The function author_nonauthor_factvalue_compare() compares AUTHOR and non-AUTHOR FactBank annotations. As discussed above, what we see is a very strong tendency for the AUTHOR annotation to be more uncertain than the non-AUTHOR one.

How do the PragBank (READER-level) annotations compare to the non-AUTHOR ones? To address this, write a function comparable to author_nonauthor_factvalue_compare() but that compares the 6/10 majority choice PragBank annotation (where there is one) to the FactBank non-AUTHOR annotations. What is the overall picture like and how does it compare to the output of author_nonauthor_factvalue_compare()?

HIGHFREQ The function lexical_associations() is somewhat helpful in understanding the way lexical items associate with the veridicality tags for the two annotation types, but it is clear that the method is heavily biased in favor of high-frequency words. This is presumably because even small differences get amplified by the hight token counts. Though these associations might be important, they are clearly dampening effects from more interesting markers of veridicality.

Your task: Try to devise a method for finding these associations that is less susceptible to frequency (both high frequency, as with the current function, and low-frequency).

COMMANDEX The goal of this exercise is to generalize the c-commanding modals functionality:

  1. Modify (and perhaps change the name of) c_commanding_modals() so that it takes any regular expression as its argument and then looks for c-commanding nodes whose terminals match it.
  2. Modify (and perhaps change the name of) modal_stats() so that it takes a regular expression as its second argument and passes it to your revised c_commanding_modals()
  3. Write and interesting regular expression (capturing, say, a class of verbs, determiners, etc.) and provide the results of running your modified code with it. How to FactBank and PragBank compare for this output?