Experiment: Question acts and interrogative clauses in the SwDA

  1. Overview
  2. On the SwDA 'q'-type tags
    1. qy: yes-no questions
    2. qr: 'or' questions
    3. qrr: or-clause tacked on after a y/n question
    4. ^d: declarative questions
    5. ^g: question tags
  3. Identifying interrogative clauses
  4. The experiment
    1. Method
    2. Results
    3. Discussion
      1. False positives
      2. False negatives
  5. Exploring other associations between form and function
    1. The distribution of root nodes
    2. The hearer perspective: P(tag|clause-type)
    3. The speaker perspective: P(clause-type|tag)
    4. Putting the pieces together
  6. Exercises

Overview

To what extent is dialog act predictable from clause typing? I explore this question with a narrow focus on question acts and polar interrogative clause types. In the context of the SwDa, this boils down to the relationship between the dialog-act tag 'qy' and polar interrogative clausal forms. Most of the work involves homing in on what thise clausal forms are actually like.

Code and data:

On the SwDA 'q'-type tags

This section looks at the q-type tags that are likely to be confusable with qy. To keep the discussion streamlined, I leave the following tags aside:

Searching the section numbers is an easy way to find the descriptions in the Coders' Manual.

qy: yes-no questions (5.2.2.1)

DAMSL group: qy, qy^g, qy^t, qy^r, qy^h, qy^g^t, qy^m, qy^c, qy^2, qy(^q), qy^g^r, qy^g^c, qy^c^r

The Coders' Manual for the SwDA suggests that we can simply assume that 'qy', as a dialog act, will correspond to an inverted polar interrogative clause:

qy is used for yes-no questions only if they both have the pragmatic force of a yes-no-question *and* if they have the syntactic and prosodic markings of a yes-no question (i.e. subject-inversion, question intonation).

qy          B.82 utt1: Do you have to have any special training? /  
qy          A.1 utt1: Do you know anyone that, {F uh, }[ is, + is ] in a
qy          A.1 utt1:  Okay, {F um, }  Chuck, do you have any pets # there at your home? # /  
qy          B.28 utt1:  Does he bite her enough to draw blood? /  
qy          B.48 utt1:  Is that the only pet that you have? /  
qy          A.55 utt2: {D So } have you tried any other pets? /  
qy          A.96 utt3: Do you? /  

Yes-no questions that are pragmatically questions but have declarative syntax are marked with ^d. Yes-no questions that are syntactically (in form) questions but do not rhetorically function as questions ("rhetorical questions") are marked either as qh or bh, depending on whether the rhetorical question is functioning as a backchannel. See the other sections for examples of each of these other kinds of "questions".

However, the following note in the section on declarative questions indicates that there are going to be problems with a syntactic selection method that depends on the canonical inverted form for polar interrogatives.

However, if the statement has an "ellipsed" aux-inversion at the beginning, we don't code it as a declarative question (following Weber 1993).

qy          B.44 utt1: Worried that they're not going to get enough attention? /

qr: 'or' questions (5.2.2.4)

DAMSL group: qr, qr^d, qr^t, qr(^q)

Coder's Heuristics

examples:

qr          B.50 utt1:  {D Well, } do you live, [ [ you, + you ] + ] in a house, 
                        or a  place where you, {F uh, } -/  
qr          B.95 utt1:  # {D Well } # do you all work for T I, or for, -/  
qr          B.36 utt1:  # {D Now, } # [ are they, + are they ] rehabilitative 
                            [ or, + or ] not. /

One problem with or-questions is that the listener often interrupts before the or clause is complete and answers the or-question as if it were a yes-no question about the first clause. For example

qr      B60 utt1:  Did you bring him to a doggy obedience school or --

nn      A61 utt1:  No --  /

+       B62 utt1:  -- just --

sd^e    A63 utt1:  -- we never did. /

+       B64 utt1:  -- train him on your own   /

We counting this as a qr since the speaker goes on to finish his qr, even though the listener answers it immediately as a yes-no question. Our current viewpoint is that if there's a conflict between labeling "what the speaker thinks" and "what the hearer thinks" go with whichever coding is more informative for the reader, which in this case is the speaker-labelling (because if you were reading the transcript you could figure out that a qr followed by a "No" answer means that the listener misinterpreted. But if you labeled it the other way (i.e. as a "qy") then it would be harder to figure out that the speaker was thinking of the utterance as an or-question.

qrr: or-clause tacked on after a y/n question (5.2.2.5)

DAMSL group: qrr, qrr^t, qrr^d

Coder's Heuristics

These are used when you think the speaker tacked on an or-clause to what had been a yes-no question, so "qrr" marks a sort of "dangling or-clause", e.g. B.18.utt2.

qy          B.18 utt1:  # [ Do you watch, + # do you watch ] [ the network, +  
                         {D like } major network ] news,  /
qrr          B.18 utt2: {C or } do you watch {D like } --

sd          A.19 utt1:  [ Just the # regular channel # -- +

+          B.20 utt1:  -- # the MACNEIL LEHRER HOUR? # /

sd          A.21 utt1:  -- just channel eight. ] /  

When the speaker uses the word "or" after a qyin a slash-unit by itself at the end of a turn, it is coded as a turn-exit (i.e. %):

qy*      B.64 utt1:  {F Uh, } is that the crime  /  [[*listen]]
qy       B.64 utt2:  {C and } it's already,  ((   ))  some chart and 
              determine the punishment,  /
%      B.64 utt3: {C or. } -/

^d: declarative questions (5.2.2.6)

DAMSL group: ^d, ^d^t, ^d^r, ^d^m, ^d^t, ^d^h, ^d^c, ^d(^q)

These labels are in an independent dimension from the other question labels (qy,qw,qo,qr,qrr). Like some of the other SWBD-DAMSL "extra dimensions", these are primarily designed to code form.

Declarative questions (^d) are utterances which function pragmatically as questions but which do not have "question form". We don't know if declarative questions will have different conversational function than non-declarative question (although see Weber 1993 for thoughts on this), but we definitely expect them to be useful for ASR language-model purposes.

Declarative questions normally have no wh-word as the argument of the verb (except in "echo-question" format), and have "declarative" word order in which the subject precedes the verb. See Webber 1993 Chapter 4 for a survey of declarative question and their various realizations.

Declarative questions *may* have rising "question-intonation". The "declarative" tag is added solely based on form. This does not mean that the intonation of the question is irrelevant. We are marking the prosodic features of each utterance in Switchboard in another, distinct database.

Coder's Heuristics

These are all ^d (declarative questions): (B.46.utt1 is an example of a declarative question with a wh-word)

qy^d          B.44 utt1:  <Laughter>  {D So } you're taking a government course? /
qw^d      B.46 utt1:  At what?  /
qy^d         B.46 utt2:  The university? /
qw^d          B.22 utt1:  [ {C And, } + {C and } ] you say you've had him how long? /
qy^d          A.1 utt3:  I don't know if you are familiar with that./
qy^d          A.3 utt1:  {C But } not for petty theft? 
qy^d          A.65 utt1:  {D Well, } I guess we'll get pretty good news coverage
                     in a couple of years when you host the, { F uh, } 
                     summer olympics <laughter>. /  

Or the following:

qy^d          B.2 utt2: You're asking what my opinion about,

 ny           A.3 utt1:  # Yeah. # /

  +          @B.4 utt1:  # whether it's # possible <laughter> to have honesty in government.  /

Or here's another one:

qy^d          A.64 utt2: you must be a T I employee. /

However, if the statement has an "ellipsed" aux-inversion at the beginning, we don't code it as a declarative question (following Weber 1993).

qy          B.44 utt1: Worried that they're not going to get enough
        attention? /

^g: question tags (5.2.2.7)

DAMSL group: ^g, ^g^t, ^g^r, ^g^c

A 'tag' question consists of a statement and a 'tag' which seeks confirmation of the statement. Because the tag gives the statement the force of a question, the tag question is coded 'qy^g'. The tag may also be transcribed as a separate slash unit, in which case it is coded '^g'.

Coder's Heuristics

A question designed to check whether the listener understands what the speaker's point is should be distinguished from a question tag. Listener may respond affirmatively that s/he understands what was said without implying agreement. "understand what I'm saying" and thus respond affirmatively to an 'understanding check' but disagree with speaker's statement. The appropriate response to a tag question, on the other hand, confirms the *statement*.

The appropriate code for an understanding check is "qy"

The appropriate code for the response, like the response to a tag question, is usually ny or nn. The appropriate response to an understanding check is also 'ny' or 'nn.

In answering a true tag, you are confirming or disconfirming the statement that precedes it.

In answering a question about 'understanding-check', listener is not taking any position on the statement that preceded it. S/He is merely indicating that the statement was understood.

Tag questions all have either an aux-inversion at the end (don't you? doesn't it? isn't he? aren't you?) which (almost always) reverses the polarity of the auxiliary in the matrix statement, or a one-word tag like ", right?" or ", huh?".

Here are some examples of ^g (tag questions): single-word tag:

qy^g      A.39 utt2: {F Uh, } I guess a year ago you're probably watching C N N a lot, right? /

unreversed polarity, with subject-aux inverted tag:

qy^g@     @B:  {D So } you live in Utah do you? /

reversed polarity, with subject-aux inverted tag:

qy^g       A.27 utt1:  That's a problem, isn't it? /
qy^g       B.54 utt1:  # {C But } that doesn't eliminate it, does it? # /

tag in single slash unit:

sd      A.1 utt 1:      Well, Hank Williams is one we forgot about.  /
^g      A.2 utt 2:      Right?  /

__________
sd	A.13 utt2: as a matter of fact, I want to think they took the top 
             managers first,  /
^g      A.13 utt3: isn't that a fact?  /

Identifying interrogative clauses

SQ root nodes roughly pick out polar interrogative clause types. Some of them have extensions like -CLF (Was it Sue... and Who was it ...) and -UNF (a clause that the speaker did not finish). Our first step in identifying polar interrogative matrix clauses is to define a function that selects these AQ trees:

  1. def root_is_labeled_SQ(tree):
  2. """Return true if the root of tree is an SQ node of some kind, else False."""
  3. if tree.node.startswith('SQ'):
  4. return True
  5. else:
  6. return False

This will collect polar interrogatives. It excludes constituent questions because they have 'SBARQ' as their root node, with an embedded SQ:

  1. (SBARQ (WHNP-1 (WP What)) (SQ (VBP do) (NP-SBJ (PRP you)) (VP (VB think) (NP (-NONE- *T*-1)))) (. ?) (-DFL- E_S))

However, root-level SQ labels a lot of structures that do not involve inversion. For example, all tag questions have this as their root label:

  1. (SQ (S (NP-SBJ (DT that)) (VP (VBZ explains) (NP (PRP it)))) (SQ (VBZ does) (RB n't) (NP-SBJ (PRP it))) (. .) (-DFL- E_S)))

It also captures cases where there is no inversion. Some of these are declarative questions that we want to exclude:

  1. EXAMPLE

To address this, I define a function has_leftmost_aux_daughter(tree) that tries to identify clauses with auxiliary verb daughters that are to the left of any nominal or sentential nodes (signaling inversion but ignoring disfluencies, interjections, adverbial clauses, etc.):

  1. def has_leftmost_aux_daughter(tree):
  2. """Return True if tree has an aux-tree daughter that is at least before any NP or SBAR nodes."""
  3. daughters = [daughter for daughter in tree if re.search(r'(NP$|SBAR$|^V|MD)', daughter.node)]
  4. if daughters and is_aux_tree(daughters[0]):
  5. return True
  6. else:
  7. return False
  8.  
  9. def is_aux_tree(tree):
  10. """Return True if argument tree is of the form (V*|MD aux), else False."""
  11. verbal_re = re.compile(r'(^V|MD)', re.I)
  12. aux_re = re.compile(r'(Is|Are|Was|Were|Have|Has|Had|Can|Could|Shall|Should|Will|Would|May|Might|Must|Do|Does|Did|Wo)', re.I)
  13. if is_preterminal(tree) and verbal_re.search(tree.node) and aux_re.search(tree[0]):
  14. return True
  15. else:
  16. return False
  17.  
  18. def is_preterminal(tree):
  19. """Return True if tree is of the form (PARENT CHILD), else False."""
  20. if len(list(tree.subtrees())) == 1:
  21. return True
  22. else:
  23. return False

This leads to the proposal for characterizing polar interrogatives syntactically:

  1. def is_polar_interrogative(tree):
  2. """Return True if tree is rooted at an SQ node and has a leftmost aux, else False."""
  3. if root_is_labeled_SQ(tree) and has_leftmost_aux_daughter(tree):
  4. return True
  5. else:
  6. return False

And now an informal inspection, limiting attention to the utterances that have a single tree that corresponds perfectly to the utterance text. More precisely, the function classifies trees and keep a random sample of 100 yes's and 100 no's for us to inspect.

  1. def inspect_trees(characterizing_function, output_filename, num=100):
  2. """
  3. Print num trees that characterizing_function says True to, and num
  4. trees that characterizing_function says False to.
  5.  
  6. Arguments:
  7. characterizing_function - a function that says True or False for trees
  8. output_filename (str) -- filename for the output
  9. num (int) -- the number of randomly selected trees to print out (default: 100)
  10.  
  11. Value: stores the trees in output_filename
  12. """
  13. yes_trees = []
  14. no_trees = []
  15. for utt in CorpusReader('swda').iter_utterances(display_progress=True):
  16. if utt.tree_is_perfect_match():
  17. tree = utt.trees[0]
  18. if characterizing_function(tree):
  19. yes_trees.append(tree)
  20. else:
  21. no_trees.append(tree)
  22. s = ''
  23. for trees, label in ((yes_trees, 'True'), (no_trees, 'False')):
  24. shuffle(trees)
  25. for t in trees[:num]:
  26. s += '======================================================================\n'
  27. s += label + "\n"
  28. s += t.pprint() + "\n"
  29. open(output_filename, 'w').write(s)

The evaulation is then done as follows:

  1. inspect_trees(is_polar_interrogative, 'swda-clausetyping-polar-int-test.txt', num=100)

The results are encouraging, broadly speaking. The list of 'True' responses looks to be high precision, capturing even difficult looking cases like:

  1. (SQ (INTJ (UH Um)) (, ,) (RB so) (, ,) (PP-TMP (IN at) (NP (NP (DT this) (NN time)) (PP (IN of) (NP (DT the) (NN year))))) (VBP are) (NP-SBJ (PRP you)) (VP (VBG doing) (NP (JJ much) (NN garden) (NN work))) (. ?) (. .) (-DFL- E_S))

However, there is a worrisome pattern in the 'False' list: lots of apparently ellipitical interrogatives like these:

  1. (SQ (NP-SBJ (PRP You)) (ADVP-TMP (RB ever)) (VP (VBN been) (PP-DIR-PRD (IN to) (NP (NP (NNP Houston) (NNP 's)) (PP-LOC (IN on) (NP (NNP Belt) (NNP Line)))))) (. ?) (-DFL- E_S))
  2. (SQ (NP-SBJ (PRP You)) (VP (VBP mean) (EDITED (RM (-DFL- \[)) (PP (IN in) (NP-UNF (DT the))) (, ,) (IP (-DFL- \+))) (PP (IN in) (NP (DT the) (RS (-DFL- \])) (ADJP (RBS most) (JJ recent)) (NN conflict)))) (. ?) (-DFL- E_S))
  3. (SQ (INTJ (UH Well)) (, ,) (NP-SBJ (PRP they)) (VP (VBG pushing) (NP (DT the) (NN death) (NN penalty))) (. ?) (-DFL- E_S))

Nonetheless, let's assume that we have the ability, with is_polar_interrogative(tree), to identify polar interrogative trees. The next step is to make sure we understand what things get labeled qy.

The experiment

Method

The experiment simply involves gathering data on the relationship between the act-tag 'qy' and the values returned by is_polar_interrogative(tree). The following function puts all these pairs into a CSV file so that we can study and visualize the patterns in R:

  1. def inversion_and_qy_to_csv(tree_classifying_function, output_filename):
  2. """
  3. Runs a clause-typing experiment, by relating dialog act tags to
  4. the values given by the supplied function. The results are put
  5. into CSV format:
  6. Tag, PolarIntClauseType
  7. t1 func_value2
  8. t2 func_value2
  9. ...
  10. Arguments:
  11. tree_classifying_function -- any function from nltk.tree.Tree
  12. objects with outputs that can be
  13. sensibly stringified
  14. output_filename (str) -- the output CSV filename
  15. """
  16. rows = []
  17. for utt in CorpusReader('swda').iter_utterances(display_progress=True):
  18. if utt.tree_is_perfect_match():
  19. # Capitalize this string so that R treats it as a boolean:
  20. is_polar_int = str(tree_classifying_function(utt.trees[0])).upper()
  21. rows.append([utt.act_tag, is_polar_int])
  22. csvwriter = csv.writer(open(output_filename, 'w'))
  23. csvwriter.writerow(['Tag', 'PolarIntClauseType'])
  24. csvwriter.writerows(rows)

The resulting file:

Results

Now over in R, we read in the CSV file and check out the confusion matrix:

  1. d = read.csv('swda-clausetyping-results.csv')
  2. d$qy = d$Tag=='qy'
  3. m = xtabs(~ qy + PolarIntClauseType, data=d)
  4. m
  5. PolarIntClauseType qy FALSE TRUE FALSE 93783 709 TRUE 382 1496

This looks good, and it's of course initially very satisfying to look at overall accuracy:

  1. # Percentage correct:
  2. sum(diag(m)) / sum(m)
  3. [1] 0.988679

Whoa! Nearly perfect! However, this is a poor assessment figure in this context, because the (FALSE, FALSE) corner of the confusion matrix is so massive. Consider, for example, the right column of the confusion matrix. It says that 1497 of the clauses we identfied are in fact 'qy' acts. However, in order to achieve this high number, we had to guess wrong 710 times, a fairly substantial part of the total. Intuitively, our precision is low. Formally, a system's precision for a category C is defined as the number of true-positives for C divided by the total number of times that the system guessed C. Here are the relevant calculations for the 'qy' and non-'qy' categories:

  1. # Precision for qy == FALSE:
  2. m[1,1] / sum(m[, 1])
  3. [1] 0.9959433
  4. # Precision for qy == TRUE:
  5. m[2,2] / sum(m[, 2])
  6. [1] 0.678458

Precision for the the non-'qy' category remains high. We do indeed do very well there. It's much more modest for the smaller 'qy' category, though.

A system can achieve perfect precision for a category C by never guessing it. For this reason, precision is generally paired with recall, which, for a category C, assesses the number of C-type things that the system identifies, balancing that against the total number of C-type things. Thus, now we calculate row-wise in the confusion matrix:

  1. # Recall for qy == FALSE:
  2. m[1,1] / sum(m[1, ])
  3. [1] 0.9924967
  4. # Recall for qy == TRUE:
  5. m[2,2] / sum(m[2, ])
  6. [1] 0.7965921

Once again the 'qy' category is the worrisome one.

We want to get to the bottom of this. That's what the discussion section below is all about. However, it's nice to round to round out this section with a bit of statistical evidence we have found a true association between clause-type and dialog act — a quick chi-squared test on the matrix:

  1. # Chi-squared test:
  2. chisq.test(m)
  3. Pearson's Chi-squared test with Yates' continuity correction data: m X-squared = 51249.22, df = 1, p-value < 2.2e-16

Now let's figure out what's going on with the mis-alignments.

Discussion

False positives

  1. fp = subset(d, PolarIntClauseType==TRUE & d$qy==FALSE)
  2. fpMatrix = xtabs(~ Tag, data=fp, drop.unused.levels=TRUE)
  3. fpTable = as.data.frame(fpMatrix)
  4. fpTable = fpTable[order(fpTable$Freq, decreasing=TRUE),]
  5. rownames(fpTable) = seq(1,nrow(fpTable))
  6. library(xtable)
  7. x = xtable(head(fpTable, 22), digits=0)
  8. print(x, type='html')
Tag Gloss Freq
1bh rhetorical question continuer 205
2% indeterminate, abandoned 118
3qrr or-clause (or is it more of a company?) 78
4qr alternative (`or') question 64
5qh rhetorical question 45
6qy^t qy + "about the task" 32
7qy^d declarative question 19
8sd statement non-opinion 14
9b acknowledge (backchannel) 13
10ba appreciation 13
11sv statement opinion 13
12^q quotation 12
13qy^r qy + repeated 9
14ad action directive 7
15qy^c qy + about communication 7
16fc conventional closing 6
17qo open-ended question 6
18aa accept 5
19^g tag-question 4
20qrr^t qrr + "about the task" 3
21qy^g qy + tag question 3
22qy^h qy + "let me think"-style hold 3

Some of these mistakes are forgivable. For example a qy^t (question about the task) is still a qy, as is qy^r (repeated question). This is arguably true of qrr (or-initial question) and qy^h (question with request for a moment to think).

Other mistakes are not forgiveable, but they are informative. For example, the overall syntactic structure is not revealing of whether a question is rhetorical or not, and this is the largest source of errors, with bh + qh accounting for 256 mistakes.

We could get some mileage out of reaching down into the structure to try to detect whether the question is an 'or' question or not.

False negatives

To understand the false negatives, we need to inspect the trees themselves. The following function returns a randomly selected subset of the false positive trees and writes them to a file called 'swda-clausetyping-fp-trees.txt'.

  1. def inspect_false_negatives(output_filename, num=50):
  2. """
  3. Identify false negatives and print num randomly selected instances
  4. to output_filename for inspection.
  5. """
  6. trees = []
  7. for utt in CorpusReader('swda').iter_utterances(display_progress=False):
  8. if utt.tree_is_perfect_match():
  9. if utt.act_tag == 'qy' and not is_polar_interrogative(utt.trees[0]):
  10. trees.append(utt.trees[0])
  11. shuffle(trees)
  12. s = ""
  13. for tree in trees[: num]:
  14. s += '======================================================================\n'
  15. s += tree.pprint() + '\n'
  16. open('swda-clausetyping-fn-trees.txt', 'w').write(s)

I ran this function and then hand-annotated the output, which you can check out here:

I found the breakdown of error types given in table FN. The "Declarative forms with visible auxs" are the ones that have haunted us throughout the investigation. The high number of "Declarative forms with visible auxs" is surprising, since those have the form of rising declaratives but were judged not to serve the special rhetorical function associated with those clauses.

Error type Count
Declarative forms with visible auxs 21
Ellipitical forms 15
Fragments of various kinds 11
wh interrogatives 2
Unanticipated EDITED node structure 1
Table FN
Different kinds of false negative.

Exploring other associations between form and function

This section introduces a method for seeking out other areas in which it seems fruitful to study the associations between form (syntax) and function (pragmatics).

My basic strategy is to use the root nodes of the Penn Treebank parses as rough approximations of their clause-types, relating them to act tags in various ways.

Set-up

The following Python function builds a CSV file whose rows are (ActTag, DAMSL Act Tag, Root Label) trios. This will be the basis for the exploration.

  1. #!/usr/bin/env python
  2.  
  3. import csv
  4. from collections import defaultdict
  5. from operator import itemgetter
  6. from swda import *
  7.  
  8. def act_tags_and_rootlabels():
  9. """
  10. Create a CSV file named swda-actags-and-rootlabels.csv in
  11. which each utterance utt has its own row consisting of just
  12.  
  13. utt.act_tag, utt.damsl_act_tag(), and utt.trees[0].node
  14.  
  15. restricting attention to cases in which utt has a single,
  16. perfectly matching tree associated with it.
  17. """
  18. csvwriter = csv.writer(open('swda-actags-and-rootlabels.csv', 'w'))
  19. csvwriter.writerow(['ActTag', 'DamslActTag', 'RootNode'])
  20. corpus = CorpusReader('swda')
  21. for utt in corpus.iter_utterances(display_progress=True):
  22. if utt.tree_is_perfect_match():
  23. csvwriter.writerow([utt.act_tag, utt.damsl_act_tag(), utt.trees[0].node])

Here's a direct link to the resulting file:

From here on, we'll work with the file in R:

  1. df = read.csv('swda-actags-and-rootlabels.csv')
  2. head(df)
  3. ActTag DamslActTag RootNode
  4. 1 o fo_o_fw_"_by_bc INTJ
  5. 2 qy qy SQ
  6. 3 sd sd S
  7. 4 ad ad S
  8. 5 h h S
  9. 6 ad ad S-IMP

The distribution of root nodes

Table ROOTS provides the full set of root nodes in the SwDA parses with their counts (restricting attention to utterances for which tree_is_perfect_match() == True); this table was created with the following commands:

  1. ## Create a data.frame of RootNode counts:
  2. counts = as.data.frame(xtabs(~ RootNode, data=df))
  3. ## Sort from most to least frequent:
  4. counts = counts[order(counts$Freq, decreasing=TRUE), ]
  5. ## Rename the rows to reflect this new ranking:
  6. rownames(counts) = seq(1, nrow(counts))
  7. ## Load the xtable library for printing the data.frame:
  8. library(xtable)
  9. x = xtable(counts, digits=0)
  10. print(x, type='html')
RootNode Count
1 S 51205
2 INTJ 30970
3 S-UNF 3807
4 SQ 2309
5 FRAG 2167
6 SBARQ 1454
7 NP 1113
8 SBAR-PRP 568
9 S-IMP 555
10 X 317
11 PP 183
12 ADVP 153
13 ADJP 150
14 SBAR 132
15 SBAR-ADV 123
16 SQ-UNF 116
17 S-1 98
18 VP 92
19 PP-LOC 80
20 SINV 71
21 WHNP 65
22 NP-UNF 50
23 SBAR-TMP 49
24 INTJ-UNF 47
25 S-2 44
26 SBARQ-UNF 44
27 S-CLF 29
28 ADVP-TMP 26
29 NP-TTL 26
30 NP-TMP 24
31 PP-TMP 22
32 SBAR-PRP-UNF 22
33 WHADVP 20
34 PP-UNF 19
35 S-3 18
36 SBAR-ADV-UNF 17
37 WHADJP 15
38 SQ-CLF 12
39 ADVP-UNF 10
40 PRN 9
41 SBAR-UNF 9
42 ADVP-LOC 8
43 NP-ADV 8
44 SBAR-NOM 8
RootNode Count
45 UCP 8
46 VP-UNF 8
47 S-4 6
48 S-PRP 6
49 ADVP-MNR 5
50 S-SEZ 5
51 SBAR-LOC 5
52 SBAR-TMP-UNF 5
53 ADJP-PRD 4
54 NP-VOC 4
55 WHNP-UNF 4
56 FRAG-1 3
57 PP-DIR 3
58 S-ADV 3
59 2
60 CONJP 2
61 PP-PRP 2
62 PP-TMP-UNF 2
63 S-5 2
64 WHADVP-UNF 2
65 WHPP 2
66 ADJP-UNF 1
67 ADVP-TMP-UNF 1
68 FRAG-2 1
69 FRAG-3 1
70 FRAG-UNF 1
71 NAC 1
72 NP-2 1
73 NP-SBJ-UNF 1
74 NP-TTL-UNF 1
75 PP-DTV 1
76 PP-LOC-UNF 1
77 PP-PRP-UNF 1
78 S-6 1
79 S-CLF-3 1
80 S-IMP-UNF 1
81 S-UNF-1 1
82 SBAR-MNR 1
83 SBAR-NOM-SBJ 1
84 SINV-SEZ 1
85 SINV-UNF 1
86 UCP-TMP 1
87 VP-TTL 1
88 WHADJP-UNF 1
Table ROOTS
Root nodes in the SwdA, with their counts, restricting attention to utterances for which tree_is_perfect_match() == True.

The hearer perspective: P(tag|clause-type)

The first perspective I take is a hearer perspective in the following sense: it says given that I heard/parsed clause-type C, what are the probabilities of various act tags (pragmatic functions) that the speaker might have intended?

For example, the following code builds such a distribution for the root node S-IMP (imperative):

  1. imp = subset(df, RootNode=='S-IMP')
  2. tagCounts = xtabs(~ DamslActTag, data=imp)
  3. tagDist = tagCounts / sum(tagCounts)
  4. tagDist
  5. DamslActTag
  6. ^2 ^g ^h ^q % + aa ...
  7. 0.014414414 0.000000000 0.227027027 0.070270270 0.021621622 0.000000000 0.001801802 ...

This distribution is very hard to plot, because there are so many values for the tags. The visualizations become more readable if we restriction attention to just the values that are above the 25th percentile, first filtering off the 0-valued elements:

  1. ## Remove the 0-valued elements:
  2. tagDist = tagDist[tagDist > 0]
  3. ## Sort the distribution from smallest to largest:
  4. sortedDist = sort(tagDist)
  5. ## Threshold size, as an integer, so that it can be an index:
  6. thresholdIndex = round(length(sortedDist) * 0.25, 0)
  7. ## The threshold value:
  8. threshold = sortedDist[thresholdIndex]
  9. ## Filter the distribution based on that index:
  10. tagDist = tagDist[tagDist >= threshold]

Figure IMP plots this distribution. I've also added the entropy of the entire (un-filtered) distribution, as a summary measure of its diversity. This value was calculated with GeneralizedResponseEntropy() using the original counts vector.

  1. e = GeneralizedResponseEntropy(tagCounts)
  2. title = paste('P(act_tag|S-IMP) -- entropy:', round(e, 2))
  3. barplot(tagDist, main=title, las=3, ylim=c(0,1))
figures/swda/dist-simp.png
Figure IMP
The distribution of DAMSL act tags for trees rooted at S-IMP. The barplot depicts only non-0 values above the 25th percentile. The entropy calculation is for the entire distribution.

The function TagGivenLabel() inside swda_functions.R generalizes this code: the user supplies a regular expression over root labels and the function handles the rest. Thus, generating the above plot is as simple as this:

  1. TagGivenLabel(df, '^S-IMP$')

Because the second argument is a regular expression, one can also pool different root-labels:

  1. ## Pools all nodes that begin with SQ (but perhaps have something following):
  2. TagGivenLabel(df, '^SQ'

The speaker perspective: P(clause-type|tag)

The speaker perspective turns around the hearer distribution P(tag|clause-type). That is, we want to study P(clause-type|tag). The idea is that the hearer decides what pragmatic move to make and then, given that decision, has a choice about which clause-type to use to express it.

The process is exactly the same as the one we used above, except here we begin by picking a tag and then look at the distribution of root-nodes for that tag. In my illustration, I choose 'ad' (action directive).

  1. ad = subset(df, DamslActTag=='ad')
  2. labelCounts = xtabs(~ RootNode, data=ad)
  3. labelDist = labelCounts / sum(labelCounts)
  4. labelCounts
  5. RootNode
  6. ADJP ADJP-PRD ADJP-UNF ADVP ADVP-LOC ADVP-MNR ADVP-TMP ...
  7. 0 0 0 0 0 0 0 0 ...

I filter the distribution in the same way, again to improve readability:

  1. ## Remove the 0-valued elements:
  2. labelDist = labelDist[labelDist > 0]
  3. ## Sort the distribution from smallest to largest:
  4. sortedDist = sort(labelDist)
  5. ## Threshold size, as an integer, so that it can be an index:
  6. thresholdIndex = round(length(sortedDist) * 0.25, 0)
  7. ## The threshold value:
  8. threshold = sortedDist[thresholdIndex]
  9. ## Filter the distribution based on that index:
  10. labelDist = labelDist[labelDist >= threshold]

Figure AD plots the filtered distribution and also provides the entropy of the entire distribution:

  1. e = GeneralizedResponseEntropy(tagCounts)
  2. title = paste('P(root-node|ad) -- entropy:', round(e, 2))
  3. barplot(labelDist, main=title, las=3, ylim=c(0,1))
figures/swda/dist-ad.png
Figure
The distribution of root nodes for utterances labeled 'ad' (action directive). The barplot depicts only non-0 values above the 25th percentile. The entropy calculation is for the entire distribution.

The function LabelGivenTag() inside swda_functions.R generalizes this code: the user supplies a regular expression over DAMSL act tags and the function handles the rest. Thus, generating the above plot is as simple as this:

  1. LabelGivenTag(df, '^ad$')

As with TagGivenLabel, the regular expression interface allows you to pool groups of tags.

Putting the pieces together

The file swda_functions.R also contains functions LabelGivenTagPlots() and TagGivenLabelPlots(). The first argument to each is the full dataframe derived from swda-actags-and-rootlabels.csv (our df above) and the second is an integer n (default: 10). The functions then display information for the n most frequent root nodes (DAMSL act tags). Figure ROOTDISTS and Figure TAGDISTS give the output of these functions for n=6.

figures/swda/dist-labels.png
Figure ROOTDISTS
The DAMSL-act-tag distributions for the 10 most frequent root-node labels.
figures/swda/dist-tags.png
Figure TAGDISTS
The root-node distributions for the 10 most frequent DAMSL act tags.

Exercises

DAMSL The DAMSL act simplifications for questions are pretty aggressive. I am concerned in particular about the fact that it reduces qy^g to qy. What impact might this have on our experiment? You can answer this question by relying on the samples from the Coders' Manual, or you can write code to study these tags more systematically. (Note: inspect_trees() will return a sample for any Tree-to-booleans function you write.)

RHETORICAL Using the corpus or your own intuitions, identify 3-5 morphosyntactic features that could be used to distinguish rhetorical questions from regular questions. (These could be heuristics; I think you won't find any deterministic features for this.)

MOD Our theory of the form of polar interrogatives is embodied in is_polar_interrogative(). Modify or completely rewrite that function and then evaluate it using inspect_trees(). Summarize how well your function does, perhaps contrasting its strengths and weaknesses with those of is_polar_interrogative.

ELLIPTICAL Elliptical forms like You going? are misdiagnosed by our function because they return False. We might be able to catch them by looking for VP nodes that lack an aux daughter but do have a 'VBG' (gerund) or 'VPN' (participle) daughter. Write a function does seeks to identify such configurations. (Feel free to improve on the technique as well.)

TAGS It seems that tag questions are structures with SQ roots that have SQ daughter nodes. Is this characterization correct? Write a function that identifies such structures and use inspect_trees() to assess it.