The indirect question–answer pairs corpus

Overview
Experimental data
Question–answer relationships
In-class group-work questions
Exercises

Overview

This section introduces an experimental dataset involving 215 indirect question–answer pairs with annotation distributions attached to them. The data include a (semi-)random development/evaluation split, with 150 examples in the development set and 65 in the evaluation set.

I don't look at all at the evaluation set here. Some of you might want to do your own experiments with the data. If that happens, we can have a mini bake-off, to see who does better on the evaluation data. Thus, if you pursue your own experiments, stick to the development set for now.

Code and data:

iqap-data.csv.zip: the corpus file
iqap.py: Python classes for working with iqap-data.csv
iqap_functions.R: Examples of how to work with iqap-data.csv in R
iqap_functions.py: Examples of how to work with iqap-data.csv using iqap.py

Experimental data

Table COLUMNS gives an overview of the column values in iqap-data.csv.

	Column	Description
1	Item	Item number
2	Classification	CNN, Yahoo, Hirschberg, Switchboard
3	Source	Source file where applicable, else repeats Classification
4	Question	question text
5	Answer	answer text
6	Prefix	'yes', 'no' or blank
7	definite-yes	number of annotators who chose this category (0..30)
8	probable-yes	number of annotators who chose this category (0..30)
9	definite-no	number of annotators who chose this category (0..30)
10	probable-no	number of annotators who chose this category (0..30)
11	DevEval	DEVELOPMENT or EVALUATION
12	QuestionParse	Stanford parser parse, with hand corrections, of Question
13	AnswerParse	Stanford parser parse, with hand corrections, of Answer

Table COLUMNS

Column values for the experiment file iqap-data.csv.

The file can be read into R:

iqap = read.csv('iqap-data.csv')

Here, we limit attention to the development set:

iqap = subset(iqap, DevEval=='DEVELOPMENT')

Here are the first two lines of the data:

head(iqap, 2)
Item Classification Source Question Answer Prefix definite.yes probable.yes definite.no probable.no DevEval QuestionParse AnswerParse 1 146 CNN LarryKingLive/0206-07-lkl.00.txt Did he do a good job? He did a great job. 30 0 0 0 DEVELOPMENT (SQ (VBD Did) (NP (PRP he)) (VP (VB do) (NP (DT a) (JJ-CONTRAST good) (NN job))) (. ?)) (S (NP (PRP He)) (VP (VBD did) (NP (DT a) (JJ-CONTRAST great) (NN job))) (. .)) 2 122 CNN LouDobbsTonight/0804-08-ldt.01.txt Do you think that's a good idea? It's a terrible idea. 0 0 29 1 DEVELOPMENT (SQ (VBP Do) (NP (PRP you)) (VP (VB think) (SBAR (S (NP (DT that)) (VP (VBZ 's) (NP (DT a) (JJ-CONTRAST good) (NN idea)))))) (. ?)) (S (NP (PRP It)) (VP (VBZ 's) (NP (DT a) (JJ-CONTRAST terrible) (NN idea))) (. .))

Sources

The experimental data come from a variety of sources. Table SOURCES gives a description.

Source name	Description	Examples
CNN	From CNN show transcripts	40
Hirschberg	From Hirschberg 1985, many of which come from corpora	26
Switchboard	From the Switchboard Dialog Act Corpus	26
Yahoo	From highly restrictive regex searches over the Yahoo Answers corpus	58

Table SOURCES

Data sources. The counts come from xtabs(~ Classification, data=iqap). They are for the development set only. The evaluation set numbers are proportionally about the same.

The dev/eval split is random except that the proportion of items from each of these categories is the same in the two sections (and thus the counts are CNN: 17, Hirschberg: 12, Switchboard: 11, Yahoo: 25). I did this because the data were collected in very different ways and pose different kinds of challenge, so it seemed worthwhile to ensure both development and evaluation contain similar proportions from each.

Corrections/Regularization

I went through the raw data and made adjustments where I thought that would deliver better results in the annotation phase and more focused experimental work:

Resolved anaphoric dependencies (esp. CNN).
Simplified overall structures (esp. CNN).
Removed information about real-world events that could create biases (distract people from evaluating the answerer's intentions).
Removed "Yes" and "No" prefixes on answers. More on this below.

Annotations

The data were annotated on Amazon's Mechanical Turk. Each IQAP received annotations from 30 separate Turkers. Here is the prompt:

Indirect Answers to Yes/No Questions

In the following dialogue, speaker A asks a simple Yes/No question, but speaker B answers with something more indirect and complicated:

${Question}

${Answer}

Which of the following best captures what speaker B meant here?

B definitely meant to convey "Yes".

B probably meant to convey "Yes".

B definitely meant to convey "No".

B probably meant to convey "No".

Cautionary note: in general, there is no unique right answer. However, a few of our HITs do have obvious right answers, which we have inserted to help ensure that we approve only careful work.

Any comments would be very much appreciated:

This prompt differs from that of de Marneffe et al. 2010 in not containing an "Uncertain" option. That option turned out to be rarely chosen and difficult to interpret because it meant dealing with two kinds of uncertainty:

Where "Uncertain" was chosen, it seemed to mean something like "I am certain enough of my uncertainty to make this choice".
Since we solicited annotation distributions over the categories, there was a second kind of uncertainty as well: the degree to which the annotators, as a group, agreed on the right category.

The current prompt removes the first kind of uncertainty. Where Turkers were uncertain of the right response, they were forced to make a choice. We can expect uncertainty to arise from the randomness of these forced choices.

Seventy-six Turkers participated in the annotating. Figure WORKERS gives some additional information about what the annotators did.

Figure WORKERS

Information on the MTurk annotators. 15 workers did 10 items or fewer. 15 did at least 200, and 6 did all 215.

Annotation distributions

The level of agreement about the basic polarity (yes vs. no) of the answerer's intentions was very high: 70/150 item showed complete agreement, and 111/150 items showed agreement by at least 28/30 Turkers:

iqap$yes = iqap$definite.yes + iqap$probable.yes
iqap$no = iqap$definite.no + iqap$probable.no
xtabs(~ yes, data=iqap)
yes 0 1 2 3 4 5 6 8 9 10 13 16 17 19 20 21 24 25 27 28 29 30 30 16 10 4 8 1 4 6 1 2 2 1 1 1 1 1 1 2 3 5 10 40

The entropy of the response distribution is another way of getting a grip on how much variation there was in the annotations overall. The following R function calculates this for an item row (you can paste it into your R window or else read it in from a file):

## Calculate response entropy for an IQAP item.
## Argument:
## pf: the row for an item from iqap-data.csv
## Value:
## e: the entropy of the response distribution for pf
ResponseEntropy = function(pf){
counts = c(pf$definite.yes, pf$probable.yes, pf$definite.no, pf$probable.no)
dist = counts/sum(counts)
## Remove 0 counts to avoid an Inf value from log2.
dist = dist[dist > 0]
## The entropy calculation: -sum_x (p(x) * log2(p(x)))
e = -sum(log2(dist) * dist)
return(e)
}

Next, we add a column giving the response entropy for each item:

library(plyr)
## ddply calls ResponseEntropy for each row:
iqapEntropy = ddply(iqap, .variables=c('Item'), .fun=ResponseEntropy)
colnames(iqapEntropy)[2] = 'entropy'
## Sort the entropy frame and the original by Item to ensure a proper alignment:
iqapEntropy = iqapEntropy[order(iqapEntropy$Item), ]
iqap = iqap[order(iqap$Item), ]
## Add a column of entropy values to the main iqap frame:
iqap$entropy = iqapEntropy$entropy

Figure ENTROPY shows the distribution of response entropy values, generated by the following code:

## The binning of hist() seemed misleading,
## so I use a barplot and bin by rounding to one place.
adjustedValues = xtabs(~ round(iqap$entropy, 1))
barplot(adjustedValues, axes=F, main='Response entropy', ylab='Item count', xlab='Entropy')
axis(2, at=adjustedValues, las=1)

Figure ENTROPY

The entropy of the response distributions. 0 entropy means complete agreement; the higher the entropy, the more mixed the responses.

There are 6 items in the development set with perfect agreement in terms of responses (entropy = 0):

  Item                           Question                  Answer   Response
   118 Is the new judge really this good?            He is great.   definite-yes
   146              Did he do a good job?     He did a great job.   definite-yes
  2011              Do you like that one?              I love it.   definite-yes
  3064               Is a shark a mammal?    A shark is a mammal.   definite-yes
  3080        Is Cassandra a pretty name? It is a beautiful name.   definite-yes

And here are the items with the highest entropy (least agreement) in the development set:


Item                               Question                        Answer definite.yes probable.yes definite.no probable.no  entropy
4012       Have you read the third chapter?            I read the fourth.            3           16           2           9 1.597417
4034                          Are you sick?           I've got allergies.            3           10           2          15 1.620973
4031          Do you have italian dressing?          We have vinaigrette.            1            7           9          13 1.697340
2016 Did you buy a program to handle menus? It came with a menus program.            2            6          11          11 1.786315
3068                 Is zero a real number?          It's a prime number.            9           12           2           7 1.800212
2003     Do you want to go ahead and start?  I was hoping that you would.            4            4          10          12 1.832263

Prefixes

The Prefix column in the data contains either 'yes', 'no', or the empty string. The 'yes' and 'no' cases are those where the answer originally had this word as a prefix but I removed it before the annotation process. One might worry about this, or at least consider it when analyzing the data. Speakers might have chosen this item because they feared that the bare answer would create too much uncertainty, so removing it could artificially increase uncertainty. The present section explores this. My overall conclusion is that removing the prefix did not increase uncertainty.

For easy identification, a column for the prefixed items:

iqap$prefixed = iqap$Prefix != ''

As discussed below, all but one of the prefixed cases are from the Yahoo section of the corpus.

Relationship to the majority choice category

Let's first check the relationship between the prefix value and the majority choice of the Turkers:

## Add a column for the majority-choice category:
iqap$majority = ifelse(iqap$definite.yes > 15, 'definite-yes', ifelse(iqap$probable.yes > 15, 'probable-yes', ifelse(iqap$definite.no > 15, 'definite-no', ifelse(iqap$probable.no > 15, 'probable-no', 'none'))))
## Compare the prefixes with the majority choices:
xtabs(~ majority + Prefix, data=subset(iqap, prefixed==TRUE), drop.unused.levels=TRUE)
Prefix majority no yes definite-no 14 0 definite-yes 0 8 none 1 0 probable-no 5 0 probable-yes 1 1

This looks really good: just one example has mixed polarity between the majority choice and the author's original prefix, and only one prefixed-example lacked a majority choice.

exercise MAJ, exercise CLASS

Relationship to category choices

If removing the prefixes increased uncertainty, then we would expect to see increased use of the 'probable-yes' and 'probable-no' categories over their 'definite' counterparts. Let's see if that is the case

Figure PROB, produced by the following code, is a boxplot of the values. The picture suggests that, if anything, there is more use of the 'probable' categories when no prefix was present in the original, though the difference is small and it seems unlikely that it traces to this property of the examples. (It is more likely that it traces to something about the Yahoo portion of the corpus.)

iqap$prob = iqap$probable.yes + iqap$probable.no
par(mar=c(4,4,2,4))
b = boxplot(iqap$prob ~ iqap$prefixed, ylab='Probable annotations per item', main="Use of the 'probable' categories for the prefix and no-prefix subsets", axes=FALSE)
axis(1, at=c(1,2), labels=c('No prefix', 'Prefix'))
axis(2, at=round(b$stats[, 1], 2), las=1)
axis(4, at=round(b$stats[, 2], 2), las=1)

Figure PROB

The two y-axes pick out these values for each subset (from bottom to top): (i) the smallest value for the subset, (ii) the 25th percentile, (iii) the median (50th percentile), (iv) the 75th percentile (v) the largest value.

exercise PROBEX

Relationship to overall response entropy

We can check to see whether having a prefix correlates with response entropy:

iqap$prefixed = iqap$Prefix != ''
par(mar=c(4,4,2,4))
b = boxplot(iqap$entropy ~ iqap$prefixed, ylab='Entropy', main='Entropy distributions for the prefix and no-prefix subsets', axes=FALSE)
axis(1, at=c(1,2), labels=c('No prefix', 'Prefix'))
axis(2, at=round(b$stats[, 1], 2), las=1)
axis(4, at=round(b$stats[, 2], 2), las=1)

Figure ENTDIST shows the output of this boxplot command. As before, the entropy for the non-prefix population is actually higher than it is for the prefixed one, which runs counter to our initial hypothesis:

Figure ENTDIST

Entropy value distributions for the prefix and no-prefix subsets

I conclude that, though the prefixing might be an important feature (especially internal to the Yahoo section of the corpus), it was not definitive in terms of the nature of the responses people gave.

exercise ENT

Additional structure

I parsed the data with the Stanford Parser and hand-corrected the results. If you decide to do experiments that depend on the trees, then you might want to extend and make use of the following classes:

iqap.py

I also hand-annotated the questions and answers for something approximating the 'contrast' predicate. For example, in the simple dialogue Was the movie good? It was great., the words good and great would be highlighted. These annotations take the form of one or more nodes suffixed with '-CONTRAST'. The Python classes include functionality for grabbing these subtrees and working with their leaf nodes. The iqap.CorpusReader method view_contrast_preds() gives you a sense for what kind of information is available here. Here's a sample of the output:

======================================================================
Did he do a good job?
[Tree('JJ-CONTRAST', ['good'])]
He did a great job.
[Tree('JJ-CONTRAST', ['great'])]
======================================================================
Is it a comedy?
[Tree('NN-CONTRAST', ['comedy'])]
I think it's a black comedy.
[Tree('JJ-CONTRAST', ['black']), Tree('NN-CONTRAST', ['comedy'])]
======================================================================
Do you belong to a gun club?
[Tree('PRP-CONTRAST', ['you']), Tree('VB-CONTRAST', ['belong'])]
My husband belonged to one awhile back.
[Tree('NP-CONTRAST', [Tree('PRP$', ['My']), Tree('NN', ['husband'])]), Tree('VBD-CONTRAST', ['belonged'])]
======================================================================
Now, do you work outside of the home?
[Tree('VB-CONTRAST', ['work'])]
I have just retired.
[Tree('VBN-CONTRAST', ['retired'])]

exercise INTJ, exercise NEG, exercise PYENT

Question–answer relationships

Let's look at the nature of the relationships between question and answer that the data include, on the look-out especially for implicatures.

I frame the discussion in terms of which predicate is stronger, the question radical or the answer. The notion of strength here is the flexible one that Hirschberg 1985 identifies — not just entailment, but rather something closer to communicative force.

The answer is stronger than the question radical

Item                           Question                                Answer  definite.yes probable.yes definite.no probable.no
  92           Is the weather nice now?                    Beautiful weather.            27            3           0           0
 146              Did he do a good job?                   He did a great job.            30            0           0           0
1009                    Is it a comedy?          I think it's a black comedy.            14           16           0           0
3015   Is Cadillac an American company?    It's a division of General Motors.             7           22           0           1
2013  Are there poisonous snakes there?        We have a lot of cotton mouth.            14           14           0           2
3000                  Is chess a sport?            It is a sport of the mind.             9           19           1           1

Entailment: the "yes" is not an implicature, but rather a contextual entailment. That is, since one cannot consistently agree that the job was great but deny that it was good, we needn't reason in terms of the maxims in order to arrive at the "yes" answer for a case like that.

Implicatures: the answerer conveys that there is something inappropriate about simply affirming the question predicate.

In examples like 92, this is likely because he wants to avoid an upper-bounding implicature (here, that the weather is merely nice). The responses are basically categorically 'definite-yes' for these cases
In examples like 1009 and 3000, the answer seems to convey that there is something inexact or misleading about the general term, i.e., that only the specific one is really accurate. This can even give rise to the sense that the sub-kind is not a normal instance of the general kind. These answers seem to generate increased uncertainty.

The answer is stronger than the negation of the question radical

Item                           Question                                Answer  definite.yes probable.yes definite.no probable.no
 122   Do you think that's a good idea?                 It's a terrible idea.             0            0          29           1
3067            Is Santa an only child?          He has a brother named Fred.             1            0          24           5

Entailment: the "no" is not an implicature, but rather a contextual entailment arising from the lexical semantics of the predicates involved and a very general notion of speaker commitment.

Implicatures:

In examples like 122, merely saying "no" might convey that "merely no" is true, i.e., that the idea might not be good, but at least it is okay/satisfactory. The answers choses a stronger predicate to block this inference.
For the others, the user gives additional information in anticipation of a more specific question under discussion.

The question radical is stronger than the answer

Item                                        Question                             Answer  definite.yes probable.yes definite.no probable.no
2014              Did you get any information on it?        I sent off for stuff on it.             0            4          23           3
3081                       Is it a sin to get drunk?    It is a sin to drink to excess.            13           14           1           2
4000 Did you manage to read that section I gave you?  I read the first couple of pages.             0            6           6          18
4016                               Do you need this?                         I want it.             2           15           0          13

Implicatures: here, "no" is often an implicature. Informally, we can reason as follows:

Some pragmatic pressure prevented the answerer from affirming the question predicate.
In many situations of direct questioning, we assume that the answerer is forthcoming and in a position to provide a complete and accurate resolution of the issue.
However, the answer provided is entailed by the question radical — it is semantically a partial resolution of the issue.
By 2 and the cooperative principle (quantity, relevance), the answerer will not provide a partial resolution.
However, if we assume the upper-bounding implicature that the complete positive resolution is pragmatically inappropriate, then we can continue to maintain that the answer was complete.
By the expert assumption 2, we conclude that the upper-bounding implicature is actually a negation of the stronger values.

The question radical is stronger than the negation of the answer

Item                           Question                                Answer  definite.yes probable.yes definite.no probable.no
4024   Have you mailed that letter yet?               I haven't proofread it.             0            0          25           5
4007               Did you buy a house?     We haven't gotten a mortgage yet.             0            2          16          12

Entailment: here, "no" is arguably a contextual entailment. The answer conveys that a precondition for the truth of the question radical has not been met.

Implicature: the over-answers can help to avoid inferences on the part of the questioner about how close or how far reality is from the question being true.

The question radical and the answer seem to be independent

Item                               Question                                Answer  definite.yes probable.yes definite.no probable.no
2020           Do you know how to spell it?                   It starts with a K.             0            6           4          20
3048    Is America an imperialistic nation?      It's a representative democracy.             1            2          10          17
4001  Have you made fondue in this pot yet?                 Not chocolate fondue.             1           19           1           9
4006                   Do you speak Ladino?                      I speak Spanish.             1            9           4          16
4012       Have you read the third chapter?                    I read the fourth.             3           16           2           9
4032                     Do you have paste?                We have rubber cement.             0            8           4          18

Implicatures: the answer is typically rich in implicature concerning "yes" or "no", because of the pressures of relevance. Determining which enrichment is appropriate can be challenging.

exercise INDY, exercise IMP

In-class group-work questions

In class on July 12, the class broke up into groups and formulated questions and hunches about the IQAP data, focussing on patterns that might be relevant for building predictive models of the response distributions.

The next few subsections give my (partial) responses to these questions. I've used a mix of Python and R code, playing to the strengths of each and also aiming to further illustrate how the classes in iqap.py work.

Where do the prefixes appear?

As discussed above, about 20% of the examples originally had a 'yes' or 'no' prefix on the answer, which I removed before annotation. The analysis of this section suggests that this did not have a systematic effect on the annotation choices people made.

In exploring the prefixes, I neglected to address a basic question about them: are they evenly distributed across corpus types? The answer is that they are not; all but one of them comes from the Yahoo part of the corpus, as the following analyses show.

This imbalance is likely an artifact of the way the data were collected. It suggests that we might do well to re-mine the other sources for additional prefixed-answers, which could then be included in an expanded data release.

## Add a column with TRUE where the example has a prefix, else FALSE:
iqap$prefixed = iqap$Prefix != ''
## Cross-tabulate the Classification with the prefixed column:
xtabs(~ Classification + prefixed, data=iqap)

Python:

from collections import defaultdict
from iqap import *
def classification_by_prefixed():
"""Shows how the prefixes are distributed across the source types."""
d = defaultdict(int)
iqap = IqapReader('iqap-data.csv')
for item in iqap.dev_set():
prefixed = False
if item.Prefix:
prefixed = True
d[(item.Classification, prefixed)] += 1
# Printing:
for key, val in d.items():
print key, val

Is 'definite-yes' more likely if the answer is stronger than the question?

This one stumped me! I don't see a way to test this comprehensively without classifying the examples by hand or relying on the models developed on this page. All I can say now is that the definite-yes's clearly dominate in this small sample!

Is 'definite-yes' more likely if the question and answer are syntactically similar?

The group that proposed this pointed out that there are a number of ways it could be addressed. Here, I treat it as a simple problem of lexical overlap. A more sophisticated approach might compare the tree structures (including or excluding their node labels and/or lexical items). For a survey of relevant algorithms, see Brille 2005.

I did this as a mix of Python code and R code: Python to generate the CSV file, and R to plot the contents of that CSV file. This seemed like the path of least resistance: NLTK/Python is excellent for working with language data, and R is excellent for visualization.

def lexical_overlap(item):
"""
Compare the lexical items in the question and answer to determine
their degree of overlap. The score is the cardinality of the
intersection divided by the cardinality of the union.
"""
que_words = set(item.question_words(wn_lemmatize=True))
ans_words = set(item.answer_words(wn_lemmatize=True))
int_card = len(que_words & ans_words)
union_card = len(que_words | ans_words)
return int_card / float(union_card)
def lexical_overlap_by_definite():
"""
Creates a CSV file named 'iqap-lexical-overlap-by-definite.csv'
with the format
Definite, LexicalOverlap
where Definite is the number of 'definite' annotations for the
item and LexicalOverlap is the question-answer intersection
divided by the question-answer union, as given by
lexical_overlap().
"""
pairs = []
iqap = IqapReader('iqap-data.csv')
for item in iqap.dev_set():
definite = item.definite_yes + item.definite_no
pairs.append((definite, lexical_overlap(item)))
csvwriter = csv.writer(open('iqap-lexical-overlap-by-definite.csv', 'w'))
csvwriter.writerow(['Definite', 'LexicalOverlap'])
csvwriter.writerows(pairs)

I then plotted this quickly in R; the results don't look promising, but perhaps this just indicates that the relevant notion of similarity is less lexical than lexical_overlap() construes it.

d = read.csv('iqap-lexical-overlap-by-definite.csv')
plot(d$Definite, d$LexicalOverlap)

How do specific word choices in the answer affect response distributions?

There were a few variants of this question in class, involving attitude predicates, modals, hedges, additive particles, and exclusive particles.

The following code seeks to provide general functionality for addressing these questions using regular expressions. More sophisticated approaches would use the tree structure as well.

The R code produces a boxplot:

## Add a column grouping the 'probable' responses:
iqap$prob = iqap$probable.yes + iqap$probable.no
## Tweak this regex as needed for particular versions of the question:
regex = '\\b(thinks?|think|thought|can|could|shall|should|will|would|may|might|must)\\b'
## Add the column for this regex:
iqap$regex = grepl(regex, iqap$Answer)
## Quick boxplot relating the two classes:
boxplot(prob ~ regex, data=iqap, xlab='Regex match', ylab='Probable annotations (by item)')

I don't want to deal with creating a boxplot in Python, so I'll take a slightly more abstract approach for the Python version, by relating regular expression matches to majority responses:

import re
from collections import defaultdict
from iqap import *
def regex_by_majority_response(regex):
"""
Relates matches for the regular expression regex in the answer to
majority response categories.
"""
d = defaultdict(lambda : defaultdict(int))
iqap = IqapReader('iqap-data.csv')
for item in iqap.dev_set():
# Regex search:
match = False
if regex.search(item.Answer):
match = True
# Majority label:
maj_label = item.majority_label()
d[maj_label][match] += 1
# Print (category, percentage-match) pairs:
print 'Category', 'Percentage-matching'
for label, match_dict in d.items():
print label, match_dict[True] / float(sum(match_dict.values()))
regex = re.compile(r'\b(thinks?|think|thought|can|could|shall|should|will|would|may|might|must)\b')
regex_by_majority_response(regex)

What effect do tense mismatches have on annotation choices?

This question is on sound theoretical footing — Hirschberg (1985 §5) discusses similar patterns. However, I don't see how to test it computationally. I suppose we need a morphological parser that can isolate tense marking.

How should we group the response categories?

The group that posed this question addressed it by studying how the entropy of the response distributions changes as we group the categories in these two ways.

I did this experiment in R, by generalizing the code above for calculating response entropy. The code looks rather involved, but it just calculates entropy for three kinds of distribution, adds the values to the iqap frame, and creates box-plots of the results for initial study.

## General function for calculating the entropy of the counts vector.
GeneralizedResponseEntropy = function(counts){
dist = counts/sum(counts)
## Remove 0 counts to avoid an Inf value from log2.
dist = dist[dist > 0]
## The entropy calculation: sum_x (p(x) * log2(p(x)))
e = -sum(log2(dist) * dist)
return(e)
}
## Group the responses by yes/no and return the entropy of that distribution.
PolarityResponseEntropy = function(pf){
counts = c(
pf$definite.yes + pf$probable.yes,
pf$definite.no + pf$probable.no)
return(GeneralizedResponseEntropy(counts))
}
## Group the responses into definite.yes, probable, definite.no:
TriResponseEntropy = function(pf){
counts = c(
pf$definite.yes,
pf$probable.yes + pf$probable.no,
pf$definite.no)
return(GeneralizedResponseEntropy(counts))
}
## Group the responses by definite/probable and return the entropy of that distribution.
DegreeResponseEntropy = function(pf){
counts = c(
pf$definite.yes + pf$definite.no,
pf$probable.yes + pf$probable.no)
return(GeneralizedResponseEntropy(counts))
}
## Calculate all entropy values.
Entropys = function(pf){
return(c(PolarityResponseEntropy(pf), DegreeResponseEntropy(pf), TriResponseEntropy(pf)))
}
library(plyr)
## Add the new entropy values to iqap.
EntropyComparisons = function(iqap){
## Add the entropy values as two new columns for iqap.
## ddply calls ResponseEntropy for each row:
iqapEntropy = ddply(iqap, .variables=c('Item'), .fun=Entropys)
colnames(iqapEntropy)[2] = 'PolarityEntropy'
colnames(iqapEntropy)[3] = 'DegreeEntropy'
colnames(iqapEntropy)[4] = 'TriEntropy'
## Sort the entropy frame and the original by Item to ensure a proper alignment:
iqapEntropy = iqapEntropy[order(iqapEntropy$Item), ]
iqap = iqap[order(iqap$Item), ]
## Add columns for the entropy values to the main iqap frame:
iqap$PolarityEntropy = iqapEntropy$PolarityEntropy
iqap$DegreeEntropy = iqapEntropy$DegreeEntropy
iqap$TriEntropy = iqapEntropy$TriEntropy
## Return the augmented frame:
return(iqap)
}
iqap = EntropyComparisons(iqap)
boxplot(cbind(iqap$PolarityEntropy, iqap$DegreeEntropy, iqap$TriEntropy), names=c('Yes/No', 'Def/Prob', 'Yes/Probable/No'))

Figure ENTCMP provides the boxplot, which we might refer to later on when we think about the experimental results:

Figure ENTCMP

Comparing the response distribution entropy for three views on the categories.

Exercises

MAJ In discussing the relationship of prefixing to the majority choice category, I extended the iqap table with a column of values indicating which, if any, is the majority choice. Calculate the majority distribution using xtabs and plot it using barplot. Are there any patterns or imbalances that we should be aware of?

CLASS What is the relationship between the majority choice category and Classification values. Use xtabs and barplot to study the relationships. What experimentally relevant information does this analysis provide?

PROBEX The section on relating prefixing to categories choices suggests that there is a difference between the prefixed and unprefixed classes with respect to use of the 'probable' categories. Is this difference reliable? To test this, use hist to check on the distribution of probable values for the prefixed and unprefixed subsets. If the results look normally distributed, then use t.test to probe what the contrast is like. If the results seem not to be normally distributed, use wilcox.test.

ENT The section on relating prefixing to response entropy suggests that there is a difference between the prefixed and unprefixed classes with respect to the entropy of the response distributions. Is this difference reliable? As in the previous problem, first check the distribution of values using hist and let this guide the statistical test you use.

INTJ This question relies on the iqap.py classes. The following function identifies whether an item's answer is interjection initial.

def interjection_initial_answer(item):
"""Return True if item's answer has an initial INTJ subtree, else False."""
tree = item.AnswerParse
if tree[0].node.startswith('INTJ'):
return True
else:
return False

Use this function to study the relationship between interjections and response distributions. Are there any patterns that we might exploit in predicting response distributions?

The following code gets this started by providing the relevant looping structure:

def interjection_initial_answer_effects():
"""Check the relationship between INTJ-initial answers and responses."""
corpus = IqapReader('iqap-data.csv')
for item in corpus.dev_set():
# Use interjection_initial_answer in here.
# To get item's response distribution,
# use item.response_dist() [probabilities] or
# item.response_counts() [raw counts]

NEG Negation is likely to be important. Write a function, akin to interjection_initial_answer above, that determines whether the contrast predicate in the answer is negated. (The Item method answer_contrast_pred_pos() grabs all the (word, tag) tuples from the -CONTRAST predicates.)

PYENT Extend the iqap.Item class with a function that calculates the entropy of the response distribution. The R code here is a useful model, I think, but you can find others on the net.

INDY The section The question radical and the answer seem to be independent identifies a reliable implicature of such answers. Provide a general Gricean derivation of this implicature.

IMP Some of the examples in The question radical and the answer seem to be independent have majority 'yes'-type answers and other majority 'no'-type answers. Pick one of each kind and state a set of contextual assumptions (one might suffice) that, if included, would likely reverse the annotators' judgments about the answer's intentions.

	B definitely meant to convey "Yes".
	B probably meant to convey "Yes".
	B definitely meant to convey "No".
	B probably meant to convey "No".