Experiments: Enriching indirect answers

  1. Overview
  2. Experiment 1: A deterministic strategy
    1. The one-word cases
      1. WordNet enrichment
      2. Review enrichment
      3. Combining resources
    2. Multi-word cases
    3. Discussion
  3. Experiment 2: A probabilistic (MaxEnt) model
    1. Background
    2. Features
      1. Subset and superset features
      2. Negation feature
      3. Review score feature
      4. Review score inference
      5. WordNet relations
      6. WordNet inferences
    3. The experiment
    4. Results
    5. Discussion
  4. Exercises

Overview

In this section, we return to the IQAP corpus, bringing together the information we extracted from WordNet and the IMDB reviews to try to enrich the IQAP answers. I begin with a deterministic enrichment strategy and then move to a probabilistic model.

Throughout this section, I work with a binary version of the ratings: 'definite-yes' and 'probable-yes' collapse to 'yes', and 'definite-no' and 'probable-no' collapse to 'no'. This makes the experiments conceptually simpler, and it side-steps my current lack of understanding of how the 'probable' categories were used.

In iqap.Item, the methods response_counts, response_dist, and max_label all have an optional keyword argument make_binary which peforms this category collapse if set to True.

In R, one can achieve the collapse by adding column values:

  1. iqap = read.csv('iqap-data.csv')
  2. iqap$yes = iqap$definite.yes + iqap$probable.yes
  3. iqap$no = iqap$definite.no + iqap$probable.no

Code and data:

To rerun the experiments, head to the bottom of the iqap_experiments.py, where you will see options for running either the series of deterministic experiments or the probabilistic one.

Experiment 1: A deterministic strategy

The first experiment is deterministic in the following sense: we work towards defining a single function that predicts 'yes' or 'no' as the label for each Item.

To prevent mistyping, and to allow flexibility, I define two global Python variables for specifying predictions:

  1. YES = 'yes'
  2. NO = 'yes'

The one-word cases

To begin, we look at cases where both the question and the answer have just one word in their '-CONTRAST' sets, which means that we can compare the two words directly using the resources we've developed.

The following function allows us to move quickly to just these 'one-worders'; using this will simplify a lot of our code.

  1. def is_one_worder(item):
  2. """
  3. The argument is an iqap.Item instance. If the -CONTRAST tree sets
  4. for both the question and the answer contain exactly one terminal,
  5. return those terminals, else return False.
  6. """
  7. q_words = item.question_contrast_pred_pos()
  8. a_words = item.answer_contrast_pred_pos()
  9. if len(q_words) == 1 and len(a_words) == 1:
  10. return (q_words[0], a_words[0])
  11. else:
  12. return False

WordNet enrichment

The following is a WordNet prediction function. It uses wordnet_relations to get all the relations that hold between its arguments word1 and word2, and then it uses the following heuristics to guess an answer:

  1. If the question is inconsistent with the answer, return NO. (Entailment.)
  2. If the question word is stronger than the answer, return NO. (Upper-bounding implicature.)
  3. If the question word is weaker than the answer, return YES. (Entailment.)
  4. If the question word is synonymous with the answer, return YES. (Entailment.)
  5. If no relation can be found, return None. (Ultimately, we will treat such lack of evidence as positive evidence that the two predicates are unrelated, which implicates NO.)
  1. def wordnet_word_predict(word1, word2):
  2. """
  3. Uses wordnet_functions.wordnet_relations to try to determine the
  4. rich enrichment for an IQAP answer, based on both logical and
  5. pragmatic principles.
  6. """
  7. rels = wordnet_functions.wordnet_relations(word1, word2)
  8. # Where WordNet can't relate the two, opt out:
  9. if not rels:
  10. return None
  11. else:
  12. # Where the question radical and answer are incompatible:
  13. if set(['antonyms']) & rels:
  14. return NO
  15. # Where word1 entails word2 (Great? / Good.)
  16. elif set(['entailments', 'hypernyms', 'member_meronyms', 'substance_meronyms', 'part_meronyms']) & rels:
  17. return NO
  18. # Where word2 entails word1 (Good? / Great!)
  19. elif set(['causes', 'hyponyms', 'member_holonyms', 'substance_holonyms', 'part_holonyms']) & rels:
  20. return YES
  21. # Where word1 and word2 seem roughly synonymous:
  22. elif set(['also_sees', 'similar_tos']) & rels:
  23. return YES
  24. # In the event of (only) a different relation, opt out:
  25. else:
  26. return None

To assess this function as well as the others defined below, I define an assessment function, which keeps track of the overall coverage and prints out an effectiveness summary for the items that the function does make predictions about:

  1. def assess_one_worder_performance(prediction_function, print_errors=False):
  2. """
  3. See how many of the one_worder cases are captured by WordNet, and
  4. also assess whether it makes correct predictions where it makes
  5. predictions at all. Prints the assessment to standard outut.
  6. """
  7. coverage = defaultdict(int)
  8. cm = defaultdict(lambda : defaultdict(int))
  9. corpus = IqapReader('iqap-data.csv')
  10. for item in corpus.dev_set():
  11. status = is_one_worder(item)
  12. if status:
  13. q_word, a_word = status
  14. predicted = prediction_function(q_word, a_word)
  15. # Check for a value at all:
  16. if predicted:
  17. coverage['related'] += 1
  18. actual = item.max_label(make_binary=True)
  19. cm[actual][predicted] += 1
  20. # View incorrect predictions:
  21. if print_errors and actual != predicted:
  22. print q_word, a_word
  23. else:
  24. coverage['unrelated'] += 1
  25. # View the confusion matrix:
  26. print_effectiveness(cm)
  27. # View the coverage:
  28. print "coverage:", coverage['related'], "/", sum(coverage.values()), "=", round(coverage['related'] / float(sum(coverage.values())), 2)

The assessment is done by effectiveness(), which does standard precision, recall, and accuracy calculations for each class:

  1. def print_effectiveness(d):
  2. """
  3. Prints an effectiveness report to standard output.
  4. Argument: d -- a two-dimensional defaultdict in which the initial
  5. keys are the actual values and the secondary keys are the
  6. predicted values
  7. The first step is creating a confusion matrix where the rows are
  8. the actual values and the predictions are the column values. One
  9. that is created, the precision and recall calculations are done
  10. for each class.
  11. """
  12. # Create a proper matrix to make the calculations easier:
  13. cm = numpy.zeros((len(d), len(d)))
  14. keys = sorted(d.keys())
  15. for i, actual in enumerate(keys):
  16. for j, predicted in enumerate(keys):
  17. cm[i, j] = d[actual][predicted]
  18. # Precision and recall by category:
  19. precisions = []
  20. recalls = []
  21. for i, key in enumerate(keys):
  22. # Precision: tp / (tp + fn).
  23. precision = cm[i,i] / numpy.sum(cm[: , i])
  24. precisions.append(precision)
  25. print key, 'precision:', round(precision, 2)
  26. # Recall: tp / (tp + fp).
  27. recall = cm[i,i] / numpy.sum(cm[i, : ])
  28. recalls.append(recall)
  29. print key, 'recall:', round(recall, 2)
  30. # Micro-averaging:
  31. print 'micro-averaged precision', round(numpy.mean(precisions), 2)
  32. print 'micro-averaged recall', round(numpy.mean(recalls), 2)
  33. # Accuracy (percentage correct):
  34. print 'accuracy:', round(numpy.sum(cm.diagonal()) / numpy.sum(cm), 2)

The assessment:

  1. assess_one_worder_performance(wordnet_word_predict, print_errors=False)
  2. no precision: 1.0
  3. no recall: 0.75
  4. yes precision: 0.82
  5. yes recall: 1.0
  6. micro-averaged precision 0.91
  7. micro-averaged recall 0.88
  8. accuracy: 0.88
  9. coverage: 17 / 82 = 0.21

Where WordNet makes a guess, it is highly accurate, missing just two items. This is encouraging. However, the overall coverage is very slight, so we will need to recruit other information.

Review enrichment

Next we recruit the IMDB review data. My expectation is that this will have better coverage but prove less accurate than WordNet.

If you followed along with the IMDB discussion through to the section on building scales, then you probably created a file called imdb-words-assess.csv. If not, you can get my versio here:

  1. imdb-words-assess.csv.zip

The function review_functions.get_all_imdb_scores, reads in this file and turns it into a Python dictionary mapping word-tag pairs to dictionaries of values {'Word':word, 'Tag':tag, 'ER':float, 'Coef':float: 'P':float}:

  1. SCORER = review_functions.get_all_imdb_scores('imdb-words-assess.csv')

The following function uses p-values to restrict attention to just the linear coefficient values that we are willing to call reliable. If the p-value threshold isn't met, it returns None.

  1. def coef_score(word, p=0.1):
  2. """
  3. Return the linear coefficient for word according to scorer, if word's
  4. p-value is below the p-value defined threshold:
  5. Arguments:
  6. word -- a (string, pos) pair
  7. p -- a p-value threshold (default: 0.1; use 1 for no threshold)
  8. Value: the coefficient for word, or None if the word is not in the dictionary
  9. or the p value is too high.
  10. """
  11. val_dict = SCORER.get(word, None)
  12. if not val_dict:
  13. return None
  14. else:
  15. if val_dict['P'] <= p:
  16. return val_dict['Coef']
  17. else:
  18. return None

The prediction function compares coeffecient values. The first comparison checks to see whether they have the same sign. Differing signs are assumed to correspond to NO; these are pairs like good and bad. Once that comparison is made, we move to absolute values, using their size as a proxy for strength. This is simply an effecient way of making it so that, for example, bad has a "smaller" (less negative) coefficent than terrible.

  1. def reviews_word_predict(word1, word2, p=0.1):
  2. """
  3. Make an IQAP answer prediction based on review coefficient scores:
  4. Arguments:
  5. word1, word2 -- (string, pos) pairs
  6. p -- a p-value threshold (default: 0.1; use 1 for no threshold)
  7. Value: YES, NO, or None
  8. """
  9. word1 = wordnet_functions.wordnet_sanitize(word1)
  10. word2 = wordnet_functions.wordnet_sanitize(word2)
  11. score1 = coef_score(word1, p=p)
  12. score2 = coef_score(word2, p=p)
  13. # If a score is missing, opt out:
  14. if score1==None or score2==None:
  15. return None
  16. # Where the signs are different, the words are predicted to be
  17. # inconsistent with each other:
  18. if sign(score1) != sign(score2):
  19. return NO
  20. score1 = abs(score1)
  21. score2 = abs(score2)
  22. # Where word1 is greater than word2 (Great? Good.):
  23. if score1 > score2:
  24. return NO
  25. # Where word1 is less than word2 (Good? Great.):
  26. if score1 <= score2:
  27. return YES

And finally an assessment of the IMDB data on their own:

  1. assess_one_worder_performance(reviews_word_predict, print_errors=False)
  2. no precision: 0.77
  3. no recall: 0.77
  4. yes precision: 0.62
  5. yes recall: 0.62
  6. micro-averaged precision 0.69
  7. micro-averaged recall 0.69
  8. accuracy: 0.71
  9. coverage: 35 / 82 = 0.43

Combining resources

Still restricting attention to the one-worders, we carefully put the resources together into a single prediction function. In doing this, we favor WordNet, appealing to the reviews only where WordNet provides no information.

In addition, the function includes one additional inference: where neither of our resources makes a prediction, we venture a NO, on the grounds that unrelated predicates deliver this implicature via a Relevance-based calculation.

  1. def combined_resources_word_predict(word1, word2):
  2. """
  3. Use WordNet and the reviews to make predictions. The logic:
  4. 1. If WordNet makes a prediction, go with it, since that is a high-precision resources.
  5. 2. Where Wordnet makes no prediction, use the reviews prediction if there is one.
  6. 3. Where the reviews make no prediction, assume that the two predicates are independent,
  7. which, by Gricean reasoning, means NO in the IQAP context.
  8. """
  9. predicted = None
  10. wn_predicted = wordnet_word_predict(word1, word2)
  11. if wn_predicted:
  12. predicted = wn_predicted
  13. else:
  14. predicted = reviews_word_predict(word1, word2)
  15. if not predicted:
  16. predicted = NO
  17. return predicted

The whole is indeed better than its parts, in the sense that we are now cover the entire data set (though with a small drop in overall performance):

  1. assess_one_worder_performance(combined_resources_word_predict, print_errors=False)
  2. no precision: 0.68
  3. no recall: 0.85
  4. yes precision: 0.68
  5. yes recall: 0.44
  6. micro-averaged precision 0.68
  7. micro-averaged recall 0.65
  8. accuracy: 0.68
  9. coverage: 82 / 82 = 1.0

Multi-word cases

We can put it off no longer: abut 40% of the development set examples are not one-worders, so we need to generalize our approach to the complexities of multi-word expressions.

Both Wordnet and the reviews are comprised of data on single words, so our multi-word strategy needs to make use of those basic resources somehow. Some heuristics:

  1. If the question's contrast predication is a subset of the answer's, return YES. The heuristic is that all modification will be restrictive, and thus that the more heavily restricted answer will be stronger as well.
  2. Reverse the above reasoning if the answer's contrast prediction is a subset of the question's.
  3. For the remaining cases, make predictions about the cross-product of the question vocabulary and the answer vocabulary, and then pick the majority choice where there is one, else choose NO.
  4. If either predication includes a negation, reverse the final decision.

The following code implements this set of heuristics, drawing heavily on code we developed for the one-worders:

  1. def combined_resources_item_predict(item):
  2. """
  3. The function makes predictions for all iqap.Item instances,
  4. whether one worders or multi-worders. The return value is always
  5. YES or NO.
  6. """
  7. # Get the words:
  8. q_words = set(item.question_contrast_pred_pos())
  9. a_words = set(item.answer_contrast_pred_pos())
  10. # Predictions:
  11. predicted = NO
  12. # Get this easy identity case out of the way:
  13. if q_words == a_words:
  14. return YES
  15. # Check the subset/superset relationships:
  16. if q_words.issubset(a_words):
  17. predicted = YES
  18. elif a_words.issubset(q_words):
  19. predicted = NO
  20. else:
  21. # Now check the cross-product of q_words and a_words for relevant connections:
  22. prediction_dict = defaultdict(int)
  23. for q in q_words:
  24. for a in a_words:
  25. prediction_dict[combined_resources_word_predict(q, a)] += 1
  26. # We sort all the predictions by their degree of represetation, and then
  27. # pick the best represented, favoring NO in the case of ties:
  28. sorted_predictions = sorted(prediction_dict.items(), key=itemgetter(1), reverse=True)
  29. predicted = sorted_predictions[0][0]
  30. if len(sorted_predictions) > 1 and predicted == sorted_predictions[1][0]:
  31. predicted = NO
  32. # If the answer is negated, then we reverse the prediction:
  33. if a_words & set(["n't", "not", "never"]):
  34. if predicted == NO: predicted = YES
  35. elif predicted == YES: predicted = NO
  36. return predicted

A new assessment method (one that doesn't restrict to one-worders):

  1. def assess_performance(print_errors=False):
  2. """
  3. Assesses a full deterministic experiment and prints a summary to
  4. standard output. Use print_errors=True to see the errors above the
  5. performance report.
  6. """
  7. cm = defaultdict(lambda : defaultdict(int))
  8. corpus = IqapReader('iqap-data.csv')
  9. for item in corpus.dev_set():
  10. predicted = combined_resources_item_predict(item)
  11. actual = item.max_label(make_binary=True)
  12. cm[actual][predicted] += 1
  13. # View incorrect predictions for the multi-word cases (we saw the single-word ones already):
  14. if print_errors and actual != predicted and \
  15. (len(item.question_contrast_pred_pos()) > 1 or len(item.answer_contrast_pred_pos()) > 1):
  16. print '======================================================================'
  17. print item.question_contrast_pred_pos()
  18. print item.answer_contrast_pred_pos()
  19. print actual, predicted
  20. # View the confusion matrix:
  21. print_effectiveness(cm)

Here is the output of assessment:

  1. assess_performance(print_errors=False)
  2. no precision: 0.66
  3. no recall: 0.86
  4. yes precision: 0.71
  5. yes recall: 0.44
  6. micro-averaged precision 0.68
  7. micro-averaged recall 0.65
  8. accuracy: 0.67

Discussion

The performance of the determinisic model is pretty good, on the whole. We are well above chance on accuracy, and the precision and recall numbers show that we are making solid, educated guesses.

One unsatisfying thing about the model is that it is so hand-crafted. We decide ahead of time how to prioritize the information, and then we reduce it all to a single judgment. In doing this, we are likely hiding, ignoring, or obscuring many important factors.

The next model we develop addresses all these concerns. In my view, it allows us to be guided by specific scentific insights, but it also learns from the data we provide it. In general, its chief advantage is that it synthetizes a wide variety of information into a single predictive function, and it is also attentive to the data, in the sense that it is trained on, and therefore learns from, its details.

Experiment 2: A probabilistic (MaxEnt) model

Background

Maximum Entropy (MaxEnt) classification is a version of multinominal logistic regression. Since we'll deal with just two classes ('yes' and 'no') for our experiments, the model is effectively the same as a basic logistic regression model of the sort that we built with the review data.

One of the great strengths of MaxEnt models is that they are able to deal with lots of different kinds of features, and they deal reasonably well with correlations between predictors that can lead other models astray. What this means for us as linguists is that we are relatively free to define a lot of features that make sense to us scientifically, relying on the model to sort them out.

To get a feel for how the model works, it's useful to consider a simple model in which there is just one predictor. For this, let's use the difference in the cumulative coefficient scores between the question and the answer. The function for calcuating the cumulative score for each example is review_score_feature below. We'll take the diffence of the two values to be the sole predictor.

What we do at this point is reduce each item in the data to its feature vector. Here, our feature vectors have just one element in them — the float-valued difference between the scores. (For the models we develop below, we will map each item to a long and diverse feature vector.)

The model is then trained on some percentage of the data. For standard classification, this means is that we feed it pairs consisting of the correct label for the example and the associated feature vector.

However, there is one twist: each of our examples has, not one label, but rather 30 of them. We could take the majority label, but this would obscure the truly mixed message sent by some of our annotations. Thus, when training our model, we include each item 30 times, once for each label it has:

 itemannotationreview_score_diff
14015yes-0.07
24015yes-0.07
 .........
134015yes-0.07
144015no-0.07
154015no-0.07
 .........
304015no-0.07
314016yes0.00

The model then learns a weight associated with each of the predictors, based on the associations in the training data. For the actual example here, the weight it learns for the feature review_score_diff is 2.923 for the correct class being 'no'. (This weight will vary somewhat depending on the composition of the train/test split used.)

The prediction step then involves taking each feature vector and plugging it into the model equation. For example, if the feature vector is -0.07, as for Item 4015 above, then we calculate invlogit(3.331 * -0.07) = 0.56. That is, the model says that there is a 56% chance that this example is labeled 'no'. In turn, there is a 1.0-0.56 = 0.44 chance that the correct label is 'yes'.

In a regular classifications setting, we can simply take the larger of the two probabilities to be the predicted class. Since each of our examples has 30 labels, our method is slightly different. We take two basic approaches:

  1. Max-label: For example 4015, the max label is 'no', with 17 choices. The max label chosen by our model is 'no', since it assigns that category the higher the two probabilities. Thus, we say that the model classifies correctly here.
  2. KL divergence: For this, we turn the annotations into a distribution and compare it to the distribution returned by the model. Example 4015 has the distribution {yes: 13/30, no:17/30} = {yes:0.43, no:0.57}, whereas the model predicts {yes:0.44, no:0.56}. The KL divergence of the predicted distribution from the actual one is 0.0002032661. We calculate this value for each item and then take some kind of average of the values. (I use the mean.)

Each of these methods has advantages and disadvantages. Max-label is very easy to interpret, but it ignores a lot of the structure in our annotations. KL divergence embraces the uncertainty inherent in our annotations and predicted by our model, but it is hard to interpret in isolation, since the KL values don't mean much. In what follows, I use both methods together, comparing to baseline models to firm up our intuitions about what the numbers mean.

figures/iqap/maxent-illustration.png
Figure FITS
A graphical illustration of how to use a fitted model (the one from the prose) to classify examples.

Features

I now define a bunch of feature functions that we will use to reduce our items to feature vectors.

Subset and superset features

The deterministic model returns a definitive answer based on these simple relationships between the question and the answer. This often works well (Good? / Very good!), but it can also fail (Good? / Somewhat good.). With a MaxEnt model, we can include these features, expecting them to be balanced against other considerations.

  1. def subset_feature(q_words, a_words):
  2. """Return True if q_words is a subset of a_words, else False."""
  3. if q_words.issubset(a_words):
  4. return True
  5. else:
  6. return False
  7.  
  8. def superset_feature(q_words, a_words):
  9. """Return True if a_words is a subset of q_words, else False."""
  10. if a_words.issubset(q_words):
  11. return True
  12. else:
  13. return False

Negation feature

We'll use negation in a variety of ways. This is the basis for those uses, and we'll also include it as a feature:

  1. def negation_feature(words):
  2. """Return True if words contains a negation, else False."""
  3. if words & set(["n't", "not", "never"]):
  4. return True
  5. else:
  6. return False

Figure NEGIMPACT helps to convey why the negation feature is important: encouraged is weaker than optimistic, suggesting a 'yes' reply, but the negation in the answer reverses this.

figures/iqap/neg-impact.png
Figure NEGIMPACT
Negation reverses the ordering expected of the adjectives.

Figure NEGIMPACT was generated by the following code:

  1. source('review_functions.R')
  2. imdb = read.csv('imdb-words.csv')
  3. par(mfrow=c(1,2), oma=c(2,0,0,0))
  4. WordDisplay(imdb, 'encouraged', 'a')
  5. WordDisplay(imdb, 'optimistic', 'a')
  6. mtext("Are you encouraged by the new developments?; I'm not so optimistic for the time being.", side=1, outer=TRUE)

Review score feature

This function builds a total score for the words provided, based on the review data. It reverse the sign of the final score if negation_feature == True. The intended use of this argument is that we get the result of negation_feature(words) and use it here.

  1. def review_score_feature(words, negation_feature):
  2. """
  3. Returns the cumulative review score for words. If negation_feature
  4. is True (this should be determined by negation_feature), then flip
  5. the sign of the final score.
  6. """
  7. score = 0.0
  8. for w in words:
  9. coef = coef_score(w)
  10. if coef:
  11. score += coef
  12. if negation_feature:
  13. score = -score
  14. return score

The reversal triggered where negation_feature == True is motivated by interactions like that of figure NEGIMPACT.

Review score inference

In addition to using the raw review scores, we also use this careful, pragmatically-informed comparison of them:

  1. def review_score_inference(q_score, a_score):
  2. """
  3. This function employs the same logic as reviews_word_predict,
  4. except now dealing with the cumulative score as given by
  5. review_score_feature. (Both q_score and a_score are presumed to
  6. have come from that function.)
  7. """
  8. if sign(q_score) != sign(a_score):
  9. return False
  10. q_score = abs(q_score)
  11. a_score = abs(a_score)
  12. if q_score < a_score:
  13. return True
  14. else:
  15. return False

WordNet relations

  1. def wordnet_relation_features(q_words, a_words, features):
  2. """
  3. This function creates a number of features, on that it adds a
  4. mapping relname -> True to features for each relation that it
  5. finds between question and answer words. In addition, it reverses
  6. the scalar relations in the presence of differing negation values.
  7. """
  8. for q in q_words:
  9. for a in a_words:
  10. for rel in wordnet_functions.wordnet_relations(q, a):
  11. if features['q_negated'] != features['a_negated']:
  12. rel = rel.replace('hypernym', 'hyponym').replace('hyponym', 'hypernym')
  13. rel = rel.replace('meronym', 'holonym').replace('holonym', 'meronym')
  14. features[rel] = True
  15. return features

WordNet inferences

This function essentially acts as though each (question_word, answer_word) pair were the sole basis for prediction, using wordnet_word_predict from the deterministic experiment to make a YES/NO inference about each pairing:

  1. def wordnet_inferences(q_words, a_words, features):
  2. """
  3. This function uses wordnet_word_predict to create predictions, but
  4. it does that for each pair of words (q, a), adding to features a
  5. feature that reflects the presence of this feature.
  6. """
  7. for q in q_words:
  8. for a in a_words:
  9. features['wn_'+ wordnet_word_predict(q, a)] = True
  10. return features

The experiment

The feature function puts all of the above pieces together:

  1. def combined_resources_item_features(item):
  2. """
  3. Feature function for iqap.Item instances. The intent is to feed
  4. this into iqap_classifier.IqapClassifier as its feature function,
  5. for maxent classification.
  6. Argument: item -- any iqap.Item instance
  7. Value: features -- a default dict with a variety of different value types
  8. """
  9. features = defaultdict(int)
  10. # Get the words as sets:
  11. q_words = set(item.question_contrast_pred_pos())
  12. a_words = set(item.answer_contrast_pred_pos())
  13. # Subset and superset features (boolean valued):
  14. features['subset'] = subset_feature(q_words, a_words)
  15. features['superset'] = superset_feature(q_words, a_words)
  16. # Negation features (boolean valued):
  17. features['q_negated'] = negation_feature(q_words)
  18. features['a_negated'] = negation_feature(a_words)
  19. # Score features (float valued):
  20. features['q_score'] = review_score_feature(q_words, features['q_negated'])
  21. features['a_score'] = review_score_feature(a_words, features['a_negated'])
  22. # Relational score feature (boolean valued):
  23. features['review_score_cmp'] = review_score_inference(features['q_score'], features['a_score'])
  24. # WordNet relational features (count valued):
  25. features = wordnet_relation_features(q_words, a_words, features)
  26. # Wordnet relation features (count valued):
  27. features = wordnet_inferences(q_words, a_words, features)
  28. return features

Results

Table RESULTS gives a set of results for the model:

  Run   Log-lik.    Max-label acc.      Mean KL div.   Micro-avg prec.  Micro-avg recall    Train acc.
    1      -0.48               0.8              0.54              0.88              0.73          0.81
    2      -0.48              0.77              0.59              0.77              0.76          0.78
    3       -0.4              0.83              0.43              0.85              0.83          0.78
    4      -0.57              0.67              0.83              0.68              0.66          0.82
    5      -0.49               0.7              0.65              0.71              0.69          0.81
    6      -0.52              0.83              0.66              0.85              0.79           0.8
    7      -0.39               0.8              0.49              0.83              0.79          0.79
    8      -0.44              0.77              0.59              0.82              0.81          0.81
    9      -0.51              0.73              0.55              0.78               0.7          0.82
   10       -0.5              0.83               0.5              0.85              0.79           0.8
Means      -0.48              0.77              0.58               0.8              0.76           0.8
Table RESULTS
The results for the MaxEnt model, 10 random train/test splits of the development set, 80% training data.

Discussion

An examination of the model's feature weights suggests that it is behaving sensibly. Table POS and table NEG provide information about the top indicators of 'yes' and 'no', respectively.

FeatureWeightNotes
a_score 4.669 High answer scores increase the probability of yes
similar_tos==True 1.829 similar_to is a kind of synonym relation
wn_yes==True 1.688 WordNet makes good predictions
q_score 1.646 High question scores increase the probability of yes
also_sees==True 1.549 also_see is a kind of synonym relation
subset==True 1.500 Restrictive modification does deliver 'yes', mostly
review_score_cmp==True 0.339 The reviews make good preditions (True is like 'yes'; see review_score_inference
derivationally_related_forms==True 0.210 derivational-relations preserve meaning
Table POS
Top features favoring a 'yes' label.
FeatureWeightNotes
hyponyms==True 1.856 High answer scores increase the probability of yes
wn_no==True 0.951 similar_to is a kind of synonym relation
antonyms==True 0.805 WordNet makes good predictions
superset==True 0.203 High question scores increase the probability of yes
hypernyms==True 0.146 also_see is a kind of synonym relation
review_score_cmp==False 0.094 The reviews make good preditions (False is like 'no'; see review_score_inference
wn_None==True 0.043 Independence in WordNet creates Relevance implicatures.
Table NEG
Top features favoring a 'no' label.

It is somewhat hard to think about the performance numbers in isolation. Some considerations:

  1. Performance is about the same on both the training and testing sets, suggesting that we are not over-fitting.
  2. The model certainly does better than the deterministic one. Here are the deterministic results again:
    1. assess_performance(print_errors=False)
    2. no precision: 0.66
    3. no recall: 0.86
    4. yes precision: 0.71
    5. yes recall: 0.44
    6. micro-averaged precision 0.68
    7. micro-averaged recall 0.65
    8. accuracy: 0.67

To gain another baseline, I fit a model that simply uses word-counts inside the -CONTRAST predicates as features. (This experiment is run if iqap_classifier.py is run from the command-line.) This is a standard sort of baseline in natural language processing. The results are given in table UNIGRAMS:

  1. The overall accuracy and KL-divergence numbers are about the same as for our model, except this one is more volatile in its performance
  2. The model has so many features that it simply memorizes the training data, producing perfect performance on it each time. This suggests that it won't generalize well.
  Run    Log-lik.    Max-label acc.      Mean KL div.   Micro-avg prec.  Micro-avg recall    Train acc.
    1        -0.3              0.83              0.57              0.83              0.83           1.0
    2        -0.5               0.7              0.52              0.66              0.63           1.0
    3       -0.39              0.77              0.57              0.77              0.76           1.0
    4       -0.52               0.7              0.64              0.66               0.6           1.0
    5       -0.51              0.63              0.78              0.58              0.57           1.0
    6       -0.53              0.63              0.75              0.61              0.58           1.0
    7       -0.47               0.7              0.64               0.7              0.65           1.0
    8       -0.55               0.6              0.76              0.61              0.58           1.0
    9       -0.43              0.73              0.59              0.75              0.71           1.0
   10       -0.43              0.67              0.68              0.73              0.68           1.0
Means       -0.46               0.7              0.65              0.69              0.66           1.0
Table UNIGRAMS
Development set results for a unigrams baseline model, where the features are just counts of the words in the -CONTRAST trees.

I conclude that we have identified a number of important predictors, and they are working well together. Of course, there are lots of other possibilities to try. That's what most of the exercises are about, and this area is also ripe for projects.

Exercises

ERROR The assessment functions assess_one_worder_performance() and assess_performance() both have a second optional argument print_errors. If print_errors=True, then the function prints out the error examples. Using these functions, study the errors from one of these functions (with your choice of predictor function as the first argument), and draw some general lessons about the nature of the errors, with an eye towards making improvements.

FEATURES For the deterministic system, propose three modifications to features we used or three new features (or a mix of these options). Motivate each one, either by studying the data or (for bonus points) by implementing and assessing them on your own.

BIGRAMS The CSV file imdb-bigrams.csv.zip contains data on a large number of bigrams from the IMDB, in the following format:

  1. bi = read.csv('imdb-bigrams.csv')
  2. Word1 Tag1 Word2 Tag2 Category Count Total 1 aa n meeting n 1 6 25395214 2 aa n meeting n 2 0 11755132 3 aa n meeting n 3 4 13995838 4 aa n meeting n 4 2 14963866 5 aa n meeting n 5 0 20390515 6 aa n meeting n 6 9 27420036 7 aa n meeting n 7 17 40192077 8 aa n meeting n 8 22 48723444 9 aa n meeting n 9 15 40277743 10 aa n meeting n 10 10 73948447

Incorporate this information into the non-deterministic model. This is decidedly non-compositional, but it might help — there are some striking patterns in the bigram distributions, as seen in figure ADVADJ.

figures/reviews/imdb-adv-adj-combos.png
Table ADVADJ
Adverb–adjective pairs in the bigrams data. The adverbs' effects seem highly predictable from their own distributions and those of their complements.

ERRORCMP If you run assess_performance(print_errors=True), you get a performance assessment for our deterministic model with the errors printed out. The following code does the same for a MaxEnt classifier model built from a random train/test split, using our central feature function:

Compare the errors from the deterministic and maxent models. How are they alike and how do they differ? What, if anything, do these findings tell us about which model to favor?

SENTI SentiWordNet provides another set of values, akin to the IMDB scores. Should we add these features to either one of the models? If so, how? If not, why not? For this, it might be useful to do this exercise comparing the SentiWordNet and IMDB values, to see what kind of new information the SentiWordNet scores would add.

PROPAGATE Blair Godensohn et al. 2008 propose an innovative method for learning sentiment information from WordNet using seed-sets of known sentiment words and a propagation algorithm. A Python/NLTK implementation of their algorithm is here, and the output of that algorithm run with the Harvard Inquirer as the seeds sets is wnscores_inquirer.csv.zip. (For more on the file's generation, see the readme file. For an overview of the algorithm, see page 12 of this handout.) Are these scores better or worse than the IMDB scores when it comes to making predictions about our dataset? You can answer this with whatever evidence you like: a direct comparison of the two lexicons, or a deterministic or maxent assessment.

CLASS NLTK includes a number of other classifier models. Perhaps modify/extend iqap_classifier.py so that the user can take advantage of one or more of these other models. (See also this project problem for an experimental/empirical angle on this coding project.)

PER For both the core MaxEnt experiment and the unigrams comparison (table UNIGRAMS), I used 80% of the development-set data for training. How do the models compare if we use less (perhaps much less) of the data for training? Do such experiments further support one model over the other?