Learning lexical scales: WordNet and SentiWordNet

  1. Overview
  2. WordNet structure
    1. Synset lists: From strings into WordNet
    2. Synsets
    3. Lemma lists: From strings into WordNet
    4. Lemmas
  3. SentiWordNet extension
  4. A function for obtaining lexical alternatives
  5. Exercises

Overview

This section is about using WordNet to build lexical scales. For the exploration, I use the NLTK WordNet module. There are a variety of other interfaces:

  1. Web interface
  2. Download (includes browsing tools)
  3. Browser interface
  4. Other APIs

I use only the English WordNet, but there are WordNets for other languages as well:

  1. OpenThesaurus (German)
  2. Global WordNet (with free downloads for at least Arabic, Danish, French, Hindi, Russian, Tamil)

Associated reading:

Code and data:

WordNet structure

The next few subsections are a fast overview of the structure of WordNet, using NLTK Python code. If you're new to using WordNet, I recommend pausing right now to read section 2.5 of the NLTK book. Once that's done, start Python's command-line interpreter, type this, and hit enter:

  1. from nltk.corpus import wordnet as wn

This loads the WordNet module, which provides access to the structure of WordNet (plus other cool functionality).

The two most important WordNet constructs are lemmas and synsets:

  1. Lemma: near to the linguistic concept of a word. Lemmas are identified by strings like idle.s.03.unused, where
    1. idle is the stem identifier for the Synset containing this Lemma
    2. a is the WordNet part of speech
    3. 03 is the sense number (01 ...)
    4. unused is the morphological form
  2. Synset: A collection of Lemmas that are synonymous (by the standards of WordNet). Synsets are identified by strings like idle.s.03 where
    1. idle is the canonical string name
    2. s is the WordNet part of speech
    3. 03 is the sense number (01 ...); sense 01 is considered the primary sense

Synset lists: From strings into WordNet

One of the bridges from strings into the structured objects of WordNet is the function wn.synsets(), which returns the list of Synset objects compatible with the string, or string–tag pair, provided. (The other such bridge is wn.lemmas().)

  1. wn.synsets('idle')
  2. [Synset('idle.n.01'), Synset('idle.v.01'), Synset('idle.v.02'), \ Synset('idle.a.01'), Synset('baseless.s.01'), \ Synset('idle.s.03'), Synset('idle.s.04'), Synset('idle.s.05'), \ Synset('dead.s.09'), Synset('idle.s.07')]
  3. wn.synsets('idle', 'a')
  4. [Synset('idle.a.01'), Synset('baseless.s.01'), Synset('idle.s.03'), \ Synset('idle.s.04'), Synset('idle.s.05'), \ Synset('dead.s.09'), Synset('idle.s.07')]
  5. wn.synsets('idle', 'v')
  6. [Synset('idle.v.01'), Synset('idle.v.02')]
  7. wn.synsets('idle', 'n')
  8. [Synset('idle.n.01')]
  9. wn.synsets('idle', 'r')
  10. []

The first member of these lists is the primary (most frequent) sense for the input supplied.

The outputs are Python lists, and thus all of the list methods are available for them. If you're just getting started with Python, you might toy around with them as lists using methods from the Python list documentation. A couple examples:

  1. # Store the list in a variable:
  2. idle_a = wn.synsets('idle', 'a')
  3. # Get the 3rd element (counting starts at 0):
  4. idle_a[2]
  5. Synset('idle.s.03')
  6. # Reverse the list (changes the list in place):
  7. idle_a.reverse()
  8. idle_a
  9. [Synset('idle.s.07'), Synset('dead.s.09'), Synset('idle.s.05'), \ Synset('idle.s.04'), Synset('idle.s.03'), Synset('baseless.s.01'), \ Synset('idle.a.01')]

Synsets

Let's grab one of the synsets returned by wn.synsets('idle', 'a') and work with it briefly:

  1. idle_synsets = wn.synsets('idle', 'a')
  2. baseless = idle_synsets[1]
  3. baseless.definition
  4. 'without a basis in reason or fact'
  5. baseless.examples
  6. ['baseless gossip', 'the allegations proved groundless', 'idle fears', \ 'unfounded suspicions', 'unwarranted jealousy']
  7. baseless.lemmas
  8. [Lemma('baseless.s.01.baseless'), Lemma('baseless.s.01.groundless'),\ Lemma('baseless.s.01.idle'), Lemma('baseless.s.01.unfounded'), \ Lemma('baseless.s.01.unwarranted'), Lemma('baseless.s.01.wild')]

If you're just starting out with Python, you might pause now to experiment some more with Synset objects, by trying out methods and manipulations based on the Synset documentation. If syn is your synset, then syn.method() will deliver the value for all the different choices of method (e.g., hypernyms, part_meronyms()), and syn.attribute will give you the value of each attribute (name, pos, lemmas, definition, examples, offset).

All of the above manipulations can easily be done with the Web or command-line interface to WordNet itself. The power of working within a full programming language is that we can also take a high-level perspective on the data. For example, the following code (which you might save as a file) looks at the distribution of Synsets by part-of-speech (pos) category:

  1. #!/usr/bin/env python
  2.  
  3. from nltk.corpus import wordnet as wn
  4. from collections import defaultdict
  5.  
  6. def wn_pos_dist():
  7. """Count the Synsets in each WordNet POS category."""
  8. # One-dimensional count dict with 0 as the default value:
  9. cats = defaultdict(int)
  10. # The counting loop:
  11. for synset in wn.all_synsets():
  12. cats[synset.pos] += 1
  13. # Print the results to the screen:
  14. for tag, count in cats.items():
  15. print tag, count
  16. # Total number (sum of the above):
  17. print 'Total', sum(cats.values())

If all is well with your set-up, the above code will print out this information:

  1. a 7463 n 82115 s 10693 r 3621 v 13767 Total 117659

The Synset documentation provides the full list of methods and attributes for these objects. The list is quite tantalizing from the point of view of forming scales, because the methods include a few that seem directly keyed into the hierarchies that drive scalar reasoning. Some examples:

  1. tree = wn.synsets('tree', 'n')[0]
  2. tree.definition
  3. 'a tall perennial woody plant having a main trunk and \ branches forming a distinct elevated crown; includes both gymnosperms \ and angiosperms'
  4. # A is a hypernym of B iff A is a type of B.
  5. tree.hypernyms()
  6. [Synset('woody_plant.n.01')]
  7. # The most abstract/general containing class for A.
  8. tree.root_hypernyms()
  9. [Synset('entity.n.01')]
  10. # A is a hyponym of B iff B is a type of A.
  11. tree.hyponyms()
  12. [Synset('calaba.n.01'), Synset('australian_nettle.n.01'), Synset('caracolito.n.01'), ...
  13. # A is a member holonym of B iff B-type things are members of A-type things.
  14. tree.member_holonyms()
  15. [Synset('forest.n.01')]
  16. # A is a substance holonym of B iff B-type things are constituted of A-type things.
  17. wn.synsets('flour', 'n')[0].substance_holonyms()
  18. [Synset('bread.n.01'), Synset('dough.n.01'), Synset('pastry.n.02')]
  19. # A is a part holonym of B iff B-type things are subparts of A-type things.
  20. wn.synsets('bark', 'n')[0].part_holonyms()
  21. [Synset('root.n.01'), Synset('trunk.n.01'), Synset('branch.n.02')]
  22. # A is a member meronym of B iff A-type things are members of B-type things.
  23. wn.synsets('forest', 'n')[0].member_meronyms()
  24. [Synset('underbrush.n.01'), Synset('tree.n.01')]
  25. # A is a substance meronym of B iff A-type things are consisted of B-type things.
  26. tree.substance_meronyms()
  27. [Synset('heartwood.n.01'), Synset('sapwood.n.01')]
  28. # A is a part meronym of B iff A-type things are parts of B-type things.
  29. tree.part_meronyms()
  30. [Synset('burl.n.02'), Synset('crown.n.07'), Synset('stump.n.01'), ...

In addition, there are two functions for directly relating Synset objects:

  1. flower = wn.synsets('flower', 'n')[0]
  2. tree.common_hypernyms(flower)
  3. [Synset('plant.n.02'), Synset('living_thing.n.01'), Synset('physical_entity.n.01'), ...
  4. tree.lowest_common_hypernyms(flower))
  5. [Synset('vascular_plant.n.01')]

Verbs have two of their own methods that could be useful for scalar implicature:

  1. verb = wn.synsets('transfer', 'v')[4]
  2. verb.entailments()
  3. [Synset('move.v.02')]
  4. verb.causes()
  5. [Synset('change_hands.v.01')]

With so many methods to look at, I find it often proves useful to have a quick way of summarizing the information attached to a given Synset instance. The following function does this:

  1. #!/usr/bin/env python
  2.  
  3. from nltk.corpus import wordnet as wn
  4. from collections import defaultdict
  5.  
  6. def synset_method_values(synset):
  7. """
  8. For a given synset, get all the (method_name, value) pairs
  9. for that synset. Returns the list of such pairs.
  10. """
  11. name_value_pairs = []
  12. # All the available synset methods:
  13. method_names = ['hypernyms', 'instance_hypernyms', 'hyponyms', 'instance_hyponyms', 'member_holonyms', 'substance_holonyms', 'part_holonyms', 'member_meronyms', 'substance_meronyms', 'part_meronyms', 'attributes', 'entailments', 'causes', 'also_sees', 'verb_groups', 'similar_tos']
  14. for method_name in method_names:
  15. # Get the method's value for this synset based on its string name.
  16. method = getattr(synset, method_name)
  17. vals = method()
  18. name_value_pairs.append((method_name, vals))
  19. return name_value_pairs

An example of the above in action (assuming that the above is saved in a file called wordnet_functions.py):

  1. from wordnet_functions import synset_method_values
  2. tree = wn.synsets('tree', 'n')[0]
  3. for key, val in synset_method_values(tree):
  4. print key, val
  5. hypernyms [Synset('woody_plant.n.01')] instance_hypernyms [] hyponyms [Synset('calaba.n.01'), Synset('australian_nettle.n.01'), ... instance_hyponyms [] member_holonyms [Synset('forest.n.01')] substance_holonyms [] part_holonyms [] member_meronyms [] substance_meronyms [Synset('heartwood.n.01'), Synset('sapwood.n.01')] part_meronyms [Synset('burl.n.02'), Synset('crown.n.07'), ... attributes [] entailments [] causes [] also_sees [] verb_groups [] similar_tos []

As you can see, even a richly hierarchical lexical item like this has non-empty values for only a small subset of the Synset methods. In fact, the distribution of non-empty values is uneven and depends on many factors. One very important factor is part of speech. The following method uses synset_method_values() to gather counts for all the non-empty values relative to POS:

  1. def synset_methods():
  2. """
  3. Iterates through all of the synsets in WordNet. For each,
  4. iterate through all the Synset methods, creating a mapping
  5.  
  6. method_name --> pos --> count
  7.  
  8. where pos is a WordNet pos and count is the number of Synsets that
  9. have non-empty values for method_name.
  10. """
  11. # Two-dimensional count dict with 0 as the default value final value:
  12. d = defaultdict(lambda : defaultdict(int))
  13. # Iterate through all the synsets using wn.all_synsets():
  14. for synset in wn.all_synsets():
  15. for method_name, vals in synset_method_values(synset):
  16. if vals: # If vals is nonempty:
  17. d[method_name][synset.pos] += 1
  18. return d

Table METHODS formats the dictionary d returned by the above function:

methodasnrv
hypernyms0074389013208
instance_hypernyms00773000
hyponyms001669303315
instance_hyponyms0094500
member_holonyms001220100
substance_holonyms0055100
part_holonyms00785900
member_meronyms00555300
substance_meronyms0066600
part_meronyms00369900
attributes620032000
entailments0000390
causes0000218
also_sees13330001
verb_groups00001498
similar_tos251210693000
total74631069382115362113767
Table METHODS
The distribution of non-empty return values for various Synset methods, by POS.

This suggests that WordNet might primarily be useful in the nominal domain, with verbs somewhat well covered too. However, none of the requisite relations apply to adverbs ('r'), and none of the useful ones for scales apply to adjectives ('a'). SentiWordNet (discussed below) and the review data we'll discuss next (here) can be seen as attempts to fill this gap in the coverage of WordNet.

Lemma lists: From strings into WordNet

Parallel to wn.synsets(), the function wn.lemmas() will take you from strings to lists of Lemma objects:

  1. wn.lemmas('idle')
  2. [Lemma('idle.n.01.idle'), Lemma('idle.v.01.idle'), Lemma('idle.v.02.idle'), \ Lemma('idle.a.01.idle'), Lemma('baseless.s.01.idle'), Lemma('idle.s.03.idle'), \ Lemma('idle.s.04.idle'), Lemma('idle.s.05.idle'), Lemma('dead.s.09.idle'), \ Lemma('idle.s.07.idle')]
  3. wn.lemmas('idle', 'a')
  4. [Lemma('idle.a.01.idle'), Lemma('baseless.s.01.idle'), Lemma('idle.s.03.idle'), \ Lemma('idle.s.04.idle'), Lemma('idle.s.05.idle'), Lemma('dead.s.09.idle'), \ Lemma('idle.s.07.idle')]

Lemmas

Lemmas are the most intuitively word-like objects in WordNet.

  1. idle_synsets = wn.synsets('idle', 'a')
  2. baseless = idle_synsets[1]
  3. baseless.name
  4. 'baseless.s.01'
  5. baseless.definition
  6. 'without a basis in reason or fact'
  7. baseless.lemmas
  8. [Lemma('baseless.s.01.baseless'), Lemma('baseless.s.01.groundless'), \ Lemma('baseless.s.01.idle'), Lemma('baseless.s.01.unfounded'), \ Lemma('baseless.s.01.unwarranted'), Lemma('baseless.s.01.wild')]

The output of baseless.lemmas is a list of Lemma objects. Let's look at the fourth:

  1. lem = baseless.lemmas[3]
  2. lem.name
  3. 'unfounded'
  4. # I always forget that .pos doesn't work for lemmas, only for their synsets.
  5. lem.pos
  6. Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'Lemma' object has no attribute 'pos'
  7. lem.synset.pos
  8. 's'

Lemmas can't be identified with (string, pos) pairs, because different lemmas can be identical with regard to those two values. This is the role that the sense index plays:

  1. wn.lemma('idle.s.03.idle').synset.definition
  2. 'not in active use'
  3. wn.lemma('idle.s.04.idle').synset.definition
  4. 'silly or trivial'

Nonetheless, when we move to and from WordNet and unstructured text, we are often forced to act as though a lemma was just a (string, pos) pair (or even just a string, if we lack POS annotations).

The number of lemmas is much larger than the number of synsets:

  1. # Use a list comprehension to efficiently count lemmas:
  2. sum((len(synset.lemmas) for synset in wn.all_synsets()))
  3. 206978
  4. # Compare to the number of synsets:
  5. len(list(wn.all_synsets()))
  6. 117659

Lemma objects have a lot of methods associated with them, but, as with Synset objects, the majority are empty. The functions lemma_method_values() and lemma_methods() are parallel to synset_method_values() and synset_methods():

  1. def lemma_method_values(lemma):
  2. """
  3. For a given lemma, get all the (method_name, value) pairs
  4. for that lemma. Returns the list of such pairs.
  5. """
  6. name_value_pairs = []
  7. # All the available synset methods:
  8. method_names = [# These are sometimes non-empty for Lemmas:
  9. 'antonyms', 'derivationally_related_forms',
  10. 'also_sees', 'verb_groups', 'pertainyms',
  11. # These were undefined for Lemmas in earlier versions of NLTK but are now defined:
  12. 'topic_domains', 'region_domains', 'usage_domains',
  13. # These are always empty for Lemmas:
  14. 'hypernyms', 'instance_hypernyms',
  15. 'hyponyms', 'instance_hyponyms',
  16. 'member_holonyms', 'substance_holonyms',
  17. 'part_holonyms', 'member_meronyms',
  18. 'substance_meronyms', 'part_meronyms',
  19. 'attributes', 'derivationally_related_forms',
  20. 'entailments', 'causes', 'similar_tos', 'pertainyms']
  21. for method_name in method_names:
  22. # Check to make sure the method is defined:
  23. if hasattr(lemma, method_name):
  24. method = getattr(lemma, method_name)
  25. # Get the values from running that method:
  26. vals = method()
  27. name_value_pairs.append((method_name, vals))
  28. return name_value_pairs
  29.  
  30. def lemma_methods():
  31. """
  32. Iterates through all of the lemmas in WordNet. For each,
  33. iterate through all the Lemma methods, creating a mapping
  34.  
  35. method_name --> pos --> count
  36.  
  37. where pos is a WordNet pos and count is the number of Lemmas that
  38. have non-empty values for method_name.
  39. """
  40. # Two-dimensional count dict with 0 as the default final value:
  41. d = defaultdict(lambda : defaultdict(int))
  42. for synset in wn.all_synsets():
  43. for lemma in synset.lemmas:
  44. for method_name, vals in lemma_method_values(lemma):
  45. if vals: # If vals is nonempty:
  46. d[method_name][synset.pos] += 1
  47. return d

Table LEMMAS summarizes the distribution just for the non-empty relations. (Note: the documentation says that topic_domains(), region_domains(), and usage_domains() are lemma methods, but it seems they are not.)

asnrv
antonyms3872021207071069
derivationally_related_forms4725580626758113102
also_sees0000324
verb_groups00002
pertainyms46650032200
topic_domains51301
region_domains011400
usage_domains0136502
Table LEMMAS
The distribution of non-empty return values for various Lemma methods, by POS.

For additional details and methods, see the documentation for NLTK Lemma objects.

SentiWordNet extension

As the above overview of WordNet Synset and Lemma objects makes clear, we have relatively little information about where adjectives and adverbs fit into the overall hierarchy. To some extent, SentiWordNet can fill this void. SentiWordNet extends Synsets with positive and negative sentiment scores. The extension was achieved via a complex mix of propagation methods and classifiers. It is thus not a gold standard resource like WordNet (which was compiled by humans), but it has proven useful in a wide range of tasks.

SentiWordNet is distributed as a single file with the following basic structure:

# POS	ID	PosScore	NegScore	SynsetTerms	Gloss
a	00001740	0.125	0	able#1	(usually followed by `to') having the necessary means or [...]
a	00002098	0	0.75	unable#1	(usually followed by `to') not having the necessary means or [...]
a	00002312	0	0	dorsal#2 abaxial#1	facing away from the axis of an organ or organism; [...]
a	00002527	0	0	ventral#2 adaxial#1	nearest to or facing toward the axis of an organ or organism; [...]
a	00002730	0	0	acroscopic#1	facing or on the side toward the apex
a	00002843	0	0	basiscopic#1	facing or on the side toward the base
a	00002956	0	0	abducting#1 abducent#1	especially of muscles; [...]
a	00003131	0	0	adductive#1 adducting#1 adducent#1	especially of muscles; [...]
a	00003356	0	0	nascent#1	being born or beginning; [...]
a	00003553	0	0	emerging#2 emergent#2	coming into existence; [...]

I've written a Python interface to SentiWordNet. Just place it in the same directory as your SentiWordNet source file (restricted link) and then work like this:

  1. from sentiwordnet import SentiWordNetCorpusReader, SentiSynset
  2. swn_filename = 'SentiWordNet_3.0.0_20100705.txt'
  3. swn = SentiWordNetCorpusReader(swn_filename)

You create SentiSynset objects from their WordNet string representations:

  1. swn.senti_synset('breakdown.n.03')
  2. breakdown.n.03 PosScore: 0.0 NegScore: 0.25

You can get SentiSynset lists just as with the NLTK WordNet interface:

  1. swn.senti_synsets('slow')
  2. [SentiSynset('decelerate.v.01'), SentiSynset('slow.v.02'), \ SentiSynset('slow.v.03'), SentiSynset('slow.a.01'), SentiSynset('slow.a.02'), \ SentiSynset('slow.a.04'), SentiSynset('slowly.r.01'), SentiSynset('behind.r.03')]
  3. happy = swn.senti_synsets('happy', 'a')[0]
  4. [Synset('happy.a.01')]
  5. happy.pos_score
  6. 0.625
  7. happy.neg_score
  8. 0.25
  9. happy.obj_score
  10. 0.125

(Since no sentiment information is attached directly to Lemmas by SentiWordNet, there is not a corresponding function swn.senti_lemmas(). However, if senti_syn is a SentiSynset object, then senti_syn.synset.lemmas will deliver the associated list.)

You can iterate through all of the synsets in SentiWordNet. Here is a function for doing that and printing the name of the synset (from WordNet) along with its positive and negative scores (from SentiWordNet):

  1. #!/usr/bin/env python
  2.  
  3. from sentiwordnet import SentiWordNetCorpusReader
  4.  
  5. def senti_synset_viewer():
  6. swn = SentiWordNetCorpusReader('SentiWordNet_3.0.0_20100705.txt')
  7. for senti_synset in swn.all_senti_synsets():
  8. print senti_synset.synset.name, senti_synset.pos_score, senti_synset.neg_score

Here is the beginning of the output of senti_synset_viewer():

  1. snowmobile.n.01 0.0 0.0 fortunate.s.02 0.875 0.0 temperature.n.02 0.0 0.25 summer.n.02 0.0 0.0 whirring.s.01 0.0 0.0 presbytes.n.01 0.0 0.0 pure.a.06 0.5 0.0 moleskin.n.01 0.0 0.0 sidewalk.n.01 0.0 0.0 danaus.n.01 0.0 0.0 . . .

The number of SentiSynsets is the same as the number of Synsets:

  1. len(list(wn.all_synsets()))
  2. 117659
  3. len(list(swn.all_senti_synsets()))
  4. 117659

However, there are discrepancies between the current version of WordNet and SentiWordNet, so one needs to write code that plans for mismatches.

A function for obtaining lexical alternatives

Now that we've explored the basics of WordNet, we can put the pieces together into a function that might be useful for understanding scalar implicature. My initial attempt at this:

  1. #!/usr/bin/env python
  2.  
  3. from nltk.corpus import wordnet as wn
  4. from collections import defaultdict
  5.  
  6. def wordnet_relations(word1, word2):
  7. """
  8. Uses the lemmas and synsets associated with word1 and word2 to
  9. gather all relationships between these two words. There is
  10. imprecision in this, since we range over all the lemmas and
  11. synsets consistent with each (string, pos) pair, but it seems
  12. to work well in practice.
  13. Arguments:
  14. word1, word2 (str, str) -- (string, pos) pairs
  15. Value:
  16. rels (set of str) -- the set of all WordNet relations that hold between word1 and word2
  17. """
  18. # This function ensures that we have a well-formed WordNet pos (or None for that value):
  19. s1, t1 = wordnet_sanitize(word1)
  20. s2, t2 = wordnet_sanitize(word2)
  21. # Output set of strings:
  22. rels = set([])
  23. # Loop through both synset and lemma relations:
  24. for lem1 in wn.lemmas(s1, t1):
  25. lemma_methodname_value_pairs = lemma_method_values(lem1)
  26. synset_methodname_value_pairs = synset_method_values(lem1.synset)
  27. for lem2 in wn.lemmas(s2, t2):
  28. # Lemma relations:
  29. for rel, rel_lemmas in lemma_methodname_value_pairs:
  30. if lem2 in rel_lemmas:
  31. rels.add(rel)
  32. # Synset relations:
  33. for rel, rel_synsets in synset_methodname_value_pairs:
  34. if lem2.synset in rel_synsets:
  35. rels.add(rel)
  36. return rels
  37.  
  38. def wordnet_sanitize(word):
  39. """
  40. Ensure that word is a (string, pos) pair that WordNet can understand.
  41.  
  42. Argument: word (str, str) -- a (string, pos) pair
  43.  
  44. Value: a possibly modified (string, pos) pair, where pos=None if
  45. the input pos is outside of WordNet.
  46. """
  47. string, tag = word
  48. string = string.lower()
  49. tag = tag.lower()
  50. if tag.startswith('v'): tag = 'v'
  51. elif tag.startswith('n'): tag = 'n'
  52. elif tag.startswith('j'): tag = 'a'
  53. elif tag.startswith('rb'): tag = 'r'
  54. if tag in ('a', 'n', 'r', 'v'):
  55. return (string, tag)
  56. else:
  57. return (string, None)

The function wordnet_relations() takes two (string, pos) pairs as input and then uses wn.lemmas and wn.synsets to move into WordNet and begin relating them. As my documentation says, this process is approximate, because the (string, pos) pairs might be compatible with a range of different lemmas and synsets, but it's the only choice we have (and we will see later that it works well in practice).

To get a feel for this function, let's play around with it a bit:

  1. from wordnet_functions import wordnet_relations
  2. wordnet_relations(('tree','n'), ('elm', 'n'))
  3. set(['hyponyms'])
  4. wordnet_relations(('good','a'), ('bad', 'a'))
  5. set(['antonyms'])
  6. wordnet_relations(('happy','a'), ('sad', 'a'))
  7. set([])
  8. wordnet_relations(('run','v'), ('move', 'v'))
  9. set(['hypernyms'])
  10. wordnet_relations(('move','v'), ('run', 'v'))
  11. set(['hyponyms'])
  12. wordnet_relations(('flour','n'), ('bread', 'n'))
  13. set(['substance_holonyms'])

Exercises

SYNSETS Get a feel for what Synsets are like. For this, you can use either the NLTK interface or the Web or command-line interface. (An in-between option is to run from nltk.app import wordnet; wordnet(), which launches a local Web browser interface — this requires that you have a local server running):

  1. Pick a domain (e.g., mammals, sciences, vehicles).
  2. Search (string, pos) pairs from that domain and see what Synsets associate with them. Note any worrisome gaps in coverage or potential confusion of senses, etc.
  3. Use the Synset relations (e.g., hypernyms, hyponyms) to explore the interconnections within this domain. How does the coverage look?
  4. Optional (requires NLTK): Use lowest_common_hypernyms() to explore pairs of Synsets in your domain. Do the return values generally belong specifically to your domain, or are they more/overly general?
  5. Based on the above findings, provide an overall assessment of WordNet's coverage in your chosen domain.

PATHS NLTK provides a number of path similarity measures for Synsets, via its WordNetCorpusReader class. For example:

  1. from nltk.corpus import wordnet as wn
  2. tree = wn.synsets('tree', 'n')[0]
  3. flower = wn.synsets('flower', 'n')[0]
  4. wn.path_similarity(tree, flower)
  5. 0.16666666666666666

Pick a domain (e.g., mammals, sciences, vehicles). (If you did problem SYNSETS, you might continue with your domain from there.) Check on the path similarity of pairs of things inside and outside your domain. Do the results make sense? Feel free to compare the path similarity measures provided to see how they are alike and how they are different. Might there be applications of these measures to scalar implicature?

VERBS The verbal relations 'causes' and 'entailments' remain somewhat mysterious to me. Using your preferred interface, sample around to try to get a sense for which verbs have non-empty values for these relations and what those values are like. NLTK users might use synset_method_values() and a modified version of synset_methods() to answer this question comprehensively, by pulling out all and only the relevant Synsets. In your report, summarize your findings briefly.

LEMMAEX Explore the Lemma relations in the context of scalar reasoning.

  1. Think up 5 antonym pairs. Which are captured via the antonyms relation in WordNet?
  2. Are pertainyns reliably ordered via scales? Explore this question by sampling. (You might use lemma_method_values() and a variant of lemma_methods() to home in on the lemmas that have non-empty values here.)
  3. Are also_sees reliably ordered via scales? Explore this question by sampling.

SENTISCORES SentiWordNet offers continuous values between 0 and 1 for both positivity and negativity scores. What is the distribution of these scores like? For this, you might want to use Python to create a CSV file pairing words with their positive and negative scores. A start:

  1. #!/usr/bin/env python
  2.  
  3. import csv
  4. from sentiwordnet import SentiWordNetCorpusReader
  5.  
  6. def sentiwordnet_scores_to_csv():
  7. csvwriter = csv.writer(file('sentiwordnet-scores.csv', 'w'))
  8. csvwriter.writerow(['Word', 'Tag', 'PosScore', 'NegScore'])
  9. for senti_synset in swn.all_senti_synsets():
  10. synset = # Complete this by getting senti_synset's synset.
  11. tag = # Complete this by getting synset's pos.
  12. if tag == 's':
  13. tag = 'a'
  14. pos_score = # Complete this by getting the positive score.
  15. neg_score = # Complete this by getting the negative score.
  16. for lemma in senti_synset.synset.lemmas:
  17. row = [lemma.name, tag, pos_score, neg_score]
  18. csvwriter.writerow(row)

Once you have the CSV file, you might also check on the distribution of the SentiWordNet objectivity values, defined as 1 - (neg_score+pos_score). What important patterns do you see in the examples, and what consequences might this have for using this resource to build scales?

RELEXPLORE Pick three examples from the IQAP corpus (Question–answer relationships contains a sample) and try to determine whether wordnet_relations, if properly applied, would get those examples right. What assumptions would we have to make to achieve the best results?

IQAP (This problem pairs well with the corresponding one for the IMDB data.) Recall that the iqap.Item methods question_contrast_pred_trees and answer_contrast_pred_trees extract the subtrees rooted at -CONTRAST nodes from the questions and answers. If we are going to use WordNet to compare these trees, then we should first get a sense for WordNet's coverage of these nodes. As a first step towards doing this, the following code loops through the development set of the IQAP corpus, pulling just the items that have a single '-CONTRAST' word in both the question and the answer. Finish this code so that it uses wordnet_relations to compare these words.

  1. def check_wordnet_one_worder_coverage():
  2. corpus = IqapReader('iqap-data.csv')
  3. for item in corpus.dev_set():
  4. q_words = item.question_contrast_pred_pos()
  5. a_words = item.answer_contrast_pred_pos()
  6. if len(q_words) == 1 and len(a_words) == 1:
  7. # Use wordnet_relations to compare the two: