Learning lexical scales: WordNet and SentiWordNet
- Overview
- WordNet structure
- Synset lists: From strings into WordNet
- Synsets
- Lemma lists: From strings into WordNet
- Lemmas
- SentiWordNet extension
- A function for obtaining lexical alternatives
- Exercises
This section is about using WordNet
to build lexical scales. For the exploration, I use the
NLTK WordNet module.
There are a variety of other interfaces:
- Web interface
- Download (includes browsing tools)
- Browser interface
- Other APIs
I use only the English WordNet, but there are WordNets for other
languages as well:
- OpenThesaurus (German)
- Global WordNet (with free downloads for at least Arabic, Danish, French, Hindi, Russian, Tamil)
Associated reading:
Code and data:
The next few subsections are a fast overview of the structure of
WordNet, using NLTK Python code. If you're new to using WordNet, I
recommend pausing right now to read
section 2.5 of the NLTK book.
Once that's done, start Python's command-line interpreter, type this, and hit enter:
- from nltk.corpus import wordnet as wn
This loads the WordNet module, which provides access to the structure of WordNet (plus other cool functionality).
The two most important WordNet constructs are lemmas and synsets:
- Lemma: near to the linguistic concept of
a word. Lemmas are identified by strings
like idle.s.03.unused, where
- idle is the stem identifier for the
Synset containing
this Lemma
- a is the WordNet part of speech
- 03 is the sense number (01 ...)
- unused is the morphological form
- Synset: A collection
of Lemmas that are synonymous (by the
standards of WordNet). Synsets are
identified by strings like idle.s.03 where
- idle is the canonical string name
- s is the WordNet part of speech
- 03 is the sense number (01 ...);
sense 01 is considered the primary sense
One of the bridges from strings into the structured objects of
WordNet is the function wn.synsets(), which
returns the list of Synset objects compatible
with the string, or string–tag pair, provided. (The other such
bridge is wn.lemmas().)
- wn.synsets('idle')
- [Synset('idle.n.01'), Synset('idle.v.01'), Synset('idle.v.02'), \
Synset('idle.a.01'), Synset('baseless.s.01'), \
Synset('idle.s.03'), Synset('idle.s.04'), Synset('idle.s.05'), \
Synset('dead.s.09'), Synset('idle.s.07')]
- wn.synsets('idle', 'a')
- [Synset('idle.a.01'), Synset('baseless.s.01'), Synset('idle.s.03'), \
Synset('idle.s.04'), Synset('idle.s.05'), \
Synset('dead.s.09'), Synset('idle.s.07')]
- wn.synsets('idle', 'v')
- [Synset('idle.v.01'), Synset('idle.v.02')]
- wn.synsets('idle', 'n')
- [Synset('idle.n.01')]
- wn.synsets('idle', 'r')
- []
The first member of these lists is the primary (most frequent)
sense for the input supplied.
The outputs are Python lists, and thus all of the list methods are
available for them. If you're just getting started with Python, you
might toy around with them as lists using methods from
the Python list documentation.
A couple examples:
- # Store the list in a variable:
- idle_a = wn.synsets('idle', 'a')
- # Get the 3rd element (counting starts at 0):
- idle_a[2]
- Synset('idle.s.03')
- # Reverse the list (changes the list in place):
- idle_a.reverse()
- idle_a
- [Synset('idle.s.07'), Synset('dead.s.09'), Synset('idle.s.05'), \
Synset('idle.s.04'), Synset('idle.s.03'), Synset('baseless.s.01'), \
Synset('idle.a.01')]
Let's grab one of the synsets returned by
wn.synsets('idle', 'a') and work with it briefly:
- idle_synsets = wn.synsets('idle', 'a')
- baseless = idle_synsets[1]
- baseless.definition
- 'without a basis in reason or fact'
- baseless.examples
- ['baseless gossip', 'the allegations proved groundless', 'idle fears', \
'unfounded suspicions', 'unwarranted jealousy']
- baseless.lemmas
- [Lemma('baseless.s.01.baseless'), Lemma('baseless.s.01.groundless'),\
Lemma('baseless.s.01.idle'), Lemma('baseless.s.01.unfounded'), \
Lemma('baseless.s.01.unwarranted'), Lemma('baseless.s.01.wild')]
If you're just starting out with Python, you might pause now to experiment some
more with Synset objects, by trying out methods and manipulations based on the
Synset documentation.
If syn is your synset,
then syn.method() will deliver the value
for all the different choices of method
(e.g., hypernyms,
part_meronyms()),
and syn.attribute will give you the value
of each attribute
(name, pos,
lemmas, definition,
examples, offset).
All of the above manipulations can easily be done with the Web or
command-line interface to WordNet itself. The power of working within
a full programming language is that we can also take a high-level
perspective on the data. For example, the following code (which you
might save as a file) looks at the distribution of Synsets by
part-of-speech (pos) category:
- #!/usr/bin/env python
-
- from nltk.corpus import wordnet as wn
- from collections import defaultdict
-
- def wn_pos_dist():
- """Count the Synsets in each WordNet POS category."""
- # One-dimensional count dict with 0 as the default value:
- cats = defaultdict(int)
- # The counting loop:
- for synset in wn.all_synsets():
- cats[synset.pos] += 1
- # Print the results to the screen:
- for tag, count in cats.items():
- print tag, count
- # Total number (sum of the above):
- print 'Total', sum(cats.values())
If all is well with your set-up, the above code will print out this
information:
- a 7463
n 82115
s 10693
r 3621
v 13767
Total 117659
The Synset documentation
provides the full list of methods and attributes for
these objects. The list is quite tantalizing from the point of view
of forming scales, because the methods include a few that seem
directly keyed into the hierarchies that drive scalar reasoning. Some
examples:
- tree = wn.synsets('tree', 'n')[0]
- tree.definition
- 'a tall perennial woody plant having a main trunk and \
branches forming a distinct elevated crown; includes both gymnosperms \
and angiosperms'
- # A is a hypernym of B iff A is a type of B.
- tree.hypernyms()
- [Synset('woody_plant.n.01')]
- # The most abstract/general containing class for A.
- tree.root_hypernyms()
- [Synset('entity.n.01')]
- # A is a hyponym of B iff B is a type of A.
- tree.hyponyms()
- [Synset('calaba.n.01'), Synset('australian_nettle.n.01'), Synset('caracolito.n.01'), ...
- # A is a member holonym of B iff B-type things are members of A-type things.
- tree.member_holonyms()
- [Synset('forest.n.01')]
- # A is a substance holonym of B iff B-type things are constituted of A-type things.
- wn.synsets('flour', 'n')[0].substance_holonyms()
- [Synset('bread.n.01'), Synset('dough.n.01'), Synset('pastry.n.02')]
- # A is a part holonym of B iff B-type things are subparts of A-type things.
- wn.synsets('bark', 'n')[0].part_holonyms()
- [Synset('root.n.01'), Synset('trunk.n.01'), Synset('branch.n.02')]
- # A is a member meronym of B iff A-type things are members of B-type things.
- wn.synsets('forest', 'n')[0].member_meronyms()
- [Synset('underbrush.n.01'), Synset('tree.n.01')]
- # A is a substance meronym of B iff A-type things are consisted of B-type things.
- tree.substance_meronyms()
- [Synset('heartwood.n.01'), Synset('sapwood.n.01')]
- # A is a part meronym of B iff A-type things are parts of B-type things.
- tree.part_meronyms()
- [Synset('burl.n.02'), Synset('crown.n.07'), Synset('stump.n.01'), ...
In addition, there are two functions for directly
relating Synset objects:
- flower = wn.synsets('flower', 'n')[0]
- tree.common_hypernyms(flower)
- [Synset('plant.n.02'), Synset('living_thing.n.01'), Synset('physical_entity.n.01'), ...
- tree.lowest_common_hypernyms(flower))
- [Synset('vascular_plant.n.01')]
Verbs have two of their own methods that could be useful for
scalar implicature:
- verb = wn.synsets('transfer', 'v')[4]
- verb.entailments()
- [Synset('move.v.02')]
- verb.causes()
- [Synset('change_hands.v.01')]
With so many methods to look at, I find it often proves useful to
have a quick way of summarizing the information attached to a given
Synset instance. The following function
does this:
- #!/usr/bin/env python
-
- from nltk.corpus import wordnet as wn
- from collections import defaultdict
-
- def synset_method_values(synset):
- """
- For a given synset, get all the (method_name, value) pairs
- for that synset. Returns the list of such pairs.
- """
- name_value_pairs = []
- # All the available synset methods:
- method_names = ['hypernyms', 'instance_hypernyms', 'hyponyms', 'instance_hyponyms',
'member_holonyms', 'substance_holonyms', 'part_holonyms',
'member_meronyms', 'substance_meronyms', 'part_meronyms',
'attributes', 'entailments', 'causes', 'also_sees', 'verb_groups',
'similar_tos']
- for method_name in method_names:
- # Get the method's value for this synset based on its string name.
- method = getattr(synset, method_name)
- vals = method()
- name_value_pairs.append((method_name, vals))
- return name_value_pairs
An example of the above in action (assuming that the above is saved in a file
called wordnet_functions.py):
- from wordnet_functions import synset_method_values
- tree = wn.synsets('tree', 'n')[0]
- for key, val in synset_method_values(tree):
- print key, val
- hypernyms [Synset('woody_plant.n.01')]
instance_hypernyms []
hyponyms [Synset('calaba.n.01'), Synset('australian_nettle.n.01'), ...
instance_hyponyms []
member_holonyms [Synset('forest.n.01')]
substance_holonyms []
part_holonyms []
member_meronyms []
substance_meronyms [Synset('heartwood.n.01'), Synset('sapwood.n.01')]
part_meronyms [Synset('burl.n.02'), Synset('crown.n.07'), ...
attributes []
entailments []
causes []
also_sees []
verb_groups []
similar_tos []
As you can see, even a richly hierarchical lexical item like this
has non-empty values for only a small subset of
the Synset methods. In fact, the
distribution of non-empty values is uneven and depends on many
factors. One very important factor is part of speech. The following
method uses synset_method_values() to gather
counts for all the non-empty values relative to POS:
- def synset_methods():
- """
- Iterates through all of the synsets in WordNet. For each,
- iterate through all the Synset methods, creating a mapping
-
- method_name --> pos --> count
-
- where pos is a WordNet pos and count is the number of Synsets that
- have non-empty values for method_name.
- """
- # Two-dimensional count dict with 0 as the default value final value:
- d = defaultdict(lambda : defaultdict(int))
- # Iterate through all the synsets using wn.all_synsets():
- for synset in wn.all_synsets():
- for method_name, vals in synset_method_values(synset):
- if vals: # If vals is nonempty:
- d[method_name][synset.pos] += 1
- return d
Table METHODS formats the
dictionary d returned by the above
function:
This suggests that WordNet might primarily be useful in the nominal
domain, with verbs somewhat well covered too. However, none of the
requisite relations apply to adverbs ('r'), and none of the useful ones
for scales apply to adjectives ('a'). SentiWordNet
(discussed below) and the review data
we'll discuss next (here) can be seen as
attempts to fill this gap in the coverage of WordNet.
exercise SYNSETS,
exercise PATHS,
exercise VERBS
Parallel to wn.synsets(), the
function wn.lemmas() will take you from
strings to lists of Lemma objects:
- wn.lemmas('idle')
- [Lemma('idle.n.01.idle'), Lemma('idle.v.01.idle'), Lemma('idle.v.02.idle'), \
Lemma('idle.a.01.idle'), Lemma('baseless.s.01.idle'), Lemma('idle.s.03.idle'), \
Lemma('idle.s.04.idle'), Lemma('idle.s.05.idle'), Lemma('dead.s.09.idle'), \
Lemma('idle.s.07.idle')]
- wn.lemmas('idle', 'a')
- [Lemma('idle.a.01.idle'), Lemma('baseless.s.01.idle'), Lemma('idle.s.03.idle'), \
Lemma('idle.s.04.idle'), Lemma('idle.s.05.idle'), Lemma('dead.s.09.idle'), \
Lemma('idle.s.07.idle')]
Lemmas
are the most intuitively word-like objects in WordNet.
- idle_synsets = wn.synsets('idle', 'a')
- baseless = idle_synsets[1]
- baseless.name
- 'baseless.s.01'
- baseless.definition
- 'without a basis in reason or fact'
- baseless.lemmas
- [Lemma('baseless.s.01.baseless'), Lemma('baseless.s.01.groundless'), \
Lemma('baseless.s.01.idle'), Lemma('baseless.s.01.unfounded'), \
Lemma('baseless.s.01.unwarranted'), Lemma('baseless.s.01.wild')]
The output of baseless.lemmas is a list
of Lemma objects. Let's look at the fourth:
- lem = baseless.lemmas[3]
- lem.name
- 'unfounded'
- # I always forget that .pos doesn't work for lemmas, only for their synsets.
- lem.pos
- Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'Lemma' object has no attribute 'pos'
- lem.synset.pos
- 's'
Lemmas can't be identified with (string, pos) pairs, because
different lemmas can be identical with regard to those two values.
This is the role that the sense index plays:
- wn.lemma('idle.s.03.idle').synset.definition
- 'not in active use'
- wn.lemma('idle.s.04.idle').synset.definition
- 'silly or trivial'
Nonetheless, when we move to and from WordNet and unstructured
text, we are often forced to act as though a lemma was just a (string,
pos) pair (or even just a string, if we lack POS annotations).
The number of lemmas is much larger than the number of synsets:
- # Use a list comprehension to efficiently count lemmas:
- sum((len(synset.lemmas) for synset in wn.all_synsets()))
- 206978
- # Compare to the number of synsets:
- len(list(wn.all_synsets()))
- 117659
Lemma objects have a lot of methods
associated with them, but, as with Synset
objects, the majority are empty. The
functions lemma_method_values() and
lemma_methods() are parallel to
synset_method_values()
and synset_methods():
- def lemma_method_values(lemma):
- """
- For a given lemma, get all the (method_name, value) pairs
- for that lemma. Returns the list of such pairs.
- """
- name_value_pairs = []
- # All the available synset methods:
- method_names = [# These are sometimes non-empty for Lemmas:
- 'antonyms', 'derivationally_related_forms',
- 'also_sees', 'verb_groups', 'pertainyms',
- # These were undefined for Lemmas in earlier versions of NLTK but are now defined:
- 'topic_domains', 'region_domains', 'usage_domains',
- # These are always empty for Lemmas:
- 'hypernyms', 'instance_hypernyms',
- 'hyponyms', 'instance_hyponyms',
- 'member_holonyms', 'substance_holonyms',
- 'part_holonyms', 'member_meronyms',
- 'substance_meronyms', 'part_meronyms',
- 'attributes', 'derivationally_related_forms',
- 'entailments', 'causes', 'similar_tos', 'pertainyms']
- for method_name in method_names:
- # Check to make sure the method is defined:
- if hasattr(lemma, method_name):
- method = getattr(lemma, method_name)
- # Get the values from running that method:
- vals = method()
- name_value_pairs.append((method_name, vals))
- return name_value_pairs
-
- def lemma_methods():
- """
- Iterates through all of the lemmas in WordNet. For each,
- iterate through all the Lemma methods, creating a mapping
-
- method_name --> pos --> count
-
- where pos is a WordNet pos and count is the number of Lemmas that
- have non-empty values for method_name.
- """
- # Two-dimensional count dict with 0 as the default final value:
- d = defaultdict(lambda : defaultdict(int))
- for synset in wn.all_synsets():
- for lemma in synset.lemmas:
- for method_name, vals in lemma_method_values(lemma):
- if vals: # If vals is nonempty:
- d[method_name][synset.pos] += 1
- return d
Table LEMMAS summarizes the
distribution just for the non-empty relations. (Note: the documentation
says that topic_domains(),
region_domains(), and
usage_domains() are lemma methods, but it
seems they are not.)
For additional details and methods, see the documentation for
NLTK
Lemma objects.
exercise LEMMAEX
As the above overview of WordNet Synset and
Lemma objects makes clear, we
have relatively little information about where adjectives and adverbs
fit into the overall hierarchy. To some extent,
SentiWordNet can fill
this void. SentiWordNet extends Synsets with positive and negative
sentiment scores. The extension was achieved via a complex mix of
propagation methods and classifiers. It is thus not a gold standard
resource like WordNet (which was compiled by humans), but it has proven
useful in a wide range of tasks.
SentiWordNet is distributed as a single file with the following
basic structure:
# POS ID PosScore NegScore SynsetTerms Gloss
a 00001740 0.125 0 able#1 (usually followed by `to') having the necessary means or [...]
a 00002098 0 0.75 unable#1 (usually followed by `to') not having the necessary means or [...]
a 00002312 0 0 dorsal#2 abaxial#1 facing away from the axis of an organ or organism; [...]
a 00002527 0 0 ventral#2 adaxial#1 nearest to or facing toward the axis of an organ or organism; [...]
a 00002730 0 0 acroscopic#1 facing or on the side toward the apex
a 00002843 0 0 basiscopic#1 facing or on the side toward the base
a 00002956 0 0 abducting#1 abducent#1 especially of muscles; [...]
a 00003131 0 0 adductive#1 adducting#1 adducent#1 especially of muscles; [...]
a 00003356 0 0 nascent#1 being born or beginning; [...]
a 00003553 0 0 emerging#2 emergent#2 coming into existence; [...]
I've written a Python interface to
SentiWordNet. Just place it in the same directory as your
SentiWordNet
source file (restricted link) and then work like this:
- from sentiwordnet import SentiWordNetCorpusReader, SentiSynset
- swn_filename = 'SentiWordNet_3.0.0_20100705.txt'
- swn = SentiWordNetCorpusReader(swn_filename)
You create SentiSynset objects from their
WordNet string representations:
- swn.senti_synset('breakdown.n.03')
- breakdown.n.03 PosScore: 0.0 NegScore: 0.25
You can get SentiSynset lists just as
with the NLTK WordNet interface:
- swn.senti_synsets('slow')
- [SentiSynset('decelerate.v.01'), SentiSynset('slow.v.02'), \
SentiSynset('slow.v.03'), SentiSynset('slow.a.01'), SentiSynset('slow.a.02'), \
SentiSynset('slow.a.04'), SentiSynset('slowly.r.01'), SentiSynset('behind.r.03')]
- happy = swn.senti_synsets('happy', 'a')[0]
- [Synset('happy.a.01')]
- happy.pos_score
- 0.625
- happy.neg_score
- 0.25
- happy.obj_score
- 0.125
(Since no sentiment information is attached directly to Lemmas by
SentiWordNet, there is not a corresponding function
swn.senti_lemmas(). However, if
senti_syn is a SentiSynset object, then
senti_syn.synset.lemmas will deliver the
associated list.)
You can iterate through all of the synsets in SentiWordNet. Here is
a function for doing that and printing the name of the synset (from
WordNet) along with its positive and negative scores (from
SentiWordNet):
- #!/usr/bin/env python
-
- from sentiwordnet import SentiWordNetCorpusReader
-
- def senti_synset_viewer():
- swn = SentiWordNetCorpusReader('SentiWordNet_3.0.0_20100705.txt')
- for senti_synset in swn.all_senti_synsets():
- print senti_synset.synset.name, senti_synset.pos_score, senti_synset.neg_score
Here is the beginning of the output
of senti_synset_viewer():
- snowmobile.n.01 0.0 0.0
fortunate.s.02 0.875 0.0
temperature.n.02 0.0 0.25
summer.n.02 0.0 0.0
whirring.s.01 0.0 0.0
presbytes.n.01 0.0 0.0
pure.a.06 0.5 0.0
moleskin.n.01 0.0 0.0
sidewalk.n.01 0.0 0.0
danaus.n.01 0.0 0.0
.
.
.
The number of SentiSynsets is the same as the number of Synsets:
- len(list(wn.all_synsets()))
- 117659
- len(list(swn.all_senti_synsets()))
- 117659
However, there are discrepancies between the current version of
WordNet and SentiWordNet, so one needs to write code that plans for
mismatches.
exercise SENTISCORES
Now that we've explored the basics of WordNet, we can put the
pieces together into a function that might be useful for understanding
scalar implicature. My initial attempt at this:
- #!/usr/bin/env python
-
- from nltk.corpus import wordnet as wn
- from collections import defaultdict
-
- def wordnet_relations(word1, word2):
- """
- Uses the lemmas and synsets associated with word1 and word2 to
- gather all relationships between these two words. There is
- imprecision in this, since we range over all the lemmas and
- synsets consistent with each (string, pos) pair, but it seems
- to work well in practice.
-
- Arguments:
- word1, word2 (str, str) -- (string, pos) pairs
-
- Value:
- rels (set of str) -- the set of all WordNet relations that hold between word1 and word2
- """
- # This function ensures that we have a well-formed WordNet pos (or None for that value):
- s1, t1 = wordnet_sanitize(word1)
- s2, t2 = wordnet_sanitize(word2)
- # Output set of strings:
- rels = set([])
- # Loop through both synset and lemma relations:
- for lem1 in wn.lemmas(s1, t1):
- lemma_methodname_value_pairs = lemma_method_values(lem1)
- synset_methodname_value_pairs = synset_method_values(lem1.synset)
- for lem2 in wn.lemmas(s2, t2):
- # Lemma relations:
- for rel, rel_lemmas in lemma_methodname_value_pairs:
- if lem2 in rel_lemmas:
- rels.add(rel)
- # Synset relations:
- for rel, rel_synsets in synset_methodname_value_pairs:
- if lem2.synset in rel_synsets:
- rels.add(rel)
- return rels
-
- def wordnet_sanitize(word):
- """
- Ensure that word is a (string, pos) pair that WordNet can understand.
-
- Argument: word (str, str) -- a (string, pos) pair
-
- Value: a possibly modified (string, pos) pair, where pos=None if
- the input pos is outside of WordNet.
- """
- string, tag = word
- string = string.lower()
- tag = tag.lower()
- if tag.startswith('v'): tag = 'v'
- elif tag.startswith('n'): tag = 'n'
- elif tag.startswith('j'): tag = 'a'
- elif tag.startswith('rb'): tag = 'r'
- if tag in ('a', 'n', 'r', 'v'):
- return (string, tag)
- else:
- return (string, None)
The function wordnet_relations() takes
two (string, pos) pairs as input and then
uses wn.lemmas
and wn.synsets to move into WordNet and
begin relating them. As my documentation says, this process is
approximate, because the (string, pos) pairs might be compatible with
a range of different lemmas and synsets, but it's the only choice we
have (and we will see later that it works well in practice).
To get a feel for this function, let's play around with it a bit:
- from wordnet_functions import wordnet_relations
- wordnet_relations(('tree','n'), ('elm', 'n'))
- set(['hyponyms'])
- wordnet_relations(('good','a'), ('bad', 'a'))
- set(['antonyms'])
- wordnet_relations(('happy','a'), ('sad', 'a'))
- set([])
- wordnet_relations(('run','v'), ('move', 'v'))
- set(['hypernyms'])
- wordnet_relations(('move','v'), ('run', 'v'))
- set(['hyponyms'])
- wordnet_relations(('flour','n'), ('bread', 'n'))
- set(['substance_holonyms'])
exercise RELEXPLORE,
exercise IQAP
SYNSETS Get a feel
for what Synsets are like. For this, you can use either the NLTK
interface or the Web or command-line interface. (An in-between
option is to run from nltk.app import wordnet;
wordnet(), which launches a local Web browser
interface — this requires that you have a local server running):
- Pick a domain (e.g., mammals, sciences, vehicles).
- Search (string, pos) pairs from that domain and see what
Synsets associate with them. Note any worrisome gaps in coverage
or potential confusion of senses, etc.
- Use the Synset relations (e.g., hypernyms, hyponyms) to explore the
interconnections within this domain. How does the coverage
look?
- Optional (requires NLTK):
Use lowest_common_hypernyms() to
explore pairs of Synsets in your domain. Do the return values
generally belong specifically to your domain, or are they
more/overly general?
- Based on the above findings, provide an overall assessment of
WordNet's coverage in your chosen domain.
PATHS NLTK provides
a number of path similarity measures for Synsets, via its
WordNetCorpusReader class.
For example:
- from nltk.corpus import wordnet as wn
- tree = wn.synsets('tree', 'n')[0]
- flower = wn.synsets('flower', 'n')[0]
- wn.path_similarity(tree, flower)
- 0.16666666666666666
Pick a domain (e.g., mammals, sciences, vehicles). (If you did
problem SYNSETS, you might continue with your
domain from there.) Check on the path similarity of pairs of things
inside and outside your domain. Do the results make sense? Feel free
to compare the path similarity measures provided to see how they are
alike and how they are different. Might there be applications of
these measures to scalar implicature?
VERBS The verbal
relations 'causes' and 'entailments' remain somewhat mysterious to me.
Using your preferred interface, sample around to try to get a sense
for which verbs have non-empty values for these relations and what
those values are like. NLTK users might
use synset_method_values() and a
modified version of synset_methods() to
answer this question comprehensively, by pulling out all and only
the relevant Synsets. In your report, summarize your findings
briefly.
LEMMAEX
Explore the Lemma relations in the context of scalar reasoning.
- Think up 5 antonym pairs. Which are captured via the antonyms relation in WordNet?
- Are pertainyns reliably ordered via scales? Explore this question by sampling.
(You might use lemma_method_values() and a variant of
lemma_methods() to home in on the lemmas that have non-empty values here.)
- Are also_sees reliably ordered via scales? Explore this question by sampling.
SENTISCORES
SentiWordNet offers continuous values between 0 and 1 for both
positivity and negativity scores. What is the distribution of these
scores like? For this, you might want to use Python to create a CSV
file pairing words with their positive and negative scores. A
start:
- #!/usr/bin/env python
-
- import csv
- from sentiwordnet import SentiWordNetCorpusReader
-
- def sentiwordnet_scores_to_csv():
- csvwriter = csv.writer(file('sentiwordnet-scores.csv', 'w'))
- csvwriter.writerow(['Word', 'Tag', 'PosScore', 'NegScore'])
- for senti_synset in swn.all_senti_synsets():
- synset = # Complete this by getting senti_synset's synset.
- tag = # Complete this by getting synset's pos.
- if tag == 's':
- tag = 'a'
- pos_score = # Complete this by getting the positive score.
- neg_score = # Complete this by getting the negative score.
- for lemma in senti_synset.synset.lemmas:
- row = [lemma.name, tag, pos_score, neg_score]
- csvwriter.writerow(row)
Once you have the CSV file, you might also check on the
distribution of the SentiWordNet objectivity values, defined
as 1 - (neg_score+pos_score). What
important patterns do you see in the examples, and what consequences
might this have for using this resource to build scales?
RELEXPLORE Pick
three examples from the IQAP corpus
(Question–answer relationships contains a
sample) and try to determine
whether wordnet_relations, if properly
applied, would get those examples right. What assumptions would we
have to make to achieve the best results?
IQAP (This problem
pairs well with the corresponding one
for the IMDB data.) Recall that the
iqap.Item
methods question_contrast_pred_trees
and
answer_contrast_pred_trees extract the
subtrees rooted at -CONTRAST nodes from the questions and
answers. If we are going to use WordNet to compare these trees, then
we should first get a sense for WordNet's coverage of these nodes.
As a first step towards doing this, the following code loops through
the development set of the IQAP corpus, pulling just the items that
have a single '-CONTRAST' word in both the question and the answer.
Finish this code so that it
uses wordnet_relations to compare these
words.
- def check_wordnet_one_worder_coverage():
- corpus = IqapReader('iqap-data.csv')
- for item in corpus.dev_set():
- q_words = item.question_contrast_pred_pos()
- a_words = item.answer_contrast_pred_pos()
- if len(q_words) == 1 and len(a_words) == 1:
- # Use wordnet_relations to compare the two: