Rhea Myers

WordNet

We can use NLTK’s support for WordNet to help generate and classify text.

from nltk.corpus import wordnet as wn
from nltk.corpus import sentiwordnet as swn

def make_synset(word, category='n', number='01'):
    """Conveniently make a synset"""
    number = int(number)
    return wn.synset('%s.%s.%02i' % (word, category, number))

>>> dog = make_synset('dog')
>>> dog.definition
'a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds'

A synset is WordNet’s representation of a word/concept. Looking at the definition confirms that we have the synset for canis familiaris rather than persecution or undesirability.

>>> dog.hypernyms()
[Synset('domestic_animal.n.01'), Synset('canine.n.02')]

Hypernyms are more general concepts. ‘dog’ has two of them, which shows that WordNet is not arranged in a simple tree of concepts. This makes checking for common ancestors slightly more complex but represents concepts more realistically.

>>> dog.hyponyms()
[Synset('puppy.n.01'), Synset('great_pyrenees.n.01'), Synset('basenji.n.01'), Synset('newfoundland.n.01'), Synset('lapdog.n.01'), Synset('poodle.n.01'), Synset('leonberg.n.01'), Synset('toy_dog.n.01'), Synset('spitz.n.01'), Synset('pooch.n.01'), Synset('cur.n.01'), Synset('mexican_hairless.n.01'), Synset('hunting_dog.n.01'), Synset('working_dog.n.01'), Synset('dalmatian.n.02'), Synset('pug.n.01'), Synset('corgi.n.01'), Synset('griffon.n.02')]

Hyponyms are more specific concepts. ‘dog’ has several. These may have hypernyms other than ‘dog’, and may have several hyponyms themselves.

def _recurse_all_hypernyms(synset, all_hypernyms):
    synset_hypernyms = synset.hypernyms()
    if synset_hypernyms:
        all_hypernyms += synset_hypernyms
        for hypernym in synset_hypernyms:
            _recurse_all_hypernyms(hypernym, all_hypernyms)

def all_hypernyms(synset):
    """Get the set of hypernyms of the hypernym of the synset etc.
       Nouns can have multiple hypernyms, so we can't just create a depth-sorted
       list."""
    hypernyms = []
    _recurse_all_hypernyms(synset, hypernyms)
    return set(hypernyms)

>>> all_hypernyms(dog)
>>> set([Synset('chordate.n.01'), Synset('living_thing.n.01'), Synset('physical_entity.n.01'), Synset('animal.n.01'), Synset('mammal.n.01'), Synset('object.n.01'), Synset('vertebrate.n.01'), Synset('entity.n.01'), Synset('carnivore.n.01'), Synset('domestic_animal.n.01'), Synset('canine.n.02'), Synset('placental.n.01'), Synset('organism.n.01'), Synset('whole.n.02')])

We can recursively fetch the hypernyms of a synset. since ‘dog’ has two hypernyms this isn’t a single list of hypernyms. We can use this to find how similar different words are by searching for common ancestors. The Python WordNet library can find common hypernyms for us though.

>>> cat = make_synset('cat')
>>> cat.common_hypernyms(dog)
[Synset('chordate.n.01'), Synset('living_thing.n.01'), Synset('physical_entity.n.01'), Synset('animal.n.01'), Synset('mammal.n.01'), Synset('vertebrate.n.01'), Synset('entity.n.01'), Synset('carnivore.n.01'), Synset('object.n.01'), Synset('placental.n.01'), Synset('organism.n.01'), Synset('whole.n.02')]
>>> steel = make_synset('steel')
>>> steel.common_hypernyms(dog)
[Synset('physical_entity.n.01'), Synset('entity.n.01')]
>>> sunset = make_synset('sunset')
>>> sunset.common_hypernyms(dog)
[Synset('entity.n.01')]

As might be expected, cats and dogs are more similar than steel or sunsets. We can recursively fetch the hyponyms of a synset. This gives us the set of objects or concepts with a kind-of relationship to the word.

def _recurse_all_hyponyms(synset, all_hyponyms):
    synset_hyponyms = synset.hyponyms()
    if synset_hyponyms:
        all_hyponyms += synset_hyponyms
        for hyponym in synset_hyponyms:
            _recurse_all_hyponyms(hyponym, all_hyponyms)

def all_hyponyms(synset):
    """Get the set of the tree of hyponyms under the synset"""
    hyponyms = []
    _recurse_all_hyponyms(synset, hyponyms)
    return set(hyponyms)

>>> all_hyponyms(dog)
set([Synset('harrier.n.02'), Synset('water_spaniel.n.01'), Synset('standard_poodle.n.01'), Synset('dandie_dinmont.n.01'), Synset('wirehair.n.01'), Synset('toy_manchester.n.01'), Synset('puppy.n.01'), Synset('briard.n.01'), Synset('beagle.n.01'), Synset('siberian_husky.n.01'), Synset('manchester_terrier.n.01'), Synset('bloodhound.n.01'), ...

WordNet has some support for synonyms and antonyms via lemmas.

def synset_synonyms(synset):
    """Get the synonyms for the synset"""
    return set([lemma.synset for lemma in synset.lemmas])

def synset_antonyms(synset):
    """Get the antonyms for [the first lemma of] the synset"""
    return set([lemma.synset for lemma in synset.lemmas[0].antonyms()])

>>> synset_synonyms(sunset)
set([Synset('sunset.n.01')])
>>> synset_antonyms(sunset)
set([Synset('dawn.n.01')])

And we can find related concepts by getting all the hyponyms of a word’s hypernynms.

def all_peers(synset):
    """Get the set of all peers of the synset (including the synset).
       If the synset has multiple hypernyms then the peers will be hyponyms of
       multiple synsets."""
    hypernyms = synset.hypernyms()
    peers = []
    for hypernym in hypernyms:
        peers += hypernym.hyponyms()
    return set(peers)

>>> all_peers(sunset)
set([Synset('zero_hour.n.01'), Synset('rush_hour.n.01'), Synset('early-morning_hour.n.01'), Synset('none.n.01'), Synset('midnight.n.01'), Synset('happy_hour.n.01'), Synset('dawn.n.01'), Synset('bedtime.n.01'), Synset('late-night_hour.n.01'), Synset('small_hours.n.01'), Synset('noon.n.01'), Synset('sunset.n.01'), Synset('twilight.n.01'), Synset('mealtime.n.01'), Synset('canonical_hour.n.01'), Synset('closing_time.n.01')])

We use sets here so that common ancestors and children appear only once, and to allow for boolean set operations on concepts. It’s trivial to get the the word (or words) for a synset.

def synsets_words(synsets):
    """Get the set of strings for the words represented by the synsets"""
    return set([synset_word(synset) for synset in synsets])

>>> synsets_words(all_hyponyms(dog))
set(['rottweiler', 'bull mastiff', 'belgian sheepdog', 'courser', 'brabancon griffon', 'toy terrier', 'fox terrier', 'sennenhunde', 'standard poodle', 'saluki', 'pointer', 'toy spaniel', 'setter', 'giant schnauzer', 'housedog', 'papillon', 'american foxhound', 'weimaraner', 'cocker spaniel', 'basenji', 'beagle', ...

WordNet has part/whole, group and substance relationships.

>>> body = make_synset('body')
>>> body.part_meronyms()
[Synset('arm.n.01'), Synset('articulatory_system.n.01'), Synset('body_substance.n.01'), Synset('cavity.n.04'), Synset('circulatory_system.n.01'), Synset('crotch.n.02'), Synset('digestive_system.n.01'), Synset('endocrine_system.n.01'), Synset('head.n.01'), Synset('leg.n.01'), Synset('lymphatic_system.n.01'), Synset('musculoskeletal_system.n.01'), Synset('neck.n.01'), Synset('nervous_system.n.01'), Synset('pressure_point.n.01'), Synset('respiratory_system.n.01'), Synset('sensory_system.n.02'), Synset('torso.n.01'), Synset('vascular_system.n.01')]

>>> dog.member_holonyms()
[Synset('canis.n.01'), Synset('pack.n.06')]

>>> wood = make_synset('wood')
>>> wood.substance_holonyms()
[Synset('beam.n.02'), Synset('chopping_block.n.01'), Synset('lumber.n.01'), Synset('spindle.n.02')]
>>> wood.substance_meronyms()
[Synset('lignin.n.01')]

We can use hypernyms to classify words into domains using WordNet, but there’s an existing domain classification system in the form of WordNet Domains. It can be downloaded here. Code for using this can be found on Stack Overflow. But it doesn’t seem to work with nltk 3.0 (the synset numbers don’t match).

And there’s a sentiment score system for WordNet in the form of SentiWordNet. There’s an interface for it in WordNet 3.0.

def make_senti_synset(word, category='n', number='01'):
    """Conveniently make a senti_synset"""
    number = int(number)
    return swn.senti_synset('%s.%s.%02i' % (word, category, number))

def synsets_sentiments(synsets):
    """Return the objs, pos, neg and pos - neg score sums for the synsets"""
    pos = 0.0
    obj = 0.0
    neg = 0.0
    for synset in synsets:
        try:
            pos += synset.pos_score()
            obj += synset.obj_score()
            neg += synset.neg_score()
        except AttributeError, e:
            pass
    return obj, pos, neg, pos - neg

>>> happy = make_senti_synset('happy', 'a')
>>> happy.pos_score()
0.875
>>> happy.neg_score()
0.0
>>> happy.obj_score()
0.125

synsets_sentiments([make_senti_synset(word, 'a') for word in 'happy sad angry heavy light depressing'.split()])
(2.5, 1.5, 2.0, -0.5)

Not every word has a sentiment score, hence the try/except block in synsets_sentiments.

WordNet is sensitive to senses and it’s hard to automatically resolve senses when processing arbitrary text. When generating text and using WordNet to find words, it’s important (and easier) to set the correct sense for the synset.

>>> colour = make_synset('colour', 'n', 6)
>>> all_hyponyms(colour)
set([Synset('chrome_red.n.01'), Synset('primary_color.n.01'), Synset('light_brown.n.01'), Synset('sallowness.n.01'), Synset('hazel.n.04'), Synset('iron-grey.n.01'), Synset('olive_green.n.01'), Synset('tan.n.02'), Synset('pastel.n.01'), Synset('coal_black.n.01'), Synset('pinkness.n.01'), Synset('vandyke_brown.n.01'), Synset('beige.n.01'), Synset('blue.n.01'), Synset('shade.n.02'), Synset('achromatic_color.n.01'), Synset('whiteness.n.03'), Synset('coral.n.01'), Synset('chromatism.n.02'), Synset('apatetic_coloration.n.01'), ...

This gives concepts on different levels. Maybe if we try the peers of a colour.

>>> all_peers(make_synset('red'))
set([Synset('red.n.01'), Synset('pastel.n.01'), Synset('purple.n.01'), Synset('green.n.01'), Synset('olive.n.05'), Synset('complementary_color.n.01'), Synset('brown.n.01'), Synset('blue.n.01'), Synset('blond.n.02'), Synset('yellow.n.01'), Synset('orange.n.02'), Synset('pink.n.01'), Synset('salmon.n.04')])

OK maybe if we try the children of a concept.

>>> all_hyponyms(make_synset('chromatic_color'))
set([Synset('chrome_red.n.01'), Synset('light_brown.n.01'), Synset('hazel.n.04'), Synset('olive_green.n.01'), Synset('tan.n.02'), Synset('pastel.n.01'), Synset('pinkness.n.01')

Perhaps the leaf nodes.

def _recurse_leaf_hyponyms(synset, leaf_hyponyms):
    synset_hyponyms = synset.hyponyms()
    if synset_hyponyms:
        for hyponym in synset_hyponyms:
            _recurse_all_hyponyms(hyponym, leaf_hyponyms)
    else:
        leaf_hyponyms += synset

def leaf_hyponyms(synset):
    """Get the set of leaf nodes from the tree of hyponyms under the synset"""
    hyponyms = []
    _recurse_leaf_hyponyms(synset, hyponyms)
    return set(hyponyms)

>>> leaf_hyponyms(make_synset('chromatic_color'))
set([Synset('taupe.n.01'), Synset('snuff-color.n.01'), Synset('chrome_red.n.01'), Synset('light_brown.n.01'), Synset('hazel.n.04'), Synset('olive_drab.n.01'), Synset('old_gold.n.01'), Synset('chocolate.n.03'), Synset('yellowish_pink.n.01'), Synset('yellowish_brown.n.01'), Synset('tyrian_purple.n.02'), ...

That looks good. All colours, no intermediate concepts.

We can use this set of words to choose colours, or to categorize words as colours.

I hope this demonstrates that WordNet can be a very useful resource for Generative Art and Digital Humanities projects.