I have a pet project which, to do correctly, would require me to calculate the
similarity of every word pair in WordNet (or whatever corpus I use, but WordNet
seems as good as any and better than most). My intention is to find the least
similar words in English, and the gloss vector method seems to provide the most
satisfactory results, both in accuracy and fine-grained differentiation.

Preliminary tests show that only between 1 and 0.1 percent of the pairs will
actually score zeros, and that at least fifty percent of those will involve the
same few words (about 10% of the words involved in a zero-score pair) so I think
I should be able to adequately rank the least similar pairs without too many
tying scores. Obviously this is a big undertaking (some 40-60k hours worth), so
I want to get it right the first time, and I have a few questions of whoever
feels qualified to answer:

I'm curious as to what causes a score of precisely zero. I have read some on the
vector method, but am no expert. There don't seem to be any scores between about
0.0002 and a perfect 0. Even just 0.1% is still about 30 million zeros, so is
there anything I can do to get even finer differentiation (without sacrificing
accuracy as not using stop words would do)?

Regarding stop words, has anyone compiled a general purpose stoplist that they
feel is more complete than the sample one?

Am I right in thinking that even words and senses that share a gloss (synonyms)
would not necessarily score the same in a vector similarity comparison due to
having different first-order connections, and thus a different "big
concatenated gloss" (is there a different term for this, or should I just call
it a BCG)? Because ideally I would want to skip words that I know will get the
same scores.

How many people would be genuinely interested in a complete vector similarity
matrix? For my own purposes I do not need to save all the results, only the
lowest scores, however I do have to calculate them all, and so it would be a
shame to throw them out if others could use them. The problem is, a complete
matrix would be at least 250gb of data if I save the scores to their full
precision as doubles (half that if I used regular floats). It would be some
effort to keep that much data, as even the large computing resources I have
access to only allow allocation of up to 200gb. So I don't want to do this
unless it would actually be appreciated. Also, such a large amount of data
would not be very convenient to work with, so I'm not sure of the utility of
it.

>From reading other posts here, it's clear there isn't much to be done to speed
up calculations other than choosing a faster method such as Res, or cutting
down my dataset.. is there? Is anyone perchance working on a wn-similarity
version in C or some faster language? I've tried using perlcc to no avail, and
I didn't really expect that to work in the first place.

All that being said, Similarity is a great module and I'd really like to thank
Ted Pedersen, Siddharth Patwardhan and all the others involved in creating it.
And of course the Princeton WordNet guys too.

Reply via email to