I have a pet project which, to do correctly, would require me to calculate the similarity of every word pair in WordNet (or whatever corpus I use, but WordNet seems as good as any and better than most). My intention is to find the least similar words in English, and the gloss vector method seems to provide the most satisfactory results, both in accuracy and fine-grained differentiation.
Preliminary tests show that only between 1 and 0.1 percent of the pairs will actually score zeros, and that at least fifty percent of those will involve the same few words (about 10% of the words involved in a zero-score pair) so I think I should be able to adequately rank the least similar pairs without too many tying scores. Obviously this is a big undertaking (some 40-60k hours worth), so I want to get it right the first time, and I have a few questions of whoever feels qualified to answer: I'm curious as to what causes a score of precisely zero. I have read some on the vector method, but am no expert. There don't seem to be any scores between about 0.0002 and a perfect 0. Even just 0.1% is still about 30 million zeros, so is there anything I can do to get even finer differentiation (without sacrificing accuracy as not using stop words would do)? Regarding stop words, has anyone compiled a general purpose stoplist that they feel is more complete than the sample one? Am I right in thinking that even words and senses that share a gloss (synonyms) would not necessarily score the same in a vector similarity comparison due to having different first-order connections, and thus a different "big concatenated gloss" (is there a different term for this, or should I just call it a BCG)? Because ideally I would want to skip words that I know will get the same scores. How many people would be genuinely interested in a complete vector similarity matrix? For my own purposes I do not need to save all the results, only the lowest scores, however I do have to calculate them all, and so it would be a shame to throw them out if others could use them. The problem is, a complete matrix would be at least 250gb of data if I save the scores to their full precision as doubles (half that if I used regular floats). It would be some effort to keep that much data, as even the large computing resources I have access to only allow allocation of up to 200gb. So I don't want to do this unless it would actually be appreciated. Also, such a large amount of data would not be very convenient to work with, so I'm not sure of the utility of it. >From reading other posts here, it's clear there isn't much to be done to speed up calculations other than choosing a faster method such as Res, or cutting down my dataset.. is there? Is anyone perchance working on a wn-similarity version in C or some faster language? I've tried using perlcc to no avail, and I didn't really expect that to work in the first place. All that being said, Similarity is a great module and I'd really like to thank Ted Pedersen, Siddharth Patwardhan and all the others involved in creating it. And of course the Princeton WordNet guys too.

