There is a very nice idea that Kevin Knight and Daniel Marcu discuss when
talking about unsupervised learning. In their domain, this has to do with
Machine Translation and word alignment. They present an exercise where you
manually align words in two "languages" that are (apparently) made up
nonsensical languages. You can see the full exercise in the following:

http://www.isi.edu/natural-language/mt/aimag97.ps

In general, this exercise forces you to really see the language as the
computer sees language when doing unsupervised learning, that is without
any additional background or real-world knowledge. This is in fact a very
useful way to think about what SenseClusters is trying to do, because it
likewise does not use any real world or domain knowledge, it relies
strictly on the text.

So, if we as humans see contexts like:

The big yellow dog is nuts.
My cat went crazy.
The computer fell off the shelf.

As humans, we can (possibly) cluster these based on the fact that we know
cats and dogs are animals, and that a computer is inanimate. We also
knonw that nuts and crazy are synonyms. But this of course gives us a
false impression as to the ease of the problem. If you convert each word
type into a random string, then your data really looks like this:

Zyx clrg xlll ark abd daf.
afaf weoi ckjl jkl.
Zyx clllll jkfdjaffd zyx jlkdf.

In fact, now we can see more clearly what SenseClusters is dealing with
(and it's a mess :). Based on what we see above, the only similarity
between the contexts is zyx, which is our new way of saying "the". But,
problems about, for exmple daf and jkl (nuts and crazy) are unrecognizable
to us as synoyms.

So, I think it might be very useful to from time to time convert your
data into a form like there, where you can't rely on your world knowledge
to make distinctions. Then, try and cluster the data. You'll see what a
tough job SenseClusters sometimes is faced with. :)

You know what, I like this idea so much I think we'll write a little
program to do this for Senseval-2 formatted data. We'll keep you posted on
that.

Enjoy,
Ted

--
Ted Pedersen
http://www.d.umn.edu/~tpederse


-------------------------------------------------------
SF.Net email is Sponsored by the Better Software Conference & EXPO
September 19-22, 2005 * San Francisco, CA * Development Lifecycle Practices
Agile & Plan-Driven Development * Managing Projects & Teams * Testing & QA
Security * Process Improvement & Measurement * http://www.sqe.com/bsce5sf
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

Reply via email to