As you likely know, SenseClusters was originally designed to carry out
word sense discrimination. The typical use of SenseClusters is as follows:
Given 100 sentences that contain the word "line", use  SenseClusters to
identify the different contexts in which line occurs, and thereby make
some judgements about the different meanings of "line" that might exist.

Recently we've become aware that SenseClusters is ideally suited for the
problem of name discrimination. This is the task of trying to identify
the different people associated with the same name.

In name discrimination we might be given 100 sentences or paragraphs that
contain various forms of the name "John Smith". The object of name
discrimination is to identify how many  "John Smith"s are being referred
to, and which instances refer to which "John Smith". If you happen to
know which "John Smith" each instance refers to you, then you could
create a key and run the standard SenseClusters evaluation procedure
to see how well the clustering found the "true" results.

So if you think of the actual people as "meanings" and their names as the
surface form of a  word that appears in a corpus, you can see that this
is essentially the word sense discrimination problem in a different form.

One nice thing about name discrimination as a problem is that it's quite
easy to create test data. Suppose you have a corpus of text. Pick out two
names that occur in that text (e.g., George Bush and John Kerry)  and
replace those names with "PersonX". Then cluster the instances of PersonX,
and see how closely they correspond to the actual clusters of John Kerry
and George Bush. (This is analogous to the creation of pseudo words as is
done in disambiguation experiments). Given this framework, it would be
very easy to create a key of the correct assignments of instances (just
use the name prior to converting to PersonX) and thereby evaluate the
clustering algorithm. Again, this is something that SenseClusters easily
does.

We are in the process of creating some name discrimination data that
can be experimented with. It's quite simple to run such experiments. In
fact, once you have the data, it's exactly like running any other
SenseClusters experiment.

For example, if you have instances that contain various forms of a
particular name, you can define those forms as valid target tokens. Target
words are found within the <head> </head> tag, and they can be
thought of to define the "center" of the context that is being clustered.
In other words, we are interested in determining the contexts in which a
particular target word occurs.

SenseClusters supports very flexible token handling, so you could have
target word forms in your instances such as :

John S.
John Smith
Mr. Smith
Mr. John Smith
etc.

Thus, you don't need to regularlize or normalize the names before running,
you can process them as they appear in a corpus.

We'll pursue this a bit more, but just wanted to mention that this is
something that we are using SenseClusters for now, and wanted to mention
the idea and let you know we should have some sample data that you can
experiment with in the near future.

Let us know what you think about this, especially if you know where we can
get any "standard" sets of name disambiguation data, or if you have done
name disambiguation/discrimination using other techniques.

Cordially,
Ted

Download SenseClusters from : http://senseclusters.sourceforge.net
(The most current version is 0.51)

--
Ted Pedersen
http://www.d.umn.edu/~tpederse


-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. 
Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
senseclusters-users mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

Reply via email to