As you likely know, SenseClusters was originally designed to carry out word sense discrimination. The typical use of SenseClusters is as follows: Given 100 sentences that contain the word "line", use SenseClusters to identify the different contexts in which line occurs, and thereby make some judgements about the different meanings of "line" that might exist.
Recently we've become aware that SenseClusters is ideally suited for the problem of name discrimination. This is the task of trying to identify the different people associated with the same name. In name discrimination we might be given 100 sentences or paragraphs that contain various forms of the name "John Smith". The object of name discrimination is to identify how many "John Smith"s are being referred to, and which instances refer to which "John Smith". If you happen to know which "John Smith" each instance refers to you, then you could create a key and run the standard SenseClusters evaluation procedure to see how well the clustering found the "true" results. So if you think of the actual people as "meanings" and their names as the surface form of a word that appears in a corpus, you can see that this is essentially the word sense discrimination problem in a different form. One nice thing about name discrimination as a problem is that it's quite easy to create test data. Suppose you have a corpus of text. Pick out two names that occur in that text (e.g., George Bush and John Kerry) and replace those names with "PersonX". Then cluster the instances of PersonX, and see how closely they correspond to the actual clusters of John Kerry and George Bush. (This is analogous to the creation of pseudo words as is done in disambiguation experiments). Given this framework, it would be very easy to create a key of the correct assignments of instances (just use the name prior to converting to PersonX) and thereby evaluate the clustering algorithm. Again, this is something that SenseClusters easily does. We are in the process of creating some name discrimination data that can be experimented with. It's quite simple to run such experiments. In fact, once you have the data, it's exactly like running any other SenseClusters experiment. For example, if you have instances that contain various forms of a particular name, you can define those forms as valid target tokens. Target words are found within the <head> </head> tag, and they can be thought of to define the "center" of the context that is being clustered. In other words, we are interested in determining the contexts in which a particular target word occurs. SenseClusters supports very flexible token handling, so you could have target word forms in your instances such as : John S. John Smith Mr. Smith Mr. John Smith etc. Thus, you don't need to regularlize or normalize the names before running, you can process them as they appear in a corpus. We'll pursue this a bit more, but just wanted to mention that this is something that we are using SenseClusters for now, and wanted to mention the idea and let you know we should have some sample data that you can experiment with in the near future. Let us know what you think about this, especially if you know where we can get any "standard" sets of name disambiguation data, or if you have done name disambiguation/discrimination using other techniques. Cordially, Ted Download SenseClusters from : http://senseclusters.sourceforge.net (The most current version is 0.51) -- Ted Pedersen http://www.d.umn.edu/~tpederse ------------------------------------------------------- This SF.Net email is sponsored by: Oracle 10g Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE. http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click _______________________________________________ senseclusters-users mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/senseclusters-users
