Hi Scott,

I think in general you would not be able to build something comparable
to WordNet via automated means (be it SenseClusters or anything else).
Automatic construction of WordNets from raw data remains an important
but as yet unsolved problem in Natural Language Processing.

I'd suggest trying both SenseRelate and SenseClusters out on English
to see if a knowledge based (SenseRelate) or distributional
(SenseClusters) approach is better for your purposes, and then start
to investigate options for other languages you might want to work
with.

SenseClusters will work with text from pretty much any language,
whereas SenseRelate will only work with English. However, SenseRelate
will take advantage of the word meanings as given in WordNet and deal
directly with word meanings, while SensClusters will instead cluster
your raw text for you, from which you can draw various conclusions.
It's a nice example of the contrast between the deeper more limited
coverage of knowledge based methods and the broader coverage of
shallower data based methods.

I know this doesn't really answer your question, other than to say
that I think many people would like to be able to do what you propose,
and there is research on those topics, it's just pretty tough. :)

Cordially,
Ted

On Fri, Dec 5, 2008 at 1:36 PM, Scott Salley
<[EMAIL PROTECTED]> wrote:
>
> If I understand correctly, the WordNet::SenseRelate::AllWords is based on
> the information already entered (by humans) in WordNet. I'll use that to
> evaluate the effectiveness of decreasing perplexity (increasing speed).
>
> But if that is successful, I'm going to want to work with a bunch of other
> languages ( around 20 ).
>
> Could I need use the headed format, for each word, to build something like
> WordNet for each language, then use something like
> WordNet::SenseReleate::AllWords based on that data to then tag words with
> meanings?
>
> What are the time/space requirement for using the headed format to determine
> senses for a given word? ('bank' for example appears in around 5000 blog
> posts, out of around 1 million blog posts (12 million sentences, total) ).
>
>
>
>> Date: Fri, 5 Dec 2008 11:30:46 -0600
>> From: [EMAIL PROTECTED]
>> To: [email protected]
>> Subject: Re: [Senseclusters-users] How to tag large amounts of text.
>>
>> Hi Scott,
>>
>> Sounds like an interesting project - in the headless format, the goal
>> of SenseClusters is to cluster your blog posts based on their
>> contextual similarity (rather than tagging individual words with
>> meanings). In the headed format, the goal is to cluster a particular
>> word or phrase based on its contextual similarity with other
>> occurrences of that same word or phrase. Neither of these sound like
>> what you want to do (at least as I understand it...)
>>
>> If you want to tag words with meanings, you might want to try out
>> WordNet::SenseRelate::AllWords.
>>
>> http://senserelate.sourceforge.net
>>
>> I hope this helps. Let me know if I've misunderstood something too...
>>
>> Cordially,
>> Ted
>>
>> On Fri, Dec 5, 2008 at 11:15 AM, Scott Salley
>> <[EMAIL PROTECTED]> wrote:
>> > I have gigabytes of text (blog posts) I am using for creating a
>> > statistical
>> > language model (SLM) for an embedded LVCSR -- speech recognition on a
>> > cell
>> > phone for writing email messages or sms.
>> >
>> > I would like to tag the words in this text with identifiers to
>> > distinguish
>> > meanings of words and hopefully result in a lower perplexity score for
>> > the
>> > SLM.
>> >
>> > As a first experiment I called discriminate.pl on a 1gig portion of this
>> > text (converted to the senseval headless format) so I could start
>> > matching
>> > documentation with reality. This text seems like it's going to require
>> > more
>> > resources to process than I'd like to devote.
>> >
>> > Can someone suggest how I should go about tagging words in a large
>> > corpus?
>> > I'm working my way through the documentation, but that is going slowly.
>> >
>> >
>> > Note that I'm willing to share the data (I got it from the web in the
>> > first
>> > place), but I don't have bandwidth for allowing everyone to download it.
>> > I
>> > archived it as directories for each blogger (from the US), each blog
>> > post as
>> > a file. I'm not sure of the actual amount of English text, but the
>> > fraction
>> > I use for experiments is 1gig and I used around 1/5 of the data.
>> >
>> >
>> > ________________________________
>> > Send e-mail anywhere. No map, no compass. Get your Hotmail(R) account
>> > now.
>> >
>> > ------------------------------------------------------------------------------
>> > SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas,
>> > Nevada.
>> > The future of the web can't happen without you. Join us at MIX09 to help
>> > pave the way to the Next Web now. Learn more and register at
>> >
>> > http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
>> > _______________________________________________
>> > senseclusters-users mailing list
>> > [email protected]
>> > https://lists.sourceforge.net/lists/listinfo/senseclusters-users
>> >
>> >
>>
>>
>>
>> --
>> Ted Pedersen
>> http://www.d.umn.edu/~tpederse
>>
>>
>> ------------------------------------------------------------------------------
>> SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas,
>> Nevada.
>> The future of the web can't happen without you. Join us at MIX09 to help
>> pave the way to the Next Web now. Learn more and register at
>>
>> http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
>> _______________________________________________
>> senseclusters-users mailing list
>> [email protected]
>> https://lists.sourceforge.net/lists/listinfo/senseclusters-users
>
> ________________________________
> Send e-mail faster without improving your typing skills. Get your Hotmail(R)
> account.
> ------------------------------------------------------------------------------
> SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
> The future of the web can't happen without you.  Join us at MIX09 to help
> pave the way to the Next Web now. Learn more and register at
> http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
> _______________________________________________
> senseclusters-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/senseclusters-users
>
>



-- 
Ted Pedersen
http://www.d.umn.edu/~tpederse

------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

Reply via email to