If I understand correctly, the WordNet::SenseRelate::AllWords is based on the
information already entered (by humans) in WordNet. I'll use that to evaluate
the effectiveness of decreasing perplexity (increasing speed).
But if that is successful, I'm going to want to work with a bunch of other
languages ( around 20 ).
Could I need use the headed format, for each word, to build something like
WordNet for each language, then use something like
WordNet::SenseReleate::AllWords based on that data to then tag words with
meanings?
What are the time/space requirement for using the headed format to determine
senses for a given word? ('bank' for example appears in around 5000 blog posts,
out of around 1 million blog posts (12 million sentences, total) ).
> Date: Fri, 5 Dec 2008 11:30:46 -0600
> From: [EMAIL PROTECTED]
> To: [email protected]
> Subject: Re: [Senseclusters-users] How to tag large amounts of text.
>
> Hi Scott,
>
> Sounds like an interesting project - in the headless format, the goal
> of SenseClusters is to cluster your blog posts based on their
> contextual similarity (rather than tagging individual words with
> meanings). In the headed format, the goal is to cluster a particular
> word or phrase based on its contextual similarity with other
> occurrences of that same word or phrase. Neither of these sound like
> what you want to do (at least as I understand it...)
>
> If you want to tag words with meanings, you might want to try out
> WordNet::SenseRelate::AllWords.
>
> http://senserelate.sourceforge.net
>
> I hope this helps. Let me know if I've misunderstood something too...
>
> Cordially,
> Ted
>
> On Fri, Dec 5, 2008 at 11:15 AM, Scott Salley
> <[EMAIL PROTECTED]> wrote:
> > I have gigabytes of text (blog posts) I am using for creating a statistical
> > language model (SLM) for an embedded LVCSR -- speech recognition on a cell
> > phone for writing email messages or sms.
> >
> > I would like to tag the words in this text with identifiers to distinguish
> > meanings of words and hopefully result in a lower perplexity score for the
> > SLM.
> >
> > As a first experiment I called discriminate.pl on a 1gig portion of this
> > text (converted to the senseval headless format) so I could start matching
> > documentation with reality. This text seems like it's going to require more
> > resources to process than I'd like to devote.
> >
> > Can someone suggest how I should go about tagging words in a large corpus?
> > I'm working my way through the documentation, but that is going slowly.
> >
> >
> > Note that I'm willing to share the data (I got it from the web in the first
> > place), but I don't have bandwidth for allowing everyone to download it. I
> > archived it as directories for each blogger (from the US), each blog post as
> > a file. I'm not sure of the actual amount of English text, but the fraction
> > I use for experiments is 1gig and I used around 1/5 of the data.
> >
> >
> > ________________________________
> > Send e-mail anywhere. No map, no compass. Get your Hotmail(R) account now.
> > ------------------------------------------------------------------------------
> > SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
> > The future of the web can't happen without you. Join us at MIX09 to help
> > pave the way to the Next Web now. Learn more and register at
> > http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
> > _______________________________________________
> > senseclusters-users mailing list
> > [email protected]
> > https://lists.sourceforge.net/lists/listinfo/senseclusters-users
> >
> >
>
>
>
> --
> Ted Pedersen
> http://www.d.umn.edu/~tpederse
>
> ------------------------------------------------------------------------------
> SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
> The future of the web can't happen without you. Join us at MIX09 to help
> pave the way to the Next Web now. Learn more and register at
> http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
> _______________________________________________
> senseclusters-users mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/senseclusters-users
_________________________________________________________________
Send e-mail faster without improving your typing skills.
http://windowslive.com/Explore/hotmail?ocid=TXT_TAGLM_WL_hotmail_acq_speed_122008------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you. Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users