I have gigabytes of text (blog posts) I am using for creating a statistical 
language model (SLM) for an embedded LVCSR -- speech recognition on a cell 
phone for writing email messages or sms.

I would like to tag the words in this text with identifiers to distinguish 
meanings of words and hopefully result in a lower perplexity score for the SLM.

As a first experiment I called discriminate.pl on a 1gig portion of this text 
(converted to the senseval headless format) so I could start matching 
documentation with reality. This text seems like it's going to require more 
resources to process than I'd like to devote.

Can someone suggest how I should go about tagging words in a large corpus? I'm 
working my way through the documentation, but that is going slowly.


Note that I'm willing to share the data (I got it from the web in the first 
place), but I don't have bandwidth for allowing everyone to download it. I 
archived it as directories for each blogger (from the US), each blog post as a 
file. I'm not sure of the actual amount of English text, but the fraction I use 
for experiments is 1gig and I used around 1/5 of the data.


_________________________________________________________________
Send e-mail anywhere. No map, no compass.
http://windowslive.com/Explore/hotmail?ocid=TXT_TAGLM_WL_hotmail_acq_anywhere_122008
------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you.  Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users

Reply via email to