I have gigabytes of text (blog posts) I am using for creating a statistical
language model (SLM) for an embedded LVCSR -- speech recognition on a cell
phone for writing email messages or sms.
I would like to tag the words in this text with identifiers to distinguish
meanings of words and hopefully result in a lower perplexity score for the SLM.
As a first experiment I called discriminate.pl on a 1gig portion of this text
(converted to the senseval headless format) so I could start matching
documentation with reality. This text seems like it's going to require more
resources to process than I'd like to devote.
Can someone suggest how I should go about tagging words in a large corpus? I'm
working my way through the documentation, but that is going slowly.
Note that I'm willing to share the data (I got it from the web in the first
place), but I don't have bandwidth for allowing everyone to download it. I
archived it as directories for each blogger (from the US), each blog post as a
file. I'm not sure of the actual amount of English text, but the fraction I use
for experiments is 1gig and I used around 1/5 of the data.
_________________________________________________________________
Send e-mail anywhere. No map, no compass.
http://windowslive.com/Explore/hotmail?ocid=TXT_TAGLM_WL_hotmail_acq_anywhere_122008
------------------------------------------------------------------------------
SF.Net email is Sponsored by MIX09, March 18-20, 2009 in Las Vegas, Nevada.
The future of the web can't happen without you. Join us at MIX09 to help
pave the way to the Next Web now. Learn more and register at
http://ad.doubleclick.net/clk;208669438;13503038;i?http://2009.visitmix.com/
_______________________________________________
senseclusters-users mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/senseclusters-users