hi joseph, i'm very much interested in stuff like that, allthough i'm not a mahout guru, i'd be very glad to have a working sample, because i can see very usefull things...
i'm working with large thesauri in skos-format and am sure i could use working solutions in a couple of projects. keep up wkr www.turnguard.com/turnguard ----- Original Message ---- From: Joseph Turian <[email protected]> To: [email protected] Sent: Sat, November 6, 2010 3:11:38 AM Subject: Mahout to find semantically related terms over a large vocabulary (>1M)? I'm organizing a bakeoff, if you want to show off some Mahout skills and do a controlled comparison of Mahout to other people's approaches: Let's say I have several hundred million documents, which are very short (only a few words). There are several million terms in the vocabulary. What is the fastest way to find the top-k semantically related terms for each term in the vocabulary? If you just want to hear the results, join this group: http://groups.google.com/group/metaoptimize-challenge-announce If you actually want to hack some data, read this blog post: http://metaoptimize.com/blog/2010/11/05/nlp-challenge-find-semantically-related-terms-over-a-large-vocabulary-1m/ It would be really cool to see participation from the Mahout community in a Mahout demo, to get a controlled comparison to other implementations. Best, Joseph
