You could try using Apache Mahout to at least cluster the messages into groups 
of similar ones based on text features. That should be doable. Given the 
groups, you could manually extract questions (the clusters with most threads 
could be the most frequently asked). Also, if you manage to get this to work 
nicely, it could be a nice tool for other projects as well. Would be a fun 
exercise anyways...

I am starting to toy with Mahout for another pet project. Once I get more 
comfortable with it, I might be able to take this on (not a promise).

I think automatic question extraction is a quite ambitious goal.

Friso



On 1 mrt 2011, at 19:12, Stack wrote:

> On Tue, Mar 1, 2011 at 10:03 AM, Otis Gospodnetic
> <[email protected]> wrote:
>>> Do you have  something in mind?  Could we be making better use of the
>>> sematext  summaries?
>> 
>> Hm... we already index HBase and other Digests on search-hadoop.com.
>> I was thinking more along the lines of mining the ML archives and doing
>> automatic Q&A extraction.
>> I don't know how difficult it would be.  Maybe the input would be too noisy
>> (people don't ask proper questions, answers are not full sentences, quote
>> characters prefixing lines from old messages add a layer of complexity...), 
>> but
>> that's what I thought you might have meant.
>> 
> 
> That'd be a nice addition to the docs.  Our FAQ is in need of
> updating.  This would be a nice undertaking if someone was up for
> taking it on.
> St.Ack

Reply via email to