I found it.  The student in question was named Stefan Henß.  See here for
details:
http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%[email protected]%3E

The results were quite surprisingly good for how simple the techniques used
are.


On Wed, Mar 2, 2011 at 12:39 PM, Ted Dunning <[email protected]> wrote:

> I have looked but can't find the postings by a student who recently posted
> about their FAQ extraction program.  The results were pretty good in terms
> of precision and the extracted answers were very nice.  The methods used
> were quite simple.
>
> Does anybody else remember this interchange?  Did it not occur here?  Did I
> imagine it?
>
> On Wed, Mar 2, 2011 at 12:30 PM, Andrew Look <[email protected]> wrote:
>
>> Is there any easy way to export this data from sematext / stack overflow?
>> Or is web crawling/scraping the way to go here?
>>
>> This is a good use case for Mahout, I've been looking for a problem to
>> play
>> around on mahout with :)
>>
>>
>> On 3/2/11 1:05 AM, "Friso van Vollenhoven" <[email protected]>
>> wrote:
>>
>> > You could try using Apache Mahout to at least cluster the messages into
>> groups
>> > of similar ones based on text features. That should be doable. Given the
>> > groups, you could manually extract questions (the clusters with most
>> threads
>> > could be the most frequently asked). Also, if you manage to get this to
>> work
>> > nicely, it could be a nice tool for other projects as well. Would be a
>> fun
>> > exercise anyways...
>> >
>> > I am starting to toy with Mahout for another pet project. Once I get
>> more
>> > comfortable with it, I might be able to take this on (not a promise).
>> >
>> > I think automatic question extraction is a quite ambitious goal.
>> >
>> > Friso
>> >
>> >
>> >
>> > On 1 mrt 2011, at 19:12, Stack wrote:
>> >
>> >> On Tue, Mar 1, 2011 at 10:03 AM, Otis Gospodnetic
>> >> <[email protected]> wrote:
>> >>>> Do you have  something in mind?  Could we be making better use of the
>> >>>> sematext  summaries?
>> >>>
>> >>> Hm... we already index HBase and other Digests on search-hadoop.com.
>> >>> I was thinking more along the lines of mining the ML archives and
>> doing
>> >>> automatic Q&A extraction.
>> >>> I don't know how difficult it would be.  Maybe the input would be too
>> noisy
>> >>> (people don't ask proper questions, answers are not full sentences,
>> quote
>> >>> characters prefixing lines from old messages add a layer of
>> complexity...),
>> >>> but
>> >>> that's what I thought you might have meant.
>> >>>
>> >>
>> >> That'd be a nice addition to the docs.  Our FAQ is in need of
>> >> updating.  This would be a nice undertaking if someone was up for
>> >> taking it on.
>> >> St.Ack
>> >
>> >
>>
>>
>

Reply via email to