I found it. The student in question was named Stefan Henß. See here for details: http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%[email protected]%3E
The results were quite surprisingly good for how simple the techniques used are. On Wed, Mar 2, 2011 at 12:39 PM, Ted Dunning <[email protected]> wrote: > I have looked but can't find the postings by a student who recently posted > about their FAQ extraction program. The results were pretty good in terms > of precision and the extracted answers were very nice. The methods used > were quite simple. > > Does anybody else remember this interchange? Did it not occur here? Did I > imagine it? > > On Wed, Mar 2, 2011 at 12:30 PM, Andrew Look <[email protected]> wrote: > >> Is there any easy way to export this data from sematext / stack overflow? >> Or is web crawling/scraping the way to go here? >> >> This is a good use case for Mahout, I've been looking for a problem to >> play >> around on mahout with :) >> >> >> On 3/2/11 1:05 AM, "Friso van Vollenhoven" <[email protected]> >> wrote: >> >> > You could try using Apache Mahout to at least cluster the messages into >> groups >> > of similar ones based on text features. That should be doable. Given the >> > groups, you could manually extract questions (the clusters with most >> threads >> > could be the most frequently asked). Also, if you manage to get this to >> work >> > nicely, it could be a nice tool for other projects as well. Would be a >> fun >> > exercise anyways... >> > >> > I am starting to toy with Mahout for another pet project. Once I get >> more >> > comfortable with it, I might be able to take this on (not a promise). >> > >> > I think automatic question extraction is a quite ambitious goal. >> > >> > Friso >> > >> > >> > >> > On 1 mrt 2011, at 19:12, Stack wrote: >> > >> >> On Tue, Mar 1, 2011 at 10:03 AM, Otis Gospodnetic >> >> <[email protected]> wrote: >> >>>> Do you have something in mind? Could we be making better use of the >> >>>> sematext summaries? >> >>> >> >>> Hm... we already index HBase and other Digests on search-hadoop.com. >> >>> I was thinking more along the lines of mining the ML archives and >> doing >> >>> automatic Q&A extraction. >> >>> I don't know how difficult it would be. Maybe the input would be too >> noisy >> >>> (people don't ask proper questions, answers are not full sentences, >> quote >> >>> characters prefixing lines from old messages add a layer of >> complexity...), >> >>> but >> >>> that's what I thought you might have meant. >> >>> >> >> >> >> That'd be a nice addition to the docs. Our FAQ is in need of >> >> updating. This would be a nice undertaking if someone was up for >> >> taking it on. >> >> St.Ack >> > >> > >> >> >
