Interesting, faqcluster.com does appear to be a successful application of Mahout.. Categorizing by question+all replies seems like a smart approach.
Do you think that choosing the best answer based on the cosine similarity of inlcuded terms is the based way to go? Is choosing a single answer even the best approach? It seems that in many cases, a coherent answer to a question emerges after a number of people have replied to the question at hand. For instance, at http://faqcluster.com/question-521113443 : * Q: Is there a SVM classifier implemented in Mahout? * A: See also o.a.m.classifier.sgd.TrainNewsGroups While in the source conversation, a number of useful pieces of information (even additional questions) are divulged in between the question and faqcluster’s chosen answer: http://lucene.472066.n3.nabble.com/SVM-classifier-td2028948.html * Q1: Is there a SVM classifier implemented in Mahout? * A1: No. But the SGD classifier should have similar characteristics. There is also a rough draft of an SVM implementation available as a patch. * Q2: where can I know more about SGD classifier? mahout wiki did not help :( * A2i: https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression <https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression> Sorry about that. What queries did you use? * A2ii: See also o.a.m.classifier.sgd.TrainNewsGroups Looking at this, I think that an interesting approach (to extracting the most useful information from a thread) would be to take the original question and all replies, and form an adjacency matrix: Match(Q1, A1) = (SVM, classifier) Match(Q1, Q2) = (classifier, mahout) Match(A1, Q2) = (SGD, classifier) ... This way coherent responses could be chained together in order to aggregate more useful information, while people replying on tangents or spamming would tend to get left out. Thoughts on how mahout might help create such an adjacency matrix? Obviously cosine similarity would still form the distances between each reply in a given thread, but it seems like having some way of weighting each term’s specificity would help too – i.e. SGD or SVM are more specific than classifier, and classifier is more specific than Mahout since we’re looking at the mahout mailing list... - Andrew On 3/14/11 6:39 PM, "Ted Dunning" <[email protected]> wrote: I found it. The student in question was named Stefan Henß. See here for details: http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%[email protected]%3E The results were quite surprisingly good for how simple the techniques used are. On Wed, Mar 2, 2011 at 12:39 PM, Ted Dunning <[email protected]> wrote: I have looked but can't find the postings by a student who recently posted about their FAQ extraction program. The results were pretty good in terms of precision and the extracted answers were very nice. The methods used were quite simple. Does anybody else remember this interchange? Did it not occur here? Did I imagine it? On Wed, Mar 2, 2011 at 12:30 PM, Andrew Look <[email protected]> wrote: Is there any easy way to export this data from sematext / stack overflow? Or is web crawling/scraping the way to go here? This is a good use case for Mahout, I've been looking for a problem to play around on mahout with :) On 3/2/11 1:05 AM, "Friso van Vollenhoven" <[email protected]> wrote: > You could try using Apache Mahout to at least cluster the messages into groups > of similar ones based on text features. That should be doable. Given the > groups, you could manually extract questions (the clusters with most threads > could be the most frequently asked). Also, if you manage to get this to work > nicely, it could be a nice tool for other projects as well. Would be a fun > exercise anyways... > > I am starting to toy with Mahout for another pet project. Once I get more > comfortable with it, I might be able to take this on (not a promise). > > I think automatic question extraction is a quite ambitious goal. > > Friso > > > > On 1 mrt 2011, at 19:12, Stack wrote: > >> On Tue, Mar 1, 2011 at 10:03 AM, Otis Gospodnetic >> <[email protected]> wrote: >>>> Do you have something in mind? Could we be making better use of the >>>> sematext summaries? >>> >>> Hm... we already index HBase and other Digests on search-hadoop.com >>> <http://search-hadoop.com> . >>> I was thinking more along the lines of mining the ML archives and doing >>> automatic Q&A extraction. >>> I don't know how difficult it would be. Maybe the input would be too noisy >>> (people don't ask proper questions, answers are not full sentences, quote >>> characters prefixing lines from old messages add a layer of complexity...), >>> but >>> that's what I thought you might have meant. >>> >> >> That'd be a nice addition to the docs. Our FAQ is in need of >> updating. This would be a nice undertaking if someone was up for >> taking it on. >> St.Ack > >
