Interesting, faqcluster.com does appear to be a successful application of 
Mahout.. Categorizing by question+all replies seems like a smart approach.

Do you think that choosing the best answer based on the cosine similarity of 
inlcuded terms is the based way to go?  Is choosing a single answer even the 
best approach? It seems that in many cases, a coherent answer to a question 
emerges after a number of people have replied to the question at hand.

For instance, at http://faqcluster.com/question-521113443 :
    * Q: Is there a SVM classifier implemented in Mahout?
    * A: See also o.a.m.classifier.sgd.TrainNewsGroups

While in the source conversation, a number of useful pieces of information 
(even additional questions) are divulged in between the question and 
faqcluster’s chosen answer: 
http://lucene.472066.n3.nabble.com/SVM-classifier-td2028948.html
    * Q1: Is there a SVM classifier implemented in Mahout?
    * A1: No. But the SGD classifier should have similar characteristics. There 
is also a rough draft of an SVM implementation available as a patch.
   * Q2: where can I know more about SGD classifier? mahout wiki did not help :(
   * A2i: 
https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression 
<https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression>  Sorry 
about that.  What queries did you use?
   * A2ii: See also o.a.m.classifier.sgd.TrainNewsGroups

Looking at this, I think that an interesting approach (to extracting the most 
useful information from a thread) would be to take the original question and 
all replies, and form an adjacency matrix:

Match(Q1, A1) = (SVM, classifier)
Match(Q1, Q2) = (classifier, mahout)
Match(A1, Q2) = (SGD, classifier)
...

This way coherent responses could be chained together in order to aggregate 
more useful information, while people replying on tangents or spamming would 
tend to get left out.

Thoughts on how mahout might help create such an adjacency matrix? Obviously 
cosine similarity would still form the distances between each reply in a given 
thread, but it seems like having some way of weighting each term’s specificity 
would help too – i.e. SGD or SVM are more specific than classifier, and 
classifier is more specific than Mahout since we’re looking at the mahout 
mailing list...

- Andrew





On 3/14/11 6:39 PM, "Ted Dunning" <[email protected]> wrote:

I found it.  The student in question was named Stefan Henß.  See here for 
details: 
http://mail-archives.apache.org/mod_mbox/mahout-user/201102.mbox/%[email protected]%3E

The results were quite surprisingly good for how simple the techniques used are.


On Wed, Mar 2, 2011 at 12:39 PM, Ted Dunning <[email protected]> wrote:
I have looked but can't find the postings by a student who recently posted 
about their FAQ extraction program.  The results were pretty good in terms of 
precision and the extracted answers were very nice.  The methods used were 
quite simple.

Does anybody else remember this interchange?  Did it not occur here?  Did I 
imagine it?

On Wed, Mar 2, 2011 at 12:30 PM, Andrew Look <[email protected]> wrote:
Is there any easy way to export this data from sematext / stack overflow?
Or is web crawling/scraping the way to go here?

This is a good use case for Mahout, I've been looking for a problem to play
around on mahout with :)


On 3/2/11 1:05 AM, "Friso van Vollenhoven" <[email protected]>
wrote:

> You could try using Apache Mahout to at least cluster the messages into groups
> of similar ones based on text features. That should be doable. Given the
> groups, you could manually extract questions (the clusters with most threads
> could be the most frequently asked). Also, if you manage to get this to work
> nicely, it could be a nice tool for other projects as well. Would be a fun
> exercise anyways...
>
> I am starting to toy with Mahout for another pet project. Once I get more
> comfortable with it, I might be able to take this on (not a promise).
>
> I think automatic question extraction is a quite ambitious goal.
>
> Friso
>
>
>
> On 1 mrt 2011, at 19:12, Stack wrote:
>
>> On Tue, Mar 1, 2011 at 10:03 AM, Otis Gospodnetic
>> <[email protected]> wrote:
>>>> Do you have  something in mind?  Could we be making better use of the
>>>> sematext  summaries?
>>>
>>> Hm... we already index HBase and other Digests on search-hadoop.com 
>>> <http://search-hadoop.com> .
>>> I was thinking more along the lines of mining the ML archives and doing
>>> automatic Q&A extraction.
>>> I don't know how difficult it would be.  Maybe the input would be too noisy
>>> (people don't ask proper questions, answers are not full sentences, quote
>>> characters prefixing lines from old messages add a layer of complexity...),
>>> but
>>> that's what I thought you might have meant.
>>>
>>
>> That'd be a nice addition to the docs.  Our FAQ is in need of
>> updating.  This would be a nice undertaking if someone was up for
>> taking it on.
>> St.Ack
>
>




Reply via email to