Re: Automatically extracted Mahout FAQs

Stefan Henß Wed, 23 Feb 2011 22:56:24 -0800

Hi Bruce,

currently the answer selection is quite simple. We assume that asophisticated answer has a quite firm use of the domain's terminology.So if someone has a high density of terms like "mahout", "hadoop","clustering", "svn", "classifier" in his response to a mahout-relatedquestion we hope he knowns what he is talking about and gives straigthpointers to solutions etc. A model of the domain's terminology is givenby the cluster (bag of words) the question/answer was assigned to, sowhat we basically do is to calculate the cosine similarity between thebag of words and the answer. High similarity - hopefully sophisticated.Of course there are some smaller additions like decreasing the score ifthe reply is by the same user as the question is but the similaritymeasure is the core idea.

For the categorization we use the question as well as all replies to itas one single document we will then give to the clustering algorithm. Onthe assumption that the replies are not spam or alike this gives a farmore precise characterization of the terminology simply due to theamount of text.


Am 23.02.2011 06:26, schrieb Bruce Dou:

How to find which answer is the best or relevant?
How to do categorization? Based on the terms in the question?

On Wed, Feb 23, 2011 at 1:15 PM, Stefan Henß
<[email protected]>  wrote:

Hi everybody,

I'm currently doing research for my bachelor thesis on how to automatically
extract FAQs from unstructured data.

For this I've built a system automatically performing the following:
- Load thousands of conversations from forums and mailing lists (don't mind
the categories there).
- Build categorization solely based on the conversation's texts (by
clustering).
- Pick the best modelled categories as basis for one FAQ each.
- For each question (first entry in a conversation) find the best reply from
its answers.
- Select the most relevant and well formatted question/answer-pairs for each
FAQ.

Most of the steps almost completely rely on the data from the categorization
step which is obtained using the latent Dirichlet allocation model.

For the evaluation part I'd like to ask you for having a look at one or two
FAQs and maybe give some comments on how far the questions matched the FAQ's
title, how relevant they were etc.


Here's the direct link to the Mahout FAQs: http://faqcluster.com/mahout-data

(There are some other interesting FAQs as well at http://faqcluster.com/)


Thanks for your help

Stefan

Re: Automatically extracted Mahout FAQs

Reply via email to