Hi Bruce,
currently the answer selection is quite simple. We assume that a
sophisticated answer has a quite firm use of the domain's terminology.
So if someone has a high density of terms like "mahout", "hadoop",
"clustering", "svn", "classifier" in his response to a mahout-related
question we hope he knowns what he is talking about and gives straigth
pointers to solutions etc. A model of the domain's terminology is given
by the cluster (bag of words) the question/answer was assigned to, so
what we basically do is to calculate the cosine similarity between the
bag of words and the answer. High similarity - hopefully sophisticated.
Of course there are some smaller additions like decreasing the score if
the reply is by the same user as the question is but the similarity
measure is the core idea.
For the categorization we use the question as well as all replies to it
as one single document we will then give to the clustering algorithm. On
the assumption that the replies are not spam or alike this gives a far
more precise characterization of the terminology simply due to the
amount of text.
Am 23.02.2011 06:26, schrieb Bruce Dou:
How to find which answer is the best or relevant?
How to do categorization? Based on the terms in the question?
On Wed, Feb 23, 2011 at 1:15 PM, Stefan Henß
<[email protected]> wrote:
Hi everybody,
I'm currently doing research for my bachelor thesis on how to automatically
extract FAQs from unstructured data.
For this I've built a system automatically performing the following:
- Load thousands of conversations from forums and mailing lists (don't mind
the categories there).
- Build categorization solely based on the conversation's texts (by
clustering).
- Pick the best modelled categories as basis for one FAQ each.
- For each question (first entry in a conversation) find the best reply from
its answers.
- Select the most relevant and well formatted question/answer-pairs for each
FAQ.
Most of the steps almost completely rely on the data from the categorization
step which is obtained using the latent Dirichlet allocation model.
For the evaluation part I'd like to ask you for having a look at one or two
FAQs and maybe give some comments on how far the questions matched the FAQ's
title, how relevant they were etc.
Here's the direct link to the Mahout FAQs: http://faqcluster.com/mahout-data
(There are some other interesting FAQs as well at http://faqcluster.com/)
Thanks for your help
Stefan