Am 24.02.2011 08:11, schrieb Bruce Dou:
On Thu, Feb 24, 2011 at 2:52 PM, Stefan Henß
<[email protected]>  wrote:
Hi Bruce,

currently the answer selection is quite simple. We assume that a
sophisticated answer has a quite firm use of the domain's terminology. So if
someone has a high density of terms like "mahout", "hadoop", "clustering",
"svn", "classifier" in his response to a mahout-related question we hope he
knowns what he is talking about and gives straigth pointers to solutions
etc. A model of the domain's terminology is given by the cluster (bag of
words) the question/answer was assigned to, so what we basically do is to
calculate the cosine similarity between the bag of words and the answer.
High similarity - hopefully sophisticated. Of course there are some smaller
additions like decreasing the score if the reply is by the same user as the
question is but the similarity measure is the core idea.
I do not think in this way the best answer can be found.

For example:
Q.How to install hadoop in Linux?
A.<commands list>

And the answer may not include the terms in the question, since the
answer is based on the question, always the terms are omitted.
The terms are not compared with the terms in the question but with the FAQ the question is assigned to. If such a question is frequently asked terms like "hadoop", "install", "linux" as well as terms from the answers such as "bin", "conf", ... should have a high weight for the FAQ. Just a list of commands would score very high due to the density of important keywords (as there is no noise etc.).

But I agree, there will still be (much?) better approaches for answer selection. But as this is done as bachelor thesis the time is limited and also the focus is set, i.e. how well one approach (LDA) is applicable for the whole task of FAQ extraction.
For the categorization we use the question as well as all replies to it as
one single document we will then give to the clustering algorithm. On the
assumption that the replies are not spam or alike this gives a far more
precise characterization of the terminology simply due to the amount of
text.
For categorization, the problem will be:
Define the terms by us, or
generate them from the questions content,
If generate there will be lots of noise.

Sure there is a lot of noise and it's important to remove it (stopwords, detect stack traces etc.). We already observed quite stupid categories due to noise. But one of the questions of this research is how to automatically extract and present important information from a very large data set and human-defined terms would already require this information to be known to some extend (and also be time-consuming). That's why we have our try with the fully automatic approach :)
Am 23.02.2011 06:26, schrieb Bruce Dou:
How to find which answer is the best or relevant?
How to do categorization? Based on the terms in the question?

On Wed, Feb 23, 2011 at 1:15 PM, Stefan Henß
<[email protected]>    wrote:
Hi everybody,

I'm currently doing research for my bachelor thesis on how to
automatically
extract FAQs from unstructured data.

For this I've built a system automatically performing the following:
- Load thousands of conversations from forums and mailing lists (don't
mind
the categories there).
- Build categorization solely based on the conversation's texts (by
clustering).
- Pick the best modelled categories as basis for one FAQ each.
- For each question (first entry in a conversation) find the best reply
from
its answers.
- Select the most relevant and well formatted question/answer-pairs for
each
FAQ.

Most of the steps almost completely rely on the data from the
categorization
step which is obtained using the latent Dirichlet allocation model.

For the evaluation part I'd like to ask you for having a look at one or
two
FAQs and maybe give some comments on how far the questions matched the
FAQ's
title, how relevant they were etc.


Here's the direct link to the Mahout FAQs:
http://faqcluster.com/mahout-data

(There are some other interesting FAQs as well at http://faqcluster.com/)


Thanks for your help

Stefan






Reply via email to