On Thu, Feb 24, 2011 at 2:52 PM, Stefan Henß
<[email protected]> wrote:
> Hi Bruce,
>
> currently the answer selection is quite simple. We assume that a
> sophisticated answer has a quite firm use of the domain's terminology. So if
> someone has a high density of terms like "mahout", "hadoop", "clustering",
> "svn", "classifier" in his response to a mahout-related question we hope he
> knowns what he is talking about and gives straigth pointers to solutions
> etc. A model of the domain's terminology is given by the cluster (bag of
> words) the question/answer was assigned to, so what we basically do is to
> calculate the cosine similarity between the bag of words and the answer.
> High similarity - hopefully sophisticated. Of course there are some smaller
> additions like decreasing the score if the reply is by the same user as the
> question is but the similarity measure is the core idea.

I do not think in this way the best answer can be found.

For example:
Q.How to install hadoop in Linux?
A.<commands list>

And the answer may not include the terms in the question, since the
answer is based on the question, always the terms are omitted.

>
> For the categorization we use the question as well as all replies to it as
> one single document we will then give to the clustering algorithm. On the
> assumption that the replies are not spam or alike this gives a far more
> precise characterization of the terminology simply due to the amount of
> text.

For categorization, the problem will be:
Define the terms by us, or
generate them from the questions content,
If generate there will be lots of noise.

>
> Am 23.02.2011 06:26, schrieb Bruce Dou:
>>
>> How to find which answer is the best or relevant?
>> How to do categorization? Based on the terms in the question?
>>
>> On Wed, Feb 23, 2011 at 1:15 PM, Stefan Henß
>> <[email protected]>  wrote:
>>>
>>> Hi everybody,
>>>
>>> I'm currently doing research for my bachelor thesis on how to
>>> automatically
>>> extract FAQs from unstructured data.
>>>
>>> For this I've built a system automatically performing the following:
>>> - Load thousands of conversations from forums and mailing lists (don't
>>> mind
>>> the categories there).
>>> - Build categorization solely based on the conversation's texts (by
>>> clustering).
>>> - Pick the best modelled categories as basis for one FAQ each.
>>> - For each question (first entry in a conversation) find the best reply
>>> from
>>> its answers.
>>> - Select the most relevant and well formatted question/answer-pairs for
>>> each
>>> FAQ.
>>>
>>> Most of the steps almost completely rely on the data from the
>>> categorization
>>> step which is obtained using the latent Dirichlet allocation model.
>>>
>>> For the evaluation part I'd like to ask you for having a look at one or
>>> two
>>> FAQs and maybe give some comments on how far the questions matched the
>>> FAQ's
>>> title, how relevant they were etc.
>>>
>>>
>>> Here's the direct link to the Mahout FAQs:
>>> http://faqcluster.com/mahout-data
>>>
>>> (There are some other interesting FAQs as well at http://faqcluster.com/)
>>>
>>>
>>> Thanks for your help
>>>
>>> Stefan
>>>
>>
>>
>
>



-- 
A decathlon Drupal developer & programmer
http://blog.eood.cn/

Reply via email to