On Thu, Feb 24, 2011 at 2:52 PM, Stefan Henß <[email protected]> wrote: > Hi Bruce, > > currently the answer selection is quite simple. We assume that a > sophisticated answer has a quite firm use of the domain's terminology. So if > someone has a high density of terms like "mahout", "hadoop", "clustering", > "svn", "classifier" in his response to a mahout-related question we hope he > knowns what he is talking about and gives straigth pointers to solutions > etc. A model of the domain's terminology is given by the cluster (bag of > words) the question/answer was assigned to, so what we basically do is to > calculate the cosine similarity between the bag of words and the answer. > High similarity - hopefully sophisticated. Of course there are some smaller > additions like decreasing the score if the reply is by the same user as the > question is but the similarity measure is the core idea.
I do not think in this way the best answer can be found. For example: Q.How to install hadoop in Linux? A.<commands list> And the answer may not include the terms in the question, since the answer is based on the question, always the terms are omitted. > > For the categorization we use the question as well as all replies to it as > one single document we will then give to the clustering algorithm. On the > assumption that the replies are not spam or alike this gives a far more > precise characterization of the terminology simply due to the amount of > text. For categorization, the problem will be: Define the terms by us, or generate them from the questions content, If generate there will be lots of noise. > > Am 23.02.2011 06:26, schrieb Bruce Dou: >> >> How to find which answer is the best or relevant? >> How to do categorization? Based on the terms in the question? >> >> On Wed, Feb 23, 2011 at 1:15 PM, Stefan Henß >> <[email protected]> wrote: >>> >>> Hi everybody, >>> >>> I'm currently doing research for my bachelor thesis on how to >>> automatically >>> extract FAQs from unstructured data. >>> >>> For this I've built a system automatically performing the following: >>> - Load thousands of conversations from forums and mailing lists (don't >>> mind >>> the categories there). >>> - Build categorization solely based on the conversation's texts (by >>> clustering). >>> - Pick the best modelled categories as basis for one FAQ each. >>> - For each question (first entry in a conversation) find the best reply >>> from >>> its answers. >>> - Select the most relevant and well formatted question/answer-pairs for >>> each >>> FAQ. >>> >>> Most of the steps almost completely rely on the data from the >>> categorization >>> step which is obtained using the latent Dirichlet allocation model. >>> >>> For the evaluation part I'd like to ask you for having a look at one or >>> two >>> FAQs and maybe give some comments on how far the questions matched the >>> FAQ's >>> title, how relevant they were etc. >>> >>> >>> Here's the direct link to the Mahout FAQs: >>> http://faqcluster.com/mahout-data >>> >>> (There are some other interesting FAQs as well at http://faqcluster.com/) >>> >>> >>> Thanks for your help >>> >>> Stefan >>> >> >> > > -- A decathlon Drupal developer & programmer http://blog.eood.cn/
