Nice, very interesting to see and read!
On Wed, Feb 23, 2011 at 5:15 AM, Stefan Henß <[email protected]> wrote: > Hi everybody, > > I'm currently doing research for my bachelor thesis on how to automatically > extract FAQs from unstructured data. > > For this I've built a system automatically performing the following: > - Load thousands of conversations from forums and mailing lists (don't mind > the categories there). > - Build categorization solely based on the conversation's texts (by > clustering). > - Pick the best modelled categories as basis for one FAQ each. > - For each question (first entry in a conversation) find the best reply from > its answers. > - Select the most relevant and well formatted question/answer-pairs for each > FAQ. > > Most of the steps almost completely rely on the data from the categorization > step which is obtained using the latent Dirichlet allocation model. > > For the evaluation part I'd like to ask you for having a look at one or two > FAQs and maybe give some comments on how far the questions matched the FAQ's > title, how relevant they were etc. > > > Here's the direct link to the Mahout FAQs: http://faqcluster.com/mahout-data > > (There are some other interesting FAQs as well at http://faqcluster.com/) > > > Thanks for your help > > Stefan >
