This is very nice work! If you have achieved this level of accuracy without direct editing, then this is very impressive. In reading through the Mahout and Math questions, I noted a few issues with quoting and a few complete failures, but the good answers were very good. I think that the quoting issues could be improved by looking at the degree of string matching relative to the previous items in the thread. Small n-grams are very effective for this and avoid the need for full edit distance calculations. For the failed cases, even a small amount of community feedback would suffice to knock out the bad answers. I think that the favorable ratio of high quality answers to low quality answers is definitely high enough to make it worth looking at. If the ratio were reversed, I think users would not find it worth the time to look.
I do note that there are a very small number of questions that have been answered compared to the number that I have seen go by on the mailing list. Is that because you are being very cautious about keeping precision high? Finally, some questions: a) do you use any sort of measure to determine how well written the questions and answers are? b) is this a dead-end school project or do you plan to continue with it? On Tue, Feb 22, 2011 at 9:15 PM, Stefan Henß <[email protected]>wrote: > Hi everybody, > > I'm currently doing research for my bachelor thesis on how to automatically > extract FAQs from unstructured data. > > For this I've built a system automatically performing the following: > - Load thousands of conversations from forums and mailing lists (don't mind > the categories there). > - Build categorization solely based on the conversation's texts (by > clustering). > - Pick the best modelled categories as basis for one FAQ each. > - For each question (first entry in a conversation) find the best reply > from its answers. > - Select the most relevant and well formatted question/answer-pairs for > each FAQ. > > Most of the steps almost completely rely on the data from the > categorization step which is obtained using the latent Dirichlet allocation > model. > > For the evaluation part I'd like to ask you for having a look at one or two > FAQs and maybe give some comments on how far the questions matched the FAQ's > title, how relevant they were etc. > > > Here's the direct link to the Mahout FAQs: > http://faqcluster.com/mahout-data > > (There are some other interesting FAQs as well at http://faqcluster.com/) > > > Thanks for your help > > Stefan >
