Interesting project. :) I didn't get a very strong sense of correlation between the topic categories and the questions in them. For example,
http://faqcluster.com/couchdb-replication-couch-databases-database "Questions & Answers about Couchdb, Couch, Replication, Databases and Database." Had the following question: http://faqcluster.com/question1996757514 "I'm looking for a recommendation for ruby gem that will enable me to use couchdb from rails. I'd like to have couch documents be modeled by ActiveRecord." This didn't have any mention of replication (or databases), so I can only guess that it was clustering on "couch" or "couchdb". Do you do any screening of common terms from the clustering? I'd imagine that if you looked at the user@couchdb mailing list, you could find a list of very common terms (like couch, couchdb, database, etc.) and discard or ignore those when trying to cluster the messages (in the same way that words like "the" and "and" shouldn't be used). Basically, a per-mailing-list set of generic terms. The questions and answers themselves seemed to be a nice, readable "I have X problem" "here is an answer" pair, so that was cool. :) HTH, Eli On Tue, Feb 22, 2011 at 8:24 PM, Stefan Henß <[email protected]> wrote: > Hi everybody, > > I'm currently doing research for my bachelor thesis on how to automatically > extract FAQs from unstructured data. > > For this I've built a system automatically performing the following: > - Load thousands of conversations from forums and mailing lists (don't mind > the categories there). > - Build categorization solely based on the conversation's texts (by > clustering). > - Pick the best modelled categories as basis for one FAQ each. > - For each question (first entry in a conversation) find the best reply from > its answers. > - Select the most relevant and well formatted question/answer-pairs for each > FAQ. > > For the evaluation part I'd like to ask you for having a look at one or two > FAQs and maybe give some comments on how far the questions matched the FAQ's > title, how relevant they were etc. > > > Here's the direct link to the CouchDB FAQs: > http://faqcluster.com/couchdb-view-document-doc-couch > > And here a quite good example in my opinion: > http://faqcluster.com/question1516894006 > > (There are some other interesting FAQs as well at http://faqcluster.com/) > > > Thanks for your help > > Stefan >
