Hello Stefan,
On Tue, Mar 8, 2011 at 9:54 PM, Stefan Henß <[email protected]> wrote: > Hi everybody, > > I'm currently doing research for my bachelor thesis on how to automatically > extract FAQs from unstructured data. > > For this I've built a system automatically performing the following: > - Load thousands of conversations from forums and mailing lists (don't mind > the categories there, don't discriminate between sources). > - Build new categorization solely based on the conversation's texts (by > clustering). > - Pick the best modelled categories as basis for one FAQ each. > - For each question (first entry in a thread) find the best reply from its > answers. > - Select the most relevant and well formatted question/answer-pairs for each > FAQ. > > For the evaluation I'm interested in expert's perceptions of the results, > e.g. if the questions are relevant, correctly answered, etc. I think the clusters contain pretty well the correct set of emails, so well done! I assume the answer to questions are correct because you can take the second mail as the answer to the first, isn't? What seems to be confusing in the answer, is that it is quite hard sometimes to see where the answer starts and stops: Perhaps because we use to comment most of the time in line in emails. Out of curiosity: What did you use for the clustering? Did you look at or use Mahout for it? > Also as I'll release a paper about the approach I'd be happy if you could > rate one or two questions (stars on the details pages) so I'd have some > statistics to present. Done. Will it be a publicly available release? Regards Ard > > > Here's the direct link to the Jackrabbit FAQs: > http://faqcluster.com/jackrabbit-node-jcr-repository-apache > > (There are some other interesting FAQs as well at http://faqcluster.com/) > > > Thanks for your help > > Stefan > -- Hippo Europe • Amsterdam Oosteinde 11 • 1017 WT Amsterdam • +31 (0)20 522 4466 USA • San Francisco 755 Baywood Drive, Second Floor • Petaluma, CA. 94954 • +1 877 414 4776 (toll free) Canada • Montréal 5369 Boulevard St-Laurent #430 • Montréal QC H2T 1S5 • +1 (514) 316 8966 www.onehippo.com • www.onehippo.org • [email protected] ________________________________________________________________ This e-mail may be privileged and/or confidential, and the sender does not waive any related rights and obligations. Any distribution, use or copying of this e-mail or the information it contains by other than an intended recipient is unauthorized. If you received this e-mail in error, please advise me (by return e-mail or otherwise) immediately.
