This is very nice work!

If you have achieved this level of accuracy without direct editing, then
this is very impressive.  In reading through the Mahout and Math questions,
I noted a few issues with quoting and a few complete failures, but the good
answers were very good.  I think that the quoting issues could be improved
by looking at the degree of string matching relative to the previous items
in the thread.  Small n-grams are very effective for this and avoid the need
for full edit distance calculations.  For the failed cases, even a small
amount of community feedback would suffice to knock out the bad answers.  I
think that the favorable ratio of high quality answers to low quality
answers is definitely high enough to make it worth looking at.  If the ratio
were reversed, I think users would not find it worth the time to look.

I do note that there are a very small number of questions that have been
answered compared to the number that I have seen go by on the mailing list.
 Is that because you are being very cautious about keeping precision high?

Finally, some questions:

a) do you use any sort of measure to determine how well written the
questions and answers are?

b) is this a dead-end school project or do you plan to continue with it?

On Tue, Feb 22, 2011 at 9:15 PM, Stefan Henß <[email protected]>wrote:

> Hi everybody,
>
> I'm currently doing research for my bachelor thesis on how to automatically
> extract FAQs from unstructured data.
>
> For this I've built a system automatically performing the following:
> - Load thousands of conversations from forums and mailing lists (don't mind
> the categories there).
> - Build categorization solely based on the conversation's texts (by
> clustering).
> - Pick the best modelled categories as basis for one FAQ each.
> - For each question (first entry in a conversation) find the best reply
> from its answers.
> - Select the most relevant and well formatted question/answer-pairs for
> each FAQ.
>
> Most of the steps almost completely rely on the data from the
> categorization step which is obtained using the latent Dirichlet allocation
> model.
>
> For the evaluation part I'd like to ask you for having a look at one or two
> FAQs and maybe give some comments on how far the questions matched the FAQ's
> title, how relevant they were etc.
>
>
> Here's the direct link to the Mahout FAQs:
> http://faqcluster.com/mahout-data
>
> (There are some other interesting FAQs as well at http://faqcluster.com/)
>
>
> Thanks for your help
>
> Stefan
>

Reply via email to