Hi Isabel,
I guess this task very much depends on the domain on which you want to
apply it. For instance discussion about products like Mahout should be
quite centered (this mailing list) and rather frequented by advanced
users. So I don't expect many duplicates (and also didn't find in my
evaluations). And even if a significant amount exists, how to identify
them? I assume the more complex the topics the more ways to express
them. That's why I focus on good categorization and a reasonable
selection of entries first (tough enough for some occasions), not so
much on the "frequent" in FAQ.
But for the commercial sector (e.g. consumer electronics) I think this
could work. Having a very large database of inquiries (mail support,
call center logs, ...), hierarchical clustering and fine grained
settings there should be clusters (or rather groups/topics) of near
duplicates at the bottom level and you simply order them by size.
Stefan
Am 23.02.2011 14:08, schrieb Isabel Drost:
On Wed, 23 Feb 11 Sean Owen wrote:
Nice, very interesting to see and read!
Very interesting indeed. Wondering whether creating a "Top 10" of the
most frequently asked questions could be created that way as well.
Isabel