* mlb: comments On Thu, Aug 13, 2009 at 9:39 AM, Stanislaw Osinski <stac...@gmail.com>wrote:
> Hi, > > On Tue, Aug 11, 2009 at 22:19, Mark Bennett <mbenn...@ideaeng.com> wrote: > > Carrot2 has several pluggable algorithms to choose from, though I have no > > evidence that they're "better" than Lucene's. Where TF/IDF is sort of a > > one > > step algebraic calculation, some clustering algorithms use iterative > > approaches, etc. > > > I'm not sure if I completely follow the way in which you'd like to use > Carrot2 for scoring -- would you cluster the whole index? Carrot2 was > designed to be a post-retrieval clustering algorithm and optimized to > cluster small sets of documents (up to ~1000) in real time. All processing > is performed in-memory, which limits Carrot2's applicability to really > large > sets of documents. > > S. > * mlb: I agree with all of your assertions, but... There are comments in the Solr materials about having an option to cluster based on the entire document set, and some warning about this being atypical and possibly slow. And from what you're saying, for a big enough docset, it might go from "slow" to "impossible", I'm not sure. And so my question was, *if* you were willing to spend that much time and effort to cluster all the text of all the documents (and if it were even possible), would the result perform better than the standard TF/IDF techniques? In the application I'm considering, the queries tend to be longer than average, more like full sentences or more. And they tend to be of a question and answer nature. I've seen references in several search engines that QandA search sometimes benefits from alternative search techniques. And, from a separate email, the IDF part of the standard similarity may be causing a problem, so I'm casting a wide net for other ideas. Just brainstorming here... :-) So, given that, did you have any thoughts on it Stanislaw? Mark