* mlb: comments

On Thu, Aug 13, 2009 at 9:39 AM, Stanislaw Osinski <stac...@gmail.com>wrote:

> Hi,
>
> On Tue, Aug 11, 2009 at 22:19, Mark Bennett <mbenn...@ideaeng.com> wrote:
>
> Carrot2 has several pluggable algorithms to choose from, though I have no
> > evidence that they're "better" than Lucene's.  Where TF/IDF is sort of a
> > one
> > step algebraic calculation, some clustering algorithms use iterative
> > approaches, etc.
>
>
> I'm not sure if I completely follow the way in which you'd like to use
> Carrot2 for scoring -- would you cluster the whole index? Carrot2 was
> designed to be a post-retrieval clustering algorithm and optimized to
> cluster small sets of documents (up to ~1000) in real time. All processing
> is performed in-memory, which limits Carrot2's applicability to really
> large
> sets of documents.
>
> S.
>

* mlb: I agree with all of your assertions, but...

There are comments in the Solr materials about having an option to cluster
based on the entire document set, and some warning about this being atypical
and possibly slow.  And from what you're saying, for a big enough docset, it
might go from "slow" to "impossible", I'm not sure.

And so my question was, *if* you were willing to spend that much time and
effort to cluster all the text of all the documents (and if it were even
possible), would the result perform better than the standard TF/IDF
techniques?

In the application I'm considering, the queries tend to be longer than
average, more like full sentences or more.  And they tend to be of a
question and answer nature.  I've seen references in several search engines
that QandA search sometimes benefits from alternative search techniques.
And, from a separate email, the IDF part of the standard similarity may be
causing a problem, so I'm casting a wide net for other ideas.  Just
brainstorming here... :-)

So, given that, did you have any thoughts on it Stanislaw?
Mark

Reply via email to