On Aug 13, 2009, at 1:29 PM, Mark Bennett wrote:

* mlb: comments

On Thu, Aug 13, 2009 at 9:39 AM, Stanislaw Osinski <stac...@gmail.com>wrote:

Hi,

On Tue, Aug 11, 2009 at 22:19, Mark Bennett <mbenn...@ideaeng.com> wrote:

Carrot2 has several pluggable algorithms to choose from, though I have no
evidence that they're "better" than Lucene's. Where TF/IDF is sort of a
one
step algebraic calculation, some clustering algorithms use iterative
approaches, etc.


I'm not sure if I completely follow the way in which you'd like to use
Carrot2 for scoring -- would you cluster the whole index? Carrot2 was
designed to be a post-retrieval clustering algorithm and optimized to
cluster small sets of documents (up to ~1000) in real time. All processing is performed in-memory, which limits Carrot2's applicability to really
large
sets of documents.

S.


* mlb: I agree with all of your assertions, but...

There are comments in the Solr materials about having an option to cluster based on the entire document set, and some warning about this being atypical and possibly slow. And from what you're saying, for a big enough docset, it
might go from "slow" to "impossible", I'm not sure.

Those comments are referring to a yet unimplemented feature that will allow for pluggable background clustering using something like Mahout to cluster the whole collection and then return back the results later upon request.



And so my question was, *if* you were willing to spend that much time and effort to cluster all the text of all the documents (and if it were even
possible), would the result perform better than the standard TF/IDF
techniques?

In the application I'm considering, the queries tend to be longer than
average, more like full sentences or more.  And they tend to be of a
question and answer nature. I've seen references in several search engines that QandA search sometimes benefits from alternative search techniques. And, from a separate email, the IDF part of the standard similarity may be
causing a problem, so I'm casting a wide net for other ideas.  Just
brainstorming here... :-)

QA has a lot of factors at play, but I can't recall anyone using clustering as a way of doing the initial passage retrieval, but it's been a few years since I kept up with that literature.

You of course can turn off or downplay IDF if that is an issue. I think payloads can also play a useful hand in QA (or Lucene's new Attribute capabilities, but I won't quite go there yet) because you could store term level information (often POS plays a role in helping QA, as well as parsing information)


--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene:
http://www.lucidimagination.com/search

Reply via email to