Re: Solr 1.4 Clustering / mlt AS search?

Grant Ingersoll Thu, 13 Aug 2009 13:25:10 -0700


On Aug 13, 2009, at 1:29 PM, Mark Bennett wrote:

* mlb: comments
On Thu, Aug 13, 2009 at 9:39 AM, Stanislaw Osinski<stac...@gmail.com>wrote:
Hi,
On Tue, Aug 11, 2009 at 22:19, Mark Bennett <mbenn...@ideaeng.com>wrote:
Carrot2 has several pluggable algorithms to choose from, though Ihave no
evidence that they're "better" than Lucene's. Where TF/IDF issort of a
one
step algebraic calculation, some clustering algorithms use iterative
approaches, etc.
I'm not sure if I completely follow the way in which you'd like touse
Carrot2 for scoring -- would you cluster the whole index? Carrot2 was
designed to be a post-retrieval clustering algorithm and optimized to
cluster small sets of documents (up to ~1000) in real time. Allprocessingis performed in-memory, which limits Carrot2's applicability toreally
large
sets of documents.

S.
* mlb: I agree with all of your assertions, but...
There are comments in the Solr materials about having an option toclusterbased on the entire document set, and some warning about this beingatypicaland possibly slow. And from what you're saying, for a big enoughdocset, it
might go from "slow" to "impossible", I'm not sure.

Those comments are referring to a yet unimplemented feature that willallow for pluggable background clustering using something like Mahoutto cluster the whole collection and then return back the results laterupon request.

And so my question was, *if* you were willing to spend that muchtime andeffort to cluster all the text of all the documents (and if it wereeven
possible), would the result perform better than the standard TF/IDF
techniques?

In the application I'm considering, the queries tend to be longer than
average, more like full sentences or more.  And they tend to be of a
question and answer nature. I've seen references in several searchenginesthat QandA search sometimes benefits from alternative searchtechniques.And, from a separate email, the IDF part of the standard similaritymay be
causing a problem, so I'm casting a wide net for other ideas.  Just
brainstorming here... :-)

QA has a lot of factors at play, but I can't recall anyone usingclustering as a way of doing the initial passage retrieval, but it'sbeen a few years since I kept up with that literature.

You of course can turn off or downplay IDF if that is an issue. Ithink payloads can also play a useful hand in QA (or Lucene's newAttribute capabilities, but I won't quite go there yet) because youcould store term level information (often POS plays a role in helpingQA, as well as parsing information)



--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)using Solr/Lucene:

http://www.lucidimagination.com/search

Re: Solr 1.4 Clustering / mlt AS search?

Reply via email to