Federated relevance ranking

2011-06-02 Thread Clint Gilbert
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi everyone,

I searched the list archives, but couldn't find a question that closely
matches mine.

The project I'm working on is designed to allow searching a distributed
collection of data repositories.  Currently, we index each repository to
build a central Lucene index.  This works ok, but for practical (the
central index is getting very large) and architectural (decentralization
is a design goal) reasons, we'd like to distribute the index.

In the past, we had basic federation system in place: when a user
submitted a query, the query was broadcast to each data repository,
which had its own independent Lucene index.  Results from each repo were
aggregated in reverse order.

The problem was, of course, that since each index was constructed
independently of all the others, and documents are distributed in the
repos unevenly, it was impossible to rank the results from all the
indices in a meaningful way.  We basically punted and interleaved
results, which didn't gave a bad user experience, hence the temporary
switch to a central index.

So, what options exist for searching distributed collections of Lucene
indices and ranking results meaningfully?

Katta seems promising, but I don't know enough about it yet.  It also
seems to want to open its own ports for RPC.  I'd prefer something that
could tunnel over HTTP to minimize firewall drama.  (We will have 10s
and then 100s of data repos running in separate locations.)

We're also considering a home-grown scheme involving normalizing the
denominators of all the index components in all our indices, based on
the sums of counts obtained from all the indices.  This feels like
re-inventing the wheel, and it's not clear to me yet that the low-level
manipulation of indices that we'd need to do is even possible.

Any suggestions for distributing indices while ranking results well are
very welcome!
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk3n6bsACgkQ5IyIbnMUeTsOFACeM2lsWKXguf8XYUFdDbYtmzc1
Qd8Anjx670zjQ7KYjnxXVQXuR+CBjxCs
=Jnkt
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



Re: Federated relevance ranking

2011-06-02 Thread Clint Gilbert
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Thank you very much for your reply.  Yeah, our indexes (indices?)
contain different types and amounts of data. :( The data being indexed
is all the same format - RDF - but it describes different numbers and
kinds of things.

What is your gut feeling on whether or not it's a good idea for us to
roll our own?  Katta is a contender, but we already have a fairly
complex system, and adding anything Hadoop-related feels like it might
push us over a tipping point into the realm of unwieldy overcomplexity.
 But, this is a hard problem after all, so some amount of complexity is
inevitable.

On 06/02/2011 07:05 PM, Erick Erickson wrote:
> As you've found out, raw scores certainly aren't comparable across
> different indexes
> #unless# the documents are fairly distributed. You're talking large
> indexes here,
> so if the documents are balanced across all your indexes, the results should 
> be
> pretty comparable. This pre-supposes that the indexes share a common schema
> and that the distributions of terms are "close enough to identical" to be 
> truly
> comparable. And it supposes that your indexes are similar in
> character. It wouldn't
> work if one of your indexes had, say, meta-data from videos and another had
> scholarly journal articles.
> 
> Otherwise, there's work going on in Solr that might help, although I
> don't know when
> that'll be available.
> 
> Other than that, I don't know what to suggest. It's not an easy
> problem or Solr/Lucene
> would already have solved it.. siiih.
> 
> Best
> Erick
> 
> On Thu, Jun 2, 2011 at 3:51 PM, Clint Gilbert
>  wrote:
> Hi everyone,
> 
> I searched the list archives, but couldn't find a question that closely
> matches mine.
> 
> The project I'm working on is designed to allow searching a distributed
> collection of data repositories.  Currently, we index each repository to
> build a central Lucene index.  This works ok, but for practical (the
> central index is getting very large) and architectural (decentralization
> is a design goal) reasons, we'd like to distribute the index.
> 
> In the past, we had basic federation system in place: when a user
> submitted a query, the query was broadcast to each data repository,
> which had its own independent Lucene index.  Results from each repo were
> aggregated in reverse order.
> 
> The problem was, of course, that since each index was constructed
> independently of all the others, and documents are distributed in the
> repos unevenly, it was impossible to rank the results from all the
> indices in a meaningful way.  We basically punted and interleaved
> results, which didn't gave a bad user experience, hence the temporary
> switch to a central index.
> 
> So, what options exist for searching distributed collections of Lucene
> indices and ranking results meaningfully?
> 
> Katta seems promising, but I don't know enough about it yet.  It also
> seems to want to open its own ports for RPC.  I'd prefer something that
> could tunnel over HTTP to minimize firewall drama.  (We will have 10s
> and then 100s of data repos running in separate locations.)
> 
> We're also considering a home-grown scheme involving normalizing the
> denominators of all the index components in all our indices, based on
> the sums of counts obtained from all the indices.  This feels like
> re-inventing the wheel, and it's not clear to me yet that the low-level
> manipulation of indices that we'd need to do is even possible.
> 
> Any suggestions for distributing indices while ranking results well are
> very welcome!
>>
- -
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
>>
>>

> -
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk3oIZUACgkQ5IyIbnMUeTuemwCeMfolvNVEjve9fIEJHy3N3TV/
0VIAn2Xf+ypB5PRS45ekmiXEDhmvDdhZ
=jtE9
-END PGP SIGNATURE-

-
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org