Re: Cores and and ranking (search quality)

johnmunir Tue, 10 Mar 2015 13:26:43 -0700

Thanks Walter.

The design decision I'm trying to solve is this: using multiple cores, will my 
ranking be impacted vs. using single core?

I have records to index and each record can be grouped into object-types, such 
as object-A, object-B, object-C, etc.  I have a total of 30 (maybe more) 
object-types.  There may be only 10 records of object-A, but 10 million records 
of object-B or 1 million of object-C, etc.  I need to be able to search against 
a single object-type and / or across all object-types.

>From my past experience, in a single core setup, if I have two identical 
>records, and I search on the term " XYZ" that matches one of the records, the 
>second record ranks right next to the other (because it too contains "XYZ").  
>This is good and is the expected behavior.  If I want to limit my search to an 
>object-type, I AND "XYZ" with that object-type.  So all is well.

What I'm considering to do for my new design is use multi-cores and distributed 
search.  I am considering to create a core for each object-type: core-A will 
hold records from object-A, core-B will hold records from object-B, etc.  
Before I can make a decision on this design, I need to know how ranking will be 
impacted.

Going back to my earlier example: if I have 2 identical records, one of them 
went to core-A which has 10 records, and the other went to core-B which has 10 
million records, using distributed search, if I now search across all cores on 
the term " XYZ" (just like in the single core case), it will match both of 
those records all right, but will those two records be ranked next to each 
other just like in the single core case?  If not, which will rank higher, the 
one from core-A or the one from core-B?

My concern is, using multi-cores and distributed search means I will give up on 
rank quality when records are not distributed across cores evenly.  If so, than 
maybe this is not a design I can use.

- MJ

-----Original Message-----
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Tuesday, March 10, 2015 2:39 PM
To: solr-user@lucene.apache.org
Subject: Re: Cores and and ranking (search quality)

On Mar 10, 2015, at 10:17 AM, johnmu...@aol.com wrote:

> If I have two cores, one core has 10 docs another has 100,000 docs.  I then 
> submit two docs that are 100% identical (with the exception of the unique-ID 
> fields, which is stored but not indexed) one to each core.  The question is, 
> during search, will both of those docs rank near each other or not? […]
> 
> Put another way: are docs from the smaller core (the one has 10 docs only) 
> rank higher or lower compared to docs from the larger core (the one with 
> 100,000) docs?

These are not quite the same question.

tf.idf ranking depends on the other documents in the collection (the idf term). 
With 10 docs, the document frequency statistics are effectively random noise, 
so the ranking is unpredictable.

Identical documents should rank identically, but whether they are higher or 
lower in the two cores depends on the rest of the docs.

idf statistics don’t settle down until at least 10K docs. You still sometimes 
see anomalies under a million documents. 

What design decision do you need to make? We can probably answer that for you.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

Re: Cores and and ranking (search quality)

Reply via email to