Thanks Walter.  This explains a lot.

- MJ

-----Original Message-----
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Tuesday, March 10, 2015 4:41 PM
To: solr-user@lucene.apache.org
Subject: Re: Cores and and ranking (search quality)

If the documents are distributed randomly across shards/cores, then the 
statistics will be similar in each core and the results will be similar.

If the documents are distributed semantically (say, by topic or type), the 
statistics of each core will be skewed towards that set of documents and the 
results could be quite different.

Assume I have tech support documents and I put all the LaserJet docs in one 
core. That term is very common in that core (poor idf) and rare in other cores 
(strong idf). But for the query “laserjet”, all the good answers are in the 
LaserJet-specific core, where they will be scored low.

An identical document that mentions “LaserJet” once will score fairly low in 
the LaserJet-specific collection and fairly high in the other collection.

Global IDF fixes this, by using corpus-wide statistics. That’s how we ran 
Infoseek and Ultraseek in the late 1990’s.

Random allocation to cores avoids it.

If you have significant traffic directed to one object type AND you need peak 
performance, you may want to segregate your cores by object type. Otherwise, 
I’d let SolrCloud spread them around randomly and filter based on an object 
type field. That should work well for most purposes.

Any core with less than 1000 records is likely to give somewhat mysterious 
results. A word that is common in English, like “next”, will only be in one 
document and will score too high. A less-common word, like “unreasonably”, will 
be in 20 and will score low. You need lots of docs for the language statistics 
to even out.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Mar 10, 2015, at 1:23 PM, johnmu...@aol.com wrote:

> Thanks Walter.
> 
> The design decision I'm trying to solve is this: using multiple cores, will 
> my ranking be impacted vs. using single core?
> 
> I have records to index and each record can be grouped into object-types, 
> such as object-A, object-B, object-C, etc.  I have a total of 30 (maybe more) 
> object-types.  There may be only 10 records of object-A, but 10 million 
> records of object-B or 1 million of object-C, etc.  I need to be able to 
> search against a single object-type and / or across all object-types.
> 
> From my past experience, in a single core setup, if I have two identical 
> records, and I search on the term " XYZ" that matches one of the records, the 
> second record ranks right next to the other (because it too contains "XYZ").  
> This is good and is the expected behavior.  If I want to limit my search to 
> an object-type, I AND "XYZ" with that object-type.  So all is well.
> 
> What I'm considering to do for my new design is use multi-cores and 
> distributed search.  I am considering to create a core for each object-type: 
> core-A will hold records from object-A, core-B will hold records from 
> object-B, etc.  Before I can make a decision on this design, I need to know 
> how ranking will be impacted.
> 
> Going back to my earlier example: if I have 2 identical records, one of them 
> went to core-A which has 10 records, and the other went to core-B which has 
> 10 million records, using distributed search, if I now search across all 
> cores on the term " XYZ" (just like in the single core case), it will match 
> both of those records all right, but will those two records be ranked next to 
> each other just like in the single core case?  If not, which will rank 
> higher, the one from core-A or the one from core-B?
> 
> My concern is, using multi-cores and distributed search means I will give up 
> on rank quality when records are not distributed across cores evenly.  If so, 
> than maybe this is not a design I can use.
> 
> - MJ
> 
> -----Original Message-----
> From: Walter Underwood [mailto:wun...@wunderwood.org] 
> Sent: Tuesday, March 10, 2015 2:39 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Cores and and ranking (search quality)
> 
> On Mar 10, 2015, at 10:17 AM, johnmu...@aol.com wrote:
> 
>> If I have two cores, one core has 10 docs another has 100,000 docs.  I then 
>> submit two docs that are 100% identical (with the exception of the unique-ID 
>> fields, which is stored but not indexed) one to each core.  The question is, 
>> during search, will both of those docs rank near each other or not? […]
>> 
>> Put another way: are docs from the smaller core (the one has 10 docs only) 
>> rank higher or lower compared to docs from the larger core (the one with 
>> 100,000) docs?
> 
> These are not quite the same question.
> 
> tf.idf ranking depends on the other documents in the collection (the idf 
> term). With 10 docs, the document frequency statistics are effectively random 
> noise, so the ranking is unpredictable.
> 
> Identical documents should rank identically, but whether they are higher or 
> lower in the two cores depends on the rest of the docs.
> 
> idf statistics don’t settle down until at least 10K docs. You still sometimes 
> see anomalies under a million documents. 
> 
> What design decision do you need to make? We can probably answer that for you.
> 
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
> 

Reply via email to