Thanks Walter. The design decision I'm trying to solve is this: using multiple cores, will my ranking be impacted vs. using single core?
I have records to index and each record can be grouped into object-types, such as object-A, object-B, object-C, etc. I have a total of 30 (maybe more) object-types. There may be only 10 records of object-A, but 10 million records of object-B or 1 million of object-C, etc. I need to be able to search against a single object-type and / or across all object-types. >From my past experience, in a single core setup, if I have two identical >records, and I search on the term " XYZ" that matches one of the records, the >second record ranks right next to the other (because it too contains "XYZ"). >This is good and is the expected behavior. If I want to limit my search to an >object-type, I AND "XYZ" with that object-type. So all is well. What I'm considering to do for my new design is use multi-cores and distributed search. I am considering to create a core for each object-type: core-A will hold records from object-A, core-B will hold records from object-B, etc. Before I can make a decision on this design, I need to know how ranking will be impacted. Going back to my earlier example: if I have 2 identical records, one of them went to core-A which has 10 records, and the other went to core-B which has 10 million records, using distributed search, if I now search across all cores on the term " XYZ" (just like in the single core case), it will match both of those records all right, but will those two records be ranked next to each other just like in the single core case? If not, which will rank higher, the one from core-A or the one from core-B? My concern is, using multi-cores and distributed search means I will give up on rank quality when records are not distributed across cores evenly. If so, than maybe this is not a design I can use. - MJ -----Original Message----- From: Walter Underwood [mailto:wun...@wunderwood.org] Sent: Tuesday, March 10, 2015 2:39 PM To: solr-user@lucene.apache.org Subject: Re: Cores and and ranking (search quality) On Mar 10, 2015, at 10:17 AM, johnmu...@aol.com wrote: > If I have two cores, one core has 10 docs another has 100,000 docs. I then > submit two docs that are 100% identical (with the exception of the unique-ID > fields, which is stored but not indexed) one to each core. The question is, > during search, will both of those docs rank near each other or not? […] > > Put another way: are docs from the smaller core (the one has 10 docs only) > rank higher or lower compared to docs from the larger core (the one with > 100,000) docs? These are not quite the same question. tf.idf ranking depends on the other documents in the collection (the idf term). With 10 docs, the document frequency statistics are effectively random noise, so the ranking is unpredictable. Identical documents should rank identically, but whether they are higher or lower in the two cores depends on the rest of the docs. idf statistics don’t settle down until at least 10K docs. You still sometimes see anomalies under a million documents. What design decision do you need to make? We can probably answer that for you. wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog)