Re: Cores and and ranking (search quality)

2015-03-12 Thread Erick Erickson
SOLR-1632 will certainly help. But trying to predict whether your core
A or core B will appear first doesn't really seem like a good use of
time. If you actually have a setup like you describe, add debug=all
to your query on both cores and you'll see all the gory detail of how
the scores are calculated, providing a definitive answer in _your_
situation.

Best,
Erick

On Mon, Mar 9, 2015 at 5:44 AM,  johnmu...@aol.com wrote:
 (reposing this to see if anyone can help)


 Help me understand this better (regarding ranking).

 If I have two docs that are 100% identical with the exception of uid (which 
 is stored but not indexed).  In a single core setup, if I search xyz such 
 that those 2 docs end up ranking as #1 and #2.  When I switch over to two 
 core setup, doc-A goes to core-A (which has 10 records) and doc-B goes to 
 core-B (which has 100,000 records).

 Now, are you saying in 2 core setup if I search on xyz (just like in singe 
 core setup) this time I will not see doc-A and doc-B as #1 and #2 in ranking? 
  That is, are you saying doc-A may now be somewhere at the top / bottom far 
 away from doc-B?  If so, which will be #1: the doc off core-A (that has 10 
 records) or doc-B off core-B (that has 100,000 records)?

 If I got all this right, are you saying SOLR-1632 will fix this issue such 
 that the end result will now be as if I had 1 core?

 - MJ


 -Original Message-
 From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk]
 Sent: Thursday, March 5, 2015 9:06 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Cores and and ranking (search quality)

 On Thu, 2015-03-05 at 14:34 +0100, johnmu...@aol.com wrote:
 My question is this: if I put my data in multiple cores and use
 distributed search will the ranking be different if I had all my data
 in a single core?

 Yes, it will be different. The practical impact depends on how homogeneous 
 your data are across the shards and how large your shards are. If you have 
 small and dissimilar shards, your ranking will suffer a lot.

 Work is being done to remedy this:
 https://issues.apache.org/jira/browse/SOLR-1632

 Also, will facet and more-like-this quality / result be the same?

 It is not formally guaranteed, but for most practical purposes, faceting on 
 multi-shards will give you the same results as single-shards.

 I don't know about more-like-this. My guess is that it will be affected in 
 the same way that standard searches are.

 Also, reading the distributed search wiki
 (http://wiki.apache.org/solr/DistributedSearch) it looks like Solr
 does the search and result merging (all I have to do is issue a
 search), is this correct?

 Yes. From a user-perspective, searches are no different.

 - Toke Eskildsen, State and University Library, Denmark



Re: Cores and and ranking (search quality)

2015-03-11 Thread johnmunir
Thanks Walter.  This explains a lot.

- MJ

-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Tuesday, March 10, 2015 4:41 PM
To: solr-user@lucene.apache.org
Subject: Re: Cores and and ranking (search quality)

If the documents are distributed randomly across shards/cores, then the 
statistics will be similar in each core and the results will be similar.

If the documents are distributed semantically (say, by topic or type), the 
statistics of each core will be skewed towards that set of documents and the 
results could be quite different.

Assume I have tech support documents and I put all the LaserJet docs in one 
core. That term is very common in that core (poor idf) and rare in other cores 
(strong idf). But for the query “laserjet”, all the good answers are in the 
LaserJet-specific core, where they will be scored low.

An identical document that mentions “LaserJet” once will score fairly low in 
the LaserJet-specific collection and fairly high in the other collection.

Global IDF fixes this, by using corpus-wide statistics. That’s how we ran 
Infoseek and Ultraseek in the late 1990’s.

Random allocation to cores avoids it.

If you have significant traffic directed to one object type AND you need peak 
performance, you may want to segregate your cores by object type. Otherwise, 
I’d let SolrCloud spread them around randomly and filter based on an object 
type field. That should work well for most purposes.

Any core with less than 1000 records is likely to give somewhat mysterious 
results. A word that is common in English, like “next”, will only be in one 
document and will score too high. A less-common word, like “unreasonably”, will 
be in 20 and will score low. You need lots of docs for the language statistics 
to even out.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Mar 10, 2015, at 1:23 PM, johnmu...@aol.com wrote:

 Thanks Walter.
 
 The design decision I'm trying to solve is this: using multiple cores, will 
 my ranking be impacted vs. using single core?
 
 I have records to index and each record can be grouped into object-types, 
 such as object-A, object-B, object-C, etc.  I have a total of 30 (maybe more) 
 object-types.  There may be only 10 records of object-A, but 10 million 
 records of object-B or 1 million of object-C, etc.  I need to be able to 
 search against a single object-type and / or across all object-types.
 
 From my past experience, in a single core setup, if I have two identical 
 records, and I search on the term  XYZ that matches one of the records, the 
 second record ranks right next to the other (because it too contains XYZ).  
 This is good and is the expected behavior.  If I want to limit my search to 
 an object-type, I AND XYZ with that object-type.  So all is well.
 
 What I'm considering to do for my new design is use multi-cores and 
 distributed search.  I am considering to create a core for each object-type: 
 core-A will hold records from object-A, core-B will hold records from 
 object-B, etc.  Before I can make a decision on this design, I need to know 
 how ranking will be impacted.
 
 Going back to my earlier example: if I have 2 identical records, one of them 
 went to core-A which has 10 records, and the other went to core-B which has 
 10 million records, using distributed search, if I now search across all 
 cores on the term  XYZ (just like in the single core case), it will match 
 both of those records all right, but will those two records be ranked next to 
 each other just like in the single core case?  If not, which will rank 
 higher, the one from core-A or the one from core-B?
 
 My concern is, using multi-cores and distributed search means I will give up 
 on rank quality when records are not distributed across cores evenly.  If so, 
 than maybe this is not a design I can use.
 
 - MJ
 
 -Original Message-
 From: Walter Underwood [mailto:wun...@wunderwood.org] 
 Sent: Tuesday, March 10, 2015 2:39 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Cores and and ranking (search quality)
 
 On Mar 10, 2015, at 10:17 AM, johnmu...@aol.com wrote:
 
 If I have two cores, one core has 10 docs another has 100,000 docs.  I then 
 submit two docs that are 100% identical (with the exception of the unique-ID 
 fields, which is stored but not indexed) one to each core.  The question is, 
 during search, will both of those docs rank near each other or not? […]
 
 Put another way: are docs from the smaller core (the one has 10 docs only) 
 rank higher or lower compared to docs from the larger core (the one with 
 100,000) docs?
 
 These are not quite the same question.
 
 tf.idf ranking depends on the other documents in the collection (the idf 
 term). With 10 docs, the document frequency statistics are effectively random 
 noise, so the ranking is unpredictable.
 
 Identical documents should rank identically, but whether they are higher or 
 lower in the two

Re: Cores and and ranking (search quality)

2015-03-10 Thread Shawn Heisey
On 3/10/2015 11:17 AM, johnmu...@aol.com wrote:
 If I have two cores, one core has 10 docs another has 100,000 docs.  I then 
 submit two docs that are 100% identical (with the exception of the unique-ID 
 fields, which is stored but not indexed) one to each core.  The question is, 
 during search, will both of those docs rank near each other or not?  If so, 
 this is great because it will behave the same as if I had one core and index 
 both docs to this single core.  If not, which core's doc will rank higher and 
 how far apart the two docs be from each other in the ranking?

 Put another way: are docs from the smaller core (the one has 10 docs only) 
 rank higher or lower compared to docs from the larger core (the one with 
 100,000) docs?

Without specific knowledge about the document in question as well as all
the other documents, this is impossible to answer, except to say that
the relative ranking position is likely to be different.  Dropping back
to general info:

The overall term frequency and inverse document frequency (TF-IDF) in
the 100,000 document index will very likely be quite a lot different
than in the 10 document index.  That will affect ranking order. 
Sometimes users are surprised by the results they get, but it is very
rare to find a bug in Lucene scoring.

In addition to the debug parameter that Erick told you about, here are a
couple of classes you could investigate at the source code level for
more information about ranking:

http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/search/similarities/Similarity.html
http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/search/similarities/DefaultSimilarity.html

Here's info that is more general, and from a much earlier Lucene version:

https://lucene.apache.org/core/3_6_2/scoring.html

I have my Solr install configured to use the BM25 similarity.

http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/search/similarities/BM25Similarity.html
http://en.wikipedia.org/wiki/Okapi_BM25

SOLR-1632 aims to make TF-IDF the same across multiple cores as you
would get if you only had one core.  I do not know enough about it to
know whether it is EXACTLY the same, or only an approximation ... but in
a search context, 100 percent precise calculation is rarely required. 
When you drop that as a requirement, search becomes easier and a LOT faster.

Thanks,
Shawn



Re: Cores and and ranking (search quality)

2015-03-10 Thread johnmunir
Thanks Erick for trying to help, I really appreciate it.  Unfortunately, I'm 
still stuck.

There are times one must know the inner working and behavior of the software to 
make design decision and this one is one of them.  If I know the inner working 
of Solr, I would not be asking.  In addition, I'm in the design process, so I'm 
not able to fully test.  Beside my test could be invalid because I may not set 
it up right due to my lack of understanding the inner working of Solr.

Given this, I hope you don't mind me asking again.

If I have two cores, one core has 10 docs another has 100,000 docs.  I then 
submit two docs that are 100% identical (with the exception of the unique-ID 
fields, which is stored but not indexed) one to each core.  The question is, 
during search, will both of those docs rank near each other or not?  If so, 
this is great because it will behave the same as if I had one core and index 
both docs to this single core.  If not, which core's doc will rank higher and 
how far apart the two docs be from each other in the ranking?

Put another way: are docs from the smaller core (the one has 10 docs only) rank 
higher or lower compared to docs from the larger core (the one with 100,000) 
docs?

Thanks!

-- MJ

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Tuesday, March 10, 2015 11:47 AM
To: solr-user@lucene.apache.org
Subject: Re: Cores and and ranking (search quality)

SOLR-1632 will certainly help. But trying to predict whether your core A or 
core B will appear first doesn't really seem like a good use of time. If you 
actually have a setup like you describe, add debug=all to your query on both 
cores and you'll see all the gory detail of how the scores are calculated, 
providing a definitive answer in _your_ situation.

Best,
Erick

On Mon, Mar 9, 2015 at 5:44 AM,  johnmu...@aol.com wrote:
 (reposing this to see if anyone can help)


 Help me understand this better (regarding ranking).

 If I have two docs that are 100% identical with the exception of uid (which 
 is stored but not indexed).  In a single core setup, if I search xyz such 
 that those 2 docs end up ranking as #1 and #2.  When I switch over to two 
 core setup, doc-A goes to core-A (which has 10 records) and doc-B goes to 
 core-B (which has 100,000 records).

 Now, are you saying in 2 core setup if I search on xyz (just like in singe 
 core setup) this time I will not see doc-A and doc-B as #1 and #2 in ranking? 
  That is, are you saying doc-A may now be somewhere at the top / bottom far 
 away from doc-B?  If so, which will be #1: the doc off core-A (that has 10 
 records) or doc-B off core-B (that has 100,000 records)?

 If I got all this right, are you saying SOLR-1632 will fix this issue such 
 that the end result will now be as if I had 1 core?

 - MJ


 -Original Message-
 From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk]
 Sent: Thursday, March 5, 2015 9:06 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Cores and and ranking (search quality)

 On Thu, 2015-03-05 at 14:34 +0100, johnmu...@aol.com wrote:
 My question is this: if I put my data in multiple cores and use 
 distributed search will the ranking be different if I had all my data 
 in a single core?

 Yes, it will be different. The practical impact depends on how homogeneous 
 your data are across the shards and how large your shards are. If you have 
 small and dissimilar shards, your ranking will suffer a lot.

 Work is being done to remedy this:
 https://issues.apache.org/jira/browse/SOLR-1632

 Also, will facet and more-like-this quality / result be the same?

 It is not formally guaranteed, but for most practical purposes, faceting on 
 multi-shards will give you the same results as single-shards.

 I don't know about more-like-this. My guess is that it will be affected in 
 the same way that standard searches are.

 Also, reading the distributed search wiki
 (http://wiki.apache.org/solr/DistributedSearch) it looks like Solr 
 does the search and result merging (all I have to do is issue a 
 search), is this correct?

 Yes. From a user-perspective, searches are no different.

 - Toke Eskildsen, State and University Library, Denmark




Re: Cores and and ranking (search quality)

2015-03-10 Thread johnmunir
Thanks Walter.

The design decision I'm trying to solve is this: using multiple cores, will my 
ranking be impacted vs. using single core?

I have records to index and each record can be grouped into object-types, such 
as object-A, object-B, object-C, etc.  I have a total of 30 (maybe more) 
object-types.  There may be only 10 records of object-A, but 10 million records 
of object-B or 1 million of object-C, etc.  I need to be able to search against 
a single object-type and / or across all object-types.

From my past experience, in a single core setup, if I have two identical 
records, and I search on the term  XYZ that matches one of the records, the 
second record ranks right next to the other (because it too contains XYZ).  
This is good and is the expected behavior.  If I want to limit my search to an 
object-type, I AND XYZ with that object-type.  So all is well.

What I'm considering to do for my new design is use multi-cores and distributed 
search.  I am considering to create a core for each object-type: core-A will 
hold records from object-A, core-B will hold records from object-B, etc.  
Before I can make a decision on this design, I need to know how ranking will be 
impacted.

Going back to my earlier example: if I have 2 identical records, one of them 
went to core-A which has 10 records, and the other went to core-B which has 10 
million records, using distributed search, if I now search across all cores on 
the term  XYZ (just like in the single core case), it will match both of 
those records all right, but will those two records be ranked next to each 
other just like in the single core case?  If not, which will rank higher, the 
one from core-A or the one from core-B?

My concern is, using multi-cores and distributed search means I will give up on 
rank quality when records are not distributed across cores evenly.  If so, than 
maybe this is not a design I can use.

- MJ

-Original Message-
From: Walter Underwood [mailto:wun...@wunderwood.org] 
Sent: Tuesday, March 10, 2015 2:39 PM
To: solr-user@lucene.apache.org
Subject: Re: Cores and and ranking (search quality)

On Mar 10, 2015, at 10:17 AM, johnmu...@aol.com wrote:

 If I have two cores, one core has 10 docs another has 100,000 docs.  I then 
 submit two docs that are 100% identical (with the exception of the unique-ID 
 fields, which is stored but not indexed) one to each core.  The question is, 
 during search, will both of those docs rank near each other or not? […]
 
 Put another way: are docs from the smaller core (the one has 10 docs only) 
 rank higher or lower compared to docs from the larger core (the one with 
 100,000) docs?

These are not quite the same question.

tf.idf ranking depends on the other documents in the collection (the idf term). 
With 10 docs, the document frequency statistics are effectively random noise, 
so the ranking is unpredictable.

Identical documents should rank identically, but whether they are higher or 
lower in the two cores depends on the rest of the docs.

idf statistics don’t settle down until at least 10K docs. You still sometimes 
see anomalies under a million documents. 

What design decision do you need to make? We can probably answer that for you.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



Re: Cores and and ranking (search quality)

2015-03-10 Thread Walter Underwood
If the documents are distributed randomly across shards/cores, then the 
statistics will be similar in each core and the results will be similar.

If the documents are distributed semantically (say, by topic or type), the 
statistics of each core will be skewed towards that set of documents and the 
results could be quite different.

Assume I have tech support documents and I put all the LaserJet docs in one 
core. That term is very common in that core (poor idf) and rare in other cores 
(strong idf). But for the query “laserjet”, all the good answers are in the 
LaserJet-specific core, where they will be scored low.

An identical document that mentions “LaserJet” once will score fairly low in 
the LaserJet-specific collection and fairly high in the other collection.

Global IDF fixes this, by using corpus-wide statistics. That’s how we ran 
Infoseek and Ultraseek in the late 1990’s.

Random allocation to cores avoids it.

If you have significant traffic directed to one object type AND you need peak 
performance, you may want to segregate your cores by object type. Otherwise, 
I’d let SolrCloud spread them around randomly and filter based on an object 
type field. That should work well for most purposes.

Any core with less than 1000 records is likely to give somewhat mysterious 
results. A word that is common in English, like “next”, will only be in one 
document and will score too high. A less-common word, like “unreasonably”, will 
be in 20 and will score low. You need lots of docs for the language statistics 
to even out.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Mar 10, 2015, at 1:23 PM, johnmu...@aol.com wrote:

 Thanks Walter.
 
 The design decision I'm trying to solve is this: using multiple cores, will 
 my ranking be impacted vs. using single core?
 
 I have records to index and each record can be grouped into object-types, 
 such as object-A, object-B, object-C, etc.  I have a total of 30 (maybe more) 
 object-types.  There may be only 10 records of object-A, but 10 million 
 records of object-B or 1 million of object-C, etc.  I need to be able to 
 search against a single object-type and / or across all object-types.
 
 From my past experience, in a single core setup, if I have two identical 
 records, and I search on the term  XYZ that matches one of the records, the 
 second record ranks right next to the other (because it too contains XYZ).  
 This is good and is the expected behavior.  If I want to limit my search to 
 an object-type, I AND XYZ with that object-type.  So all is well.
 
 What I'm considering to do for my new design is use multi-cores and 
 distributed search.  I am considering to create a core for each object-type: 
 core-A will hold records from object-A, core-B will hold records from 
 object-B, etc.  Before I can make a decision on this design, I need to know 
 how ranking will be impacted.
 
 Going back to my earlier example: if I have 2 identical records, one of them 
 went to core-A which has 10 records, and the other went to core-B which has 
 10 million records, using distributed search, if I now search across all 
 cores on the term  XYZ (just like in the single core case), it will match 
 both of those records all right, but will those two records be ranked next to 
 each other just like in the single core case?  If not, which will rank 
 higher, the one from core-A or the one from core-B?
 
 My concern is, using multi-cores and distributed search means I will give up 
 on rank quality when records are not distributed across cores evenly.  If so, 
 than maybe this is not a design I can use.
 
 - MJ
 
 -Original Message-
 From: Walter Underwood [mailto:wun...@wunderwood.org] 
 Sent: Tuesday, March 10, 2015 2:39 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Cores and and ranking (search quality)
 
 On Mar 10, 2015, at 10:17 AM, johnmu...@aol.com wrote:
 
 If I have two cores, one core has 10 docs another has 100,000 docs.  I then 
 submit two docs that are 100% identical (with the exception of the unique-ID 
 fields, which is stored but not indexed) one to each core.  The question is, 
 during search, will both of those docs rank near each other or not? […]
 
 Put another way: are docs from the smaller core (the one has 10 docs only) 
 rank higher or lower compared to docs from the larger core (the one with 
 100,000) docs?
 
 These are not quite the same question.
 
 tf.idf ranking depends on the other documents in the collection (the idf 
 term). With 10 docs, the document frequency statistics are effectively random 
 noise, so the ranking is unpredictable.
 
 Identical documents should rank identically, but whether they are higher or 
 lower in the two cores depends on the rest of the docs.
 
 idf statistics don’t settle down until at least 10K docs. You still sometimes 
 see anomalies under a million documents. 
 
 What design decision do you need to make? We can probably answer that for you

Re: Cores and and ranking (search quality)

2015-03-10 Thread Walter Underwood
On Mar 10, 2015, at 10:17 AM, johnmu...@aol.com wrote:

 If I have two cores, one core has 10 docs another has 100,000 docs.  I then 
 submit two docs that are 100% identical (with the exception of the unique-ID 
 fields, which is stored but not indexed) one to each core.  The question is, 
 during search, will both of those docs rank near each other or not? […]
 
 Put another way: are docs from the smaller core (the one has 10 docs only) 
 rank higher or lower compared to docs from the larger core (the one with 
 100,000) docs?

These are not quite the same question.

tf.idf ranking depends on the other documents in the collection (the idf term). 
With 10 docs, the document frequency statistics are effectively random noise, 
so the ranking is unpredictable.

Identical documents should rank identically, but whether they are higher or 
lower in the two cores depends on the rest of the docs.

idf statistics don’t settle down until at least 10K docs. You still sometimes 
see anomalies under a million documents. 

What design decision do you need to make? We can probably answer that for you.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)



Re: Cores and and ranking (search quality)

2015-03-09 Thread johnmunir
(reposing this to see if anyone can help)


Help me understand this better (regarding ranking).

If I have two docs that are 100% identical with the exception of uid (which is 
stored but not indexed).  In a single core setup, if I search xyz such that 
those 2 docs end up ranking as #1 and #2.  When I switch over to two core 
setup, doc-A goes to core-A (which has 10 records) and doc-B goes to core-B 
(which has 100,000 records).

Now, are you saying in 2 core setup if I search on xyz (just like in singe 
core setup) this time I will not see doc-A and doc-B as #1 and #2 in ranking?  
That is, are you saying doc-A may now be somewhere at the top / bottom far away 
from doc-B?  If so, which will be #1: the doc off core-A (that has 10 records) 
or doc-B off core-B (that has 100,000 records)?

If I got all this right, are you saying SOLR-1632 will fix this issue such that 
the end result will now be as if I had 1 core?

- MJ


-Original Message-
From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk]
Sent: Thursday, March 5, 2015 9:06 AM
To: solr-user@lucene.apache.org
Subject: Re: Cores and and ranking (search quality)

On Thu, 2015-03-05 at 14:34 +0100, johnmu...@aol.com wrote:
 My question is this: if I put my data in multiple cores and use 
 distributed search will the ranking be different if I had all my data 
 in a single core?

Yes, it will be different. The practical impact depends on how homogeneous your 
data are across the shards and how large your shards are. If you have small and 
dissimilar shards, your ranking will suffer a lot.

Work is being done to remedy this:
https://issues.apache.org/jira/browse/SOLR-1632

 Also, will facet and more-like-this quality / result be the same?

It is not formally guaranteed, but for most practical purposes, faceting on 
multi-shards will give you the same results as single-shards.

I don't know about more-like-this. My guess is that it will be affected in the 
same way that standard searches are.

 Also, reading the distributed search wiki
 (http://wiki.apache.org/solr/DistributedSearch) it looks like Solr 
 does the search and result merging (all I have to do is issue a 
 search), is this correct?

Yes. From a user-perspective, searches are no different.

- Toke Eskildsen, State and University Library, Denmark



RE: Cores and and ranking (search quality)

2015-03-06 Thread johnmunir
Help me understand this better (regarding ranking).

If I have two docs that are 100% identical with the exception of uid (which is 
stored but not indexed).  In a single core setup, if I search xyz such that 
those 2 docs end up ranking as #1 and #2.  When I switch over to two core 
setup, doc-A goes to core-A (which has 10 records) and doc-B goes to core-B 
(which has 100,000 records).

Now, are you saying in 2 core setup if I search on xyz (just like in singe 
core setup) this time I will not see doc-A and doc-B as #1 and #2 in ranking?  
That is, are you saying doc-A may now be somewhere at the top / bottom far away 
from doc-B?  If so, which will be #1: the doc off core-A (that has 10 records) 
or doc-B off core-B (that has 100,000 records)?

If I got all this right, are you saying SOLR-1632 will fix this issue such that 
the end result will now be as if I had 1 core?

- MJ


-Original Message-
From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk] 
Sent: Thursday, March 5, 2015 9:06 AM
To: solr-user@lucene.apache.org
Subject: Re: Cores and and ranking (search quality)

On Thu, 2015-03-05 at 14:34 +0100, johnmu...@aol.com wrote:
 My question is this: if I put my data in multiple cores and use 
 distributed search will the ranking be different if I had all my data 
 in a single core?

Yes, it will be different. The practical impact depends on how homogeneous your 
data are across the shards and how large your shards are. If you have small and 
dissimilar shards, your ranking will suffer a lot.

Work is being done to remedy this:
https://issues.apache.org/jira/browse/SOLR-1632

 Also, will facet and more-like-this quality / result be the same?

It is not formally guaranteed, but for most practical purposes, faceting on 
multi-shards will give you the same results as single-shards.

I don't know about more-like-this. My guess is that it will be affected in the 
same way that standard searches are.

 Also, reading the distributed search wiki
 (http://wiki.apache.org/solr/DistributedSearch) it looks like Solr 
 does the search and result merging (all I have to do is issue a 
 search), is this correct?

Yes. From a user-perspective, searches are no different.

- Toke Eskildsen, State and University Library, Denmark



RE: Cores and and ranking (search quality)

2015-03-05 Thread Markus Jelsma
Hello - facetting will be the same and distributed more like this is also 
possible since 5.0, and there is a working patch for 4.10.3. Regular search 
will work as well since 5.0 because of distributed IDF, which you need to 
enable manually. Behaviour will not be the same if you rely on average document 
length statistics, which is true when you use BM25 instead of the default TFIDF 
similarity. Solr will do the result merging so everything is transparent, 
awesome!

Markus 
 
-Original message-
 From:johnmu...@aol.com johnmu...@aol.com
 Sent: Thursday 5th March 2015 14:38
 To: solr-user@lucene.apache.org
 Subject: Cores and and ranking (search quality)
 
 Hi,
 
 I have data in which I will index and search on.  This data is well define 
 such that I can index into a single core or multiple cores like so: 
 core_1:Jan2015, core_2:Feb2015, core_3:Mar2015, etc.
 
 My question is this: if I put my data in multiple cores and use distributed 
 search will the ranking be different if I had all my data in a single core?  
 If yes, how will it be different?  Also, will facet and more-like-this 
 quality / result be the same?
 
 Also, reading the distributed search wiki 
 (http://wiki.apache.org/solr/DistributedSearch) it looks like Solr does the 
 search and result merging (all I have to do is issue a search), is this 
 correct?
 
 Thanks!
 
 - MJ
 


Re: Cores and and ranking (search quality)

2015-03-05 Thread Toke Eskildsen
On Thu, 2015-03-05 at 14:34 +0100, johnmu...@aol.com wrote:
 My question is this: if I put my data in multiple cores and use
 distributed search will the ranking be different if I had all my data
 in a single core?

Yes, it will be different. The practical impact depends on how
homogeneous your data are across the shards and how large your shards
are. If you have small and dissimilar shards, your ranking will suffer a
lot.

Work is being done to remedy this:
https://issues.apache.org/jira/browse/SOLR-1632

 Also, will facet and more-like-this quality / result be the same?

It is not formally guaranteed, but for most practical purposes, faceting
on multi-shards will give you the same results as single-shards.

I don't know about more-like-this. My guess is that it will be affected
in the same way that standard searches are.

 Also, reading the distributed search wiki
 (http://wiki.apache.org/solr/DistributedSearch) it looks like Solr does
 the search and result merging (all I have to do is issue a search), is
 this correct?

Yes. From a user-perspective, searches are no different.

- Toke Eskildsen, State and University Library, Denmark




Cores and and ranking (search quality)

2015-03-05 Thread johnmunir
Hi,

I have data in which I will index and search on.  This data is well define such 
that I can index into a single core or multiple cores like so: core_1:Jan2015, 
core_2:Feb2015, core_3:Mar2015, etc.

My question is this: if I put my data in multiple cores and use distributed 
search will the ranking be different if I had all my data in a single core?  If 
yes, how will it be different?  Also, will facet and more-like-this quality / 
result be the same?

Also, reading the distributed search wiki 
(http://wiki.apache.org/solr/DistributedSearch) it looks like Solr does the 
search and result merging (all I have to do is issue a search), is this correct?

Thanks!

- MJ