Re: SolrCloud different score for same document on different replicas.

2017-01-13 Thread Morten Bøgeskov
On Thu, 5 Jan 2017 16:31:35 +
Charlie Hull  wrote:

> On 05/01/2017 13:30, Morten Bøgeskov wrote:
> >
> >
> > Hi.
> >
> > We've got a SolrCloud which is sharded and has a replication factor of
> > 2.
> >
> > The 2 replicas of a shard may look like this:
> >
> > Num Docs:5401023
> > Max Doc:6388614
> > Deleted Docs:987591
> >
> >
> > Num Docs:5401023
> > Max Doc:5948122
> > Deleted Docs:547099
> >
> > We've seen >10% difference in Max Doc at times with same Num Docs.
> > Our use case is few documents that are search and many small that
> > are filtered against (often updated multiple times a day), so the
> > difference in deleted docs aren't surprising.
> >
> > This results in a different score for a document depending on which
> > replica it comes from. As I see it: it has to do with the different
> > maxDoc value when calculating idf.
> >
> > This in turn alters a specific document's position in the search
> > result over reloads. This is quite confusing (duplicates in pagination).
> >
> > What is the trick to get homogeneous score from different replicas.
> > We've tried using ExactStatsCache & ExactSharedStatsCache, but that
> > didn't seem to make any difference.
> >
> > Any hints to this will be greatly appreciated.
> >
> 
> This was one of things we looked at during our recent Lucene London 
> Hackday (see item 3) https://github.com/flaxsearch/london-hackday-2016
> 
> I'm not sure there is a way to get a homogenous score - this patch tries 
> to keep you connected to the same replica during a session so you don't 
> see results jumping over pagination.
> 

Sorry for the late reply.

I went with a new searcher, that inherits from SearchHandler.
This hashes the query, and uses that to select replicas to put in the
shards parameter (if it's a cloud, and a distributed query where shards
isn't already set), then passes it onto the original searcher.

Given sufficiently diverse end user queries, this gives an equal load
across the cloud. This could put a skewed load on nodes, if a query
suddenly becomes very popular or you have an opening page default query
(in our use case, quite unlikely).

Thanks for the input.


-- 
 Morten Bøgeskov 



Re: SolrCloud different score for same document on different replicas.

2017-01-09 Thread Morten Bøgeskov
On Fri, 6 Jan 2017 10:45:02 -0600
Webster Homer  wrote:

> I was seeing something like this, and it turned out to be a problem with
> our autoCommit and autoSoftCommit settings. We had overly aggressive
> settings that eventually started failing with errors around too many
> warming searchers etc...
> 
> You can test this by doing a commit and seeing if the replicas start
> returning consistent results
> 

Commit changes nothing, since number og deleted documents doesn't
change much.
Optimize makes ranking consistent over replicas for the time being,
until too many updates has hit the shard, and the number of deleted
documents (in the largest, it takes some time to prune due to a merge)
segment. Optimizing hourly is not really an option.


> On Thu, Jan 5, 2017 at 10:31 AM, Charlie Hull  wrote:
> 
> > On 05/01/2017 13:30, Morten Bøgeskov wrote:
> >
> >>
> >>
> >> Hi.
> >>
> >> We've got a SolrCloud which is sharded and has a replication factor of
> >> 2.
> >>
> >> The 2 replicas of a shard may look like this:
> >>
> >> Num Docs:5401023
> >> Max Doc:6388614
> >> Deleted Docs:987591
> >>
> >>
> >> Num Docs:5401023
> >> Max Doc:5948122
> >> Deleted Docs:547099
> >>
> >> We've seen >10% difference in Max Doc at times with same Num Docs.
> >> Our use case is few documents that are search and many small that
> >> are filtered against (often updated multiple times a day), so the
> >> difference in deleted docs aren't surprising.
> >>
> >> This results in a different score for a document depending on which
> >> replica it comes from. As I see it: it has to do with the different
> >> maxDoc value when calculating idf.
> >>
> >> This in turn alters a specific document's position in the search
> >> result over reloads. This is quite confusing (duplicates in pagination).
> >>
> >> What is the trick to get homogeneous score from different replicas.
> >> We've tried using ExactStatsCache & ExactSharedStatsCache, but that
> >> didn't seem to make any difference.
> >>
> >> Any hints to this will be greatly appreciated.
> >>
> >>
> > This was one of things we looked at during our recent Lucene London
> > Hackday (see item 3) https://github.com/flaxsearch/london-hackday-2016
> >
> > I'm not sure there is a way to get a homogenous score - this patch tries
> > to keep you connected to the same replica during a session so you don't see
> > results jumping over pagination.
> >
> > Cheers
> >
> > Charlie
> >
> >
> > --
> > Charlie Hull
> > Flax - Open Source Enterprise Search
> >
> > tel/fax: +44 (0)8700 118334
> > mobile:  +44 (0)7767 825828
> > web: www.flax.co.uk
> >
> 



-- 
 Morten Bøgeskov 



Re: SolrCloud different score for same document on different replicas.

2017-01-06 Thread Webster Homer
I was seeing something like this, and it turned out to be a problem with
our autoCommit and autoSoftCommit settings. We had overly aggressive
settings that eventually started failing with errors around too many
warming searchers etc...

You can test this by doing a commit and seeing if the replicas start
returning consistent results

On Thu, Jan 5, 2017 at 10:31 AM, Charlie Hull  wrote:

> On 05/01/2017 13:30, Morten Bøgeskov wrote:
>
>>
>>
>> Hi.
>>
>> We've got a SolrCloud which is sharded and has a replication factor of
>> 2.
>>
>> The 2 replicas of a shard may look like this:
>>
>> Num Docs:5401023
>> Max Doc:6388614
>> Deleted Docs:987591
>>
>>
>> Num Docs:5401023
>> Max Doc:5948122
>> Deleted Docs:547099
>>
>> We've seen >10% difference in Max Doc at times with same Num Docs.
>> Our use case is few documents that are search and many small that
>> are filtered against (often updated multiple times a day), so the
>> difference in deleted docs aren't surprising.
>>
>> This results in a different score for a document depending on which
>> replica it comes from. As I see it: it has to do with the different
>> maxDoc value when calculating idf.
>>
>> This in turn alters a specific document's position in the search
>> result over reloads. This is quite confusing (duplicates in pagination).
>>
>> What is the trick to get homogeneous score from different replicas.
>> We've tried using ExactStatsCache & ExactSharedStatsCache, but that
>> didn't seem to make any difference.
>>
>> Any hints to this will be greatly appreciated.
>>
>>
> This was one of things we looked at during our recent Lucene London
> Hackday (see item 3) https://github.com/flaxsearch/london-hackday-2016
>
> I'm not sure there is a way to get a homogenous score - this patch tries
> to keep you connected to the same replica during a session so you don't see
> results jumping over pagination.
>
> Cheers
>
> Charlie
>
>
> --
> Charlie Hull
> Flax - Open Source Enterprise Search
>
> tel/fax: +44 (0)8700 118334
> mobile:  +44 (0)7767 825828
> web: www.flax.co.uk
>

-- 


This message and any attachment are confidential and may be privileged or 
otherwise protected from disclosure. If you are not the intended recipient, 
you must not copy this message or attachment or disclose the contents to 
any other person. If you have received this transmission in error, please 
notify the sender immediately and delete the message and any attachment 
from your system. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not accept liability for any omissions or errors in this 
message which may arise as a result of E-Mail-transmission or for damages 
resulting from any unauthorized changes of the content of this message and 
any attachment thereto. Merck KGaA, Darmstadt, Germany and any of its 
subsidiaries do not guarantee that this message is free of viruses and does 
not accept liability for any damages caused by any virus transmitted 
therewith.

Click http://www.emdgroup.com/disclaimer to access the German, French, 
Spanish and Portuguese versions of this disclaimer.


Re: SolrCloud different score for same document on different replicas.

2017-01-05 Thread Charlie Hull

On 05/01/2017 13:30, Morten Bøgeskov wrote:



Hi.

We've got a SolrCloud which is sharded and has a replication factor of
2.

The 2 replicas of a shard may look like this:

Num Docs:5401023
Max Doc:6388614
Deleted Docs:987591


Num Docs:5401023
Max Doc:5948122
Deleted Docs:547099

We've seen >10% difference in Max Doc at times with same Num Docs.
Our use case is few documents that are search and many small that
are filtered against (often updated multiple times a day), so the
difference in deleted docs aren't surprising.

This results in a different score for a document depending on which
replica it comes from. As I see it: it has to do with the different
maxDoc value when calculating idf.

This in turn alters a specific document's position in the search
result over reloads. This is quite confusing (duplicates in pagination).

What is the trick to get homogeneous score from different replicas.
We've tried using ExactStatsCache & ExactSharedStatsCache, but that
didn't seem to make any difference.

Any hints to this will be greatly appreciated.



This was one of things we looked at during our recent Lucene London 
Hackday (see item 3) https://github.com/flaxsearch/london-hackday-2016


I'm not sure there is a way to get a homogenous score - this patch tries 
to keep you connected to the same replica during a session so you don't 
see results jumping over pagination.


Cheers

Charlie


--
Charlie Hull
Flax - Open Source Enterprise Search

tel/fax: +44 (0)8700 118334
mobile:  +44 (0)7767 825828
web: www.flax.co.uk


RE: SolrCloud different score for same document on different replicas.

2017-01-05 Thread Markus Jelsma
Hello - you need a custom similarity and use docCount as divisor instead of 
maxDoc when calculating IDF. I believe this was fixed in some version but i'm 
not sure.

Markus
 
-Original message-
> From:Morten Bøgeskov 
> Sent: Thursday 5th January 2017 14:33
> To: solr-user@lucene.apache.org
> Subject: SolrCloud different score for same document on different replicas.
> 
> 
> 
> Hi.
> 
> We've got a SolrCloud which is sharded and has a replication factor of
> 2.
> 
> The 2 replicas of a shard may look like this:
> 
> Num Docs:5401023
> Max Doc:6388614
> Deleted Docs:987591
> 
> 
> Num Docs:5401023
> Max Doc:5948122
> Deleted Docs:547099
> 
> We've seen >10% difference in Max Doc at times with same Num Docs.
> Our use case is few documents that are search and many small that
> are filtered against (often updated multiple times a day), so the
> difference in deleted docs aren't surprising.
> 
> This results in a different score for a document depending on which
> replica it comes from. As I see it: it has to do with the different
> maxDoc value when calculating idf.
> 
> This in turn alters a specific document's position in the search
> result over reloads. This is quite confusing (duplicates in pagination).
> 
> What is the trick to get homogeneous score from different replicas.
> We've tried using ExactStatsCache & ExactSharedStatsCache, but that
> didn't seem to make any difference.
> 
> Any hints to this will be greatly appreciated.
> 
> -- 
>  Morten Bøgeskov 
> 
>