Re: Out of sync deletions causing differing IDF

Upayavira Thu, 04 Aug 2016 15:36:58 -0700

Thx for these both, we'll give them both a try, see what difference they
make.


Upayavira

On Thu, 4 Aug 2016, at 12:27 PM, Erick Erickson wrote:
> Upayavira:
> 
> bq: I would have expected that, because the data is being indexed
> concurrently across replicas, that the pattern of delete/merge would be
> similar across replicas.
> 
> Except for the pesky timing issue. The timers start for autocommit when a
> request is received. So the time the autocommit timer expires won't be
> the same wall-clock time on all servers and thus may not have the same
> docs
> in the same segments. It would be _really nice_ if they did, because then
> we wouldn't have to fall back to full replication so often for recovery.
> 
> I think there's a JIRA out there for trying to coordinate all the commits
> across
> replicas in a shard, but I can't find it on a quick look.
> 
> Would distributed IDF help here?
> https://issues.apache.org/jira/browse/SOLR-1632 (even though this is
> really old, it's in 5.0+)
> 
> Best,
> Erick
> 
> On Thu, Aug 4, 2016 at 5:12 AM, Markus Jelsma
> <markus.jel...@openindex.io> wrote:
> > Hello - your similarity should rely on numDoc instead, it solves the 
> > problem. I believe it is already fixed in trunk, but i am not sure.
> > Markus
> >
> > -----Original message-----
> >> From:Upayavira <upayav...@odoko.co.uk>
> >> Sent: Thursday 4th August 2016 13:59
> >> To: solr-user@lucene.apache.org
> >> Subject: Out of sync deletions causing differing IDF
> >>
> >> We have a system that has a reasonable number of changes going on on a
> >> daily basis (maybe 60m docs, and around 1m updates per day). Using Solr
> >> Cloud, the data is split into 10 shards and those shards are replicated.
> >>
> >> What we are finding is that the number of deletions is causing differing
> >> maxDocs across the different replicas, and that is causing significantly
> >> different IDF values between replicas of the same shard, giving
> >> different scores and thus different orders depending upon which replica
> >> we hit.
> >>
> >> I would have expected that, because the data is being indexed
> >> concurrently across replicas, that the pattern of delete/merge would be
> >> similar across replicas, but that doesn't seem to be the case in
> >> practice.
> >>
> >> We could, of course, optimise the index to merge down to a single
> >> segment. This would clear all deletes out, but would leave us in a worse
> >> place for the future, as now most of our deletes would be concentrated
> >> into a single large segment.
> >>
> >> Has anyone seen this sort of thing before, and does anyone have
> >> suggested strategies as to how to encourage IDF values into a similar
> >> range across replicas?
> >>
> >> Upayavira
> >>

Re: Out of sync deletions causing differing IDF

Reply via email to