Thx for these both, we'll give them both a try, see what difference they make.
Upayavira On Thu, 4 Aug 2016, at 12:27 PM, Erick Erickson wrote: > Upayavira: > > bq: I would have expected that, because the data is being indexed > concurrently across replicas, that the pattern of delete/merge would be > similar across replicas. > > Except for the pesky timing issue. The timers start for autocommit when a > request is received. So the time the autocommit timer expires won't be > the same wall-clock time on all servers and thus may not have the same > docs > in the same segments. It would be _really nice_ if they did, because then > we wouldn't have to fall back to full replication so often for recovery. > > I think there's a JIRA out there for trying to coordinate all the commits > across > replicas in a shard, but I can't find it on a quick look. > > Would distributed IDF help here? > https://issues.apache.org/jira/browse/SOLR-1632 (even though this is > really old, it's in 5.0+) > > Best, > Erick > > On Thu, Aug 4, 2016 at 5:12 AM, Markus Jelsma > <markus.jel...@openindex.io> wrote: > > Hello - your similarity should rely on numDoc instead, it solves the > > problem. I believe it is already fixed in trunk, but i am not sure. > > Markus > > > > -----Original message----- > >> From:Upayavira <upayav...@odoko.co.uk> > >> Sent: Thursday 4th August 2016 13:59 > >> To: solr-user@lucene.apache.org > >> Subject: Out of sync deletions causing differing IDF > >> > >> We have a system that has a reasonable number of changes going on on a > >> daily basis (maybe 60m docs, and around 1m updates per day). Using Solr > >> Cloud, the data is split into 10 shards and those shards are replicated. > >> > >> What we are finding is that the number of deletions is causing differing > >> maxDocs across the different replicas, and that is causing significantly > >> different IDF values between replicas of the same shard, giving > >> different scores and thus different orders depending upon which replica > >> we hit. > >> > >> I would have expected that, because the data is being indexed > >> concurrently across replicas, that the pattern of delete/merge would be > >> similar across replicas, but that doesn't seem to be the case in > >> practice. > >> > >> We could, of course, optimise the index to merge down to a single > >> segment. This would clear all deletes out, but would leave us in a worse > >> place for the future, as now most of our deletes would be concentrated > >> into a single large segment. > >> > >> Has anyone seen this sort of thing before, and does anyone have > >> suggested strategies as to how to encourage IDF values into a similar > >> range across replicas? > >> > >> Upayavira > >>