I'm using Solr 7.7.1, 12 shards, router:{"field":"route", "name":"compositeId"}, and find the realtime get only returns results if I specify the leader core-url. Most of the time I see no results.
On Thu, 11 Oct 2018 at 23:41, Chris Ulicny <culicny@iq.media> wrote: > We are relatively far behind with this one. The collections that we > experience the problem on are currently running on 6.3.0. If it's easy > enough for you to upgrade, it might be worth a try, but I didn't see any > changes to the RealTimeGet in either of the 7.4/5 change logs after a > cursory glance. > > Due to the volume and number of different processes that, this cluster > requires more coordination to reindex and upgrade. So it's currently the > last one on our plan to get upgraded to 7.X (or 8.X if timing allows). > > On Thu, Oct 11, 2018 at 8:22 AM sgaron cse <sgaron....@gmail.com> wrote: > > > Hey Chris, > > > > Which version of SOLR are you running? I was thinking of maybe trying > > another version to see if it fixes the issue. > > > > On Thu, Oct 11, 2018 at 8:11 AM Chris Ulicny <culicny@iq.media> wrote: > > > > > We've also run into that issue of not being able to reproduce it > outside > > of > > > running production loads. > > > > > > However, we haven't been encountering the problem in live production > > quite > > > as much as we used to, and I think that might be from the /get requests > > > being spread out a little more evenly over the running interval which > is > > > due to other process changes. > > > > > > If I get any new information, I'll update as well. > > > > > > Thanks for your help. > > > > > > On Wed, Oct 10, 2018 at 10:53 AM sgaron cse <sgaron....@gmail.com> > > wrote: > > > > > > > I haven't found a way to reproduce the problem other that running our > > > > entire set of code. I've also been trying different things to make > sure > > > to > > > > problem is not from my end and so far I haven't managed to fix it by > > > > changing my code. It has to be a race condition somewhere but I just > > > can't > > > > put my finger on it. > > > > > > > > I'll message back if I find a way to reproduce. > > > > > > > > On Wed, Oct 10, 2018 at 10:48 AM Erick Erickson < > > erickerick...@gmail.com > > > > > > > > wrote: > > > > > > > > > Well assigning a bogus version that generates a 409 error then > > > > > immediately doing an RTG on the doc doesn't fail for me either 18 > > > > > million tries later. So I'm afraid I haven't a clue where to go > from > > > > > here. Unless we can somehow find a way to generate this failure I'm > > > > > going to drop it for the foreseeable future. > > > > > > > > > > Erick > > > > > On Tue, Oct 9, 2018 at 7:39 AM Erick Erickson < > > erickerick...@gmail.com > > > > > > > > > wrote: > > > > > > > > > > > > Hmmmm. I wonder if a version conflict or perhaps other failure > can > > > > > > somehow cause this. It shouldn't be very hard to add that to my > > test > > > > > > setup, just randomly add n _version_ field value. > > > > > > > > > > > > Erick > > > > > > On Mon, Oct 1, 2018 at 8:20 AM Erick Erickson < > > > erickerick...@gmail.com > > > > > > > > > > wrote: > > > > > > > > > > > > > > Thanks. I'll be away for the rest of the week, so won't be able > > to > > > > try > > > > > > > anything more.... > > > > > > > On Mon, Oct 1, 2018 at 5:10 AM Chris Ulicny <culicny@iq.media> > > > > wrote: > > > > > > > > > > > > > > > > In our case, we are heavily indexing in the collection while > > the > > > > /get > > > > > > > > requests are happening which is what we assumed was causing > > this > > > > > very rare > > > > > > > > behavior. However, we have experienced the problem for a > > > collection > > > > > where > > > > > > > > the following happens in sequence with minutes in between > them. > > > > > > > > > > > > > > > > 1. Document id=1 is indexed > > > > > > > > 2. Document successfully retrieved with /get?id=1 > > > > > > > > 3. Document failed to be retrieved with /get?id=1 > > > > > > > > 4. Document successfully retrieved with /get?id=1 > > > > > > > > > > > > > > > > We've haven't looked at the issue in a while, so I don't have > > the > > > > > exact > > > > > > > > timing of that sequence on hand right now. I'll try to find > an > > > > actual > > > > > > > > example, although I'm relatively certain it was multiple > > minutes > > > in > > > > > between > > > > > > > > each of those requests. However our autocommit (and soft > > commit) > > > > > times are > > > > > > > > 60s for both collections. > > > > > > > > > > > > > > > > I think the following two are probably the biggest > differences > > > for > > > > > our > > > > > > > > setup, besides the version difference (v6.3.0): > > > > > > > > > > > > > > > > > index to this collection, perhaps not at a high rate > > > > > > > > > separate the machines running solr from the one doing any > > > > querying > > > > > or > > > > > > > > indexing > > > > > > > > > > > > > > > > The clients are on 3 hosts separate from the solr instances. > > The > > > > > total > > > > > > > > number of threads that are making updates and making /get > > > requests > > > > is > > > > > > > > around 120-150. About 40-50 per host. Each of our two > > collections > > > > > gets an > > > > > > > > average of 500 requests per second constantly for ~5 minutes, > > and > > > > > then the > > > > > > > > number slowly tapers off to essentially 0 after ~15 minutes. > > > > > > > > > > > > > > > > Every thread attempts to make the same series of requests. > > > > > > > > > > > > > > > > -- Update with "_version_=-1". If successful, no other > requests > > > are > > > > > made. > > > > > > > > -- On 409 Conflict failure, it makes a /get request for the > id > > > > > > > > -- On doc:null failure, the client handles the error and > moves > > on > > > > > > > > > > > > > > > > Combining this with the previous series of /get requests, we > > end > > > up > > > > > with > > > > > > > > situations where an update fails as expected, but the > > subsequent > > > > /get > > > > > > > > request fails to retrieve the existing document: > > > > > > > > > > > > > > > > 1. Thread 1 updates id=1 successfully > > > > > > > > 2. Thread 2 tries to update id=1, fails (409) > > > > > > > > 3. Thread 2 tries to get id=1 succeeds. > > > > > > > > > > > > > > > > ...Minutes later... > > > > > > > > > > > > > > > > 4. Thread 3 tries to update id=1, fails (409) > > > > > > > > 5. Thread 3 tries to get id=1, fails (doc:null) > > > > > > > > > > > > > > > > ...Minutes later... > > > > > > > > > > > > > > > > 6. Thread 4 tries to update id=1, fails (409) > > > > > > > > 7. Thread 4 tries to get id=1 succeeds. > > > > > > > > > > > > > > > > As Steven mentioned, it happens very, very rarely. We tried > to > > > > > recreate it > > > > > > > > in a more controlled environment, but ran into the same issue > > > that > > > > > you are, > > > > > > > > Erick. Every simplified situation we ran produced no > problems. > > > > Since > > > > > it's > > > > > > > > not a large issue for us and happens very rarely, we stopped > > > trying > > > > > to > > > > > > > > recreate it. > > > > > > > > > > > > > > > > > > > > > > > > On Sun, Sep 30, 2018 at 9:16 PM Erick Erickson < > > > > > erickerick...@gmail.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > > > 57 million queries later, with constant indexing going on > > and 9 > > > > > dummy > > > > > > > > > collections in the mix and the main collection I'm querying > > > > having > > > > > 2 > > > > > > > > > shards, 2 replicas each, I have no errors. > > > > > > > > > > > > > > > > > > So unless the code doesn't look like it exercises any > similar > > > > path, > > > > > > > > > I'm not sure what more I can test. "It works on my machine" > > ;) > > > > > > > > > > > > > > > > > > Here's my querying code, does it look like it what you're > > > seeing? > > > > > > > > > > > > > > > > > > while (Main.allStop.get() == false) { > > > > > > > > > try (SolrClient client = new > HttpSolrClient.Builder() > > > > > > > > > //("http://my-solr-server:8981/solr/eoe_shard1_replica_n4 > ")) > > { > > > > > > > > > .withBaseSolrUrl(" > http://localhost:8981/solr/eoe > > > > ").build()) > > > > > { > > > > > > > > > > > > > > > > > > //SolrQuery query = new SolrQuery(); > > > > > > > > > String lower = > > > > Integer.toString(rand.nextInt(1_000_000)); > > > > > > > > > SolrDocument rsp = client.getById(lower); > > > > > > > > > if (rsp == null) { > > > > > > > > > System.out.println("Got a null response!"); > > > > > > > > > Main.allStop.set(true); > > > > > > > > > } > > > > > > > > > > > > > > > > > > rsp = client.getById(lower); > > > > > > > > > > > > > > > > > > if (rsp.get("id").equals(lower) == false) { > > > > > > > > > System.out.println("Got an invalid response, > > > looking > > > > > for " > > > > > > > > > + lower + " got: " + rsp.get("id")); > > > > > > > > > Main.allStop.set(true); > > > > > > > > > } > > > > > > > > > long queries = Main.eoeCounter.incrementAndGet(); > > > > > > > > > if ((queries % 100_000) == 0) { > > > > > > > > > long seconds = (System.currentTimeMillis() - > > > > > Main.start) / > > > > > > > > > 1000; > > > > > > > > > System.out.println("Query count: " + > > > > > > > > > numFormatter.format(queries) + ", rate is " + > > > > > > > > > numFormatter.format(queries / seconds) + " QPS"); > > > > > > > > > } > > > > > > > > > } catch (Exception cle) { > > > > > > > > > cle.printStackTrace(); > > > > > > > > > Main.allStop.set(true); > > > > > > > > > } > > > > > > > > > } > > > > > > > > > }On Sat, Sep 29, 2018 at 12:46 PM Erick Erickson > > > > > > > > > <erickerick...@gmail.com> wrote: > > > > > > > > > > > > > > > > > > > > Steve: > > > > > > > > > > > > > > > > > > > > bq. Basically, one core had data in it that should > belong > > to > > > > > another > > > > > > > > > > core. Here's my question about this: Is it possible that > > two > > > > > request to > > > > > > > > > the > > > > > > > > > > /get API coming in at the same time would get confused > and > > > > > either both > > > > > > > > > get > > > > > > > > > > the same result or result get inverted? > > > > > > > > > > > > > > > > > > > > Well, that shouldn't be happening, these are all supposed > > to > > > be > > > > > > > > > thread-safe > > > > > > > > > > calls.... All things are possible of course ;) > > > > > > > > > > > > > > > > > > > > If two replicas of the same shard have different > documents, > > > > that > > > > > could > > > > > > > > > account > > > > > > > > > > for what you're seeing, meanwhile begging the question of > > why > > > > > that is > > > > > > > > > the case > > > > > > > > > > since it should never be true for a quiescent index. > > > > Technically > > > > > there > > > > > > > > > _are_ > > > > > > > > > > conditions where this is true on a very temporary basis, > > > > commits > > > > > on the > > > > > > > > > leader > > > > > > > > > > and follower can trigger at different wall-clock times. > Say > > > > your > > > > > soft > > > > > > > > > commit > > > > > > > > > > (or hard-commit-with-opensearcher-true) is 10 seconds. It > > > > should > > > > > never > > > > > > > > > be the > > > > > > > > > > case that s1r1 and s1r2 are out of sync 10 seconds after > > the > > > > > last update > > > > > > > > > was > > > > > > > > > > sent. This doesn't seem likely from what you've described > > > > > though... > > > > > > > > > > > > > > > > > > > > Hmmmm. I guess that one other thing I can set up is to > > have a > > > > > bunch of > > > > > > > > > dummy > > > > > > > > > > collections laying around. Currently I have only the > active > > > > one, > > > > > and > > > > > > > > > > if there's some > > > > > > > > > > code path whereby the RTG request goes to a replica of a > > > > > different > > > > > > > > > > collection, my > > > > > > > > > > test setup wouldn't reproduce it. > > > > > > > > > > > > > > > > > > > > Currently, I'm running a 2-shard, 1 replica setup, so if > > > > there's > > > > > some > > > > > > > > > > way that the replicas > > > > > > > > > > get out of sync that wouldn't show either. > > > > > > > > > > > > > > > > > > > > So I'm starting another run with these changes: > > > > > > > > > > > opening a new connection each query > > > > > > > > > > > switched so the collection I'm querying is 2x2 > > > > > > > > > > > added some dummy collections that are empty > > > > > > > > > > > > > > > > > > > > One nit, while "core" is exactly correct. When we talk > > about > > > a > > > > > core > > > > > > > > > > that's part of a collection, we try to use "replica" to > be > > > > clear > > > > > we're > > > > > > > > > > talking about > > > > > > > > > > a core with some added characteristics, i.e. we're in > > > > > SolrCloud-land. > > > > > > > > > > No big deal > > > > > > > > > > of course.... > > > > > > > > > > > > > > > > > > > > Best, > > > > > > > > > > Erick > > > > > > > > > > On Sat, Sep 29, 2018 at 8:28 AM Shawn Heisey < > > > > > apa...@elyograg.org> > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > On 9/28/2018 8:11 PM, sgaron cse wrote: > > > > > > > > > > > > @Shawn > > > > > > > > > > > > We're running two instance on one machine for two > > reason: > > > > > > > > > > > > 1. The box has plenty of resources (48 cores / 256GB > > ram) > > > > > and since > > > > > > > > > I was > > > > > > > > > > > > reading that it's not recommended to use more than > 31GB > > > of > > > > > heap in > > > > > > > > > SOLR we > > > > > > > > > > > > figured 96 GB for keeping index data in OS cache + 31 > > GB > > > of > > > > > heap per > > > > > > > > > > > > instance was a good idea. > > > > > > > > > > > > > > > > > > > > > > Do you know that these Solr instances actually DO need > 31 > > > GB > > > > > of heap, > > > > > > > > > or > > > > > > > > > > > are you following advice from somewhere, saying "use > one > > > > > quarter of > > > > > > > > > your > > > > > > > > > > > memory as the heap size"? That advice is not in the > Solr > > > > > > > > > documentation, > > > > > > > > > > > and never will be. Figuring out the right heap size > > > requires > > > > > > > > > > > experimentation. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > https://wiki.apache.org/solr/SolrPerformanceProblems#How_much_heap_space_do_I_need.3F > > > > > > > > > > > > > > > > > > > > > > How big (on disk) are each of these nine cores, and how > > > many > > > > > documents > > > > > > > > > > > are in each one? Which of them is in each Solr > instance? > > > > > With that > > > > > > > > > > > information, we can make a *guess* about how big your > > heap > > > > > should be. > > > > > > > > > > > Figuring out whether the guess is correct generally > > > requires > > > > > careful > > > > > > > > > > > analysis of a GC log. > > > > > > > > > > > > > > > > > > > > > > > 2. We're in testing phase so we wanted a SOLR cloud > > > > > configuration, > > > > > > > > > we will > > > > > > > > > > > > most likely have a much bigger deployment once going > to > > > > > production. > > > > > > > > > In prod > > > > > > > > > > > > right now, we currently to run a six machines Riak > > > cluster. > > > > > Riak is a > > > > > > > > > > > > key/value document store an has SOLR built-in for > > search, > > > > > but we are > > > > > > > > > trying > > > > > > > > > > > > to push the key/value aspect of Riak inside SOLR. > That > > > way > > > > > we would > > > > > > > > > have > > > > > > > > > > > > one less piece to worry about in our system. > > > > > > > > > > > > > > > > > > > > > > Solr is not a database. It is not intended to be a > data > > > > > repository. > > > > > > > > > > > All of its optimizations (most of which are actually in > > > > > Lucene) are > > > > > > > > > > > geared towards search. While technically it can be a > > > > > key-value store, > > > > > > > > > > > that is not what it was MADE for. Software actually > > > designed > > > > > for that > > > > > > > > > > > role is going to be much better than Solr as a > key-value > > > > store. > > > > > > > > > > > > > > > > > > > > > > > When I say null document, I mean the /get API > returns: > > > > {doc: > > > > > null} > > > > > > > > > > > > > > > > > > > > > > > > The problem is definitely not always there. We also > > have > > > > > large > > > > > > > > > period of > > > > > > > > > > > > time (few hours) were we have no problems. I'm just > > > > extremely > > > > > > > > > hesitant on > > > > > > > > > > > > retrying when I get a null document because in some > > case, > > > > > getting a > > > > > > > > > null > > > > > > > > > > > > document is a valid outcome. Our caching layer > heavily > > > rely > > > > > on this > > > > > > > > > for > > > > > > > > > > > > example. If I was to retry every nulls I'd pay a big > > > > penalty > > > > > in > > > > > > > > > > > > performance. > > > > > > > > > > > > > > > > > > > > > > I've just done a little test with the 7.5.0 > techproducts > > > > > example. It > > > > > > > > > > > looks like returning doc:null actually is how the RTG > > > handler > > > > > says it > > > > > > > > > > > didn't find the document. This seems very wrong to me, > > > but I > > > > > didn't > > > > > > > > > > > design it, and that response needs SOME kind of format. > > > > > > > > > > > > > > > > > > > > > > Have you done any testing to see whether the standard > > > > > searching handler > > > > > > > > > > > (typically /select, but many other URL paths are > > possible) > > > > > returns > > > > > > > > > > > results when RTG doesn't? Do you know for these > failures > > > > > whether the > > > > > > > > > > > document has been committed or not? > > > > > > > > > > > > > > > > > > > > > > > As for your last comment, part of our testing phase > is > > > also > > > > > testing > > > > > > > > > the > > > > > > > > > > > > limits. Our framework has auto-scaling built-in so if > > we > > > > > have a > > > > > > > > > burst of > > > > > > > > > > > > request, the system will automatically spin up more > > > > clients. > > > > > We're > > > > > > > > > pushing > > > > > > > > > > > > 10% of our production system to that Test server to > see > > > how > > > > > it will > > > > > > > > > handle > > > > > > > > > > > > it. > > > > > > > > > > > > > > > > > > > > > > To spin up another replica, Solr must copy all its > index > > > data > > > > > from the > > > > > > > > > > > leader replica. Not only can this take a long time if > > the > > > > > index is > > > > > > > > > big, > > > > > > > > > > > but it will put a lot of extra I/O load on the > machine(s) > > > > with > > > > > the > > > > > > > > > > > leader roles. So performance will actually be WORSE > > before > > > > it > > > > > gets > > > > > > > > > > > better when you spin up another replica, and if the > index > > > is > > > > > big, that > > > > > > > > > > > condition will persist for quite a while. Copying the > > > index > > > > > data will > > > > > > > > > > > be constrained by the speed of your network and by the > > > speed > > > > > of your > > > > > > > > > > > disks. Often the disks are slower than the network, > but > > > that > > > > > is not > > > > > > > > > > > always the case. > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > Shawn > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >