Re: Realtime get not always returning existing data

damienk Wed, 05 Jun 2019 20:14:59 -0700

I'm using Solr 7.7.1, 12 shards,
router:{"field":"route", "name":"compositeId"}, and find the realtime get
only returns results if I specify the leader core-url. Most of the time I
see no results.


On Thu, 11 Oct 2018 at 23:41, Chris Ulicny <culicny@iq.media> wrote:

> We are relatively far behind with this one. The collections that we
> experience the problem on are currently running on 6.3.0. If it's easy
> enough for you to upgrade, it might be worth a try, but I didn't see any
> changes to the RealTimeGet in either of the 7.4/5 change logs after a
> cursory glance.
>
> Due to the volume and number of different processes that, this cluster
> requires more coordination to reindex and upgrade. So it's currently the
> last one on our plan to get upgraded to 7.X (or 8.X if timing allows).
>
> On Thu, Oct 11, 2018 at 8:22 AM sgaron cse <sgaron....@gmail.com> wrote:
>
> > Hey Chris,
> >
> > Which version of SOLR are you running? I was thinking of maybe trying
> > another version to see if it fixes the issue.
> >
> > On Thu, Oct 11, 2018 at 8:11 AM Chris Ulicny <culicny@iq.media> wrote:
> >
> > > We've also run into that issue of not being able to reproduce it
> outside
> > of
> > > running production loads.
> > >
> > > However, we haven't been encountering the problem in live production
> > quite
> > > as much as we used to, and I think that might be from the /get requests
> > > being spread out a little more evenly over the running interval which
> is
> > > due to other process changes.
> > >
> > > If I get any new information, I'll update as well.
> > >
> > > Thanks for your help.
> > >
> > > On Wed, Oct 10, 2018 at 10:53 AM sgaron cse <sgaron....@gmail.com>
> > wrote:
> > >
> > > > I haven't found a way to reproduce the problem other that running our
> > > > entire set of code. I've also been trying different things to make
> sure
> > > to
> > > > problem is not from my end and so far I haven't managed to fix it by
> > > > changing my code. It has to be a race condition somewhere but I just
> > > can't
> > > > put my finger on it.
> > > >
> > > > I'll message back if I find a way to reproduce.
> > > >
> > > > On Wed, Oct 10, 2018 at 10:48 AM Erick Erickson <
> > erickerick...@gmail.com
> > > >
> > > > wrote:
> > > >
> > > > > Well assigning a bogus version that generates a 409 error then
> > > > > immediately doing an RTG on the doc doesn't fail for me either 18
> > > > > million tries later. So I'm afraid I haven't a clue where to go
> from
> > > > > here. Unless we can somehow find a way to generate this failure I'm
> > > > > going to drop it for the foreseeable future.
> > > > >
> > > > > Erick
> > > > > On Tue, Oct 9, 2018 at 7:39 AM Erick Erickson <
> > erickerick...@gmail.com
> > > >
> > > > > wrote:
> > > > > >
> > > > > > Hmmmm. I wonder if a version conflict or perhaps other failure
> can
> > > > > > somehow cause this. It shouldn't be very hard to add that to my
> > test
> > > > > > setup, just randomly add n _version_ field value.
> > > > > >
> > > > > > Erick
> > > > > > On Mon, Oct 1, 2018 at 8:20 AM Erick Erickson <
> > > erickerick...@gmail.com
> > > > >
> > > > > wrote:
> > > > > > >
> > > > > > > Thanks. I'll be away for the rest of the week, so won't be able
> > to
> > > > try
> > > > > > > anything more....
> > > > > > > On Mon, Oct 1, 2018 at 5:10 AM Chris Ulicny <culicny@iq.media>
> > > > wrote:
> > > > > > > >
> > > > > > > > In our case, we are heavily indexing in the collection while
> > the
> > > > /get
> > > > > > > > requests are happening which is what we assumed was causing
> > this
> > > > > very rare
> > > > > > > > behavior. However, we have experienced the problem for a
> > > collection
> > > > > where
> > > > > > > > the following happens in sequence with minutes in between
> them.
> > > > > > > >
> > > > > > > > 1. Document id=1 is indexed
> > > > > > > > 2. Document successfully retrieved with /get?id=1
> > > > > > > > 3. Document failed to be retrieved with /get?id=1
> > > > > > > > 4. Document successfully retrieved with /get?id=1
> > > > > > > >
> > > > > > > > We've haven't looked at the issue in a while, so I don't have
> > the
> > > > > exact
> > > > > > > > timing of that sequence on hand right now. I'll try to find
> an
> > > > actual
> > > > > > > > example, although I'm relatively certain it was multiple
> > minutes
> > > in
> > > > > between
> > > > > > > > each of those requests. However our autocommit (and soft
> > commit)
> > > > > times are
> > > > > > > > 60s for both collections.
> > > > > > > >
> > > > > > > > I think the following two are probably the biggest
> differences
> > > for
> > > > > our
> > > > > > > > setup, besides the version difference (v6.3.0):
> > > > > > > >
> > > > > > > > > index to this collection, perhaps not at a high rate
> > > > > > > > > separate the machines running solr from the one doing any
> > > > querying
> > > > > or
> > > > > > > > indexing
> > > > > > > >
> > > > > > > > The clients are on 3 hosts separate from the solr instances.
> > The
> > > > > total
> > > > > > > > number of threads that are making updates and making /get
> > > requests
> > > > is
> > > > > > > > around 120-150. About 40-50 per host. Each of our two
> > collections
> > > > > gets an
> > > > > > > > average of 500 requests per second constantly for ~5 minutes,
> > and
> > > > > then the
> > > > > > > > number slowly tapers off to essentially 0 after ~15 minutes.
> > > > > > > >
> > > > > > > > Every thread attempts to make the same series of requests.
> > > > > > > >
> > > > > > > > -- Update with "_version_=-1". If successful, no other
> requests
> > > are
> > > > > made.
> > > > > > > > -- On 409 Conflict failure, it makes a /get request for the
> id
> > > > > > > > -- On doc:null failure, the client handles the error and
> moves
> > on
> > > > > > > >
> > > > > > > > Combining this with the previous series of /get requests, we
> > end
> > > up
> > > > > with
> > > > > > > > situations where an update fails as expected, but the
> > subsequent
> > > > /get
> > > > > > > > request fails to retrieve the existing document:
> > > > > > > >
> > > > > > > > 1. Thread 1 updates id=1 successfully
> > > > > > > > 2. Thread 2 tries to update id=1, fails (409)
> > > > > > > > 3. Thread 2 tries to get id=1 succeeds.
> > > > > > > >
> > > > > > > > ...Minutes later...
> > > > > > > >
> > > > > > > > 4. Thread 3 tries to update id=1, fails (409)
> > > > > > > > 5. Thread 3 tries to get id=1, fails (doc:null)
> > > > > > > >
> > > > > > > > ...Minutes later...
> > > > > > > >
> > > > > > > > 6. Thread 4 tries to update id=1, fails (409)
> > > > > > > > 7. Thread 4 tries to get id=1 succeeds.
> > > > > > > >
> > > > > > > > As Steven mentioned, it happens very, very rarely. We tried
> to
> > > > > recreate it
> > > > > > > > in a more controlled environment, but ran into the same issue
> > > that
> > > > > you are,
> > > > > > > > Erick. Every simplified situation we ran produced no
> problems.
> > > > Since
> > > > > it's
> > > > > > > > not a large issue for us and happens very rarely, we stopped
> > > trying
> > > > > to
> > > > > > > > recreate it.
> > > > > > > >
> > > > > > > >
> > > > > > > > On Sun, Sep 30, 2018 at 9:16 PM Erick Erickson <
> > > > > erickerick...@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > 57 million queries later, with constant indexing going on
> > and 9
> > > > > dummy
> > > > > > > > > collections in the mix and the main collection I'm querying
> > > > having
> > > > > 2
> > > > > > > > > shards, 2 replicas each, I have no errors.
> > > > > > > > >
> > > > > > > > > So unless the code doesn't look like it exercises any
> similar
> > > > path,
> > > > > > > > > I'm not sure what more I can test. "It works on my machine"
> > ;)
> > > > > > > > >
> > > > > > > > > Here's my querying code, does it look like it what you're
> > > seeing?
> > > > > > > > >
> > > > > > > > >       while (Main.allStop.get() == false) {
> > > > > > > > >         try (SolrClient client = new
> HttpSolrClient.Builder()
> > > > > > > > > //("http://my-solr-server:8981/solr/eoe_shard1_replica_n4
> "))
> > {
> > > > > > > > >             .withBaseSolrUrl("
> http://localhost:8981/solr/eoe
> > > > ").build())
> > > > > {
> > > > > > > > >
> > > > > > > > >           //SolrQuery query = new SolrQuery();
> > > > > > > > >           String lower =
> > > > Integer.toString(rand.nextInt(1_000_000));
> > > > > > > > >           SolrDocument rsp = client.getById(lower);
> > > > > > > > >           if (rsp == null) {
> > > > > > > > >             System.out.println("Got a null response!");
> > > > > > > > >             Main.allStop.set(true);
> > > > > > > > >           }
> > > > > > > > >
> > > > > > > > >           rsp = client.getById(lower);
> > > > > > > > >
> > > > > > > > >           if (rsp.get("id").equals(lower) == false) {
> > > > > > > > >             System.out.println("Got an invalid response,
> > > looking
> > > > > for "
> > > > > > > > > + lower + " got: " + rsp.get("id"));
> > > > > > > > >             Main.allStop.set(true);
> > > > > > > > >           }
> > > > > > > > >           long queries = Main.eoeCounter.incrementAndGet();
> > > > > > > > >           if ((queries % 100_000) == 0) {
> > > > > > > > >             long seconds = (System.currentTimeMillis() -
> > > > > Main.start) /
> > > > > > > > > 1000;
> > > > > > > > >             System.out.println("Query count: " +
> > > > > > > > > numFormatter.format(queries) + ", rate is " +
> > > > > > > > > numFormatter.format(queries / seconds) + " QPS");
> > > > > > > > >           }
> > > > > > > > >         } catch (Exception cle) {
> > > > > > > > >           cle.printStackTrace();
> > > > > > > > >           Main.allStop.set(true);
> > > > > > > > >         }
> > > > > > > > >       }
> > > > > > > > >   }On Sat, Sep 29, 2018 at 12:46 PM Erick Erickson
> > > > > > > > > <erickerick...@gmail.com> wrote:
> > > > > > > > > >
> > > > > > > > > > Steve:
> > > > > > > > > >
> > > > > > > > > > bq.  Basically, one core had data in it that should
> belong
> > to
> > > > > another
> > > > > > > > > > core. Here's my question about this: Is it possible that
> > two
> > > > > request to
> > > > > > > > > the
> > > > > > > > > > /get API coming in at the same time would get confused
> and
> > > > > either both
> > > > > > > > > get
> > > > > > > > > > the same result or result get inverted?
> > > > > > > > > >
> > > > > > > > > > Well, that shouldn't be happening, these are all supposed
> > to
> > > be
> > > > > > > > > thread-safe
> > > > > > > > > > calls.... All things are possible of course ;)
> > > > > > > > > >
> > > > > > > > > > If two replicas of the same shard have different
> documents,
> > > > that
> > > > > could
> > > > > > > > > account
> > > > > > > > > > for what you're seeing, meanwhile begging the question of
> > why
> > > > > that is
> > > > > > > > > the case
> > > > > > > > > > since it should never be true for a quiescent index.
> > > > Technically
> > > > > there
> > > > > > > > > _are_
> > > > > > > > > > conditions where this is true on a very temporary basis,
> > > > commits
> > > > > on the
> > > > > > > > > leader
> > > > > > > > > > and follower can trigger at different wall-clock times.
> Say
> > > > your
> > > > > soft
> > > > > > > > > commit
> > > > > > > > > > (or hard-commit-with-opensearcher-true) is 10 seconds. It
> > > > should
> > > > > never
> > > > > > > > > be the
> > > > > > > > > > case that s1r1 and s1r2 are out of sync 10 seconds after
> > the
> > > > > last update
> > > > > > > > > was
> > > > > > > > > > sent. This doesn't seem likely from what you've described
> > > > > though...
> > > > > > > > > >
> > > > > > > > > > Hmmmm. I guess that one other thing I can set up is to
> > have a
> > > > > bunch of
> > > > > > > > > dummy
> > > > > > > > > > collections laying around. Currently I have only the
> active
> > > > one,
> > > > > and
> > > > > > > > > > if there's some
> > > > > > > > > > code path whereby the RTG request goes to a replica of a
> > > > > different
> > > > > > > > > > collection, my
> > > > > > > > > > test setup wouldn't reproduce it.
> > > > > > > > > >
> > > > > > > > > > Currently, I'm running a 2-shard, 1 replica setup, so if
> > > > there's
> > > > > some
> > > > > > > > > > way that the replicas
> > > > > > > > > > get out of sync that wouldn't show either.
> > > > > > > > > >
> > > > > > > > > > So I'm starting another run with these changes:
> > > > > > > > > > > opening a new connection each query
> > > > > > > > > > > switched so the collection I'm querying is 2x2
> > > > > > > > > > > added some dummy collections that are empty
> > > > > > > > > >
> > > > > > > > > > One nit, while "core" is exactly correct. When we talk
> > about
> > > a
> > > > > core
> > > > > > > > > > that's part of a collection, we try to use "replica" to
> be
> > > > clear
> > > > > we're
> > > > > > > > > > talking about
> > > > > > > > > > a core with some added characteristics, i.e. we're in
> > > > > SolrCloud-land.
> > > > > > > > > > No big deal
> > > > > > > > > > of course....
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Erick
> > > > > > > > > > On Sat, Sep 29, 2018 at 8:28 AM Shawn Heisey <
> > > > > apa...@elyograg.org>
> > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > On 9/28/2018 8:11 PM, sgaron cse wrote:
> > > > > > > > > > > > @Shawn
> > > > > > > > > > > > We're running two instance on one machine for two
> > reason:
> > > > > > > > > > > > 1. The box has plenty of resources (48 cores / 256GB
> > ram)
> > > > > and since
> > > > > > > > > I was
> > > > > > > > > > > > reading that it's not recommended to use more than
> 31GB
> > > of
> > > > > heap in
> > > > > > > > > SOLR we
> > > > > > > > > > > > figured 96 GB for keeping index data in OS cache + 31
> > GB
> > > of
> > > > > heap per
> > > > > > > > > > > > instance was a good idea.
> > > > > > > > > > >
> > > > > > > > > > > Do you know that these Solr instances actually DO need
> 31
> > > GB
> > > > > of heap,
> > > > > > > > > or
> > > > > > > > > > > are you following advice from somewhere, saying "use
> one
> > > > > quarter of
> > > > > > > > > your
> > > > > > > > > > > memory as the heap size"?  That advice is not in the
> Solr
> > > > > > > > > documentation,
> > > > > > > > > > > and never will be.  Figuring out the right heap size
> > > requires
> > > > > > > > > > > experimentation.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > >
> > > > >
> > > >
> > >
> >
> https://wiki.apache.org/solr/SolrPerformanceProblems#How_much_heap_space_do_I_need.3F
> > > > > > > > > > >
> > > > > > > > > > > How big (on disk) are each of these nine cores, and how
> > > many
> > > > > documents
> > > > > > > > > > > are in each one?  Which of them is in each Solr
> instance?
> > > > > With that
> > > > > > > > > > > information, we can make a *guess* about how big your
> > heap
> > > > > should be.
> > > > > > > > > > > Figuring out whether the guess is correct generally
> > > requires
> > > > > careful
> > > > > > > > > > > analysis of a GC log.
> > > > > > > > > > >
> > > > > > > > > > > > 2. We're in testing phase so we wanted a SOLR cloud
> > > > > configuration,
> > > > > > > > > we will
> > > > > > > > > > > > most likely have a much bigger deployment once going
> to
> > > > > production.
> > > > > > > > > In prod
> > > > > > > > > > > > right now, we currently to run a six machines Riak
> > > cluster.
> > > > > Riak is a
> > > > > > > > > > > > key/value document store an has SOLR built-in for
> > search,
> > > > > but we are
> > > > > > > > > trying
> > > > > > > > > > > > to push the key/value aspect of Riak inside SOLR.
> That
> > > way
> > > > > we would
> > > > > > > > > have
> > > > > > > > > > > > one less piece to worry about in our system.
> > > > > > > > > > >
> > > > > > > > > > > Solr is not a database.  It is not intended to be a
> data
> > > > > repository.
> > > > > > > > > > > All of its optimizations (most of which are actually in
> > > > > Lucene) are
> > > > > > > > > > > geared towards search.  While technically it can be a
> > > > > key-value store,
> > > > > > > > > > > that is not what it was MADE for.  Software actually
> > > designed
> > > > > for that
> > > > > > > > > > > role is going to be much better than Solr as a
> key-value
> > > > store.
> > > > > > > > > > >
> > > > > > > > > > > > When I say null document, I mean the /get API
> returns:
> > > > {doc:
> > > > > null}
> > > > > > > > > > > >
> > > > > > > > > > > > The problem is definitely not always there. We also
> > have
> > > > > large
> > > > > > > > > period of
> > > > > > > > > > > > time (few hours) were we have no problems. I'm just
> > > > extremely
> > > > > > > > > hesitant on
> > > > > > > > > > > > retrying when I get a null document because in some
> > case,
> > > > > getting a
> > > > > > > > > null
> > > > > > > > > > > > document is a valid outcome. Our caching layer
> heavily
> > > rely
> > > > > on this
> > > > > > > > > for
> > > > > > > > > > > > example. If I was to retry every nulls I'd pay a big
> > > > penalty
> > > > > in
> > > > > > > > > > > > performance.
> > > > > > > > > > >
> > > > > > > > > > > I've just done a little test with the 7.5.0
> techproducts
> > > > > example.  It
> > > > > > > > > > > looks like returning doc:null actually is how the RTG
> > > handler
> > > > > says it
> > > > > > > > > > > didn't find the document.  This seems very wrong to me,
> > > but I
> > > > > didn't
> > > > > > > > > > > design it, and that response needs SOME kind of format.
> > > > > > > > > > >
> > > > > > > > > > > Have you done any testing to see whether the standard
> > > > > searching handler
> > > > > > > > > > > (typically /select, but many other URL paths are
> > possible)
> > > > > returns
> > > > > > > > > > > results when RTG doesn't?  Do you know for these
> failures
> > > > > whether the
> > > > > > > > > > > document has been committed or not?
> > > > > > > > > > >
> > > > > > > > > > > > As for your last comment, part of our testing phase
> is
> > > also
> > > > > testing
> > > > > > > > > the
> > > > > > > > > > > > limits. Our framework has auto-scaling built-in so if
> > we
> > > > > have a
> > > > > > > > > burst of
> > > > > > > > > > > > request, the system will automatically spin up more
> > > > clients.
> > > > > We're
> > > > > > > > > pushing
> > > > > > > > > > > > 10% of our production system to that Test server to
> see
> > > how
> > > > > it will
> > > > > > > > > handle
> > > > > > > > > > > > it.
> > > > > > > > > > >
> > > > > > > > > > > To spin up another replica, Solr must copy all its
> index
> > > data
> > > > > from the
> > > > > > > > > > > leader replica.  Not only can this take a long time if
> > the
> > > > > index is
> > > > > > > > > big,
> > > > > > > > > > > but it will put a lot of extra I/O load on the
> machine(s)
> > > > with
> > > > > the
> > > > > > > > > > > leader roles.  So performance will actually be WORSE
> > before
> > > > it
> > > > > gets
> > > > > > > > > > > better when you spin up another replica, and if the
> index
> > > is
> > > > > big, that
> > > > > > > > > > > condition will persist for quite a while.  Copying the
> > > index
> > > > > data will
> > > > > > > > > > > be constrained by the speed of your network and by the
> > > speed
> > > > > of your
> > > > > > > > > > > disks.  Often the disks are slower than the network,
> but
> > > that
> > > > > is not
> > > > > > > > > > > always the case.
> > > > > > > > > > >
> > > > > > > > > > > Thanks,
> > > > > > > > > > > Shawn
> > > > > > > > > > >
> > > > > > > > >
> > > > >
> > > >
> > >
> >
>

Re: Realtime get not always returning existing data

Reply via email to