Thank you Eric.
Our Solr version is 4.10 and we are not doing any sorting or faceting.

I am trying to find some ways of investigating this problem.
Hence asking a few more questions to see what are the normal steps taken in
such situations.
(I did search a few of them on the Internet but could not find anything
good).
Any pointers provided here will help us resolve a little more quickly.


1) Is there a conclusive way to know about the memory leaks?
  How does Solr ensure with each release that there are no memory leaks?
  With a heap 24gb (-Xmx parameter), I sometimes see GC pauses of about 1
second now.
  Looks like we will need to scale it down.
  Total VM memory is 92gb and Solr is the only process running on it.


2) How can I know that the zookeeper connectivity to Solr is not good?
  What commands/steps are normally used to resolve this?
  Does Solr has some metrics that share the zookeeper interaction
statistics?


3) In a span of 9 hours, I see:
  4 times: java.net.SocketException: Connection reset
  32 times: java.net.SocketTimeoutException: Read timed out

And several other exceptions that ultimately bring a whole shard down
(leader is recovery-failed and replica is down).

I understand that the above information might not be sufficient to get the
full picture.
But just in case, someone has resolved or debugged these issues before,
please share your experience.
It would be of great help to me.

Thanks,
SG





On Sun, Dec 4, 2016 at 8:59 AM, Erick Erickson <erickerick...@gmail.com>
wrote:

> All of this is consistent with not having a properly
> tuned Solr instance wrt # documents, usage
> pattern, memory allocated to the JVM, GC
> settings and the like.
>
> Your leader issues can be explained by long
> GC pauses too. Zookeeper periodically pings
> each replica it knows about and if the response
> times out (due to GC in this case) then Zookeeper
> thinks the node has gone away and marks
> it as "down". Similarly when a leader forwards
> an update to a follower and the request times
> out, the leader will mark the follower as down.
> Do this enough and the state of the cluster gets
> "interesting".
>
> You still haven't told us what version of Solr
> you're using, the "Version" you took from
> the core stats is the version of the _index_,
> not Solr.
>
> You have almost 200M documents on
> a single core. That's definitely on the high side,
> although I've seen that work. Assuming
> you aren't doing things like faceting and
> sorting and the like on non docValues fields.
>
> As others have pointed out, the link you
> provided doesn't provide much in the way of
> any "smoking guns" as far as a memory
> leak is concerned.
>
> I've certainly seen situations where memory
> required by Solr is close to the total memory
> allocated to the JVM for instance. Then the GC
> cycle kicks in and recovers just enough to
> go on for a very brief time before going into another
> GC cycle resulting in very poor performance.
>
> So overall this looks like you need to do some
> serious tuning of your Solr instances, take a
> hard look at how you're using your physical
> machines. You specify that these are VMs,
> but how many VMs are you running per box?
> How much JVM have you allocated for each?
> How much total physical memory do you have
> to work with per box?
>
> Even if you provide the answers to the above
> questions, there's not much we can do to
> help you resolve your issues assuming it's
> simply inappropriate sizing. I'd really recommend
> you create a stress environment so you can
> test different scenarios to become confident about
> your expected performance, here's a blog on the
> subject:
>
> https://lucidworks.com/blog/2012/07/23/sizing-hardware-in-
> the-abstract-why-we-dont-have-a-definitive-answer/
>
> Best,
> Erick
>
> On Sat, Dec 3, 2016 at 8:46 PM, S G <sg.online.em...@gmail.com> wrote:
> > The symptom we see is that the java clients querying Solr see response
> > times in 10s of seconds (not milliseconds).
> > And on the tomcat's gc.log file (where Solr is running), we see very bad
> GC
> > pauses - threads being paused for 0.5 seconds per second approximately.
> >
> > Some numbers for the Solr Cloud:
> >
> > *Overall infrastructure:*
> > - Only one collection
> > - 16 VMs used
> > - 8 shards (1 leader and 1 replica per shard - each core on separate VM)
> >
> > *Overview from one core:*
> > - Num Docs:193,623,388
> > - Max Doc:230,577,696
> > - Heap Memory Usage:231,217,880
> > - Deleted Docs:36,954,308
> > - Version:2,357,757
> > - Segment Count:37
> >
> > *Stats from QueryHandler/select*
> > - requests:78,557
> > - errors:358
> > - timeouts:0
> > - totalTime:1,639,975.27
> > - avgRequestsPerSecond:2.62
> > - 5minRateReqsPerSecond:1.39
> > - 15minRateReqsPerSecond:1.64
> > - avgTimePerRequest:20.87
> > - medianRequestTime:0.70
> > - 75thPcRequestTime:1.11
> > - 95thPcRequestTime:191.76
> >
> > *Stats from QueryHandler/update*
> > - requests:33,555
> > - errors:0
> > - timeouts:0
> > - totalTime:227,870.58
> > - avgRequestsPerSecond:1.12
> > - 5minRateReqsPerSecond:1.16
> > - 15minRateReqsPerSecond:1.23
> > - avgTimePerRequest:6.79
> > - medianRequestTime:3.16
> > - 75thPcRequestTime:5.27
> > - 95thPcRequestTime:9.33
> >
> > And yet the Solr clients are reporting timeouts and very long read times.
> >
> > Plus, on every server, we are seeing lots of exceptions.
> > For example:
> >
> > Between 8:06:55 PM and 8:21:36 PM, exceptions are:
> >
> > 1) Request says it is coming from leader, but we are the leader:
> > update.distrib=FROMLEADER&distrib.from=HOSTB_ca_1_
> 1456430020/&wt=javabin&version=2
> >
> > 2) org.apache.solr.common.SolrException: Request says it is coming from
> > leader, but we are the leader
> >
> > 3) org.apache.solr.common.SolrException:
> > org.apache.solr.client.solrj.SolrServerException: Tried one server for
> read
> > operation and it timed out, so failing fast
> >
> > 4) null:org.apache.solr.common.SolrException:
> > org.apache.solr.client.solrj.SolrServerException: Tried one server for
> read
> > operation and it timed out, so failing fast
> >
> > 5) org.apache.solr.common.SolrException:
> > org.apache.solr.client.solrj.SolrServerException: Tried one server for
> read
> > operation and it timed out, so failing fast
> >
> > 6) null:org.apache.solr.common.SolrException:
> > org.apache.solr.client.solrj.SolrServerException: Tried one server for
> read
> > operation and it timed out, so failing fast
> >
> > 7) org.apache.solr.common.SolrException:
> > org.apache.solr.client.solrj.SolrServerException: No live SolrServers
> > available to handle this request. Zombie server list:
> > [HOSTA_ca_1_1456429897]
> >
> > 8) null:org.apache.solr.common.SolrException:
> > org.apache.solr.client.solrj.SolrServerException: No live SolrServers
> > available to handle this request. Zombie server list:
> > [HOSTA_ca_1_1456429897]
> >
> > 9) org.apache.solr.common.SolrException:
> > org.apache.solr.client.solrj.SolrServerException: Tried one server for
> read
> > operation and it timed out, so failing fast
> >
> > 10) null:org.apache.solr.common.SolrException:
> > org.apache.solr.client.solrj.SolrServerException: Tried one server for
> read
> > operation and it timed out, so failing fast
> >
> > 11) org.apache.solr.common.SolrException:
> > org.apache.solr.client.solrj.SolrServerException: Tried one server for
> read
> > operation and it timed out, so failing fast
> >
> > 12) null:org.apache.solr.common.SolrException:
> > org.apache.solr.client.solrj.SolrServerException: Tried one server for
> read
> > operation and it timed out, so failing fast
> >
> > Why are we seeing so many timeouts then and why so huge response times on
> > the client?
> >
> > Thanks
> > SG
> >
> >
> >
> > On Sat, Dec 3, 2016 at 4:19 PM, <billnb...@gmail.com> wrote:
> >
> >> What tool is that ? The stats I would like to run on my Solr instance
> >>
> >> Bill Bell
> >> Sent from mobile
> >>
> >>
> >> > On Dec 2, 2016, at 4:49 PM, Shawn Heisey <apa...@elyograg.org> wrote:
> >> >
> >> >> On 12/2/2016 12:01 PM, S G wrote:
> >> >> This post shows some stats on Solr which indicate that there might
> be a
> >> >> memory leak in there.
> >> >>
> >> >> http://stackoverflow.com/questions/40939166/is-this-a-
> >> memory-leak-in-solr
> >> >>
> >> >> Can someone please help to debug this?
> >> >> It might be a very good step in making Solr stable if we can fix
> this.
> >> >
> >> > +1 to what Walter said.
> >> >
> >> > I replied earlier on the stackoverflow question.
> >> >
> >> > FYI -- your 95th percentile request time of about 16 milliseconds is
> NOT
> >> > something that I would characterize as "very high."  I would *love* to
> >> > have statistics that good.
> >> >
> >> > Even your 99th percentile request time is not much more than a full
> >> > second.  If a search takes a couple of seconds, most users will not
> >> > really care, and some might not even notice.  It's when a large
> >> > percentage of queries start taking several seconds that complaints
> start
> >> > coming in.  On your system, 99 percent of your queries are completing
> in
> >> > 1.3 seconds or less, and 95 percent of them are less than 17
> >> > milliseconds.  That sounds quite good to me.
> >> >
> >> > In my experience, the time it takes for the browser to receive the
> >> > search result page and render it is a significant part of the total
> time
> >> > to see results, and often dwarfs the time spent getting info from
> Solr.
> >> >
> >> > Here's some numbers from Solr in my organization:
> >> >
> >> > requests:               4102054
> >> > errors:                 364894
> >> > timeouts:               49
> >> > totalTime:              799446287.45041
> >> > avgRequestsPerSecond:   1.2375565828793849
> >> > 5minRateReqsPerSecond:  0.8444329508327961
> >> > 15minRateReqsPerSecond: 0.8631197328073346
> >> > avgTimePerRequest:      194.88926460997587
> >> > medianRequestTime:      20.8566605
> >> > 75thPcRequestTime:      85.51328849999999
> >> > 95thPcRequestTime:      2202.277466549999
> >> > 99thPcRequestTime:      5280.375381280002
> >> > 999thPcRequestTime:     6866.020122961001
> >> >
> >> > The numbers above come from a distributed index that contains 167
> >> > million documents and takes up about 200GB of disk space across two
> >> > machines.
> >> >
> >> > requests:               192683
> >> > errors:                 124
> >> > timeouts:               0
> >> > totalTime:              199380421.985073
> >> > avgRequestsPerSecond    0.042222722771354554
> >> > 5minRateReqsPerSecon    0.00800545427600684
> >> > 15minRateReqsPerSecond: 0.017521222412364163
> >> > avgTimePerRequest:      1034.7587591280653
> >> > medianRequestTime:      541.591858
> >> > 75thPcRequestTime:      1683.83246125
> >> > 95thPcRequestTime:      5644.542019949997
> >> > 99thPcRequestTime:      9445.592394760004
> >> > 999thPcRequestTime:     14602.166640771007
> >> >
> >> > These numbers are from an index with about 394 million documents,
> taking
> >> > up nearly 500GB of disk space.  This index is also distributed on
> >> > multiple machines.
> >> >
> >> > Are you experiencing any problems other than what you perceive as slow
> >> > queries?  I asked some other questions on stackoverflow.  In
> particular,
> >> > I'd like to know the total memory on the server, the total number of
> >> > documents (maxDoc and numDoc) you're handling with this server, as
> well
> >> > as the total index size.  What do your queries look like?  What
> version
> >> > and vendor of Java are you using?  Can you share your config/schema?
> >> >
> >> > A memory leak is very unlikely, unless your Java or your operating
> >> > system is broken.  I can't say for sure that it's not happening, but
> >> > it's just not something we see around here.
> >> >
> >> > Here's what I have collected on performance issues in Solr.  This page
> >> > does mostly concern itself with memory, though it touches briefly on
> >> > other topics:
> >> >
> >> > https://wiki.apache.org/solr/SolrPerformanceProblems
> >> >
> >> > Thanks,
> >> > Shawn
> >> >
> >>
>

Reply via email to