It has taken as little as 2 minutes to happen the last time we tried.  It
basically happens upon high query load (peak user hours during the day).
 When we reduce functionality by disabling most searches, it stabilizes.
 So it really is only on high query load.  Our ingest rate is fairly low.

It happens no matter how many nodes in the shard are up.


Joe


On Sat, May 31, 2014 at 11:04 AM, Jack Krupansky <j...@basetechnology.com>
wrote:

> When you restart, how long does it take it hit the problem? And how much
> query or update activity is happening in that time? Is there any other
> activity showing up in the log?
>
> If you bring up only a single node in that problematic shard, do you still
> see the problem?
>
> -- Jack Krupansky
>
> -----Original Message----- From: Joe Gresock
> Sent: Saturday, May 31, 2014 9:34 AM
> To: solr-user@lucene.apache.org
> Subject: Uneven shard heap usage
>
>
> Hi folks,
>
> I'm trying to figure out why one shard of an evenly-distributed 3-shard
> cluster would suddenly start running out of heap space, after 9+ months of
> stable performance.  We're using the "!" delimiter in our ids to distribute
> the documents, and indeed the disk size of our shards are very similar
> (31-32GB on disk per replica).
>
> Our setup is:
> 9 VMs with 16GB RAM, 8 vcpus (with a 4:1 oversubscription ratio, so
> basically 2 physical CPUs), 24GB disk
> 3 shards, 3 replicas per shard (1 leader, 2 replicas, whatever).  We
> reserve 10g heap for each solr instance.
> Also 3 zookeeper VMs, which are very stable
>
> Since the troubles started, we've been monitoring all 9 with jvisualvm, and
> shards 2 and 3 keep a steady amount of heap space reserved, always having
> horizontal lines (with some minor gc).  They're using 4-5GB heap, and when
> we force gc using jvisualvm, they drop to 1GB usage.  Shard 1, however,
> quickly has a steep slope, and eventually has concurrent mode failures in
> the gc logs, requiring us to restart the instances when they can no longer
> do anything but gc.
>
> We've tried ruling out physical host problems by moving all 3 Shard 1
> replicas to different hosts that are underutilized, however we still get
> the same problem.  We'll still be working on ruling out infrastructure
> issues, but I wanted to ask the questions here in case it makes sense:
>
> * Does it make sense that all the replicas on one shard of a cluster would
> have heap problems, when the other shard replicas do not, assuming a fairly
> even data distribution?
> * One thing we changed recently was to make all of our fields stored,
> instead of only half of them.  This was to support atomic updates.  Can
> stored fields, even though lazily loaded, cause problems like this?
>
> Thanks for any input,
> Joe
>
>
>
>
>
> --
> I know what it is to be in need, and I know what it is to have plenty.  I
> have learned the secret of being content in any and every situation,
> whether well fed or hungry, whether living in plenty or in want.  I can do
> all this through him who gives me strength.    *-Philippians 4:12-13*
>



-- 
I know what it is to be in need, and I know what it is to have plenty.  I
have learned the secret of being content in any and every situation,
whether well fed or hungry, whether living in plenty or in want.  I can do
all this through him who gives me strength.    *-Philippians 4:12-13*

Reply via email to