Hi Joe, Are you/how are you sure all 3 shards are roughly the same size? Can you share what you run/see that shows you that?
Are you sure queries are evenly distributed? Something like SPM <http://sematext.com/spm/> should give you insight into that. How big are your caches? Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr & Elasticsearch Support * http://sematext.com/ On Sat, May 31, 2014 at 5:54 PM, Joe Gresock <jgres...@gmail.com> wrote: > Interesting thought about the routing. Our document ids are in 3 parts: > > <10-digit identifier>!<epoch timestamp>!<format> > > e.g., 5/12345678!130000025603!TEXT > > Each object has an identifier, and there may be multiple versions of the > object, hence the timestamp. We like to be able to pull back all of the > versions of an object at once, hence the routing scheme. > > The nature of the identifier is that a great many of them begin with a > certain number. I'd be interested to know more about the hashing scheme > used for the document routing. Perhaps the first character gives it more > weight as to which shard it lands in? > > It seems strange that certain of the most highly-searched documents would > happen to fall on this shard, but you may be onto something. We'll scrape > through some non-distributed queries and see what we can find. > > > On Sat, May 31, 2014 at 1:47 PM, Erick Erickson <erickerick...@gmail.com> > wrote: > > > This is very weird. > > > > Are you sure that all the Java versions are identical? And all the JVM > > parameters are the same? Grasping at straws here. > > > > More grasping at straws: I'm a little suspicious that you are using > > routing. You say that the indexes are about the same size, but is it is > > possible that your routing is somehow loading the problem shard > abnormally? > > By that I mean somehow the documents on that shard are different, or > have a > > drastically higher number of hits than the other shards? > > > > You can fire queries at shards with &distrib=false and NOT have it go to > > other shards, perhaps if you can isolate the problem queries that might > > shed some light on the problem. > > > > > > Best > > er...@baffled.com > > > > > > On Sat, May 31, 2014 at 8:33 AM, Joe Gresock <jgres...@gmail.com> wrote: > > > > > It has taken as little as 2 minutes to happen the last time we tried. > It > > > basically happens upon high query load (peak user hours during the > day). > > > When we reduce functionality by disabling most searches, it > stabilizes. > > > So it really is only on high query load. Our ingest rate is fairly > low. > > > > > > It happens no matter how many nodes in the shard are up. > > > > > > > > > Joe > > > > > > > > > On Sat, May 31, 2014 at 11:04 AM, Jack Krupansky < > > j...@basetechnology.com> > > > wrote: > > > > > > > When you restart, how long does it take it hit the problem? And how > > much > > > > query or update activity is happening in that time? Is there any > other > > > > activity showing up in the log? > > > > > > > > If you bring up only a single node in that problematic shard, do you > > > still > > > > see the problem? > > > > > > > > -- Jack Krupansky > > > > > > > > -----Original Message----- From: Joe Gresock > > > > Sent: Saturday, May 31, 2014 9:34 AM > > > > To: solr-user@lucene.apache.org > > > > Subject: Uneven shard heap usage > > > > > > > > > > > > Hi folks, > > > > > > > > I'm trying to figure out why one shard of an evenly-distributed > 3-shard > > > > cluster would suddenly start running out of heap space, after 9+ > months > > > of > > > > stable performance. We're using the "!" delimiter in our ids to > > > distribute > > > > the documents, and indeed the disk size of our shards are very > similar > > > > (31-32GB on disk per replica). > > > > > > > > Our setup is: > > > > 9 VMs with 16GB RAM, 8 vcpus (with a 4:1 oversubscription ratio, so > > > > basically 2 physical CPUs), 24GB disk > > > > 3 shards, 3 replicas per shard (1 leader, 2 replicas, whatever). We > > > > reserve 10g heap for each solr instance. > > > > Also 3 zookeeper VMs, which are very stable > > > > > > > > Since the troubles started, we've been monitoring all 9 with > jvisualvm, > > > and > > > > shards 2 and 3 keep a steady amount of heap space reserved, always > > having > > > > horizontal lines (with some minor gc). They're using 4-5GB heap, and > > > when > > > > we force gc using jvisualvm, they drop to 1GB usage. Shard 1, > however, > > > > quickly has a steep slope, and eventually has concurrent mode > failures > > in > > > > the gc logs, requiring us to restart the instances when they can no > > > longer > > > > do anything but gc. > > > > > > > > We've tried ruling out physical host problems by moving all 3 Shard 1 > > > > replicas to different hosts that are underutilized, however we still > > get > > > > the same problem. We'll still be working on ruling out > infrastructure > > > > issues, but I wanted to ask the questions here in case it makes > sense: > > > > > > > > * Does it make sense that all the replicas on one shard of a cluster > > > would > > > > have heap problems, when the other shard replicas do not, assuming a > > > fairly > > > > even data distribution? > > > > * One thing we changed recently was to make all of our fields stored, > > > > instead of only half of them. This was to support atomic updates. > Can > > > > stored fields, even though lazily loaded, cause problems like this? > > > > > > > > Thanks for any input, > > > > Joe > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > I know what it is to be in need, and I know what it is to have > plenty. > > I > > > > have learned the secret of being content in any and every situation, > > > > whether well fed or hungry, whether living in plenty or in want. I > can > > > do > > > > all this through him who gives me strength. *-Philippians 4:12-13* > > > > > > > > > > > > > > > > -- > > > I know what it is to be in need, and I know what it is to have plenty. > I > > > have learned the secret of being content in any and every situation, > > > whether well fed or hungry, whether living in plenty or in want. I can > > do > > > all this through him who gives me strength. *-Philippians 4:12-13* > > > > > > > > > -- > I know what it is to be in need, and I know what it is to have plenty. I > have learned the secret of being content in any and every situation, > whether well fed or hungry, whether living in plenty or in want. I can do > all this through him who gives me strength. *-Philippians 4:12-13* >