Sorry, my bad. For SolrCloud soft commits are enabled (every 15 seconds). I
do a hard commit from an external cron task via curl every 15 minutes.

The version I'm using for the SolrCloud setup is 4.4.0.

Document cache warm-up times are 0ms.
Filter cache warm-up times are between 3 and 7 seconds.
Query result cache warm-up times are between 0 and 2 seconds.

I haven't tried disabling the caches, I'll give that a try and see what
happens.

This isn't a static index. We are indexing documents into it. We're keeping
up with our normal update load, which is to make updates to a percentage of
the documents (thousands, not hundreds).




On 19 September 2013 20:33, Shreejay Nair <shreej...@gmail.com> wrote:

> Hi Neil,
>
> Although you haven't mentioned it, just wanted to confirm - do you have
> soft commits enabled?
>
> Also what's the version of solr you are using for the solr cloud setup?
> 4.0.0 had lots of memory and zk related issues. What's the warmup time for
> your caches? Have you tried disabling the caches?
>
> Is this is static index or you documents are added continuously?
>
> The answers to these questions might help us pin point the issue...
>
> On Thursday, September 19, 2013, Neil Prosser wrote:
>
> > Apologies for the giant email. Hopefully it makes sense.
> >
> > We've been trying out SolrCloud to solve some scalability issues with our
> > current setup and have run into problems. I'd like to describe our
> current
> > setup, our queries and the sort of load we see and am hoping someone
> might
> > be able to spot the massive flaw in the way I've been trying to set
> things
> > up.
> >
> > We currently run Solr 4.0.0 in the old style Master/Slave replication. We
> > have five slaves, each running Centos with 96GB of RAM, 24 cores and with
> > 48GB assigned to the JVM heap. Disks aren't crazy fast (i.e. not SSDs)
> but
> > aren't slow either. Our GC parameters aren't particularly exciting, just
> > -XX:+UseConcMarkSweepGC. Java version is 1.7.0_11.
> >
> > Our index size ranges between 144GB and 200GB (when we optimise it back
> > down, since we've had bad experiences with large cores). We've got just
> > over 37M documents some are smallish but most range between 1000-6000
> > bytes. We regularly update documents so large portions of the index will
> be
> > touched leading to a maxDocs value of around 43M.
> >
> > Query load ranges between 400req/s to 800req/s across the five slaves
> > throughout the day, increasing and decreasing gradually over a period of
> > hours, rather than bursting.
> >
> > Most of our documents have upwards of twenty fields. We use different
> > fields to store territory variant (we have around 30 territories) values
> > and also boost based on the values in some of these fields (integer
> ones).
> >
> > So an average query can do a range filter by two of the territory variant
> > fields, filter by a non-territory variant field. Facet by a field or two
> > (may be territory variant). Bring back the values of 60 fields. Boost
> query
> > on field values of a non-territory variant field. Boost by values of two
> > territory-variant fields. Dismax query on up to 20 fields (with boosts)
> and
> > phrase boost on those fields too. They're pretty big queries. We don't do
> > any index-time boosting. We try to keep things dynamic so we can alter
> our
> > boosts on-the-fly.
> >
> > Another common query is to list documents with a given set of IDs and
> > select documents with a common reference and order them by one of their
> > fields.
> >
> > Auto-commit every 30 minutes. Replication polls every 30 minutes.
> >
> > Document cache:
> >   * initialSize - 32768
> >   * size - 32768
> >
> > Filter cache:
> >   * autowarmCount - 128
> >   * initialSize - 8192
> >   * size - 8192
> >
> > Query result cache:
> >   * autowarmCount - 128
> >   * initialSize - 8192
> >   * size - 8192
> >
> > After a replicated core has finished downloading (probably while it's
> > warming) we see requests which usually take around 100ms taking over 5s.
> GC
> > logs show concurrent mode failure.
> >
> > I was wondering whether anyone can help with sizing the boxes required to
> > split this index down into shards for use with SolrCloud and roughly how
> > much memory we should be assigning to the JVM. Everything I've read
> > suggests that running with a 48GB heap is way too high but every attempt
> > I've made to reduce the cache sizes seems to wind up causing
> out-of-memory
> > problems. Even dropping all cache sizes by 50% and reducing the heap by
> 50%
> > caused problems.
> >
> > I've already tried using SolrCloud 10 shards (around 3.7M documents per
> > shard, each with one replica) and kept the cache sizes low:
> >
> > Document cache:
> >   * initialSize - 1024
> >   * size - 1024
> >
> > Filter cache:
> >   * autowarmCount - 128
> >   * initialSize - 512
> >   * size - 512
> >
> > Query result cache:
> >   * autowarmCount - 32
> >   * initialSize - 128
> >   * size - 128
> >
> > Even when running on six machines in AWS with SSDs, 24GB heap (out of
> 60GB
> > memory) and four shards on two boxes and three on the rest I still see
> > concurrent mode failure. This looks like it's causing ZooKeeper to mark
> the
> > node as down and things begin to struggle.
> >
> > Is concurrent mode failure just something that will inevitably happen or
> is
> > it avoidable by dropping the CMSInitiatingOccupancyFraction?
> >
> > If anyone has anything that might shove me in the right direction I'd be
> > very grateful. I'm wondering whether our set-up will just never work and
> > maybe we're expecting too much.
> >
> > Many thanks,
> >
> > Neil
> >
>

Reply via email to