Re: SolrCloud loadbalancing, replication, and failover

Shawn Heisey Fri, 19 Apr 2013 02:02:48 -0700

On 4/19/2013 1:34 AM, John Nielsen wrote:
> Well, to consume 120GB of RAM with a 120GB index, you would have to query
> over every single GB of data.
> 
> If you only actually query over, say, 500MB of the 120GB data in your dev
> environment, you would only use 500MB worth of RAM for caching. Not 120GB

What you are saying is essentially true, although I would not be
surprised to learn that even a single query would read a few gigabytes
from a 120GB index, assuming that you start after a server reboot.  The
next query would re-use a lot of the data cached by the first query and
return much faster.

> On Fri, Apr 19, 2013 at 7:55 AM, David Parks <davidpark...@yahoo.com> wrote:
>> Question: if I had 1 server with 60GB of memory and 120GB index, would solr
>> make full use of the 60GB of memory? Thus trimming disk access in half. Or
>> is it an all-or-nothing thing?  In a dev environment, I didn't notice SOLR
>> consuming the full 5GB of RAM assigned to it with a 120GB index.

Solr would likely cause the OS to use most or all of that memory.  It's
not an all or nothing thing.  The first few queries will load a big
chunk, then each additional query will load a little more.  60GB of RAM
will be significantly better than 12GB.  With only 12GB, it is extremely
likely that a given query will read a section of the index that will
push the data required for the next query out of the disk cache, so it
will have to re-read it from the disk on the next query, and so on in a
never-ending cycle.  That is far less likely if you have enough RAM for
half your index rather than a tenth.  Operating system disk caches are
pretty good at figuring out which data is needed frequently.  If the
cache is big enough, that data can be kept in the cache easily.

An ideal setup would have enough RAM to cache the entire index.
Depending on your schema, you may find that the disk cache in production
only ends up caching somewhere between half and two thirds of your
index.  The 60GB figure you have quoted above *MIGHT* be enough to make
things work really well with a 120GB index, but I always tell people
that if they want top performance, they will buy enough RAM to cache the
whole thing.

You might have a combination of query pattern and data that results in
more of the index needing cache than I have seen on my setup.  You are
likely to add documents continuously.  You may learn that your schema
doesn't cover your needs, so you have to modify it to tokenize more
aggressively, or you may need to copy fields so you can analyze the same
data more than one way.  These things will make your index bigger.  If
your query volume grows or gets more varied, more of your index is
likely to end up in the disk cache.

I would not recommend going into production with an index that has no
redundancy.  If you buy quality hardware with redundancy in storage,
dual power supplies, and ECC memory, catastrophic failures are rare, but
they DO happen.  The motherboard or an entire RAM chip could suddenly
die.  Someone might accidentally hit the power switch on the server and
cause it to shut down.  They might be working in the rack, fall down,
and pull out both power cords in an attempt to catch themselves.  The
latter scenarios are a temporary problem, but your users will probably
notice.

Thanks,
Shawn

Re: SolrCloud loadbalancing, replication, and failover

Reply via email to