Re: SolrCloud loadbalancing, replication, and failover

2014-07-31 Thread Shawn Heisey
On 7/31/2014 12:58 AM, shuss...@del.aithent.com wrote:
 Thanks for giving great explanation about the memory requirements. Could you 
 tell be what all parameters that I need to change in my SolrConfig.xml to 
 handle large index size. What are the optimal values that I need to use.

 My indexed data size is 65 GB (for 8.6 million documents) and I am having 48 
 GB RAM on my server. Whenever I perform delta-indexing, the server become 
 unresponsive while updating the index. 

 Following are the changes that I did in solrconfig.xml after going through net
 writeLockTimeout6/writeLockTimeout
 ramBufferSizeMB256/ramBufferSizeMB
 useCompoundFilefalse/useCompoundFile
 maxBufferedDocs1000/maxBufferedDocs

  mergePolicy class=org.apache.lucene.index.TieredMergePolicy
   int name=maxMergeAtOnce10/int
   int name=segmentsPerTier10/int
  /mergePolicy
  
 mergeFactor10/mergeFactor
 mergeScheduler class=org.apache.lucene.index.ConcurrentMergeScheduler/

 lockTypesimple/lockType
 unlockOnStartuptrue/unlockOnStartup

 updateHandler class=solr.DirectUpdateHandler2
   autoCommit
 maxDocs15000/maxDocs
 openSearchertrue/openSearcher
   /autoCommit
   updateLog
   str name=dir${solr.data.dir:}/str
  /updateLog
 /updateHandler

 So, please provide your valuable suggestion on this problem

You replied directly to me, not to the list.  I am redirecting this back
to the list.

One of the first things that I would do is change openSearcher to false
in your autoCommit settings.  This will mean that you must take care of
commits yourself when you index, to make documents visible.  If you want
any more suggestions, we'll need to see the entire solrconfig.xml file.

The fact that you don't have enough RAM to cache your whole index could
be a problem.  If 8.6 million documents results in 65GB of index, then
your documents are probably quite large, and that can lead to other
possible challenges, because it usually means that a lot of work must be
done to index a single document.  There are also probably a lot of terms
to match when querying.

I do not know how much of your 48GB has been allocated to the java heap,
which takes away from memory that the operating system can use to cache
index files.

Thanks,
Shawn



Re: SolrCloud loadbalancing, replication, and failover

2013-04-21 Thread Erick Erickson
One note to add. There's been lots of discussion here about
index size, which is a slippery concept. To whit:
Look at your index directory, specifically the *.fdt and *.fdx files.
That's where the verbatim copy of your data is held, i.e.
whenever you specify 'stored=true', and is almost totally irrelevant
to memory needs for searching, that data is accessed when the
final set of documents have been assembled and the fl list is
being populated for them.

So, an index with 39G of stored data and 1G for the rest has much
different memory requirements that 1G of stored data and 39G for
the rest, where the rest == the searchable part that can be
held in RAM.

Then there's the fact that the actual data in the index doesn't
include dynamic structures required for navigating that data, so just
because your non-stored data consumes 10G of data on your disk
doesn't mean it'll actually all fit in 10G of memory.

Quick example. Your filter cache consists of a key that is the  filter
query and maxDoc/8 bytes. So I can configure a doc with 64M docs
will require 8M bytes (ignoring some overhead). Not bad so far. But
now I keep doing unfortunate filter queries that use NOW, so each
one requires an additional 8M of memory. And this is a static index
so we never open new readers. And I've configured my filter cache to hold
1,000,000 entries (I have seen this). Works fine in my test environment
where I'm bouncing the server pretty frequently, but now I put it in my
production environment and it starts blowing up with OOM errors after
running for a while.

So try. Measure. Rinse, Repeat G

Best
Erick

On Fri, Apr 19, 2013 at 10:33 PM, David Parks davidpark...@yahoo.com wrote:
 Again, thank you for this incredible information, I feel on much firmer
 footing now. I'm going to test distributing this across 10 servers,
 borrowing a Hadoop cluster temporarily, and see how it does with enough
 memory to have the whole index cached. But I'm thinking that we'll try the
 SSD route as our index will probably rest in the 1/2 terabyte range
 eventually, there's still a lot of active development.

 I guess the RAM disk would work in our case also, as we only index in
 batches, and eventually I'd like to do that off of Solr and just update the
 index (I'm presuming this is doable in solr cloud, but I haven't put it to
 task yet). If I could purpose Hadoop to index the shards, that would be
 ideal, though I haven't quite figured out how to go about it yet.

 David


 -Original Message-
 From: Shawn Heisey [mailto:s...@elyograg.org]
 Sent: Friday, April 19, 2013 9:42 PM
 To: solr-user@lucene.apache.org
 Subject: Re: SolrCloud loadbalancing, replication, and failover

 On 4/19/2013 3:48 AM, David Parks wrote:
 The Physical Memory is 90% utilized (21.18GB of 23.54GB). Solr has
 dark grey allocation of 602MB, and light grey of an additional 108MB,
 for a JVM total of 710MB allocated. If I understand correctly, Solr
 memory utilization is
 *not* for caching (unless I configured document caches or some of the
 other cache options in Solr, which don't seem to apply in this case,
 and I haven't altered from their defaults).

 Right.  Solr does have caches, but they serve specific purposes.  The OS is
 much better at general large-scale caching than Solr is.  Solr caches get
 cleared (and possibly re-warmed) whenever you issue a commit on your index
 that makes new documents visible.

 So assuming this box was dedicated to 1 solr instance/shard. What JVM
 heap should I set? Does that matter? 24GB JVM heap? Or keep it lower
 and ensure the OS cache has plenty of room to operate? (this is an
 Ubuntu 12.10 server instance).

 The JVM heap to use is highly dependent on the nature of your queries, the
 number of documents, the number of unique terms, etc.  The best thing to do
 is try it out with a relatively large heap, see how much memory actually
 gets used inside the JVM.  The jvisualvm and jconsole tools will give you
 nice graphs of JVM memory usage.  The jstat program will give you raw
 numbers on the commandline that you'll need to add to get the full picture.
 Due to the garbage collection model that Java uses, what you'll see is a
 sawtooth pattern - memory usage goes up to max heap, then garbage collection
 reduces it to the actual memory used.
  Generally speaking, you want to have more heap available than the low
 point of that sawtooth pattern.  If that low point is around 3GB when you
 are hitting your index hard with queries and updates, then you would want to
 give Solr a heap of 4 to 6 GB.

 Would I be wise to just put the index on a RAM disk and guarantee
 performance?  Assuming I installed sufficient RAM?

 A RAM disk is a very good way to guarantee performance - but RAM disks are
 ephemeral.  Reboot or have an OS crash and it's gone, you'll have to
 reindex.  Also remember that you actually need at *least* twice the size of
 your index so that Solr (Lucene) has enough room to do merges, and the
 worst-case scenario is *three

Re: SolrCloud loadbalancing, replication, and failover

2013-04-19 Thread John Nielsen
Well, to consume 120GB of RAM with a 120GB index, you would have to query
over every single GB of data.

If you only actually query over, say, 500MB of the 120GB data in your dev
environment, you would only use 500MB worth of RAM for caching. Not 120GB


On Fri, Apr 19, 2013 at 7:55 AM, David Parks davidpark...@yahoo.com wrote:

 Wow! That was the most pointed, concise discussion of hardware requirements
 I've seen to date, and it's fabulously helpful, thank you Shawn!  We
 currently have 2 servers that I can dedicate about 12GB of ram to Solr on
 (we're moving to these 2 servers now). I can upgrade further if it's needed
  justified, and your discussion helps me justify that such an upgrade is
 the right thing to do.

 So... If I move to 3 servers with 50GB of RAM each, using 3 shards, I
 should
 be in the free and clear then right?  This seems reasonable and doable.

 In this more extreme example the failover properties of solr cloud become
 more clear. I couldn't possibly run a replica shard without doubling the
 memory, so really replication isn't reasonable until I have double the
 hardware, then the load balancing scheme makes perfect sense. With 3
 servers, 50GB of RAM and 120GB index I should just backup the index
 directory I think.

 My previous though to run replication just for failover would have actually
 resulted in LOWER performance because I would have halved the memory
 available to the master  replica. So the previous question is answered as
 well now.

 Question: if I had 1 server with 60GB of memory and 120GB index, would solr
 make full use of the 60GB of memory? Thus trimming disk access in half. Or
 is it an all-or-nothing thing?  In a dev environment, I didn't notice SOLR
 consuming the full 5GB of RAM assigned to it with a 120GB index.

 Dave


 -Original Message-
 From: Shawn Heisey [mailto:s...@elyograg.org]
 Sent: Friday, April 19, 2013 11:51 AM
 To: solr-user@lucene.apache.org
 Subject: Re: SolrCloud loadbalancing, replication, and failover

 On 4/18/2013 8:12 PM, David Parks wrote:
  I think I still don't understand something here.
 
  My concern right now is that query times are very slow for 120GB index
  (14s on avg), I've seen a lot of disk activity when running queries.
 
  I'm hoping that distributing that query across 2 servers is going to
  improve the query time, specifically I'm hoping that we can distribute
  that disk activity because we don't have great disks on there (yet).
 
  So, with disk IO being a factor in mind, running the query on one box,
 vs.
  across 2 *should* be a concern right?
 
  Admittedly, this is the first step in what will probably be many to
  try to work our query times down from 14s to what I want to be around 1s.

 I went through my mailing list archive to see what all you've said about
 your setup.  One thing that I can't seem to find is a mention of how much
 total RAM is in each of your servers.  I apologize if it was actually there
 and I overlooked it.

 In one email thread, you wanted to know whether Solr is CPU-bound or
 IO-bound.  Solr is heavily reliant on the index on disk, and disk I/O is
 the
 slowest piece of the puzzle. The way to get good performance out of Solr is
 to have enough memory that you can take the disk mostly out of the equation
 by having the operating system cache the index in RAM.  If you don't have
 enough RAM for that, then Solr becomes IO-bound, and your CPUs will be busy
 in iowait, unable to do much real work.  If you DO have enough RAM to cache
 all (or most) of your index, then Solr will be CPU-bound.

 With 120GB of total index data on each server, you would want at least
 128GB
 of RAM per server, assuming you are only giving 8-16GB of RAM to Solr, and
 that Solr is the only thing running on the machine.  If you have more
 servers and shards, you can reduce the per-server memory requirement
 because
 the amount of index data on each server would go down.  I am aware of the
 cost associated with this kind of requirement - each of my Solr servers has
 64GB.

 If you are sharing the server with another program, then you want to have
 enough RAM available for Solr's heap, Solr's data, the other program's
 heap,
 and the other program's data.  Some programs (like
 MySQL) completely skip the OS disk cache and instead do that caching
 themselves with heap memory that's actually allocated to the program.
 If you're using a program like that, then you wouldn't need to count its
 data.

 Using SSDs for storage can speed things up dramatically and may reduce the
 total memory requirement to some degree, but even an SSD is slower than
 RAM.
 The transfer speed of RAM is faster, and from what I understand, the
 latency
 is at least an order of magnitude quicker - nanoseconds vs microseconds.

 In another thread, you asked about how Google gets such good response
 times.
 Although Google's software probably works differently than Solr/Lucene,
 when
 it comes right down to it, all search engines do similar

RE: SolrCloud loadbalancing, replication, and failover

2013-04-19 Thread David Parks
Interesting. I'm trying to correlate this new understanding to what I see on
my servers.  I've got one server with 5GB dedicated to solr, solr dashboard
reports a 167GB index actually.

When I do many typical queries I see between 3MB and 9MB of disk reads
(watching iostat).

But solr's dashboard only shows 710MB of memory in use (this box has had
many hundreds of queries put through it, and has been up for 1 week). That
doesn't quite correlate with my understanding that Solr would cache the
index as much as possible. 

Should I be thinking that things aren't configured correctly here?

Dave


-Original Message-
From: John Nielsen [mailto:j...@mcb.dk] 
Sent: Friday, April 19, 2013 2:35 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud loadbalancing, replication, and failover

Well, to consume 120GB of RAM with a 120GB index, you would have to query
over every single GB of data.

If you only actually query over, say, 500MB of the 120GB data in your dev
environment, you would only use 500MB worth of RAM for caching. Not 120GB


On Fri, Apr 19, 2013 at 7:55 AM, David Parks davidpark...@yahoo.com wrote:

 Wow! That was the most pointed, concise discussion of hardware 
 requirements I've seen to date, and it's fabulously helpful, thank you 
 Shawn!  We currently have 2 servers that I can dedicate about 12GB of 
 ram to Solr on (we're moving to these 2 servers now). I can upgrade 
 further if it's needed  justified, and your discussion helps me 
 justify that such an upgrade is the right thing to do.

 So... If I move to 3 servers with 50GB of RAM each, using 3 shards, I 
 should be in the free and clear then right?  This seems reasonable and 
 doable.

 In this more extreme example the failover properties of solr cloud 
 become more clear. I couldn't possibly run a replica shard without 
 doubling the memory, so really replication isn't reasonable until I 
 have double the hardware, then the load balancing scheme makes perfect 
 sense. With 3 servers, 50GB of RAM and 120GB index I should just 
 backup the index directory I think.

 My previous though to run replication just for failover would have 
 actually resulted in LOWER performance because I would have halved the 
 memory available to the master  replica. So the previous question is 
 answered as well now.

 Question: if I had 1 server with 60GB of memory and 120GB index, would 
 solr make full use of the 60GB of memory? Thus trimming disk access in 
 half. Or is it an all-or-nothing thing?  In a dev environment, I 
 didn't notice SOLR consuming the full 5GB of RAM assigned to it with a
120GB index.

 Dave


 -Original Message-
 From: Shawn Heisey [mailto:s...@elyograg.org]
 Sent: Friday, April 19, 2013 11:51 AM
 To: solr-user@lucene.apache.org
 Subject: Re: SolrCloud loadbalancing, replication, and failover

 On 4/18/2013 8:12 PM, David Parks wrote:
  I think I still don't understand something here.
 
  My concern right now is that query times are very slow for 120GB 
  index (14s on avg), I've seen a lot of disk activity when running
queries.
 
  I'm hoping that distributing that query across 2 servers is going to 
  improve the query time, specifically I'm hoping that we can 
  distribute that disk activity because we don't have great disks on there
(yet).
 
  So, with disk IO being a factor in mind, running the query on one 
  box,
 vs.
  across 2 *should* be a concern right?
 
  Admittedly, this is the first step in what will probably be many to 
  try to work our query times down from 14s to what I want to be around
1s.

 I went through my mailing list archive to see what all you've said 
 about your setup.  One thing that I can't seem to find is a mention of 
 how much total RAM is in each of your servers.  I apologize if it was 
 actually there and I overlooked it.

 In one email thread, you wanted to know whether Solr is CPU-bound or 
 IO-bound.  Solr is heavily reliant on the index on disk, and disk I/O 
 is the slowest piece of the puzzle. The way to get good performance 
 out of Solr is to have enough memory that you can take the disk mostly 
 out of the equation by having the operating system cache the index in 
 RAM.  If you don't have enough RAM for that, then Solr becomes 
 IO-bound, and your CPUs will be busy in iowait, unable to do much real 
 work.  If you DO have enough RAM to cache all (or most) of your index, 
 then Solr will be CPU-bound.

 With 120GB of total index data on each server, you would want at least 
 128GB of RAM per server, assuming you are only giving 8-16GB of RAM to 
 Solr, and that Solr is the only thing running on the machine.  If you 
 have more servers and shards, you can reduce the per-server memory 
 requirement because the amount of index data on each server would go 
 down.  I am aware of the cost associated with this kind of requirement 
 - each of my Solr servers has 64GB.

 If you are sharing the server with another program, then you want to 
 have enough RAM available for Solr's

Re: SolrCloud loadbalancing, replication, and failover

2013-04-19 Thread Shawn Heisey
On 4/19/2013 1:34 AM, John Nielsen wrote:
 Well, to consume 120GB of RAM with a 120GB index, you would have to query
 over every single GB of data.
 
 If you only actually query over, say, 500MB of the 120GB data in your dev
 environment, you would only use 500MB worth of RAM for caching. Not 120GB

What you are saying is essentially true, although I would not be
surprised to learn that even a single query would read a few gigabytes
from a 120GB index, assuming that you start after a server reboot.  The
next query would re-use a lot of the data cached by the first query and
return much faster.

 On Fri, Apr 19, 2013 at 7:55 AM, David Parks davidpark...@yahoo.com wrote:
 Question: if I had 1 server with 60GB of memory and 120GB index, would solr
 make full use of the 60GB of memory? Thus trimming disk access in half. Or
 is it an all-or-nothing thing?  In a dev environment, I didn't notice SOLR
 consuming the full 5GB of RAM assigned to it with a 120GB index.

Solr would likely cause the OS to use most or all of that memory.  It's
not an all or nothing thing.  The first few queries will load a big
chunk, then each additional query will load a little more.  60GB of RAM
will be significantly better than 12GB.  With only 12GB, it is extremely
likely that a given query will read a section of the index that will
push the data required for the next query out of the disk cache, so it
will have to re-read it from the disk on the next query, and so on in a
never-ending cycle.  That is far less likely if you have enough RAM for
half your index rather than a tenth.  Operating system disk caches are
pretty good at figuring out which data is needed frequently.  If the
cache is big enough, that data can be kept in the cache easily.

An ideal setup would have enough RAM to cache the entire index.
Depending on your schema, you may find that the disk cache in production
only ends up caching somewhere between half and two thirds of your
index.  The 60GB figure you have quoted above *MIGHT* be enough to make
things work really well with a 120GB index, but I always tell people
that if they want top performance, they will buy enough RAM to cache the
whole thing.

You might have a combination of query pattern and data that results in
more of the index needing cache than I have seen on my setup.  You are
likely to add documents continuously.  You may learn that your schema
doesn't cover your needs, so you have to modify it to tokenize more
aggressively, or you may need to copy fields so you can analyze the same
data more than one way.  These things will make your index bigger.  If
your query volume grows or gets more varied, more of your index is
likely to end up in the disk cache.

I would not recommend going into production with an index that has no
redundancy.  If you buy quality hardware with redundancy in storage,
dual power supplies, and ECC memory, catastrophic failures are rare, but
they DO happen.  The motherboard or an entire RAM chip could suddenly
die.  Someone might accidentally hit the power switch on the server and
cause it to shut down.  They might be working in the rack, fall down,
and pull out both power cords in an attempt to catch themselves.  The
latter scenarios are a temporary problem, but your users will probably
notice.

Thanks,
Shawn



Re: SolrCloud loadbalancing, replication, and failover

2013-04-19 Thread Shawn Heisey
On 4/19/2013 2:15 AM, David Parks wrote:
 Interesting. I'm trying to correlate this new understanding to what I see on
 my servers.  I've got one server with 5GB dedicated to solr, solr dashboard
 reports a 167GB index actually.
 
 When I do many typical queries I see between 3MB and 9MB of disk reads
 (watching iostat).
 
 But solr's dashboard only shows 710MB of memory in use (this box has had
 many hundreds of queries put through it, and has been up for 1 week). That
 doesn't quite correlate with my understanding that Solr would cache the
 index as much as possible. 

There are two memory sections on the dashboard.  The one at the top
shows the operating system view of physical memory.  That is probably
showing virtually all of it in use.  Most UNIX platforms will show you
the same info with 'top' or 'free'.  Some of them, like Solaris, require
different tools.  I assume you're not using Windows, because you mention
iostat.

The other memory section is for the JVM, and that only covers the memory
used by Solr.  The dark grey section is the amount of Java heap memory
currently utilized by Solr and its servlet container.  The light grey
section represents the memory that the JVM has allocated from system
memory.  If any part of that bar is white, then Java has not yet
requested the maximum configured heap.  Typically a long-running Solr
install will have only dark and light grey, no white.

The operating system is what caches your index, not Solr.  The bulk of
your RAM should be unallocated.  With your index size, the OS will use
all unallocated RAM for the disk cache.  If a program requests some of
that RAM, the OS will instantly give it up.

Thanks,
Shawn



RE: SolrCloud loadbalancing, replication, and failover

2013-04-19 Thread David Parks
Ok, I understand better now.

The Physical Memory is 90% utilized (21.18GB of 23.54GB). Solr has dark grey
allocation of 602MB, and light grey of an additional 108MB, for a JVM total
of 710MB allocated. If I understand correctly, Solr memory utilization is
*not* for caching (unless I configured document caches or some of the other
cache options in Solr, which don't seem to apply in this case, and I haven't
altered from their defaults).

So assuming this box was dedicated to 1 solr instance/shard. What JVM heap
should I set? Does that matter? 24GB JVM heap? Or keep it lower and ensure
the OS cache has plenty of room to operate? (this is an Ubuntu 12.10 server
instance).

Would I be wise to just put the index on a RAM disk and guarantee
performance?  Assuming I installed sufficient RAM?

Dave


-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Friday, April 19, 2013 4:19 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud loadbalancing, replication, and failover

On 4/19/2013 2:15 AM, David Parks wrote:
 Interesting. I'm trying to correlate this new understanding to what I 
 see on my servers.  I've got one server with 5GB dedicated to solr, 
 solr dashboard reports a 167GB index actually.
 
 When I do many typical queries I see between 3MB and 9MB of disk reads 
 (watching iostat).
 
 But solr's dashboard only shows 710MB of memory in use (this box has 
 had many hundreds of queries put through it, and has been up for 1 
 week). That doesn't quite correlate with my understanding that Solr 
 would cache the index as much as possible.

There are two memory sections on the dashboard.  The one at the top shows
the operating system view of physical memory.  That is probably showing
virtually all of it in use.  Most UNIX platforms will show you the same info
with 'top' or 'free'.  Some of them, like Solaris, require different tools.
I assume you're not using Windows, because you mention iostat.

The other memory section is for the JVM, and that only covers the memory
used by Solr.  The dark grey section is the amount of Java heap memory
currently utilized by Solr and its servlet container.  The light grey
section represents the memory that the JVM has allocated from system memory.
If any part of that bar is white, then Java has not yet requested the
maximum configured heap.  Typically a long-running Solr install will have
only dark and light grey, no white.

The operating system is what caches your index, not Solr.  The bulk of your
RAM should be unallocated.  With your index size, the OS will use all
unallocated RAM for the disk cache.  If a program requests some of that RAM,
the OS will instantly give it up.

Thanks,
Shawn



Re: SolrCloud loadbalancing, replication, and failover

2013-04-19 Thread Toke Eskildsen
On Fri, 2013-04-19 at 06:51 +0200, Shawn Heisey wrote:
 Using SSDs for storage can speed things up dramatically and may reduce
 the total memory requirement to some degree,

We have been using SSDs for several years in our servers. It is our
clear experience that to some degree should be replaced with very
much in the above.

Our current SSD-equipped servers each holds a total of 127GB of index
data spread ever 3 instances. The machines each have 16GB of RAM, of
which about 7GB are left for disk cache.

We are the State and University Library, Denmark and our search engine
is the primary (and arguably only) way to locate resources for our
users. The average raw search time is 32ms for non-faceted queries and
616ms for heavy faceted (which is much too slow. Dang! I thought I fixed
that).

  but even an SSD is slower than RAM.  The transfer speed of RAM is faster,
 and from what I understand, the latency is at least an order of
 magnitude quicker - nanoseconds vs microseconds.

True, but you might as well argue that everyone should go for the
fastest CPU possible, as it will be, well, faster than the slower ones.

The question is almost never to get the fastest possible, but to get a
good price/performance tradeoff. I would argue that SSDs fit that bill
very well for a great deal of the My search is too slow-threads that
are spun on this mailing list. Especially for larger indexes.

Regards,
Toke Eskildsen



RE: SolrCloud loadbalancing, replication, and failover

2013-04-19 Thread David Parks
Wow, thank you for those benchmarks Toke, that really gives me some firm 
footing to stand on in knowing what to expect and thinking out which path to 
venture down. It's tremendously appreciated!

Dave


-Original Message-
From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk] 
Sent: Friday, April 19, 2013 5:17 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud loadbalancing, replication, and failover

On Fri, 2013-04-19 at 06:51 +0200, Shawn Heisey wrote:
 Using SSDs for storage can speed things up dramatically and may reduce 
 the total memory requirement to some degree,

We have been using SSDs for several years in our servers. It is our clear 
experience that to some degree should be replaced with very much in the 
above.

Our current SSD-equipped servers each holds a total of 127GB of index data 
spread ever 3 instances. The machines each have 16GB of RAM, of which about 7GB 
are left for disk cache.

We are the State and University Library, Denmark and our search engine is the 
primary (and arguably only) way to locate resources for our users. The average 
raw search time is 32ms for non-faceted queries and 616ms for heavy faceted 
(which is much too slow. Dang! I thought I fixed that).

  but even an SSD is slower than RAM.  The transfer speed of RAM is 
 faster, and from what I understand, the latency is at least an order 
 of magnitude quicker - nanoseconds vs microseconds.

True, but you might as well argue that everyone should go for the fastest CPU 
possible, as it will be, well, faster than the slower ones.

The question is almost never to get the fastest possible, but to get a good 
price/performance tradeoff. I would argue that SSDs fit that bill very well for 
a great deal of the My search is too slow-threads that are spun on this 
mailing list. Especially for larger indexes.

Regards,
Toke Eskildsen



Re: SolrCloud loadbalancing, replication, and failover

2013-04-19 Thread Shawn Heisey
On 4/19/2013 3:48 AM, David Parks wrote:
 The Physical Memory is 90% utilized (21.18GB of 23.54GB). Solr has dark grey
 allocation of 602MB, and light grey of an additional 108MB, for a JVM total
 of 710MB allocated. If I understand correctly, Solr memory utilization is
 *not* for caching (unless I configured document caches or some of the other
 cache options in Solr, which don't seem to apply in this case, and I haven't
 altered from their defaults).

Right.  Solr does have caches, but they serve specific purposes.  The OS
is much better at general large-scale caching than Solr is.  Solr caches
get cleared (and possibly re-warmed) whenever you issue a commit on your
index that makes new documents visible.

 So assuming this box was dedicated to 1 solr instance/shard. What JVM heap
 should I set? Does that matter? 24GB JVM heap? Or keep it lower and ensure
 the OS cache has plenty of room to operate? (this is an Ubuntu 12.10 server
 instance).

The JVM heap to use is highly dependent on the nature of your queries,
the number of documents, the number of unique terms, etc.  The best
thing to do is try it out with a relatively large heap, see how much
memory actually gets used inside the JVM.  The jvisualvm and jconsole
tools will give you nice graphs of JVM memory usage.  The jstat program
will give you raw numbers on the commandline that you'll need to add to
get the full picture.  Due to the garbage collection model that Java
uses, what you'll see is a sawtooth pattern - memory usage goes up to
max heap, then garbage collection reduces it to the actual memory used.
 Generally speaking, you want to have more heap available than the low
point of that sawtooth pattern.  If that low point is around 3GB when
you are hitting your index hard with queries and updates, then you would
want to give Solr a heap of 4 to 6 GB.

 Would I be wise to just put the index on a RAM disk and guarantee
 performance?  Assuming I installed sufficient RAM?

A RAM disk is a very good way to guarantee performance - but RAM disks
are ephemeral.  Reboot or have an OS crash and it's gone, you'll have to
reindex.  Also remember that you actually need at *least* twice the size
of your index so that Solr (Lucene) has enough room to do merges, and
the worst-case scenario is *three* times the index size.  Merging
happens during normal indexing, not just when you optimize.  If you have
enough RAM for three times your index size and it takes less than an
hour or two to rebuild the index, then a RAM disk might be a viable way
to go.  I suspect that this won't work for you.

Thanks,
Shawn



RE: SolrCloud loadbalancing, replication, and failover

2013-04-19 Thread David Parks
Again, thank you for this incredible information, I feel on much firmer
footing now. I'm going to test distributing this across 10 servers,
borrowing a Hadoop cluster temporarily, and see how it does with enough
memory to have the whole index cached. But I'm thinking that we'll try the
SSD route as our index will probably rest in the 1/2 terabyte range
eventually, there's still a lot of active development.

I guess the RAM disk would work in our case also, as we only index in
batches, and eventually I'd like to do that off of Solr and just update the
index (I'm presuming this is doable in solr cloud, but I haven't put it to
task yet). If I could purpose Hadoop to index the shards, that would be
ideal, though I haven't quite figured out how to go about it yet.

David


-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Friday, April 19, 2013 9:42 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud loadbalancing, replication, and failover

On 4/19/2013 3:48 AM, David Parks wrote:
 The Physical Memory is 90% utilized (21.18GB of 23.54GB). Solr has 
 dark grey allocation of 602MB, and light grey of an additional 108MB, 
 for a JVM total of 710MB allocated. If I understand correctly, Solr 
 memory utilization is
 *not* for caching (unless I configured document caches or some of the 
 other cache options in Solr, which don't seem to apply in this case, 
 and I haven't altered from their defaults).

Right.  Solr does have caches, but they serve specific purposes.  The OS is
much better at general large-scale caching than Solr is.  Solr caches get
cleared (and possibly re-warmed) whenever you issue a commit on your index
that makes new documents visible.

 So assuming this box was dedicated to 1 solr instance/shard. What JVM 
 heap should I set? Does that matter? 24GB JVM heap? Or keep it lower 
 and ensure the OS cache has plenty of room to operate? (this is an 
 Ubuntu 12.10 server instance).

The JVM heap to use is highly dependent on the nature of your queries, the
number of documents, the number of unique terms, etc.  The best thing to do
is try it out with a relatively large heap, see how much memory actually
gets used inside the JVM.  The jvisualvm and jconsole tools will give you
nice graphs of JVM memory usage.  The jstat program will give you raw
numbers on the commandline that you'll need to add to get the full picture.
Due to the garbage collection model that Java uses, what you'll see is a
sawtooth pattern - memory usage goes up to max heap, then garbage collection
reduces it to the actual memory used.
 Generally speaking, you want to have more heap available than the low
point of that sawtooth pattern.  If that low point is around 3GB when you
are hitting your index hard with queries and updates, then you would want to
give Solr a heap of 4 to 6 GB.

 Would I be wise to just put the index on a RAM disk and guarantee 
 performance?  Assuming I installed sufficient RAM?

A RAM disk is a very good way to guarantee performance - but RAM disks are
ephemeral.  Reboot or have an OS crash and it's gone, you'll have to
reindex.  Also remember that you actually need at *least* twice the size of
your index so that Solr (Lucene) has enough room to do merges, and the
worst-case scenario is *three* times the index size.  Merging happens during
normal indexing, not just when you optimize.  If you have enough RAM for
three times your index size and it takes less than an hour or two to rebuild
the index, then a RAM disk might be a viable way to go.  I suspect that this
won't work for you.

Thanks,
Shawn



SolrCloud loadbalancing, replication, and failover

2013-04-18 Thread David Parks
Step 1: distribute processing

We have 2 servers in which we'll run 2 SolrCloud instances on.

We'll define 2 shards so that both servers are busy for each request
(improving response time of the request).

 

Step 2: Failover

We would now like to ensure that if either of the servers goes down (we're
very unlucky with disks), that the other will be able to take over
automatically.

So we define 2 shards with a replication factor of 2.

 

So we have:

. Server 1: Shard 1, Replica 2

. Server 2: Shard 2, Replica 1

 

Question:

But in SolrCloud, replicas are active right? So isn't it now possible that
the load balancer will have Server 1 process *both* parts of a request,
after all, it has both shards due to the replication, right?



Re: SolrCloud loadbalancing, replication, and failover

2013-04-18 Thread Otis Gospodnetic
Correct. This is what you want if server 2 goes down.

Otis
Solr  ElasticSearch Support
http://sematext.com/
On Apr 18, 2013 3:11 AM, David Parks davidpark...@yahoo.com wrote:

 Step 1: distribute processing

 We have 2 servers in which we'll run 2 SolrCloud instances on.

 We'll define 2 shards so that both servers are busy for each request
 (improving response time of the request).



 Step 2: Failover

 We would now like to ensure that if either of the servers goes down (we're
 very unlucky with disks), that the other will be able to take over
 automatically.

 So we define 2 shards with a replication factor of 2.



 So we have:

 . Server 1: Shard 1, Replica 2

 . Server 2: Shard 2, Replica 1



 Question:

 But in SolrCloud, replicas are active right? So isn't it now possible that
 the load balancer will have Server 1 process *both* parts of a request,
 after all, it has both shards due to the replication, right?




RE: SolrCloud loadbalancing, replication, and failover

2013-04-18 Thread David Parks
But my concern is this, when we have just 2 servers:
 - I want 1 to be able to take over in case the other fails, as you point
out.
 - But when *both* servers are up I don't want the SolrCloud load balancer
to have Shard1 and Replica2 do the work (as they would both reside on the
same physical server).

Does that make sense? I want *both* server1  server2 sharing the processing
of every request, *and* I want the failover capability.

I'm probably missing some bit of logic here, but I want to be sure I
understand the architecture.

Dave



-Original Message-
From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] 
Sent: Thursday, April 18, 2013 8:13 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud loadbalancing, replication, and failover

Correct. This is what you want if server 2 goes down.

Otis
Solr  ElasticSearch Support
http://sematext.com/
On Apr 18, 2013 3:11 AM, David Parks davidpark...@yahoo.com wrote:

 Step 1: distribute processing

 We have 2 servers in which we'll run 2 SolrCloud instances on.

 We'll define 2 shards so that both servers are busy for each request 
 (improving response time of the request).



 Step 2: Failover

 We would now like to ensure that if either of the servers goes down 
 (we're very unlucky with disks), that the other will be able to take 
 over automatically.

 So we define 2 shards with a replication factor of 2.



 So we have:

 . Server 1: Shard 1, Replica 2

 . Server 2: Shard 2, Replica 1



 Question:

 But in SolrCloud, replicas are active right? So isn't it now possible 
 that the load balancer will have Server 1 process *both* parts of a 
 request, after all, it has both shards due to the replication, right?





Re: SolrCloud loadbalancing, replication, and failover

2013-04-18 Thread Timothy Potter
Hi Dave,

This sounds more like a budget / deployment issue vs. anything
architectural. You want 2 shards with replication so you either need
sufficient capacity on each of your 2 servers to host 2 Solr instances or
you need 4 servers. You need to avoid starving Solr of necessary RAM, disk
performance, and CPU regardless of how you lay out the cluster otherwise
performance will suffer. My guess is if each Solr had sufficient resources,
you wouldn't actually notice much difference in query performance.

Tim


On Thu, Apr 18, 2013 at 8:03 AM, David Parks davidpark...@yahoo.com wrote:

 But my concern is this, when we have just 2 servers:
  - I want 1 to be able to take over in case the other fails, as you point
 out.
  - But when *both* servers are up I don't want the SolrCloud load balancer
 to have Shard1 and Replica2 do the work (as they would both reside on the
 same physical server).

 Does that make sense? I want *both* server1  server2 sharing the
 processing
 of every request, *and* I want the failover capability.

 I'm probably missing some bit of logic here, but I want to be sure I
 understand the architecture.

 Dave



 -Original Message-
 From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com]
 Sent: Thursday, April 18, 2013 8:13 PM
 To: solr-user@lucene.apache.org
 Subject: Re: SolrCloud loadbalancing, replication, and failover

 Correct. This is what you want if server 2 goes down.

 Otis
 Solr  ElasticSearch Support
 http://sematext.com/
 On Apr 18, 2013 3:11 AM, David Parks davidpark...@yahoo.com wrote:

  Step 1: distribute processing
 
  We have 2 servers in which we'll run 2 SolrCloud instances on.
 
  We'll define 2 shards so that both servers are busy for each request
  (improving response time of the request).
 
 
 
  Step 2: Failover
 
  We would now like to ensure that if either of the servers goes down
  (we're very unlucky with disks), that the other will be able to take
  over automatically.
 
  So we define 2 shards with a replication factor of 2.
 
 
 
  So we have:
 
  . Server 1: Shard 1, Replica 2
 
  . Server 2: Shard 2, Replica 1
 
 
 
  Question:
 
  But in SolrCloud, replicas are active right? So isn't it now possible
  that the load balancer will have Server 1 process *both* parts of a
  request, after all, it has both shards due to the replication, right?
 
 




RE: SolrCloud loadbalancing, replication, and failover

2013-04-18 Thread David Parks
I think I still don't understand something here. 

My concern right now is that query times are very slow for 120GB index (14s
on avg), I've seen a lot of disk activity when running queries.

I'm hoping that distributing that query across 2 servers is going to improve
the query time, specifically I'm hoping that we can distribute that disk
activity because we don't have great disks on there (yet).

So, with disk IO being a factor in mind, running the query on one box, vs.
across 2 *should* be a concern right?

Admittedly, this is the first step in what will probably be many to try to
work our query times down from 14s to what I want to be around 1s.

Dave


-Original Message-
From: Timothy Potter [mailto:thelabd...@gmail.com] 
Sent: Thursday, April 18, 2013 9:16 PM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud loadbalancing, replication, and failover

Hi Dave,

This sounds more like a budget / deployment issue vs. anything
architectural. You want 2 shards with replication so you either need
sufficient capacity on each of your 2 servers to host 2 Solr instances or
you need 4 servers. You need to avoid starving Solr of necessary RAM, disk
performance, and CPU regardless of how you lay out the cluster otherwise
performance will suffer. My guess is if each Solr had sufficient resources,
you wouldn't actually notice much difference in query performance.

Tim


On Thu, Apr 18, 2013 at 8:03 AM, David Parks davidpark...@yahoo.com wrote:

 But my concern is this, when we have just 2 servers:
  - I want 1 to be able to take over in case the other fails, as you 
 point out.
  - But when *both* servers are up I don't want the SolrCloud load 
 balancer to have Shard1 and Replica2 do the work (as they would both 
 reside on the same physical server).

 Does that make sense? I want *both* server1  server2 sharing the 
 processing of every request, *and* I want the failover capability.

 I'm probably missing some bit of logic here, but I want to be sure I 
 understand the architecture.

 Dave



 -Original Message-
 From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com]
 Sent: Thursday, April 18, 2013 8:13 PM
 To: solr-user@lucene.apache.org
 Subject: Re: SolrCloud loadbalancing, replication, and failover

 Correct. This is what you want if server 2 goes down.

 Otis
 Solr  ElasticSearch Support
 http://sematext.com/
 On Apr 18, 2013 3:11 AM, David Parks davidpark...@yahoo.com wrote:

  Step 1: distribute processing
 
  We have 2 servers in which we'll run 2 SolrCloud instances on.
 
  We'll define 2 shards so that both servers are busy for each request 
  (improving response time of the request).
 
 
 
  Step 2: Failover
 
  We would now like to ensure that if either of the servers goes down 
  (we're very unlucky with disks), that the other will be able to take 
  over automatically.
 
  So we define 2 shards with a replication factor of 2.
 
 
 
  So we have:
 
  . Server 1: Shard 1, Replica 2
 
  . Server 2: Shard 2, Replica 1
 
 
 
  Question:
 
  But in SolrCloud, replicas are active right? So isn't it now 
  possible that the load balancer will have Server 1 process *both* 
  parts of a request, after all, it has both shards due to the
replication, right?
 
 





Re: SolrCloud loadbalancing, replication, and failover

2013-04-18 Thread Shawn Heisey
On 4/18/2013 8:12 PM, David Parks wrote:
 I think I still don't understand something here. 
 
 My concern right now is that query times are very slow for 120GB index (14s
 on avg), I've seen a lot of disk activity when running queries.
 
 I'm hoping that distributing that query across 2 servers is going to improve
 the query time, specifically I'm hoping that we can distribute that disk
 activity because we don't have great disks on there (yet).
 
 So, with disk IO being a factor in mind, running the query on one box, vs.
 across 2 *should* be a concern right?
 
 Admittedly, this is the first step in what will probably be many to try to
 work our query times down from 14s to what I want to be around 1s.

I went through my mailing list archive to see what all you've said about
your setup.  One thing that I can't seem to find is a mention of how
much total RAM is in each of your servers.  I apologize if it was
actually there and I overlooked it.

In one email thread, you wanted to know whether Solr is CPU-bound or
IO-bound.  Solr is heavily reliant on the index on disk, and disk I/O is
the slowest piece of the puzzle. The way to get good performance out of
Solr is to have enough memory that you can take the disk mostly out of
the equation by having the operating system cache the index in RAM.  If
you don't have enough RAM for that, then Solr becomes IO-bound, and your
CPUs will be busy in iowait, unable to do much real work.  If you DO
have enough RAM to cache all (or most) of your index, then Solr will be
CPU-bound.

With 120GB of total index data on each server, you would want at least
128GB of RAM per server, assuming you are only giving 8-16GB of RAM to
Solr, and that Solr is the only thing running on the machine.  If you
have more servers and shards, you can reduce the per-server memory
requirement because the amount of index data on each server would go
down.  I am aware of the cost associated with this kind of requirement -
each of my Solr servers has 64GB.

If you are sharing the server with another program, then you want to
have enough RAM available for Solr's heap, Solr's data, the other
program's heap, and the other program's data.  Some programs (like
MySQL) completely skip the OS disk cache and instead do that caching
themselves with heap memory that's actually allocated to the program.
If you're using a program like that, then you wouldn't need to count its
data.

Using SSDs for storage can speed things up dramatically and may reduce
the total memory requirement to some degree, but even an SSD is slower
than RAM.  The transfer speed of RAM is faster, and from what I
understand, the latency is at least an order of magnitude quicker -
nanoseconds vs microseconds.

In another thread, you asked about how Google gets such good response
times.  Although Google's software probably works differently than
Solr/Lucene, when it comes right down to it, all search engines do
similar jobs and have similar requirements.  I would imagine that Google
gets incredible response time because they have incredible amounts of
RAM at their disposal that keep the important bits of their index
instantly available.  They have thousands of servers in each data
center.  I once got a look at the extent of Google's hardware in one
data center - it was HUGE.  I couldn't get in to examine things closely,
they keep that stuff very locked down.

Thanks,
Shawn



RE: SolrCloud loadbalancing, replication, and failover

2013-04-18 Thread David Parks
Wow! That was the most pointed, concise discussion of hardware requirements
I've seen to date, and it's fabulously helpful, thank you Shawn!  We
currently have 2 servers that I can dedicate about 12GB of ram to Solr on
(we're moving to these 2 servers now). I can upgrade further if it's needed
 justified, and your discussion helps me justify that such an upgrade is
the right thing to do.

So... If I move to 3 servers with 50GB of RAM each, using 3 shards, I should
be in the free and clear then right?  This seems reasonable and doable.

In this more extreme example the failover properties of solr cloud become
more clear. I couldn't possibly run a replica shard without doubling the
memory, so really replication isn't reasonable until I have double the
hardware, then the load balancing scheme makes perfect sense. With 3
servers, 50GB of RAM and 120GB index I should just backup the index
directory I think.

My previous though to run replication just for failover would have actually
resulted in LOWER performance because I would have halved the memory
available to the master  replica. So the previous question is answered as
well now.

Question: if I had 1 server with 60GB of memory and 120GB index, would solr
make full use of the 60GB of memory? Thus trimming disk access in half. Or
is it an all-or-nothing thing?  In a dev environment, I didn't notice SOLR
consuming the full 5GB of RAM assigned to it with a 120GB index.

Dave


-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: Friday, April 19, 2013 11:51 AM
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud loadbalancing, replication, and failover

On 4/18/2013 8:12 PM, David Parks wrote:
 I think I still don't understand something here. 
 
 My concern right now is that query times are very slow for 120GB index 
 (14s on avg), I've seen a lot of disk activity when running queries.
 
 I'm hoping that distributing that query across 2 servers is going to 
 improve the query time, specifically I'm hoping that we can distribute 
 that disk activity because we don't have great disks on there (yet).
 
 So, with disk IO being a factor in mind, running the query on one box, vs.
 across 2 *should* be a concern right?
 
 Admittedly, this is the first step in what will probably be many to 
 try to work our query times down from 14s to what I want to be around 1s.

I went through my mailing list archive to see what all you've said about
your setup.  One thing that I can't seem to find is a mention of how much
total RAM is in each of your servers.  I apologize if it was actually there
and I overlooked it.

In one email thread, you wanted to know whether Solr is CPU-bound or
IO-bound.  Solr is heavily reliant on the index on disk, and disk I/O is the
slowest piece of the puzzle. The way to get good performance out of Solr is
to have enough memory that you can take the disk mostly out of the equation
by having the operating system cache the index in RAM.  If you don't have
enough RAM for that, then Solr becomes IO-bound, and your CPUs will be busy
in iowait, unable to do much real work.  If you DO have enough RAM to cache
all (or most) of your index, then Solr will be CPU-bound.

With 120GB of total index data on each server, you would want at least 128GB
of RAM per server, assuming you are only giving 8-16GB of RAM to Solr, and
that Solr is the only thing running on the machine.  If you have more
servers and shards, you can reduce the per-server memory requirement because
the amount of index data on each server would go down.  I am aware of the
cost associated with this kind of requirement - each of my Solr servers has
64GB.

If you are sharing the server with another program, then you want to have
enough RAM available for Solr's heap, Solr's data, the other program's heap,
and the other program's data.  Some programs (like
MySQL) completely skip the OS disk cache and instead do that caching
themselves with heap memory that's actually allocated to the program.
If you're using a program like that, then you wouldn't need to count its
data.

Using SSDs for storage can speed things up dramatically and may reduce the
total memory requirement to some degree, but even an SSD is slower than RAM.
The transfer speed of RAM is faster, and from what I understand, the latency
is at least an order of magnitude quicker - nanoseconds vs microseconds.

In another thread, you asked about how Google gets such good response times.
Although Google's software probably works differently than Solr/Lucene, when
it comes right down to it, all search engines do similar jobs and have
similar requirements.  I would imagine that Google gets incredible response
time because they have incredible amounts of RAM at their disposal that keep
the important bits of their index instantly available.  They have thousands
of servers in each data center.  I once got a look at the extent of Google's
hardware in one data center - it was HUGE.  I couldn't get in to examine
things