Re: Solr memory reqs for time-sorted data

2018-09-07 Thread Shawn Heisey

On 9/7/2018 8:39 AM, Pavel Micka wrote:

I found on wiki (https://wiki.apache.org/solr/SolrPerformanceProblems#RAM) that 
optimal amount of RAM for SOLR is equal to index size. This is lets say the 
ideal case to have everything in memory.

I wrote that page.


We plan to have small installation with 2 nodes and 8shards. We'll have inside 
the cluster 100M of documents. We expect that each document will take 5kB to 
index. With in-memory index this would mean that those two nodes would require 
~500GB RAM. This would mean 2x 256GB to have everything in memory. And those 
are really big machines... Is this calculation even correct in new Solr 
versions?

And we do have a bit restricted problem: Our data are time based logs and we 
generally have a restricted search for last 3 months. Which will match let's 
say 10M of documents. How will this affect SOLR memory requirements? Will we 
still need to have the whole inverted indexes in memory? Or is there some 
internal optimization, which will ensure that only some part will need to be in 
memory?

The questions:

1)  Is the 500GB of memory reqs correct assumption?


There are two things that Solr needs memory for.  One is Solr's heap, 
which is memory directly used by Solr itself.  The other is unused 
memory, which the operating system will use to cache data on disk.  Solr 
performance is helped dramatically by the latter kind of memory.


For *OPTIMAL* performance with a 500GB index, you need 500GB of memory 
for the OS to cache the data.  This is memory that is not used by 
programs, including Solr's heap.


For *good* performance, it's rare that you will need enough memory to 
cache the entire index.  But I cannot tell you with any reliability how 
much of the index you must be able to cache.  Some people are doing fine 
with only a few percent of their index cached.  Others see terrible 
performance unless they can get 75 percent of the index cached.



2)  Will the fact that we have time-based logs with majority of accesses to 
recent data only help?


Yes, it most likely will help, and reduce your memory requirements.


3)  Is there some best practice how to reduce required RAM in Solr?


The biggest thing you can do is to reduce the size of the index, so 
there is less data that must be accessed for a query. The page you 
referenced lists some things you might be able to do to reduce Solr's 
heap requirements.  If you reduce the heap requirements, then more of 
the server's memory available for caching.


Thanks,
Shawn



Solr memory reqs for time-sorted data

2018-09-07 Thread Pavel Micka
Hi,

I found on wiki (https://wiki.apache.org/solr/SolrPerformanceProblems#RAM) that 
optimal amount of RAM for SOLR is equal to index size. This is lets say the 
ideal case to have everything in memory.

We plan to have small installation with 2 nodes and 8shards. We'll have inside 
the cluster 100M of documents. We expect that each document will take 5kB to 
index. With in-memory index this would mean that those two nodes would require 
~500GB RAM. This would mean 2x 256GB to have everything in memory. And those 
are really big machines... Is this calculation even correct in new Solr 
versions?

And we do have a bit restricted problem: Our data are time based logs and we 
generally have a restricted search for last 3 months. Which will match let's 
say 10M of documents. How will this affect SOLR memory requirements? Will we 
still need to have the whole inverted indexes in memory? Or is there some 
internal optimization, which will ensure that only some part will need to be in 
memory?

The questions:

1)  Is the 500GB of memory reqs correct assumption?

2)  Will the fact that we have time-based logs with majority of accesses to 
recent data only help?

3)  Is there some best practice how to reduce required RAM in Solr?



Thanks in advance!

Pavel


Side note:
We were thinking about DB partitioning based on Time Routed Aliases, but 
unfortunately we need to ensure disaster recovery through a bad network 
connection. And TRA and Cross Data Center Replication are not compatible. (CDCR 
requires static number of cores, while TRA creates cores dynamically).