On 1/8/2015 3:16 AM, Toke Eskildsen wrote:
On Wed, 2015-01-07 at 22:26 +0100, Joseph Obernberger wrote:
Thank you Toke - yes - the data is indexed throughout the day. We are
handling very few searches - probably 50 a day; this is an R&D system.
If your searches are in small bundles, you could pause the indexing flow
while the searches are executed, for better performance.
Our HDFS cache, I believe, is too small at 10GBytes per shard.
That depends a lot on your corpus, your searches and underlying storage.
But with our current level of information, it is a really good bet:
Having 10GB cache per 130GB (270GB?) data is not a lot with spinning
drives.
Yes - it would be 20GBytes of cache per 270GBytes of data.
Current parameters for running each shard are:
JAVA_OPTS="-XX:MaxDirectMemorySize=10g -XX:+UseLargePages -XX:NewRatio=3
[...]
-Xmx10752m"
One Solr/shard? You could probably win a bit by having one Solr/machine
instead. Anyway, it's quite a high Xmx, but I presume you have measured
the memory needs.
We've tried lower Xmx but we get OOM errors during faceting of large
datasets. Right now we're running two JVMs per physical box (2 shards
per box), but we're going to be changing that to on JVM and one shard
per box.
I'd love to try SSDs, but don't have the budget at present to go that
route.
We find the price/performance for SSD + moderate RAM to be quite a
better deal than spinning drives + a lot of RAM, even when buying
enterprise hardware. For consumer SSDs (used in our large server) it is
even cheaper to use SSDs. It all depends on use pattern of course, but
your setup with non-concurrent searches seems like it would fit well.
Note: I am sure that the RAM == index size would deliver very high
performance. With enough RAM you can use tape to hold the index. Whether
it is cost effective is another matter.
Ha! Yes - our index is accessible via a 2400 baud modem, but we have
lots of cache! ;)
I'd really like to get the HDFS option to work well as it
reduces system complexity.
That is very understandable. We examined the option of networked storage
(Isilon) with underlying spindles, and it performed adequately for our
needs up to 2-3TB of index data. Unfortunately the heavy random read
load from Solr meant a noticeable degradation of other services using
the networked storage. I am sure it could be solved with more
centralized hardware, but in the end we found it cheaper and simpler to
use local storage for search. This will of course differ across
organizations and setups.
We're going to experiment with the one shard per box and more RAM cache
per shard and see where that gets us; we'll also be adding more shards.
Thanks for the tips!
Interesting that you mention Isilon as we're planning on doing an eval
with their product this year where we'll be testing out their HDFS
layer. It's a potential way to balance computer and storage since you
can add HDFS storage without adding compute.
- Toke Eskildsen
-Joe