Re: Solr hardware memory question

2013-12-12 Thread Michael Della Bitta
Hello, Gil,

I'm wondering if you've been in touch with the Hathi Trust people, because
I imagine your use cases are somewhat similar.

They've done some blogging around getting digitized texts indexed at scale,
which I what I assume you're doing:

http://www.hathitrust.org/blogs/Large-scale-Search

Michael Della Bitta

Applications Developer

o: +1 646 532 3062  | c: +1 917 477 7906

appinions inc.

“The Science of Influence Marketing”

18 East 41st Street

New York, NY 10017

t: @appinions <https://twitter.com/Appinions> | g+:
plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts>
w: appinions.com <http://www.appinions.com/>


On Thu, Dec 12, 2013 at 5:10 AM, Hoggarth, Gil  wrote:

> Thanks for this - I haven't any previous experience with utilising SSDs in
> the way you suggest, so I guess I need to start learning! And thanks for
> the Danish-webscale URL, looks like very informed reading. (Yes, I think
> we're working in similar industries with similar constraints and
> expectations).
>
> Compiliing my answers into one email, " Curious how many documents per
> shard you were planning? The number of documents per shard and field type
> will drive the amount of a RAM needed to sort and facet."
> - Number of documents per shard, I think about 200 million. That's a bit
> of a rough estimate based on other Solrs we run though. Which I think means
> we hold a lot of data for each document, though I keep arguing to keep this
> to the truly required minimum. We also have many facets, some of which are
> pretty large (I'm stretching my understanding here but I think most
> documents have many 'entries' in many facets so these really hit us
> performance-wise.)
>
> I try to keep a 1-to-1 ratio of Solr nodes to CPUs with a few spare for
> the operating system. I utilise MMapDirectory to manage memory via the OS.
> So at this moment I guessing that we'll have 56 Solr dedicated CPUs across
> 2 physical 32 CPU servers and _hopefully_ 256GB RAM on each. This would
> give 28 shards and each would have 5GB java memory (in Tomcat), leaving
> 126GB on each server for the OS and MMap. (I believe the Solr theory for
> this doesn't accurately work out but we can accept the edge cases where
> this will fail.)
>
> I can also see that our hardware requirements will also depend on usage as
> well as the volume of data, and I've been pondering how best we can
> structure our index/es to facilitate a long term service (which means that,
> given it's a lot of data, I need to structure the data so that new usage
> doesn't require re-indexing.) But at this early stage, as people say, we
> need to prototype, test, profile etc. and to do that I need the hardware to
> run the trials (policy dictates that I buy the production hardware now,
> before profiling - I get to control much of the design and construction so
> I don't argue with this!)
>
> Thanks for all the comments everyone, all very much appreciated :)
> Gil
>
>
> -Original Message-
> From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk]
> Sent: 11 December 2013 12:02
> To: solr-user@lucene.apache.org
> Subject: Re: Solr hardware memory question
>
> On Tue, 2013-12-10 at 17:51 +0100, Hoggarth, Gil wrote:
> > We're probably going to be building a Solr service to handle a dataset
> > of ~60TB, which for our data and schema typically gives a Solr index
> > size of 1/10th - i.e., 6TB. Given there's a general rule about the
> > amount of hardware memory required should exceed the size of the Solr
> > index (exceed to also allow for the operating system etc.), how have
> > people handled this situation?
>
> By acknowledging that it is cheaper to buy SSDs instead of trying to
> compensate for slow spinning drives with excessive amounts of RAM.
>
> Our plans for an estimated 20TB of indexes out of 372TB of raw web data is
> to use SSDs controlled by a single machine with 512GB of RAM (or was it
> 256GB? I'll have to ask the hardware guys):
> https://sbdevel.wordpress.com/2013/12/06/danish-webscale/
>
> As always YMMW and the numbers you quite elsewhere indicates that your
> queries are quite complex. You might want to be a bit of profiling to see
> if they are heavy enough to make the CPU the bottleneck.
>
> Regards,
> Toke Eskildsen, State and University Library, Denmark
>
>
>


RE: Solr hardware memory question

2013-12-12 Thread Toke Eskildsen
On Thu, 2013-12-12 at 11:10 +0100, Hoggarth, Gil wrote:
> Thanks for this - I haven't any previous experience with utilising SSDs
> in the way you suggest, so I guess I need to start learning!

There's a bit of divide in the Lucene/Solr-world on this. Everybody
agrees that SSDs in themselves are great for Lucene/Solr searches,
compared to a spinning drives solution. How much better is another
matter and the issue gets confusing when RAM caching is factored in.

Some are also very concerned about the reliability of SSDs and the write
performance degradation without TRIM (you need to have a quite specific
setup to have TRIM enabled on a server with SSDs in RAID). Guessing that
your 6TB index is not heavily updated, the TRIM part should not be one
of your worries though.

At Statsbiblioteket, we have been using SSDs for our search servers
since 2008. That was back when random write performance was horrible and
a large drive was 64GB. As you have probably guessed, we are very much
in the SSD camp.

We have done some testing and for simple searches (i.e. a lot of IO and
comparatively little CPU usage), we have observed that SSDs + 10% index
size RAM for caching deliver something like 80% of pure RAM speed.
https://sbdevel.wordpress.com/2013/06/06/memory-is-overrated/

Your mileage will surely vary.

> [...] leaving 126GB on each server for the OS and MMap. [...]

So about the same as your existing 3TB setup? Seems like you will get
the same performance then. I must say that 1 minute response times would
be very hard to sell at our library, even for a special search only used
by a small and dedicated audience. Even your goal of 20 seconds seems
adverse to exploratory search.

May I be so frank as to suggest a course of action? Buy one ½ TB Samsung
840 EVO SSD, fill it with indexes and test it in a machine with 32GB of
RAM, thus matching the 1/20 index size RAM that your servers will have.
Such a drive costs £250 on Amazon and the experiment would spare you for
a lot of speculation and time.

Next, conclude that SSDs are the obvious choice and secure the 840 for
your workstation with reference to "further testing".

> I can also see that our hardware requirements will also depend on usage
> as well as the volume of data, and I've been pondering how best we can
> structure our index/es to facilitate a long term service (which means
> that, given it's a lot of data, I need to structure the data so that
> new usage doesn't require re-indexing.)

We definitely have this problem too. We have resigned to re-indexing the
data after some months of real world usage.

Regards,
Toke Eskildsen, State and University Library, Denmark



RE: Solr hardware memory question

2013-12-12 Thread Hoggarth, Gil
Thanks for this - I haven't any previous experience with utilising SSDs in the 
way you suggest, so I guess I need to start learning! And thanks for the 
Danish-webscale URL, looks like very informed reading. (Yes, I think we're 
working in similar industries with similar constraints and expectations).

Compiliing my answers into one email, " Curious how many documents per shard 
you were planning? The number of documents per shard and field type will drive 
the amount of a RAM needed to sort and facet."
- Number of documents per shard, I think about 200 million. That's a bit of a 
rough estimate based on other Solrs we run though. Which I think means we hold 
a lot of data for each document, though I keep arguing to keep this to the 
truly required minimum. We also have many facets, some of which are pretty 
large (I'm stretching my understanding here but I think most documents have 
many 'entries' in many facets so these really hit us performance-wise.)

I try to keep a 1-to-1 ratio of Solr nodes to CPUs with a few spare for the 
operating system. I utilise MMapDirectory to manage memory via the OS. So at 
this moment I guessing that we'll have 56 Solr dedicated CPUs across 2 physical 
32 CPU servers and _hopefully_ 256GB RAM on each. This would give 28 shards and 
each would have 5GB java memory (in Tomcat), leaving 126GB on each server for 
the OS and MMap. (I believe the Solr theory for this doesn't accurately work 
out but we can accept the edge cases where this will fail.)

I can also see that our hardware requirements will also depend on usage as well 
as the volume of data, and I've been pondering how best we can structure our 
index/es to facilitate a long term service (which means that, given it's a lot 
of data, I need to structure the data so that new usage doesn't require 
re-indexing.) But at this early stage, as people say, we need to prototype, 
test, profile etc. and to do that I need the hardware to run the trials (policy 
dictates that I buy the production hardware now, before profiling - I get to 
control much of the design and construction so I don't argue with this!) 

Thanks for all the comments everyone, all very much appreciated :)
Gil


-Original Message-
From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk] 
Sent: 11 December 2013 12:02
To: solr-user@lucene.apache.org
Subject: Re: Solr hardware memory question

On Tue, 2013-12-10 at 17:51 +0100, Hoggarth, Gil wrote:
> We're probably going to be building a Solr service to handle a dataset 
> of ~60TB, which for our data and schema typically gives a Solr index 
> size of 1/10th - i.e., 6TB. Given there's a general rule about the 
> amount of hardware memory required should exceed the size of the Solr 
> index (exceed to also allow for the operating system etc.), how have 
> people handled this situation?

By acknowledging that it is cheaper to buy SSDs instead of trying to compensate 
for slow spinning drives with excessive amounts of RAM. 

Our plans for an estimated 20TB of indexes out of 372TB of raw web data is to 
use SSDs controlled by a single machine with 512GB of RAM (or was it 256GB? 
I'll have to ask the hardware guys):
https://sbdevel.wordpress.com/2013/12/06/danish-webscale/

As always YMMW and the numbers you quite elsewhere indicates that your queries 
are quite complex. You might want to be a bit of profiling to see if they are 
heavy enough to make the CPU the bottleneck.

Regards,
Toke Eskildsen, State and University Library, Denmark




Re: Solr hardware memory question

2013-12-12 Thread Toke Eskildsen
On Thu, 2013-12-12 at 02:46 +0100, Joel Bernstein wrote:
> Curious how many documents per shard you were planning?

350-500 million, optimized to a single segment as the data are not
changing.

> The number of documents per shard and field type will drive the amount
> of a RAM needed to sort and facet. 

Very true. It makes a lot of sense to separate RAM requirements for the
Lucene/Solr structures and OS-caching.

It seems that Gil is working on about the same project as we are, so I
will elaborate in this thread:

We would like to perform some sort of grouping on URL, so that the same
page harvested at different points in time, is only displayed once. This
is probably the heaviest functionality as the cardinality of the field
will be near the number of documents.

For plain(er) faceting, things like MIME-type, harvest date and site
seems relevant. Those field have lower cardinality and they are
single-valued so the memory requirements are something like 
#docs*log2(#unique_values) bits
With 500M documents and 1000 values, that is 600MB. With 20 shards, we
are looking at 12GB per simple facet field.

Regards,
Toke Eskildsen





Re: Solr hardware memory question

2013-12-11 Thread Otis Gospodnetic
Hi Gil,

I'd look at the number and type of fields you sort and facet on (this stuff
likes memory).
I'd keep in mind heaps over 32 GB use bigger pointers, so maybe more
smaller heaps are better than one big one.
You didn't mention the # of CPU cores, but keep that in mind when sharding.
 When a query comes in, you want to put all your CPU cores to work.
...

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Tue, Dec 10, 2013 at 11:51 AM, Hoggarth, Gil  wrote:

> We're probably going to be building a Solr service to handle a dataset
> of ~60TB, which for our data and schema typically gives a Solr index
> size of 1/10th - i.e., 6TB. Given there's a general rule about the
> amount of hardware memory required should exceed the size of the Solr
> index (exceed to also allow for the operating system etc.), how have
> people handled this situation? Do I really need, for example, 12 servers
> with 512GB RAM, or are there other techniques to handling this?
>
>
>
> Many thanks in advance for any general/conceptual/specific
> ideas/comments/answers!
>
> Gil
>
>
>
>
>
> Gil Hoggarth
>
> Web Archiving Technical Services Engineer
>
> The British Library, Boston Spa, West Yorkshire, LS23 7BQ
>
>


Re: Solr hardware memory question

2013-12-11 Thread Joel Bernstein
Curious how many documents per shard you were planning? The number of
documents per shard and field type will drive the amount of a RAM needed to
sort and facet.


On Wed, Dec 11, 2013 at 7:02 AM, Toke Eskildsen wrote:

> On Tue, 2013-12-10 at 17:51 +0100, Hoggarth, Gil wrote:
> > We're probably going to be building a Solr service to handle a dataset
> > of ~60TB, which for our data and schema typically gives a Solr index
> > size of 1/10th - i.e., 6TB. Given there's a general rule about the
> > amount of hardware memory required should exceed the size of the Solr
> > index (exceed to also allow for the operating system etc.), how have
> > people handled this situation?
>
> By acknowledging that it is cheaper to buy SSDs instead of trying to
> compensate for slow spinning drives with excessive amounts of RAM.
>
> Our plans for an estimated 20TB of indexes out of 372TB of raw web data
> is to use SSDs controlled by a single machine with 512GB of RAM (or was
> it 256GB? I'll have to ask the hardware guys):
> https://sbdevel.wordpress.com/2013/12/06/danish-webscale/
>
> As always YMMW and the numbers you quite elsewhere indicates that your
> queries are quite complex. You might want to be a bit of profiling to
> see if they are heavy enough to make the CPU the bottleneck.
>
> Regards,
> Toke Eskildsen, State and University Library, Denmark
>
>
>


-- 
Joel Bernstein
Search Engineer at Heliosearch


Re: Solr hardware memory question

2013-12-11 Thread Toke Eskildsen
On Tue, 2013-12-10 at 17:51 +0100, Hoggarth, Gil wrote:
> We're probably going to be building a Solr service to handle a dataset
> of ~60TB, which for our data and schema typically gives a Solr index
> size of 1/10th - i.e., 6TB. Given there's a general rule about the
> amount of hardware memory required should exceed the size of the Solr
> index (exceed to also allow for the operating system etc.), how have
> people handled this situation?

By acknowledging that it is cheaper to buy SSDs instead of trying to
compensate for slow spinning drives with excessive amounts of RAM. 

Our plans for an estimated 20TB of indexes out of 372TB of raw web data
is to use SSDs controlled by a single machine with 512GB of RAM (or was
it 256GB? I'll have to ask the hardware guys):
https://sbdevel.wordpress.com/2013/12/06/danish-webscale/

As always YMMW and the numbers you quite elsewhere indicates that your
queries are quite complex. You might want to be a bit of profiling to
see if they are heavy enough to make the CPU the bottleneck.

Regards,
Toke Eskildsen, State and University Library, Denmark




Re: Solr hardware memory question

2013-12-10 Thread Ryan Cutter
Shawn's right that if you're going to scale this big you'd be very well
served to spend time getting the index as small as possible.  In my
experience if your searches require real-time random access reads (that is,
the entire index needs to be fast), you don't want to wait for HDD disk
reads.

Getting everything in RAM is best but 6TB per replica (perhaps you'll want
more than 1 replica?) is a tall order.  SSDs are coming down in price.
 Flash memory tech is advancing quickly (Fusion-io and the like).

Sounds like an interesting use case!

Thanks, Ryan


On Tue, Dec 10, 2013 at 9:37 AM, Shawn Heisey  wrote:

> On 12/10/2013 9:51 AM, Hoggarth, Gil wrote:
> > We're probably going to be building a Solr service to handle a dataset
> > of ~60TB, which for our data and schema typically gives a Solr index
> > size of 1/10th - i.e., 6TB. Given there's a general rule about the
> > amount of hardware memory required should exceed the size of the Solr
> > index (exceed to also allow for the operating system etc.), how have
> > people handled this situation? Do I really need, for example, 12 servers
> > with 512GB RAM, or are there other techniques to handling this?
>
> That really depends on what kind of query volume you'll have and what
> kind of performance you want.  If your query volume is low and you can
> deal with slow individual queries, then you won't need that much memory.
>  If either of those requirements increases, you'd probably need more
> memory, up to the 6TB total -- or 12TB if you need to double the total
> index size for redundancy purposes.  If your index is constantly growing
> like most are, you need to plan for that too.
>
> Putting the entire index into RAM is required for *top* performance, but
> not for base functionality.  It might be possible to put only a fraction
> of your index into RAM.  Only testing can determine what you really need
> to obtain the performance you're after.
>
> Perhaps you've already done this, but you should try as much as possible
> to reduce your index size.  Store as few fields as possible, only just
> enough to build a search result list/grid and retrieve the full document
> from the canonical data store.  Save termvectors and docvalues on as few
> fields as possible.  If you can, reduce the number of terms produced by
> your analysis chains.
>
> Thanks,
> Shawn
>
>


RE: Solr hardware memory question

2013-12-10 Thread Hoggarth, Gil
Thanks Shawn. You're absolutely right about the performance balance,
though it's good to hear it from an experienced source (if you don't
mind me calling you that!) Fortunately we don't have a top performance
requirement, and we have a small audience so a low query volume. On
similar systems we're "managing" to just provide a Solr service with a
3TB index size on 160GB RAM, though we have scripts to handle the
occasionally necessary service restart when someone submits a more
exotic query. This, btw, gives a response time of ~45-90 seconds for
uncached queries. My question I suppose comes from my hope that we can
do away with the restart scripts as I doubt they help the Solr service
(as they can if necessary just kill processes and restart), and get to
responses times < 20 seconds.

-Original Message-
From: Shawn Heisey [mailto:s...@elyograg.org] 
Sent: 10 December 2013 17:37
To: solr-user@lucene.apache.org
Subject: Re: Solr hardware memory question

On 12/10/2013 9:51 AM, Hoggarth, Gil wrote:
> We're probably going to be building a Solr service to handle a dataset

> of ~60TB, which for our data and schema typically gives a Solr index 
> size of 1/10th - i.e., 6TB. Given there's a general rule about the 
> amount of hardware memory required should exceed the size of the Solr 
> index (exceed to also allow for the operating system etc.), how have 
> people handled this situation? Do I really need, for example, 12 
> servers with 512GB RAM, or are there other techniques to handling
this?

That really depends on what kind of query volume you'll have and what
kind of performance you want.  If your query volume is low and you can
deal with slow individual queries, then you won't need that much memory.
 If either of those requirements increases, you'd probably need more
memory, up to the 6TB total -- or 12TB if you need to double the total
index size for redundancy purposes.  If your index is constantly growing
like most are, you need to plan for that too.

Putting the entire index into RAM is required for *top* performance, but
not for base functionality.  It might be possible to put only a fraction
of your index into RAM.  Only testing can determine what you really need
to obtain the performance you're after.

Perhaps you've already done this, but you should try as much as possible
to reduce your index size.  Store as few fields as possible, only just
enough to build a search result list/grid and retrieve the full document
from the canonical data store.  Save termvectors and docvalues on as few
fields as possible.  If you can, reduce the number of terms produced by
your analysis chains.

Thanks,
Shawn



Re: Solr hardware memory question

2013-12-10 Thread Shawn Heisey
On 12/10/2013 9:51 AM, Hoggarth, Gil wrote:
> We're probably going to be building a Solr service to handle a dataset
> of ~60TB, which for our data and schema typically gives a Solr index
> size of 1/10th - i.e., 6TB. Given there's a general rule about the
> amount of hardware memory required should exceed the size of the Solr
> index (exceed to also allow for the operating system etc.), how have
> people handled this situation? Do I really need, for example, 12 servers
> with 512GB RAM, or are there other techniques to handling this?

That really depends on what kind of query volume you'll have and what
kind of performance you want.  If your query volume is low and you can
deal with slow individual queries, then you won't need that much memory.
 If either of those requirements increases, you'd probably need more
memory, up to the 6TB total -- or 12TB if you need to double the total
index size for redundancy purposes.  If your index is constantly growing
like most are, you need to plan for that too.

Putting the entire index into RAM is required for *top* performance, but
not for base functionality.  It might be possible to put only a fraction
of your index into RAM.  Only testing can determine what you really need
to obtain the performance you're after.

Perhaps you've already done this, but you should try as much as possible
to reduce your index size.  Store as few fields as possible, only just
enough to build a search result list/grid and retrieve the full document
from the canonical data store.  Save termvectors and docvalues on as few
fields as possible.  If you can, reduce the number of terms produced by
your analysis chains.

Thanks,
Shawn



Solr hardware memory question

2013-12-10 Thread Hoggarth, Gil
We're probably going to be building a Solr service to handle a dataset
of ~60TB, which for our data and schema typically gives a Solr index
size of 1/10th - i.e., 6TB. Given there's a general rule about the
amount of hardware memory required should exceed the size of the Solr
index (exceed to also allow for the operating system etc.), how have
people handled this situation? Do I really need, for example, 12 servers
with 512GB RAM, or are there other techniques to handling this?

 

Many thanks in advance for any general/conceptual/specific
ideas/comments/answers!

Gil

 

 

Gil Hoggarth

Web Archiving Technical Services Engineer 

The British Library, Boston Spa, West Yorkshire, LS23 7BQ