Re: Solr hardware memory question
Hello, Gil, I'm wondering if you've been in touch with the Hathi Trust people, because I imagine your use cases are somewhat similar. They've done some blogging around getting digitized texts indexed at scale, which I what I assume you're doing: http://www.hathitrust.org/blogs/Large-scale-Search Michael Della Bitta Applications Developer o: +1 646 532 3062 | c: +1 917 477 7906 appinions inc. “The Science of Influence Marketing” 18 East 41st Street New York, NY 10017 t: @appinions <https://twitter.com/Appinions> | g+: plus.google.com/appinions<https://plus.google.com/u/0/b/112002776285509593336/112002776285509593336/posts> w: appinions.com <http://www.appinions.com/> On Thu, Dec 12, 2013 at 5:10 AM, Hoggarth, Gil wrote: > Thanks for this - I haven't any previous experience with utilising SSDs in > the way you suggest, so I guess I need to start learning! And thanks for > the Danish-webscale URL, looks like very informed reading. (Yes, I think > we're working in similar industries with similar constraints and > expectations). > > Compiliing my answers into one email, " Curious how many documents per > shard you were planning? The number of documents per shard and field type > will drive the amount of a RAM needed to sort and facet." > - Number of documents per shard, I think about 200 million. That's a bit > of a rough estimate based on other Solrs we run though. Which I think means > we hold a lot of data for each document, though I keep arguing to keep this > to the truly required minimum. We also have many facets, some of which are > pretty large (I'm stretching my understanding here but I think most > documents have many 'entries' in many facets so these really hit us > performance-wise.) > > I try to keep a 1-to-1 ratio of Solr nodes to CPUs with a few spare for > the operating system. I utilise MMapDirectory to manage memory via the OS. > So at this moment I guessing that we'll have 56 Solr dedicated CPUs across > 2 physical 32 CPU servers and _hopefully_ 256GB RAM on each. This would > give 28 shards and each would have 5GB java memory (in Tomcat), leaving > 126GB on each server for the OS and MMap. (I believe the Solr theory for > this doesn't accurately work out but we can accept the edge cases where > this will fail.) > > I can also see that our hardware requirements will also depend on usage as > well as the volume of data, and I've been pondering how best we can > structure our index/es to facilitate a long term service (which means that, > given it's a lot of data, I need to structure the data so that new usage > doesn't require re-indexing.) But at this early stage, as people say, we > need to prototype, test, profile etc. and to do that I need the hardware to > run the trials (policy dictates that I buy the production hardware now, > before profiling - I get to control much of the design and construction so > I don't argue with this!) > > Thanks for all the comments everyone, all very much appreciated :) > Gil > > > -Original Message- > From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk] > Sent: 11 December 2013 12:02 > To: solr-user@lucene.apache.org > Subject: Re: Solr hardware memory question > > On Tue, 2013-12-10 at 17:51 +0100, Hoggarth, Gil wrote: > > We're probably going to be building a Solr service to handle a dataset > > of ~60TB, which for our data and schema typically gives a Solr index > > size of 1/10th - i.e., 6TB. Given there's a general rule about the > > amount of hardware memory required should exceed the size of the Solr > > index (exceed to also allow for the operating system etc.), how have > > people handled this situation? > > By acknowledging that it is cheaper to buy SSDs instead of trying to > compensate for slow spinning drives with excessive amounts of RAM. > > Our plans for an estimated 20TB of indexes out of 372TB of raw web data is > to use SSDs controlled by a single machine with 512GB of RAM (or was it > 256GB? I'll have to ask the hardware guys): > https://sbdevel.wordpress.com/2013/12/06/danish-webscale/ > > As always YMMW and the numbers you quite elsewhere indicates that your > queries are quite complex. You might want to be a bit of profiling to see > if they are heavy enough to make the CPU the bottleneck. > > Regards, > Toke Eskildsen, State and University Library, Denmark > > >
RE: Solr hardware memory question
On Thu, 2013-12-12 at 11:10 +0100, Hoggarth, Gil wrote: > Thanks for this - I haven't any previous experience with utilising SSDs > in the way you suggest, so I guess I need to start learning! There's a bit of divide in the Lucene/Solr-world on this. Everybody agrees that SSDs in themselves are great for Lucene/Solr searches, compared to a spinning drives solution. How much better is another matter and the issue gets confusing when RAM caching is factored in. Some are also very concerned about the reliability of SSDs and the write performance degradation without TRIM (you need to have a quite specific setup to have TRIM enabled on a server with SSDs in RAID). Guessing that your 6TB index is not heavily updated, the TRIM part should not be one of your worries though. At Statsbiblioteket, we have been using SSDs for our search servers since 2008. That was back when random write performance was horrible and a large drive was 64GB. As you have probably guessed, we are very much in the SSD camp. We have done some testing and for simple searches (i.e. a lot of IO and comparatively little CPU usage), we have observed that SSDs + 10% index size RAM for caching deliver something like 80% of pure RAM speed. https://sbdevel.wordpress.com/2013/06/06/memory-is-overrated/ Your mileage will surely vary. > [...] leaving 126GB on each server for the OS and MMap. [...] So about the same as your existing 3TB setup? Seems like you will get the same performance then. I must say that 1 minute response times would be very hard to sell at our library, even for a special search only used by a small and dedicated audience. Even your goal of 20 seconds seems adverse to exploratory search. May I be so frank as to suggest a course of action? Buy one ½ TB Samsung 840 EVO SSD, fill it with indexes and test it in a machine with 32GB of RAM, thus matching the 1/20 index size RAM that your servers will have. Such a drive costs £250 on Amazon and the experiment would spare you for a lot of speculation and time. Next, conclude that SSDs are the obvious choice and secure the 840 for your workstation with reference to "further testing". > I can also see that our hardware requirements will also depend on usage > as well as the volume of data, and I've been pondering how best we can > structure our index/es to facilitate a long term service (which means > that, given it's a lot of data, I need to structure the data so that > new usage doesn't require re-indexing.) We definitely have this problem too. We have resigned to re-indexing the data after some months of real world usage. Regards, Toke Eskildsen, State and University Library, Denmark
RE: Solr hardware memory question
Thanks for this - I haven't any previous experience with utilising SSDs in the way you suggest, so I guess I need to start learning! And thanks for the Danish-webscale URL, looks like very informed reading. (Yes, I think we're working in similar industries with similar constraints and expectations). Compiliing my answers into one email, " Curious how many documents per shard you were planning? The number of documents per shard and field type will drive the amount of a RAM needed to sort and facet." - Number of documents per shard, I think about 200 million. That's a bit of a rough estimate based on other Solrs we run though. Which I think means we hold a lot of data for each document, though I keep arguing to keep this to the truly required minimum. We also have many facets, some of which are pretty large (I'm stretching my understanding here but I think most documents have many 'entries' in many facets so these really hit us performance-wise.) I try to keep a 1-to-1 ratio of Solr nodes to CPUs with a few spare for the operating system. I utilise MMapDirectory to manage memory via the OS. So at this moment I guessing that we'll have 56 Solr dedicated CPUs across 2 physical 32 CPU servers and _hopefully_ 256GB RAM on each. This would give 28 shards and each would have 5GB java memory (in Tomcat), leaving 126GB on each server for the OS and MMap. (I believe the Solr theory for this doesn't accurately work out but we can accept the edge cases where this will fail.) I can also see that our hardware requirements will also depend on usage as well as the volume of data, and I've been pondering how best we can structure our index/es to facilitate a long term service (which means that, given it's a lot of data, I need to structure the data so that new usage doesn't require re-indexing.) But at this early stage, as people say, we need to prototype, test, profile etc. and to do that I need the hardware to run the trials (policy dictates that I buy the production hardware now, before profiling - I get to control much of the design and construction so I don't argue with this!) Thanks for all the comments everyone, all very much appreciated :) Gil -Original Message- From: Toke Eskildsen [mailto:t...@statsbiblioteket.dk] Sent: 11 December 2013 12:02 To: solr-user@lucene.apache.org Subject: Re: Solr hardware memory question On Tue, 2013-12-10 at 17:51 +0100, Hoggarth, Gil wrote: > We're probably going to be building a Solr service to handle a dataset > of ~60TB, which for our data and schema typically gives a Solr index > size of 1/10th - i.e., 6TB. Given there's a general rule about the > amount of hardware memory required should exceed the size of the Solr > index (exceed to also allow for the operating system etc.), how have > people handled this situation? By acknowledging that it is cheaper to buy SSDs instead of trying to compensate for slow spinning drives with excessive amounts of RAM. Our plans for an estimated 20TB of indexes out of 372TB of raw web data is to use SSDs controlled by a single machine with 512GB of RAM (or was it 256GB? I'll have to ask the hardware guys): https://sbdevel.wordpress.com/2013/12/06/danish-webscale/ As always YMMW and the numbers you quite elsewhere indicates that your queries are quite complex. You might want to be a bit of profiling to see if they are heavy enough to make the CPU the bottleneck. Regards, Toke Eskildsen, State and University Library, Denmark
Re: Solr hardware memory question
On Thu, 2013-12-12 at 02:46 +0100, Joel Bernstein wrote: > Curious how many documents per shard you were planning? 350-500 million, optimized to a single segment as the data are not changing. > The number of documents per shard and field type will drive the amount > of a RAM needed to sort and facet. Very true. It makes a lot of sense to separate RAM requirements for the Lucene/Solr structures and OS-caching. It seems that Gil is working on about the same project as we are, so I will elaborate in this thread: We would like to perform some sort of grouping on URL, so that the same page harvested at different points in time, is only displayed once. This is probably the heaviest functionality as the cardinality of the field will be near the number of documents. For plain(er) faceting, things like MIME-type, harvest date and site seems relevant. Those field have lower cardinality and they are single-valued so the memory requirements are something like #docs*log2(#unique_values) bits With 500M documents and 1000 values, that is 600MB. With 20 shards, we are looking at 12GB per simple facet field. Regards, Toke Eskildsen
Re: Solr hardware memory question
Hi Gil, I'd look at the number and type of fields you sort and facet on (this stuff likes memory). I'd keep in mind heaps over 32 GB use bigger pointers, so maybe more smaller heaps are better than one big one. You didn't mention the # of CPU cores, but keep that in mind when sharding. When a query comes in, you want to put all your CPU cores to work. ... Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr & Elasticsearch Support * http://sematext.com/ On Tue, Dec 10, 2013 at 11:51 AM, Hoggarth, Gil wrote: > We're probably going to be building a Solr service to handle a dataset > of ~60TB, which for our data and schema typically gives a Solr index > size of 1/10th - i.e., 6TB. Given there's a general rule about the > amount of hardware memory required should exceed the size of the Solr > index (exceed to also allow for the operating system etc.), how have > people handled this situation? Do I really need, for example, 12 servers > with 512GB RAM, or are there other techniques to handling this? > > > > Many thanks in advance for any general/conceptual/specific > ideas/comments/answers! > > Gil > > > > > > Gil Hoggarth > > Web Archiving Technical Services Engineer > > The British Library, Boston Spa, West Yorkshire, LS23 7BQ > >
Re: Solr hardware memory question
Curious how many documents per shard you were planning? The number of documents per shard and field type will drive the amount of a RAM needed to sort and facet. On Wed, Dec 11, 2013 at 7:02 AM, Toke Eskildsen wrote: > On Tue, 2013-12-10 at 17:51 +0100, Hoggarth, Gil wrote: > > We're probably going to be building a Solr service to handle a dataset > > of ~60TB, which for our data and schema typically gives a Solr index > > size of 1/10th - i.e., 6TB. Given there's a general rule about the > > amount of hardware memory required should exceed the size of the Solr > > index (exceed to also allow for the operating system etc.), how have > > people handled this situation? > > By acknowledging that it is cheaper to buy SSDs instead of trying to > compensate for slow spinning drives with excessive amounts of RAM. > > Our plans for an estimated 20TB of indexes out of 372TB of raw web data > is to use SSDs controlled by a single machine with 512GB of RAM (or was > it 256GB? I'll have to ask the hardware guys): > https://sbdevel.wordpress.com/2013/12/06/danish-webscale/ > > As always YMMW and the numbers you quite elsewhere indicates that your > queries are quite complex. You might want to be a bit of profiling to > see if they are heavy enough to make the CPU the bottleneck. > > Regards, > Toke Eskildsen, State and University Library, Denmark > > > -- Joel Bernstein Search Engineer at Heliosearch
Re: Solr hardware memory question
On Tue, 2013-12-10 at 17:51 +0100, Hoggarth, Gil wrote: > We're probably going to be building a Solr service to handle a dataset > of ~60TB, which for our data and schema typically gives a Solr index > size of 1/10th - i.e., 6TB. Given there's a general rule about the > amount of hardware memory required should exceed the size of the Solr > index (exceed to also allow for the operating system etc.), how have > people handled this situation? By acknowledging that it is cheaper to buy SSDs instead of trying to compensate for slow spinning drives with excessive amounts of RAM. Our plans for an estimated 20TB of indexes out of 372TB of raw web data is to use SSDs controlled by a single machine with 512GB of RAM (or was it 256GB? I'll have to ask the hardware guys): https://sbdevel.wordpress.com/2013/12/06/danish-webscale/ As always YMMW and the numbers you quite elsewhere indicates that your queries are quite complex. You might want to be a bit of profiling to see if they are heavy enough to make the CPU the bottleneck. Regards, Toke Eskildsen, State and University Library, Denmark
Re: Solr hardware memory question
Shawn's right that if you're going to scale this big you'd be very well served to spend time getting the index as small as possible. In my experience if your searches require real-time random access reads (that is, the entire index needs to be fast), you don't want to wait for HDD disk reads. Getting everything in RAM is best but 6TB per replica (perhaps you'll want more than 1 replica?) is a tall order. SSDs are coming down in price. Flash memory tech is advancing quickly (Fusion-io and the like). Sounds like an interesting use case! Thanks, Ryan On Tue, Dec 10, 2013 at 9:37 AM, Shawn Heisey wrote: > On 12/10/2013 9:51 AM, Hoggarth, Gil wrote: > > We're probably going to be building a Solr service to handle a dataset > > of ~60TB, which for our data and schema typically gives a Solr index > > size of 1/10th - i.e., 6TB. Given there's a general rule about the > > amount of hardware memory required should exceed the size of the Solr > > index (exceed to also allow for the operating system etc.), how have > > people handled this situation? Do I really need, for example, 12 servers > > with 512GB RAM, or are there other techniques to handling this? > > That really depends on what kind of query volume you'll have and what > kind of performance you want. If your query volume is low and you can > deal with slow individual queries, then you won't need that much memory. > If either of those requirements increases, you'd probably need more > memory, up to the 6TB total -- or 12TB if you need to double the total > index size for redundancy purposes. If your index is constantly growing > like most are, you need to plan for that too. > > Putting the entire index into RAM is required for *top* performance, but > not for base functionality. It might be possible to put only a fraction > of your index into RAM. Only testing can determine what you really need > to obtain the performance you're after. > > Perhaps you've already done this, but you should try as much as possible > to reduce your index size. Store as few fields as possible, only just > enough to build a search result list/grid and retrieve the full document > from the canonical data store. Save termvectors and docvalues on as few > fields as possible. If you can, reduce the number of terms produced by > your analysis chains. > > Thanks, > Shawn > >
RE: Solr hardware memory question
Thanks Shawn. You're absolutely right about the performance balance, though it's good to hear it from an experienced source (if you don't mind me calling you that!) Fortunately we don't have a top performance requirement, and we have a small audience so a low query volume. On similar systems we're "managing" to just provide a Solr service with a 3TB index size on 160GB RAM, though we have scripts to handle the occasionally necessary service restart when someone submits a more exotic query. This, btw, gives a response time of ~45-90 seconds for uncached queries. My question I suppose comes from my hope that we can do away with the restart scripts as I doubt they help the Solr service (as they can if necessary just kill processes and restart), and get to responses times < 20 seconds. -Original Message- From: Shawn Heisey [mailto:s...@elyograg.org] Sent: 10 December 2013 17:37 To: solr-user@lucene.apache.org Subject: Re: Solr hardware memory question On 12/10/2013 9:51 AM, Hoggarth, Gil wrote: > We're probably going to be building a Solr service to handle a dataset > of ~60TB, which for our data and schema typically gives a Solr index > size of 1/10th - i.e., 6TB. Given there's a general rule about the > amount of hardware memory required should exceed the size of the Solr > index (exceed to also allow for the operating system etc.), how have > people handled this situation? Do I really need, for example, 12 > servers with 512GB RAM, or are there other techniques to handling this? That really depends on what kind of query volume you'll have and what kind of performance you want. If your query volume is low and you can deal with slow individual queries, then you won't need that much memory. If either of those requirements increases, you'd probably need more memory, up to the 6TB total -- or 12TB if you need to double the total index size for redundancy purposes. If your index is constantly growing like most are, you need to plan for that too. Putting the entire index into RAM is required for *top* performance, but not for base functionality. It might be possible to put only a fraction of your index into RAM. Only testing can determine what you really need to obtain the performance you're after. Perhaps you've already done this, but you should try as much as possible to reduce your index size. Store as few fields as possible, only just enough to build a search result list/grid and retrieve the full document from the canonical data store. Save termvectors and docvalues on as few fields as possible. If you can, reduce the number of terms produced by your analysis chains. Thanks, Shawn
Re: Solr hardware memory question
On 12/10/2013 9:51 AM, Hoggarth, Gil wrote: > We're probably going to be building a Solr service to handle a dataset > of ~60TB, which for our data and schema typically gives a Solr index > size of 1/10th - i.e., 6TB. Given there's a general rule about the > amount of hardware memory required should exceed the size of the Solr > index (exceed to also allow for the operating system etc.), how have > people handled this situation? Do I really need, for example, 12 servers > with 512GB RAM, or are there other techniques to handling this? That really depends on what kind of query volume you'll have and what kind of performance you want. If your query volume is low and you can deal with slow individual queries, then you won't need that much memory. If either of those requirements increases, you'd probably need more memory, up to the 6TB total -- or 12TB if you need to double the total index size for redundancy purposes. If your index is constantly growing like most are, you need to plan for that too. Putting the entire index into RAM is required for *top* performance, but not for base functionality. It might be possible to put only a fraction of your index into RAM. Only testing can determine what you really need to obtain the performance you're after. Perhaps you've already done this, but you should try as much as possible to reduce your index size. Store as few fields as possible, only just enough to build a search result list/grid and retrieve the full document from the canonical data store. Save termvectors and docvalues on as few fields as possible. If you can, reduce the number of terms produced by your analysis chains. Thanks, Shawn
Solr hardware memory question
We're probably going to be building a Solr service to handle a dataset of ~60TB, which for our data and schema typically gives a Solr index size of 1/10th - i.e., 6TB. Given there's a general rule about the amount of hardware memory required should exceed the size of the Solr index (exceed to also allow for the operating system etc.), how have people handled this situation? Do I really need, for example, 12 servers with 512GB RAM, or are there other techniques to handling this? Many thanks in advance for any general/conceptual/specific ideas/comments/answers! Gil Gil Hoggarth Web Archiving Technical Services Engineer The British Library, Boston Spa, West Yorkshire, LS23 7BQ