RE: Hardware config for SOLR
Grant, Thanks a lot for the answers. Please see my replies below. 1) Should we do sharding or not? If we start without sharding, how hard will it be to enable it? Is it just some config changes + the index rebuild or is it more? There will be operations setup, etc. And you'll have to add in the appropriate query stuff. Your install and requirements aren't that large, so I doubt you'll need sharding, but it always depends on your exact configuration. I've seen indexes as big as 80 million docs on a single machine, but the docs were smaller in size. My personal opinion is to go without sharding at first and enable it later if do get a lot of documents. Sounds reasonable. One more question - is it worth it to try to keep the whole index in memory and shard when it doesn't fit anymore? For me it seems like a bit of overhead, but I may be very wrong here. What's a recommended ratio of the parts to keep in RAM and on the HDDs? 2) How should we organize our clusters to ensure redundancy? Should we have 2 or more identical Masters (means that all the updates/optimisations/etc. are done for every one of them)? An alternative, afaik, is to reconfigure one slave to become the new Master, how hard is that? I don't have a good answer here, maybe someone else can chime in. I know master failover is a concern, but I'm not sure how others handle it right now. Would be good to have people share their approach. That being said, it seems reasonable to me to have identical masters. I found this thread related to this issue: http://www.nabble.com/High-Availability-deployment-to13094489.html#a1309 8729 I guess, it depends on how easy we can fill the gap between the last commit and the time of the Master going down. Most likely, we'll have to have 2 Masters. 3) Basically, we can get servers of two kinds: * Single Processor, Dual Core Opteron 2214HE * 2 GB DDR2 SDRAM * 1 x 250 GB (7200 RPM) SATA Drive(s) * Dual Processor, Quad Core 5335 * 16 GB Memory (Fully Buffered) * 2 x 73 GB (10k RPM) 2.5 SAS Drive(s), RAID 1 The second - more powerful - one is more expensive, of course. Get as much RAM as you can afford. Surely there is an in between machine as well that might balance cost and capabilities. The first machine seems a bit light, especially in memory. Fair enough. How can we take advantage of the multiprocessor/multicore servers? Is there some special setup required to make, say, 2 instances of SOLR run on the same server using different processors/cores? See the Core Admin stuff http://wiki.apache.org/solr/CoreAdmin. Solr is thread-safe by design (so it's a bug, if you hit issues). You can send it documents on multiple threads and it will be fine. Hmmm, it seems that several cores are supposed to handle different indexes: http://wiki.apache.org/solr/MultipleIndexes#head-e517417ef9b96e32168b2cf 35ab6ff393f360d59 Solr1.3 added support for multiple Solr Cores in a single deployment of Solr -- each Solr Core has it's own index. For more information please see CoreAdmin. As we are going to have just one index, so the only way to use it that I see is to configure a Master on Core1 and a Slave on core 2, or 2 slaves on 2 cores. Do I miss something here? 4) Does it make much difference to get a more powerful Master? Or, on the contrary, as slaves will be queried more often, they should be the better ones? Maybe just the HDDs for the slaves should be as fast as possible? Depends on where your bottlenecks are. Are you getting a lot of queries or a lot of updates? Both, but more queries than updates. Means we shouldn't neglect slaves, I guess? As for HDDs, people have noted some nice speedups in Lucene using Solid-state drives, if you can afford them. Fast I/O is good if you're retrieving whole documents, but once things are warmed up more RAM is most important, I think, as many things can be cached. 5) How many slaves does it make sense to have per one Master? What's (roughly) the performance gain from 1 to 2, 2 - 3, etc? When does it stop making sense to add more slaves? I suppose it's when you can handle your peak load, but I don't have numbers. One of the keys is to incrementally test and see what makes sense for your scenario. Right, the numbers given in other responses (thanks Karl and Lars) look impressive, so we'll consider this option. As far as I understand, it depends mainly on the size of the index. However, I'd guess the time required to do a push for too many slaves can be a problem too, correct? The biggest problem for slaves is if the master does an optimization, in which case the whole snapshot must be downloaded versus incremental additions can be handled by getting just the deltas. Our initial idea is to send batch updates several times per day rather than individual real-time updates, commit and run optimization after that, as advised here:
Re: Hardware config for SOLR
Hi Andrey, Responses inlined. - Original Message From: Andrey Shulinskiy [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Sunday, September 21, 2008 11:23:00 PM Subject: RE: Hardware config for SOLR Grant, Thanks a lot for the answers. Please see my replies below. 1) Should we do sharding or not? If we start without sharding, how hard will it be to enable it? Is it just some config changes + the index rebuild or is it more? There will be operations setup, etc. And you'll have to add in the appropriate query stuff. Your install and requirements aren't that large, so I doubt you'll need sharding, but it always depends on your exact configuration. I've seen indexes as big as 80 million docs on a single machine, but the docs were smaller in size. My personal opinion is to go without sharding at first and enable it later if do get a lot of documents. Sounds reasonable. One more question - is it worth it to try to keep the whole index in memory and shard when it doesn't fit anymore? For me it seems like a bit of overhead, but I may be very wrong here. What's a recommended ratio of the parts to keep in RAM and on the HDDs? It's well worth trying to keep the index buffered (i.e. in memory). Yes, once you can't fit the hot parts of the index in RAM it's time to think about sharding (or buying more RAM). However, it's not as simple as looking at the index size and RAM size, as not all parts of the index need to be cached. 2) How should we organize our clusters to ensure redundancy? Should we have 2 or more identical Masters (means that all the updates/optimisations/etc. are done for every one of them)? An alternative, afaik, is to reconfigure one slave to become the new Master, how hard is that? I don't have a good answer here, maybe someone else can chime in. I know master failover is a concern, but I'm not sure how others handle it right now. Would be good to have people share their approach. That being said, it seems reasonable to me to have identical masters. I found this thread related to this issue: http://www.nabble.com/High-Availability-deployment-to13094489.html#a1309 8729 I guess, it depends on how easy we can fill the gap between the last commit and the time of the Master going down. Most likely, we'll have to have 2 Masters. Or you could simply have 2 masters and index the same data on both of them. Then, in case #1 fails, you simply get your slaves to start copying from the #2. You could have slaves talk to the master via a LB VIP, so a change from #1 to #2 can be done in LB quickly and slaves don't have to be changed. Or you could have masters keep the index on some sort of shared storage (e.g. SAN). 3) Basically, we can get servers of two kinds: * Single Processor, Dual Core Opteron 2214HE * 2 GB DDR2 SDRAM * 1 x 250 GB (7200 RPM) SATA Drive(s) * Dual Processor, Quad Core 5335 * 16 GB Memory (Fully Buffered) * 2 x 73 GB (10k RPM) 2.5 SAS Drive(s), RAID 1 The second - more powerful - one is more expensive, of course. Get as much RAM as you can afford. Surely there is an in between machine as well that might balance cost and capabilities. The first machine seems a bit light, especially in memory. Fair enough. How can we take advantage of the multiprocessor/multicore servers? Is there some special setup required to make, say, 2 instances of SOLR run on the same server using different processors/cores? See the Core Admin stuff http://wiki.apache.org/solr/CoreAdmin. Solr is thread-safe by design (so it's a bug, if you hit issues). You can send it documents on multiple threads and it will be fine. Hmmm, it seems that several cores are supposed to handle different indexes: http://wiki.apache.org/solr/MultipleIndexes#head-e517417ef9b96e32168b2cf 35ab6ff393f360d59 Yes. Solr1.3 added support for multiple Solr Cores in a single deployment of Solr -- each Solr Core has it's own index. For more information please see CoreAdmin. As we are going to have just one index, so the only way to use it that I see is to configure a Master on Core1 and a Slave on core 2, or 2 slaves on 2 cores. Do I miss something here? It sounds like you are talking about a single server hosting the master and slave(s) on the same server. That's not what you want to do, though. Master and slave(s) live each on their own server. But I think you are aware of this. You don't need to think about Solr Multicore functionality if you have but a single index. 4) Does it make much difference to get a more powerful Master? Or, on the contrary, as slaves will be queried more often, they should be the better ones? Maybe just the HDDs for the slaves should be as fast as possible? Depends on where your bottlenecks are. Are you getting a lot of queries or a lot of updates? Both, but more
Re: Hardware config for SOLR
I have not worked with SSDs, though I've read all the good information that's trickling to us from Denmark. One thing that I've been wondering all along is - what about writes? That is, what about writes wearing out the SSD? How quickly does that happen and when it does happen, what are the symptoms? For example, does it happen after N write operations? Do writes start failing and one starts getting IOExceptions in case of Lucene and Solr? Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch - Original Message From: Karl Wettin [EMAIL PROTECTED] To: solr-user@lucene.apache.org Sent: Friday, September 19, 2008 6:15:53 PM Subject: Re: Hardware config for SOLR 19 sep 2008 kl. 23.22 skrev Grant Ingersoll: As for HDDs, people have noted some nice speedups in Lucene using Solid-state drives, if you can afford them. I've seen the average response time cut in 5-10 times when switching to SSD. 64GB SSD is starting at EUR 200 so that can be a lot cheaper to do replace the disk than getting more servers, given you can fit your index on of those. karl
Re: Hardware config for SOLR
I have not worked with SSDs, though I've read all the good information that's trickling to us from Denmark. One thing that I've been wondering all along is - what about writes? That is, what about writes wearing out the SSD? How quickly does that happen and when it does happen, what are the symptoms? For example, does it happen after N write operations? Do writes start failing and one starts getting IOExceptions in case of Lucene and Solr? With modern SSDs you get something in the region of 500,000 to 1,000,000 write cycles per memory cell. Additionally they all use wear leveling, i.e. the writes are spread over the whole disk -- you can write to a file system block many times more. One of the manufacturers of high-end SSDs [1] claims that at a sustained write rate of 50GB per day their drives will last more than 140 years, i.e. it's much more likely that something else will fail before ;) When the write cycles are exhausted much the same thing as with a bad conventional disk happens -- you'll see lots of write errors. If the wear leveling is perfect (i.e. all memory locations have exactly the same number of writes) it's even possible that the whole disk will fail at once. Lars [1] http://www.mtron.net
Re: Hardware config for SOLR
Inline below. On Sep 17, 2008, at 6:32 PM, Andrey Shulinskiy wrote: Hello, First, some numbers we're expecting. - The average size of a doc: ~100K - The number of indexes: 1 - The query response time we're looking for: 200 - 300ms - The number of stored docs: 1st year: 500K - 1M 2nd year: 2-3M - The estimated number of concurrent users per second 1st year: 15 - 25 2nd year: 40 - 60 - The estimated number of queries 1st year: 15 - 25 2nd year: 40 - 60 Now the questions 1) Should we do sharding or not? If we start without sharding, how hard will it be to enable it? Is it just some config changes + the index rebuild or is it more? There will be operations setup, etc. And you'll have to add in the appropriate query stuff. Your install and requirements aren't that large, so I doubt you'll need sharding, but it always depends on your exact configuration. I've seen indexes as big as 80 million docs on a single machine, but the docs were smaller in size. My personal opinion is to go without sharding at first and enable it later if do get a lot of documents. Sounds reasonable. 2) How should we organize our clusters to ensure redundancy? Should we have 2 or more identical Masters (means that all the updates/optimisations/etc. are done for every one of them)? An alternative, afaik, is to reconfigure one slave to become the new Master, how hard is that? I don't have a good answer here, maybe someone else can chime in. I know master failover is a concern, but I'm not sure how others handle it right now. Would be good to have people share their approach. That being said, it seems reasonable to me to have identical masters. 3) Basically, we can get servers of two kinds: * Single Processor, Dual Core Opteron 2214HE * 2 GB DDR2 SDRAM * 1 x 250 GB (7200 RPM) SATA Drive(s) * Dual Processor, Quad Core 5335 * 16 GB Memory (Fully Buffered) * 2 x 73 GB (10k RPM) 2.5 SAS Drive(s), RAID 1 The second - more powerful - one is more expensive, of course. Get as much RAM as you can afford. Surely there is an in between machine as well that might balance cost and capabilities. The first machine seems a bit light, especially in memory. How can we take advantage of the multiprocessor/multicore servers? Is there some special setup required to make, say, 2 instances of SOLR run on the same server using different processors/cores? See the Core Admin stuff http://wiki.apache.org/solr/CoreAdmin. Solr is thread-safe by design (so it's a bug, if you hit issues). You can send it documents on multiple threads and it will be fine. 4) Does it make much difference to get a more powerful Master? Or, on the contrary, as slaves will be queried more often, they should be the better ones? Maybe just the HDDs for the slaves should be as fast as possible? Depends on where your bottlenecks are. Are you getting a lot of queries or a lot of updates? As for HDDs, people have noted some nice speedups in Lucene using Solid-state drives, if you can afford them. Fast I/O is good if you're retrieving whole documents, but once things are warmed up more RAM is most important, I think, as many things can be cached. 5) How many slaves does it make sense to have per one Master? What's (roughly) the performance gain from 1 to 2, 2 - 3, etc? When does it stop making sense to add more slaves? I suppose it's when you can handle your peak load, but I don't have numbers. One of the keys is to incrementally test and see what makes sense for your scenario. As far as I understand, it depends mainly on the size of the index. However, I'd guess the time required to do a push for too many slaves can be a problem too, correct? The biggest problem for slaves is if the master does an optimization, in which case the whole snapshot must be downloaded versus incremental additions can be handled by getting just the deltas. HTH, Grant -- Grant Ingersoll http://www.lucidimagination.com Lucene Helpful Hints: http://wiki.apache.org/lucene-java/BasicsOfPerformance http://wiki.apache.org/lucene-java/LuceneFAQ
Re: Hardware config for SOLR
19 sep 2008 kl. 23.22 skrev Grant Ingersoll: As for HDDs, people have noted some nice speedups in Lucene using Solid-state drives, if you can afford them. I've seen the average response time cut in 5-10 times when switching to SSD. 64GB SSD is starting at EUR 200 so that can be a lot cheaper to do replace the disk than getting more servers, given you can fit your index on of those. karl
Re: Hardware config for SOLR
As for HDDs, people have noted some nice speedups in Lucene using Solid-state drives, if you can afford them. I've seen the average response time cut in 5-10 times when switching to SSD. 64GB SSD is starting at EUR 200 so that can be a lot cheaper to do replace the disk than getting more servers, given you can fit your index on of those. For some concrete numbers, see http://wiki.statsbiblioteket.dk/summa/Hardware Lars
Re: Hardware config for SOLR
I can't speak to a lot of this - but regarding the servers I'd go with the more powerful ones, if only for the amount of ram. Your index will likely be larger than 1 gig, and with only two you'll have a lot of your index not stored in ram, which will slow down your QPS. Thanks for your time! Matthew Runo Software Engineer, Zappos.com [EMAIL PROTECTED] - 702-943-7833 On Sep 17, 2008, at 3:32 PM, Andrey Shulinskiy wrote: Hello, We're planning to use SOLR for our project, got some questions. So I asked some Qs yesterday, got no answers whatsoever. Wondering if they didn't make sense, or if the e-mail was too long... :-) Anyway, I'll try to ask them again and hope for some answers this time. It's a very new experience for us so any help is really appreciated. First, some numbers we're expecting. - The average size of a doc: ~100K - The number of indexes: 1 - The query response time we're looking for: 200 - 300ms - The number of stored docs: 1st year: 500K - 1M 2nd year: 2-3M - The estimated number of concurrent users per second 1st year: 15 - 25 2nd year: 40 - 60 - The estimated number of queries 1st year: 15 - 25 2nd year: 40 - 60 Now the questions 1) Should we do sharding or not? If we start without sharding, how hard will it be to enable it? Is it just some config changes + the index rebuild or is it more? My personal opinion is to go without sharding at first and enable it later if do get a lot of documents. 2) How should we organize our clusters to ensure redundancy? Should we have 2 or more identical Masters (means that all the updates/optimisations/etc. are done for every one of them)? An alternative, afaik, is to reconfigure one slave to become the new Master, how hard is that? 3) Basically, we can get servers of two kinds: * Single Processor, Dual Core Opteron 2214HE * 2 GB DDR2 SDRAM * 1 x 250 GB (7200 RPM) SATA Drive(s) * Dual Processor, Quad Core 5335 * 16 GB Memory (Fully Buffered) * 2 x 73 GB (10k RPM) 2.5 SAS Drive(s), RAID 1 The second - more powerful - one is more expensive, of course. How can we take advantage of the multiprocessor/multicore servers? Is there some special setup required to make, say, 2 instances of SOLR run on the same server using different processors/cores? 4) Does it make much difference to get a more powerful Master? Or, on the contrary, as slaves will be queried more often, they should be the better ones? Maybe just the HDDs for the slaves should be as fast as possible? 5) How many slaves does it make sense to have per one Master? What's (roughly) the performance gain from 1 to 2, 2 - 3, etc? When does it stop making sense to add more slaves? As far as I understand, it depends mainly on the size of the index. However, I'd guess the time required to do a push for too many slaves can be a problem too, correct? Thanks, Andrey.
RE: Hardware config for SOLR
Matthew, Thanks, a very good point. Andrey. -Original Message- From: Matthew Runo [mailto:[EMAIL PROTECTED] Sent: Thursday, September 18, 2008 11:38 AM To: solr-user@lucene.apache.org Subject: Re: Hardware config for SOLR I can't speak to a lot of this - but regarding the servers I'd go with the more powerful ones, if only for the amount of ram. Your index will likely be larger than 1 gig, and with only two you'll have a lot of your index not stored in ram, which will slow down your QPS. Thanks for your time! Matthew Runo Software Engineer, Zappos.com [EMAIL PROTECTED] - 702-943-7833 On Sep 17, 2008, at 3:32 PM, Andrey Shulinskiy wrote: Hello, We're planning to use SOLR for our project, got some questions. So I asked some Qs yesterday, got no answers whatsoever. Wondering if they didn't make sense, or if the e-mail was too long... :-) Anyway, I'll try to ask them again and hope for some answers this time. It's a very new experience for us so any help is really appreciated. First, some numbers we're expecting. - The average size of a doc: ~100K - The number of indexes: 1 - The query response time we're looking for: 200 - 300ms - The number of stored docs: 1st year: 500K - 1M 2nd year: 2-3M - The estimated number of concurrent users per second 1st year: 15 - 25 2nd year: 40 - 60 - The estimated number of queries 1st year: 15 - 25 2nd year: 40 - 60 Now the questions 1) Should we do sharding or not? If we start without sharding, how hard will it be to enable it? Is it just some config changes + the index rebuild or is it more? My personal opinion is to go without sharding at first and enable it later if do get a lot of documents. 2) How should we organize our clusters to ensure redundancy? Should we have 2 or more identical Masters (means that all the updates/optimisations/etc. are done for every one of them)? An alternative, afaik, is to reconfigure one slave to become the new Master, how hard is that? 3) Basically, we can get servers of two kinds: * Single Processor, Dual Core Opteron 2214HE * 2 GB DDR2 SDRAM * 1 x 250 GB (7200 RPM) SATA Drive(s) * Dual Processor, Quad Core 5335 * 16 GB Memory (Fully Buffered) * 2 x 73 GB (10k RPM) 2.5 SAS Drive(s), RAID 1 The second - more powerful - one is more expensive, of course. How can we take advantage of the multiprocessor/multicore servers? Is there some special setup required to make, say, 2 instances of SOLR run on the same server using different processors/cores? 4) Does it make much difference to get a more powerful Master? Or, on the contrary, as slaves will be queried more often, they should be the better ones? Maybe just the HDDs for the slaves should be as fast as possible? 5) How many slaves does it make sense to have per one Master? What's (roughly) the performance gain from 1 to 2, 2 - 3, etc? When does it stop making sense to add more slaves? As far as I understand, it depends mainly on the size of the index. However, I'd guess the time required to do a push for too many slaves can be a problem too, correct? Thanks, Andrey.
Hardware config for SOLR
Hello, We're planning to use SOLR for our project, got some questions. So I asked some Qs yesterday, got no answers whatsoever. Wondering if they didn't make sense, or if the e-mail was too long... :-) Anyway, I'll try to ask them again and hope for some answers this time. It's a very new experience for us so any help is really appreciated. First, some numbers we're expecting. - The average size of a doc: ~100K - The number of indexes: 1 - The query response time we're looking for: 200 - 300ms - The number of stored docs: 1st year: 500K - 1M 2nd year: 2-3M - The estimated number of concurrent users per second 1st year: 15 - 25 2nd year: 40 - 60 - The estimated number of queries 1st year: 15 - 25 2nd year: 40 - 60 Now the questions 1) Should we do sharding or not? If we start without sharding, how hard will it be to enable it? Is it just some config changes + the index rebuild or is it more? My personal opinion is to go without sharding at first and enable it later if do get a lot of documents. 2) How should we organize our clusters to ensure redundancy? Should we have 2 or more identical Masters (means that all the updates/optimisations/etc. are done for every one of them)? An alternative, afaik, is to reconfigure one slave to become the new Master, how hard is that? 3) Basically, we can get servers of two kinds: * Single Processor, Dual Core Opteron 2214HE * 2 GB DDR2 SDRAM * 1 x 250 GB (7200 RPM) SATA Drive(s) * Dual Processor, Quad Core 5335 * 16 GB Memory (Fully Buffered) * 2 x 73 GB (10k RPM) 2.5 SAS Drive(s), RAID 1 The second - more powerful - one is more expensive, of course. How can we take advantage of the multiprocessor/multicore servers? Is there some special setup required to make, say, 2 instances of SOLR run on the same server using different processors/cores? 4) Does it make much difference to get a more powerful Master? Or, on the contrary, as slaves will be queried more often, they should be the better ones? Maybe just the HDDs for the slaves should be as fast as possible? 5) How many slaves does it make sense to have per one Master? What's (roughly) the performance gain from 1 to 2, 2 - 3, etc? When does it stop making sense to add more slaves? As far as I understand, it depends mainly on the size of the index. However, I'd guess the time required to do a push for too many slaves can be a problem too, correct? Thanks, Andrey.