Re: Architecture and Capacity planning for large Solr index
Whether three shards will give you adequate throughput is not an answerable question. Here's what I suggest. Get a single box of the size you expect your servers to be and index 1/3 of your documents on it. Run stress tests. That's really the only way to be fairly sure your hardware is adequate. As far as SANs are concerned, local storage is almost always better. I'd advise against trying to share the index amongst slaves, SAN or not. And using the SAN for each slave's copy seems unnecessary with storage as cheap as it is, what advantage do you see in this scenario? Best Erick On Mon, Nov 21, 2011 at 3:18 PM, Rahul Warawdekar rahul.warawde...@gmail.com wrote: Thanks Otis ! Please ignore my earlier email which does not have all the information. My business requirements have changed a bit. We now need one year rolling data in Production, with the following details - Number of records - 1.2 million - Solr index size for these records comes to approximately 200 - 220 GB. (includes large attachments) - Approx 250 users who will be searching the applicaiton with a peak of 1 search request every 40 seconds. I am planning to address this using Solr distributed search on a VMWare virtualized environment as follows. 1. Whole index to be split up between 3 shards, with 3 masters and 6 slaves (load balanced) 2. Master configuration for each server is as follows - 4 CPUs - 16 GB RAM - 300 GB disk space 3. Slave configuration for each server is as follows - 4 CPUs - 16 GB RAM - 150 GB disk space 4. I am planning to use SAN instead of local storage to store Solr index. And my questions are as follows: Will 3 shards serve the purpose here ? Is SAN a a good option for storing solr index, given the high index volume ? On Mon, Nov 21, 2011 at 3:05 PM, Rahul Warawdekar rahul.warawde...@gmail.com wrote: Thanks ! My business requirements have changed a bit. We need one year rolling data in Production. The index size for the same comes to approximately 200 - 220 GB. I am planning to address this using Solr distributed search as follows. 1. Whole index to be split up between 3 shards, with 3 masters and 6 slaves (load balanced) 2. Master configuration will be 4 CPU On Tue, Oct 11, 2011 at 2:05 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi Rahul, This is unfortunately not enough information for anyone to give you very precise answers, so I'll just give some rough ones: * best disk - SSD :) * CPU - multicore, depends on query complexity, concurrency, etc. * sharded search and failover - start with SolrCloud, there are a couple of pages about it on the Wiki and http://blog.sematext.com/2011/09/14/solr-digest-spring-summer-2011-part-2-solr-cloud-and-near-real-time-search/ Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: Rahul Warawdekar rahul.warawde...@gmail.com To: solr-user solr-user@lucene.apache.org Sent: Tuesday, October 11, 2011 11:47 AM Subject: Architecture and Capacity planning for large Solr index Hi All, I am working on a Solr search based project, and would highly appreciate help/suggestions from you all regarding Solr architecture and capacity planning. Details of the project are as follows 1. There are 2 databases from which, data needs to be indexed and made searchable, - Production - Archive 2. Production database will retain 6 months old data and archive data every month. 3. Archive database will retain 3 years old data. 4. Database is SQL Server 2008 and Solr version is 3.1 Data to be indexed contains a huge volume of attachments (PDF, Word, excel etc..), approximately 200 GB per month. We are planning to do a full index every month (multithreaded) and incremental indexing on a daily basis. The Solr index size is coming to approximately 25 GB per month. If we were to use distributed search, what would be the best configuration for Production as well as Archive indexes ? What would be the best CPU/RAM/Disk configuration ? How can I implement failover mechanism for sharded searches ? Please let me know in case I need to share more information. -- Thanks and Regards Rahul A. Warawdekar -- Thanks and Regards Rahul A. Warawdekar -- Thanks and Regards Rahul A. Warawdekar
Re: Architecture and Capacity planning for large Solr index
Thanks ! My business requirements have changed a bit. We need one year rolling data in Production. The index size for the same comes to approximately 200 - 220 GB. I am planning to address this using Solr distributed search as follows. 1. Whole index to be split up between 3 shards, with 3 masters and 6 slaves (load balanced) 2. Master configuration will be 4 CPU On Tue, Oct 11, 2011 at 2:05 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi Rahul, This is unfortunately not enough information for anyone to give you very precise answers, so I'll just give some rough ones: * best disk - SSD :) * CPU - multicore, depends on query complexity, concurrency, etc. * sharded search and failover - start with SolrCloud, there are a couple of pages about it on the Wiki and http://blog.sematext.com/2011/09/14/solr-digest-spring-summer-2011-part-2-solr-cloud-and-near-real-time-search/ Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: Rahul Warawdekar rahul.warawde...@gmail.com To: solr-user solr-user@lucene.apache.org Sent: Tuesday, October 11, 2011 11:47 AM Subject: Architecture and Capacity planning for large Solr index Hi All, I am working on a Solr search based project, and would highly appreciate help/suggestions from you all regarding Solr architecture and capacity planning. Details of the project are as follows 1. There are 2 databases from which, data needs to be indexed and made searchable, - Production - Archive 2. Production database will retain 6 months old data and archive data every month. 3. Archive database will retain 3 years old data. 4. Database is SQL Server 2008 and Solr version is 3.1 Data to be indexed contains a huge volume of attachments (PDF, Word, excel etc..), approximately 200 GB per month. We are planning to do a full index every month (multithreaded) and incremental indexing on a daily basis. The Solr index size is coming to approximately 25 GB per month. If we were to use distributed search, what would be the best configuration for Production as well as Archive indexes ? What would be the best CPU/RAM/Disk configuration ? How can I implement failover mechanism for sharded searches ? Please let me know in case I need to share more information. -- Thanks and Regards Rahul A. Warawdekar -- Thanks and Regards Rahul A. Warawdekar
Re: Architecture and Capacity planning for large Solr index
Thanks Otis ! Please ignore my earlier email which does not have all the information. My business requirements have changed a bit. We now need one year rolling data in Production, with the following details - Number of records - 1.2 million - Solr index size for these records comes to approximately 200 - 220 GB. (includes large attachments) - Approx 250 users who will be searching the applicaiton with a peak of 1 search request every 40 seconds. I am planning to address this using Solr distributed search on a VMWare virtualized environment as follows. 1. Whole index to be split up between 3 shards, with 3 masters and 6 slaves (load balanced) 2. Master configuration for each server is as follows - 4 CPUs - 16 GB RAM - 300 GB disk space 3. Slave configuration for each server is as follows - 4 CPUs - 16 GB RAM - 150 GB disk space 4. I am planning to use SAN instead of local storage to store Solr index. And my questions are as follows: Will 3 shards serve the purpose here ? Is SAN a a good option for storing solr index, given the high index volume ? On Mon, Nov 21, 2011 at 3:05 PM, Rahul Warawdekar rahul.warawde...@gmail.com wrote: Thanks ! My business requirements have changed a bit. We need one year rolling data in Production. The index size for the same comes to approximately 200 - 220 GB. I am planning to address this using Solr distributed search as follows. 1. Whole index to be split up between 3 shards, with 3 masters and 6 slaves (load balanced) 2. Master configuration will be 4 CPU On Tue, Oct 11, 2011 at 2:05 PM, Otis Gospodnetic otis_gospodne...@yahoo.com wrote: Hi Rahul, This is unfortunately not enough information for anyone to give you very precise answers, so I'll just give some rough ones: * best disk - SSD :) * CPU - multicore, depends on query complexity, concurrency, etc. * sharded search and failover - start with SolrCloud, there are a couple of pages about it on the Wiki and http://blog.sematext.com/2011/09/14/solr-digest-spring-summer-2011-part-2-solr-cloud-and-near-real-time-search/ Otis Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch Lucene ecosystem search :: http://search-lucene.com/ From: Rahul Warawdekar rahul.warawde...@gmail.com To: solr-user solr-user@lucene.apache.org Sent: Tuesday, October 11, 2011 11:47 AM Subject: Architecture and Capacity planning for large Solr index Hi All, I am working on a Solr search based project, and would highly appreciate help/suggestions from you all regarding Solr architecture and capacity planning. Details of the project are as follows 1. There are 2 databases from which, data needs to be indexed and made searchable, - Production - Archive 2. Production database will retain 6 months old data and archive data every month. 3. Archive database will retain 3 years old data. 4. Database is SQL Server 2008 and Solr version is 3.1 Data to be indexed contains a huge volume of attachments (PDF, Word, excel etc..), approximately 200 GB per month. We are planning to do a full index every month (multithreaded) and incremental indexing on a daily basis. The Solr index size is coming to approximately 25 GB per month. If we were to use distributed search, what would be the best configuration for Production as well as Archive indexes ? What would be the best CPU/RAM/Disk configuration ? How can I implement failover mechanism for sharded searches ? Please let me know in case I need to share more information. -- Thanks and Regards Rahul A. Warawdekar -- Thanks and Regards Rahul A. Warawdekar -- Thanks and Regards Rahul A. Warawdekar