Re: Architecture and Capacity planning for large Solr index

2011-11-23 Thread Erick Erickson
Whether three shards will give you adequate throughput is not an
answerable question. Here's what I suggest. Get a single box
of the size you expect your servers to be and index 1/3 of your
documents on it. Run stress tests. That's really the only way to
be fairly sure your hardware is adequate.

As far as SANs are concerned, local storage is almost always
better. I'd advise against trying to share the index amongst
slaves, SAN or not. And using the SAN for each slave's copy
seems unnecessary with storage as cheap as it is, what
advantage do you see in this scenario?

Best
Erick

On Mon, Nov 21, 2011 at 3:18 PM, Rahul Warawdekar
rahul.warawde...@gmail.com wrote:
 Thanks Otis !
 Please ignore my earlier email which does not have all the information.

 My business requirements have changed a bit.
 We now need one year rolling data in Production, with the following details
    - Number of records - 1.2 million
    - Solr index size for these records comes to approximately 200 - 220
 GB. (includes large attachments)
    - Approx 250 users who will be searching the applicaiton with a peak of
 1 search request every 40 seconds.

 I am planning to address this using Solr distributed search on a VMWare
 virtualized environment as follows.

 1. Whole index to be split up between 3 shards, with 3 masters and 6 slaves
 (load balanced)

 2. Master configuration for each server is as follows
    - 4 CPUs
    - 16 GB RAM
    - 300 GB disk space

 3. Slave configuration for each server is as follows
    - 4 CPUs
    - 16 GB RAM
    - 150 GB disk space

 4. I am planning to use SAN instead of local storage to store Solr index.

 And my questions are as follows:
 Will 3 shards serve the purpose here ?
 Is SAN a a good option for storing solr index, given the high index volume ?




 On Mon, Nov 21, 2011 at 3:05 PM, Rahul Warawdekar 
 rahul.warawde...@gmail.com wrote:

 Thanks !

 My business requirements have changed a bit.
 We need one year rolling data in Production.
 The index size for the same comes to approximately 200 - 220 GB.
 I am planning to address this using Solr distributed search as follows.

 1. Whole index to be split up between 3 shards, with 3 masters and 6
 slaves (load balanced)
 2. Master configuration
  will be 4 CPU



 On Tue, Oct 11, 2011 at 2:05 PM, Otis Gospodnetic 
 otis_gospodne...@yahoo.com wrote:

 Hi Rahul,

 This is unfortunately not enough information for anyone to give you very
 precise answers, so I'll just give some rough ones:

 * best disk - SSD :)
 * CPU - multicore, depends on query complexity, concurrency, etc.
 * sharded search and failover - start with SolrCloud, there are a couple
 of pages about it on the Wiki and
 http://blog.sematext.com/2011/09/14/solr-digest-spring-summer-2011-part-2-solr-cloud-and-near-real-time-search/

 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/


 
 From: Rahul Warawdekar rahul.warawde...@gmail.com
 To: solr-user solr-user@lucene.apache.org
 Sent: Tuesday, October 11, 2011 11:47 AM
 Subject: Architecture and Capacity planning for large Solr index
 
 Hi All,
 
 I am working on a Solr search based project, and would highly appreciate
 help/suggestions from you all regarding Solr architecture and capacity
 planning.
 Details of the project are as follows
 
 1. There are 2 databases from which, data needs to be indexed and made
 searchable,
                 - Production
                 - Archive
 2. Production database will retain 6 months old data and archive data
 every
 month.
 3. Archive database will retain 3 years old data.
 4. Database is SQL Server 2008 and Solr version is 3.1
 
 Data to be indexed contains a huge volume of attachments (PDF, Word,
 excel
 etc..), approximately 200 GB per month.
 We are planning to do a full index every month (multithreaded) and
 incremental indexing on a daily basis.
 The Solr index size is coming to approximately 25 GB per month.
 
 If we were to use distributed search, what would be the best
 configuration
 for Production as well as Archive indexes ?
 What would be the best CPU/RAM/Disk configuration ?
 How can I implement failover mechanism for sharded searches ?
 
 Please let me know in case I need to share more information.
 
 
 --
 Thanks and Regards
 Rahul A. Warawdekar
 
 
 




 --
 Thanks and Regards
 Rahul A. Warawdekar




 --
 Thanks and Regards
 Rahul A. Warawdekar



Re: Architecture and Capacity planning for large Solr index

2011-11-21 Thread Rahul Warawdekar
Thanks !

My business requirements have changed a bit.
We need one year rolling data in Production.
The index size for the same comes to approximately 200 - 220 GB.
I am planning to address this using Solr distributed search as follows.

1. Whole index to be split up between 3 shards, with 3 masters and 6 slaves
(load balanced)
2. Master configuration
 will be 4 CPU


On Tue, Oct 11, 2011 at 2:05 PM, Otis Gospodnetic 
otis_gospodne...@yahoo.com wrote:

 Hi Rahul,

 This is unfortunately not enough information for anyone to give you very
 precise answers, so I'll just give some rough ones:

 * best disk - SSD :)
 * CPU - multicore, depends on query complexity, concurrency, etc.
 * sharded search and failover - start with SolrCloud, there are a couple
 of pages about it on the Wiki and
 http://blog.sematext.com/2011/09/14/solr-digest-spring-summer-2011-part-2-solr-cloud-and-near-real-time-search/

 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/


 
 From: Rahul Warawdekar rahul.warawde...@gmail.com
 To: solr-user solr-user@lucene.apache.org
 Sent: Tuesday, October 11, 2011 11:47 AM
 Subject: Architecture and Capacity planning for large Solr index
 
 Hi All,
 
 I am working on a Solr search based project, and would highly appreciate
 help/suggestions from you all regarding Solr architecture and capacity
 planning.
 Details of the project are as follows
 
 1. There are 2 databases from which, data needs to be indexed and made
 searchable,
 - Production
 - Archive
 2. Production database will retain 6 months old data and archive data
 every
 month.
 3. Archive database will retain 3 years old data.
 4. Database is SQL Server 2008 and Solr version is 3.1
 
 Data to be indexed contains a huge volume of attachments (PDF, Word, excel
 etc..), approximately 200 GB per month.
 We are planning to do a full index every month (multithreaded) and
 incremental indexing on a daily basis.
 The Solr index size is coming to approximately 25 GB per month.
 
 If we were to use distributed search, what would be the best configuration
 for Production as well as Archive indexes ?
 What would be the best CPU/RAM/Disk configuration ?
 How can I implement failover mechanism for sharded searches ?
 
 Please let me know in case I need to share more information.
 
 
 --
 Thanks and Regards
 Rahul A. Warawdekar
 
 
 




-- 
Thanks and Regards
Rahul A. Warawdekar


Re: Architecture and Capacity planning for large Solr index

2011-11-21 Thread Rahul Warawdekar
Thanks Otis !
Please ignore my earlier email which does not have all the information.

My business requirements have changed a bit.
We now need one year rolling data in Production, with the following details
- Number of records - 1.2 million
- Solr index size for these records comes to approximately 200 - 220
GB. (includes large attachments)
- Approx 250 users who will be searching the applicaiton with a peak of
1 search request every 40 seconds.

I am planning to address this using Solr distributed search on a VMWare
virtualized environment as follows.

1. Whole index to be split up between 3 shards, with 3 masters and 6 slaves
(load balanced)

2. Master configuration for each server is as follows
- 4 CPUs
- 16 GB RAM
- 300 GB disk space

3. Slave configuration for each server is as follows
- 4 CPUs
- 16 GB RAM
- 150 GB disk space

4. I am planning to use SAN instead of local storage to store Solr index.

And my questions are as follows:
Will 3 shards serve the purpose here ?
Is SAN a a good option for storing solr index, given the high index volume ?




On Mon, Nov 21, 2011 at 3:05 PM, Rahul Warawdekar 
rahul.warawde...@gmail.com wrote:

 Thanks !

 My business requirements have changed a bit.
 We need one year rolling data in Production.
 The index size for the same comes to approximately 200 - 220 GB.
 I am planning to address this using Solr distributed search as follows.

 1. Whole index to be split up between 3 shards, with 3 masters and 6
 slaves (load balanced)
 2. Master configuration
  will be 4 CPU



 On Tue, Oct 11, 2011 at 2:05 PM, Otis Gospodnetic 
 otis_gospodne...@yahoo.com wrote:

 Hi Rahul,

 This is unfortunately not enough information for anyone to give you very
 precise answers, so I'll just give some rough ones:

 * best disk - SSD :)
 * CPU - multicore, depends on query complexity, concurrency, etc.
 * sharded search and failover - start with SolrCloud, there are a couple
 of pages about it on the Wiki and
 http://blog.sematext.com/2011/09/14/solr-digest-spring-summer-2011-part-2-solr-cloud-and-near-real-time-search/

 Otis
 
 Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
 Lucene ecosystem search :: http://search-lucene.com/


 
 From: Rahul Warawdekar rahul.warawde...@gmail.com
 To: solr-user solr-user@lucene.apache.org
 Sent: Tuesday, October 11, 2011 11:47 AM
 Subject: Architecture and Capacity planning for large Solr index
 
 Hi All,
 
 I am working on a Solr search based project, and would highly appreciate
 help/suggestions from you all regarding Solr architecture and capacity
 planning.
 Details of the project are as follows
 
 1. There are 2 databases from which, data needs to be indexed and made
 searchable,
 - Production
 - Archive
 2. Production database will retain 6 months old data and archive data
 every
 month.
 3. Archive database will retain 3 years old data.
 4. Database is SQL Server 2008 and Solr version is 3.1
 
 Data to be indexed contains a huge volume of attachments (PDF, Word,
 excel
 etc..), approximately 200 GB per month.
 We are planning to do a full index every month (multithreaded) and
 incremental indexing on a daily basis.
 The Solr index size is coming to approximately 25 GB per month.
 
 If we were to use distributed search, what would be the best
 configuration
 for Production as well as Archive indexes ?
 What would be the best CPU/RAM/Disk configuration ?
 How can I implement failover mechanism for sharded searches ?
 
 Please let me know in case I need to share more information.
 
 
 --
 Thanks and Regards
 Rahul A. Warawdekar
 
 
 




 --
 Thanks and Regards
 Rahul A. Warawdekar




-- 
Thanks and Regards
Rahul A. Warawdekar