RE: Solr server requirements for 100+ million documents

2014-02-11 Thread Susheel Kumar
Hi Otis,

Just to confirm, the 3 servers you mean here are 2 for shards/nodes and 1 for 
Zookeeper. Is that correct?

Thanks,
Susheel

-Original Message-
From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] 
Sent: Friday, January 24, 2014 5:21 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr server requirements for 100+ million documents

Hi Susheel,

Like Erick said, it's impossible to give precise recommendations, but making a 
few assumptions and combining them with experience (+ a licked finger in the 
air):
* 3 servers
* 32 GB
* 2+ CPU cores
* Linux

Assuming docs are not bigger than a few KB, that they are not being reindexed 
over and over, that you don't have a search rate higher than a few dozen QPS, 
assuming your queries are not a page long, etc. assuming best practices are 
followed, the above should be sufficient.

I hope this helps.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics Solr  Elasticsearch 
Support * http://sematext.com/


On Fri, Jan 24, 2014 at 1:10 PM, Susheel Kumar  
susheel.ku...@thedigitalgroup.net wrote:

 Hi,

 Currently we are indexing 10 million document from database (10 db 
 data
 entities)  index size is around 8 GB on windows virtual box. Indexing 
 in one shot taking 12+ hours while indexing parallel in separate cores 
  merging them together taking 4+ hours.

 We are looking to scale to 100+ million documents and looking for 
 recommendation on servers requirements on below parameters for a 
 Production environment. There can be 200+ users performing search same time.

 No of physical servers (considering solr cloud) Memory requirement 
 Processor requirement (# cores) Linux as OS oppose to windows

 Thanks in advance.
 Susheel




Re: Solr server requirements for 100+ million documents

2014-02-11 Thread Otis Gospodnetic
Hi Susheel,

No, we wouldn't want to go with just 1 ZK. :)

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Tue, Feb 11, 2014 at 5:18 PM, Susheel Kumar 
susheel.ku...@thedigitalgroup.net wrote:

 Hi Otis,

 Just to confirm, the 3 servers you mean here are 2 for shards/nodes and 1
 for Zookeeper. Is that correct?

 Thanks,
 Susheel

 -Original Message-
 From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com]
 Sent: Friday, January 24, 2014 5:21 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr server requirements for 100+ million documents

 Hi Susheel,

 Like Erick said, it's impossible to give precise recommendations, but
 making a few assumptions and combining them with experience (+ a licked
 finger in the air):
 * 3 servers
 * 32 GB
 * 2+ CPU cores
 * Linux

 Assuming docs are not bigger than a few KB, that they are not being
 reindexed over and over, that you don't have a search rate higher than a
 few dozen QPS, assuming your queries are not a page long, etc. assuming
 best practices are followed, the above should be sufficient.

 I hope this helps.

 Otis
 --
 Performance Monitoring * Log Analytics * Search Analytics Solr 
 Elasticsearch Support * http://sematext.com/


 On Fri, Jan 24, 2014 at 1:10 PM, Susheel Kumar 
 susheel.ku...@thedigitalgroup.net wrote:

  Hi,
 
  Currently we are indexing 10 million document from database (10 db
  data
  entities)  index size is around 8 GB on windows virtual box. Indexing
  in one shot taking 12+ hours while indexing parallel in separate cores
   merging them together taking 4+ hours.
 
  We are looking to scale to 100+ million documents and looking for
  recommendation on servers requirements on below parameters for a
  Production environment. There can be 200+ users performing search same
 time.
 
  No of physical servers (considering solr cloud) Memory requirement
  Processor requirement (# cores) Linux as OS oppose to windows
 
  Thanks in advance.
  Susheel
 
 



RE: Solr server requirements for 100+ million documents

2014-02-11 Thread Susheel Kumar
Thanks, Otis for quick reply. So for ZK do you recommend separate servers and 
if so how many for initial Solr cloud cluster setup. 

-Original Message-
From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com] 
Sent: Tuesday, February 11, 2014 4:21 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr server requirements for 100+ million documents

Hi Susheel,

No, we wouldn't want to go with just 1 ZK. :)

Otis
--
Performance Monitoring * Log Analytics * Search Analytics Solr  Elasticsearch 
Support * http://sematext.com/


On Tue, Feb 11, 2014 at 5:18 PM, Susheel Kumar  
susheel.ku...@thedigitalgroup.net wrote:

 Hi Otis,

 Just to confirm, the 3 servers you mean here are 2 for shards/nodes 
 and 1 for Zookeeper. Is that correct?

 Thanks,
 Susheel

 -Original Message-
 From: Otis Gospodnetic [mailto:otis.gospodne...@gmail.com]
 Sent: Friday, January 24, 2014 5:21 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr server requirements for 100+ million documents

 Hi Susheel,

 Like Erick said, it's impossible to give precise recommendations, but 
 making a few assumptions and combining them with experience (+ a 
 licked finger in the air):
 * 3 servers
 * 32 GB
 * 2+ CPU cores
 * Linux

 Assuming docs are not bigger than a few KB, that they are not being 
 reindexed over and over, that you don't have a search rate higher than 
 a few dozen QPS, assuming your queries are not a page long, etc. 
 assuming best practices are followed, the above should be sufficient.

 I hope this helps.

 Otis
 --
 Performance Monitoring * Log Analytics * Search Analytics Solr  
 Elasticsearch Support * http://sematext.com/


 On Fri, Jan 24, 2014 at 1:10 PM, Susheel Kumar  
 susheel.ku...@thedigitalgroup.net wrote:

  Hi,
 
  Currently we are indexing 10 million document from database (10 db 
  data
  entities)  index size is around 8 GB on windows virtual box. 
  Indexing in one shot taking 12+ hours while indexing parallel in 
  separate cores  merging them together taking 4+ hours.
 
  We are looking to scale to 100+ million documents and looking for 
  recommendation on servers requirements on below parameters for a 
  Production environment. There can be 200+ users performing search 
  same
 time.
 
  No of physical servers (considering solr cloud) Memory requirement 
  Processor requirement (# cores) Linux as OS oppose to windows
 
  Thanks in advance.
  Susheel
 
 



Re: Solr server requirements for 100+ million documents

2014-02-11 Thread svante karlsson
ZK needs a quorum to keep functional so 3 servers handles one failure. 5
handles 2 node failures. If you Solr with 1 replica per shard then stick to
3 ZK. If you use 2 replicas use 5 ZK








Re: Solr server requirements for 100+ million documents

2014-02-11 Thread Jason Hellman
Whether you use the same machines as Solr or separate machines is a matter 
suited to taste.

If you are the CTO, then you should make this decision.  If not, inform 
management that risk conditions are greater when you share function and control 
on a single piece of hardware.  A single failure of a replica + zookeeper node 
will be more impactful than a single failure of a replica *or* a zookeeper 
node.  Let them earn the big bucks to make the risk decision.

The good news is, zookeeper hardware can be extremely lightweight for Solr 
Cloud.  Commodity hardware should work just fine…and thus scaling to 5 nodes 
for zookeeper is not that hard at all.

Jason


On Feb 11, 2014, at 3:00 PM, svante karlsson s...@csi.se wrote:

 ZK needs a quorum to keep functional so 3 servers handles one failure. 5
 handles 2 node failures. If you Solr with 1 replica per shard then stick to
 3 ZK. If you use 2 replicas use 5 ZK
 
 
 
 
 
 



Re: Solr server requirements for 100+ million documents

2014-02-11 Thread Shawn Heisey

On 2/11/2014 3:28 PM, Susheel Kumar wrote:

Thanks, Otis for quick reply. So for ZK do you recommend separate servers and 
if so how many for initial Solr cloud cluster setup.


In a minimal 3-server setup, all servers would run zookeeper and two of 
them would also run Solr.With this setup, you can survive the failure of 
any of those three machines, even if it dies completely.


If the third machine is only running zookeeper, two fast CPU cores and 
2GB of RAM would be plenty.  For 100 million documents, I would 
personally recommend at least 8 CPU cores on the machines running Solr, 
ideally provided by at least two separate physical CPUs.  Otis 
recommended 32GB of RAM as a starting point.  You would very likely want 
more.


One copy of my 90 million document index uses two servers to run all the 
shards.  Because I have two copies of the index, I have four servers.  
Each server has 64GB of RAM.  This is **NOT** running SolrCloud, but if 
it were, I would have zookeeper running on three of those servers.


Thanks,
Shawn



Re: Solr server requirements for 100+ million documents

2014-01-28 Thread Jack Krupansky
Lucene and Solr work best if the full index can be cached in OS memory. 
Sure, Lucene/Solr does work properly once the index no longer fits, but 
performance will drop off.


I would say that you could fit 100 million moderate-size documents on a 
single Solr server - provided that you give the OS enough RAM for the full 
Lucene index. That said, if you want to configure a SolrCloud cluster with 
shards, you can use more modest, commodity servers with less RAM, provided 
each server still fits it's fraction of the total Lucene index in that 
server's OS memory (file cache.)


You may also need to add replicas for each shard to accommodate query load - 
proof-of-concept testing is needed to verify that. It is worth noting that 
sharding can improve total query performance since each node only searches a 
fraction of the total data and those searches are done in parallel  (since 
they are on different machines.)


-- Jack Krupansky

-Original Message- 
From: Susheel Kumar

Sent: Sunday, January 26, 2014 10:54 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr server requirements for 100+ million documents

Thank you Erick for your valuable inputs. Yes, we have to re-index data 
again  again. I'll look into possibility of tuning db access.


On SolrJ and automating the indexing (incremental as well as one time) I 
want to get your opinion on below two points. We will be indexing separate 
sets of tables with similar data structure


- Should we use SolrJ and write Java programs that can be scheduled to 
trigger indexing on demand/schedule based.


- Is using SolrJ a better idea even for searching than using SolrNet? As our 
frontend is in .Net so we started using SolrNet but I am afraid down the 
road when we scale/support SolrClod using SolrJ is better?



Thanks
Susheel
-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Sunday, January 26, 2014 8:37 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr server requirements for 100+ million documents

Dumping the raw data would probably be a good idea. I guarantee you'll be 
re-indexing the data several times as you change the schema to accommodate 
different requirements...


But it may also be worth spending some time figuring out why the DB access 
is slow. Sometimes one can tune that.


If you go the SolrJ route, you also have the possibility of setting up N 
clients to work simultaneously, sometimes that'll help.


FWIW,
Erick

On Sat, Jan 25, 2014 at 11:06 PM, Susheel Kumar 
susheel.ku...@thedigitalgroup.net wrote:

Hi Kranti,

Attach are the solrconfig  schema xml for review. I did run indexing with 
just few fields (5-6 fields) in schema.xml  keeping the same db config 
but Indexing almost still taking similar time (average 1 million records 1 
hr) which confirms that the bottleneck is in the data acquisition which in 
our case is oracle database. I am thinking to not use dataimporthandler / 
jdbc to get data from Oracle but to rather dump data somehow from oracle 
using SQL loader and then index it. Any thoughts?


Thnx

-Original Message-
From: Kranti Parisa [mailto:kranti.par...@gmail.com]
Sent: Saturday, January 25, 2014 12:08 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr server requirements for 100+ million documents

can you post the complete solrconfig.xml file and schema.xml files to 
review all of your settings that would impact your indexing performance.


Thanks,
Kranti K. Parisa
http://www.linkedin.com/in/krantiparisa



On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar  
susheel.ku...@thedigitalgroup.net wrote:



Thanks, Svante. Your indexing speed using db seems to really fast.
Can you please provide some more detail on how you are indexing db
records. Is it thru DataImportHandler? And what database? Is that
local db?  We are indexing around 70 fields (60 multivalued) but data
is not populated always in all fields. The average size of document is in 
5-10 kbs.


-Original Message-
From: saka.csi...@gmail.com [mailto:saka.csi...@gmail.com] On Behalf
Of svante karlsson
Sent: Friday, January 24, 2014 5:05 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr server requirements for 100+ million documents

I just indexed 100 million db docs (records) with 22 fields (4
multivalued) in 9524 sec using libcurl.
11 million took 763 seconds so the speed drops somewhat with
increasing dbsize.

We write 1000 docs (just an arbitrary number) in each request from
two threads. If you will be using solrcloud you will want more writer 
threads.


The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with
one SSD and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual 
machine.


/svante




2014/1/24 Susheel Kumar susheel.ku...@thedigitalgroup.net

 Thanks, Erick for the info.

 For indexing I agree the more time is consumed in data acquisition
 which in our case from Database.  For indexing currently we are
 using the manual process i.e. Solr dashboard Data Import but now

RE: Solr server requirements for 100+ million documents

2014-01-28 Thread Susheel Kumar
Thanks, Jack. That helps.

-Original Message-
From: Jack Krupansky [mailto:j...@basetechnology.com] 
Sent: Tuesday, January 28, 2014 8:01 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr server requirements for 100+ million documents

Lucene and Solr work best if the full index can be cached in OS memory. 
Sure, Lucene/Solr does work properly once the index no longer fits, but 
performance will drop off.

I would say that you could fit 100 million moderate-size documents on a single 
Solr server - provided that you give the OS enough RAM for the full Lucene 
index. That said, if you want to configure a SolrCloud cluster with shards, you 
can use more modest, commodity servers with less RAM, provided each server 
still fits it's fraction of the total Lucene index in that server's OS memory 
(file cache.)

You may also need to add replicas for each shard to accommodate query load - 
proof-of-concept testing is needed to verify that. It is worth noting that 
sharding can improve total query performance since each node only searches a 
fraction of the total data and those searches are done in parallel  (since they 
are on different machines.)

-- Jack Krupansky

-Original Message-
From: Susheel Kumar
Sent: Sunday, January 26, 2014 10:54 AM
To: solr-user@lucene.apache.org
Subject: RE: Solr server requirements for 100+ million documents

Thank you Erick for your valuable inputs. Yes, we have to re-index data again  
again. I'll look into possibility of tuning db access.

On SolrJ and automating the indexing (incremental as well as one time) I want 
to get your opinion on below two points. We will be indexing separate sets of 
tables with similar data structure

- Should we use SolrJ and write Java programs that can be scheduled to trigger 
indexing on demand/schedule based.

- Is using SolrJ a better idea even for searching than using SolrNet? As our 
frontend is in .Net so we started using SolrNet but I am afraid down the road 
when we scale/support SolrClod using SolrJ is better?


Thanks
Susheel
-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Sunday, January 26, 2014 8:37 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr server requirements for 100+ million documents

Dumping the raw data would probably be a good idea. I guarantee you'll be 
re-indexing the data several times as you change the schema to accommodate 
different requirements...

But it may also be worth spending some time figuring out why the DB access is 
slow. Sometimes one can tune that.

If you go the SolrJ route, you also have the possibility of setting up N 
clients to work simultaneously, sometimes that'll help.

FWIW,
Erick

On Sat, Jan 25, 2014 at 11:06 PM, Susheel Kumar 
susheel.ku...@thedigitalgroup.net wrote:
 Hi Kranti,

 Attach are the solrconfig  schema xml for review. I did run indexing 
 with just few fields (5-6 fields) in schema.xml  keeping the same db 
 config but Indexing almost still taking similar time (average 1 
 million records 1
 hr) which confirms that the bottleneck is in the data acquisition 
 which in our case is oracle database. I am thinking to not use 
 dataimporthandler / jdbc to get data from Oracle but to rather dump 
 data somehow from oracle using SQL loader and then index it. Any thoughts?

 Thnx

 -Original Message-
 From: Kranti Parisa [mailto:kranti.par...@gmail.com]
 Sent: Saturday, January 25, 2014 12:08 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr server requirements for 100+ million documents

 can you post the complete solrconfig.xml file and schema.xml files to 
 review all of your settings that would impact your indexing performance.

 Thanks,
 Kranti K. Parisa
 http://www.linkedin.com/in/krantiparisa



 On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar  
 susheel.ku...@thedigitalgroup.net wrote:

 Thanks, Svante. Your indexing speed using db seems to really fast.
 Can you please provide some more detail on how you are indexing db 
 records. Is it thru DataImportHandler? And what database? Is that 
 local db?  We are indexing around 70 fields (60 multivalued) but data 
 is not populated always in all fields. The average size of document 
 is in
 5-10 kbs.

 -Original Message-
 From: saka.csi...@gmail.com [mailto:saka.csi...@gmail.com] On Behalf 
 Of svante karlsson
 Sent: Friday, January 24, 2014 5:05 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr server requirements for 100+ million documents

 I just indexed 100 million db docs (records) with 22 fields (4
 multivalued) in 9524 sec using libcurl.
 11 million took 763 seconds so the speed drops somewhat with 
 increasing dbsize.

 We write 1000 docs (just an arbitrary number) in each request from 
 two threads. If you will be using solrcloud you will want more writer 
 threads.

 The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with 
 one SSD and 32GB and the solr runs on ubuntu 13.10 inside a esxi 
 virtual machine.

 /svante




 2014/1

Re: Solr server requirements for 100+ million documents

2014-01-28 Thread Jorge Luis Betancourt Gonzalez
Previously in the list a spreadsheet has been mentioned, taking into account 
that you already have documents in an index you could extract the needed 
information from your index and feed it into the spreadsheet and it probably 
will give you a rough approximated of the hardware you’ll bee needing. Also if 
I’m not mistaken no SolrCloud approximation is provided by this “tool”.

Greetings!

On Jan 28, 2014, at 11:02 PM, Susheel Kumar susheel.ku...@thedigitalgroup.net 
wrote:

 Thanks, Jack. That helps.
 
 -Original Message-
 From: Jack Krupansky [mailto:j...@basetechnology.com] 
 Sent: Tuesday, January 28, 2014 8:01 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr server requirements for 100+ million documents
 
 Lucene and Solr work best if the full index can be cached in OS memory. 
 Sure, Lucene/Solr does work properly once the index no longer fits, but 
 performance will drop off.
 
 I would say that you could fit 100 million moderate-size documents on a 
 single Solr server - provided that you give the OS enough RAM for the full 
 Lucene index. That said, if you want to configure a SolrCloud cluster with 
 shards, you can use more modest, commodity servers with less RAM, provided 
 each server still fits it's fraction of the total Lucene index in that 
 server's OS memory (file cache.)
 
 You may also need to add replicas for each shard to accommodate query load - 
 proof-of-concept testing is needed to verify that. It is worth noting that 
 sharding can improve total query performance since each node only searches a 
 fraction of the total data and those searches are done in parallel  (since 
 they are on different machines.)
 
 -- Jack Krupansky
 
 -Original Message-
 From: Susheel Kumar
 Sent: Sunday, January 26, 2014 10:54 AM
 To: solr-user@lucene.apache.org
 Subject: RE: Solr server requirements for 100+ million documents
 
 Thank you Erick for your valuable inputs. Yes, we have to re-index data again 
  again. I'll look into possibility of tuning db access.
 
 On SolrJ and automating the indexing (incremental as well as one time) I want 
 to get your opinion on below two points. We will be indexing separate sets of 
 tables with similar data structure
 
 - Should we use SolrJ and write Java programs that can be scheduled to 
 trigger indexing on demand/schedule based.
 
 - Is using SolrJ a better idea even for searching than using SolrNet? As our 
 frontend is in .Net so we started using SolrNet but I am afraid down the road 
 when we scale/support SolrClod using SolrJ is better?
 
 
 Thanks
 Susheel
 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Sunday, January 26, 2014 8:37 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr server requirements for 100+ million documents
 
 Dumping the raw data would probably be a good idea. I guarantee you'll be 
 re-indexing the data several times as you change the schema to accommodate 
 different requirements...
 
 But it may also be worth spending some time figuring out why the DB access is 
 slow. Sometimes one can tune that.
 
 If you go the SolrJ route, you also have the possibility of setting up N 
 clients to work simultaneously, sometimes that'll help.
 
 FWIW,
 Erick
 
 On Sat, Jan 25, 2014 at 11:06 PM, Susheel Kumar 
 susheel.ku...@thedigitalgroup.net wrote:
 Hi Kranti,
 
 Attach are the solrconfig  schema xml for review. I did run indexing 
 with just few fields (5-6 fields) in schema.xml  keeping the same db 
 config but Indexing almost still taking similar time (average 1 
 million records 1
 hr) which confirms that the bottleneck is in the data acquisition 
 which in our case is oracle database. I am thinking to not use 
 dataimporthandler / jdbc to get data from Oracle but to rather dump 
 data somehow from oracle using SQL loader and then index it. Any thoughts?
 
 Thnx
 
 -Original Message-
 From: Kranti Parisa [mailto:kranti.par...@gmail.com]
 Sent: Saturday, January 25, 2014 12:08 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr server requirements for 100+ million documents
 
 can you post the complete solrconfig.xml file and schema.xml files to 
 review all of your settings that would impact your indexing performance.
 
 Thanks,
 Kranti K. Parisa
 http://www.linkedin.com/in/krantiparisa
 
 
 
 On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar  
 susheel.ku...@thedigitalgroup.net wrote:
 
 Thanks, Svante. Your indexing speed using db seems to really fast.
 Can you please provide some more detail on how you are indexing db 
 records. Is it thru DataImportHandler? And what database? Is that 
 local db?  We are indexing around 70 fields (60 multivalued) but data 
 is not populated always in all fields. The average size of document 
 is in
 5-10 kbs.
 
 -Original Message-
 From: saka.csi...@gmail.com [mailto:saka.csi...@gmail.com] On Behalf 
 Of svante karlsson
 Sent: Friday, January 24, 2014 5:05 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr

Re: Solr server requirements for 100+ million documents

2014-01-26 Thread Erick Erickson
Dumping the raw data would probably be a good idea. I guarantee you'll be
re-indexing the data several times as you change the schema to accommodate
different requirements...

But it may also be worth spending some time figuring out why the DB
access is slow. Sometimes one can tune that.

If you go the SolrJ route, you also have the possibility of setting up N clients
to work simultaneously, sometimes that'll help.

FWIW,
Erick

On Sat, Jan 25, 2014 at 11:06 PM, Susheel Kumar
susheel.ku...@thedigitalgroup.net wrote:
 Hi Kranti,

 Attach are the solrconfig  schema xml for review. I did run indexing with 
 just few fields (5-6 fields) in schema.xml  keeping the same db config but 
 Indexing almost still taking similar time (average 1 million records 1 hr) 
 which confirms that the bottleneck is in the data acquisition which in our 
 case is oracle database. I am thinking to not use dataimporthandler / jdbc to 
 get data from Oracle but to rather dump data somehow from oracle using SQL 
 loader and then index it. Any thoughts?

 Thnx

 -Original Message-
 From: Kranti Parisa [mailto:kranti.par...@gmail.com]
 Sent: Saturday, January 25, 2014 12:08 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr server requirements for 100+ million documents

 can you post the complete solrconfig.xml file and schema.xml files to review 
 all of your settings that would impact your indexing performance.

 Thanks,
 Kranti K. Parisa
 http://www.linkedin.com/in/krantiparisa



 On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar  
 susheel.ku...@thedigitalgroup.net wrote:

 Thanks, Svante. Your indexing speed using db seems to really fast. Can
 you please provide some more detail on how you are indexing db
 records. Is it thru DataImportHandler? And what database? Is that
 local db?  We are indexing around 70 fields (60 multivalued) but data
 is not populated always in all fields. The average size of document is in 
 5-10 kbs.

 -Original Message-
 From: saka.csi...@gmail.com [mailto:saka.csi...@gmail.com] On Behalf
 Of svante karlsson
 Sent: Friday, January 24, 2014 5:05 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr server requirements for 100+ million documents

 I just indexed 100 million db docs (records) with 22 fields (4
 multivalued) in 9524 sec using libcurl.
 11 million took 763 seconds so the speed drops somewhat with
 increasing dbsize.

 We write 1000 docs (just an arbitrary number) in each request from two
 threads. If you will be using solrcloud you will want more writer threads.

 The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with one
 SSD and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual machine.

 /svante




 2014/1/24 Susheel Kumar susheel.ku...@thedigitalgroup.net

  Thanks, Erick for the info.
 
  For indexing I agree the more time is consumed in data acquisition
  which in our case from Database.  For indexing currently we are
  using the manual process i.e. Solr dashboard Data Import but now
  looking to automate.  How do you suggest to automate the index part.
  Do you recommend to use SolrJ or should we try to automate using Curl?
 
 
  -Original Message-
  From: Erick Erickson [mailto:erickerick...@gmail.com]
  Sent: Friday, January 24, 2014 2:59 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Solr server requirements for 100+ million documents
 
  Can't be done with the information you provided, and can only be
  guessed at even with more comprehensive information.
 
  Here's why:
 
 
  http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-
  we
  -dont-have-a-definitive-answer/
 
  Also, at a guess, your indexing speed is so slow due to data
  acquisition; I rather doubt you're being limited by raw Solr indexing.
  If you're using SolrJ, try commenting out the
  server.add() bit and running again. My guess is that your indexing
  speed will be almost unchanged, in which case it's the data
  acquisition process is where you should concentrate efforts. As a
  comparison, I can index 11M Wikipedia docs on my laptop in 45
  minutes without any attempts at parallelization.
 
 
  Best,
  Erick
 
  On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar 
  susheel.ku...@thedigitalgroup.net wrote:
   Hi,
  
   Currently we are indexing 10 million document from database (10 db
   data
  entities)  index size is around 8 GB on windows virtual box.
  Indexing in one shot taking 12+ hours while indexing parallel in
  separate cores  merging them together taking 4+ hours.
  
   We are looking to scale to 100+ million documents and looking for
  recommendation on servers requirements on below parameters for a
  Production environment. There can be 200+ users performing search
  same
 time.
  
   No of physical servers (considering solr cloud) Memory requirement
   Processor requirement (# cores) Linux as OS oppose to windows
  
   Thanks in advance.
   Susheel
  
 



RE: Solr server requirements for 100+ million documents

2014-01-26 Thread Susheel Kumar
Thank you Erick for your valuable inputs. Yes, we have to re-index data again  
again. I'll look into possibility of tuning db access.  

On SolrJ and automating the indexing (incremental as well as one time) I want 
to get your opinion on below two points. We will be indexing separate sets of 
tables with similar data structure

- Should we use SolrJ and write Java programs that can be scheduled to trigger 
indexing on demand/schedule based. 

- Is using SolrJ a better idea even for searching than using SolrNet? As our 
frontend is in .Net so we started using SolrNet but I am afraid down the road 
when we scale/support SolrClod using SolrJ is better?


Thanks
Susheel
-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Sunday, January 26, 2014 8:37 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr server requirements for 100+ million documents

Dumping the raw data would probably be a good idea. I guarantee you'll be 
re-indexing the data several times as you change the schema to accommodate 
different requirements...

But it may also be worth spending some time figuring out why the DB access is 
slow. Sometimes one can tune that.

If you go the SolrJ route, you also have the possibility of setting up N 
clients to work simultaneously, sometimes that'll help.

FWIW,
Erick

On Sat, Jan 25, 2014 at 11:06 PM, Susheel Kumar 
susheel.ku...@thedigitalgroup.net wrote:
 Hi Kranti,

 Attach are the solrconfig  schema xml for review. I did run indexing with 
 just few fields (5-6 fields) in schema.xml  keeping the same db config but 
 Indexing almost still taking similar time (average 1 million records 1 hr) 
 which confirms that the bottleneck is in the data acquisition which in our 
 case is oracle database. I am thinking to not use dataimporthandler / jdbc to 
 get data from Oracle but to rather dump data somehow from oracle using SQL 
 loader and then index it. Any thoughts?

 Thnx

 -Original Message-
 From: Kranti Parisa [mailto:kranti.par...@gmail.com]
 Sent: Saturday, January 25, 2014 12:08 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr server requirements for 100+ million documents

 can you post the complete solrconfig.xml file and schema.xml files to review 
 all of your settings that would impact your indexing performance.

 Thanks,
 Kranti K. Parisa
 http://www.linkedin.com/in/krantiparisa



 On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar  
 susheel.ku...@thedigitalgroup.net wrote:

 Thanks, Svante. Your indexing speed using db seems to really fast. 
 Can you please provide some more detail on how you are indexing db 
 records. Is it thru DataImportHandler? And what database? Is that 
 local db?  We are indexing around 70 fields (60 multivalued) but data 
 is not populated always in all fields. The average size of document is in 
 5-10 kbs.

 -Original Message-
 From: saka.csi...@gmail.com [mailto:saka.csi...@gmail.com] On Behalf 
 Of svante karlsson
 Sent: Friday, January 24, 2014 5:05 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr server requirements for 100+ million documents

 I just indexed 100 million db docs (records) with 22 fields (4
 multivalued) in 9524 sec using libcurl.
 11 million took 763 seconds so the speed drops somewhat with 
 increasing dbsize.

 We write 1000 docs (just an arbitrary number) in each request from 
 two threads. If you will be using solrcloud you will want more writer 
 threads.

 The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with 
 one SSD and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual 
 machine.

 /svante




 2014/1/24 Susheel Kumar susheel.ku...@thedigitalgroup.net

  Thanks, Erick for the info.
 
  For indexing I agree the more time is consumed in data acquisition 
  which in our case from Database.  For indexing currently we are 
  using the manual process i.e. Solr dashboard Data Import but now 
  looking to automate.  How do you suggest to automate the index part.
  Do you recommend to use SolrJ or should we try to automate using Curl?
 
 
  -Original Message-
  From: Erick Erickson [mailto:erickerick...@gmail.com]
  Sent: Friday, January 24, 2014 2:59 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Solr server requirements for 100+ million documents
 
  Can't be done with the information you provided, and can only be 
  guessed at even with more comprehensive information.
 
  Here's why:
 
 
  http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why
  -
  we
  -dont-have-a-definitive-answer/
 
  Also, at a guess, your indexing speed is so slow due to data 
  acquisition; I rather doubt you're being limited by raw Solr indexing.
  If you're using SolrJ, try commenting out the
  server.add() bit and running again. My guess is that your indexing 
  speed will be almost unchanged, in which case it's the data 
  acquisition process is where you should concentrate efforts. As a 
  comparison, I can index 11M Wikipedia docs on my

Re: Solr server requirements for 100+ million documents

2014-01-26 Thread Erick Erickson
1 That's what I'd do. For incremental updates you might have to
create a trigger on the main table and insert rows into another table
that is then used to do the incremental updates. This is particularly
relevant for deletes. Consider the case where you've ingested all your
data then rows are deleted. Removing those same documents from Solr
requires either a re-indexing everything or b getting all the docs
in Solr and comparing them with the rows in the DB etc. This is
expensive. c recording the changes as above and just processing
deletes from the change table.

2 SolrJ is usually the most current. I don't know how much work
SolrNet gets. However, under the covers it's just HTTP calls so since
you have access in either to just adding HTTP parameters, you should
be able to get the full functionality out of either. I _think_ that
I'd go with whatever you're most comfortable with.

Best,
Erick

On Sun, Jan 26, 2014 at 9:54 AM, Susheel Kumar
susheel.ku...@thedigitalgroup.net wrote:
 Thank you Erick for your valuable inputs. Yes, we have to re-index data again 
  again. I'll look into possibility of tuning db access.

 On SolrJ and automating the indexing (incremental as well as one time) I want 
 to get your opinion on below two points. We will be indexing separate sets of 
 tables with similar data structure

 - Should we use SolrJ and write Java programs that can be scheduled to 
 trigger indexing on demand/schedule based.

 - Is using SolrJ a better idea even for searching than using SolrNet? As our 
 frontend is in .Net so we started using SolrNet but I am afraid down the road 
 when we scale/support SolrClod using SolrJ is better?


 Thanks
 Susheel
 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Sunday, January 26, 2014 8:37 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr server requirements for 100+ million documents

 Dumping the raw data would probably be a good idea. I guarantee you'll be 
 re-indexing the data several times as you change the schema to accommodate 
 different requirements...

 But it may also be worth spending some time figuring out why the DB access is 
 slow. Sometimes one can tune that.

 If you go the SolrJ route, you also have the possibility of setting up N 
 clients to work simultaneously, sometimes that'll help.

 FWIW,
 Erick

 On Sat, Jan 25, 2014 at 11:06 PM, Susheel Kumar 
 susheel.ku...@thedigitalgroup.net wrote:
 Hi Kranti,

 Attach are the solrconfig  schema xml for review. I did run indexing with 
 just few fields (5-6 fields) in schema.xml  keeping the same db config but 
 Indexing almost still taking similar time (average 1 million records 1 hr) 
 which confirms that the bottleneck is in the data acquisition which in our 
 case is oracle database. I am thinking to not use dataimporthandler / jdbc 
 to get data from Oracle but to rather dump data somehow from oracle using 
 SQL loader and then index it. Any thoughts?

 Thnx

 -Original Message-
 From: Kranti Parisa [mailto:kranti.par...@gmail.com]
 Sent: Saturday, January 25, 2014 12:08 AM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr server requirements for 100+ million documents

 can you post the complete solrconfig.xml file and schema.xml files to review 
 all of your settings that would impact your indexing performance.

 Thanks,
 Kranti K. Parisa
 http://www.linkedin.com/in/krantiparisa



 On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar  
 susheel.ku...@thedigitalgroup.net wrote:

 Thanks, Svante. Your indexing speed using db seems to really fast.
 Can you please provide some more detail on how you are indexing db
 records. Is it thru DataImportHandler? And what database? Is that
 local db?  We are indexing around 70 fields (60 multivalued) but data
 is not populated always in all fields. The average size of document is in 
 5-10 kbs.

 -Original Message-
 From: saka.csi...@gmail.com [mailto:saka.csi...@gmail.com] On Behalf
 Of svante karlsson
 Sent: Friday, January 24, 2014 5:05 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr server requirements for 100+ million documents

 I just indexed 100 million db docs (records) with 22 fields (4
 multivalued) in 9524 sec using libcurl.
 11 million took 763 seconds so the speed drops somewhat with
 increasing dbsize.

 We write 1000 docs (just an arbitrary number) in each request from
 two threads. If you will be using solrcloud you will want more writer 
 threads.

 The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with
 one SSD and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual 
 machine.

 /svante




 2014/1/24 Susheel Kumar susheel.ku...@thedigitalgroup.net

  Thanks, Erick for the info.
 
  For indexing I agree the more time is consumed in data acquisition
  which in our case from Database.  For indexing currently we are
  using the manual process i.e. Solr dashboard Data Import but now
  looking to automate.  How do you suggest to automate the index part

Re: Solr server requirements for 100+ million documents

2014-01-26 Thread simon
Erick's probably too modest to say so ;=) , but he wrote a great blog entry
on indexing with SolrJ -
http://searchhub.org/2012/02/14/indexing-with-solrj/ .  I took the guts of
the code in that blog and  easily customized it to write a very fast
indexer  (content from MySQL, I excised all the Tika code as I am not using
it).

You should replace StreamingUpdateSolrServer by  ConcurrentUpdateSolrServer
and experiment to find the optimal number of threads to configure.

-Simon


On Sun, Jan 26, 2014 at 11:28 AM, Erick Erickson erickerick...@gmail.comwrote:

 1 That's what I'd do. For incremental updates you might have to
 create a trigger on the main table and insert rows into another table
 that is then used to do the incremental updates. This is particularly
 relevant for deletes. Consider the case where you've ingested all your
 data then rows are deleted. Removing those same documents from Solr
 requires either a re-indexing everything or b getting all the docs
 in Solr and comparing them with the rows in the DB etc. This is
 expensive. c recording the changes as above and just processing
 deletes from the change table.

 2 SolrJ is usually the most current. I don't know how much work
 SolrNet gets. However, under the covers it's just HTTP calls so since
 you have access in either to just adding HTTP parameters, you should
 be able to get the full functionality out of either. I _think_ that
 I'd go with whatever you're most comfortable with.

 Best,
 Erick

 On Sun, Jan 26, 2014 at 9:54 AM, Susheel Kumar
 susheel.ku...@thedigitalgroup.net wrote:
  Thank you Erick for your valuable inputs. Yes, we have to re-index data
 again  again. I'll look into possibility of tuning db access.
 
  On SolrJ and automating the indexing (incremental as well as one time) I
 want to get your opinion on below two points. We will be indexing separate
 sets of tables with similar data structure
 
  - Should we use SolrJ and write Java programs that can be scheduled to
 trigger indexing on demand/schedule based.
 
  - Is using SolrJ a better idea even for searching than using SolrNet? As
 our frontend is in .Net so we started using SolrNet but I am afraid down
 the road when we scale/support SolrClod using SolrJ is better?
 
 
  Thanks
  Susheel
  -Original Message-
  From: Erick Erickson [mailto:erickerick...@gmail.com]
  Sent: Sunday, January 26, 2014 8:37 AM
  To: solr-user@lucene.apache.org
  Subject: Re: Solr server requirements for 100+ million documents
 
  Dumping the raw data would probably be a good idea. I guarantee you'll
 be re-indexing the data several times as you change the schema to
 accommodate different requirements...
 
  But it may also be worth spending some time figuring out why the DB
 access is slow. Sometimes one can tune that.
 
  If you go the SolrJ route, you also have the possibility of setting up N
 clients to work simultaneously, sometimes that'll help.
 
  FWIW,
  Erick
 
  On Sat, Jan 25, 2014 at 11:06 PM, Susheel Kumar 
 susheel.ku...@thedigitalgroup.net wrote:
  Hi Kranti,
 
  Attach are the solrconfig  schema xml for review. I did run indexing
 with just few fields (5-6 fields) in schema.xml  keeping the same db
 config but Indexing almost still taking similar time (average 1 million
 records 1 hr) which confirms that the bottleneck is in the data acquisition
 which in our case is oracle database. I am thinking to not use
 dataimporthandler / jdbc to get data from Oracle but to rather dump data
 somehow from oracle using SQL loader and then index it. Any thoughts?
 
  Thnx
 
  -Original Message-
  From: Kranti Parisa [mailto:kranti.par...@gmail.com]
  Sent: Saturday, January 25, 2014 12:08 AM
  To: solr-user@lucene.apache.org
  Subject: Re: Solr server requirements for 100+ million documents
 
  can you post the complete solrconfig.xml file and schema.xml files to
 review all of your settings that would impact your indexing performance.
 
  Thanks,
  Kranti K. Parisa
  http://www.linkedin.com/in/krantiparisa
 
 
 
  On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar 
 susheel.ku...@thedigitalgroup.net wrote:
 
  Thanks, Svante. Your indexing speed using db seems to really fast.
  Can you please provide some more detail on how you are indexing db
  records. Is it thru DataImportHandler? And what database? Is that
  local db?  We are indexing around 70 fields (60 multivalued) but data
  is not populated always in all fields. The average size of document is
 in 5-10 kbs.
 
  -Original Message-
  From: saka.csi...@gmail.com [mailto:saka.csi...@gmail.com] On Behalf
  Of svante karlsson
  Sent: Friday, January 24, 2014 5:05 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Solr server requirements for 100+ million documents
 
  I just indexed 100 million db docs (records) with 22 fields (4
  multivalued) in 9524 sec using libcurl.
  11 million took 763 seconds so the speed drops somewhat with
  increasing dbsize.
 
  We write 1000 docs (just an arbitrary number) in each

Re: Solr server requirements for 100+ million documents

2014-01-25 Thread svante karlsson
We are using a postgres server on a different host (same hardware as the
test solr server). The reason we take the data from the postgres server is
that is easy to automate testing since we use the same server to produce
queries. In production we preload the solr from a csv file from a hive
(hadoop) job and then only write updates (  500 / sec ). In our usecase we
use solr as NoSQL dabase since we really want to do SHOULD queries against
all the fields. The fields are typically very small text fields (30 chars)
but occasionally bigger but I don't think I have more than 128 chars on
anything in the whole dataset.

?xml version=1.0 encoding=UTF-8 ?
schema name=example version=1.1
  types
  fieldType name=uuid class=solr.UUIDField indexed=true /
  fieldType name=string class=solr.StrField sortMissingLast=true
omitNorms=true/
   fieldType name=boolean class=solr.BoolField sortMissingLast=true/
   fieldType name=tdate class=solr.TrieDateField precisionStep=6
positionIncrementGap=0/
   fieldType name=int class=solr.TrieIntField precisionStep=0
positionIncrementGap=0/
   fieldType name=long class=solr.TrieLongField precisionStep=0
positionIncrementGap=0/
   /types
fields
field name=_version_ type=long indexed=true stored=true
multiValued=false/
field name=id type=string indexed=true stored=true required=true
multiValued=false /
field name=name type=int indexed=true stored=true/
field name=fieldA type=string indexed=true stored=true/
field name=fieldB type=string indexed=true stored=true/
field name=fieldC type=int indexed=true stored=true/
field name=fieldD type=int indexed=true stored=true/
field name=fieldE type=int indexed=true stored=true/
field name=fieldF type=string indexed=true stored=true
multiValued=true/
field name=fieldG type=string indexed=true stored=true
multiValued=true/
field name=fieldH type=string indexed=true stored=true
multiValued=true/
field name=fieldI type=string indexed=true stored=true
multiValued=true/
field name=fieldJ type=string indexed=true stored=true
multiValued=true/
field name=fieldK type=string indexed=true stored=true
multiValued=true/
field name=fieldL type=string indexed=true stored=true/
field name=fieldM type=string indexed=true stored=true
multiValued=true/
field name=fieldN type=string indexed=true stored=true/

field name=fieldO type=string indexed=false stored=true
required=false /
field name=ts  type=long indexed=true stored=true/
/fields
uniqueKeyid/uniqueKey
solrQueryParser defaultOperator=OR/
/schema





2014/1/25 Kranti Parisa kranti.par...@gmail.com

 can you post the complete solrconfig.xml file and schema.xml files to
 review all of your settings that would impact your indexing performance.

 Thanks,
 Kranti K. Parisa
 http://www.linkedin.com/in/krantiparisa



 On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar 
 susheel.ku...@thedigitalgroup.net wrote:

  Thanks, Svante. Your indexing speed using db seems to really fast. Can
 you
  please provide some more detail on how you are indexing db records. Is it
  thru DataImportHandler? And what database? Is that local db?  We are
  indexing around 70 fields (60 multivalued) but data is not populated
 always
  in all fields. The average size of document is in 5-10 kbs.
 
  -Original Message-
  From: saka.csi...@gmail.com [mailto:saka.csi...@gmail.com] On Behalf Of
  svante karlsson
  Sent: Friday, January 24, 2014 5:05 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Solr server requirements for 100+ million documents
 
  I just indexed 100 million db docs (records) with 22 fields (4
  multivalued) in 9524 sec using libcurl.
  11 million took 763 seconds so the speed drops somewhat with increasing
  dbsize.
 
  We write 1000 docs (just an arbitrary number) in each request from two
  threads. If you will be using solrcloud you will want more writer
 threads.
 
  The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with one
 SSD
  and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual machine.
 
  /svante
 
 
 
 
  2014/1/24 Susheel Kumar susheel.ku...@thedigitalgroup.net
 
   Thanks, Erick for the info.
  
   For indexing I agree the more time is consumed in data acquisition
   which in our case from Database.  For indexing currently we are using
   the manual process i.e. Solr dashboard Data Import but now looking to
   automate.  How do you suggest to automate the index part. Do you
   recommend to use SolrJ or should we try to automate using Curl?
  
  
   -Original Message-
   From: Erick Erickson [mailto:erickerick...@gmail.com]
   Sent: Friday, January 24, 2014 2:59 PM
   To: solr-user@lucene.apache.org
   Subject: Re: Solr server requirements for 100+ million documents
  
   Can't be done with the information you provided, and can only be
   guessed at even with more comprehensive information.
  
   Here's why:
  
  
   http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we
   -dont-have-a-definitive-answer/
  
   Also, at a guess, your indexing speed is so

Re: Solr server requirements for 100+ million documents

2014-01-25 Thread svante karlsson
That got away a little early...

The inserter is a small C++ program that uses pglib to speek to postgres
and the a http-client library that uses libcurl under the hood. The
inserter draws very little CPU and we normally use 2 writer threads that
each posts 1000 records at a time. Its very inefficient to post one at a
time but I've not done any specific testing to know if 1000 is better that
500

What we're doing now is trying to figure out how to get the query
performance up since is not where we need it to be so we're not done
either...


2014/1/25 svante karlsson s...@csi.se

 We are using a postgres server on a different host (same hardware as the
 test solr server). The reason we take the data from the postgres server is
 that is easy to automate testing since we use the same server to produce
 queries. In production we preload the solr from a csv file from a hive
 (hadoop) job and then only write updates (  500 / sec ). In our usecase we
 use solr as NoSQL dabase since we really want to do SHOULD queries against
 all the fields. The fields are typically very small text fields (30 chars)
 but occasionally bigger but I don't think I have more than 128 chars on
 anything in the whole dataset.

 ?xml version=1.0 encoding=UTF-8 ?
 schema name=example version=1.1
   types
   fieldType name=uuid class=solr.UUIDField indexed=true /
   fieldType name=string class=solr.StrField sortMissingLast=true
 omitNorms=true/
fieldType name=boolean class=solr.BoolField
 sortMissingLast=true/
fieldType name=tdate class=solr.TrieDateField precisionStep=6
 positionIncrementGap=0/
fieldType name=int class=solr.TrieIntField precisionStep=0
 positionIncrementGap=0/
fieldType name=long class=solr.TrieLongField precisionStep=0
 positionIncrementGap=0/
/types
 fields
 field name=_version_ type=long indexed=true stored=true
 multiValued=false/
 field name=id type=string indexed=true stored=true
 required=true multiValued=false /
 field name=name type=int indexed=true stored=true/
 field name=fieldA type=string indexed=true stored=true/
 field name=fieldB type=string indexed=true stored=true/
 field name=fieldC type=int indexed=true stored=true/
 field name=fieldD type=int indexed=true stored=true/
 field name=fieldE type=int indexed=true stored=true/
 field name=fieldF type=string indexed=true stored=true
 multiValued=true/
 field name=fieldG type=string indexed=true stored=true
 multiValued=true/
 field name=fieldH type=string indexed=true stored=true
 multiValued=true/
 field name=fieldI type=string indexed=true stored=true
 multiValued=true/
 field name=fieldJ type=string indexed=true stored=true
 multiValued=true/
 field name=fieldK type=string indexed=true stored=true
 multiValued=true/
 field name=fieldL type=string indexed=true stored=true/
 field name=fieldM type=string indexed=true stored=true
 multiValued=true/
 field name=fieldN type=string indexed=true stored=true/

 field name=fieldO type=string indexed=false stored=true
 required=false /
 field name=ts  type=long indexed=true stored=true/
 /fields
 uniqueKeyid/uniqueKey
 solrQueryParser defaultOperator=OR/
 /schema





 2014/1/25 Kranti Parisa kranti.par...@gmail.com

 can you post the complete solrconfig.xml file and schema.xml files to
 review all of your settings that would impact your indexing performance.

 Thanks,
 Kranti K. Parisa
 http://www.linkedin.com/in/krantiparisa



 On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar 
 susheel.ku...@thedigitalgroup.net wrote:

  Thanks, Svante. Your indexing speed using db seems to really fast. Can
 you
  please provide some more detail on how you are indexing db records. Is
 it
  thru DataImportHandler? And what database? Is that local db?  We are
  indexing around 70 fields (60 multivalued) but data is not populated
 always
  in all fields. The average size of document is in 5-10 kbs.
 
  -Original Message-
  From: saka.csi...@gmail.com [mailto:saka.csi...@gmail.com] On Behalf Of
  svante karlsson
  Sent: Friday, January 24, 2014 5:05 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Solr server requirements for 100+ million documents
 
  I just indexed 100 million db docs (records) with 22 fields (4
  multivalued) in 9524 sec using libcurl.
  11 million took 763 seconds so the speed drops somewhat with increasing
  dbsize.
 
  We write 1000 docs (just an arbitrary number) in each request from two
  threads. If you will be using solrcloud you will want more writer
 threads.
 
  The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with one
 SSD
  and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual
 machine.
 
  /svante
 
 
 
 
  2014/1/24 Susheel Kumar susheel.ku...@thedigitalgroup.net
 
   Thanks, Erick for the info.
  
   For indexing I agree the more time is consumed in data acquisition
   which in our case from Database.  For indexing currently we are using
   the manual process i.e. Solr dashboard Data Import but now looking to
   automate.  How do you

Re: Solr server requirements for 100+ million documents

2014-01-25 Thread Erick Erickson
Hmmm, I'm always suspicious when I see a schema.xml with a lot of string
types. This is tangential to your question, but I thought I'd butt in anyway.

String types are totally unanalyzed. So if the input for a field is I
like Strings,
the only match will be I like Strings. I like strings won't match
due to the
lower-case 's' in strings. like won't match since it isn't the complete input.

You may already know this, but thought I'd point it out. For tokenized
searches, text_general is a good place to start. Pardon me if this is repeating
what you already know

Lots of string types sometimes lead people with DB backgrounds to
search for *like* which will be slow FWIW.

Best,
Erick

On Sat, Jan 25, 2014 at 5:51 AM, svante karlsson s...@csi.se wrote:
 That got away a little early...

 The inserter is a small C++ program that uses pglib to speek to postgres
 and the a http-client library that uses libcurl under the hood. The
 inserter draws very little CPU and we normally use 2 writer threads that
 each posts 1000 records at a time. Its very inefficient to post one at a
 time but I've not done any specific testing to know if 1000 is better that
 500

 What we're doing now is trying to figure out how to get the query
 performance up since is not where we need it to be so we're not done
 either...


 2014/1/25 svante karlsson s...@csi.se

 We are using a postgres server on a different host (same hardware as the
 test solr server). The reason we take the data from the postgres server is
 that is easy to automate testing since we use the same server to produce
 queries. In production we preload the solr from a csv file from a hive
 (hadoop) job and then only write updates (  500 / sec ). In our usecase we
 use solr as NoSQL dabase since we really want to do SHOULD queries against
 all the fields. The fields are typically very small text fields (30 chars)
 but occasionally bigger but I don't think I have more than 128 chars on
 anything in the whole dataset.

 ?xml version=1.0 encoding=UTF-8 ?
 schema name=example version=1.1
   types
   fieldType name=uuid class=solr.UUIDField indexed=true /
   fieldType name=string class=solr.StrField sortMissingLast=true
 omitNorms=true/
fieldType name=boolean class=solr.BoolField
 sortMissingLast=true/
fieldType name=tdate class=solr.TrieDateField precisionStep=6
 positionIncrementGap=0/
fieldType name=int class=solr.TrieIntField precisionStep=0
 positionIncrementGap=0/
fieldType name=long class=solr.TrieLongField precisionStep=0
 positionIncrementGap=0/
/types
 fields
 field name=_version_ type=long indexed=true stored=true
 multiValued=false/
 field name=id type=string indexed=true stored=true
 required=true multiValued=false /
 field name=name type=int indexed=true stored=true/
 field name=fieldA type=string indexed=true stored=true/
 field name=fieldB type=string indexed=true stored=true/
 field name=fieldC type=int indexed=true stored=true/
 field name=fieldD type=int indexed=true stored=true/
 field name=fieldE type=int indexed=true stored=true/
 field name=fieldF type=string indexed=true stored=true
 multiValued=true/
 field name=fieldG type=string indexed=true stored=true
 multiValued=true/
 field name=fieldH type=string indexed=true stored=true
 multiValued=true/
 field name=fieldI type=string indexed=true stored=true
 multiValued=true/
 field name=fieldJ type=string indexed=true stored=true
 multiValued=true/
 field name=fieldK type=string indexed=true stored=true
 multiValued=true/
 field name=fieldL type=string indexed=true stored=true/
 field name=fieldM type=string indexed=true stored=true
 multiValued=true/
 field name=fieldN type=string indexed=true stored=true/

 field name=fieldO type=string indexed=false stored=true
 required=false /
 field name=ts  type=long indexed=true stored=true/
 /fields
 uniqueKeyid/uniqueKey
 solrQueryParser defaultOperator=OR/
 /schema





 2014/1/25 Kranti Parisa kranti.par...@gmail.com

 can you post the complete solrconfig.xml file and schema.xml files to
 review all of your settings that would impact your indexing performance.

 Thanks,
 Kranti K. Parisa
 http://www.linkedin.com/in/krantiparisa



 On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar 
 susheel.ku...@thedigitalgroup.net wrote:

  Thanks, Svante. Your indexing speed using db seems to really fast. Can
 you
  please provide some more detail on how you are indexing db records. Is
 it
  thru DataImportHandler? And what database? Is that local db?  We are
  indexing around 70 fields (60 multivalued) but data is not populated
 always
  in all fields. The average size of document is in 5-10 kbs.
 
  -Original Message-
  From: saka.csi...@gmail.com [mailto:saka.csi...@gmail.com] On Behalf Of
  svante karlsson
  Sent: Friday, January 24, 2014 5:05 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Solr server requirements for 100+ million documents
 
  I just indexed 100 million db docs (records) with 22 fields (4
  multivalued) in 9524 sec

Re: Solr server requirements for 100+ million documents

2014-01-25 Thread svante karlsson
  always
   in all fields. The average size of document is in 5-10 kbs.
  
   -Original Message-
   From: saka.csi...@gmail.com [mailto:saka.csi...@gmail.com] On
 Behalf Of
   svante karlsson
   Sent: Friday, January 24, 2014 5:05 PM
   To: solr-user@lucene.apache.org
   Subject: Re: Solr server requirements for 100+ million documents
  
   I just indexed 100 million db docs (records) with 22 fields (4
   multivalued) in 9524 sec using libcurl.
   11 million took 763 seconds so the speed drops somewhat with
 increasing
   dbsize.
  
   We write 1000 docs (just an arbitrary number) in each request from
 two
   threads. If you will be using solrcloud you will want more writer
  threads.
  
   The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with
 one
  SSD
   and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual
  machine.
  
   /svante
  
  
  
  
   2014/1/24 Susheel Kumar susheel.ku...@thedigitalgroup.net
  
Thanks, Erick for the info.
   
For indexing I agree the more time is consumed in data acquisition
which in our case from Database.  For indexing currently we are
 using
the manual process i.e. Solr dashboard Data Import but now looking
 to
automate.  How do you suggest to automate the index part. Do you
recommend to use SolrJ or should we try to automate using Curl?
   
   
-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Friday, January 24, 2014 2:59 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr server requirements for 100+ million documents
   
Can't be done with the information you provided, and can only be
guessed at even with more comprehensive information.
   
Here's why:
   
   
   
  http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we
-dont-have-a-definitive-answer/
   
Also, at a guess, your indexing speed is so slow due to data
acquisition; I rather doubt you're being limited by raw Solr
 indexing.
If you're using SolrJ, try commenting out the
server.add() bit and running again. My guess is that your indexing
speed will be almost unchanged, in which case it's the data
acquisition process is where you should concentrate efforts. As a
comparison, I can index 11M Wikipedia docs on my laptop in 45
 minutes
without any attempts at parallelization.
   
   
Best,
Erick
   
On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar 
susheel.ku...@thedigitalgroup.net wrote:
 Hi,

 Currently we are indexing 10 million document from database (10
 db
 data
entities)  index size is around 8 GB on windows virtual box.
 Indexing
in one shot taking 12+ hours while indexing parallel in separate
 cores
 merging them together taking 4+ hours.

 We are looking to scale to 100+ million documents and looking for
recommendation on servers requirements on below parameters for a
Production environment. There can be 200+ users performing search
 same
   time.

 No of physical servers (considering solr cloud) Memory
 requirement
 Processor requirement (# cores) Linux as OS oppose to windows

 Thanks in advance.
 Susheel

   
  
 
 
 



RE: Solr server requirements for 100+ million documents

2014-01-25 Thread Susheel Kumar
Hi Kranti,

Attach are the solrconfig  schema xml for review. I did run indexing with just 
few fields (5-6 fields) in schema.xml  keeping the same db config but Indexing 
almost still taking similar time (average 1 million records 1 hr) which 
confirms that the bottleneck is in the data acquisition which in our case is 
oracle database. I am thinking to not use dataimporthandler / jdbc to get data 
from Oracle but to rather dump data somehow from oracle using SQL loader and 
then index it. Any thoughts? 

Thnx

-Original Message-
From: Kranti Parisa [mailto:kranti.par...@gmail.com] 
Sent: Saturday, January 25, 2014 12:08 AM
To: solr-user@lucene.apache.org
Subject: Re: Solr server requirements for 100+ million documents

can you post the complete solrconfig.xml file and schema.xml files to review 
all of your settings that would impact your indexing performance.

Thanks,
Kranti K. Parisa
http://www.linkedin.com/in/krantiparisa



On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar  
susheel.ku...@thedigitalgroup.net wrote:

 Thanks, Svante. Your indexing speed using db seems to really fast. Can 
 you please provide some more detail on how you are indexing db 
 records. Is it thru DataImportHandler? And what database? Is that 
 local db?  We are indexing around 70 fields (60 multivalued) but data 
 is not populated always in all fields. The average size of document is in 
 5-10 kbs.

 -Original Message-
 From: saka.csi...@gmail.com [mailto:saka.csi...@gmail.com] On Behalf 
 Of svante karlsson
 Sent: Friday, January 24, 2014 5:05 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr server requirements for 100+ million documents

 I just indexed 100 million db docs (records) with 22 fields (4
 multivalued) in 9524 sec using libcurl.
 11 million took 763 seconds so the speed drops somewhat with 
 increasing dbsize.

 We write 1000 docs (just an arbitrary number) in each request from two 
 threads. If you will be using solrcloud you will want more writer threads.

 The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with one 
 SSD and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual machine.

 /svante




 2014/1/24 Susheel Kumar susheel.ku...@thedigitalgroup.net

  Thanks, Erick for the info.
 
  For indexing I agree the more time is consumed in data acquisition 
  which in our case from Database.  For indexing currently we are 
  using the manual process i.e. Solr dashboard Data Import but now 
  looking to automate.  How do you suggest to automate the index part. 
  Do you recommend to use SolrJ or should we try to automate using Curl?
 
 
  -Original Message-
  From: Erick Erickson [mailto:erickerick...@gmail.com]
  Sent: Friday, January 24, 2014 2:59 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Solr server requirements for 100+ million documents
 
  Can't be done with the information you provided, and can only be 
  guessed at even with more comprehensive information.
 
  Here's why:
 
 
  http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-
  we
  -dont-have-a-definitive-answer/
 
  Also, at a guess, your indexing speed is so slow due to data 
  acquisition; I rather doubt you're being limited by raw Solr indexing.
  If you're using SolrJ, try commenting out the
  server.add() bit and running again. My guess is that your indexing 
  speed will be almost unchanged, in which case it's the data 
  acquisition process is where you should concentrate efforts. As a 
  comparison, I can index 11M Wikipedia docs on my laptop in 45 
  minutes without any attempts at parallelization.
 
 
  Best,
  Erick
 
  On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar  
  susheel.ku...@thedigitalgroup.net wrote:
   Hi,
  
   Currently we are indexing 10 million document from database (10 db 
   data
  entities)  index size is around 8 GB on windows virtual box. 
  Indexing in one shot taking 12+ hours while indexing parallel in 
  separate cores  merging them together taking 4+ hours.
  
   We are looking to scale to 100+ million documents and looking for
  recommendation on servers requirements on below parameters for a 
  Production environment. There can be 200+ users performing search 
  same
 time.
  
   No of physical servers (considering solr cloud) Memory requirement 
   Processor requirement (# cores) Linux as OS oppose to windows
  
   Thanks in advance.
   Susheel
  
 



solrconfig.xml
Description: solrconfig.xml


schema.xml
Description: schema.xml


Re: Solr server requirements for 100+ million documents

2014-01-24 Thread Erick Erickson
Can't be done with the information you provided, and can only
be guessed at even with more comprehensive information.

Here's why:

http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Also, at a guess, your indexing speed is so slow due to data
acquisition; I rather doubt
you're being limited by raw Solr indexing. If you're using SolrJ, try
commenting out the
server.add() bit and running again. My guess is that your indexing
speed will be almost
unchanged, in which case it's the data acquisition process is where
you should concentrate
efforts. As a comparison, I can index 11M Wikipedia docs on my laptop
in 45 minutes without
any attempts at parallelization.


Best,
Erick

On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar
susheel.ku...@thedigitalgroup.net wrote:
 Hi,

 Currently we are indexing 10 million document from database (10 db data 
 entities)  index size is around 8 GB on windows virtual box. Indexing in one 
 shot taking 12+ hours while indexing parallel in separate cores  merging 
 them together taking 4+ hours.

 We are looking to scale to 100+ million documents and looking for 
 recommendation on servers requirements on below parameters for a Production 
 environment. There can be 200+ users performing search same time.

 No of physical servers (considering solr cloud)
 Memory requirement
 Processor requirement (# cores)
 Linux as OS oppose to windows

 Thanks in advance.
 Susheel



RE: Solr server requirements for 100+ million documents

2014-01-24 Thread Susheel Kumar
Thanks, Erick for the info.

For indexing I agree the more time is consumed in data acquisition which in our 
case from Database.  For indexing currently we are using the manual process 
i.e. Solr dashboard Data Import but now looking to automate.  How do you 
suggest to automate the index part. Do you recommend to use SolrJ or should we 
try to automate using Curl?


-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Friday, January 24, 2014 2:59 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr server requirements for 100+ million documents

Can't be done with the information you provided, and can only be guessed at 
even with more comprehensive information.

Here's why:

http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

Also, at a guess, your indexing speed is so slow due to data acquisition; I 
rather doubt you're being limited by raw Solr indexing. If you're using SolrJ, 
try commenting out the
server.add() bit and running again. My guess is that your indexing speed will 
be almost unchanged, in which case it's the data acquisition process is where 
you should concentrate efforts. As a comparison, I can index 11M Wikipedia docs 
on my laptop in 45 minutes without any attempts at parallelization.


Best,
Erick

On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar 
susheel.ku...@thedigitalgroup.net wrote:
 Hi,

 Currently we are indexing 10 million document from database (10 db data 
 entities)  index size is around 8 GB on windows virtual box. Indexing in one 
 shot taking 12+ hours while indexing parallel in separate cores  merging 
 them together taking 4+ hours.

 We are looking to scale to 100+ million documents and looking for 
 recommendation on servers requirements on below parameters for a Production 
 environment. There can be 200+ users performing search same time.

 No of physical servers (considering solr cloud) Memory requirement 
 Processor requirement (# cores) Linux as OS oppose to windows

 Thanks in advance.
 Susheel



Re: Solr server requirements for 100+ million documents

2014-01-24 Thread svante karlsson
I just indexed 100 million db docs (records) with 22 fields (4 multivalued)
in 9524 sec using libcurl.
11 million took 763 seconds so the speed drops somewhat with increasing
dbsize.

We write 1000 docs (just an arbitrary number) in each request from two
threads. If you will be using solrcloud you will want more writer threads.

The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with one SSD
and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual machine.

/svante




2014/1/24 Susheel Kumar susheel.ku...@thedigitalgroup.net

 Thanks, Erick for the info.

 For indexing I agree the more time is consumed in data acquisition which
 in our case from Database.  For indexing currently we are using the manual
 process i.e. Solr dashboard Data Import but now looking to automate.  How
 do you suggest to automate the index part. Do you recommend to use SolrJ or
 should we try to automate using Curl?


 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Friday, January 24, 2014 2:59 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr server requirements for 100+ million documents

 Can't be done with the information you provided, and can only be guessed
 at even with more comprehensive information.

 Here's why:


 http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we-dont-have-a-definitive-answer/

 Also, at a guess, your indexing speed is so slow due to data acquisition;
 I rather doubt you're being limited by raw Solr indexing. If you're using
 SolrJ, try commenting out the
 server.add() bit and running again. My guess is that your indexing speed
 will be almost unchanged, in which case it's the data acquisition process
 is where you should concentrate efforts. As a comparison, I can index 11M
 Wikipedia docs on my laptop in 45 minutes without any attempts at
 parallelization.


 Best,
 Erick

 On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar 
 susheel.ku...@thedigitalgroup.net wrote:
  Hi,
 
  Currently we are indexing 10 million document from database (10 db data
 entities)  index size is around 8 GB on windows virtual box. Indexing in
 one shot taking 12+ hours while indexing parallel in separate cores 
 merging them together taking 4+ hours.
 
  We are looking to scale to 100+ million documents and looking for
 recommendation on servers requirements on below parameters for a Production
 environment. There can be 200+ users performing search same time.
 
  No of physical servers (considering solr cloud) Memory requirement
  Processor requirement (# cores) Linux as OS oppose to windows
 
  Thanks in advance.
  Susheel
 



Re: Solr server requirements for 100+ million documents

2014-01-24 Thread Otis Gospodnetic
Hi Susheel,

Like Erick said, it's impossible to give precise recommendations, but
making a few assumptions and combining them with experience (+ a licked
finger in the air):
* 3 servers
* 32 GB
* 2+ CPU cores
* Linux

Assuming docs are not bigger than a few KB, that they are not being
reindexed over and over, that you don't have a search rate higher than a
few dozen QPS, assuming your queries are not a page long, etc. assuming
best practices are followed, the above should be sufficient.

I hope this helps.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr  Elasticsearch Support * http://sematext.com/


On Fri, Jan 24, 2014 at 1:10 PM, Susheel Kumar 
susheel.ku...@thedigitalgroup.net wrote:

 Hi,

 Currently we are indexing 10 million document from database (10 db data
 entities)  index size is around 8 GB on windows virtual box. Indexing in
 one shot taking 12+ hours while indexing parallel in separate cores 
 merging them together taking 4+ hours.

 We are looking to scale to 100+ million documents and looking for
 recommendation on servers requirements on below parameters for a Production
 environment. There can be 200+ users performing search same time.

 No of physical servers (considering solr cloud)
 Memory requirement
 Processor requirement (# cores)
 Linux as OS oppose to windows

 Thanks in advance.
 Susheel




RE: Solr server requirements for 100+ million documents

2014-01-24 Thread Susheel Kumar
Thanks, Svante. Your indexing speed using db seems to really fast. Can you 
please provide some more detail on how you are indexing db records. Is it thru 
DataImportHandler? And what database? Is that local db?  We are indexing around 
70 fields (60 multivalued) but data is not populated always in all fields. The 
average size of document is in 5-10 kbs.  

-Original Message-
From: saka.csi...@gmail.com [mailto:saka.csi...@gmail.com] On Behalf Of svante 
karlsson
Sent: Friday, January 24, 2014 5:05 PM
To: solr-user@lucene.apache.org
Subject: Re: Solr server requirements for 100+ million documents

I just indexed 100 million db docs (records) with 22 fields (4 multivalued) in 
9524 sec using libcurl.
11 million took 763 seconds so the speed drops somewhat with increasing dbsize.

We write 1000 docs (just an arbitrary number) in each request from two threads. 
If you will be using solrcloud you will want more writer threads.

The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with one SSD and 
32GB and the solr runs on ubuntu 13.10 inside a esxi virtual machine.

/svante




2014/1/24 Susheel Kumar susheel.ku...@thedigitalgroup.net

 Thanks, Erick for the info.

 For indexing I agree the more time is consumed in data acquisition 
 which in our case from Database.  For indexing currently we are using 
 the manual process i.e. Solr dashboard Data Import but now looking to 
 automate.  How do you suggest to automate the index part. Do you 
 recommend to use SolrJ or should we try to automate using Curl?


 -Original Message-
 From: Erick Erickson [mailto:erickerick...@gmail.com]
 Sent: Friday, January 24, 2014 2:59 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr server requirements for 100+ million documents

 Can't be done with the information you provided, and can only be 
 guessed at even with more comprehensive information.

 Here's why:


 http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we
 -dont-have-a-definitive-answer/

 Also, at a guess, your indexing speed is so slow due to data 
 acquisition; I rather doubt you're being limited by raw Solr indexing. 
 If you're using SolrJ, try commenting out the
 server.add() bit and running again. My guess is that your indexing 
 speed will be almost unchanged, in which case it's the data 
 acquisition process is where you should concentrate efforts. As a 
 comparison, I can index 11M Wikipedia docs on my laptop in 45 minutes 
 without any attempts at parallelization.


 Best,
 Erick

 On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar  
 susheel.ku...@thedigitalgroup.net wrote:
  Hi,
 
  Currently we are indexing 10 million document from database (10 db 
  data
 entities)  index size is around 8 GB on windows virtual box. Indexing 
 in one shot taking 12+ hours while indexing parallel in separate cores 
  merging them together taking 4+ hours.
 
  We are looking to scale to 100+ million documents and looking for
 recommendation on servers requirements on below parameters for a 
 Production environment. There can be 200+ users performing search same time.
 
  No of physical servers (considering solr cloud) Memory requirement 
  Processor requirement (# cores) Linux as OS oppose to windows
 
  Thanks in advance.
  Susheel
 



Re: Solr server requirements for 100+ million documents

2014-01-24 Thread Kranti Parisa
can you post the complete solrconfig.xml file and schema.xml files to
review all of your settings that would impact your indexing performance.

Thanks,
Kranti K. Parisa
http://www.linkedin.com/in/krantiparisa



On Sat, Jan 25, 2014 at 12:56 AM, Susheel Kumar 
susheel.ku...@thedigitalgroup.net wrote:

 Thanks, Svante. Your indexing speed using db seems to really fast. Can you
 please provide some more detail on how you are indexing db records. Is it
 thru DataImportHandler? And what database? Is that local db?  We are
 indexing around 70 fields (60 multivalued) but data is not populated always
 in all fields. The average size of document is in 5-10 kbs.

 -Original Message-
 From: saka.csi...@gmail.com [mailto:saka.csi...@gmail.com] On Behalf Of
 svante karlsson
 Sent: Friday, January 24, 2014 5:05 PM
 To: solr-user@lucene.apache.org
 Subject: Re: Solr server requirements for 100+ million documents

 I just indexed 100 million db docs (records) with 22 fields (4
 multivalued) in 9524 sec using libcurl.
 11 million took 763 seconds so the speed drops somewhat with increasing
 dbsize.

 We write 1000 docs (just an arbitrary number) in each request from two
 threads. If you will be using solrcloud you will want more writer threads.

 The hardware is a single cheap hp DL320E GEN8 V2 1P E3-1220V3 with one SSD
 and 32GB and the solr runs on ubuntu 13.10 inside a esxi virtual machine.

 /svante




 2014/1/24 Susheel Kumar susheel.ku...@thedigitalgroup.net

  Thanks, Erick for the info.
 
  For indexing I agree the more time is consumed in data acquisition
  which in our case from Database.  For indexing currently we are using
  the manual process i.e. Solr dashboard Data Import but now looking to
  automate.  How do you suggest to automate the index part. Do you
  recommend to use SolrJ or should we try to automate using Curl?
 
 
  -Original Message-
  From: Erick Erickson [mailto:erickerick...@gmail.com]
  Sent: Friday, January 24, 2014 2:59 PM
  To: solr-user@lucene.apache.org
  Subject: Re: Solr server requirements for 100+ million documents
 
  Can't be done with the information you provided, and can only be
  guessed at even with more comprehensive information.
 
  Here's why:
 
 
  http://searchhub.org/2012/07/23/sizing-hardware-in-the-abstract-why-we
  -dont-have-a-definitive-answer/
 
  Also, at a guess, your indexing speed is so slow due to data
  acquisition; I rather doubt you're being limited by raw Solr indexing.
  If you're using SolrJ, try commenting out the
  server.add() bit and running again. My guess is that your indexing
  speed will be almost unchanged, in which case it's the data
  acquisition process is where you should concentrate efforts. As a
  comparison, I can index 11M Wikipedia docs on my laptop in 45 minutes
  without any attempts at parallelization.
 
 
  Best,
  Erick
 
  On Fri, Jan 24, 2014 at 12:10 PM, Susheel Kumar 
  susheel.ku...@thedigitalgroup.net wrote:
   Hi,
  
   Currently we are indexing 10 million document from database (10 db
   data
  entities)  index size is around 8 GB on windows virtual box. Indexing
  in one shot taking 12+ hours while indexing parallel in separate cores
   merging them together taking 4+ hours.
  
   We are looking to scale to 100+ million documents and looking for
  recommendation on servers requirements on below parameters for a
  Production environment. There can be 200+ users performing search same
 time.
  
   No of physical servers (considering solr cloud) Memory requirement
   Processor requirement (# cores) Linux as OS oppose to windows
  
   Thanks in advance.
   Susheel