Re: Indexing huge data onto solr

2020-05-26 Thread Erick Erickson
It Depends (tm). Often, you can create a single (albeit, perhaps complex)
SQL query that does this for you and just process the response.

I’ve also seen situations where it’s possible to hold one of the tables 
in memory on the client and just use that rather than a separate query.

It depends on the characteristics of your particular database, your DBA
could probably help.

Best,
Erick

> On May 25, 2020, at 11:56 PM, Srinivas Kashyap 
>  wrote:
> 
> Hi Erick,
> 
> Thanks for the below response. The link which you provided holds good if you 
> have single entity where you can join the tables and index it. But in our 
> scenario, we have nested entities joining different tables as shown below:
> 
> db-data-config.xml:
> 
> 
> 
> (table 1 join table 2)
> (table 3 join table 4)
> (table 5 join table 6)
> (table 7 join table 8)
> 
> 
> 
> Do you have any recommendations for it to run multiple sql’s and make it as 
> single solr document that can be sent over solrJ for indexing?
> 
> Say parent entity has 100 documents, should I iterate over each one of parent 
> tuples and execute the child entity sql’s(with where condition of parent) to 
> create one solr document? Won’t it be more load on database by executing more 
> sqls? Is there an optimum solution?
> 
> Thanks,
> Srinivas
> From: Erick Erickson 
> Sent: 22 May 2020 22:52
> To: solr-user@lucene.apache.org
> Subject: Re: Indexing huge data onto solr
> 
> You have a lot more control over the speed and form of importing data if
> you just do the initial load in SolrJ. Here’s an example, taking the Tika
> parts out is easy:
> 
> https://lucidworks.com/post/indexing-with-solrj/<https://lucidworks.com/post/indexing-with-solrj>
> 
> It’s especially instructive to comment out just the call to 
> CloudSolrClient.add(doclist…); If
> that _still_ takes a long time, then your DB query is the root of the 
> problem. Even with 100M
> records, I’d be really surprised if Solr is the bottleneck, but the above 
> test will tell you
> where to go to try to speed things up.
> 
> Best,
> Erick
> 
>> On May 22, 2020, at 12:39 PM, Srinivas Kashyap 
>> mailto:srini...@bamboorose.com.INVALID>> 
>> wrote:
>> 
>> Hi All,
>> 
>> We are runnnig solr 8.4.1. We have a database table which has more than 100 
>> million of records. Till now we were using DIH to do full-import on the 
>> tables. But for this table, when we do full-import via DIH it is taking more 
>> than 3-4 days to complete and also it consumes fair bit of JVM memory while 
>> running.
>> 
>> Are there any speedier/alternates ways to load data onto this solr core.
>> 
>> P.S: Only initial data import is problem, further updates/additions to this 
>> core is being done through SolrJ.
>> 
>> Thanks,
>> Srinivas
>> 
>> DISCLAIMER:
>> E-mails and attachments from Bamboo Rose, LLC are confidential.
>> If you are not the intended recipient, please notify the sender immediately 
>> by replying to the e-mail, and then delete it without making copies or using 
>> it in any way.
>> No representation is made that this email or any attachments are free of 
>> viruses. Virus scanning is recommended and is the responsibility of the 
>> recipient.
>> 
>> Disclaimer
>> 
>> The information contained in this communication from the sender is 
>> confidential. It is intended solely for use by the recipient and others 
>> authorized to receive it. If you are not the recipient, you are hereby 
>> notified that any disclosure, copying, distribution or taking action in 
>> relation of the contents of this information is strictly prohibited and may 
>> be unlawful.
>> 
>> This email has been scanned for viruses and malware, and may have been 
>> automatically archived by Mimecast Ltd, an innovator in Software as a 
>> Service (SaaS) for business. Providing a safer and more useful place for 
>> your human generated data. Specializing in; Security, archiving and 
>> compliance. To find out more visit the Mimecast website.



RE: Indexing huge data onto solr

2020-05-25 Thread Srinivas Kashyap
Hi Erick,

Thanks for the below response. The link which you provided holds good if you 
have single entity where you can join the tables and index it. But in our 
scenario, we have nested entities joining different tables as shown below:

db-data-config.xml:



 (table 1 join table 2)
 (table 3 join table 4)
 (table 5 join table 6)
 (table 7 join table 8)



Do you have any recommendations for it to run multiple sql’s and make it as 
single solr document that can be sent over solrJ for indexing?

Say parent entity has 100 documents, should I iterate over each one of parent 
tuples and execute the child entity sql’s(with where condition of parent) to 
create one solr document? Won’t it be more load on database by executing more 
sqls? Is there an optimum solution?

Thanks,
Srinivas
From: Erick Erickson 
Sent: 22 May 2020 22:52
To: solr-user@lucene.apache.org
Subject: Re: Indexing huge data onto solr

You have a lot more control over the speed and form of importing data if
you just do the initial load in SolrJ. Here’s an example, taking the Tika
parts out is easy:

https://lucidworks.com/post/indexing-with-solrj/<https://lucidworks.com/post/indexing-with-solrj>

It’s especially instructive to comment out just the call to 
CloudSolrClient.add(doclist…); If
that _still_ takes a long time, then your DB query is the root of the problem. 
Even with 100M
records, I’d be really surprised if Solr is the bottleneck, but the above test 
will tell you
where to go to try to speed things up.

Best,
Erick

> On May 22, 2020, at 12:39 PM, Srinivas Kashyap 
> mailto:srini...@bamboorose.com.INVALID>> 
> wrote:
>
> Hi All,
>
> We are runnnig solr 8.4.1. We have a database table which has more than 100 
> million of records. Till now we were using DIH to do full-import on the 
> tables. But for this table, when we do full-import via DIH it is taking more 
> than 3-4 days to complete and also it consumes fair bit of JVM memory while 
> running.
>
> Are there any speedier/alternates ways to load data onto this solr core.
>
> P.S: Only initial data import is problem, further updates/additions to this 
> core is being done through SolrJ.
>
> Thanks,
> Srinivas
> 
> DISCLAIMER:
> E-mails and attachments from Bamboo Rose, LLC are confidential.
> If you are not the intended recipient, please notify the sender immediately 
> by replying to the e-mail, and then delete it without making copies or using 
> it in any way.
> No representation is made that this email or any attachments are free of 
> viruses. Virus scanning is recommended and is the responsibility of the 
> recipient.
>
> Disclaimer
>
> The information contained in this communication from the sender is 
> confidential. It is intended solely for use by the recipient and others 
> authorized to receive it. If you are not the recipient, you are hereby 
> notified that any disclosure, copying, distribution or taking action in 
> relation of the contents of this information is strictly prohibited and may 
> be unlawful.
>
> This email has been scanned for viruses and malware, and may have been 
> automatically archived by Mimecast Ltd, an innovator in Software as a Service 
> (SaaS) for business. Providing a safer and more useful place for your human 
> generated data. Specializing in; Security, archiving and compliance. To find 
> out more visit the Mimecast website.


Re: Indexing huge data onto solr

2020-05-22 Thread matthew sporleder
I can index (without nested entities ofc ;) ) 100M records in about
6-8 hours on a pretty low-powered machine using vanilla DIH -> mysql
so it is probably worth looking at why it is going slow before writing
your own indexer (which we are finally having to do)

On Fri, May 22, 2020 at 1:22 PM Erick Erickson  wrote:
>
> You have a lot more control over the speed and form of importing data if
> you just do the initial load in SolrJ. Here’s an example, taking the Tika
> parts out is easy:
>
> https://lucidworks.com/post/indexing-with-solrj/
>
> It’s especially instructive to comment out just the call to 
> CloudSolrClient.add(doclist…); If
> that _still_ takes a long time, then your DB query is the root of the 
> problem. Even with 100M
> records, I’d be really surprised if Solr is the bottleneck, but the above 
> test will tell you
> where to go to try to speed things up.
>
> Best,
> Erick
>
> > On May 22, 2020, at 12:39 PM, Srinivas Kashyap 
> >  wrote:
> >
> > Hi All,
> >
> > We are runnnig solr 8.4.1. We have a database table which has more than 100 
> > million of records. Till now we were using DIH to do full-import on the 
> > tables. But for this table, when we do full-import via DIH it is taking 
> > more than 3-4 days to complete and also it consumes fair bit of JVM memory 
> > while running.
> >
> > Are there any speedier/alternates ways to load data onto this solr core.
> >
> > P.S: Only initial data import is problem, further updates/additions to this 
> > core is being done through SolrJ.
> >
> > Thanks,
> > Srinivas
> > 
> > DISCLAIMER:
> > E-mails and attachments from Bamboo Rose, LLC are confidential.
> > If you are not the intended recipient, please notify the sender immediately 
> > by replying to the e-mail, and then delete it without making copies or 
> > using it in any way.
> > No representation is made that this email or any attachments are free of 
> > viruses. Virus scanning is recommended and is the responsibility of the 
> > recipient.
> >
> > Disclaimer
> >
> > The information contained in this communication from the sender is 
> > confidential. It is intended solely for use by the recipient and others 
> > authorized to receive it. If you are not the recipient, you are hereby 
> > notified that any disclosure, copying, distribution or taking action in 
> > relation of the contents of this information is strictly prohibited and may 
> > be unlawful.
> >
> > This email has been scanned for viruses and malware, and may have been 
> > automatically archived by Mimecast Ltd, an innovator in Software as a 
> > Service (SaaS) for business. Providing a safer and more useful place for 
> > your human generated data. Specializing in; Security, archiving and 
> > compliance. To find out more visit the Mimecast website.
>


Re: Indexing huge data onto solr

2020-05-22 Thread Erick Erickson
You have a lot more control over the speed and form of importing data if
you just do the initial load in SolrJ. Here’s an example, taking the Tika
parts out is easy:

https://lucidworks.com/post/indexing-with-solrj/

It’s especially instructive to comment out just the call to 
CloudSolrClient.add(doclist…); If
that _still_ takes a long time, then your DB query is the root of the problem. 
Even with 100M
records, I’d be really surprised if Solr is the bottleneck, but the above test 
will tell you
where to go to try to speed things up.

Best,
Erick

> On May 22, 2020, at 12:39 PM, Srinivas Kashyap 
>  wrote:
> 
> Hi All,
> 
> We are runnnig solr 8.4.1. We have a database table which has more than 100 
> million of records. Till now we were using DIH to do full-import on the 
> tables. But for this table, when we do full-import via DIH it is taking more 
> than 3-4 days to complete and also it consumes fair bit of JVM memory while 
> running.
> 
> Are there any speedier/alternates ways to load data onto this solr core.
> 
> P.S: Only initial data import is problem, further updates/additions to this 
> core is being done through SolrJ.
> 
> Thanks,
> Srinivas
> 
> DISCLAIMER:
> E-mails and attachments from Bamboo Rose, LLC are confidential.
> If you are not the intended recipient, please notify the sender immediately 
> by replying to the e-mail, and then delete it without making copies or using 
> it in any way.
> No representation is made that this email or any attachments are free of 
> viruses. Virus scanning is recommended and is the responsibility of the 
> recipient.
> 
> Disclaimer
> 
> The information contained in this communication from the sender is 
> confidential. It is intended solely for use by the recipient and others 
> authorized to receive it. If you are not the recipient, you are hereby 
> notified that any disclosure, copying, distribution or taking action in 
> relation of the contents of this information is strictly prohibited and may 
> be unlawful.
> 
> This email has been scanned for viruses and malware, and may have been 
> automatically archived by Mimecast Ltd, an innovator in Software as a Service 
> (SaaS) for business. Providing a safer and more useful place for your human 
> generated data. Specializing in; Security, archiving and compliance. To find 
> out more visit the Mimecast website.



Indexing huge data onto solr

2020-05-22 Thread Srinivas Kashyap
Hi All,

We are runnnig solr 8.4.1. We have a database table which has more than 100 
million of records. Till now we were using DIH to do full-import on the tables. 
But for this table, when we do full-import via DIH it is taking more than 3-4 
days to complete and also it consumes fair bit of JVM memory while running.

Are there any speedier/alternates ways to load data onto this solr core.

P.S: Only initial data import is problem, further updates/additions to this 
core is being done through SolrJ.

Thanks,
Srinivas

DISCLAIMER:
E-mails and attachments from Bamboo Rose, LLC are confidential.
If you are not the intended recipient, please notify the sender immediately by 
replying to the e-mail, and then delete it without making copies or using it in 
any way.
No representation is made that this email or any attachments are free of 
viruses. Virus scanning is recommended and is the responsibility of the 
recipient.

Disclaimer

The information contained in this communication from the sender is 
confidential. It is intended solely for use by the recipient and others 
authorized to receive it. If you are not the recipient, you are hereby notified 
that any disclosure, copying, distribution or taking action in relation of the 
contents of this information is strictly prohibited and may be unlawful.

This email has been scanned for viruses and malware, and may have been 
automatically archived by Mimecast Ltd, an innovator in Software as a Service 
(SaaS) for business. Providing a safer and more useful place for your human 
generated data. Specializing in; Security, archiving and compliance. To find 
out more visit the Mimecast website.


Re: Indexing huge data

2014-03-08 Thread Rallavagu
Thanks for all responses so far. Test runs so far does not suggest any 
bottleneck with Solr yet as I continue to work on different approaches. 
Collecting the data from different sources seems to be consuming most of 
the time.


On 3/7/14, 5:53 PM, Erick Erickson wrote:

Kranti and Susheel's appoaches are certainly
reasonable assuming I bet right :).

Another strategy is to rack together N
indexing programs that simultaneously
feed Solr.

In any of these scenarios, the end goal is to get
Solr using up all the CPU cycles it can, _assuming_
that Solr isn't the bottleneck in the first place.

Best,
Erick

On Thu, Mar 6, 2014 at 6:38 PM, Kranti Parisa  wrote:

thats what I do. precreate JSONs following the schema, saving that in
MongoDB, this is part of the ETL process. after that, just dump the JSONs
into Solr using batching etc. with this you can do full and incremental
indexing as well.

Thanks,
Kranti K. Parisa
http://www.linkedin.com/in/krantiparisa



On Thu, Mar 6, 2014 at 9:57 AM, Rallavagu  wrote:


Yeah. I have thought about spitting out JSON and run it against Solr using
parallel Http threads separately. Thanks.


On 3/5/14, 6:46 PM, Susheel Kumar wrote:


One more suggestion is to collect/prepare the data in CSV format (1-2
million sample depending on size) and then import data direct into Solr
using CSV handler & curl.  This will give you the pure indexing time & the
differences.

Thanks,
Susheel

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Wednesday, March 05, 2014 8:03 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing huge data

Here's the easiest thing to try to figure out where to concentrate your
energies. Just comment out the server.add call in your SolrJ program.
Well, and any commits you're doing from SolrJ.

My bet: Your program will run at about the same speed it does when you
actually index the docs, indicating that your problem is in the data
acquisition side. Of course the older I get, the more times I've been wrong
:).

You can also monitor the CPU usage on the box running Solr. I often see
it idling along < 30% when indexing, or even < 10%, again indicating that
the bottleneck is on the acquisition side.

Note I haven't mentioned any solutions, I'm a believer in identifying the
_problem_ before worrying about a solution.

Best,
Erick

On Wed, Mar 5, 2014 at 4:29 PM, Jack Krupansky 
wrote:


Make sure you're not doing a commit on each individual document add.
Commit every few minutes or every few hundred or few thousand
documents is sufficient. You can set up auto commit in solrconfig.xml.

-- Jack Krupansky

-Original Message- From: Rallavagu
Sent: Wednesday, March 5, 2014 2:37 PM
To: solr-user@lucene.apache.org
Subject: Indexing huge data


All,

Wondering about best practices/common practices to index/re-index huge
amount of data in Solr. The data is about 6 million entries in the db
and other source (data is not located in one resource). Trying with
solrj based solution to collect data from difference resources to
index into Solr. It takes hours to index Solr.

Thanks in advance





Re: Indexing huge data

2014-03-07 Thread Erick Erickson
Kranti and Susheel's appoaches are certainly
reasonable assuming I bet right :).

Another strategy is to rack together N
indexing programs that simultaneously
feed Solr.

In any of these scenarios, the end goal is to get
Solr using up all the CPU cycles it can, _assuming_
that Solr isn't the bottleneck in the first place.

Best,
Erick

On Thu, Mar 6, 2014 at 6:38 PM, Kranti Parisa  wrote:
> thats what I do. precreate JSONs following the schema, saving that in
> MongoDB, this is part of the ETL process. after that, just dump the JSONs
> into Solr using batching etc. with this you can do full and incremental
> indexing as well.
>
> Thanks,
> Kranti K. Parisa
> http://www.linkedin.com/in/krantiparisa
>
>
>
> On Thu, Mar 6, 2014 at 9:57 AM, Rallavagu  wrote:
>
>> Yeah. I have thought about spitting out JSON and run it against Solr using
>> parallel Http threads separately. Thanks.
>>
>>
>> On 3/5/14, 6:46 PM, Susheel Kumar wrote:
>>
>>> One more suggestion is to collect/prepare the data in CSV format (1-2
>>> million sample depending on size) and then import data direct into Solr
>>> using CSV handler & curl.  This will give you the pure indexing time & the
>>> differences.
>>>
>>> Thanks,
>>> Susheel
>>>
>>> -Original Message-
>>> From: Erick Erickson [mailto:erickerick...@gmail.com]
>>> Sent: Wednesday, March 05, 2014 8:03 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Re: Indexing huge data
>>>
>>> Here's the easiest thing to try to figure out where to concentrate your
>>> energies. Just comment out the server.add call in your SolrJ program.
>>> Well, and any commits you're doing from SolrJ.
>>>
>>> My bet: Your program will run at about the same speed it does when you
>>> actually index the docs, indicating that your problem is in the data
>>> acquisition side. Of course the older I get, the more times I've been wrong
>>> :).
>>>
>>> You can also monitor the CPU usage on the box running Solr. I often see
>>> it idling along < 30% when indexing, or even < 10%, again indicating that
>>> the bottleneck is on the acquisition side.
>>>
>>> Note I haven't mentioned any solutions, I'm a believer in identifying the
>>> _problem_ before worrying about a solution.
>>>
>>> Best,
>>> Erick
>>>
>>> On Wed, Mar 5, 2014 at 4:29 PM, Jack Krupansky 
>>> wrote:
>>>
>>>> Make sure you're not doing a commit on each individual document add.
>>>> Commit every few minutes or every few hundred or few thousand
>>>> documents is sufficient. You can set up auto commit in solrconfig.xml.
>>>>
>>>> -- Jack Krupansky
>>>>
>>>> -Original Message- From: Rallavagu
>>>> Sent: Wednesday, March 5, 2014 2:37 PM
>>>> To: solr-user@lucene.apache.org
>>>> Subject: Indexing huge data
>>>>
>>>>
>>>> All,
>>>>
>>>> Wondering about best practices/common practices to index/re-index huge
>>>> amount of data in Solr. The data is about 6 million entries in the db
>>>> and other source (data is not located in one resource). Trying with
>>>> solrj based solution to collect data from difference resources to
>>>> index into Solr. It takes hours to index Solr.
>>>>
>>>> Thanks in advance
>>>>
>>>


Re: Indexing huge data

2014-03-06 Thread Kranti Parisa
thats what I do. precreate JSONs following the schema, saving that in
MongoDB, this is part of the ETL process. after that, just dump the JSONs
into Solr using batching etc. with this you can do full and incremental
indexing as well.

Thanks,
Kranti K. Parisa
http://www.linkedin.com/in/krantiparisa



On Thu, Mar 6, 2014 at 9:57 AM, Rallavagu  wrote:

> Yeah. I have thought about spitting out JSON and run it against Solr using
> parallel Http threads separately. Thanks.
>
>
> On 3/5/14, 6:46 PM, Susheel Kumar wrote:
>
>> One more suggestion is to collect/prepare the data in CSV format (1-2
>> million sample depending on size) and then import data direct into Solr
>> using CSV handler & curl.  This will give you the pure indexing time & the
>> differences.
>>
>> Thanks,
>> Susheel
>>
>> -Original Message-
>> From: Erick Erickson [mailto:erickerick...@gmail.com]
>> Sent: Wednesday, March 05, 2014 8:03 PM
>> To: solr-user@lucene.apache.org
>> Subject: Re: Indexing huge data
>>
>> Here's the easiest thing to try to figure out where to concentrate your
>> energies. Just comment out the server.add call in your SolrJ program.
>> Well, and any commits you're doing from SolrJ.
>>
>> My bet: Your program will run at about the same speed it does when you
>> actually index the docs, indicating that your problem is in the data
>> acquisition side. Of course the older I get, the more times I've been wrong
>> :).
>>
>> You can also monitor the CPU usage on the box running Solr. I often see
>> it idling along < 30% when indexing, or even < 10%, again indicating that
>> the bottleneck is on the acquisition side.
>>
>> Note I haven't mentioned any solutions, I'm a believer in identifying the
>> _problem_ before worrying about a solution.
>>
>> Best,
>> Erick
>>
>> On Wed, Mar 5, 2014 at 4:29 PM, Jack Krupansky 
>> wrote:
>>
>>> Make sure you're not doing a commit on each individual document add.
>>> Commit every few minutes or every few hundred or few thousand
>>> documents is sufficient. You can set up auto commit in solrconfig.xml.
>>>
>>> -- Jack Krupansky
>>>
>>> -Original Message- From: Rallavagu
>>> Sent: Wednesday, March 5, 2014 2:37 PM
>>> To: solr-user@lucene.apache.org
>>> Subject: Indexing huge data
>>>
>>>
>>> All,
>>>
>>> Wondering about best practices/common practices to index/re-index huge
>>> amount of data in Solr. The data is about 6 million entries in the db
>>> and other source (data is not located in one resource). Trying with
>>> solrj based solution to collect data from difference resources to
>>> index into Solr. It takes hours to index Solr.
>>>
>>> Thanks in advance
>>>
>>


Re: Indexing huge data

2014-03-06 Thread Rallavagu
Yeah. I have thought about spitting out JSON and run it against Solr 
using parallel Http threads separately. Thanks.


On 3/5/14, 6:46 PM, Susheel Kumar wrote:

One more suggestion is to collect/prepare the data in CSV format (1-2 million sample 
depending on size) and then import data direct into Solr using CSV handler & curl.  
This will give you the pure indexing time & the differences.

Thanks,
Susheel

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Wednesday, March 05, 2014 8:03 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing huge data

Here's the easiest thing to try to figure out where to concentrate your 
energies. Just comment out the server.add call in your SolrJ program. Well, 
and any commits you're doing from SolrJ.

My bet: Your program will run at about the same speed it does when you actually 
index the docs, indicating that your problem is in the data acquisition side. 
Of course the older I get, the more times I've been wrong :).

You can also monitor the CPU usage on the box running Solr. I often see it idling 
along < 30% when indexing, or even < 10%, again indicating that the bottleneck 
is on the acquisition side.

Note I haven't mentioned any solutions, I'm a believer in identifying the 
_problem_ before worrying about a solution.

Best,
Erick

On Wed, Mar 5, 2014 at 4:29 PM, Jack Krupansky  wrote:

Make sure you're not doing a commit on each individual document add.
Commit every few minutes or every few hundred or few thousand
documents is sufficient. You can set up auto commit in solrconfig.xml.

-- Jack Krupansky

-Original Message- From: Rallavagu
Sent: Wednesday, March 5, 2014 2:37 PM
To: solr-user@lucene.apache.org
Subject: Indexing huge data


All,

Wondering about best practices/common practices to index/re-index huge
amount of data in Solr. The data is about 6 million entries in the db
and other source (data is not located in one resource). Trying with
solrj based solution to collect data from difference resources to
index into Solr. It takes hours to index Solr.

Thanks in advance


Re: Indexing huge data

2014-03-06 Thread Rallavagu

Erick,

That helps so I can focus on the problem areas. Thanks.

On 3/5/14, 6:03 PM, Erick Erickson wrote:

Here's the easiest thing to try to figure out where to
concentrate your energies. Just comment out the
server.add call in your SolrJ program. Well, and any
commits you're doing from SolrJ.

My bet: Your program will run at about the same speed
it does when you actually index the docs, indicating that
your problem is in the data acquisition side. Of course
the older I get, the more times I've been wrong :).

You can also monitor the CPU usage on the box running
Solr. I often see it idling along < 30% when indexing, or
even < 10%, again indicating that the bottleneck is on the
acquisition side.

Note I haven't mentioned any solutions, I'm a believer in
identifying the _problem_ before worrying about a solution.

Best,
Erick

On Wed, Mar 5, 2014 at 4:29 PM, Jack Krupansky  wrote:

Make sure you're not doing a commit on each individual document add. Commit
every few minutes or every few hundred or few thousand documents is
sufficient. You can set up auto commit in solrconfig.xml.

-- Jack Krupansky

-Original Message- From: Rallavagu
Sent: Wednesday, March 5, 2014 2:37 PM
To: solr-user@lucene.apache.org
Subject: Indexing huge data


All,

Wondering about best practices/common practices to index/re-index huge
amount of data in Solr. The data is about 6 million entries in the db
and other source (data is not located in one resource). Trying with
solrj based solution to collect data from difference resources to index
into Solr. It takes hours to index Solr.

Thanks in advance


RE: Indexing huge data

2014-03-05 Thread Susheel Kumar
One more suggestion is to collect/prepare the data in CSV format (1-2 million 
sample depending on size) and then import data direct into Solr using CSV 
handler & curl.  This will give you the pure indexing time & the differences. 

Thanks,
Susheel  

-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com] 
Sent: Wednesday, March 05, 2014 8:03 PM
To: solr-user@lucene.apache.org
Subject: Re: Indexing huge data

Here's the easiest thing to try to figure out where to concentrate your 
energies. Just comment out the server.add call in your SolrJ program. Well, 
and any commits you're doing from SolrJ.

My bet: Your program will run at about the same speed it does when you actually 
index the docs, indicating that your problem is in the data acquisition side. 
Of course the older I get, the more times I've been wrong :).

You can also monitor the CPU usage on the box running Solr. I often see it 
idling along < 30% when indexing, or even < 10%, again indicating that the 
bottleneck is on the acquisition side.

Note I haven't mentioned any solutions, I'm a believer in identifying the 
_problem_ before worrying about a solution.

Best,
Erick

On Wed, Mar 5, 2014 at 4:29 PM, Jack Krupansky  wrote:
> Make sure you're not doing a commit on each individual document add. 
> Commit every few minutes or every few hundred or few thousand 
> documents is sufficient. You can set up auto commit in solrconfig.xml.
>
> -- Jack Krupansky
>
> -Original Message- From: Rallavagu
> Sent: Wednesday, March 5, 2014 2:37 PM
> To: solr-user@lucene.apache.org
> Subject: Indexing huge data
>
>
> All,
>
> Wondering about best practices/common practices to index/re-index huge 
> amount of data in Solr. The data is about 6 million entries in the db 
> and other source (data is not located in one resource). Trying with 
> solrj based solution to collect data from difference resources to 
> index into Solr. It takes hours to index Solr.
>
> Thanks in advance


Re: Indexing huge data

2014-03-05 Thread Erick Erickson
Here's the easiest thing to try to figure out where to
concentrate your energies. Just comment out the
server.add call in your SolrJ program. Well, and any
commits you're doing from SolrJ.

My bet: Your program will run at about the same speed
it does when you actually index the docs, indicating that
your problem is in the data acquisition side. Of course
the older I get, the more times I've been wrong :).

You can also monitor the CPU usage on the box running
Solr. I often see it idling along < 30% when indexing, or
even < 10%, again indicating that the bottleneck is on the
acquisition side.

Note I haven't mentioned any solutions, I'm a believer in
identifying the _problem_ before worrying about a solution.

Best,
Erick

On Wed, Mar 5, 2014 at 4:29 PM, Jack Krupansky  wrote:
> Make sure you're not doing a commit on each individual document add. Commit
> every few minutes or every few hundred or few thousand documents is
> sufficient. You can set up auto commit in solrconfig.xml.
>
> -- Jack Krupansky
>
> -Original Message- From: Rallavagu
> Sent: Wednesday, March 5, 2014 2:37 PM
> To: solr-user@lucene.apache.org
> Subject: Indexing huge data
>
>
> All,
>
> Wondering about best practices/common practices to index/re-index huge
> amount of data in Solr. The data is about 6 million entries in the db
> and other source (data is not located in one resource). Trying with
> solrj based solution to collect data from difference resources to index
> into Solr. It takes hours to index Solr.
>
> Thanks in advance


Re: Indexing huge data

2014-03-05 Thread Jack Krupansky
Make sure you're not doing a commit on each individual document add. Commit 
every few minutes or every few hundred or few thousand documents is 
sufficient. You can set up auto commit in solrconfig.xml.


-- Jack Krupansky

-Original Message- 
From: Rallavagu

Sent: Wednesday, March 5, 2014 2:37 PM
To: solr-user@lucene.apache.org
Subject: Indexing huge data

All,

Wondering about best practices/common practices to index/re-index huge
amount of data in Solr. The data is about 6 million entries in the db
and other source (data is not located in one resource). Trying with
solrj based solution to collect data from difference resources to index
into Solr. It takes hours to index Solr.

Thanks in advance 



Re: Indexing huge data

2014-03-05 Thread Otis Gospodnetic
Hi,

Each doc is 100K?  That's on the big side, yes, and the server seems on the
small side, yes.  Hence the "speed". :)

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Wed, Mar 5, 2014 at 3:37 PM, Rallavagu  wrote:

> Otis,
>
> Good points. I guess you are suggesting that it depends on the resources.
> The document is 100k each the pre processing server is a 2 cpu VM running
> with 4G RAM. So, that could be a "small" machine relatively to process such
> amount of data??
>
>
> On 3/5/14, 12:27 PM, Otis Gospodnetic wrote:
>
>> Hi,
>>
>> It depends.  Are docs huge or small? Server single core or 32 core?  Heap
>> big or small?  etc. etc.
>>
>> Otis
>> --
>> Performance Monitoring * Log Analytics * Search Analytics
>> Solr & Elasticsearch Support * http://sematext.com/
>>
>>
>> On Wed, Mar 5, 2014 at 3:02 PM, Rallavagu  wrote:
>>
>>  It seems the latency is introduced by collecting the data from different
>>> sources and putting them together then actual Solr index. I would say all
>>> these activities are contributing equally though I would say So, is it
>>> normal to expect to run indexing to run for long? Wondering what to
>>> expect
>>> in such cases. Thanks.
>>>
>>> On 3/5/14, 11:47 AM, Otis Gospodnetic wrote:
>>>
>>>  Hi,

 6M is really not huge these days.  6B is big, though also still not huge
 any more.  What seems to be the bottleneck?  Solr or DB or network or
 something else?

 Otis
 --
 Performance Monitoring * Log Analytics * Search Analytics
 Solr & Elasticsearch Support * http://sematext.com/


 On Wed, Mar 5, 2014 at 2:37 PM, Rallavagu  wrote:

   All,

>
> Wondering about best practices/common practices to index/re-index huge
> amount of data in Solr. The data is about 6 million entries in the db
> and
> other source (data is not located in one resource). Trying with solrj
> based
> solution to collect data from difference resources to index into Solr.
> It
> takes hours to index Solr.
>
> Thanks in advance
>
>
>

>>


Re: Indexing huge data

2014-03-05 Thread Rallavagu

Otis,

Good points. I guess you are suggesting that it depends on the 
resources. The document is 100k each the pre processing server is a 2 
cpu VM running with 4G RAM. So, that could be a "small" machine 
relatively to process such amount of data??



On 3/5/14, 12:27 PM, Otis Gospodnetic wrote:

Hi,

It depends.  Are docs huge or small? Server single core or 32 core?  Heap
big or small?  etc. etc.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Wed, Mar 5, 2014 at 3:02 PM, Rallavagu  wrote:


It seems the latency is introduced by collecting the data from different
sources and putting them together then actual Solr index. I would say all
these activities are contributing equally though I would say So, is it
normal to expect to run indexing to run for long? Wondering what to expect
in such cases. Thanks.

On 3/5/14, 11:47 AM, Otis Gospodnetic wrote:


Hi,

6M is really not huge these days.  6B is big, though also still not huge
any more.  What seems to be the bottleneck?  Solr or DB or network or
something else?

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Wed, Mar 5, 2014 at 2:37 PM, Rallavagu  wrote:

  All,


Wondering about best practices/common practices to index/re-index huge
amount of data in Solr. The data is about 6 million entries in the db and
other source (data is not located in one resource). Trying with solrj
based
solution to collect data from difference resources to index into Solr. It
takes hours to index Solr.

Thanks in advance








Re: Indexing huge data

2014-03-05 Thread Otis Gospodnetic
Hi,

It depends.  Are docs huge or small? Server single core or 32 core?  Heap
big or small?  etc. etc.

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Wed, Mar 5, 2014 at 3:02 PM, Rallavagu  wrote:

> It seems the latency is introduced by collecting the data from different
> sources and putting them together then actual Solr index. I would say all
> these activities are contributing equally though I would say So, is it
> normal to expect to run indexing to run for long? Wondering what to expect
> in such cases. Thanks.
>
> On 3/5/14, 11:47 AM, Otis Gospodnetic wrote:
>
>> Hi,
>>
>> 6M is really not huge these days.  6B is big, though also still not huge
>> any more.  What seems to be the bottleneck?  Solr or DB or network or
>> something else?
>>
>> Otis
>> --
>> Performance Monitoring * Log Analytics * Search Analytics
>> Solr & Elasticsearch Support * http://sematext.com/
>>
>>
>> On Wed, Mar 5, 2014 at 2:37 PM, Rallavagu  wrote:
>>
>>  All,
>>>
>>> Wondering about best practices/common practices to index/re-index huge
>>> amount of data in Solr. The data is about 6 million entries in the db and
>>> other source (data is not located in one resource). Trying with solrj
>>> based
>>> solution to collect data from difference resources to index into Solr. It
>>> takes hours to index Solr.
>>>
>>> Thanks in advance
>>>
>>>
>>


Re: Indexing huge data

2014-03-05 Thread Rallavagu
It seems the latency is introduced by collecting the data from different 
sources and putting them together then actual Solr index. I would say 
all these activities are contributing equally though I would say So, is 
it normal to expect to run indexing to run for long? Wondering what to 
expect in such cases. Thanks.


On 3/5/14, 11:47 AM, Otis Gospodnetic wrote:

Hi,

6M is really not huge these days.  6B is big, though also still not huge
any more.  What seems to be the bottleneck?  Solr or DB or network or
something else?

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Wed, Mar 5, 2014 at 2:37 PM, Rallavagu  wrote:


All,

Wondering about best practices/common practices to index/re-index huge
amount of data in Solr. The data is about 6 million entries in the db and
other source (data is not located in one resource). Trying with solrj based
solution to collect data from difference resources to index into Solr. It
takes hours to index Solr.

Thanks in advance





Re: Indexing huge data

2014-03-05 Thread Otis Gospodnetic
Hi,

6M is really not huge these days.  6B is big, though also still not huge
any more.  What seems to be the bottleneck?  Solr or DB or network or
something else?

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Wed, Mar 5, 2014 at 2:37 PM, Rallavagu  wrote:

> All,
>
> Wondering about best practices/common practices to index/re-index huge
> amount of data in Solr. The data is about 6 million entries in the db and
> other source (data is not located in one resource). Trying with solrj based
> solution to collect data from difference resources to index into Solr. It
> takes hours to index Solr.
>
> Thanks in advance
>


Indexing huge data

2014-03-05 Thread Rallavagu

All,

Wondering about best practices/common practices to index/re-index huge 
amount of data in Solr. The data is about 6 million entries in the db 
and other source (data is not located in one resource). Trying with 
solrj based solution to collect data from difference resources to index 
into Solr. It takes hours to index Solr.


Thanks in advance