Re: Indexing huge data onto solr
It Depends (tm). Often, you can create a single (albeit, perhaps complex) SQL query that does this for you and just process the response. I’ve also seen situations where it’s possible to hold one of the tables in memory on the client and just use that rather than a separate query. It depends on the characteristics of your particular database, your DBA could probably help. Best, Erick > On May 25, 2020, at 11:56 PM, Srinivas Kashyap > wrote: > > Hi Erick, > > Thanks for the below response. The link which you provided holds good if you > have single entity where you can join the tables and index it. But in our > scenario, we have nested entities joining different tables as shown below: > > db-data-config.xml: > > > > (table 1 join table 2) > (table 3 join table 4) > (table 5 join table 6) > (table 7 join table 8) > > > > Do you have any recommendations for it to run multiple sql’s and make it as > single solr document that can be sent over solrJ for indexing? > > Say parent entity has 100 documents, should I iterate over each one of parent > tuples and execute the child entity sql’s(with where condition of parent) to > create one solr document? Won’t it be more load on database by executing more > sqls? Is there an optimum solution? > > Thanks, > Srinivas > From: Erick Erickson > Sent: 22 May 2020 22:52 > To: solr-user@lucene.apache.org > Subject: Re: Indexing huge data onto solr > > You have a lot more control over the speed and form of importing data if > you just do the initial load in SolrJ. Here’s an example, taking the Tika > parts out is easy: > > https://lucidworks.com/post/indexing-with-solrj/<https://lucidworks.com/post/indexing-with-solrj> > > It’s especially instructive to comment out just the call to > CloudSolrClient.add(doclist…); If > that _still_ takes a long time, then your DB query is the root of the > problem. Even with 100M > records, I’d be really surprised if Solr is the bottleneck, but the above > test will tell you > where to go to try to speed things up. > > Best, > Erick > >> On May 22, 2020, at 12:39 PM, Srinivas Kashyap >> mailto:srini...@bamboorose.com.INVALID>> >> wrote: >> >> Hi All, >> >> We are runnnig solr 8.4.1. We have a database table which has more than 100 >> million of records. Till now we were using DIH to do full-import on the >> tables. But for this table, when we do full-import via DIH it is taking more >> than 3-4 days to complete and also it consumes fair bit of JVM memory while >> running. >> >> Are there any speedier/alternates ways to load data onto this solr core. >> >> P.S: Only initial data import is problem, further updates/additions to this >> core is being done through SolrJ. >> >> Thanks, >> Srinivas >> >> DISCLAIMER: >> E-mails and attachments from Bamboo Rose, LLC are confidential. >> If you are not the intended recipient, please notify the sender immediately >> by replying to the e-mail, and then delete it without making copies or using >> it in any way. >> No representation is made that this email or any attachments are free of >> viruses. Virus scanning is recommended and is the responsibility of the >> recipient. >> >> Disclaimer >> >> The information contained in this communication from the sender is >> confidential. It is intended solely for use by the recipient and others >> authorized to receive it. If you are not the recipient, you are hereby >> notified that any disclosure, copying, distribution or taking action in >> relation of the contents of this information is strictly prohibited and may >> be unlawful. >> >> This email has been scanned for viruses and malware, and may have been >> automatically archived by Mimecast Ltd, an innovator in Software as a >> Service (SaaS) for business. Providing a safer and more useful place for >> your human generated data. Specializing in; Security, archiving and >> compliance. To find out more visit the Mimecast website.
RE: Indexing huge data onto solr
Hi Erick, Thanks for the below response. The link which you provided holds good if you have single entity where you can join the tables and index it. But in our scenario, we have nested entities joining different tables as shown below: db-data-config.xml: (table 1 join table 2) (table 3 join table 4) (table 5 join table 6) (table 7 join table 8) Do you have any recommendations for it to run multiple sql’s and make it as single solr document that can be sent over solrJ for indexing? Say parent entity has 100 documents, should I iterate over each one of parent tuples and execute the child entity sql’s(with where condition of parent) to create one solr document? Won’t it be more load on database by executing more sqls? Is there an optimum solution? Thanks, Srinivas From: Erick Erickson Sent: 22 May 2020 22:52 To: solr-user@lucene.apache.org Subject: Re: Indexing huge data onto solr You have a lot more control over the speed and form of importing data if you just do the initial load in SolrJ. Here’s an example, taking the Tika parts out is easy: https://lucidworks.com/post/indexing-with-solrj/<https://lucidworks.com/post/indexing-with-solrj> It’s especially instructive to comment out just the call to CloudSolrClient.add(doclist…); If that _still_ takes a long time, then your DB query is the root of the problem. Even with 100M records, I’d be really surprised if Solr is the bottleneck, but the above test will tell you where to go to try to speed things up. Best, Erick > On May 22, 2020, at 12:39 PM, Srinivas Kashyap > mailto:srini...@bamboorose.com.INVALID>> > wrote: > > Hi All, > > We are runnnig solr 8.4.1. We have a database table which has more than 100 > million of records. Till now we were using DIH to do full-import on the > tables. But for this table, when we do full-import via DIH it is taking more > than 3-4 days to complete and also it consumes fair bit of JVM memory while > running. > > Are there any speedier/alternates ways to load data onto this solr core. > > P.S: Only initial data import is problem, further updates/additions to this > core is being done through SolrJ. > > Thanks, > Srinivas > > DISCLAIMER: > E-mails and attachments from Bamboo Rose, LLC are confidential. > If you are not the intended recipient, please notify the sender immediately > by replying to the e-mail, and then delete it without making copies or using > it in any way. > No representation is made that this email or any attachments are free of > viruses. Virus scanning is recommended and is the responsibility of the > recipient. > > Disclaimer > > The information contained in this communication from the sender is > confidential. It is intended solely for use by the recipient and others > authorized to receive it. If you are not the recipient, you are hereby > notified that any disclosure, copying, distribution or taking action in > relation of the contents of this information is strictly prohibited and may > be unlawful. > > This email has been scanned for viruses and malware, and may have been > automatically archived by Mimecast Ltd, an innovator in Software as a Service > (SaaS) for business. Providing a safer and more useful place for your human > generated data. Specializing in; Security, archiving and compliance. To find > out more visit the Mimecast website.
Re: Indexing huge data onto solr
I can index (without nested entities ofc ;) ) 100M records in about 6-8 hours on a pretty low-powered machine using vanilla DIH -> mysql so it is probably worth looking at why it is going slow before writing your own indexer (which we are finally having to do) On Fri, May 22, 2020 at 1:22 PM Erick Erickson wrote: > > You have a lot more control over the speed and form of importing data if > you just do the initial load in SolrJ. Here’s an example, taking the Tika > parts out is easy: > > https://lucidworks.com/post/indexing-with-solrj/ > > It’s especially instructive to comment out just the call to > CloudSolrClient.add(doclist…); If > that _still_ takes a long time, then your DB query is the root of the > problem. Even with 100M > records, I’d be really surprised if Solr is the bottleneck, but the above > test will tell you > where to go to try to speed things up. > > Best, > Erick > > > On May 22, 2020, at 12:39 PM, Srinivas Kashyap > > wrote: > > > > Hi All, > > > > We are runnnig solr 8.4.1. We have a database table which has more than 100 > > million of records. Till now we were using DIH to do full-import on the > > tables. But for this table, when we do full-import via DIH it is taking > > more than 3-4 days to complete and also it consumes fair bit of JVM memory > > while running. > > > > Are there any speedier/alternates ways to load data onto this solr core. > > > > P.S: Only initial data import is problem, further updates/additions to this > > core is being done through SolrJ. > > > > Thanks, > > Srinivas > > > > DISCLAIMER: > > E-mails and attachments from Bamboo Rose, LLC are confidential. > > If you are not the intended recipient, please notify the sender immediately > > by replying to the e-mail, and then delete it without making copies or > > using it in any way. > > No representation is made that this email or any attachments are free of > > viruses. Virus scanning is recommended and is the responsibility of the > > recipient. > > > > Disclaimer > > > > The information contained in this communication from the sender is > > confidential. It is intended solely for use by the recipient and others > > authorized to receive it. If you are not the recipient, you are hereby > > notified that any disclosure, copying, distribution or taking action in > > relation of the contents of this information is strictly prohibited and may > > be unlawful. > > > > This email has been scanned for viruses and malware, and may have been > > automatically archived by Mimecast Ltd, an innovator in Software as a > > Service (SaaS) for business. Providing a safer and more useful place for > > your human generated data. Specializing in; Security, archiving and > > compliance. To find out more visit the Mimecast website. >
Re: Indexing huge data onto solr
You have a lot more control over the speed and form of importing data if you just do the initial load in SolrJ. Here’s an example, taking the Tika parts out is easy: https://lucidworks.com/post/indexing-with-solrj/ It’s especially instructive to comment out just the call to CloudSolrClient.add(doclist…); If that _still_ takes a long time, then your DB query is the root of the problem. Even with 100M records, I’d be really surprised if Solr is the bottleneck, but the above test will tell you where to go to try to speed things up. Best, Erick > On May 22, 2020, at 12:39 PM, Srinivas Kashyap > wrote: > > Hi All, > > We are runnnig solr 8.4.1. We have a database table which has more than 100 > million of records. Till now we were using DIH to do full-import on the > tables. But for this table, when we do full-import via DIH it is taking more > than 3-4 days to complete and also it consumes fair bit of JVM memory while > running. > > Are there any speedier/alternates ways to load data onto this solr core. > > P.S: Only initial data import is problem, further updates/additions to this > core is being done through SolrJ. > > Thanks, > Srinivas > > DISCLAIMER: > E-mails and attachments from Bamboo Rose, LLC are confidential. > If you are not the intended recipient, please notify the sender immediately > by replying to the e-mail, and then delete it without making copies or using > it in any way. > No representation is made that this email or any attachments are free of > viruses. Virus scanning is recommended and is the responsibility of the > recipient. > > Disclaimer > > The information contained in this communication from the sender is > confidential. It is intended solely for use by the recipient and others > authorized to receive it. If you are not the recipient, you are hereby > notified that any disclosure, copying, distribution or taking action in > relation of the contents of this information is strictly prohibited and may > be unlawful. > > This email has been scanned for viruses and malware, and may have been > automatically archived by Mimecast Ltd, an innovator in Software as a Service > (SaaS) for business. Providing a safer and more useful place for your human > generated data. Specializing in; Security, archiving and compliance. To find > out more visit the Mimecast website.
Indexing huge data onto solr
Hi All, We are runnnig solr 8.4.1. We have a database table which has more than 100 million of records. Till now we were using DIH to do full-import on the tables. But for this table, when we do full-import via DIH it is taking more than 3-4 days to complete and also it consumes fair bit of JVM memory while running. Are there any speedier/alternates ways to load data onto this solr core. P.S: Only initial data import is problem, further updates/additions to this core is being done through SolrJ. Thanks, Srinivas DISCLAIMER: E-mails and attachments from Bamboo Rose, LLC are confidential. If you are not the intended recipient, please notify the sender immediately by replying to the e-mail, and then delete it without making copies or using it in any way. No representation is made that this email or any attachments are free of viruses. Virus scanning is recommended and is the responsibility of the recipient. Disclaimer The information contained in this communication from the sender is confidential. It is intended solely for use by the recipient and others authorized to receive it. If you are not the recipient, you are hereby notified that any disclosure, copying, distribution or taking action in relation of the contents of this information is strictly prohibited and may be unlawful. This email has been scanned for viruses and malware, and may have been automatically archived by Mimecast Ltd, an innovator in Software as a Service (SaaS) for business. Providing a safer and more useful place for your human generated data. Specializing in; Security, archiving and compliance. To find out more visit the Mimecast website.
Re: Indexing huge data
Thanks for all responses so far. Test runs so far does not suggest any bottleneck with Solr yet as I continue to work on different approaches. Collecting the data from different sources seems to be consuming most of the time. On 3/7/14, 5:53 PM, Erick Erickson wrote: Kranti and Susheel's appoaches are certainly reasonable assuming I bet right :). Another strategy is to rack together N indexing programs that simultaneously feed Solr. In any of these scenarios, the end goal is to get Solr using up all the CPU cycles it can, _assuming_ that Solr isn't the bottleneck in the first place. Best, Erick On Thu, Mar 6, 2014 at 6:38 PM, Kranti Parisa wrote: thats what I do. precreate JSONs following the schema, saving that in MongoDB, this is part of the ETL process. after that, just dump the JSONs into Solr using batching etc. with this you can do full and incremental indexing as well. Thanks, Kranti K. Parisa http://www.linkedin.com/in/krantiparisa On Thu, Mar 6, 2014 at 9:57 AM, Rallavagu wrote: Yeah. I have thought about spitting out JSON and run it against Solr using parallel Http threads separately. Thanks. On 3/5/14, 6:46 PM, Susheel Kumar wrote: One more suggestion is to collect/prepare the data in CSV format (1-2 million sample depending on size) and then import data direct into Solr using CSV handler & curl. This will give you the pure indexing time & the differences. Thanks, Susheel -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, March 05, 2014 8:03 PM To: solr-user@lucene.apache.org Subject: Re: Indexing huge data Here's the easiest thing to try to figure out where to concentrate your energies. Just comment out the server.add call in your SolrJ program. Well, and any commits you're doing from SolrJ. My bet: Your program will run at about the same speed it does when you actually index the docs, indicating that your problem is in the data acquisition side. Of course the older I get, the more times I've been wrong :). You can also monitor the CPU usage on the box running Solr. I often see it idling along < 30% when indexing, or even < 10%, again indicating that the bottleneck is on the acquisition side. Note I haven't mentioned any solutions, I'm a believer in identifying the _problem_ before worrying about a solution. Best, Erick On Wed, Mar 5, 2014 at 4:29 PM, Jack Krupansky wrote: Make sure you're not doing a commit on each individual document add. Commit every few minutes or every few hundred or few thousand documents is sufficient. You can set up auto commit in solrconfig.xml. -- Jack Krupansky -Original Message- From: Rallavagu Sent: Wednesday, March 5, 2014 2:37 PM To: solr-user@lucene.apache.org Subject: Indexing huge data All, Wondering about best practices/common practices to index/re-index huge amount of data in Solr. The data is about 6 million entries in the db and other source (data is not located in one resource). Trying with solrj based solution to collect data from difference resources to index into Solr. It takes hours to index Solr. Thanks in advance
Re: Indexing huge data
Kranti and Susheel's appoaches are certainly reasonable assuming I bet right :). Another strategy is to rack together N indexing programs that simultaneously feed Solr. In any of these scenarios, the end goal is to get Solr using up all the CPU cycles it can, _assuming_ that Solr isn't the bottleneck in the first place. Best, Erick On Thu, Mar 6, 2014 at 6:38 PM, Kranti Parisa wrote: > thats what I do. precreate JSONs following the schema, saving that in > MongoDB, this is part of the ETL process. after that, just dump the JSONs > into Solr using batching etc. with this you can do full and incremental > indexing as well. > > Thanks, > Kranti K. Parisa > http://www.linkedin.com/in/krantiparisa > > > > On Thu, Mar 6, 2014 at 9:57 AM, Rallavagu wrote: > >> Yeah. I have thought about spitting out JSON and run it against Solr using >> parallel Http threads separately. Thanks. >> >> >> On 3/5/14, 6:46 PM, Susheel Kumar wrote: >> >>> One more suggestion is to collect/prepare the data in CSV format (1-2 >>> million sample depending on size) and then import data direct into Solr >>> using CSV handler & curl. This will give you the pure indexing time & the >>> differences. >>> >>> Thanks, >>> Susheel >>> >>> -Original Message- >>> From: Erick Erickson [mailto:erickerick...@gmail.com] >>> Sent: Wednesday, March 05, 2014 8:03 PM >>> To: solr-user@lucene.apache.org >>> Subject: Re: Indexing huge data >>> >>> Here's the easiest thing to try to figure out where to concentrate your >>> energies. Just comment out the server.add call in your SolrJ program. >>> Well, and any commits you're doing from SolrJ. >>> >>> My bet: Your program will run at about the same speed it does when you >>> actually index the docs, indicating that your problem is in the data >>> acquisition side. Of course the older I get, the more times I've been wrong >>> :). >>> >>> You can also monitor the CPU usage on the box running Solr. I often see >>> it idling along < 30% when indexing, or even < 10%, again indicating that >>> the bottleneck is on the acquisition side. >>> >>> Note I haven't mentioned any solutions, I'm a believer in identifying the >>> _problem_ before worrying about a solution. >>> >>> Best, >>> Erick >>> >>> On Wed, Mar 5, 2014 at 4:29 PM, Jack Krupansky >>> wrote: >>> >>>> Make sure you're not doing a commit on each individual document add. >>>> Commit every few minutes or every few hundred or few thousand >>>> documents is sufficient. You can set up auto commit in solrconfig.xml. >>>> >>>> -- Jack Krupansky >>>> >>>> -Original Message- From: Rallavagu >>>> Sent: Wednesday, March 5, 2014 2:37 PM >>>> To: solr-user@lucene.apache.org >>>> Subject: Indexing huge data >>>> >>>> >>>> All, >>>> >>>> Wondering about best practices/common practices to index/re-index huge >>>> amount of data in Solr. The data is about 6 million entries in the db >>>> and other source (data is not located in one resource). Trying with >>>> solrj based solution to collect data from difference resources to >>>> index into Solr. It takes hours to index Solr. >>>> >>>> Thanks in advance >>>> >>>
Re: Indexing huge data
thats what I do. precreate JSONs following the schema, saving that in MongoDB, this is part of the ETL process. after that, just dump the JSONs into Solr using batching etc. with this you can do full and incremental indexing as well. Thanks, Kranti K. Parisa http://www.linkedin.com/in/krantiparisa On Thu, Mar 6, 2014 at 9:57 AM, Rallavagu wrote: > Yeah. I have thought about spitting out JSON and run it against Solr using > parallel Http threads separately. Thanks. > > > On 3/5/14, 6:46 PM, Susheel Kumar wrote: > >> One more suggestion is to collect/prepare the data in CSV format (1-2 >> million sample depending on size) and then import data direct into Solr >> using CSV handler & curl. This will give you the pure indexing time & the >> differences. >> >> Thanks, >> Susheel >> >> -Original Message- >> From: Erick Erickson [mailto:erickerick...@gmail.com] >> Sent: Wednesday, March 05, 2014 8:03 PM >> To: solr-user@lucene.apache.org >> Subject: Re: Indexing huge data >> >> Here's the easiest thing to try to figure out where to concentrate your >> energies. Just comment out the server.add call in your SolrJ program. >> Well, and any commits you're doing from SolrJ. >> >> My bet: Your program will run at about the same speed it does when you >> actually index the docs, indicating that your problem is in the data >> acquisition side. Of course the older I get, the more times I've been wrong >> :). >> >> You can also monitor the CPU usage on the box running Solr. I often see >> it idling along < 30% when indexing, or even < 10%, again indicating that >> the bottleneck is on the acquisition side. >> >> Note I haven't mentioned any solutions, I'm a believer in identifying the >> _problem_ before worrying about a solution. >> >> Best, >> Erick >> >> On Wed, Mar 5, 2014 at 4:29 PM, Jack Krupansky >> wrote: >> >>> Make sure you're not doing a commit on each individual document add. >>> Commit every few minutes or every few hundred or few thousand >>> documents is sufficient. You can set up auto commit in solrconfig.xml. >>> >>> -- Jack Krupansky >>> >>> -Original Message- From: Rallavagu >>> Sent: Wednesday, March 5, 2014 2:37 PM >>> To: solr-user@lucene.apache.org >>> Subject: Indexing huge data >>> >>> >>> All, >>> >>> Wondering about best practices/common practices to index/re-index huge >>> amount of data in Solr. The data is about 6 million entries in the db >>> and other source (data is not located in one resource). Trying with >>> solrj based solution to collect data from difference resources to >>> index into Solr. It takes hours to index Solr. >>> >>> Thanks in advance >>> >>
Re: Indexing huge data
Yeah. I have thought about spitting out JSON and run it against Solr using parallel Http threads separately. Thanks. On 3/5/14, 6:46 PM, Susheel Kumar wrote: One more suggestion is to collect/prepare the data in CSV format (1-2 million sample depending on size) and then import data direct into Solr using CSV handler & curl. This will give you the pure indexing time & the differences. Thanks, Susheel -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, March 05, 2014 8:03 PM To: solr-user@lucene.apache.org Subject: Re: Indexing huge data Here's the easiest thing to try to figure out where to concentrate your energies. Just comment out the server.add call in your SolrJ program. Well, and any commits you're doing from SolrJ. My bet: Your program will run at about the same speed it does when you actually index the docs, indicating that your problem is in the data acquisition side. Of course the older I get, the more times I've been wrong :). You can also monitor the CPU usage on the box running Solr. I often see it idling along < 30% when indexing, or even < 10%, again indicating that the bottleneck is on the acquisition side. Note I haven't mentioned any solutions, I'm a believer in identifying the _problem_ before worrying about a solution. Best, Erick On Wed, Mar 5, 2014 at 4:29 PM, Jack Krupansky wrote: Make sure you're not doing a commit on each individual document add. Commit every few minutes or every few hundred or few thousand documents is sufficient. You can set up auto commit in solrconfig.xml. -- Jack Krupansky -Original Message- From: Rallavagu Sent: Wednesday, March 5, 2014 2:37 PM To: solr-user@lucene.apache.org Subject: Indexing huge data All, Wondering about best practices/common practices to index/re-index huge amount of data in Solr. The data is about 6 million entries in the db and other source (data is not located in one resource). Trying with solrj based solution to collect data from difference resources to index into Solr. It takes hours to index Solr. Thanks in advance
Re: Indexing huge data
Erick, That helps so I can focus on the problem areas. Thanks. On 3/5/14, 6:03 PM, Erick Erickson wrote: Here's the easiest thing to try to figure out where to concentrate your energies. Just comment out the server.add call in your SolrJ program. Well, and any commits you're doing from SolrJ. My bet: Your program will run at about the same speed it does when you actually index the docs, indicating that your problem is in the data acquisition side. Of course the older I get, the more times I've been wrong :). You can also monitor the CPU usage on the box running Solr. I often see it idling along < 30% when indexing, or even < 10%, again indicating that the bottleneck is on the acquisition side. Note I haven't mentioned any solutions, I'm a believer in identifying the _problem_ before worrying about a solution. Best, Erick On Wed, Mar 5, 2014 at 4:29 PM, Jack Krupansky wrote: Make sure you're not doing a commit on each individual document add. Commit every few minutes or every few hundred or few thousand documents is sufficient. You can set up auto commit in solrconfig.xml. -- Jack Krupansky -Original Message- From: Rallavagu Sent: Wednesday, March 5, 2014 2:37 PM To: solr-user@lucene.apache.org Subject: Indexing huge data All, Wondering about best practices/common practices to index/re-index huge amount of data in Solr. The data is about 6 million entries in the db and other source (data is not located in one resource). Trying with solrj based solution to collect data from difference resources to index into Solr. It takes hours to index Solr. Thanks in advance
RE: Indexing huge data
One more suggestion is to collect/prepare the data in CSV format (1-2 million sample depending on size) and then import data direct into Solr using CSV handler & curl. This will give you the pure indexing time & the differences. Thanks, Susheel -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, March 05, 2014 8:03 PM To: solr-user@lucene.apache.org Subject: Re: Indexing huge data Here's the easiest thing to try to figure out where to concentrate your energies. Just comment out the server.add call in your SolrJ program. Well, and any commits you're doing from SolrJ. My bet: Your program will run at about the same speed it does when you actually index the docs, indicating that your problem is in the data acquisition side. Of course the older I get, the more times I've been wrong :). You can also monitor the CPU usage on the box running Solr. I often see it idling along < 30% when indexing, or even < 10%, again indicating that the bottleneck is on the acquisition side. Note I haven't mentioned any solutions, I'm a believer in identifying the _problem_ before worrying about a solution. Best, Erick On Wed, Mar 5, 2014 at 4:29 PM, Jack Krupansky wrote: > Make sure you're not doing a commit on each individual document add. > Commit every few minutes or every few hundred or few thousand > documents is sufficient. You can set up auto commit in solrconfig.xml. > > -- Jack Krupansky > > -Original Message- From: Rallavagu > Sent: Wednesday, March 5, 2014 2:37 PM > To: solr-user@lucene.apache.org > Subject: Indexing huge data > > > All, > > Wondering about best practices/common practices to index/re-index huge > amount of data in Solr. The data is about 6 million entries in the db > and other source (data is not located in one resource). Trying with > solrj based solution to collect data from difference resources to > index into Solr. It takes hours to index Solr. > > Thanks in advance
Re: Indexing huge data
Here's the easiest thing to try to figure out where to concentrate your energies. Just comment out the server.add call in your SolrJ program. Well, and any commits you're doing from SolrJ. My bet: Your program will run at about the same speed it does when you actually index the docs, indicating that your problem is in the data acquisition side. Of course the older I get, the more times I've been wrong :). You can also monitor the CPU usage on the box running Solr. I often see it idling along < 30% when indexing, or even < 10%, again indicating that the bottleneck is on the acquisition side. Note I haven't mentioned any solutions, I'm a believer in identifying the _problem_ before worrying about a solution. Best, Erick On Wed, Mar 5, 2014 at 4:29 PM, Jack Krupansky wrote: > Make sure you're not doing a commit on each individual document add. Commit > every few minutes or every few hundred or few thousand documents is > sufficient. You can set up auto commit in solrconfig.xml. > > -- Jack Krupansky > > -Original Message- From: Rallavagu > Sent: Wednesday, March 5, 2014 2:37 PM > To: solr-user@lucene.apache.org > Subject: Indexing huge data > > > All, > > Wondering about best practices/common practices to index/re-index huge > amount of data in Solr. The data is about 6 million entries in the db > and other source (data is not located in one resource). Trying with > solrj based solution to collect data from difference resources to index > into Solr. It takes hours to index Solr. > > Thanks in advance
Re: Indexing huge data
Make sure you're not doing a commit on each individual document add. Commit every few minutes or every few hundred or few thousand documents is sufficient. You can set up auto commit in solrconfig.xml. -- Jack Krupansky -Original Message- From: Rallavagu Sent: Wednesday, March 5, 2014 2:37 PM To: solr-user@lucene.apache.org Subject: Indexing huge data All, Wondering about best practices/common practices to index/re-index huge amount of data in Solr. The data is about 6 million entries in the db and other source (data is not located in one resource). Trying with solrj based solution to collect data from difference resources to index into Solr. It takes hours to index Solr. Thanks in advance
Re: Indexing huge data
Hi, Each doc is 100K? That's on the big side, yes, and the server seems on the small side, yes. Hence the "speed". :) Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr & Elasticsearch Support * http://sematext.com/ On Wed, Mar 5, 2014 at 3:37 PM, Rallavagu wrote: > Otis, > > Good points. I guess you are suggesting that it depends on the resources. > The document is 100k each the pre processing server is a 2 cpu VM running > with 4G RAM. So, that could be a "small" machine relatively to process such > amount of data?? > > > On 3/5/14, 12:27 PM, Otis Gospodnetic wrote: > >> Hi, >> >> It depends. Are docs huge or small? Server single core or 32 core? Heap >> big or small? etc. etc. >> >> Otis >> -- >> Performance Monitoring * Log Analytics * Search Analytics >> Solr & Elasticsearch Support * http://sematext.com/ >> >> >> On Wed, Mar 5, 2014 at 3:02 PM, Rallavagu wrote: >> >> It seems the latency is introduced by collecting the data from different >>> sources and putting them together then actual Solr index. I would say all >>> these activities are contributing equally though I would say So, is it >>> normal to expect to run indexing to run for long? Wondering what to >>> expect >>> in such cases. Thanks. >>> >>> On 3/5/14, 11:47 AM, Otis Gospodnetic wrote: >>> >>> Hi, 6M is really not huge these days. 6B is big, though also still not huge any more. What seems to be the bottleneck? Solr or DB or network or something else? Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr & Elasticsearch Support * http://sematext.com/ On Wed, Mar 5, 2014 at 2:37 PM, Rallavagu wrote: All, > > Wondering about best practices/common practices to index/re-index huge > amount of data in Solr. The data is about 6 million entries in the db > and > other source (data is not located in one resource). Trying with solrj > based > solution to collect data from difference resources to index into Solr. > It > takes hours to index Solr. > > Thanks in advance > > > >>
Re: Indexing huge data
Otis, Good points. I guess you are suggesting that it depends on the resources. The document is 100k each the pre processing server is a 2 cpu VM running with 4G RAM. So, that could be a "small" machine relatively to process such amount of data?? On 3/5/14, 12:27 PM, Otis Gospodnetic wrote: Hi, It depends. Are docs huge or small? Server single core or 32 core? Heap big or small? etc. etc. Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr & Elasticsearch Support * http://sematext.com/ On Wed, Mar 5, 2014 at 3:02 PM, Rallavagu wrote: It seems the latency is introduced by collecting the data from different sources and putting them together then actual Solr index. I would say all these activities are contributing equally though I would say So, is it normal to expect to run indexing to run for long? Wondering what to expect in such cases. Thanks. On 3/5/14, 11:47 AM, Otis Gospodnetic wrote: Hi, 6M is really not huge these days. 6B is big, though also still not huge any more. What seems to be the bottleneck? Solr or DB or network or something else? Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr & Elasticsearch Support * http://sematext.com/ On Wed, Mar 5, 2014 at 2:37 PM, Rallavagu wrote: All, Wondering about best practices/common practices to index/re-index huge amount of data in Solr. The data is about 6 million entries in the db and other source (data is not located in one resource). Trying with solrj based solution to collect data from difference resources to index into Solr. It takes hours to index Solr. Thanks in advance
Re: Indexing huge data
Hi, It depends. Are docs huge or small? Server single core or 32 core? Heap big or small? etc. etc. Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr & Elasticsearch Support * http://sematext.com/ On Wed, Mar 5, 2014 at 3:02 PM, Rallavagu wrote: > It seems the latency is introduced by collecting the data from different > sources and putting them together then actual Solr index. I would say all > these activities are contributing equally though I would say So, is it > normal to expect to run indexing to run for long? Wondering what to expect > in such cases. Thanks. > > On 3/5/14, 11:47 AM, Otis Gospodnetic wrote: > >> Hi, >> >> 6M is really not huge these days. 6B is big, though also still not huge >> any more. What seems to be the bottleneck? Solr or DB or network or >> something else? >> >> Otis >> -- >> Performance Monitoring * Log Analytics * Search Analytics >> Solr & Elasticsearch Support * http://sematext.com/ >> >> >> On Wed, Mar 5, 2014 at 2:37 PM, Rallavagu wrote: >> >> All, >>> >>> Wondering about best practices/common practices to index/re-index huge >>> amount of data in Solr. The data is about 6 million entries in the db and >>> other source (data is not located in one resource). Trying with solrj >>> based >>> solution to collect data from difference resources to index into Solr. It >>> takes hours to index Solr. >>> >>> Thanks in advance >>> >>> >>
Re: Indexing huge data
It seems the latency is introduced by collecting the data from different sources and putting them together then actual Solr index. I would say all these activities are contributing equally though I would say So, is it normal to expect to run indexing to run for long? Wondering what to expect in such cases. Thanks. On 3/5/14, 11:47 AM, Otis Gospodnetic wrote: Hi, 6M is really not huge these days. 6B is big, though also still not huge any more. What seems to be the bottleneck? Solr or DB or network or something else? Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr & Elasticsearch Support * http://sematext.com/ On Wed, Mar 5, 2014 at 2:37 PM, Rallavagu wrote: All, Wondering about best practices/common practices to index/re-index huge amount of data in Solr. The data is about 6 million entries in the db and other source (data is not located in one resource). Trying with solrj based solution to collect data from difference resources to index into Solr. It takes hours to index Solr. Thanks in advance
Re: Indexing huge data
Hi, 6M is really not huge these days. 6B is big, though also still not huge any more. What seems to be the bottleneck? Solr or DB or network or something else? Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr & Elasticsearch Support * http://sematext.com/ On Wed, Mar 5, 2014 at 2:37 PM, Rallavagu wrote: > All, > > Wondering about best practices/common practices to index/re-index huge > amount of data in Solr. The data is about 6 million entries in the db and > other source (data is not located in one resource). Trying with solrj based > solution to collect data from difference resources to index into Solr. It > takes hours to index Solr. > > Thanks in advance >
Indexing huge data
All, Wondering about best practices/common practices to index/re-index huge amount of data in Solr. The data is about 6 million entries in the db and other source (data is not located in one resource). Trying with solrj based solution to collect data from difference resources to index into Solr. It takes hours to index Solr. Thanks in advance