Re: SolrCloudServer questions
I've seen best throughput while indexing by sending in batches of documents rather than individual documents per request. You might try queueing on your indexing machines for a bit then sending off a batch every N documents. Thanks, Greg On Feb 1, 2014, at 6:49 PM, Software Dev static.void@gmail.com wrote: Also, if we are seeing a huge cpu spike on the leader when doing a bulk index, would changing any of the options help? On Sat, Feb 1, 2014 at 2:59 PM, Software Dev static.void@gmail.comwrote: Out use case is we have 3 indexing machines pulling off a kafka queue and they are all sending individual updates. On Fri, Jan 31, 2014 at 12:54 PM, Mark Miller markrmil...@gmail.comwrote: Just make sure parallel updates is set to true. If you want to load even faster, you can use the bulk add methods, or if you need more fine grained responses, use the single add from multiple threads (though bulk add can also be done via multiple threads if you really want to try and push the max). - Mark http://about.me/markrmiller On Jan 31, 2014, at 3:50 PM, Software Dev static.void@gmail.com wrote: Which of any of these settings would be beneficial when bulk uploading? On Fri, Jan 31, 2014 at 11:05 AM, Mark Miller markrmil...@gmail.com wrote: On Jan 31, 2014, at 1:56 PM, Greg Walters greg.walt...@answers.com wrote: I'm assuming you mean CloudSolrServer here. If I'm wrong please ignore my response. -updatesToLeaders Only send documents to shard leaders while indexing. This saves cross-talk between slaves and leaders which results in more efficient document routing. Right, but recently this has less of an affect because CloudSolrServer can now hash documents and directly send them to the right place. This option has become more historical. Just make sure you set the correct id field on the CloudSolrServer instance for this hashing to work (I think it defaults to id). shutdownLBHttpSolrServer CloudSolrServer uses a LBHttpSolrServer behind the scenes to distribute requests (that aren't updates directly to leaders). Where did you find this? I don't see this in the javadoc anywhere but it is a boolean in the CloudSolrServer class. It looks like when you create a new CloudSolrServer and pass it your own LBHttpSolrServer the boolean gets set to false and the CloudSolrServer won't shut down the LBHttpSolrServer when it gets shut down. parellelUpdates The javadoc's done have any description for this one but I checked out the code for CloudSolrServer and if parallelUpdates it looks like it executes update statements to multiple shards at the same time. Right, we should def add some javadoc, but this sends updates to shards in parallel rather than with a single thread. Can really increase update speed. Still not as powerful as using CloudSolrServer from multiple threads, but a nice improvement non the less. - Mark http://about.me/markrmiller I'm no dev but I can read so please excuse any errors on my part. Thanks, Greg On Jan 31, 2014, at 11:40 AM, Software Dev static.void@gmail.com wrote: Can someone clarify what the following options are: - updatesToLeaders - shutdownLBHttpSolrServer - parallelUpdates Also, I remember in older version of Solr there was an efficient format that was used between SolrJ and Solr that is more compact. Does this sill exist in the latest version of Solr? If so, is it the default? Thanks
Re: SolrCloudServer questions
Out use case is we have 3 indexing machines pulling off a kafka queue and they are all sending individual updates. On Fri, Jan 31, 2014 at 12:54 PM, Mark Miller markrmil...@gmail.com wrote: Just make sure parallel updates is set to true. If you want to load even faster, you can use the bulk add methods, or if you need more fine grained responses, use the single add from multiple threads (though bulk add can also be done via multiple threads if you really want to try and push the max). - Mark http://about.me/markrmiller On Jan 31, 2014, at 3:50 PM, Software Dev static.void@gmail.com wrote: Which of any of these settings would be beneficial when bulk uploading? On Fri, Jan 31, 2014 at 11:05 AM, Mark Miller markrmil...@gmail.com wrote: On Jan 31, 2014, at 1:56 PM, Greg Walters greg.walt...@answers.com wrote: I'm assuming you mean CloudSolrServer here. If I'm wrong please ignore my response. -updatesToLeaders Only send documents to shard leaders while indexing. This saves cross-talk between slaves and leaders which results in more efficient document routing. Right, but recently this has less of an affect because CloudSolrServer can now hash documents and directly send them to the right place. This option has become more historical. Just make sure you set the correct id field on the CloudSolrServer instance for this hashing to work (I think it defaults to id). shutdownLBHttpSolrServer CloudSolrServer uses a LBHttpSolrServer behind the scenes to distribute requests (that aren't updates directly to leaders). Where did you find this? I don't see this in the javadoc anywhere but it is a boolean in the CloudSolrServer class. It looks like when you create a new CloudSolrServer and pass it your own LBHttpSolrServer the boolean gets set to false and the CloudSolrServer won't shut down the LBHttpSolrServer when it gets shut down. parellelUpdates The javadoc's done have any description for this one but I checked out the code for CloudSolrServer and if parallelUpdates it looks like it executes update statements to multiple shards at the same time. Right, we should def add some javadoc, but this sends updates to shards in parallel rather than with a single thread. Can really increase update speed. Still not as powerful as using CloudSolrServer from multiple threads, but a nice improvement non the less. - Mark http://about.me/markrmiller I'm no dev but I can read so please excuse any errors on my part. Thanks, Greg On Jan 31, 2014, at 11:40 AM, Software Dev static.void@gmail.com wrote: Can someone clarify what the following options are: - updatesToLeaders - shutdownLBHttpSolrServer - parallelUpdates Also, I remember in older version of Solr there was an efficient format that was used between SolrJ and Solr that is more compact. Does this sill exist in the latest version of Solr? If so, is it the default? Thanks
Re: SolrCloudServer questions
Also, if we are seeing a huge cpu spike on the leader when doing a bulk index, would changing any of the options help? On Sat, Feb 1, 2014 at 2:59 PM, Software Dev static.void@gmail.comwrote: Out use case is we have 3 indexing machines pulling off a kafka queue and they are all sending individual updates. On Fri, Jan 31, 2014 at 12:54 PM, Mark Miller markrmil...@gmail.comwrote: Just make sure parallel updates is set to true. If you want to load even faster, you can use the bulk add methods, or if you need more fine grained responses, use the single add from multiple threads (though bulk add can also be done via multiple threads if you really want to try and push the max). - Mark http://about.me/markrmiller On Jan 31, 2014, at 3:50 PM, Software Dev static.void@gmail.com wrote: Which of any of these settings would be beneficial when bulk uploading? On Fri, Jan 31, 2014 at 11:05 AM, Mark Miller markrmil...@gmail.com wrote: On Jan 31, 2014, at 1:56 PM, Greg Walters greg.walt...@answers.com wrote: I'm assuming you mean CloudSolrServer here. If I'm wrong please ignore my response. -updatesToLeaders Only send documents to shard leaders while indexing. This saves cross-talk between slaves and leaders which results in more efficient document routing. Right, but recently this has less of an affect because CloudSolrServer can now hash documents and directly send them to the right place. This option has become more historical. Just make sure you set the correct id field on the CloudSolrServer instance for this hashing to work (I think it defaults to id). shutdownLBHttpSolrServer CloudSolrServer uses a LBHttpSolrServer behind the scenes to distribute requests (that aren't updates directly to leaders). Where did you find this? I don't see this in the javadoc anywhere but it is a boolean in the CloudSolrServer class. It looks like when you create a new CloudSolrServer and pass it your own LBHttpSolrServer the boolean gets set to false and the CloudSolrServer won't shut down the LBHttpSolrServer when it gets shut down. parellelUpdates The javadoc's done have any description for this one but I checked out the code for CloudSolrServer and if parallelUpdates it looks like it executes update statements to multiple shards at the same time. Right, we should def add some javadoc, but this sends updates to shards in parallel rather than with a single thread. Can really increase update speed. Still not as powerful as using CloudSolrServer from multiple threads, but a nice improvement non the less. - Mark http://about.me/markrmiller I'm no dev but I can read so please excuse any errors on my part. Thanks, Greg On Jan 31, 2014, at 11:40 AM, Software Dev static.void@gmail.com wrote: Can someone clarify what the following options are: - updatesToLeaders - shutdownLBHttpSolrServer - parallelUpdates Also, I remember in older version of Solr there was an efficient format that was used between SolrJ and Solr that is more compact. Does this sill exist in the latest version of Solr? If so, is it the default? Thanks
SolrCloudServer questions
Can someone clarify what the following options are: - updatesToLeaders - shutdownLBHttpSolrServer - parallelUpdates Also, I remember in older version of Solr there was an efficient format that was used between SolrJ and Solr that is more compact. Does this sill exist in the latest version of Solr? If so, is it the default? Thanks
Re: SolrCloudServer questions
I'm assuming you mean CloudSolrServer here. If I'm wrong please ignore my response. -updatesToLeaders Only send documents to shard leaders while indexing. This saves cross-talk between slaves and leaders which results in more efficient document routing. shutdownLBHttpSolrServer CloudSolrServer uses a LBHttpSolrServer behind the scenes to distribute requests (that aren't updates directly to leaders). Where did you find this? I don't see this in the javadoc anywhere but it is a boolean in the CloudSolrServer class. It looks like when you create a new CloudSolrServer and pass it your own LBHttpSolrServer the boolean gets set to false and the CloudSolrServer won't shut down the LBHttpSolrServer when it gets shut down. parellelUpdates The javadoc's done have any description for this one but I checked out the code for CloudSolrServer and if parallelUpdates it looks like it executes update statements to multiple shards at the same time. I'm no dev but I can read so please excuse any errors on my part. Thanks, Greg On Jan 31, 2014, at 11:40 AM, Software Dev static.void@gmail.com wrote: Can someone clarify what the following options are: - updatesToLeaders - shutdownLBHttpSolrServer - parallelUpdates Also, I remember in older version of Solr there was an efficient format that was used between SolrJ and Solr that is more compact. Does this sill exist in the latest version of Solr? If so, is it the default? Thanks
Re: SolrCloudServer questions
On Jan 31, 2014, at 1:56 PM, Greg Walters greg.walt...@answers.com wrote: I'm assuming you mean CloudSolrServer here. If I'm wrong please ignore my response. -updatesToLeaders Only send documents to shard leaders while indexing. This saves cross-talk between slaves and leaders which results in more efficient document routing. Right, but recently this has less of an affect because CloudSolrServer can now hash documents and directly send them to the right place. This option has become more historical. Just make sure you set the correct id field on the CloudSolrServer instance for this hashing to work (I think it defaults to “id”). shutdownLBHttpSolrServer CloudSolrServer uses a LBHttpSolrServer behind the scenes to distribute requests (that aren't updates directly to leaders). Where did you find this? I don't see this in the javadoc anywhere but it is a boolean in the CloudSolrServer class. It looks like when you create a new CloudSolrServer and pass it your own LBHttpSolrServer the boolean gets set to false and the CloudSolrServer won't shut down the LBHttpSolrServer when it gets shut down. parellelUpdates The javadoc's done have any description for this one but I checked out the code for CloudSolrServer and if parallelUpdates it looks like it executes update statements to multiple shards at the same time. Right, we should def add some javadoc, but this sends updates to shards in parallel rather than with a single thread. Can really increase update speed. Still not as powerful as using CloudSolrServer from multiple threads, but a nice improvement non the less. - Mark http://about.me/markrmiller I'm no dev but I can read so please excuse any errors on my part. Thanks, Greg On Jan 31, 2014, at 11:40 AM, Software Dev static.void@gmail.com wrote: Can someone clarify what the following options are: - updatesToLeaders - shutdownLBHttpSolrServer - parallelUpdates Also, I remember in older version of Solr there was an efficient format that was used between SolrJ and Solr that is more compact. Does this sill exist in the latest version of Solr? If so, is it the default? Thanks
Re: SolrCloudServer questions
Which of any of these settings would be beneficial when bulk uploading? On Fri, Jan 31, 2014 at 11:05 AM, Mark Miller markrmil...@gmail.com wrote: On Jan 31, 2014, at 1:56 PM, Greg Walters greg.walt...@answers.com wrote: I'm assuming you mean CloudSolrServer here. If I'm wrong please ignore my response. -updatesToLeaders Only send documents to shard leaders while indexing. This saves cross-talk between slaves and leaders which results in more efficient document routing. Right, but recently this has less of an affect because CloudSolrServer can now hash documents and directly send them to the right place. This option has become more historical. Just make sure you set the correct id field on the CloudSolrServer instance for this hashing to work (I think it defaults to id). shutdownLBHttpSolrServer CloudSolrServer uses a LBHttpSolrServer behind the scenes to distribute requests (that aren't updates directly to leaders). Where did you find this? I don't see this in the javadoc anywhere but it is a boolean in the CloudSolrServer class. It looks like when you create a new CloudSolrServer and pass it your own LBHttpSolrServer the boolean gets set to false and the CloudSolrServer won't shut down the LBHttpSolrServer when it gets shut down. parellelUpdates The javadoc's done have any description for this one but I checked out the code for CloudSolrServer and if parallelUpdates it looks like it executes update statements to multiple shards at the same time. Right, we should def add some javadoc, but this sends updates to shards in parallel rather than with a single thread. Can really increase update speed. Still not as powerful as using CloudSolrServer from multiple threads, but a nice improvement non the less. - Mark http://about.me/markrmiller I'm no dev but I can read so please excuse any errors on my part. Thanks, Greg On Jan 31, 2014, at 11:40 AM, Software Dev static.void@gmail.com wrote: Can someone clarify what the following options are: - updatesToLeaders - shutdownLBHttpSolrServer - parallelUpdates Also, I remember in older version of Solr there was an efficient format that was used between SolrJ and Solr that is more compact. Does this sill exist in the latest version of Solr? If so, is it the default? Thanks
Re: SolrCloudServer questions
Just make sure parallel updates is set to true. If you want to load even faster, you can use the bulk add methods, or if you need more fine grained responses, use the single add from multiple threads (though bulk add can also be done via multiple threads if you really want to try and push the max). - Mark http://about.me/markrmiller On Jan 31, 2014, at 3:50 PM, Software Dev static.void@gmail.com wrote: Which of any of these settings would be beneficial when bulk uploading? On Fri, Jan 31, 2014 at 11:05 AM, Mark Miller markrmil...@gmail.com wrote: On Jan 31, 2014, at 1:56 PM, Greg Walters greg.walt...@answers.com wrote: I'm assuming you mean CloudSolrServer here. If I'm wrong please ignore my response. -updatesToLeaders Only send documents to shard leaders while indexing. This saves cross-talk between slaves and leaders which results in more efficient document routing. Right, but recently this has less of an affect because CloudSolrServer can now hash documents and directly send them to the right place. This option has become more historical. Just make sure you set the correct id field on the CloudSolrServer instance for this hashing to work (I think it defaults to id). shutdownLBHttpSolrServer CloudSolrServer uses a LBHttpSolrServer behind the scenes to distribute requests (that aren't updates directly to leaders). Where did you find this? I don't see this in the javadoc anywhere but it is a boolean in the CloudSolrServer class. It looks like when you create a new CloudSolrServer and pass it your own LBHttpSolrServer the boolean gets set to false and the CloudSolrServer won't shut down the LBHttpSolrServer when it gets shut down. parellelUpdates The javadoc's done have any description for this one but I checked out the code for CloudSolrServer and if parallelUpdates it looks like it executes update statements to multiple shards at the same time. Right, we should def add some javadoc, but this sends updates to shards in parallel rather than with a single thread. Can really increase update speed. Still not as powerful as using CloudSolrServer from multiple threads, but a nice improvement non the less. - Mark http://about.me/markrmiller I'm no dev but I can read so please excuse any errors on my part. Thanks, Greg On Jan 31, 2014, at 11:40 AM, Software Dev static.void@gmail.com wrote: Can someone clarify what the following options are: - updatesToLeaders - shutdownLBHttpSolrServer - parallelUpdates Also, I remember in older version of Solr there was an efficient format that was used between SolrJ and Solr that is more compact. Does this sill exist in the latest version of Solr? If so, is it the default? Thanks