Re: SolrCloudServer questions

2014-02-03 Thread Greg Walters
I've seen best throughput while indexing by sending in batches of documents 
rather than individual documents per request. You might try queueing on your 
indexing machines for a bit then sending off a batch every N documents.

Thanks,
Greg

On Feb 1, 2014, at 6:49 PM, Software Dev static.void@gmail.com wrote:

 Also, if we are seeing a huge cpu spike on the leader when doing a bulk
 index, would changing any of the options help?
 
 
 On Sat, Feb 1, 2014 at 2:59 PM, Software Dev static.void@gmail.comwrote:
 
 Out use case is we have 3 indexing machines pulling off a kafka queue and
 they are all sending individual updates.
 
 
 On Fri, Jan 31, 2014 at 12:54 PM, Mark Miller markrmil...@gmail.comwrote:
 
 Just make sure parallel updates is set to true.
 
 If you want to load even faster, you can use the bulk add methods, or if
 you need more fine grained responses, use the single add from multiple
 threads (though bulk add can also be done via multiple threads if you
 really want to try and push the max).
 
 - Mark
 
 http://about.me/markrmiller
 
 On Jan 31, 2014, at 3:50 PM, Software Dev static.void@gmail.com
 wrote:
 
 Which of any of these settings would be beneficial when bulk uploading?
 
 
 On Fri, Jan 31, 2014 at 11:05 AM, Mark Miller markrmil...@gmail.com
 wrote:
 
 
 
 On Jan 31, 2014, at 1:56 PM, Greg Walters greg.walt...@answers.com
 wrote:
 
 I'm assuming you mean CloudSolrServer here. If I'm wrong please ignore
 my response.
 
 -updatesToLeaders
 
 Only send documents to shard leaders while indexing. This saves
 cross-talk between slaves and leaders which results in more efficient
 document routing.
 
 Right, but recently this has less of an affect because CloudSolrServer
 can
 now hash documents and directly send them to the right place. This
 option
 has become more historical. Just make sure you set the correct id
 field on
 the CloudSolrServer instance for this hashing to work (I think it
 defaults
 to id).
 
 
 shutdownLBHttpSolrServer
 
 CloudSolrServer uses a LBHttpSolrServer behind the scenes to
 distribute
 requests (that aren't updates directly to leaders). Where did you find
 this? I don't see this in the javadoc anywhere but it is a boolean in
 the
 CloudSolrServer class. It looks like when you create a new
 CloudSolrServer
 and pass it your own LBHttpSolrServer the boolean gets set to false
 and the
 CloudSolrServer won't shut down the LBHttpSolrServer when it gets shut
 down.
 
 parellelUpdates
 
 The javadoc's done have any description for this one but I checked out
 the code for CloudSolrServer and if parallelUpdates it looks like it
 executes update statements to multiple shards at the same time.
 
 Right, we should def add some javadoc, but this sends updates to
 shards in
 parallel rather than with a single thread. Can really increase update
 speed. Still not as powerful as using CloudSolrServer from multiple
 threads, but a nice improvement non the less.
 
 
 - Mark
 
 http://about.me/markrmiller
 
 
 I'm no dev but I can read so please excuse any errors on my part.
 
 Thanks,
 Greg
 
 On Jan 31, 2014, at 11:40 AM, Software Dev static.void@gmail.com
 
 wrote:
 
 Can someone clarify what the following options are:
 
 - updatesToLeaders
 - shutdownLBHttpSolrServer
 - parallelUpdates
 
 Also, I remember in older version of Solr there was an efficient
 format
 that was used between SolrJ and Solr that is more compact. Does this
 sill
 exist in the latest version of Solr? If so, is it the default?
 
 Thanks
 
 
 
 
 
 



Re: SolrCloudServer questions

2014-02-01 Thread Software Dev
Out use case is we have 3 indexing machines pulling off a kafka queue and
they are all sending individual updates.


On Fri, Jan 31, 2014 at 12:54 PM, Mark Miller markrmil...@gmail.com wrote:

 Just make sure parallel updates is set to true.

 If you want to load even faster, you can use the bulk add methods, or if
 you need more fine grained responses, use the single add from multiple
 threads (though bulk add can also be done via multiple threads if you
 really want to try and push the max).

 - Mark

 http://about.me/markrmiller

 On Jan 31, 2014, at 3:50 PM, Software Dev static.void@gmail.com
 wrote:

  Which of any of these settings would be beneficial when bulk uploading?
 
 
  On Fri, Jan 31, 2014 at 11:05 AM, Mark Miller markrmil...@gmail.com
 wrote:
 
 
 
  On Jan 31, 2014, at 1:56 PM, Greg Walters greg.walt...@answers.com
  wrote:
 
  I'm assuming you mean CloudSolrServer here. If I'm wrong please ignore
  my response.
 
  -updatesToLeaders
 
  Only send documents to shard leaders while indexing. This saves
  cross-talk between slaves and leaders which results in more efficient
  document routing.
 
  Right, but recently this has less of an affect because CloudSolrServer
 can
  now hash documents and directly send them to the right place. This
 option
  has become more historical. Just make sure you set the correct id field
 on
  the CloudSolrServer instance for this hashing to work (I think it
 defaults
  to id).
 
 
  shutdownLBHttpSolrServer
 
  CloudSolrServer uses a LBHttpSolrServer behind the scenes to distribute
  requests (that aren't updates directly to leaders). Where did you find
  this? I don't see this in the javadoc anywhere but it is a boolean in
 the
  CloudSolrServer class. It looks like when you create a new
 CloudSolrServer
  and pass it your own LBHttpSolrServer the boolean gets set to false and
 the
  CloudSolrServer won't shut down the LBHttpSolrServer when it gets shut
 down.
 
  parellelUpdates
 
  The javadoc's done have any description for this one but I checked out
  the code for CloudSolrServer and if parallelUpdates it looks like it
  executes update statements to multiple shards at the same time.
 
  Right, we should def add some javadoc, but this sends updates to shards
 in
  parallel rather than with a single thread. Can really increase update
  speed. Still not as powerful as using CloudSolrServer from multiple
  threads, but a nice improvement non the less.
 
 
  - Mark
 
  http://about.me/markrmiller
 
 
  I'm no dev but I can read so please excuse any errors on my part.
 
  Thanks,
  Greg
 
  On Jan 31, 2014, at 11:40 AM, Software Dev static.void@gmail.com
  wrote:
 
  Can someone clarify what the following options are:
 
  - updatesToLeaders
  - shutdownLBHttpSolrServer
  - parallelUpdates
 
  Also, I remember in older version of Solr there was an efficient
 format
  that was used between SolrJ and Solr that is more compact. Does this
  sill
  exist in the latest version of Solr? If so, is it the default?
 
  Thanks
 
 
 




Re: SolrCloudServer questions

2014-02-01 Thread Software Dev
Also, if we are seeing a huge cpu spike on the leader when doing a bulk
index, would changing any of the options help?


On Sat, Feb 1, 2014 at 2:59 PM, Software Dev static.void@gmail.comwrote:

 Out use case is we have 3 indexing machines pulling off a kafka queue and
 they are all sending individual updates.


 On Fri, Jan 31, 2014 at 12:54 PM, Mark Miller markrmil...@gmail.comwrote:

 Just make sure parallel updates is set to true.

 If you want to load even faster, you can use the bulk add methods, or if
 you need more fine grained responses, use the single add from multiple
 threads (though bulk add can also be done via multiple threads if you
 really want to try and push the max).

 - Mark

 http://about.me/markrmiller

 On Jan 31, 2014, at 3:50 PM, Software Dev static.void@gmail.com
 wrote:

  Which of any of these settings would be beneficial when bulk uploading?
 
 
  On Fri, Jan 31, 2014 at 11:05 AM, Mark Miller markrmil...@gmail.com
 wrote:
 
 
 
  On Jan 31, 2014, at 1:56 PM, Greg Walters greg.walt...@answers.com
  wrote:
 
  I'm assuming you mean CloudSolrServer here. If I'm wrong please ignore
  my response.
 
  -updatesToLeaders
 
  Only send documents to shard leaders while indexing. This saves
  cross-talk between slaves and leaders which results in more efficient
  document routing.
 
  Right, but recently this has less of an affect because CloudSolrServer
 can
  now hash documents and directly send them to the right place. This
 option
  has become more historical. Just make sure you set the correct id
 field on
  the CloudSolrServer instance for this hashing to work (I think it
 defaults
  to id).
 
 
  shutdownLBHttpSolrServer
 
  CloudSolrServer uses a LBHttpSolrServer behind the scenes to
 distribute
  requests (that aren't updates directly to leaders). Where did you find
  this? I don't see this in the javadoc anywhere but it is a boolean in
 the
  CloudSolrServer class. It looks like when you create a new
 CloudSolrServer
  and pass it your own LBHttpSolrServer the boolean gets set to false
 and the
  CloudSolrServer won't shut down the LBHttpSolrServer when it gets shut
 down.
 
  parellelUpdates
 
  The javadoc's done have any description for this one but I checked out
  the code for CloudSolrServer and if parallelUpdates it looks like it
  executes update statements to multiple shards at the same time.
 
  Right, we should def add some javadoc, but this sends updates to
 shards in
  parallel rather than with a single thread. Can really increase update
  speed. Still not as powerful as using CloudSolrServer from multiple
  threads, but a nice improvement non the less.
 
 
  - Mark
 
  http://about.me/markrmiller
 
 
  I'm no dev but I can read so please excuse any errors on my part.
 
  Thanks,
  Greg
 
  On Jan 31, 2014, at 11:40 AM, Software Dev static.void@gmail.com
 
  wrote:
 
  Can someone clarify what the following options are:
 
  - updatesToLeaders
  - shutdownLBHttpSolrServer
  - parallelUpdates
 
  Also, I remember in older version of Solr there was an efficient
 format
  that was used between SolrJ and Solr that is more compact. Does this
  sill
  exist in the latest version of Solr? If so, is it the default?
 
  Thanks
 
 
 





SolrCloudServer questions

2014-01-31 Thread Software Dev
Can someone clarify what the following options are:

- updatesToLeaders
- shutdownLBHttpSolrServer
- parallelUpdates

Also, I remember in older version of Solr there was an efficient format
that was used between SolrJ and Solr that is more compact. Does this sill
exist in the latest version of Solr? If so, is it the default?

Thanks


Re: SolrCloudServer questions

2014-01-31 Thread Greg Walters
I'm assuming you mean CloudSolrServer here. If I'm wrong please ignore my 
response.

 -updatesToLeaders

Only send documents to shard leaders while indexing. This saves cross-talk 
between slaves and leaders which results in more efficient document routing.

 shutdownLBHttpSolrServer

CloudSolrServer uses a LBHttpSolrServer behind the scenes to distribute 
requests (that aren't updates directly to leaders). Where did you find this? I 
don't see this in the javadoc anywhere but it is a boolean in the 
CloudSolrServer class. It looks like when you create a new CloudSolrServer and 
pass it your own LBHttpSolrServer the boolean gets set to false and the 
CloudSolrServer won't shut down the LBHttpSolrServer when it gets shut down.

 parellelUpdates

The javadoc's done have any description for this one but I checked out the code 
for CloudSolrServer and if parallelUpdates it looks like it executes update 
statements to multiple shards at the same time.

I'm no dev but I can read so please excuse any errors on my part.

Thanks,
Greg

On Jan 31, 2014, at 11:40 AM, Software Dev static.void@gmail.com wrote:

 Can someone clarify what the following options are:
 
 - updatesToLeaders
 - shutdownLBHttpSolrServer
 - parallelUpdates
 
 Also, I remember in older version of Solr there was an efficient format
 that was used between SolrJ and Solr that is more compact. Does this sill
 exist in the latest version of Solr? If so, is it the default?
 
 Thanks



Re: SolrCloudServer questions

2014-01-31 Thread Mark Miller


On Jan 31, 2014, at 1:56 PM, Greg Walters greg.walt...@answers.com wrote:

 I'm assuming you mean CloudSolrServer here. If I'm wrong please ignore my 
 response.
 
 -updatesToLeaders
 
 Only send documents to shard leaders while indexing. This saves cross-talk 
 between slaves and leaders which results in more efficient document routing.

Right, but recently this has less of an affect because CloudSolrServer can now 
hash documents and directly send them to the right place. This option has 
become more historical. Just make sure you set the correct id field on the 
CloudSolrServer instance for this hashing to work (I think it defaults to “id”).

 
 shutdownLBHttpSolrServer
 
 CloudSolrServer uses a LBHttpSolrServer behind the scenes to distribute 
 requests (that aren't updates directly to leaders). Where did you find this? 
 I don't see this in the javadoc anywhere but it is a boolean in the 
 CloudSolrServer class. It looks like when you create a new CloudSolrServer 
 and pass it your own LBHttpSolrServer the boolean gets set to false and the 
 CloudSolrServer won't shut down the LBHttpSolrServer when it gets shut down.
 
 parellelUpdates
 
 The javadoc's done have any description for this one but I checked out the 
 code for CloudSolrServer and if parallelUpdates it looks like it executes 
 update statements to multiple shards at the same time.

Right, we should def add some javadoc, but this sends updates to shards in 
parallel rather than with a single thread. Can really increase update speed. 
Still not as powerful as using CloudSolrServer from multiple threads, but a 
nice improvement non the less.


- Mark

http://about.me/markrmiller

 
 I'm no dev but I can read so please excuse any errors on my part.
 
 Thanks,
 Greg
 
 On Jan 31, 2014, at 11:40 AM, Software Dev static.void@gmail.com wrote:
 
 Can someone clarify what the following options are:
 
 - updatesToLeaders
 - shutdownLBHttpSolrServer
 - parallelUpdates
 
 Also, I remember in older version of Solr there was an efficient format
 that was used between SolrJ and Solr that is more compact. Does this sill
 exist in the latest version of Solr? If so, is it the default?
 
 Thanks
 



Re: SolrCloudServer questions

2014-01-31 Thread Software Dev
Which of any of these settings would be beneficial when bulk uploading?


On Fri, Jan 31, 2014 at 11:05 AM, Mark Miller markrmil...@gmail.com wrote:



 On Jan 31, 2014, at 1:56 PM, Greg Walters greg.walt...@answers.com
 wrote:

  I'm assuming you mean CloudSolrServer here. If I'm wrong please ignore
 my response.
 
  -updatesToLeaders
 
  Only send documents to shard leaders while indexing. This saves
 cross-talk between slaves and leaders which results in more efficient
 document routing.

 Right, but recently this has less of an affect because CloudSolrServer can
 now hash documents and directly send them to the right place. This option
 has become more historical. Just make sure you set the correct id field on
 the CloudSolrServer instance for this hashing to work (I think it defaults
 to id).

 
  shutdownLBHttpSolrServer
 
  CloudSolrServer uses a LBHttpSolrServer behind the scenes to distribute
 requests (that aren't updates directly to leaders). Where did you find
 this? I don't see this in the javadoc anywhere but it is a boolean in the
 CloudSolrServer class. It looks like when you create a new CloudSolrServer
 and pass it your own LBHttpSolrServer the boolean gets set to false and the
 CloudSolrServer won't shut down the LBHttpSolrServer when it gets shut down.
 
  parellelUpdates
 
  The javadoc's done have any description for this one but I checked out
 the code for CloudSolrServer and if parallelUpdates it looks like it
 executes update statements to multiple shards at the same time.

 Right, we should def add some javadoc, but this sends updates to shards in
 parallel rather than with a single thread. Can really increase update
 speed. Still not as powerful as using CloudSolrServer from multiple
 threads, but a nice improvement non the less.


 - Mark

 http://about.me/markrmiller

 
  I'm no dev but I can read so please excuse any errors on my part.
 
  Thanks,
  Greg
 
  On Jan 31, 2014, at 11:40 AM, Software Dev static.void@gmail.com
 wrote:
 
  Can someone clarify what the following options are:
 
  - updatesToLeaders
  - shutdownLBHttpSolrServer
  - parallelUpdates
 
  Also, I remember in older version of Solr there was an efficient format
  that was used between SolrJ and Solr that is more compact. Does this
 sill
  exist in the latest version of Solr? If so, is it the default?
 
  Thanks
 




Re: SolrCloudServer questions

2014-01-31 Thread Mark Miller
Just make sure parallel updates is set to true.

If you want to load even faster, you can use the bulk add methods, or if you 
need more fine grained responses, use the single add from multiple threads 
(though bulk add can also be done via multiple threads if you really want to 
try and push the max).

- Mark

http://about.me/markrmiller

On Jan 31, 2014, at 3:50 PM, Software Dev static.void@gmail.com wrote:

 Which of any of these settings would be beneficial when bulk uploading?
 
 
 On Fri, Jan 31, 2014 at 11:05 AM, Mark Miller markrmil...@gmail.com wrote:
 
 
 
 On Jan 31, 2014, at 1:56 PM, Greg Walters greg.walt...@answers.com
 wrote:
 
 I'm assuming you mean CloudSolrServer here. If I'm wrong please ignore
 my response.
 
 -updatesToLeaders
 
 Only send documents to shard leaders while indexing. This saves
 cross-talk between slaves and leaders which results in more efficient
 document routing.
 
 Right, but recently this has less of an affect because CloudSolrServer can
 now hash documents and directly send them to the right place. This option
 has become more historical. Just make sure you set the correct id field on
 the CloudSolrServer instance for this hashing to work (I think it defaults
 to id).
 
 
 shutdownLBHttpSolrServer
 
 CloudSolrServer uses a LBHttpSolrServer behind the scenes to distribute
 requests (that aren't updates directly to leaders). Where did you find
 this? I don't see this in the javadoc anywhere but it is a boolean in the
 CloudSolrServer class. It looks like when you create a new CloudSolrServer
 and pass it your own LBHttpSolrServer the boolean gets set to false and the
 CloudSolrServer won't shut down the LBHttpSolrServer when it gets shut down.
 
 parellelUpdates
 
 The javadoc's done have any description for this one but I checked out
 the code for CloudSolrServer and if parallelUpdates it looks like it
 executes update statements to multiple shards at the same time.
 
 Right, we should def add some javadoc, but this sends updates to shards in
 parallel rather than with a single thread. Can really increase update
 speed. Still not as powerful as using CloudSolrServer from multiple
 threads, but a nice improvement non the less.
 
 
 - Mark
 
 http://about.me/markrmiller
 
 
 I'm no dev but I can read so please excuse any errors on my part.
 
 Thanks,
 Greg
 
 On Jan 31, 2014, at 11:40 AM, Software Dev static.void@gmail.com
 wrote:
 
 Can someone clarify what the following options are:
 
 - updatesToLeaders
 - shutdownLBHttpSolrServer
 - parallelUpdates
 
 Also, I remember in older version of Solr there was an efficient format
 that was used between SolrJ and Solr that is more compact. Does this
 sill
 exist in the latest version of Solr? If so, is it the default?
 
 Thanks