Jonathan and Ryan,

Jonathan says “It is absolutely not going to help you if you're trying to lump 
queries together to reduce network & server overhead - in fact it'll do the 
opposite”, but I would note that the CQL3 spec says “The BATCH statement ... 
serves several purposes: 1. It saves network round-trips between the client and 
the server (and sometimes between the server coordinator and the replicas) when 
batching multiple updates.” Is the spec inaccurate? I mean, it seems in 
conflict with your statement.

See:
https://cassandra.apache.org/doc/cql3/CQL.html

I see the spec as gospel – if it’s not accurate, let’s propose a change to make 
it accurate.

The DataStax CQL doc is more nuanced: “Batching multiple statements can save 
network exchanges between the client/server and server coordinator/replicas. 
However, because of the distributed nature of Cassandra, spread requests across 
nearby nodes as much as possible to optimize performance. Using batches to 
optimize performance is usually not successful, as described in Using and 
misusing batches section. For information about the fastest way to load data, 
see "Cassandra: Batch loading without the Batch keyword."”

Maybe what we really need is a “client/driver-side batch”, which is simply a 
way to collect “batches” of operations in the client/driver and then let the 
driver determine what degree of batching and asynchronous operation is 
appropriate.

It might also be nice to have an inquiry for the cluster as to what batch size 
is most optimal for the cluster, like number of mutations in a batch and number 
of simultaneous connections, and to have that be dynamic based on overall 
cluster load.

I would also note that the example in the spec has multiple inserts with 
different partition key values, which flies in the face of the admonition to to 
refrain from using server-side distribution of requests.

At a minimum the CQL spec should make a more clear statement of intent and 
non-intent for BATCH.

-- Jack Krupansky

From: Jonathan Haddad 
Sent: Friday, December 12, 2014 12:58 PM
To: user@cassandra.apache.org ; Ryan Svihla 
Subject: Re: batch_size_warn_threshold_in_kb

The really important thing to really take away from Ryan's original post is 
that batches are not there for performance.  The only case I consider batches 
to be useful for is when you absolutely need to know that several tables all 
get a mutation (via logged batches).  The use case for this is when you've got 
multiple tables that are serving as different views for data.  It is absolutely 
not going to help you if you're trying to lump queries together to reduce 
network & server overhead - in fact it'll do the opposite.  If you're trying to 
do that, instead perform many async queries.  The overhead of batches in 
cassandra is significant and you're going to hit a lot of problems if you use 
them excessively (timeouts / failures). 

tl;dr: you probably don't want batch, you most likely want many async calls


On Thu Dec 11 2014 at 11:15:00 PM Mohammed Guller <moham...@glassbeam.com> 
wrote:

  Ryan,

  Thanks for the quick response.



  I did see that jira before posting my question on this list. However, I 
didn’t see any information about why 5kb+ data will cause instability. 5kb or 
even 50kb seems too small. For example, if each mutation is 1000+ bytes, then 
with just 5 mutations, you will hit that threshold. 



  In addition, Patrick is saying that he does not recommend more than 100 
mutations per batch. So why not warn users just on the # of mutations in a 
batch?



  Mohammed



  From: Ryan Svihla [mailto:rsvi...@datastax.com] 
  Sent: Thursday, December 11, 2014 12:56 PM
  To: user@cassandra.apache.org
  Subject: Re: batch_size_warn_threshold_in_kb



  Nothing magic, just put in there based on experience. You can find the story 
behind the original recommendation here



  https://issues.apache.org/jira/browse/CASSANDRA-6487



  Key reasoning for the desire comes from Patrick McFadden:


  "Yes that was in bytes. Just in my own experience, I don't recommend more 
than ~100 mutations per batch. Doing some quick math I came up with 5k as 100 x 
50 byte mutations.

  Totally up for debate."



  It's totally changeable, however, it's there in no small part because so many 
people confuse the BATCH keyword as a performance optimization, this helps flag 
those cases of misuse.



  On Thu, Dec 11, 2014 at 2:43 PM, Mohammed Guller <moham...@glassbeam.com> 
wrote:

  Hi – 

  The cassandra.yaml file has property called batch_size_warn_threshold_in_kb. 

  The default size is 5kb and according to the comments in the yaml file, it is 
used to log WARN on any batch size exceeding this value in kilobytes. It says 
caution should be taken on increasing the size of this threshold as it can lead 
to node instability.



  Does anybody know the significance of this magic number 5kb? Why would a 
higher number (say 10kb) lead to node instability?



  Mohammed 






  -- 



  Ryan Svihla

  Solution Architect 






  DataStax is the fastest, most scalable distributed database technology, 
delivering Apache Cassandra to the world’s most innovative enterprises. 
Datastax is built to be agile, always-on, and predictably scalable to any size. 
With more than 500 customers in 45 countries, DataStax is the database 
technology and transactional backbone of choice for the worlds most innovative 
companies such as Netflix, Adobe, Intuit, and eBay. 


Reply via email to