Two other options, both of which will be faster (and less likely to impact read 
latencies) but require some app side programming, if you’re willing to generate 
the sstables programmatically with CQLSSTableWriter or similar.

Once you do that, you can:

1) stream them in with the sstableloader (which will always send them to the 
right replicas and handle renumbering the generation), or

2) manually figure out what the replicas are, rsync the files out, and call 
nodetool refresh

(If you google around you may see references to bulkSaveToCassandra, which 
seems to be DSE’s implementation of #1 - if you’re a datastax customer you 
could consider just using that, if you’re not you’ll need to recreate it using 
https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/CQLSSTableWriter.java
 )



- Jeff

-- 
Jeff Jirsa


> On Jan 30, 2018, at 12:12 AM, Julien Moumne <jmou...@deezer.com> wrote:
> 
> Hello, I am looking for best practices for the following use case :
> 
> Once a day, we insert at the same time 10 full tables (several 100GiB each) 
> using Spark C* driver, without batching, with CL set to ALL.
> 
> Whether skinny rows or wide rows, data for a partition key is always 
> completely updated / overwritten, ie. every command is an insert.
> 
> This imposes a great load on the cluster (huge CPU consumption), this load 
> greatly impacts the constant reads we have. Read latency are fine the rest of 
> the time.
> 
> Is there any best practices we should follow to ease the load when importing 
> data into C* except
>  - reducing the number of concurrent writes and throughput on the driver side
>  - reducing the number of compaction threads and throughput on the cluster
> 
> In particular : 
>  - is there any evidence that writing multiple tables at the same time 
> produces more load than writing the tables one at a time when tables are 
> completely written at once such as we do?
>  - because of the heavy writes, we use STC. Is it the best choice considering 
> data is completely overwritten once a day? Tables contain collections and 
> UDTs.
> 
> (We manage data expiration with TTL set to several days.
> We use SSDs.)
> 
> Thanks!

Reply via email to