Thank you all for your inputs.
Before trying advanced ideas, we'll first try reducing the amount of spark
executors and see if export times are still acceptable.
Two additional questions.
Would you know if there is evidence that inserting skinny rows in sorted
order (no batching) helps C*?
Also, in the case of wide rows, is there evidence that sorting clustering
keys within partition batches helps ease C*'s job?
The idea would be that the underlying storage structure (LSM trees?),
and/or other components, require less maintenance when data is inserted in
a mostly ordered way.
On 30 January 2018 at 16:41, Jeff Jirsa <jji...@gmail.com> wrote:
> Two other options, both of which will be faster (and less likely to impact
> read latencies) but require some app side programming, if you’re willing to
> generate the sstables programmatically with CQLSSTableWriter or similar.
> Once you do that, you can:
> 1) stream them in with the sstableloader (which will always send them to
> the right replicas and handle renumbering the generation), or
> 2) manually figure out what the replicas are, rsync the files out, and
> call nodetool refresh
> (If you google around you may see references to bulkSaveToCassandra,
> which seems to be DSE’s implementation of #1 - if you’re a datastax
> customer you could consider just using that, if you’re not you’ll need to
> recreate it using https://github.com/apache/cassandra/blob/trunk/
> src/java/org/apache/cassandra/io/sstable/CQLSSTableWriter.java )
> - Jeff
> Jeff Jirsa
> On Jan 30, 2018, at 12:12 AM, Julien Moumne <jmou...@deezer.com> wrote:
> Hello, I am looking for best practices for the following use case :
> Once a day, we insert at the same time 10 full tables (several 100GiB
> each) using Spark C* driver, without batching, with CL set to ALL.
> Whether skinny rows or wide rows, data for a partition key is always
> completely updated / overwritten, ie. every command is an insert.
> This imposes a great load on the cluster (huge CPU consumption), this load
> greatly impacts the constant reads we have. Read latency are fine the rest
> of the time.
> Is there any best practices we should follow to ease the load when
> importing data into C* except
> - reducing the number of concurrent writes and throughput on the driver
> - reducing the number of compaction threads and throughput on the cluster
> In particular :
> - is there any evidence that writing multiple tables at the same time
> produces more load than writing the tables one at a time when tables are
> completely written at once such as we do?
> - because of the heavy writes, we use STC. Is it the best choice
> considering data is completely overwritten once a day? Tables contain
> collections and UDTs.
> (We manage data expiration with TTL set to several days.
> We use SSDs.)
Software Engineering - Data Science
12 rue d'Athènes 75009 Paris - France