We use Spark2Cassandra (this fork works with C*3.0
https://github.com/leoromanovsky/Spark2Cassandra )
SSTables are streamed to Cassandra by Spark2Cassandra (so you need to open port
7000 accordingly).During benchmark we used 25 EMR nodes but in production we
use less nodes to be more gentle with Cassandra.
Best,
Romain
Le mardi 6 février 2018 à 16:05:16 UTC+1, Julien Moumne
<[email protected]> a écrit :
This does look like a very viable solution. Thanks.
Could you give us some pointers/documentation on : - how can we build such
SSTables using spark jobs, maybe https://github.com/Netflix/sstable-adaptor ? -
how do we send these tables to cassandra? does a simple SCP work? - what is the
recommended size for sstables for when it does not fit a single executor
On 5 February 2018 at 18:40, Romain Hardouin <[email protected]>
wrote:
Hi Julien,
We have such a use case on some clusters. If you want to insert big batches at
fast pace the only viable solution is to generate SSTables on Spark side and
stream them to C*. Last time we benchmarked such a job we achieved 1.3 million
partitions inserted per seconde on a 3 C* nodes test cluster - which is
impossible with regular inserts.
Best,
Romain
Le lundi 5 février 2018 à 03:54:09 UTC+1, kurt greaves
<[email protected]> a écrit :
Would you know if there is evidence that inserting skinny rows in sorted order
(no batching) helps C*?
This won't have any effect as each insert will be handled separately by the
coordinator (or a different coordinator, even). Sorting is also very unlikely
to help even if you did batch.
Also, in the case of wide rows, is there evidence that sorting clustering keys
within partition batches helps ease C*'s job?
No evidence, seems very unlikely.
--
Julien MOUMNÉ
Software Engineering - Data Science
Mail: [email protected] rue d'Athènes 75009 Paris - France