DataStax Spark driver performance for analytics workload

eugene miretsky Fri, 06 Oct 2017 08:26:43 -0700

Hello,

When doing analytics is Spark, a common pattern is to load either the whole
table into memory or filter on some columns. This is a good pattern for
column-oriented files (Parquet) but seems to be a huge anti-pattern in C*.
Most common spark operations will result in one of (a) query without a
partition key (full table scan), (b) filter on a non-clustering key.
A naive implementation of the above will result in all SSTables being read
from disk multiple times in random order (for different keys) resulting in
horrible cache performance.


Does the DataStax driver do any smart tricks to optimize for this?

Cheers,
Eugene

DataStax Spark driver performance for analytics workload

Reply via email to