Re: Question on efficient loading from Cassandra

Nikolai Tikhonov Thu, 27 Jul 2017 10:28:37 -0700

Hello,

>So, there is a reduction in elapsed time. Correct?


I think that it is not correct for any case. If you have significal count
of nodes (for example 20 nodes with 4 cores) than in short period of time
Ignite will be quering to Cassandra from ~80 threads. I'm not sure that
this high load will bring more performance than one thread per node. BTW do
you know Casssandra caches quries?


On Thu, Jul 27, 2017 at 3:03 AM, Roger Fischer (CW) <[email protected]>
wrote:

> Hello,
>
>
>
> what is the best way to efficiently load data from a backing store, like
> Cassandra. I am looking for a solution that minimizes work in Ignite and
> Cassandra.
>
>
>
> As I understand:
>
>
>
> The simplest way is to call loadCache() with a single select statement.
>
> cache.loadCache( null, “select * from a_table where a_date_time >=
> ‘2017-07-25 10:00:00’);”)
>
>
>
> Is it correct that:
>
> 1) Each Ignite node gets the same loadCache() request.
>
> 2) Each Ignite node sends the same query to Cassandra.
>
> 3) Each Ignite node gets all matched objects (rows) back from Cassandra.
>
> 4) Each Ignite node stores only the objects for which it has the primary
> partition, or a backup partition.
>
>
>
> Unless I misunderstand, this simple approach has the following
> inefficiencies:
>
> a) Cassandra executes the same query multiple times, once for each Ignite
> node.
>
> b) The query results are transferred multiple times, once for each Ignite
> node.
>
> c) The Ignite node gets a lot of data which it does not need (has neither
> primary or backup partition).
>
> d) Each Cassandra node has to query all partitions.
>
>
>
> loadCache() supports multiple queries. This allows the query to be broken
> down, ideally (for this case) into one query per Cassandra partition.
>
>
>
> cache.loadCache( null, “select * from a_table where partition_key = 0 and
> a_date_time >= ‘2017-07-25 10:00:00’);”, “select * from a_table where
> partition_key = 1 and a_date_time >= ‘2017-07-25 10:00:00’);”, …)
>
>
>
> This optimizes the Cassandra query, as each query is constrained to one
> Cassandra partition.
>
>
>
> But, I think, each node still needs to execute each query. Thus none of
> the other inefficiencies are eliminated.
>
>
>
> I believe that, when multiple cores (worker threads) are available, the
> Ignite nodes will execute multiple queries in parallel. So, there is a
> reduction in elapsed time. Correct?
>
>
>
> Now, is there any way to avoid that Cassandra has to execute the same
> query multiple times, and that the data is transferred multiple times?
>
>
>
> One approach would be that an Ignite node modifies the query so that it
> only includes the partitions for which it has the primary or a backup
> partition. That eliminates some duplication, but may not result in
> efficient queries in Cassandra.
>
>
>
> Another approach is that Ignite forwards objects for which it is not the
> primary or does not have a backup (similar to when an application does a
> put()). That would optimize the Cassandra query, but require additional
> communications between Ignite nodes.
>
>
>
> What if Ignite and Cassandra partitions were aligned? Then queries could
> be created that only return data relevant to the node and only query a
> subset of Cassandra partitions. But this seems not practical for a
> generalized system (I think).
>
>
>
> Any other suggestions?
>
>
>
> Thanks…
>
>
>
> Roger
>
>
>
> PS: The use case for this is to use Ignite as an SQL cache for a large
> data set in the Cassandra DB. The most recent data is pre-loaded (and
> updated) in Ignite. When older data is required, it is loaded first into
> Ignite, and then processed. It is this dynamic loading that should be quick
> (and efficient).
>
>
>

Re: Question on efficient loading from Cassandra

Reply via email to