Re: Question on efficient loading from Cassandra

Igor Rudyak Sun, 30 Jul 2017 18:00:39 -0700

Hi Nikolai,

As for now Ignite-Cassandra module always executes same CQL query on each
node while doing loadCache(...).


But you assumptions are right and there is a ticket for this:
https://issues.apache.org/jira/browse/IGNITE-3962


Igor



On Thu, Jul 27, 2017 at 10:28 AM, Nikolai Tikhonov <[email protected]>
wrote:

> Hello,
>
> >So, there is a reduction in elapsed time. Correct?
>
> I think that it is not correct for any case. If you have significal count
> of nodes (for example 20 nodes with 4 cores) than in short period of time
> Ignite will be quering to Cassandra from ~80 threads. I'm not sure that
> this high load will bring more performance than one thread per node. BTW do
> you know Casssandra caches quries?
>
>
> On Thu, Jul 27, 2017 at 3:03 AM, Roger Fischer (CW) <[email protected]>
> wrote:
>
>> Hello,
>>
>>
>>
>> what is the best way to efficiently load data from a backing store, like
>> Cassandra. I am looking for a solution that minimizes work in Ignite and
>> Cassandra.
>>
>>
>>
>> As I understand:
>>
>>
>>
>> The simplest way is to call loadCache() with a single select statement.
>>
>> cache.loadCache( null, “select * from a_table where a_date_time >=
>> ‘2017-07-25 10:00:00’);”)
>>
>>
>>
>> Is it correct that:
>>
>> 1) Each Ignite node gets the same loadCache() request.
>>
>> 2) Each Ignite node sends the same query to Cassandra.
>>
>> 3) Each Ignite node gets all matched objects (rows) back from Cassandra.
>>
>> 4) Each Ignite node stores only the objects for which it has the primary
>> partition, or a backup partition.
>>
>>
>>
>> Unless I misunderstand, this simple approach has the following
>> inefficiencies:
>>
>> a) Cassandra executes the same query multiple times, once for each Ignite
>> node.
>>
>> b) The query results are transferred multiple times, once for each Ignite
>> node.
>>
>> c) The Ignite node gets a lot of data which it does not need (has neither
>> primary or backup partition).
>>
>> d) Each Cassandra node has to query all partitions.
>>
>>
>>
>> loadCache() supports multiple queries. This allows the query to be broken
>> down, ideally (for this case) into one query per Cassandra partition.
>>
>>
>>
>> cache.loadCache( null, “select * from a_table where partition_key = 0 and
>> a_date_time >= ‘2017-07-25 10:00:00’);”, “select * from a_table where
>> partition_key = 1 and a_date_time >= ‘2017-07-25 10:00:00’);”, …)
>>
>>
>>
>> This optimizes the Cassandra query, as each query is constrained to one
>> Cassandra partition.
>>
>>
>>
>> But, I think, each node still needs to execute each query. Thus none of
>> the other inefficiencies are eliminated.
>>
>>
>>
>> I believe that, when multiple cores (worker threads) are available, the
>> Ignite nodes will execute multiple queries in parallel. So, there is a
>> reduction in elapsed time. Correct?
>>
>>
>>
>> Now, is there any way to avoid that Cassandra has to execute the same
>> query multiple times, and that the data is transferred multiple times?
>>
>>
>>
>> One approach would be that an Ignite node modifies the query so that it
>> only includes the partitions for which it has the primary or a backup
>> partition. That eliminates some duplication, but may not result in
>> efficient queries in Cassandra.
>>
>>
>>
>> Another approach is that Ignite forwards objects for which it is not the
>> primary or does not have a backup (similar to when an application does a
>> put()). That would optimize the Cassandra query, but require additional
>> communications between Ignite nodes.
>>
>>
>>
>> What if Ignite and Cassandra partitions were aligned? Then queries could
>> be created that only return data relevant to the node and only query a
>> subset of Cassandra partitions. But this seems not practical for a
>> generalized system (I think).
>>
>>
>>
>> Any other suggestions?
>>
>>
>>
>> Thanks…
>>
>>
>>
>> Roger
>>
>>
>>
>> PS: The use case for this is to use Ignite as an SQL cache for a large
>> data set in the Cassandra DB. The most recent data is pre-loaded (and
>> updated) in Ignite. When older data is required, it is loaded first into
>> Ignite, and then processed. It is this dynamic loading that should be quick
>> (and efficient).
>>
>>
>>
>
>

Re: Question on efficient loading from Cassandra

Reply via email to