Hi Nikolai, As for now Ignite-Cassandra module always executes same CQL query on each node while doing loadCache(...).
But you assumptions are right and there is a ticket for this: https://issues.apache.org/jira/browse/IGNITE-3962 Igor On Thu, Jul 27, 2017 at 10:28 AM, Nikolai Tikhonov <[email protected]> wrote: > Hello, > > >So, there is a reduction in elapsed time. Correct? > > I think that it is not correct for any case. If you have significal count > of nodes (for example 20 nodes with 4 cores) than in short period of time > Ignite will be quering to Cassandra from ~80 threads. I'm not sure that > this high load will bring more performance than one thread per node. BTW do > you know Casssandra caches quries? > > > On Thu, Jul 27, 2017 at 3:03 AM, Roger Fischer (CW) <[email protected]> > wrote: > >> Hello, >> >> >> >> what is the best way to efficiently load data from a backing store, like >> Cassandra. I am looking for a solution that minimizes work in Ignite and >> Cassandra. >> >> >> >> As I understand: >> >> >> >> The simplest way is to call loadCache() with a single select statement. >> >> cache.loadCache( null, “select * from a_table where a_date_time >= >> ‘2017-07-25 10:00:00’);”) >> >> >> >> Is it correct that: >> >> 1) Each Ignite node gets the same loadCache() request. >> >> 2) Each Ignite node sends the same query to Cassandra. >> >> 3) Each Ignite node gets all matched objects (rows) back from Cassandra. >> >> 4) Each Ignite node stores only the objects for which it has the primary >> partition, or a backup partition. >> >> >> >> Unless I misunderstand, this simple approach has the following >> inefficiencies: >> >> a) Cassandra executes the same query multiple times, once for each Ignite >> node. >> >> b) The query results are transferred multiple times, once for each Ignite >> node. >> >> c) The Ignite node gets a lot of data which it does not need (has neither >> primary or backup partition). >> >> d) Each Cassandra node has to query all partitions. >> >> >> >> loadCache() supports multiple queries. This allows the query to be broken >> down, ideally (for this case) into one query per Cassandra partition. >> >> >> >> cache.loadCache( null, “select * from a_table where partition_key = 0 and >> a_date_time >= ‘2017-07-25 10:00:00’);”, “select * from a_table where >> partition_key = 1 and a_date_time >= ‘2017-07-25 10:00:00’);”, …) >> >> >> >> This optimizes the Cassandra query, as each query is constrained to one >> Cassandra partition. >> >> >> >> But, I think, each node still needs to execute each query. Thus none of >> the other inefficiencies are eliminated. >> >> >> >> I believe that, when multiple cores (worker threads) are available, the >> Ignite nodes will execute multiple queries in parallel. So, there is a >> reduction in elapsed time. Correct? >> >> >> >> Now, is there any way to avoid that Cassandra has to execute the same >> query multiple times, and that the data is transferred multiple times? >> >> >> >> One approach would be that an Ignite node modifies the query so that it >> only includes the partitions for which it has the primary or a backup >> partition. That eliminates some duplication, but may not result in >> efficient queries in Cassandra. >> >> >> >> Another approach is that Ignite forwards objects for which it is not the >> primary or does not have a backup (similar to when an application does a >> put()). That would optimize the Cassandra query, but require additional >> communications between Ignite nodes. >> >> >> >> What if Ignite and Cassandra partitions were aligned? Then queries could >> be created that only return data relevant to the node and only query a >> subset of Cassandra partitions. But this seems not practical for a >> generalized system (I think). >> >> >> >> Any other suggestions? >> >> >> >> Thanks… >> >> >> >> Roger >> >> >> >> PS: The use case for this is to use Ignite as an SQL cache for a large >> data set in the Cassandra DB. The most recent data is pre-loaded (and >> updated) in Ignite. When older data is required, it is loaded first into >> Ignite, and then processed. It is this dynamic loading that should be quick >> (and efficient). >> >> >> > >
