Hello, what is the best way to efficiently load data from a backing store, like Cassandra. I am looking for a solution that minimizes work in Ignite and Cassandra.
As I understand: The simplest way is to call loadCache() with a single select statement. cache.loadCache( null, "select * from a_table where a_date_time >= '2017-07-25 10:00:00');") Is it correct that: 1) Each Ignite node gets the same loadCache() request. 2) Each Ignite node sends the same query to Cassandra. 3) Each Ignite node gets all matched objects (rows) back from Cassandra. 4) Each Ignite node stores only the objects for which it has the primary partition, or a backup partition. Unless I misunderstand, this simple approach has the following inefficiencies: a) Cassandra executes the same query multiple times, once for each Ignite node. b) The query results are transferred multiple times, once for each Ignite node. c) The Ignite node gets a lot of data which it does not need (has neither primary or backup partition). d) Each Cassandra node has to query all partitions. loadCache() supports multiple queries. This allows the query to be broken down, ideally (for this case) into one query per Cassandra partition. cache.loadCache( null, "select * from a_table where partition_key = 0 and a_date_time >= '2017-07-25 10:00:00');", "select * from a_table where partition_key = 1 and a_date_time >= '2017-07-25 10:00:00');", ...) This optimizes the Cassandra query, as each query is constrained to one Cassandra partition. But, I think, each node still needs to execute each query. Thus none of the other inefficiencies are eliminated. I believe that, when multiple cores (worker threads) are available, the Ignite nodes will execute multiple queries in parallel. So, there is a reduction in elapsed time. Correct? Now, is there any way to avoid that Cassandra has to execute the same query multiple times, and that the data is transferred multiple times? One approach would be that an Ignite node modifies the query so that it only includes the partitions for which it has the primary or a backup partition. That eliminates some duplication, but may not result in efficient queries in Cassandra. Another approach is that Ignite forwards objects for which it is not the primary or does not have a backup (similar to when an application does a put()). That would optimize the Cassandra query, but require additional communications between Ignite nodes. What if Ignite and Cassandra partitions were aligned? Then queries could be created that only return data relevant to the node and only query a subset of Cassandra partitions. But this seems not practical for a generalized system (I think). Any other suggestions? Thanks... Roger PS: The use case for this is to use Ignite as an SQL cache for a large data set in the Cassandra DB. The most recent data is pre-loaded (and updated) in Ignite. When older data is required, it is loaded first into Ignite, and then processed. It is this dynamic loading that should be quick (and efficient).
