Hello, >So, there is a reduction in elapsed time. Correct?
I think that it is not correct for any case. If you have significal count of nodes (for example 20 nodes with 4 cores) than in short period of time Ignite will be quering to Cassandra from ~80 threads. I'm not sure that this high load will bring more performance than one thread per node. BTW do you know Casssandra caches quries? On Thu, Jul 27, 2017 at 3:03 AM, Roger Fischer (CW) <[email protected]> wrote: > Hello, > > > > what is the best way to efficiently load data from a backing store, like > Cassandra. I am looking for a solution that minimizes work in Ignite and > Cassandra. > > > > As I understand: > > > > The simplest way is to call loadCache() with a single select statement. > > cache.loadCache( null, “select * from a_table where a_date_time >= > ‘2017-07-25 10:00:00’);”) > > > > Is it correct that: > > 1) Each Ignite node gets the same loadCache() request. > > 2) Each Ignite node sends the same query to Cassandra. > > 3) Each Ignite node gets all matched objects (rows) back from Cassandra. > > 4) Each Ignite node stores only the objects for which it has the primary > partition, or a backup partition. > > > > Unless I misunderstand, this simple approach has the following > inefficiencies: > > a) Cassandra executes the same query multiple times, once for each Ignite > node. > > b) The query results are transferred multiple times, once for each Ignite > node. > > c) The Ignite node gets a lot of data which it does not need (has neither > primary or backup partition). > > d) Each Cassandra node has to query all partitions. > > > > loadCache() supports multiple queries. This allows the query to be broken > down, ideally (for this case) into one query per Cassandra partition. > > > > cache.loadCache( null, “select * from a_table where partition_key = 0 and > a_date_time >= ‘2017-07-25 10:00:00’);”, “select * from a_table where > partition_key = 1 and a_date_time >= ‘2017-07-25 10:00:00’);”, …) > > > > This optimizes the Cassandra query, as each query is constrained to one > Cassandra partition. > > > > But, I think, each node still needs to execute each query. Thus none of > the other inefficiencies are eliminated. > > > > I believe that, when multiple cores (worker threads) are available, the > Ignite nodes will execute multiple queries in parallel. So, there is a > reduction in elapsed time. Correct? > > > > Now, is there any way to avoid that Cassandra has to execute the same > query multiple times, and that the data is transferred multiple times? > > > > One approach would be that an Ignite node modifies the query so that it > only includes the partitions for which it has the primary or a backup > partition. That eliminates some duplication, but may not result in > efficient queries in Cassandra. > > > > Another approach is that Ignite forwards objects for which it is not the > primary or does not have a backup (similar to when an application does a > put()). That would optimize the Cassandra query, but require additional > communications between Ignite nodes. > > > > What if Ignite and Cassandra partitions were aligned? Then queries could > be created that only return data relevant to the node and only query a > subset of Cassandra partitions. But this seems not practical for a > generalized system (I think). > > > > Any other suggestions? > > > > Thanks… > > > > Roger > > > > PS: The use case for this is to use Ignite as an SQL cache for a large > data set in the Cassandra DB. The most recent data is pre-loaded (and > updated) in Ignite. When older data is required, it is loaded first into > Ignite, and then processed. It is this dynamic loading that should be quick > (and efficient). > > >
