Indexing, operational performance and data eviction

Randy Harmon Fri, 16 Dec 2016 15:17:54 -0800

Hi all,

I'm seeking a fuller understanding of how Apache ignite manages datasets,
both for indexes and for the underlying data.


In particular, I'm looking at what practical constraints exist for overall
data size (beyond the obvious 'how much memory do you have?'), and
functional characteristics when working near the constraint boundaries.

My assumption (corrections welcome) include:

   - The underlying objects (Value part of cache) do not need to be
   in-memory on any cache nodes (performance is naturally affected if they
   were evicted from the cache) to execute an indexed query.

   - The indexed keys need to be in-memory for all indexed lookups.  If the
   referenced Value is not in-memory, it will be loaded by call to backing
   store when that value is needed: load(key)

   - Indexed keys do not need to be in-memory for any table-scan queries to
   work, but loadCache() (?) is called to bring these data into memory.  This
   may result in eviction of other values. Once the queries on these data are
   complete, the keys (at least) will tend to remain in-memory (how to
   forcibly remove?)

In this latter case, can large datasets be queried, with earlier records in
the dataset progressively evicted to make room for later records in the
dataset (e.g. SUM(x) GROUP BY y)?

A sample use case might include a set of metadata objects (megabytes to
gigabytes, in various Ignite caches) and a much larger set of operational
metrics with fine-grained slicing, or even fully-granular facts
(GB/TB/PB).  In this use-case, the metadata might well have "hot" subsets
that (we hope) are not evicted by an LFU cache, as well as some
less-frequently-used data; meanwhile, the operational metrics may also have
tiers, even to the extent where the least frequently-used metrics should be
evicted after a rather short idle time, recovering both Value memory as
well as Key memory.

In this case ^, can "small" data and "big" data co-exist within an Ignite
cluster, and are there any particular techniques needed to assure
operational performance, particularly for keeping hot data hot, when total
data-size exceeds total-available-memory?

   - a) Can "indexed" queries be executed across datasets that need to be
   loaded with loadCache() or would they execute as table-scans?

   - b) Would such a query run incrementally with progressive eviction of
   data, in the case of big data?

I guess I'm unclear on the sequence of data-loading vs data-scanning - are
they parallel operations, or would we expect the data-loading phase to
block the data-scanning phase?

Hopefully these questions and sample scenario are clear enough to get
experienced perspective & input from y'all... thanks in advance.

R

Indexing, operational performance and data eviction

Reply via email to