Hi all, I'm seeking a fuller understanding of how Apache ignite manages datasets, both for indexes and for the underlying data.
In particular, I'm looking at what practical constraints exist for overall data size (beyond the obvious 'how much memory do you have?'), and functional characteristics when working near the constraint boundaries. My assumption (corrections welcome) include: - The underlying objects (Value part of cache) do not need to be in-memory on any cache nodes (performance is naturally affected if they were evicted from the cache) to execute an indexed query. - The indexed keys need to be in-memory for all indexed lookups. If the referenced Value is not in-memory, it will be loaded by call to backing store when that value is needed: load(key) - Indexed keys do not need to be in-memory for any table-scan queries to work, but loadCache() (?) is called to bring these data into memory. This may result in eviction of other values. Once the queries on these data are complete, the keys (at least) will tend to remain in-memory (how to forcibly remove?) In this latter case, can large datasets be queried, with earlier records in the dataset progressively evicted to make room for later records in the dataset (e.g. SUM(x) GROUP BY y)? A sample use case might include a set of metadata objects (megabytes to gigabytes, in various Ignite caches) and a much larger set of operational metrics with fine-grained slicing, or even fully-granular facts (GB/TB/PB). In this use-case, the metadata might well have "hot" subsets that (we hope) are not evicted by an LFU cache, as well as some less-frequently-used data; meanwhile, the operational metrics may also have tiers, even to the extent where the least frequently-used metrics should be evicted after a rather short idle time, recovering both Value memory as well as Key memory. In this case ^, can "small" data and "big" data co-exist within an Ignite cluster, and are there any particular techniques needed to assure operational performance, particularly for keeping hot data hot, when total data-size exceeds total-available-memory? - a) Can "indexed" queries be executed across datasets that need to be loaded with loadCache() or would they execute as table-scans? - b) Would such a query run incrementally with progressive eviction of data, in the case of big data? I guess I'm unclear on the sequence of data-loading vs data-scanning - are they parallel operations, or would we expect the data-loading phase to block the data-scanning phase? Hopefully these questions and sample scenario are clear enough to get experienced perspective & input from y'all... thanks in advance. R
