We've been running a large, high volume, low latency Ignite Compute Grid in
Production for a few months now, and, in general, things are going quite
But we see a handful of large spikes a day in our application (approx. 1 per
And thus far we've been at a loss as to explain what is causing them.

Of course, we ruled out the usual suspects immediately.
*  Garbage collection (using G1) is excellent -- with few pauses, a max
pause of 100ms, and typical pauses < 40ms.
*  We also see no Host activity that correlates with the Spikes, including
CPU and Network or Disk I/O
*  And we can find no noisy neighbors in AWS (CPU Stolen, etc)
*  We see no evident Thread blocking 

The symptoms are always the same. 
*  It occurs on a handful of Nodes (typically 2 out of 38 Nodes) 
*  We always see ComputeTaskTimeoutCheckedException ERRORs.
*  It results in a handful of failed Requests (Timeouts) to our Clients. 
     *  10 to 30, out of 1000s of concurrent Requests in the Grid
     *  Each Request is for a batch of 100s of "computations" that are
map/reduced onto the Grid.
     *  So. In all, a relatively very small number of computations are
affected (0.00X %) 
*  It is very often the same two Nodes involved in the Spikes!!
     *  We run in AWS, so the Nodes are different for each deployment.

When a Spike occurs 
*  We see nothing odd in the logs.
*  All the Exceptions we see are symptoms of the Spike and not the cause.
*  We did occasionally see some Ignite socketWriteTimeouts, but we increased
the timeout to work around this, and they are gone now.

Our Grid;
*  Has millions of cache entries in 9 separate caches.
*  Has caches that all use the same Affinity key, and thus, all cache access
is local to a given Node, once computation is mapped there.
*  Is predominantly read-only.
*  Has 1 Primary and 3 Backups
*  Allows read access to the Backups for computation.
*  Caches are primed via Kafka streams, using DataStreamers. 
     *  After the initial priming, data is a relative trickle.

So. Finally. My questions;

I strongly suspect that some sort of cache rebalancing, or some such, is
occurring and causing this behavior??
*  Is this logical??
*  How would I validate this??
*  Is there any logging (or Interceptors) that we could enable to track
whether my hunch is correct??
*  We appear to see less Spikes -- with less Backups and less Nodes. Does
that make sense??
*  Is there any tuning (config) that could help eliminate the problem??

Any wisdom that the Mailing List may have would be greatly appreciated.

Thanks much, 
-- Chris 

