Greetings,
We've been running a large, high volume, low latency Ignite Compute Grid in
Production for a few months now, and, in general, things are going quite
well.
But we see a handful of large spikes a day in our application (approx. 1 per
hour)
And thus far we've been at a loss as to explain what is causing them.
Of course, we ruled out the usual suspects immediately.
* Garbage collection (using G1) is excellent -- with few pauses, a max
pause of 100ms, and typical pauses < 40ms.
* We also see no Host activity that correlates with the Spikes, including
CPU and Network or Disk I/O
* And we can find no noisy neighbors in AWS (CPU Stolen, etc)
* We see no evident Thread blocking
The symptoms are always the same.
* It occurs on a handful of Nodes (typically 2 out of 38 Nodes)
* We always see ComputeTaskTimeoutCheckedException ERRORs.
* It results in a handful of failed Requests (Timeouts) to our Clients.
* 10 to 30, out of 1000s of concurrent Requests in the Grid
* Each Request is for a batch of 100s of "computations" that are
map/reduced onto the Grid.
* So. In all, a relatively very small number of computations are
affected (0.00X %)
* It is very often the same two Nodes involved in the Spikes!!
* We run in AWS, so the Nodes are different for each deployment.
When a Spike occurs
* We see nothing odd in the logs.
* All the Exceptions we see are symptoms of the Spike and not the cause.
* We did occasionally see some Ignite socketWriteTimeouts, but we increased
the timeout to work around this, and they are gone now.
Our Grid;
* Has millions of cache entries in 9 separate caches.
* Has caches that all use the same Affinity key, and thus, all cache access
is local to a given Node, once computation is mapped there.
* Is predominantly read-only.
* Has 1 Primary and 3 Backups
* Allows read access to the Backups for computation.
* Caches are primed via Kafka streams, using DataStreamers.
* After the initial priming, data is a relative trickle.
So. Finally. My questions;
I strongly suspect that some sort of cache rebalancing, or some such, is
occurring and causing this behavior??
* Is this logical??
* How would I validate this??
* Is there any logging (or Interceptors) that we could enable to track
whether my hunch is correct??
* We appear to see less Spikes -- with less Backups and less Nodes. Does
that make sense??
* Is there any tuning (config) that could help eliminate the problem??
Any wisdom that the Mailing List may have would be greatly appreciated.
Thanks much,
-- Chris
--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/