Re: Intermittent Spikes in Response Time

2018-04-30 Thread ezhuravlev
Hi Chris,

How do you map compute tasks to nodes?

Well, it's possible that 2 nodes in your cluster always stores more data
than others, that's why you face these spikes, it could happen in case of by
your affinity key, too much data could be collocated. You can check it by
using IgniteCache.localSize on each node and comparing results.

Also, I would recommend checking Ignite.affinity.mapPartitionsToNodes - just
to make sure that all nodes store pretty the same amount of partitions.

Evgenii



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Re: Intermittent Spikes in Response Time

2018-04-14 Thread piyush
It's hard to tell unless actual code written is known.

Although try using this JVM parameters.

-Xss256K  ;; limit threadstack size 
-XX:MaxGCPauseMillis=25  ;; limit max GC pause to 25ms

More tuning settings here

https://apacheignite.readme.io/docs/jvm-and-system-tuning
 



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/


Intermittent Spikes in Response Time

2018-04-13 Thread Chris Berry
Greetings,

We've been running a large, high volume, low latency Ignite Compute Grid in
Production for a few months now, and, in general, things are going quite
well.
But we see a handful of large spikes a day in our application (approx. 1 per
hour)
And thus far we've been at a loss as to explain what is causing them.

Of course, we ruled out the usual suspects immediately.
*  Garbage collection (using G1) is excellent -- with few pauses, a max
pause of 100ms, and typical pauses < 40ms.
*  We also see no Host activity that correlates with the Spikes, including
CPU and Network or Disk I/O
*  And we can find no noisy neighbors in AWS (CPU Stolen, etc)
*  We see no evident Thread blocking 

The symptoms are always the same. 
*  It occurs on a handful of Nodes (typically 2 out of 38 Nodes) 
*  We always see ComputeTaskTimeoutCheckedException ERRORs.
*  It results in a handful of failed Requests (Timeouts) to our Clients. 
 *  10 to 30, out of 1000s of concurrent Requests in the Grid
 *  Each Request is for a batch of 100s of "computations" that are
map/reduced onto the Grid.
 *  So. In all, a relatively very small number of computations are
affected (0.00X %) 
*  It is very often the same two Nodes involved in the Spikes!!
 *  We run in AWS, so the Nodes are different for each deployment.

When a Spike occurs 
*  We see nothing odd in the logs.
*  All the Exceptions we see are symptoms of the Spike and not the cause.
*  We did occasionally see some Ignite socketWriteTimeouts, but we increased
the timeout to work around this, and they are gone now.

Our Grid;
*  Has millions of cache entries in 9 separate caches.
*  Has caches that all use the same Affinity key, and thus, all cache access
is local to a given Node, once computation is mapped there.
*  Is predominantly read-only.
*  Has 1 Primary and 3 Backups
*  Allows read access to the Backups for computation.
*  Caches are primed via Kafka streams, using DataStreamers. 
 *  After the initial priming, data is a relative trickle.

So. Finally. My questions;

I strongly suspect that some sort of cache rebalancing, or some such, is
occurring and causing this behavior??
*  Is this logical??
*  How would I validate this??
*  Is there any logging (or Interceptors) that we could enable to track
whether my hunch is correct??
*  We appear to see less Spikes -- with less Backups and less Nodes. Does
that make sense??
*  Is there any tuning (config) that could help eliminate the problem??

Any wisdom that the Mailing List may have would be greatly appreciated.

Thanks much, 
-- Chris 



--
Sent from: http://apache-ignite-users.70518.x6.nabble.com/