Re: Facility to detect long STW pauses and other system response degradations

2017-11-20 Thread Dmitriy_Sorokin
I have tested solution with java-thread and GC logs had contain same pause
values of thread stopping which was detected by java-thread.


My log (contains pauses > 100ms):
[2017-11-20 17:33:28,822][WARN ][Thread-1][root] Possible too long STW
pause: 507 milliseconds.
[2017-11-20 17:33:34,522][WARN ][Thread-1][root] Possible too long STW
pause: 5595 milliseconds.
[2017-11-20 17:33:37,896][WARN ][Thread-1][root] Possible too long STW
pause: 3262 milliseconds.
[2017-11-20 17:33:39,714][WARN ][Thread-1][root] Possible too long STW
pause: 1737 milliseconds.

GC log:
gridgain@dell-5580-92zc8h2:~$ cat
./dev/ignite-logs/gc-2017-11-20_17-33-27.log | grep Total
2017-11-20T17:33:27.608+0300: 0,116: Total time for which application
threads were stopped: 0,845 seconds, Stopping threads took: 0,246
seconds
2017-11-20T17:33:27.667+0300: 0,175: Total time for which application
threads were stopped: 0,0001072 seconds, Stopping threads took: 0,252
seconds
2017-11-20T17:33:28.822+0300: 1,330: Total time for which application
threads were stopped: 0,5001082 seconds, Stopping threads took: 0,178
seconds// GOT!
2017-11-20T17:33:34.521+0300: 7,030: Total time for which application
threads were stopped: 5,5856603 seconds, Stopping threads took: 0,229
seconds// GOT!
2017-11-20T17:33:37.896+0300: 10,405: Total time for which application
threads were stopped: 3,2595700 seconds, Stopping threads took: 0,223
seconds// GOT!
2017-11-20T17:33:39.714+0300: 12,222: Total time for which application
threads were stopped: 1,7337123 seconds, Stopping threads took: 0,121
seconds// GOT!




--
Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/


Facility to detect long STW pauses and other system response degradations

2017-11-17 Thread Dmitriy_Sorokin
Hi, Igniters!

This discussion thread related to
https://issues.apache.org/jira/browse/IGNITE-6171.

Currently there are no JVM performance monitoring tools in AI, for example
the impact of GC (eg STW) on the operation of the node. I think we should
add this functionality.

1) It is useful to know that STW duration increased or any other situations
leads to similar consequences.
This will allow system administrators to solve issues prior they become
problems.

I propose to add a special thread that will record current time every N
milliseconds and check the difference with the latest recorded value. 
The maximum and total pause values for a certain period can be published in
the special metrics available through JMX.

2) If the pause reaches a critical value, we need to stop the node, without
waiting for end of the pause.

The thread (from the first part of the proposed solution) is able to
estimate the pause duration, but only after its completion. 
So, we need an external thread (in another JVM or native) that is able to
recognize that the pause duration has passed the critical mark.

We can estimate (STW or similar) pause duration by
 a) reading value updated by the first thread, somehow (eg via JMX, shmem or
shared file)
 or
 b) by using JVM diagnostic tools. Does anybody know crossplatform
solutions?

Feel free to suggest ideas or tips, especially about second part of
proposal.

Thoughts?



--
Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/