Re: Facility to detect long STW pauses and other system response degradations

2017-11-27 Thread Vladimir Ozerov
> some duration using only JVM API? > > > > > > > > > > On Tue, Nov 21, 2017 at 7:17 PM, Andrey Kornev < > > > andrewkor...@hotmail.com > > > > > > > > > > wrote: > > > > > > > > > > >

Re: Facility to detect long STW pauses and other system response degradations

2017-11-22 Thread Anton Vinogradov
nt means of > detecting a > > > > > struggling process out of the box. SRE/Operations teams usually > know > > > how > > > > to > > > > > monitor JVMs and can handle killing of such processes themselves. > > > > > > > > >

Re: Facility to detect long STW pauses and other system response degradations

2017-11-22 Thread Vladimir Ozerov
f detecting a > > > > struggling process out of the box. SRE/Operations teams usually know > > how > > > to > > > > monitor JVMs and can handle killing of such processes themselves. > > > > > > > > The feature adds no value, just complexity

Re: Facility to detect long STW pauses and other system response degradations

2017-11-22 Thread Anton Vinogradov
t; > > > > > The feature adds no value, just complexity (and more configuration > > > parameters (!) — as if Ignite didn’t have enough of them already). > > > > > > Regards, > > > Andrey > > > _____ > > >

Re: Facility to detect long STW pauses and other system response degradations

2017-11-21 Thread Vladimir Ozerov
ers (!) — as if Ignite didn’t have enough of them already). > > > > Regards, > > Andrey > > _________ > > From: Denis Magda <dma...@apache.org> > > Sent: Monday, November 20, 2017 3:10 PM > > Subject: Re: Facility to detec

Re: Facility to detect long STW pauses and other system response degradations

2017-11-21 Thread Anton Vinogradov
nfiguration > parameters (!) — as if Ignite didn’t have enough of them already). > > Regards, > Andrey > _ > From: Denis Magda <dma...@apache.org> > Sent: Monday, November 20, 2017 3:10 PM > Subject: Re: Facility to detect long STW pauses and other syste

Re: Facility to detect long STW pauses and other system response degradations

2017-11-21 Thread Andrey Kornev
(!) — as if Ignite didn’t have enough of them already). Regards, Andrey _ From: Denis Magda <dma...@apache.org> Sent: Monday, November 20, 2017 3:10 PM Subject: Re: Facility to detect long STW pauses and other system response degradations To: <dev@ignite.apache.or

Re: Facility to detect long STW pauses and other system response degradations

2017-11-21 Thread Дмитрий Сорокин
Don't forget that the high utilization of CPU can occur for reasons other than GC STW, and GC log parsing will not help us in that case. вт, 21 нояб. 2017 г. в 13:06, Anton Vinogradov [via Apache Ignite Developers] : > Denis, > > > 1. Totally for a separate

Re: Facility to detect long STW pauses and other system response degradations

2017-11-21 Thread Anton Vinogradov
Denis, > 1. Totally for a separate native process that will handle the monitoring of an Ignite process. The watchdog process can simply start a JVM tool like jstat and parse its GC logs: https://dzone.com/articles/ how-monitor-java-garbage

Re: Facility to detect long STW pauses and other system response degradations

2017-11-20 Thread Denis Magda
My 2 cents. 1. Totally for a separate native process that will handle the monitoring of an Ignite process. The watchdog process can simply start a JVM tool like jstat and parse its GC logs: https://dzone.com/articles/how-monitor-java-garbage

Re: Facility to detect long STW pauses and other system response degradations

2017-11-20 Thread Anton Vinogradov
Yakov, Issue is https://issues.apache.org/jira/browse/IGNITE-6171 We split issue to #1 STW duration metrics #2 External monitoring allows to stop node during STW > Testing GC pause with java thread is > a bit strange and can give info only after GC pause finishes. That's ok since it's #1 On

Re: Facility to detect long STW pauses and other system response degradations

2017-11-20 Thread Dmitriy_Sorokin
I have tested solution with java-thread and GC logs had contain same pause values of thread stopping which was detected by java-thread. My log (contains pauses > 100ms): [2017-11-20 17:33:28,822][WARN ][Thread-1][root] Possible too long STW pause: 507 milliseconds. [2017-11-20 17:33:34,522][WARN

Re: Facility to detect long STW pauses and other system response degradations

2017-11-20 Thread Yakov Zhdanov
Guys, how about having 2 native threads - one calling some java method, another one monitoring that the first one is active and is not stuck on safepoint (which points to GC pause)? Testing GC pause with java thread is a bit strange and can give info only after GC pause finishes. Native threads

Re: Facility to detect long STW pauses and other system response degradations

2017-11-20 Thread Dmitry Pavlov
Yes, we need some timestamp from Java code. But I think JVM thread could update TS with delays not related to GC and will have same effect with IgniteUtils#currentTimeMillis(). Is this new test compares result from java timestamps difference with GC logs? пн, 20 нояб. 2017 г. в 16:39, Anton

Re: Facility to detect long STW pauses and other system response degradations

2017-11-20 Thread Anton Vinogradov
Dmitriy, > Sleeping Java Thread IMO is not an option, because thread can be in > Timed_Watiting logner than timeout. That's the only one idea we have, and, according to tests, it works! > Did I understand correctly that the native stream is proposed? And our goal > now is to select best

Re: Facility to detect long STW pauses and other system response degradations

2017-11-20 Thread Dmitry Pavlov
Sleeping Java Thread IMO is not an option, because thread can be in Timed_Watiting logner than timeout. Did I understand correctly that the native stream is proposed? And our goal now is to select best framework for this? Can we limit this oppotunity with several popular OS (Win,Linux), and do

Re: Facility to detect long STW pauses and other system response degradations

2017-11-20 Thread Anton Vinogradov
Igniters, Since no one rejected proposal, let's start from part one. > I propose to add a special thread that will record current time every N > milliseconds and check the difference with the latest recorded value. > The maximum and total pause values for a certain period can be published in >