Re: Facility to detect long STW pauses and other system response degradations

2017-11-27 Thread Vladimir Ozerov
Makes sense.

On Wed, Nov 22, 2017 at 4:54 PM, Anton Vinogradov <avinogra...@gridgain.com>
wrote:

> Vova,
>
> Monitoring systems works not how you expected, they LOVE incremental
> metrics :)
>
> > 1) When did these events appear?
>
> Monitoring gets metrics each N seconds.
>
> 01.00 got 0, 0
> 02.00 got 1, 20 -> means between 02.00 and 01.00 was 1 STW with duration
> 20
> 03.00 got 3, 100  -> means between 03.00 and 02.00 was 2 STW with total
> duration 80
>
> Good monitoring will record this values and show you graph when you decide
> to check this value.
> So, you'll see "0,1,2" and "0,20,80" at 03.01
>
> > 2) How duration is distributed? Was 10 pauses 10 seconds each, or 9 short
> > pauses of 1 sec and one critical pause of 90s?
>
> So, previous probe was 0, 0 and we got it minute ago.
> Now we have 10, 100. It means we have 10 STW with total duration 100 last
> minute.
>
> But, it case we set interval to 10 seconds we'll get
> 1, 20
> 4, 20
> 0, 0
> 4, 10
> 1, 50
> 0, 0
>
> So, precision depends on check interval. And it's up to administration team
> what interval to choose.
>
> > May be a kind of sliding window plus min/max values will do better job.
>
> That's the monitoring system job (eg. zabbix)
>
>
> On Wed, Nov 22, 2017 at 4:10 PM, Vladimir Ozerov <voze...@gridgain.com>
> wrote:
>
> > The question is how administrator should interpret these numbers. Ok, I
> > opened JMX console and see that there was 10 long GC events, which took
> 100
> > seconds.
> > 1) When did these events appear? Over the last day, which is more or less
> > OK, or over the last 10 minutes, so my server is nearly dead?
> > 2) How duration is distributed? Was 10 pauses 10 seconds each, or 9 short
> > pauses of 1 sec and one critical pause of 90s?
> >
> > May be a kind of sliding window plus min/max values will do better job.
> >
> > On Wed, Nov 22, 2017 at 1:07 PM, Anton Vinogradov <
> > avinogra...@gridgain.com>
> > wrote:
> >
> > > Vova,
> > >
> > > 1) We can gain collections info from GarbageCollectorMXBean
> > > But it provides only
> > > - collectionCount
> > > - collectionTime.
> > >
> > > This is very interesting metrics, but they tell us nothing about long
> > STW.
> > > Long STW means we have huge latency during STW and we should find the
> > > reason and solve it.
> > >
> > > 2) So, we're working on new incremental metrics
> > > - total amount of STW longer than XXX
> > > - total duration of STW longer than XXX
> > >
> > > which shows us JVM/GC health situation.
> > >
> > > Is it answer to your question?
> > >
> > > On Tue, Nov 21, 2017 at 9:05 PM, Vladimir Ozerov <voze...@gridgain.com
> >
> > > wrote:
> > >
> > > > Anton,
> > > >
> > > > The question is why user may need so precise measurement? I share
> > > Andrey’s
> > > > opinion - cannot understand the value.
> > > >
> > > > вт, 21 нояб. 2017 г. в 19:33, Anton Vinogradov <
> > avinogra...@gridgain.com
> > > >:
> > > >
> > > > > Andrey,
> > > > >
> > > > > >  JVM provides sufficient means of detecting a struggling process
> > out
> > > of
> > > > > the box.
> > > > >
> > > > > Could you point to some articles describing how to detect STW
> > exceeding
> > > > > some duration using only JVM API?
> > > > >
> > > > > On Tue, Nov 21, 2017 at 7:17 PM, Andrey Kornev <
> > > andrewkor...@hotmail.com
> > > > >
> > > > > wrote:
> > > > >
> > > > > > My 2 cents. Don’t do it. JVM provides sufficient means of
> > detecting a
> > > > > > struggling process out of the box. SRE/Operations teams usually
> > know
> > > > how
> > > > > to
> > > > > > monitor JVMs and can handle killing of such processes themselves.
> > > > > >
> > > > > > The feature adds no value, just complexity (and more
> configuration
> > > > > > parameters (!) — as if Ignite didn’t have enough of them
> already).
> > > > > >
> > > > > > Regards,
> > > > > > Andrey
> > > > > > _
> > > > > > From: Denis Magda <dma...@apache

Re: Facility to detect long STW pauses and other system response degradations

2017-11-22 Thread Anton Vinogradov
Vova,

Monitoring systems works not how you expected, they LOVE incremental
metrics :)

> 1) When did these events appear?

Monitoring gets metrics each N seconds.

01.00 got 0, 0
02.00 got 1, 20 -> means between 02.00 and 01.00 was 1 STW with duration  20
03.00 got 3, 100  -> means between 03.00 and 02.00 was 2 STW with total
duration 80

Good monitoring will record this values and show you graph when you decide
to check this value.
So, you'll see "0,1,2" and "0,20,80" at 03.01

> 2) How duration is distributed? Was 10 pauses 10 seconds each, or 9 short
> pauses of 1 sec and one critical pause of 90s?

So, previous probe was 0, 0 and we got it minute ago.
Now we have 10, 100. It means we have 10 STW with total duration 100 last
minute.

But, it case we set interval to 10 seconds we'll get
1, 20
4, 20
0, 0
4, 10
1, 50
0, 0

So, precision depends on check interval. And it's up to administration team
what interval to choose.

> May be a kind of sliding window plus min/max values will do better job.

That's the monitoring system job (eg. zabbix)


On Wed, Nov 22, 2017 at 4:10 PM, Vladimir Ozerov <voze...@gridgain.com>
wrote:

> The question is how administrator should interpret these numbers. Ok, I
> opened JMX console and see that there was 10 long GC events, which took 100
> seconds.
> 1) When did these events appear? Over the last day, which is more or less
> OK, or over the last 10 minutes, so my server is nearly dead?
> 2) How duration is distributed? Was 10 pauses 10 seconds each, or 9 short
> pauses of 1 sec and one critical pause of 90s?
>
> May be a kind of sliding window plus min/max values will do better job.
>
> On Wed, Nov 22, 2017 at 1:07 PM, Anton Vinogradov <
> avinogra...@gridgain.com>
> wrote:
>
> > Vova,
> >
> > 1) We can gain collections info from GarbageCollectorMXBean
> > But it provides only
> > - collectionCount
> > - collectionTime.
> >
> > This is very interesting metrics, but they tell us nothing about long
> STW.
> > Long STW means we have huge latency during STW and we should find the
> > reason and solve it.
> >
> > 2) So, we're working on new incremental metrics
> > - total amount of STW longer than XXX
> > - total duration of STW longer than XXX
> >
> > which shows us JVM/GC health situation.
> >
> > Is it answer to your question?
> >
> > On Tue, Nov 21, 2017 at 9:05 PM, Vladimir Ozerov <voze...@gridgain.com>
> > wrote:
> >
> > > Anton,
> > >
> > > The question is why user may need so precise measurement? I share
> > Andrey’s
> > > opinion - cannot understand the value.
> > >
> > > вт, 21 нояб. 2017 г. в 19:33, Anton Vinogradov <
> avinogra...@gridgain.com
> > >:
> > >
> > > > Andrey,
> > > >
> > > > >  JVM provides sufficient means of detecting a struggling process
> out
> > of
> > > > the box.
> > > >
> > > > Could you point to some articles describing how to detect STW
> exceeding
> > > > some duration using only JVM API?
> > > >
> > > > On Tue, Nov 21, 2017 at 7:17 PM, Andrey Kornev <
> > andrewkor...@hotmail.com
> > > >
> > > > wrote:
> > > >
> > > > > My 2 cents. Don’t do it. JVM provides sufficient means of
> detecting a
> > > > > struggling process out of the box. SRE/Operations teams usually
> know
> > > how
> > > > to
> > > > > monitor JVMs and can handle killing of such processes themselves.
> > > > >
> > > > > The feature adds no value, just complexity (and more configuration
> > > > > parameters (!) — as if Ignite didn’t have enough of them already).
> > > > >
> > > > > Regards,
> > > > > Andrey
> > > > > _
> > > > > From: Denis Magda <dma...@apache.org>
> > > > > Sent: Monday, November 20, 2017 3:10 PM
> > > > > Subject: Re: Facility to detect long STW pauses and other system
> > > response
> > > > > degradations
> > > > > To: <dev@ignite.apache.org>
> > > > >
> > > > >
> > > > > My 2 cents.
> > > > >
> > > > > 1. Totally for a separate native process that will handle the
> > > monitoring
> > > > > of an Ignite process. The watchdog process can simply start a JVM
> > tool
> > > > like
> > > > > jstat and parse its GC logs: https://dzone.com/articles/
> 

Re: Facility to detect long STW pauses and other system response degradations

2017-11-22 Thread Vladimir Ozerov
The question is how administrator should interpret these numbers. Ok, I
opened JMX console and see that there was 10 long GC events, which took 100
seconds.
1) When did these events appear? Over the last day, which is more or less
OK, or over the last 10 minutes, so my server is nearly dead?
2) How duration is distributed? Was 10 pauses 10 seconds each, or 9 short
pauses of 1 sec and one critical pause of 90s?

May be a kind of sliding window plus min/max values will do better job.

On Wed, Nov 22, 2017 at 1:07 PM, Anton Vinogradov <avinogra...@gridgain.com>
wrote:

> Vova,
>
> 1) We can gain collections info from GarbageCollectorMXBean
> But it provides only
> - collectionCount
> - collectionTime.
>
> This is very interesting metrics, but they tell us nothing about long STW.
> Long STW means we have huge latency during STW and we should find the
> reason and solve it.
>
> 2) So, we're working on new incremental metrics
> - total amount of STW longer than XXX
> - total duration of STW longer than XXX
>
> which shows us JVM/GC health situation.
>
> Is it answer to your question?
>
> On Tue, Nov 21, 2017 at 9:05 PM, Vladimir Ozerov <voze...@gridgain.com>
> wrote:
>
> > Anton,
> >
> > The question is why user may need so precise measurement? I share
> Andrey’s
> > opinion - cannot understand the value.
> >
> > вт, 21 нояб. 2017 г. в 19:33, Anton Vinogradov <avinogra...@gridgain.com
> >:
> >
> > > Andrey,
> > >
> > > >  JVM provides sufficient means of detecting a struggling process out
> of
> > > the box.
> > >
> > > Could you point to some articles describing how to detect STW exceeding
> > > some duration using only JVM API?
> > >
> > > On Tue, Nov 21, 2017 at 7:17 PM, Andrey Kornev <
> andrewkor...@hotmail.com
> > >
> > > wrote:
> > >
> > > > My 2 cents. Don’t do it. JVM provides sufficient means of detecting a
> > > > struggling process out of the box. SRE/Operations teams usually know
> > how
> > > to
> > > > monitor JVMs and can handle killing of such processes themselves.
> > > >
> > > > The feature adds no value, just complexity (and more configuration
> > > > parameters (!) — as if Ignite didn’t have enough of them already).
> > > >
> > > > Regards,
> > > > Andrey
> > > > _
> > > > From: Denis Magda <dma...@apache.org>
> > > > Sent: Monday, November 20, 2017 3:10 PM
> > > > Subject: Re: Facility to detect long STW pauses and other system
> > response
> > > > degradations
> > > > To: <dev@ignite.apache.org>
> > > >
> > > >
> > > > My 2 cents.
> > > >
> > > > 1. Totally for a separate native process that will handle the
> > monitoring
> > > > of an Ignite process. The watchdog process can simply start a JVM
> tool
> > > like
> > > > jstat and parse its GC logs: https://dzone.com/articles/
> > > > how-monitor-java-garbage <https://dzone.com/articles/
> > > > how-monitor-java-garbage>
> > > >
> > > > 2. As for the STW handling, I would make a possible reaction more
> > > generic.
> > > > Let’s define a policy (enumeration) that will define how to deal with
> > an
> > > > unstable node. The events might be as follows - kill a node, restart
> a
> > > > node, trigger a custom script using Runtime.exec or other methods.
> > > >
> > > > What’d you think? Specifically on point 2.
> > > >
> > > > —
> > > > Denis
> > > >
> > > > > On Nov 20, 2017, at 6:47 AM, Anton Vinogradov <
> > > avinogra...@gridgain.com>
> > > > wrote:
> > > > >
> > > > > Yakov,
> > > > >
> > > > > Issue is https://issues.apache.org/jira/browse/IGNITE-6171
> > > > >
> > > > > We split issue to
> > > > > #1 STW duration metrics
> > > > > #2 External monitoring allows to stop node during STW
> > > > >
> > > > >> Testing GC pause with java thread is
> > > > >> a bit strange and can give info only after GC pause finishes.
> > > > >
> > > > > That's ok since it's #1
> > > > >
> > > > > On Mon, Nov 20, 2017 at 5:45 PM, Dmitriy_Sorokin <
> > > > sbt.sorokin@gmail.com>
> > > &g

Re: Facility to detect long STW pauses and other system response degradations

2017-11-22 Thread Anton Vinogradov
Vova,

1) We can gain collections info from GarbageCollectorMXBean
But it provides only
- collectionCount
- collectionTime.

This is very interesting metrics, but they tell us nothing about long STW.
Long STW means we have huge latency during STW and we should find the
reason and solve it.

2) So, we're working on new incremental metrics
- total amount of STW longer than XXX
- total duration of STW longer than XXX

which shows us JVM/GC health situation.

Is it answer to your question?

On Tue, Nov 21, 2017 at 9:05 PM, Vladimir Ozerov <voze...@gridgain.com>
wrote:

> Anton,
>
> The question is why user may need so precise measurement? I share Andrey’s
> opinion - cannot understand the value.
>
> вт, 21 нояб. 2017 г. в 19:33, Anton Vinogradov <avinogra...@gridgain.com>:
>
> > Andrey,
> >
> > >  JVM provides sufficient means of detecting a struggling process out of
> > the box.
> >
> > Could you point to some articles describing how to detect STW exceeding
> > some duration using only JVM API?
> >
> > On Tue, Nov 21, 2017 at 7:17 PM, Andrey Kornev <andrewkor...@hotmail.com
> >
> > wrote:
> >
> > > My 2 cents. Don’t do it. JVM provides sufficient means of detecting a
> > > struggling process out of the box. SRE/Operations teams usually know
> how
> > to
> > > monitor JVMs and can handle killing of such processes themselves.
> > >
> > > The feature adds no value, just complexity (and more configuration
> > > parameters (!) — as if Ignite didn’t have enough of them already).
> > >
> > > Regards,
> > > Andrey
> > > _____
> > > From: Denis Magda <dma...@apache.org>
> > > Sent: Monday, November 20, 2017 3:10 PM
> > > Subject: Re: Facility to detect long STW pauses and other system
> response
> > > degradations
> > > To: <dev@ignite.apache.org>
> > >
> > >
> > > My 2 cents.
> > >
> > > 1. Totally for a separate native process that will handle the
> monitoring
> > > of an Ignite process. The watchdog process can simply start a JVM tool
> > like
> > > jstat and parse its GC logs: https://dzone.com/articles/
> > > how-monitor-java-garbage <https://dzone.com/articles/
> > > how-monitor-java-garbage>
> > >
> > > 2. As for the STW handling, I would make a possible reaction more
> > generic.
> > > Let’s define a policy (enumeration) that will define how to deal with
> an
> > > unstable node. The events might be as follows - kill a node, restart a
> > > node, trigger a custom script using Runtime.exec or other methods.
> > >
> > > What’d you think? Specifically on point 2.
> > >
> > > —
> > > Denis
> > >
> > > > On Nov 20, 2017, at 6:47 AM, Anton Vinogradov <
> > avinogra...@gridgain.com>
> > > wrote:
> > > >
> > > > Yakov,
> > > >
> > > > Issue is https://issues.apache.org/jira/browse/IGNITE-6171
> > > >
> > > > We split issue to
> > > > #1 STW duration metrics
> > > > #2 External monitoring allows to stop node during STW
> > > >
> > > >> Testing GC pause with java thread is
> > > >> a bit strange and can give info only after GC pause finishes.
> > > >
> > > > That's ok since it's #1
> > > >
> > > > On Mon, Nov 20, 2017 at 5:45 PM, Dmitriy_Sorokin <
> > > sbt.sorokin@gmail.com>
> > > > wrote:
> > > >
> > > >> I have tested solution with java-thread and GC logs had contain same
> > > pause
> > > >> values of thread stopping which was detected by java-thread.
> > > >>
> > > >>
> > > >> My log (contains pauses > 100ms):
> > > >> [2017-11-20 17:33:28,822][WARN ][Thread-1][root] Possible too long
> STW
> > > >> pause: 507 milliseconds.
> > > >> [2017-11-20 17:33:34,522][WARN ][Thread-1][root] Possible too long
> STW
> > > >> pause: 5595 milliseconds.
> > > >> [2017-11-20 17:33:37,896][WARN ][Thread-1][root] Possible too long
> STW
> > > >> pause: 3262 milliseconds.
> > > >> [2017-11-20 17:33:39,714][WARN ][Thread-1][root] Possible too long
> STW
> > > >> pause: 1737 milliseconds.
> > > >>
> > > >> GC log:
> > > >> gridgain@dell-5580-92zc8h2:~$ cat
> > > >> ./dev/ignite-logs/gc-2017-11-20_17-33-27.log

Re: Facility to detect long STW pauses and other system response degradations

2017-11-21 Thread Vladimir Ozerov
Anton,

The question is why user may need so precise measurement? I share Andrey’s
opinion - cannot understand the value.

вт, 21 нояб. 2017 г. в 19:33, Anton Vinogradov <avinogra...@gridgain.com>:

> Andrey,
>
> >  JVM provides sufficient means of detecting a struggling process out of
> the box.
>
> Could you point to some articles describing how to detect STW exceeding
> some duration using only JVM API?
>
> On Tue, Nov 21, 2017 at 7:17 PM, Andrey Kornev <andrewkor...@hotmail.com>
> wrote:
>
> > My 2 cents. Don’t do it. JVM provides sufficient means of detecting a
> > struggling process out of the box. SRE/Operations teams usually know how
> to
> > monitor JVMs and can handle killing of such processes themselves.
> >
> > The feature adds no value, just complexity (and more configuration
> > parameters (!) — as if Ignite didn’t have enough of them already).
> >
> > Regards,
> > Andrey
> > _____
> > From: Denis Magda <dma...@apache.org>
> > Sent: Monday, November 20, 2017 3:10 PM
> > Subject: Re: Facility to detect long STW pauses and other system response
> > degradations
> > To: <dev@ignite.apache.org>
> >
> >
> > My 2 cents.
> >
> > 1. Totally for a separate native process that will handle the monitoring
> > of an Ignite process. The watchdog process can simply start a JVM tool
> like
> > jstat and parse its GC logs: https://dzone.com/articles/
> > how-monitor-java-garbage <https://dzone.com/articles/
> > how-monitor-java-garbage>
> >
> > 2. As for the STW handling, I would make a possible reaction more
> generic.
> > Let’s define a policy (enumeration) that will define how to deal with an
> > unstable node. The events might be as follows - kill a node, restart a
> > node, trigger a custom script using Runtime.exec or other methods.
> >
> > What’d you think? Specifically on point 2.
> >
> > —
> > Denis
> >
> > > On Nov 20, 2017, at 6:47 AM, Anton Vinogradov <
> avinogra...@gridgain.com>
> > wrote:
> > >
> > > Yakov,
> > >
> > > Issue is https://issues.apache.org/jira/browse/IGNITE-6171
> > >
> > > We split issue to
> > > #1 STW duration metrics
> > > #2 External monitoring allows to stop node during STW
> > >
> > >> Testing GC pause with java thread is
> > >> a bit strange and can give info only after GC pause finishes.
> > >
> > > That's ok since it's #1
> > >
> > > On Mon, Nov 20, 2017 at 5:45 PM, Dmitriy_Sorokin <
> > sbt.sorokin@gmail.com>
> > > wrote:
> > >
> > >> I have tested solution with java-thread and GC logs had contain same
> > pause
> > >> values of thread stopping which was detected by java-thread.
> > >>
> > >>
> > >> My log (contains pauses > 100ms):
> > >> [2017-11-20 17:33:28,822][WARN ][Thread-1][root] Possible too long STW
> > >> pause: 507 milliseconds.
> > >> [2017-11-20 17:33:34,522][WARN ][Thread-1][root] Possible too long STW
> > >> pause: 5595 milliseconds.
> > >> [2017-11-20 17:33:37,896][WARN ][Thread-1][root] Possible too long STW
> > >> pause: 3262 milliseconds.
> > >> [2017-11-20 17:33:39,714][WARN ][Thread-1][root] Possible too long STW
> > >> pause: 1737 milliseconds.
> > >>
> > >> GC log:
> > >> gridgain@dell-5580-92zc8h2:~$ cat
> > >> ./dev/ignite-logs/gc-2017-11-20_17-33-27.log | grep Total
> > >> 2017-11-20T17:33:27.608+0300: 0,116: Total time for which application
> > >> threads were stopped: 0,845 seconds, Stopping threads took:
> > 0,246
> > >> seconds
> > >> 2017-11-20T17:33:27.667+0300: 0,175: Total time for which application
> > >> threads were stopped: 0,0001072 seconds, Stopping threads took:
> > 0,252
> > >> seconds
> > >> 2017-11-20T17:33:28.822+0300: 1,330: Total time for which application
> > >> threads were stopped: 0,5001082 seconds, Stopping threads took:
> > 0,178
> > >> seconds // GOT!
> > >> 2017-11-20T17:33:34.521+0300: 7,030: Total time for which application
> > >> threads were stopped: 5,5856603 seconds, Stopping threads took:
> > 0,229
> > >> seconds // GOT!
> > >> 2017-11-20T17:33:37.896+0300: 10,405: Total time for which application
> > >> threads were stopped: 3,2595700 seconds, Stopping threads took:
> > 0,223
> > >> seconds // GOT!
> > >> 2017-11-20T17:33:39.714+0300: 12,222: Total time for which application
> > >> threads were stopped: 1,7337123 seconds, Stopping threads took:
> > 0,121
> > >> seconds // GOT!
> > >>
> > >>
> > >>
> > >>
> > >> --
> > >> Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
> > >>
> >
> >
> >
> >
>


Re: Facility to detect long STW pauses and other system response degradations

2017-11-21 Thread Anton Vinogradov
Andrey,

>  JVM provides sufficient means of detecting a struggling process out of
the box.

Could you point to some articles describing how to detect STW exceeding
some duration using only JVM API?

On Tue, Nov 21, 2017 at 7:17 PM, Andrey Kornev <andrewkor...@hotmail.com>
wrote:

> My 2 cents. Don’t do it. JVM provides sufficient means of detecting a
> struggling process out of the box. SRE/Operations teams usually know how to
> monitor JVMs and can handle killing of such processes themselves.
>
> The feature adds no value, just complexity (and more configuration
> parameters (!) — as if Ignite didn’t have enough of them already).
>
> Regards,
> Andrey
> _
> From: Denis Magda <dma...@apache.org>
> Sent: Monday, November 20, 2017 3:10 PM
> Subject: Re: Facility to detect long STW pauses and other system response
> degradations
> To: <dev@ignite.apache.org>
>
>
> My 2 cents.
>
> 1. Totally for a separate native process that will handle the monitoring
> of an Ignite process. The watchdog process can simply start a JVM tool like
> jstat and parse its GC logs: https://dzone.com/articles/
> how-monitor-java-garbage <https://dzone.com/articles/
> how-monitor-java-garbage>
>
> 2. As for the STW handling, I would make a possible reaction more generic.
> Let’s define a policy (enumeration) that will define how to deal with an
> unstable node. The events might be as follows - kill a node, restart a
> node, trigger a custom script using Runtime.exec or other methods.
>
> What’d you think? Specifically on point 2.
>
> —
> Denis
>
> > On Nov 20, 2017, at 6:47 AM, Anton Vinogradov <avinogra...@gridgain.com>
> wrote:
> >
> > Yakov,
> >
> > Issue is https://issues.apache.org/jira/browse/IGNITE-6171
> >
> > We split issue to
> > #1 STW duration metrics
> > #2 External monitoring allows to stop node during STW
> >
> >> Testing GC pause with java thread is
> >> a bit strange and can give info only after GC pause finishes.
> >
> > That's ok since it's #1
> >
> > On Mon, Nov 20, 2017 at 5:45 PM, Dmitriy_Sorokin <
> sbt.sorokin@gmail.com>
> > wrote:
> >
> >> I have tested solution with java-thread and GC logs had contain same
> pause
> >> values of thread stopping which was detected by java-thread.
> >>
> >>
> >> My log (contains pauses > 100ms):
> >> [2017-11-20 17:33:28,822][WARN ][Thread-1][root] Possible too long STW
> >> pause: 507 milliseconds.
> >> [2017-11-20 17:33:34,522][WARN ][Thread-1][root] Possible too long STW
> >> pause: 5595 milliseconds.
> >> [2017-11-20 17:33:37,896][WARN ][Thread-1][root] Possible too long STW
> >> pause: 3262 milliseconds.
> >> [2017-11-20 17:33:39,714][WARN ][Thread-1][root] Possible too long STW
> >> pause: 1737 milliseconds.
> >>
> >> GC log:
> >> gridgain@dell-5580-92zc8h2:~$ cat
> >> ./dev/ignite-logs/gc-2017-11-20_17-33-27.log | grep Total
> >> 2017-11-20T17:33:27.608+0300: 0,116: Total time for which application
> >> threads were stopped: 0,845 seconds, Stopping threads took:
> 0,246
> >> seconds
> >> 2017-11-20T17:33:27.667+0300: 0,175: Total time for which application
> >> threads were stopped: 0,0001072 seconds, Stopping threads took:
> 0,252
> >> seconds
> >> 2017-11-20T17:33:28.822+0300: 1,330: Total time for which application
> >> threads were stopped: 0,5001082 seconds, Stopping threads took:
> 0,178
> >> seconds // GOT!
> >> 2017-11-20T17:33:34.521+0300: 7,030: Total time for which application
> >> threads were stopped: 5,5856603 seconds, Stopping threads took:
> 0,229
> >> seconds // GOT!
> >> 2017-11-20T17:33:37.896+0300: 10,405: Total time for which application
> >> threads were stopped: 3,2595700 seconds, Stopping threads took:
> 0,223
> >> seconds // GOT!
> >> 2017-11-20T17:33:39.714+0300: 12,222: Total time for which application
> >> threads were stopped: 1,7337123 seconds, Stopping threads took:
> 0,121
> >> seconds // GOT!
> >>
> >>
> >>
> >>
> >> --
> >> Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
> >>
>
>
>
>


Re: Facility to detect long STW pauses and other system response degradations

2017-11-21 Thread Andrey Kornev
My 2 cents. Don’t do it. JVM provides sufficient means of detecting a 
struggling process out of the box. SRE/Operations teams usually know how to 
monitor JVMs and can handle killing of such processes themselves.

The feature adds no value, just complexity (and more configuration parameters 
(!) — as if Ignite didn’t have enough of them already).

Regards,
Andrey
_
From: Denis Magda <dma...@apache.org>
Sent: Monday, November 20, 2017 3:10 PM
Subject: Re: Facility to detect long STW pauses and other system response 
degradations
To: <dev@ignite.apache.org>


My 2 cents.

1. Totally for a separate native process that will handle the monitoring of an 
Ignite process. The watchdog process can simply start a JVM tool like jstat and 
parse its GC logs: https://dzone.com/articles/how-monitor-java-garbage 
<https://dzone.com/articles/how-monitor-java-garbage>

2. As for the STW handling, I would make a possible reaction more generic. 
Let’s define a policy (enumeration) that will define how to deal with an 
unstable node. The events might be as follows - kill a node, restart a node, 
trigger a custom script using Runtime.exec or other methods.

What’d you think? Specifically on point 2.

—
Denis

> On Nov 20, 2017, at 6:47 AM, Anton Vinogradov <avinogra...@gridgain.com> 
> wrote:
>
> Yakov,
>
> Issue is https://issues.apache.org/jira/browse/IGNITE-6171
>
> We split issue to
> #1 STW duration metrics
> #2 External monitoring allows to stop node during STW
>
>> Testing GC pause with java thread is
>> a bit strange and can give info only after GC pause finishes.
>
> That's ok since it's #1
>
> On Mon, Nov 20, 2017 at 5:45 PM, Dmitriy_Sorokin <sbt.sorokin@gmail.com>
> wrote:
>
>> I have tested solution with java-thread and GC logs had contain same pause
>> values of thread stopping which was detected by java-thread.
>>
>>
>> My log (contains pauses > 100ms):
>> [2017-11-20 17:33:28,822][WARN ][Thread-1][root] Possible too long STW
>> pause: 507 milliseconds.
>> [2017-11-20 17:33:34,522][WARN ][Thread-1][root] Possible too long STW
>> pause: 5595 milliseconds.
>> [2017-11-20 17:33:37,896][WARN ][Thread-1][root] Possible too long STW
>> pause: 3262 milliseconds.
>> [2017-11-20 17:33:39,714][WARN ][Thread-1][root] Possible too long STW
>> pause: 1737 milliseconds.
>>
>> GC log:
>> gridgain@dell-5580-92zc8h2:~$ cat
>> ./dev/ignite-logs/gc-2017-11-20_17-33-27.log | grep Total
>> 2017-11-20T17:33:27.608+0300: 0,116: Total time for which application
>> threads were stopped: 0,845 seconds, Stopping threads took: 0,246
>> seconds
>> 2017-11-20T17:33:27.667+0300: 0,175: Total time for which application
>> threads were stopped: 0,0001072 seconds, Stopping threads took: 0,252
>> seconds
>> 2017-11-20T17:33:28.822+0300: 1,330: Total time for which application
>> threads were stopped: 0,5001082 seconds, Stopping threads took: 0,178
>> seconds // GOT!
>> 2017-11-20T17:33:34.521+0300: 7,030: Total time for which application
>> threads were stopped: 5,5856603 seconds, Stopping threads took: 0,229
>> seconds // GOT!
>> 2017-11-20T17:33:37.896+0300: 10,405: Total time for which application
>> threads were stopped: 3,2595700 seconds, Stopping threads took: 0,223
>> seconds // GOT!
>> 2017-11-20T17:33:39.714+0300: 12,222: Total time for which application
>> threads were stopped: 1,7337123 seconds, Stopping threads took: 0,121
>> seconds // GOT!
>>
>>
>>
>>
>> --
>> Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
>>





Re: Facility to detect long STW pauses and other system response degradations

2017-11-21 Thread Дмитрий Сорокин
Don't forget that the high utilization of CPU can occur for reasons other
than GC STW, and GC log parsing will not help us in that case.


вт, 21 нояб. 2017 г. в 13:06, Anton Vinogradov [via Apache Ignite
Developers] <ml+s2346864n24497...@n4.nabble.com>:

> Denis,
>
> > 1. Totally for a separate native process that will handle the monitoring
> of an Ignite process. The watchdog process can simply start a JVM tool
> like
> jstat and parse its GC logs: https://dzone.com/articles/
> how-monitor-java-garbage <https://dzone.com/articles/
> how-monitor-java-garbage>
> Different GC and even same GC at different OS/JVM produce different logs.
> That's not easy to parse them. But, since http://gceasy.io can do that,
> it
> looks to be possible, somehow :) .
> Do you know any libs or solutions allows to do this at realtime?
>
> > 2. As for the STW handling, I would make a possible reaction more
> generic. Let’s define a policy (enumeration) that will define how to deal
> with an unstable node. The events might be as follows - kill a node,
> restart a node, trigger a custom script using Runtime.exec or other
> methods.
> Yes, it should be similar to segmentation policy + custom script
> execution.
>
>
> On Tue, Nov 21, 2017 at 2:10 AM, Denis Magda <[hidden email]
> <http:///user/SendEmail.jtp?type=node=24497=0>> wrote:
>
> > My 2 cents.
> >
> > 1. Totally for a separate native process that will handle the monitoring
> > of an Ignite process. The watchdog process can simply start a JVM tool
> like
> > jstat and parse its GC logs: https://dzone.com/articles/
> > how-monitor-java-garbage <https://dzone.com/articles/
> > how-monitor-java-garbage>
> >
> > 2. As for the STW handling, I would make a possible reaction more
> generic.
> > Let’s define a policy (enumeration) that will define how to deal with an
> > unstable node. The events might be as follows - kill a node, restart a
> > node, trigger a custom script using Runtime.exec or other methods.
> >
> > What’d you think? Specifically on point 2.
> >
> > —
> > Denis
> >
> > > On Nov 20, 2017, at 6:47 AM, Anton Vinogradov <[hidden email]
> <http:///user/SendEmail.jtp?type=node=24497=1>>
> > wrote:
> > >
> > > Yakov,
> > >
> > > Issue is https://issues.apache.org/jira/browse/IGNITE-6171
> > >
> > > We split issue to
> > > #1 STW duration metrics
> > > #2 External monitoring allows to stop node during STW
> > >
> > >> Testing GC pause with java thread is
> > >> a bit strange and can give info only after GC pause finishes.
> > >
> > > That's ok since it's #1
> > >
> > > On Mon, Nov 20, 2017 at 5:45 PM, Dmitriy_Sorokin <
> > [hidden email] <http:///user/SendEmail.jtp?type=node=24497=2>>
> > > wrote:
> > >
> > >> I have tested solution with java-thread and GC logs had contain same
> > pause
> > >> values of thread stopping which was detected by java-thread.
> > >>
> > >>
> > >> My log (contains pauses > 100ms):
> > >> [2017-11-20 17:33:28,822][WARN ][Thread-1][root] Possible too long
> STW
> > >> pause: 507 milliseconds.
> > >> [2017-11-20 17:33:34,522][WARN ][Thread-1][root] Possible too long
> STW
> > >> pause: 5595 milliseconds.
> > >> [2017-11-20 17:33:37,896][WARN ][Thread-1][root] Possible too long
> STW
> > >> pause: 3262 milliseconds.
> > >> [2017-11-20 17:33:39,714][WARN ][Thread-1][root] Possible too long
> STW
> > >> pause: 1737 milliseconds.
> > >>
> > >> GC log:
> > >> gridgain@dell-5580-92zc8h2:~$ cat
> > >> ./dev/ignite-logs/gc-2017-11-20_17-33-27.log | grep Total
> > >> 2017-11-20T17:33:27.608+0300: 0,116: Total time for which application
> > >> threads were stopped: 0,845 seconds, Stopping threads took:
> > 0,246
> > >> seconds
> > >> 2017-11-20T17:33:27.667+0300: 0,175: Total time for which application
> > >> threads were stopped: 0,0001072 seconds, Stopping threads took:
> > 0,252
> > >> seconds
> > >> 2017-11-20T17:33:28.822+0300: 1,330: Total time for which application
> > >> threads were stopped: 0,5001082 seconds, Stopping threads took:
> > 0,178
> > >> seconds// GOT!
> > >> 2017-11-20T17:33:34.521+0300: 7,030: Total time for which application
> > >> threads were stopped: 5,5856603 seconds, Stopping threads took:
> > 0,0000229
&g

Re: Facility to detect long STW pauses and other system response degradations

2017-11-21 Thread Anton Vinogradov
Denis,

> 1. Totally for a separate native process that will handle the monitoring
of an Ignite process. The watchdog process can simply start a JVM tool like
jstat and parse its GC logs: https://dzone.com/articles/
how-monitor-java-garbage 
Different GC and even same GC at different OS/JVM produce different logs.
That's not easy to parse them. But, since http://gceasy.io can do that, it
looks to be possible, somehow :) .
Do you know any libs or solutions allows to do this at realtime?

> 2. As for the STW handling, I would make a possible reaction more
generic. Let’s define a policy (enumeration) that will define how to deal
with an unstable node. The events might be as follows - kill a node,
restart a node, trigger a custom script using Runtime.exec or other methods.
Yes, it should be similar to segmentation policy + custom script execution.


On Tue, Nov 21, 2017 at 2:10 AM, Denis Magda  wrote:

> My 2 cents.
>
> 1. Totally for a separate native process that will handle the monitoring
> of an Ignite process. The watchdog process can simply start a JVM tool like
> jstat and parse its GC logs: https://dzone.com/articles/
> how-monitor-java-garbage  how-monitor-java-garbage>
>
> 2. As for the STW handling, I would make a possible reaction more generic.
> Let’s define a policy (enumeration) that will define how to deal with an
> unstable node. The events might be as follows - kill a node, restart a
> node, trigger a custom script using Runtime.exec or other methods.
>
> What’d you think? Specifically on point 2.
>
> —
> Denis
>
> > On Nov 20, 2017, at 6:47 AM, Anton Vinogradov 
> wrote:
> >
> > Yakov,
> >
> > Issue is https://issues.apache.org/jira/browse/IGNITE-6171
> >
> > We split issue to
> > #1 STW duration metrics
> > #2 External monitoring allows to stop node during STW
> >
> >> Testing GC pause with java thread is
> >> a bit strange and can give info only after GC pause finishes.
> >
> > That's ok since it's #1
> >
> > On Mon, Nov 20, 2017 at 5:45 PM, Dmitriy_Sorokin <
> sbt.sorokin@gmail.com>
> > wrote:
> >
> >> I have tested solution with java-thread and GC logs had contain same
> pause
> >> values of thread stopping which was detected by java-thread.
> >>
> >>
> >> My log (contains pauses > 100ms):
> >> [2017-11-20 17:33:28,822][WARN ][Thread-1][root] Possible too long STW
> >> pause: 507 milliseconds.
> >> [2017-11-20 17:33:34,522][WARN ][Thread-1][root] Possible too long STW
> >> pause: 5595 milliseconds.
> >> [2017-11-20 17:33:37,896][WARN ][Thread-1][root] Possible too long STW
> >> pause: 3262 milliseconds.
> >> [2017-11-20 17:33:39,714][WARN ][Thread-1][root] Possible too long STW
> >> pause: 1737 milliseconds.
> >>
> >> GC log:
> >> gridgain@dell-5580-92zc8h2:~$ cat
> >> ./dev/ignite-logs/gc-2017-11-20_17-33-27.log | grep Total
> >> 2017-11-20T17:33:27.608+0300: 0,116: Total time for which application
> >> threads were stopped: 0,845 seconds, Stopping threads took:
> 0,246
> >> seconds
> >> 2017-11-20T17:33:27.667+0300: 0,175: Total time for which application
> >> threads were stopped: 0,0001072 seconds, Stopping threads took:
> 0,252
> >> seconds
> >> 2017-11-20T17:33:28.822+0300: 1,330: Total time for which application
> >> threads were stopped: 0,5001082 seconds, Stopping threads took:
> 0,178
> >> seconds// GOT!
> >> 2017-11-20T17:33:34.521+0300: 7,030: Total time for which application
> >> threads were stopped: 5,5856603 seconds, Stopping threads took:
> 0,229
> >> seconds// GOT!
> >> 2017-11-20T17:33:37.896+0300: 10,405: Total time for which application
> >> threads were stopped: 3,2595700 seconds, Stopping threads took:
> 0,223
> >> seconds// GOT!
> >> 2017-11-20T17:33:39.714+0300: 12,222: Total time for which application
> >> threads were stopped: 1,7337123 seconds, Stopping threads took:
> 0,121
> >> seconds// GOT!
> >>
> >>
> >>
> >>
> >> --
> >> Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
> >>
>
>


Re: Facility to detect long STW pauses and other system response degradations

2017-11-20 Thread Denis Magda
My 2 cents.

1. Totally for a separate native process that will handle the monitoring of an 
Ignite process. The watchdog process can simply start a JVM tool like jstat and 
parse its GC logs: https://dzone.com/articles/how-monitor-java-garbage 


2. As for the STW handling, I would make a possible reaction more generic. 
Let’s define a policy (enumeration) that will define how to deal with an 
unstable node. The events might be as follows - kill a node, restart a node, 
trigger a custom script using Runtime.exec or other methods.

What’d you think? Specifically on point 2.

—
Denis

> On Nov 20, 2017, at 6:47 AM, Anton Vinogradov  
> wrote:
> 
> Yakov,
> 
> Issue is https://issues.apache.org/jira/browse/IGNITE-6171
> 
> We split issue to
> #1 STW duration metrics
> #2 External monitoring allows to stop node during STW
> 
>> Testing GC pause with java thread is
>> a bit strange and can give info only after GC pause finishes.
> 
> That's ok since it's #1
> 
> On Mon, Nov 20, 2017 at 5:45 PM, Dmitriy_Sorokin 
> wrote:
> 
>> I have tested solution with java-thread and GC logs had contain same pause
>> values of thread stopping which was detected by java-thread.
>> 
>> 
>> My log (contains pauses > 100ms):
>> [2017-11-20 17:33:28,822][WARN ][Thread-1][root] Possible too long STW
>> pause: 507 milliseconds.
>> [2017-11-20 17:33:34,522][WARN ][Thread-1][root] Possible too long STW
>> pause: 5595 milliseconds.
>> [2017-11-20 17:33:37,896][WARN ][Thread-1][root] Possible too long STW
>> pause: 3262 milliseconds.
>> [2017-11-20 17:33:39,714][WARN ][Thread-1][root] Possible too long STW
>> pause: 1737 milliseconds.
>> 
>> GC log:
>> gridgain@dell-5580-92zc8h2:~$ cat
>> ./dev/ignite-logs/gc-2017-11-20_17-33-27.log | grep Total
>> 2017-11-20T17:33:27.608+0300: 0,116: Total time for which application
>> threads were stopped: 0,845 seconds, Stopping threads took: 0,246
>> seconds
>> 2017-11-20T17:33:27.667+0300: 0,175: Total time for which application
>> threads were stopped: 0,0001072 seconds, Stopping threads took: 0,252
>> seconds
>> 2017-11-20T17:33:28.822+0300: 1,330: Total time for which application
>> threads were stopped: 0,5001082 seconds, Stopping threads took: 0,178
>> seconds// GOT!
>> 2017-11-20T17:33:34.521+0300: 7,030: Total time for which application
>> threads were stopped: 5,5856603 seconds, Stopping threads took: 0,229
>> seconds// GOT!
>> 2017-11-20T17:33:37.896+0300: 10,405: Total time for which application
>> threads were stopped: 3,2595700 seconds, Stopping threads took: 0,223
>> seconds// GOT!
>> 2017-11-20T17:33:39.714+0300: 12,222: Total time for which application
>> threads were stopped: 1,7337123 seconds, Stopping threads took: 0,121
>> seconds// GOT!
>> 
>> 
>> 
>> 
>> --
>> Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
>> 



Re: Facility to detect long STW pauses and other system response degradations

2017-11-20 Thread Anton Vinogradov
Yakov,

Issue is https://issues.apache.org/jira/browse/IGNITE-6171

We split issue to
#1 STW duration metrics
#2 External monitoring allows to stop node during STW

> Testing GC pause with java thread is
> a bit strange and can give info only after GC pause finishes.

That's ok since it's #1

On Mon, Nov 20, 2017 at 5:45 PM, Dmitriy_Sorokin 
wrote:

> I have tested solution with java-thread and GC logs had contain same pause
> values of thread stopping which was detected by java-thread.
>
>
> My log (contains pauses > 100ms):
> [2017-11-20 17:33:28,822][WARN ][Thread-1][root] Possible too long STW
> pause: 507 milliseconds.
> [2017-11-20 17:33:34,522][WARN ][Thread-1][root] Possible too long STW
> pause: 5595 milliseconds.
> [2017-11-20 17:33:37,896][WARN ][Thread-1][root] Possible too long STW
> pause: 3262 milliseconds.
> [2017-11-20 17:33:39,714][WARN ][Thread-1][root] Possible too long STW
> pause: 1737 milliseconds.
>
> GC log:
> gridgain@dell-5580-92zc8h2:~$ cat
> ./dev/ignite-logs/gc-2017-11-20_17-33-27.log | grep Total
> 2017-11-20T17:33:27.608+0300: 0,116: Total time for which application
> threads were stopped: 0,845 seconds, Stopping threads took: 0,246
> seconds
> 2017-11-20T17:33:27.667+0300: 0,175: Total time for which application
> threads were stopped: 0,0001072 seconds, Stopping threads took: 0,252
> seconds
> 2017-11-20T17:33:28.822+0300: 1,330: Total time for which application
> threads were stopped: 0,5001082 seconds, Stopping threads took: 0,178
> seconds// GOT!
> 2017-11-20T17:33:34.521+0300: 7,030: Total time for which application
> threads were stopped: 5,5856603 seconds, Stopping threads took: 0,229
> seconds// GOT!
> 2017-11-20T17:33:37.896+0300: 10,405: Total time for which application
> threads were stopped: 3,2595700 seconds, Stopping threads took: 0,223
> seconds// GOT!
> 2017-11-20T17:33:39.714+0300: 12,222: Total time for which application
> threads were stopped: 1,7337123 seconds, Stopping threads took: 0,121
> seconds// GOT!
>
>
>
>
> --
> Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
>


Re: Facility to detect long STW pauses and other system response degradations

2017-11-20 Thread Dmitriy_Sorokin
I have tested solution with java-thread and GC logs had contain same pause
values of thread stopping which was detected by java-thread.


My log (contains pauses > 100ms):
[2017-11-20 17:33:28,822][WARN ][Thread-1][root] Possible too long STW
pause: 507 milliseconds.
[2017-11-20 17:33:34,522][WARN ][Thread-1][root] Possible too long STW
pause: 5595 milliseconds.
[2017-11-20 17:33:37,896][WARN ][Thread-1][root] Possible too long STW
pause: 3262 milliseconds.
[2017-11-20 17:33:39,714][WARN ][Thread-1][root] Possible too long STW
pause: 1737 milliseconds.

GC log:
gridgain@dell-5580-92zc8h2:~$ cat
./dev/ignite-logs/gc-2017-11-20_17-33-27.log | grep Total
2017-11-20T17:33:27.608+0300: 0,116: Total time for which application
threads were stopped: 0,845 seconds, Stopping threads took: 0,246
seconds
2017-11-20T17:33:27.667+0300: 0,175: Total time for which application
threads were stopped: 0,0001072 seconds, Stopping threads took: 0,252
seconds
2017-11-20T17:33:28.822+0300: 1,330: Total time for which application
threads were stopped: 0,5001082 seconds, Stopping threads took: 0,178
seconds// GOT!
2017-11-20T17:33:34.521+0300: 7,030: Total time for which application
threads were stopped: 5,5856603 seconds, Stopping threads took: 0,229
seconds// GOT!
2017-11-20T17:33:37.896+0300: 10,405: Total time for which application
threads were stopped: 3,2595700 seconds, Stopping threads took: 0,223
seconds// GOT!
2017-11-20T17:33:39.714+0300: 12,222: Total time for which application
threads were stopped: 1,7337123 seconds, Stopping threads took: 0,121
seconds// GOT!




--
Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/


Re: Facility to detect long STW pauses and other system response degradations

2017-11-20 Thread Yakov Zhdanov
Guys, how about having 2 native threads - one calling some java method,
another one monitoring that the first one is active and is not stuck on
safepoint (which points to GC pause)? Testing GC pause with java thread is
a bit strange and can give info only after GC pause finishes. Native
threads can give info immediately. This was suggested by V. Ozerov some
time ago. Vova, I believe we already have a ticket. Can you please provide
a link?


--Yakov


Re: Facility to detect long STW pauses and other system response degradations

2017-11-20 Thread Dmitry Pavlov
Yes, we need some timestamp from Java code. But I think JVM thread could
update TS with delays not related to GC and will have same effect with
IgniteUtils#currentTimeMillis().

Is this new test compares result from java timestamps difference with GC
logs?

пн, 20 нояб. 2017 г. в 16:39, Anton Vinogradov :

> Dmitriy,
>
> > Sleeping Java Thread IMO is not an option, because thread can be in
> > Timed_Watiting logner than timeout.
>
> That's the only one idea we have, and, according to tests, it works!
>
> > Did I understand correctly that the native stream is proposed? And our
> goal
> > now is to select best framework for this?
>
> That's one of possible cases.
> We can replace native thread by another JVM, this should solve
> compatibility issues.
>
>
> On Mon, Nov 20, 2017 at 4:24 PM, Dmitry Pavlov 
> wrote:
>
> > Sleeping Java Thread IMO is not an option, because thread can be in
> > Timed_Watiting logner than timeout.
> >
> > Did I understand correctly that the native stream is proposed? And our
> goal
> > now is to select best framework for this?
> >
> > Can we limit this oppotunity with several popular OS (Win,Linux), and do
> > not implement this feature for all operation systems?
> >
> >
> > пн, 20 нояб. 2017 г. в 14:55, Anton Vinogradov  >:
> >
> > > Igniters,
> > >
> > > Since no one rejected proposal, let's start from part one.
> > >
> > > > I propose to add a special thread that will record current time
> every N
> > > > milliseconds and check the difference with the latest recorded value.
> > > > The maximum and total pause values for a certain period can be
> > published
> > > in
> > > > the special metrics available through JMX.
> > >
> > > On Fri, Nov 17, 2017 at 4:08 PM, Dmitriy_Sorokin <
> > > sbt.sorokin@gmail.com>
> > > wrote:
> > >
> > > > Hi, Igniters!
> > > >
> > > > This discussion thread related to
> > > > https://issues.apache.org/jira/browse/IGNITE-6171.
> > > >
> > > > Currently there are no JVM performance monitoring tools in AI, for
> > > example
> > > > the impact of GC (eg STW) on the operation of the node. I think we
> > should
> > > > add this functionality.
> > > >
> > > > 1) It is useful to know that STW duration increased or any other
> > > situations
> > > > leads to similar consequences.
> > > > This will allow system administrators to solve issues prior they
> become
> > > > problems.
> > > >
> > > > I propose to add a special thread that will record current time
> every N
> > > > milliseconds and check the difference with the latest recorded value.
> > > > The maximum and total pause values for a certain period can be
> > published
> > > in
> > > > the special metrics available through JMX.
> > > >
> > > > 2) If the pause reaches a critical value, we need to stop the node,
> > > without
> > > > waiting for end of the pause.
> > > >
> > > > The thread (from the first part of the proposed solution) is able to
> > > > estimate the pause duration, but only after its completion.
> > > > So, we need an external thread (in another JVM or native) that is
> able
> > to
> > > > recognize that the pause duration has passed the critical mark.
> > > >
> > > > We can estimate (STW or similar) pause duration by
> > > >  a) reading value updated by the first thread, somehow (eg via JMX,
> > shmem
> > > > or
> > > > shared file)
> > > >  or
> > > >  b) by using JVM diagnostic tools. Does anybody know crossplatform
> > > > solutions?
> > > >
> > > > Feel free to suggest ideas or tips, especially about second part of
> > > > proposal.
> > > >
> > > > Thoughts?
> > > >
> > > >
> > > >
> > > > --
> > > > Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
> > > >
> > >
> >
>


Re: Facility to detect long STW pauses and other system response degradations

2017-11-20 Thread Anton Vinogradov
Dmitriy,

> Sleeping Java Thread IMO is not an option, because thread can be in
> Timed_Watiting logner than timeout.

That's the only one idea we have, and, according to tests, it works!

> Did I understand correctly that the native stream is proposed? And our
goal
> now is to select best framework for this?

That's one of possible cases.
We can replace native thread by another JVM, this should solve
compatibility issues.


On Mon, Nov 20, 2017 at 4:24 PM, Dmitry Pavlov 
wrote:

> Sleeping Java Thread IMO is not an option, because thread can be in
> Timed_Watiting logner than timeout.
>
> Did I understand correctly that the native stream is proposed? And our goal
> now is to select best framework for this?
>
> Can we limit this oppotunity with several popular OS (Win,Linux), and do
> not implement this feature for all operation systems?
>
>
> пн, 20 нояб. 2017 г. в 14:55, Anton Vinogradov :
>
> > Igniters,
> >
> > Since no one rejected proposal, let's start from part one.
> >
> > > I propose to add a special thread that will record current time every N
> > > milliseconds and check the difference with the latest recorded value.
> > > The maximum and total pause values for a certain period can be
> published
> > in
> > > the special metrics available through JMX.
> >
> > On Fri, Nov 17, 2017 at 4:08 PM, Dmitriy_Sorokin <
> > sbt.sorokin@gmail.com>
> > wrote:
> >
> > > Hi, Igniters!
> > >
> > > This discussion thread related to
> > > https://issues.apache.org/jira/browse/IGNITE-6171.
> > >
> > > Currently there are no JVM performance monitoring tools in AI, for
> > example
> > > the impact of GC (eg STW) on the operation of the node. I think we
> should
> > > add this functionality.
> > >
> > > 1) It is useful to know that STW duration increased or any other
> > situations
> > > leads to similar consequences.
> > > This will allow system administrators to solve issues prior they become
> > > problems.
> > >
> > > I propose to add a special thread that will record current time every N
> > > milliseconds and check the difference with the latest recorded value.
> > > The maximum and total pause values for a certain period can be
> published
> > in
> > > the special metrics available through JMX.
> > >
> > > 2) If the pause reaches a critical value, we need to stop the node,
> > without
> > > waiting for end of the pause.
> > >
> > > The thread (from the first part of the proposed solution) is able to
> > > estimate the pause duration, but only after its completion.
> > > So, we need an external thread (in another JVM or native) that is able
> to
> > > recognize that the pause duration has passed the critical mark.
> > >
> > > We can estimate (STW or similar) pause duration by
> > >  a) reading value updated by the first thread, somehow (eg via JMX,
> shmem
> > > or
> > > shared file)
> > >  or
> > >  b) by using JVM diagnostic tools. Does anybody know crossplatform
> > > solutions?
> > >
> > > Feel free to suggest ideas or tips, especially about second part of
> > > proposal.
> > >
> > > Thoughts?
> > >
> > >
> > >
> > > --
> > > Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
> > >
> >
>


Re: Facility to detect long STW pauses and other system response degradations

2017-11-20 Thread Dmitry Pavlov
Sleeping Java Thread IMO is not an option, because thread can be in
Timed_Watiting logner than timeout.

Did I understand correctly that the native stream is proposed? And our goal
now is to select best framework for this?

Can we limit this oppotunity with several popular OS (Win,Linux), and do
not implement this feature for all operation systems?


пн, 20 нояб. 2017 г. в 14:55, Anton Vinogradov :

> Igniters,
>
> Since no one rejected proposal, let's start from part one.
>
> > I propose to add a special thread that will record current time every N
> > milliseconds and check the difference with the latest recorded value.
> > The maximum and total pause values for a certain period can be published
> in
> > the special metrics available through JMX.
>
> On Fri, Nov 17, 2017 at 4:08 PM, Dmitriy_Sorokin <
> sbt.sorokin@gmail.com>
> wrote:
>
> > Hi, Igniters!
> >
> > This discussion thread related to
> > https://issues.apache.org/jira/browse/IGNITE-6171.
> >
> > Currently there are no JVM performance monitoring tools in AI, for
> example
> > the impact of GC (eg STW) on the operation of the node. I think we should
> > add this functionality.
> >
> > 1) It is useful to know that STW duration increased or any other
> situations
> > leads to similar consequences.
> > This will allow system administrators to solve issues prior they become
> > problems.
> >
> > I propose to add a special thread that will record current time every N
> > milliseconds and check the difference with the latest recorded value.
> > The maximum and total pause values for a certain period can be published
> in
> > the special metrics available through JMX.
> >
> > 2) If the pause reaches a critical value, we need to stop the node,
> without
> > waiting for end of the pause.
> >
> > The thread (from the first part of the proposed solution) is able to
> > estimate the pause duration, but only after its completion.
> > So, we need an external thread (in another JVM or native) that is able to
> > recognize that the pause duration has passed the critical mark.
> >
> > We can estimate (STW or similar) pause duration by
> >  a) reading value updated by the first thread, somehow (eg via JMX, shmem
> > or
> > shared file)
> >  or
> >  b) by using JVM diagnostic tools. Does anybody know crossplatform
> > solutions?
> >
> > Feel free to suggest ideas or tips, especially about second part of
> > proposal.
> >
> > Thoughts?
> >
> >
> >
> > --
> > Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
> >
>


Re: Facility to detect long STW pauses and other system response degradations

2017-11-20 Thread Anton Vinogradov
Igniters,

Since no one rejected proposal, let's start from part one.

> I propose to add a special thread that will record current time every N
> milliseconds and check the difference with the latest recorded value.
> The maximum and total pause values for a certain period can be published
in
> the special metrics available through JMX.

On Fri, Nov 17, 2017 at 4:08 PM, Dmitriy_Sorokin 
wrote:

> Hi, Igniters!
>
> This discussion thread related to
> https://issues.apache.org/jira/browse/IGNITE-6171.
>
> Currently there are no JVM performance monitoring tools in AI, for example
> the impact of GC (eg STW) on the operation of the node. I think we should
> add this functionality.
>
> 1) It is useful to know that STW duration increased or any other situations
> leads to similar consequences.
> This will allow system administrators to solve issues prior they become
> problems.
>
> I propose to add a special thread that will record current time every N
> milliseconds and check the difference with the latest recorded value.
> The maximum and total pause values for a certain period can be published in
> the special metrics available through JMX.
>
> 2) If the pause reaches a critical value, we need to stop the node, without
> waiting for end of the pause.
>
> The thread (from the first part of the proposed solution) is able to
> estimate the pause duration, but only after its completion.
> So, we need an external thread (in another JVM or native) that is able to
> recognize that the pause duration has passed the critical mark.
>
> We can estimate (STW or similar) pause duration by
>  a) reading value updated by the first thread, somehow (eg via JMX, shmem
> or
> shared file)
>  or
>  b) by using JVM diagnostic tools. Does anybody know crossplatform
> solutions?
>
> Feel free to suggest ideas or tips, especially about second part of
> proposal.
>
> Thoughts?
>
>
>
> --
> Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/
>


Facility to detect long STW pauses and other system response degradations

2017-11-17 Thread Dmitriy_Sorokin
Hi, Igniters!

This discussion thread related to
https://issues.apache.org/jira/browse/IGNITE-6171.

Currently there are no JVM performance monitoring tools in AI, for example
the impact of GC (eg STW) on the operation of the node. I think we should
add this functionality.

1) It is useful to know that STW duration increased or any other situations
leads to similar consequences.
This will allow system administrators to solve issues prior they become
problems.

I propose to add a special thread that will record current time every N
milliseconds and check the difference with the latest recorded value. 
The maximum and total pause values for a certain period can be published in
the special metrics available through JMX.

2) If the pause reaches a critical value, we need to stop the node, without
waiting for end of the pause.

The thread (from the first part of the proposed solution) is able to
estimate the pause duration, but only after its completion. 
So, we need an external thread (in another JVM or native) that is able to
recognize that the pause duration has passed the critical mark.

We can estimate (STW or similar) pause duration by
 a) reading value updated by the first thread, somehow (eg via JMX, shmem or
shared file)
 or
 b) by using JVM diagnostic tools. Does anybody know crossplatform
solutions?

Feel free to suggest ideas or tips, especially about second part of
proposal.

Thoughts?



--
Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/