Re: Critical worker threads liveness checking drawbacks

2018-12-19 Thread Dmitriy Pavlov
Hi,

Sorry for being too formal here, but IGNITE-10003
 is in progress.

Also, I've tried to find anything related to it in the list. So according
to the list, no one was asking to include.

Sincerely,
Dmitriy Pavlov

ср, 19 дек. 2018 г. в 13:24, Nikolay Izhikov :

> Hello, Alexey.
>
> No, we don't include this ticket to 2.7.
> Should we?
>
> ср, 19 дек. 2018 г. в 12:55, Alexey Goncharuk  >:
>
> > Folks, why did not we include IGNITE-10003 to ignite-2.7 release scope?
> > This causes an Ignite node to be stopped by default when checkpoint read
> > lock acquire times out. I expect a lot of Ignite 2.7 users will be
> affected
> > by this mistake.
> >
> > We should at least update the documentation and make users aware of a
> > workaround.
> >
> > чт, 25 окт. 2018 г. в 16:35, Alexey Goncharuk <
> alexey.goncha...@gmail.com
> > >:
> >
> > > Andrey,
> > >
> > > I still see that checkpoint read lock acquisition raises a
> > CRITICAL_ERROR,
> > > which by default will shut down local node. As far as I remember, we
> > > decided that by default thread timeout should not trigger node failure.
> > > Now, however, it does, because we ignore SYSTEM_WORKER_BLOCKED events
> in
> > > default configuration.
> > >
> > > Should we introduce another critical failure type
> > > CHECKPOINT_READ_LOCK_BLOCKED or use SYSTEM_WORKER_BLOCKED for
> checkpoint
> > > read lock acquire failure?
> > >
> > > --AG
> > >
> > > пт, 12 окт. 2018 г. в 8:29, Andrey Kuznetsov :
> > >
> > >> Igniters,
> > >>
> > >> Now I spot blocking / long-running code arising from
> > >> {{GridDhtPartitionsExchangeFuture#init}} calls in partition-exchanger
> > >> thread, see [1]. Ideally, all blocking operations along all possible
> > code
> > >> paths should be guarded implicitly from critical failure detector to
> > avoid
> > >> the thread from being considered blocked. There is a pull request [2]
> > that
> > >> provides shallow solution. I didn't change code outside
> > >> {{GridDhtPartitionsExchangeFuture}}, otherwise it could be broken by
> any
> > >> upcoming change. Also, I didn't touch the code runnable by threads
> other
> > >> than partition-exchanger. So I have a number of guarded sections that
> > are
> > >> wider than they could be, and this potentially hides issues from
> failure
> > >> detector. Does this PR make sense? Or maybe it's better to exclude
> > >> partition-exchanger from critical threads registry at all?
> > >>
> > >> [1] https://issues.apache.org/jira/browse/IGNITE-9710
> > >> [2] https://github.com/apache/ignite/pull/4962
> > >>
> > >>
> > >> пт, 28 сент. 2018 г. в 18:56, Maxim Muzafarov :
> > >>
> > >> > Andrey, Andrey
> > >> >
> > >> > > Thanks for being attentive! It's definitely a typo. Could you
> please
> > >> > create
> > >> > > an issue?
> > >> >
> > >> > I've created an issue [1] and prepared PR [2].
> > >> > Please, review this change.
> > >> >
> > >> > [1] https://issues.apache.org/jira/browse/IGNITE-9723
> > >> > [2] https://github.com/apache/ignite/pull/4862
> > >> >
> > >> > On Fri, 28 Sep 2018 at 16:58 Yakov Zhdanov 
> > wrote:
> > >> >
> > >> > > Config option + mbean access. Does that make sense?
> > >> > >
> > >> > > Yakov
> > >> > >
> > >> > > On Fri, Sep 28, 2018, 17:17 Vladimir Ozerov  >
> > >> > wrote:
> > >> > >
> > >> > > > Then it should be config option.
> > >> > > >
> > >> > > > пт, 28 сент. 2018 г. в 13:15, Andrey Gura :
> > >> > > >
> > >> > > > > Guys,
> > >> > > > >
> > >> > > > > why we need both config option and system property? I believe
> > one
> > >> way
> > >> > > is
> > >> > > > > enough.
> > >> > > > > On Fri, Sep 28, 2018 at 12:38 PM Nikolay Izhikov <
> > >> > nizhi...@apache.org>
> > >> > > > > wrote:
> > >> > > > > >
> > >> > > > > > Ticket created -
> > >> https://issues.apache.org/jira/browse/IGNITE-9737
> > >> > > > > >
> > >> > > > > > Fixed version is 2.7.
> > >> > > > > >
> > >> > > > > > В Пт, 28/09/2018 в 11:41 +0300, Alexey Goncharuk пишет:
> > >> > > > > > > Nikolay, I agree, a user should be able to disable both
> > thread
> > >> > > > liveness
> > >> > > > > > > check and checkpoint read lock timeout check from config
> > and a
> > >> > > system
> > >> > > > > > > property.
> > >> > > > > > >
> > >> > > > > > > пт, 28 сент. 2018 г. в 11:30, Nikolay Izhikov <
> > >> > nizhi...@apache.org
> > >> > > >:
> > >> > > > > > >
> > >> > > > > > > > Hello, Igniters.
> > >> > > > > > > >
> > >> > > > > > > > I found that this feature can't be disabled from config.
> > >> > > > > > > > The only way to disable it is from JMX bean.
> > >> > > > > > > >
> > >> > > > > > > > I think it very dangerous: If we have some corner case
> or
> > a
> > >> bug
> > >> > > in
> > >> > > > > this
> > >> > > > > > > > Watch Dog it can make Ignite unusable.
> > >> > > > > > > > I propose to implement possibility to disable this
> feature
> > >> > both -
> > >> > > > > from
> > >> > > > > > > > config and from JVM options.
> > >> > > > > > > >
> > >> > > > 

Re: Critical worker threads liveness checking drawbacks

2018-12-19 Thread Nikolay Izhikov
Hello, Alexey.

No, we don't include this ticket to 2.7.
Should we?

ср, 19 дек. 2018 г. в 12:55, Alexey Goncharuk :

> Folks, why did not we include IGNITE-10003 to ignite-2.7 release scope?
> This causes an Ignite node to be stopped by default when checkpoint read
> lock acquire times out. I expect a lot of Ignite 2.7 users will be affected
> by this mistake.
>
> We should at least update the documentation and make users aware of a
> workaround.
>
> чт, 25 окт. 2018 г. в 16:35, Alexey Goncharuk  >:
>
> > Andrey,
> >
> > I still see that checkpoint read lock acquisition raises a
> CRITICAL_ERROR,
> > which by default will shut down local node. As far as I remember, we
> > decided that by default thread timeout should not trigger node failure.
> > Now, however, it does, because we ignore SYSTEM_WORKER_BLOCKED events in
> > default configuration.
> >
> > Should we introduce another critical failure type
> > CHECKPOINT_READ_LOCK_BLOCKED or use SYSTEM_WORKER_BLOCKED for checkpoint
> > read lock acquire failure?
> >
> > --AG
> >
> > пт, 12 окт. 2018 г. в 8:29, Andrey Kuznetsov :
> >
> >> Igniters,
> >>
> >> Now I spot blocking / long-running code arising from
> >> {{GridDhtPartitionsExchangeFuture#init}} calls in partition-exchanger
> >> thread, see [1]. Ideally, all blocking operations along all possible
> code
> >> paths should be guarded implicitly from critical failure detector to
> avoid
> >> the thread from being considered blocked. There is a pull request [2]
> that
> >> provides shallow solution. I didn't change code outside
> >> {{GridDhtPartitionsExchangeFuture}}, otherwise it could be broken by any
> >> upcoming change. Also, I didn't touch the code runnable by threads other
> >> than partition-exchanger. So I have a number of guarded sections that
> are
> >> wider than they could be, and this potentially hides issues from failure
> >> detector. Does this PR make sense? Or maybe it's better to exclude
> >> partition-exchanger from critical threads registry at all?
> >>
> >> [1] https://issues.apache.org/jira/browse/IGNITE-9710
> >> [2] https://github.com/apache/ignite/pull/4962
> >>
> >>
> >> пт, 28 сент. 2018 г. в 18:56, Maxim Muzafarov :
> >>
> >> > Andrey, Andrey
> >> >
> >> > > Thanks for being attentive! It's definitely a typo. Could you please
> >> > create
> >> > > an issue?
> >> >
> >> > I've created an issue [1] and prepared PR [2].
> >> > Please, review this change.
> >> >
> >> > [1] https://issues.apache.org/jira/browse/IGNITE-9723
> >> > [2] https://github.com/apache/ignite/pull/4862
> >> >
> >> > On Fri, 28 Sep 2018 at 16:58 Yakov Zhdanov 
> wrote:
> >> >
> >> > > Config option + mbean access. Does that make sense?
> >> > >
> >> > > Yakov
> >> > >
> >> > > On Fri, Sep 28, 2018, 17:17 Vladimir Ozerov 
> >> > wrote:
> >> > >
> >> > > > Then it should be config option.
> >> > > >
> >> > > > пт, 28 сент. 2018 г. в 13:15, Andrey Gura :
> >> > > >
> >> > > > > Guys,
> >> > > > >
> >> > > > > why we need both config option and system property? I believe
> one
> >> way
> >> > > is
> >> > > > > enough.
> >> > > > > On Fri, Sep 28, 2018 at 12:38 PM Nikolay Izhikov <
> >> > nizhi...@apache.org>
> >> > > > > wrote:
> >> > > > > >
> >> > > > > > Ticket created -
> >> https://issues.apache.org/jira/browse/IGNITE-9737
> >> > > > > >
> >> > > > > > Fixed version is 2.7.
> >> > > > > >
> >> > > > > > В Пт, 28/09/2018 в 11:41 +0300, Alexey Goncharuk пишет:
> >> > > > > > > Nikolay, I agree, a user should be able to disable both
> thread
> >> > > > liveness
> >> > > > > > > check and checkpoint read lock timeout check from config
> and a
> >> > > system
> >> > > > > > > property.
> >> > > > > > >
> >> > > > > > > пт, 28 сент. 2018 г. в 11:30, Nikolay Izhikov <
> >> > nizhi...@apache.org
> >> > > >:
> >> > > > > > >
> >> > > > > > > > Hello, Igniters.
> >> > > > > > > >
> >> > > > > > > > I found that this feature can't be disabled from config.
> >> > > > > > > > The only way to disable it is from JMX bean.
> >> > > > > > > >
> >> > > > > > > > I think it very dangerous: If we have some corner case or
> a
> >> bug
> >> > > in
> >> > > > > this
> >> > > > > > > > Watch Dog it can make Ignite unusable.
> >> > > > > > > > I propose to implement possibility to disable this feature
> >> > both -
> >> > > > > from
> >> > > > > > > > config and from JVM options.
> >> > > > > > > >
> >> > > > > > > > What do you think?
> >> > > > > > > >
> >> > > > > > > > В Чт, 27/09/2018 в 16:14 +0300, Andrey Kuznetsov пишет:
> >> > > > > > > > > Maxim,
> >> > > > > > > > >
> >> > > > > > > > > Thanks for being attentive! It's definitely a typo.
> Could
> >> you
> >> > > > > please
> >> > > > > > > >
> >> > > > > > > > create
> >> > > > > > > > > an issue?
> >> > > > > > > > >
> >> > > > > > > > > чт, 27 сент. 2018 г. в 16:00, Maxim Muzafarov <
> >> > > > maxmu...@gmail.com
> >> > > > > >:
> >> > > > > > > > >
> >> > > > > > > > > > Folks,
> >> > > > > > > > > >
> >> > > > > > > > > > I've found in 

Re: Critical worker threads liveness checking drawbacks

2018-12-19 Thread Alexey Goncharuk
Folks, why did not we include IGNITE-10003 to ignite-2.7 release scope?
This causes an Ignite node to be stopped by default when checkpoint read
lock acquire times out. I expect a lot of Ignite 2.7 users will be affected
by this mistake.

We should at least update the documentation and make users aware of a
workaround.

чт, 25 окт. 2018 г. в 16:35, Alexey Goncharuk :

> Andrey,
>
> I still see that checkpoint read lock acquisition raises a CRITICAL_ERROR,
> which by default will shut down local node. As far as I remember, we
> decided that by default thread timeout should not trigger node failure.
> Now, however, it does, because we ignore SYSTEM_WORKER_BLOCKED events in
> default configuration.
>
> Should we introduce another critical failure type
> CHECKPOINT_READ_LOCK_BLOCKED or use SYSTEM_WORKER_BLOCKED for checkpoint
> read lock acquire failure?
>
> --AG
>
> пт, 12 окт. 2018 г. в 8:29, Andrey Kuznetsov :
>
>> Igniters,
>>
>> Now I spot blocking / long-running code arising from
>> {{GridDhtPartitionsExchangeFuture#init}} calls in partition-exchanger
>> thread, see [1]. Ideally, all blocking operations along all possible code
>> paths should be guarded implicitly from critical failure detector to avoid
>> the thread from being considered blocked. There is a pull request [2] that
>> provides shallow solution. I didn't change code outside
>> {{GridDhtPartitionsExchangeFuture}}, otherwise it could be broken by any
>> upcoming change. Also, I didn't touch the code runnable by threads other
>> than partition-exchanger. So I have a number of guarded sections that are
>> wider than they could be, and this potentially hides issues from failure
>> detector. Does this PR make sense? Or maybe it's better to exclude
>> partition-exchanger from critical threads registry at all?
>>
>> [1] https://issues.apache.org/jira/browse/IGNITE-9710
>> [2] https://github.com/apache/ignite/pull/4962
>>
>>
>> пт, 28 сент. 2018 г. в 18:56, Maxim Muzafarov :
>>
>> > Andrey, Andrey
>> >
>> > > Thanks for being attentive! It's definitely a typo. Could you please
>> > create
>> > > an issue?
>> >
>> > I've created an issue [1] and prepared PR [2].
>> > Please, review this change.
>> >
>> > [1] https://issues.apache.org/jira/browse/IGNITE-9723
>> > [2] https://github.com/apache/ignite/pull/4862
>> >
>> > On Fri, 28 Sep 2018 at 16:58 Yakov Zhdanov  wrote:
>> >
>> > > Config option + mbean access. Does that make sense?
>> > >
>> > > Yakov
>> > >
>> > > On Fri, Sep 28, 2018, 17:17 Vladimir Ozerov 
>> > wrote:
>> > >
>> > > > Then it should be config option.
>> > > >
>> > > > пт, 28 сент. 2018 г. в 13:15, Andrey Gura :
>> > > >
>> > > > > Guys,
>> > > > >
>> > > > > why we need both config option and system property? I believe one
>> way
>> > > is
>> > > > > enough.
>> > > > > On Fri, Sep 28, 2018 at 12:38 PM Nikolay Izhikov <
>> > nizhi...@apache.org>
>> > > > > wrote:
>> > > > > >
>> > > > > > Ticket created -
>> https://issues.apache.org/jira/browse/IGNITE-9737
>> > > > > >
>> > > > > > Fixed version is 2.7.
>> > > > > >
>> > > > > > В Пт, 28/09/2018 в 11:41 +0300, Alexey Goncharuk пишет:
>> > > > > > > Nikolay, I agree, a user should be able to disable both thread
>> > > > liveness
>> > > > > > > check and checkpoint read lock timeout check from config and a
>> > > system
>> > > > > > > property.
>> > > > > > >
>> > > > > > > пт, 28 сент. 2018 г. в 11:30, Nikolay Izhikov <
>> > nizhi...@apache.org
>> > > >:
>> > > > > > >
>> > > > > > > > Hello, Igniters.
>> > > > > > > >
>> > > > > > > > I found that this feature can't be disabled from config.
>> > > > > > > > The only way to disable it is from JMX bean.
>> > > > > > > >
>> > > > > > > > I think it very dangerous: If we have some corner case or a
>> bug
>> > > in
>> > > > > this
>> > > > > > > > Watch Dog it can make Ignite unusable.
>> > > > > > > > I propose to implement possibility to disable this feature
>> > both -
>> > > > > from
>> > > > > > > > config and from JVM options.
>> > > > > > > >
>> > > > > > > > What do you think?
>> > > > > > > >
>> > > > > > > > В Чт, 27/09/2018 в 16:14 +0300, Andrey Kuznetsov пишет:
>> > > > > > > > > Maxim,
>> > > > > > > > >
>> > > > > > > > > Thanks for being attentive! It's definitely a typo. Could
>> you
>> > > > > please
>> > > > > > > >
>> > > > > > > > create
>> > > > > > > > > an issue?
>> > > > > > > > >
>> > > > > > > > > чт, 27 сент. 2018 г. в 16:00, Maxim Muzafarov <
>> > > > maxmu...@gmail.com
>> > > > > >:
>> > > > > > > > >
>> > > > > > > > > > Folks,
>> > > > > > > > > >
>> > > > > > > > > > I've found in `GridCachePartitionExchangeManager:2684`
>> [1]
>> > > > > (master
>> > > > > > > >
>> > > > > > > > branch)
>> > > > > > > > > > exchange future wrapped
>> > > > > > > > > > with double `blockingSectionEnd` method. Is it correct?
>> I
>> > > just
>> > > > > want to
>> > > > > > > > > > understand this change and
>> > > > > > > > > > how should I use this in the future.
>> > > > > > > > > >
>> > > > > > > > > > Should I file a 

Re: Critical worker threads liveness checking drawbacks

2018-10-25 Thread Alexey Goncharuk
Andrey,

I still see that checkpoint read lock acquisition raises a CRITICAL_ERROR,
which by default will shut down local node. As far as I remember, we
decided that by default thread timeout should not trigger node failure.
Now, however, it does, because we ignore SYSTEM_WORKER_BLOCKED events in
default configuration.

Should we introduce another critical failure type
CHECKPOINT_READ_LOCK_BLOCKED or use SYSTEM_WORKER_BLOCKED for checkpoint
read lock acquire failure?

--AG

пт, 12 окт. 2018 г. в 8:29, Andrey Kuznetsov :

> Igniters,
>
> Now I spot blocking / long-running code arising from
> {{GridDhtPartitionsExchangeFuture#init}} calls in partition-exchanger
> thread, see [1]. Ideally, all blocking operations along all possible code
> paths should be guarded implicitly from critical failure detector to avoid
> the thread from being considered blocked. There is a pull request [2] that
> provides shallow solution. I didn't change code outside
> {{GridDhtPartitionsExchangeFuture}}, otherwise it could be broken by any
> upcoming change. Also, I didn't touch the code runnable by threads other
> than partition-exchanger. So I have a number of guarded sections that are
> wider than they could be, and this potentially hides issues from failure
> detector. Does this PR make sense? Or maybe it's better to exclude
> partition-exchanger from critical threads registry at all?
>
> [1] https://issues.apache.org/jira/browse/IGNITE-9710
> [2] https://github.com/apache/ignite/pull/4962
>
>
> пт, 28 сент. 2018 г. в 18:56, Maxim Muzafarov :
>
> > Andrey, Andrey
> >
> > > Thanks for being attentive! It's definitely a typo. Could you please
> > create
> > > an issue?
> >
> > I've created an issue [1] and prepared PR [2].
> > Please, review this change.
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-9723
> > [2] https://github.com/apache/ignite/pull/4862
> >
> > On Fri, 28 Sep 2018 at 16:58 Yakov Zhdanov  wrote:
> >
> > > Config option + mbean access. Does that make sense?
> > >
> > > Yakov
> > >
> > > On Fri, Sep 28, 2018, 17:17 Vladimir Ozerov 
> > wrote:
> > >
> > > > Then it should be config option.
> > > >
> > > > пт, 28 сент. 2018 г. в 13:15, Andrey Gura :
> > > >
> > > > > Guys,
> > > > >
> > > > > why we need both config option and system property? I believe one
> way
> > > is
> > > > > enough.
> > > > > On Fri, Sep 28, 2018 at 12:38 PM Nikolay Izhikov <
> > nizhi...@apache.org>
> > > > > wrote:
> > > > > >
> > > > > > Ticket created -
> https://issues.apache.org/jira/browse/IGNITE-9737
> > > > > >
> > > > > > Fixed version is 2.7.
> > > > > >
> > > > > > В Пт, 28/09/2018 в 11:41 +0300, Alexey Goncharuk пишет:
> > > > > > > Nikolay, I agree, a user should be able to disable both thread
> > > > liveness
> > > > > > > check and checkpoint read lock timeout check from config and a
> > > system
> > > > > > > property.
> > > > > > >
> > > > > > > пт, 28 сент. 2018 г. в 11:30, Nikolay Izhikov <
> > nizhi...@apache.org
> > > >:
> > > > > > >
> > > > > > > > Hello, Igniters.
> > > > > > > >
> > > > > > > > I found that this feature can't be disabled from config.
> > > > > > > > The only way to disable it is from JMX bean.
> > > > > > > >
> > > > > > > > I think it very dangerous: If we have some corner case or a
> bug
> > > in
> > > > > this
> > > > > > > > Watch Dog it can make Ignite unusable.
> > > > > > > > I propose to implement possibility to disable this feature
> > both -
> > > > > from
> > > > > > > > config and from JVM options.
> > > > > > > >
> > > > > > > > What do you think?
> > > > > > > >
> > > > > > > > В Чт, 27/09/2018 в 16:14 +0300, Andrey Kuznetsov пишет:
> > > > > > > > > Maxim,
> > > > > > > > >
> > > > > > > > > Thanks for being attentive! It's definitely a typo. Could
> you
> > > > > please
> > > > > > > >
> > > > > > > > create
> > > > > > > > > an issue?
> > > > > > > > >
> > > > > > > > > чт, 27 сент. 2018 г. в 16:00, Maxim Muzafarov <
> > > > maxmu...@gmail.com
> > > > > >:
> > > > > > > > >
> > > > > > > > > > Folks,
> > > > > > > > > >
> > > > > > > > > > I've found in `GridCachePartitionExchangeManager:2684`
> [1]
> > > > > (master
> > > > > > > >
> > > > > > > > branch)
> > > > > > > > > > exchange future wrapped
> > > > > > > > > > with double `blockingSectionEnd` method. Is it correct? I
> > > just
> > > > > want to
> > > > > > > > > > understand this change and
> > > > > > > > > > how should I use this in the future.
> > > > > > > > > >
> > > > > > > > > > Should I file a new issue to fix this? I think here
> > > > > > > >
> > > > > > > > `blockingSectionBegin`
> > > > > > > > > > method should be used.
> > > > > > > > > >
> > > > > > > > > > -
> > > > > > > > > > blockingSectionEnd();
> > > > > > > > > >
> > > > > > > > > > try {
> > > > > > > > > > resVer = exchFut.get(exchTimeout,
> > TimeUnit.MILLISECONDS);
> > > > > > > > > > } finally {
> > > > > > > > > > blockingSectionEnd();
> > > > > > > > > > }
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > 

Re: Critical worker threads liveness checking drawbacks

2018-10-11 Thread Andrey Kuznetsov
Igniters,

Now I spot blocking / long-running code arising from
{{GridDhtPartitionsExchangeFuture#init}} calls in partition-exchanger
thread, see [1]. Ideally, all blocking operations along all possible code
paths should be guarded implicitly from critical failure detector to avoid
the thread from being considered blocked. There is a pull request [2] that
provides shallow solution. I didn't change code outside
{{GridDhtPartitionsExchangeFuture}}, otherwise it could be broken by any
upcoming change. Also, I didn't touch the code runnable by threads other
than partition-exchanger. So I have a number of guarded sections that are
wider than they could be, and this potentially hides issues from failure
detector. Does this PR make sense? Or maybe it's better to exclude
partition-exchanger from critical threads registry at all?

[1] https://issues.apache.org/jira/browse/IGNITE-9710
[2] https://github.com/apache/ignite/pull/4962


пт, 28 сент. 2018 г. в 18:56, Maxim Muzafarov :

> Andrey, Andrey
>
> > Thanks for being attentive! It's definitely a typo. Could you please
> create
> > an issue?
>
> I've created an issue [1] and prepared PR [2].
> Please, review this change.
>
> [1] https://issues.apache.org/jira/browse/IGNITE-9723
> [2] https://github.com/apache/ignite/pull/4862
>
> On Fri, 28 Sep 2018 at 16:58 Yakov Zhdanov  wrote:
>
> > Config option + mbean access. Does that make sense?
> >
> > Yakov
> >
> > On Fri, Sep 28, 2018, 17:17 Vladimir Ozerov 
> wrote:
> >
> > > Then it should be config option.
> > >
> > > пт, 28 сент. 2018 г. в 13:15, Andrey Gura :
> > >
> > > > Guys,
> > > >
> > > > why we need both config option and system property? I believe one way
> > is
> > > > enough.
> > > > On Fri, Sep 28, 2018 at 12:38 PM Nikolay Izhikov <
> nizhi...@apache.org>
> > > > wrote:
> > > > >
> > > > > Ticket created - https://issues.apache.org/jira/browse/IGNITE-9737
> > > > >
> > > > > Fixed version is 2.7.
> > > > >
> > > > > В Пт, 28/09/2018 в 11:41 +0300, Alexey Goncharuk пишет:
> > > > > > Nikolay, I agree, a user should be able to disable both thread
> > > liveness
> > > > > > check and checkpoint read lock timeout check from config and a
> > system
> > > > > > property.
> > > > > >
> > > > > > пт, 28 сент. 2018 г. в 11:30, Nikolay Izhikov <
> nizhi...@apache.org
> > >:
> > > > > >
> > > > > > > Hello, Igniters.
> > > > > > >
> > > > > > > I found that this feature can't be disabled from config.
> > > > > > > The only way to disable it is from JMX bean.
> > > > > > >
> > > > > > > I think it very dangerous: If we have some corner case or a bug
> > in
> > > > this
> > > > > > > Watch Dog it can make Ignite unusable.
> > > > > > > I propose to implement possibility to disable this feature
> both -
> > > > from
> > > > > > > config and from JVM options.
> > > > > > >
> > > > > > > What do you think?
> > > > > > >
> > > > > > > В Чт, 27/09/2018 в 16:14 +0300, Andrey Kuznetsov пишет:
> > > > > > > > Maxim,
> > > > > > > >
> > > > > > > > Thanks for being attentive! It's definitely a typo. Could you
> > > > please
> > > > > > >
> > > > > > > create
> > > > > > > > an issue?
> > > > > > > >
> > > > > > > > чт, 27 сент. 2018 г. в 16:00, Maxim Muzafarov <
> > > maxmu...@gmail.com
> > > > >:
> > > > > > > >
> > > > > > > > > Folks,
> > > > > > > > >
> > > > > > > > > I've found in `GridCachePartitionExchangeManager:2684` [1]
> > > > (master
> > > > > > >
> > > > > > > branch)
> > > > > > > > > exchange future wrapped
> > > > > > > > > with double `blockingSectionEnd` method. Is it correct? I
> > just
> > > > want to
> > > > > > > > > understand this change and
> > > > > > > > > how should I use this in the future.
> > > > > > > > >
> > > > > > > > > Should I file a new issue to fix this? I think here
> > > > > > >
> > > > > > > `blockingSectionBegin`
> > > > > > > > > method should be used.
> > > > > > > > >
> > > > > > > > > -
> > > > > > > > > blockingSectionEnd();
> > > > > > > > >
> > > > > > > > > try {
> > > > > > > > > resVer = exchFut.get(exchTimeout,
> TimeUnit.MILLISECONDS);
> > > > > > > > > } finally {
> > > > > > > > > blockingSectionEnd();
> > > > > > > > > }
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > [1]
> > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > > >
> > > >
> > >
> >
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.java#L2684
> > > > > > > > >
> > > > > > > > > On Wed, 26 Sep 2018 at 22:47 Vyacheslav Daradur <
> > > > daradu...@gmail.com>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Andrey Gura, thank you for the answer!
> > > > > > > > > >
> > > > > > > > > > I agree that wrapping of 'init' method reduces the profit
> > of
> > > > watchdog
> > > > > > > > > > service in case of PME worker, but in other cases, we
> > should
> > > > wrap all
> > > > > > > > > > possible long sections on GridDhtPartitionExchangeFuture.
> > For
> > > > 

Re: Critical worker threads liveness checking drawbacks

2018-09-28 Thread Maxim Muzafarov
Andrey, Andrey

> Thanks for being attentive! It's definitely a typo. Could you please
create
> an issue?

I've created an issue [1] and prepared PR [2].
Please, review this change.

[1] https://issues.apache.org/jira/browse/IGNITE-9723
[2] https://github.com/apache/ignite/pull/4862

On Fri, 28 Sep 2018 at 16:58 Yakov Zhdanov  wrote:

> Config option + mbean access. Does that make sense?
>
> Yakov
>
> On Fri, Sep 28, 2018, 17:17 Vladimir Ozerov  wrote:
>
> > Then it should be config option.
> >
> > пт, 28 сент. 2018 г. в 13:15, Andrey Gura :
> >
> > > Guys,
> > >
> > > why we need both config option and system property? I believe one way
> is
> > > enough.
> > > On Fri, Sep 28, 2018 at 12:38 PM Nikolay Izhikov 
> > > wrote:
> > > >
> > > > Ticket created - https://issues.apache.org/jira/browse/IGNITE-9737
> > > >
> > > > Fixed version is 2.7.
> > > >
> > > > В Пт, 28/09/2018 в 11:41 +0300, Alexey Goncharuk пишет:
> > > > > Nikolay, I agree, a user should be able to disable both thread
> > liveness
> > > > > check and checkpoint read lock timeout check from config and a
> system
> > > > > property.
> > > > >
> > > > > пт, 28 сент. 2018 г. в 11:30, Nikolay Izhikov  >:
> > > > >
> > > > > > Hello, Igniters.
> > > > > >
> > > > > > I found that this feature can't be disabled from config.
> > > > > > The only way to disable it is from JMX bean.
> > > > > >
> > > > > > I think it very dangerous: If we have some corner case or a bug
> in
> > > this
> > > > > > Watch Dog it can make Ignite unusable.
> > > > > > I propose to implement possibility to disable this feature both -
> > > from
> > > > > > config and from JVM options.
> > > > > >
> > > > > > What do you think?
> > > > > >
> > > > > > В Чт, 27/09/2018 в 16:14 +0300, Andrey Kuznetsov пишет:
> > > > > > > Maxim,
> > > > > > >
> > > > > > > Thanks for being attentive! It's definitely a typo. Could you
> > > please
> > > > > >
> > > > > > create
> > > > > > > an issue?
> > > > > > >
> > > > > > > чт, 27 сент. 2018 г. в 16:00, Maxim Muzafarov <
> > maxmu...@gmail.com
> > > >:
> > > > > > >
> > > > > > > > Folks,
> > > > > > > >
> > > > > > > > I've found in `GridCachePartitionExchangeManager:2684` [1]
> > > (master
> > > > > >
> > > > > > branch)
> > > > > > > > exchange future wrapped
> > > > > > > > with double `blockingSectionEnd` method. Is it correct? I
> just
> > > want to
> > > > > > > > understand this change and
> > > > > > > > how should I use this in the future.
> > > > > > > >
> > > > > > > > Should I file a new issue to fix this? I think here
> > > > > >
> > > > > > `blockingSectionBegin`
> > > > > > > > method should be used.
> > > > > > > >
> > > > > > > > -
> > > > > > > > blockingSectionEnd();
> > > > > > > >
> > > > > > > > try {
> > > > > > > > resVer = exchFut.get(exchTimeout, TimeUnit.MILLISECONDS);
> > > > > > > > } finally {
> > > > > > > > blockingSectionEnd();
> > > > > > > > }
> > > > > > > >
> > > > > > > >
> > > > > > > > [1]
> > > > > > > >
> > > > > > > >
> > > > > >
> > > > > >
> > >
> >
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.java#L2684
> > > > > > > >
> > > > > > > > On Wed, 26 Sep 2018 at 22:47 Vyacheslav Daradur <
> > > daradu...@gmail.com>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > Andrey Gura, thank you for the answer!
> > > > > > > > >
> > > > > > > > > I agree that wrapping of 'init' method reduces the profit
> of
> > > watchdog
> > > > > > > > > service in case of PME worker, but in other cases, we
> should
> > > wrap all
> > > > > > > > > possible long sections on GridDhtPartitionExchangeFuture.
> For
> > > example
> > > > > > > > > 'onCacheChangeRequest' method or
> > > > > > > > > 'cctx.affinity().onCacheChangeRequest' inside because it
> may
> > > take
> > > > > > > > > significant time (reproducer attached).
> > > > > > > > >
> > > > > > > > > I only want to point out a possible issue which may allow
> to
> > > end-user
> > > > > > > > > halt the Ignite cluster accidentally.
> > > > > > > > >
> > > > > > > > > I'm sure that PME experts know how to fix this issue
> > properly.
> > > > > > > > > On Wed, Sep 26, 2018 at 10:28 PM Andrey Gura <
> > ag...@apache.org
> > > >
> > > > > >
> > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Vyacheslav,
> > > > > > > > > >
> > > > > > > > > > Exchange worker is strongly tied with
> > > > > > > > > > GridDhtPartitionExchangeFuture#init and it is ok.
> Exchange
> > > worker
> > > > > >
> > > > > > also
> > > > > > > > > > shouldn't be blocked for long time but in reality it
> > > happens.It
> > > > > >
> > > > > > also
> > > > > > > > > > means that your change doesn't make sense.
> > > > > > > > > >
> > > > > > > > > > What actually make sense it is identification of places
> > which
> > > > > > > > > > intentionally blocking. May be some places/actions should
> > be
> > > > > >
> > > > > > braced by
> > > > > > > > > > 

Re: Critical worker threads liveness checking drawbacks

2018-09-28 Thread Yakov Zhdanov
Config option + mbean access. Does that make sense?

Yakov

On Fri, Sep 28, 2018, 17:17 Vladimir Ozerov  wrote:

> Then it should be config option.
>
> пт, 28 сент. 2018 г. в 13:15, Andrey Gura :
>
> > Guys,
> >
> > why we need both config option and system property? I believe one way is
> > enough.
> > On Fri, Sep 28, 2018 at 12:38 PM Nikolay Izhikov 
> > wrote:
> > >
> > > Ticket created - https://issues.apache.org/jira/browse/IGNITE-9737
> > >
> > > Fixed version is 2.7.
> > >
> > > В Пт, 28/09/2018 в 11:41 +0300, Alexey Goncharuk пишет:
> > > > Nikolay, I agree, a user should be able to disable both thread
> liveness
> > > > check and checkpoint read lock timeout check from config and a system
> > > > property.
> > > >
> > > > пт, 28 сент. 2018 г. в 11:30, Nikolay Izhikov :
> > > >
> > > > > Hello, Igniters.
> > > > >
> > > > > I found that this feature can't be disabled from config.
> > > > > The only way to disable it is from JMX bean.
> > > > >
> > > > > I think it very dangerous: If we have some corner case or a bug in
> > this
> > > > > Watch Dog it can make Ignite unusable.
> > > > > I propose to implement possibility to disable this feature both -
> > from
> > > > > config and from JVM options.
> > > > >
> > > > > What do you think?
> > > > >
> > > > > В Чт, 27/09/2018 в 16:14 +0300, Andrey Kuznetsov пишет:
> > > > > > Maxim,
> > > > > >
> > > > > > Thanks for being attentive! It's definitely a typo. Could you
> > please
> > > > >
> > > > > create
> > > > > > an issue?
> > > > > >
> > > > > > чт, 27 сент. 2018 г. в 16:00, Maxim Muzafarov <
> maxmu...@gmail.com
> > >:
> > > > > >
> > > > > > > Folks,
> > > > > > >
> > > > > > > I've found in `GridCachePartitionExchangeManager:2684` [1]
> > (master
> > > > >
> > > > > branch)
> > > > > > > exchange future wrapped
> > > > > > > with double `blockingSectionEnd` method. Is it correct? I just
> > want to
> > > > > > > understand this change and
> > > > > > > how should I use this in the future.
> > > > > > >
> > > > > > > Should I file a new issue to fix this? I think here
> > > > >
> > > > > `blockingSectionBegin`
> > > > > > > method should be used.
> > > > > > >
> > > > > > > -
> > > > > > > blockingSectionEnd();
> > > > > > >
> > > > > > > try {
> > > > > > > resVer = exchFut.get(exchTimeout, TimeUnit.MILLISECONDS);
> > > > > > > } finally {
> > > > > > > blockingSectionEnd();
> > > > > > > }
> > > > > > >
> > > > > > >
> > > > > > > [1]
> > > > > > >
> > > > > > >
> > > > >
> > > > >
> >
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.java#L2684
> > > > > > >
> > > > > > > On Wed, 26 Sep 2018 at 22:47 Vyacheslav Daradur <
> > daradu...@gmail.com>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > Andrey Gura, thank you for the answer!
> > > > > > > >
> > > > > > > > I agree that wrapping of 'init' method reduces the profit of
> > watchdog
> > > > > > > > service in case of PME worker, but in other cases, we should
> > wrap all
> > > > > > > > possible long sections on GridDhtPartitionExchangeFuture. For
> > example
> > > > > > > > 'onCacheChangeRequest' method or
> > > > > > > > 'cctx.affinity().onCacheChangeRequest' inside because it may
> > take
> > > > > > > > significant time (reproducer attached).
> > > > > > > >
> > > > > > > > I only want to point out a possible issue which may allow to
> > end-user
> > > > > > > > halt the Ignite cluster accidentally.
> > > > > > > >
> > > > > > > > I'm sure that PME experts know how to fix this issue
> properly.
> > > > > > > > On Wed, Sep 26, 2018 at 10:28 PM Andrey Gura <
> ag...@apache.org
> > >
> > > > >
> > > > > wrote:
> > > > > > > > >
> > > > > > > > > Vyacheslav,
> > > > > > > > >
> > > > > > > > > Exchange worker is strongly tied with
> > > > > > > > > GridDhtPartitionExchangeFuture#init and it is ok. Exchange
> > worker
> > > > >
> > > > > also
> > > > > > > > > shouldn't be blocked for long time but in reality it
> > happens.It
> > > > >
> > > > > also
> > > > > > > > > means that your change doesn't make sense.
> > > > > > > > >
> > > > > > > > > What actually make sense it is identification of places
> which
> > > > > > > > > intentionally blocking. May be some places/actions should
> be
> > > > >
> > > > > braced by
> > > > > > > > > blocking guards.
> > > > > > > > >
> > > > > > > > > If you have failing tests please make sure that your
> > > > >
> > > > > failureHandler is
> > > > > > > > > NoOpFailureHandler or any other handler with
> > ignoreFailureTypes =
> > > > > > > > > [CRITICAL_WORKER_BLOCKED].
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Wed, Sep 26, 2018 at 9:43 PM Vyacheslav Daradur <
> > > > > > >
> > > > > > > daradu...@gmail.com>
> > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > Hi Igniters!
> > > > > > > > > >
> > > > > > > > > > Thank you for this important improvement!
> > > > > > > > > >
> > > > > > > > > > I've looked 

Re: Critical worker threads liveness checking drawbacks

2018-09-28 Thread Vladimir Ozerov
Then it should be config option.

пт, 28 сент. 2018 г. в 13:15, Andrey Gura :

> Guys,
>
> why we need both config option and system property? I believe one way is
> enough.
> On Fri, Sep 28, 2018 at 12:38 PM Nikolay Izhikov 
> wrote:
> >
> > Ticket created - https://issues.apache.org/jira/browse/IGNITE-9737
> >
> > Fixed version is 2.7.
> >
> > В Пт, 28/09/2018 в 11:41 +0300, Alexey Goncharuk пишет:
> > > Nikolay, I agree, a user should be able to disable both thread liveness
> > > check and checkpoint read lock timeout check from config and a system
> > > property.
> > >
> > > пт, 28 сент. 2018 г. в 11:30, Nikolay Izhikov :
> > >
> > > > Hello, Igniters.
> > > >
> > > > I found that this feature can't be disabled from config.
> > > > The only way to disable it is from JMX bean.
> > > >
> > > > I think it very dangerous: If we have some corner case or a bug in
> this
> > > > Watch Dog it can make Ignite unusable.
> > > > I propose to implement possibility to disable this feature both -
> from
> > > > config and from JVM options.
> > > >
> > > > What do you think?
> > > >
> > > > В Чт, 27/09/2018 в 16:14 +0300, Andrey Kuznetsov пишет:
> > > > > Maxim,
> > > > >
> > > > > Thanks for being attentive! It's definitely a typo. Could you
> please
> > > >
> > > > create
> > > > > an issue?
> > > > >
> > > > > чт, 27 сент. 2018 г. в 16:00, Maxim Muzafarov  >:
> > > > >
> > > > > > Folks,
> > > > > >
> > > > > > I've found in `GridCachePartitionExchangeManager:2684` [1]
> (master
> > > >
> > > > branch)
> > > > > > exchange future wrapped
> > > > > > with double `blockingSectionEnd` method. Is it correct? I just
> want to
> > > > > > understand this change and
> > > > > > how should I use this in the future.
> > > > > >
> > > > > > Should I file a new issue to fix this? I think here
> > > >
> > > > `blockingSectionBegin`
> > > > > > method should be used.
> > > > > >
> > > > > > -
> > > > > > blockingSectionEnd();
> > > > > >
> > > > > > try {
> > > > > > resVer = exchFut.get(exchTimeout, TimeUnit.MILLISECONDS);
> > > > > > } finally {
> > > > > > blockingSectionEnd();
> > > > > > }
> > > > > >
> > > > > >
> > > > > > [1]
> > > > > >
> > > > > >
> > > >
> > > >
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.java#L2684
> > > > > >
> > > > > > On Wed, 26 Sep 2018 at 22:47 Vyacheslav Daradur <
> daradu...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Andrey Gura, thank you for the answer!
> > > > > > >
> > > > > > > I agree that wrapping of 'init' method reduces the profit of
> watchdog
> > > > > > > service in case of PME worker, but in other cases, we should
> wrap all
> > > > > > > possible long sections on GridDhtPartitionExchangeFuture. For
> example
> > > > > > > 'onCacheChangeRequest' method or
> > > > > > > 'cctx.affinity().onCacheChangeRequest' inside because it may
> take
> > > > > > > significant time (reproducer attached).
> > > > > > >
> > > > > > > I only want to point out a possible issue which may allow to
> end-user
> > > > > > > halt the Ignite cluster accidentally.
> > > > > > >
> > > > > > > I'm sure that PME experts know how to fix this issue properly.
> > > > > > > On Wed, Sep 26, 2018 at 10:28 PM Andrey Gura  >
> > > >
> > > > wrote:
> > > > > > > >
> > > > > > > > Vyacheslav,
> > > > > > > >
> > > > > > > > Exchange worker is strongly tied with
> > > > > > > > GridDhtPartitionExchangeFuture#init and it is ok. Exchange
> worker
> > > >
> > > > also
> > > > > > > > shouldn't be blocked for long time but in reality it
> happens.It
> > > >
> > > > also
> > > > > > > > means that your change doesn't make sense.
> > > > > > > >
> > > > > > > > What actually make sense it is identification of places which
> > > > > > > > intentionally blocking. May be some places/actions should be
> > > >
> > > > braced by
> > > > > > > > blocking guards.
> > > > > > > >
> > > > > > > > If you have failing tests please make sure that your
> > > >
> > > > failureHandler is
> > > > > > > > NoOpFailureHandler or any other handler with
> ignoreFailureTypes =
> > > > > > > > [CRITICAL_WORKER_BLOCKED].
> > > > > > > >
> > > > > > > >
> > > > > > > > On Wed, Sep 26, 2018 at 9:43 PM Vyacheslav Daradur <
> > > > > >
> > > > > > daradu...@gmail.com>
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > Hi Igniters!
> > > > > > > > >
> > > > > > > > > Thank you for this important improvement!
> > > > > > > > >
> > > > > > > > > I've looked through implementation and noticed that
> > > > > > > > > GridDhtPartitionsExchangeFuture#init has not been wrapped
> in
> > > >
> > > > blocked
> > > > > > > > > section. This means it easy to halt the node in case of
> > > >
> > > > longrunning
> > > > > > > > > actions during PME, for example when we create a cache with
> > > > > > > > > StoreFactrory which connect to 3rd party DB.
> > > > > > > > >
> > > > > > > > > I'm not sure that it is the right 

Re: Critical worker threads liveness checking drawbacks

2018-09-28 Thread Andrey Gura
Guys,

why we need both config option and system property? I believe one way is enough.
On Fri, Sep 28, 2018 at 12:38 PM Nikolay Izhikov  wrote:
>
> Ticket created - https://issues.apache.org/jira/browse/IGNITE-9737
>
> Fixed version is 2.7.
>
> В Пт, 28/09/2018 в 11:41 +0300, Alexey Goncharuk пишет:
> > Nikolay, I agree, a user should be able to disable both thread liveness
> > check and checkpoint read lock timeout check from config and a system
> > property.
> >
> > пт, 28 сент. 2018 г. в 11:30, Nikolay Izhikov :
> >
> > > Hello, Igniters.
> > >
> > > I found that this feature can't be disabled from config.
> > > The only way to disable it is from JMX bean.
> > >
> > > I think it very dangerous: If we have some corner case or a bug in this
> > > Watch Dog it can make Ignite unusable.
> > > I propose to implement possibility to disable this feature both - from
> > > config and from JVM options.
> > >
> > > What do you think?
> > >
> > > В Чт, 27/09/2018 в 16:14 +0300, Andrey Kuznetsov пишет:
> > > > Maxim,
> > > >
> > > > Thanks for being attentive! It's definitely a typo. Could you please
> > >
> > > create
> > > > an issue?
> > > >
> > > > чт, 27 сент. 2018 г. в 16:00, Maxim Muzafarov :
> > > >
> > > > > Folks,
> > > > >
> > > > > I've found in `GridCachePartitionExchangeManager:2684` [1] (master
> > >
> > > branch)
> > > > > exchange future wrapped
> > > > > with double `blockingSectionEnd` method. Is it correct? I just want to
> > > > > understand this change and
> > > > > how should I use this in the future.
> > > > >
> > > > > Should I file a new issue to fix this? I think here
> > >
> > > `blockingSectionBegin`
> > > > > method should be used.
> > > > >
> > > > > -
> > > > > blockingSectionEnd();
> > > > >
> > > > > try {
> > > > > resVer = exchFut.get(exchTimeout, TimeUnit.MILLISECONDS);
> > > > > } finally {
> > > > > blockingSectionEnd();
> > > > > }
> > > > >
> > > > >
> > > > > [1]
> > > > >
> > > > >
> > >
> > > https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.java#L2684
> > > > >
> > > > > On Wed, 26 Sep 2018 at 22:47 Vyacheslav Daradur 
> > > > > wrote:
> > > > >
> > > > > > Andrey Gura, thank you for the answer!
> > > > > >
> > > > > > I agree that wrapping of 'init' method reduces the profit of 
> > > > > > watchdog
> > > > > > service in case of PME worker, but in other cases, we should wrap 
> > > > > > all
> > > > > > possible long sections on GridDhtPartitionExchangeFuture. For 
> > > > > > example
> > > > > > 'onCacheChangeRequest' method or
> > > > > > 'cctx.affinity().onCacheChangeRequest' inside because it may take
> > > > > > significant time (reproducer attached).
> > > > > >
> > > > > > I only want to point out a possible issue which may allow to 
> > > > > > end-user
> > > > > > halt the Ignite cluster accidentally.
> > > > > >
> > > > > > I'm sure that PME experts know how to fix this issue properly.
> > > > > > On Wed, Sep 26, 2018 at 10:28 PM Andrey Gura 
> > >
> > > wrote:
> > > > > > >
> > > > > > > Vyacheslav,
> > > > > > >
> > > > > > > Exchange worker is strongly tied with
> > > > > > > GridDhtPartitionExchangeFuture#init and it is ok. Exchange worker
> > >
> > > also
> > > > > > > shouldn't be blocked for long time but in reality it happens.It
> > >
> > > also
> > > > > > > means that your change doesn't make sense.
> > > > > > >
> > > > > > > What actually make sense it is identification of places which
> > > > > > > intentionally blocking. May be some places/actions should be
> > >
> > > braced by
> > > > > > > blocking guards.
> > > > > > >
> > > > > > > If you have failing tests please make sure that your
> > >
> > > failureHandler is
> > > > > > > NoOpFailureHandler or any other handler with ignoreFailureTypes =
> > > > > > > [CRITICAL_WORKER_BLOCKED].
> > > > > > >
> > > > > > >
> > > > > > > On Wed, Sep 26, 2018 at 9:43 PM Vyacheslav Daradur <
> > > > >
> > > > > daradu...@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > Hi Igniters!
> > > > > > > >
> > > > > > > > Thank you for this important improvement!
> > > > > > > >
> > > > > > > > I've looked through implementation and noticed that
> > > > > > > > GridDhtPartitionsExchangeFuture#init has not been wrapped in
> > >
> > > blocked
> > > > > > > > section. This means it easy to halt the node in case of
> > >
> > > longrunning
> > > > > > > > actions during PME, for example when we create a cache with
> > > > > > > > StoreFactrory which connect to 3rd party DB.
> > > > > > > >
> > > > > > > > I'm not sure that it is the right behavior.
> > > > > > > >
> > > > > > > > I filled the issue [1] and prepared the PR [2] with reproducer
> > >
> > > and
> > > > > >
> > > > > > possible fix.
> > > > > > > >
> > > > > > > > Andrey, could you please look at and confirm that it makes 
> > > > > > > > sense?
> > > > > > > >
> > > > > > > > [1] 

Re: Critical worker threads liveness checking drawbacks

2018-09-28 Thread Nikolay Izhikov
Ticket created - https://issues.apache.org/jira/browse/IGNITE-9737

Fixed version is 2.7.

В Пт, 28/09/2018 в 11:41 +0300, Alexey Goncharuk пишет:
> Nikolay, I agree, a user should be able to disable both thread liveness
> check and checkpoint read lock timeout check from config and a system
> property.
> 
> пт, 28 сент. 2018 г. в 11:30, Nikolay Izhikov :
> 
> > Hello, Igniters.
> > 
> > I found that this feature can't be disabled from config.
> > The only way to disable it is from JMX bean.
> > 
> > I think it very dangerous: If we have some corner case or a bug in this
> > Watch Dog it can make Ignite unusable.
> > I propose to implement possibility to disable this feature both - from
> > config and from JVM options.
> > 
> > What do you think?
> > 
> > В Чт, 27/09/2018 в 16:14 +0300, Andrey Kuznetsov пишет:
> > > Maxim,
> > > 
> > > Thanks for being attentive! It's definitely a typo. Could you please
> > 
> > create
> > > an issue?
> > > 
> > > чт, 27 сент. 2018 г. в 16:00, Maxim Muzafarov :
> > > 
> > > > Folks,
> > > > 
> > > > I've found in `GridCachePartitionExchangeManager:2684` [1] (master
> > 
> > branch)
> > > > exchange future wrapped
> > > > with double `blockingSectionEnd` method. Is it correct? I just want to
> > > > understand this change and
> > > > how should I use this in the future.
> > > > 
> > > > Should I file a new issue to fix this? I think here
> > 
> > `blockingSectionBegin`
> > > > method should be used.
> > > > 
> > > > -
> > > > blockingSectionEnd();
> > > > 
> > > > try {
> > > > resVer = exchFut.get(exchTimeout, TimeUnit.MILLISECONDS);
> > > > } finally {
> > > > blockingSectionEnd();
> > > > }
> > > > 
> > > > 
> > > > [1]
> > > > 
> > > > 
> > 
> > https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.java#L2684
> > > > 
> > > > On Wed, 26 Sep 2018 at 22:47 Vyacheslav Daradur 
> > > > wrote:
> > > > 
> > > > > Andrey Gura, thank you for the answer!
> > > > > 
> > > > > I agree that wrapping of 'init' method reduces the profit of watchdog
> > > > > service in case of PME worker, but in other cases, we should wrap all
> > > > > possible long sections on GridDhtPartitionExchangeFuture. For example
> > > > > 'onCacheChangeRequest' method or
> > > > > 'cctx.affinity().onCacheChangeRequest' inside because it may take
> > > > > significant time (reproducer attached).
> > > > > 
> > > > > I only want to point out a possible issue which may allow to end-user
> > > > > halt the Ignite cluster accidentally.
> > > > > 
> > > > > I'm sure that PME experts know how to fix this issue properly.
> > > > > On Wed, Sep 26, 2018 at 10:28 PM Andrey Gura 
> > 
> > wrote:
> > > > > > 
> > > > > > Vyacheslav,
> > > > > > 
> > > > > > Exchange worker is strongly tied with
> > > > > > GridDhtPartitionExchangeFuture#init and it is ok. Exchange worker
> > 
> > also
> > > > > > shouldn't be blocked for long time but in reality it happens.It
> > 
> > also
> > > > > > means that your change doesn't make sense.
> > > > > > 
> > > > > > What actually make sense it is identification of places which
> > > > > > intentionally blocking. May be some places/actions should be
> > 
> > braced by
> > > > > > blocking guards.
> > > > > > 
> > > > > > If you have failing tests please make sure that your
> > 
> > failureHandler is
> > > > > > NoOpFailureHandler or any other handler with ignoreFailureTypes =
> > > > > > [CRITICAL_WORKER_BLOCKED].
> > > > > > 
> > > > > > 
> > > > > > On Wed, Sep 26, 2018 at 9:43 PM Vyacheslav Daradur <
> > > > 
> > > > daradu...@gmail.com>
> > > > > wrote:
> > > > > > > 
> > > > > > > Hi Igniters!
> > > > > > > 
> > > > > > > Thank you for this important improvement!
> > > > > > > 
> > > > > > > I've looked through implementation and noticed that
> > > > > > > GridDhtPartitionsExchangeFuture#init has not been wrapped in
> > 
> > blocked
> > > > > > > section. This means it easy to halt the node in case of
> > 
> > longrunning
> > > > > > > actions during PME, for example when we create a cache with
> > > > > > > StoreFactrory which connect to 3rd party DB.
> > > > > > > 
> > > > > > > I'm not sure that it is the right behavior.
> > > > > > > 
> > > > > > > I filled the issue [1] and prepared the PR [2] with reproducer
> > 
> > and
> > > > > 
> > > > > possible fix.
> > > > > > > 
> > > > > > > Andrey, could you please look at and confirm that it makes sense?
> > > > > > > 
> > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9710
> > > > > > > [2] https://github.com/apache/ignite/pull/4845
> > > > > > > On Mon, Sep 24, 2018 at 9:46 PM Andrey Kuznetsov <
> > 
> > stku...@gmail.com>
> > > > > 
> > > > > wrote:
> > > > > > > > 
> > > > > > > > Denis,
> > > > > > > > 
> > > > > > > > I've created the ticket [1] with short description of the
> > > > > 
> > > > > functionality.
> > > > > > > > 
> > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9679
> 

Re: Critical worker threads liveness checking drawbacks

2018-09-28 Thread Alexey Goncharuk
Nikolay, I agree, a user should be able to disable both thread liveness
check and checkpoint read lock timeout check from config and a system
property.

пт, 28 сент. 2018 г. в 11:30, Nikolay Izhikov :

> Hello, Igniters.
>
> I found that this feature can't be disabled from config.
> The only way to disable it is from JMX bean.
>
> I think it very dangerous: If we have some corner case or a bug in this
> Watch Dog it can make Ignite unusable.
> I propose to implement possibility to disable this feature both - from
> config and from JVM options.
>
> What do you think?
>
> В Чт, 27/09/2018 в 16:14 +0300, Andrey Kuznetsov пишет:
> > Maxim,
> >
> > Thanks for being attentive! It's definitely a typo. Could you please
> create
> > an issue?
> >
> > чт, 27 сент. 2018 г. в 16:00, Maxim Muzafarov :
> >
> > > Folks,
> > >
> > > I've found in `GridCachePartitionExchangeManager:2684` [1] (master
> branch)
> > > exchange future wrapped
> > > with double `blockingSectionEnd` method. Is it correct? I just want to
> > > understand this change and
> > > how should I use this in the future.
> > >
> > > Should I file a new issue to fix this? I think here
> `blockingSectionBegin`
> > > method should be used.
> > >
> > > -
> > > blockingSectionEnd();
> > >
> > > try {
> > > resVer = exchFut.get(exchTimeout, TimeUnit.MILLISECONDS);
> > > } finally {
> > > blockingSectionEnd();
> > > }
> > >
> > >
> > > [1]
> > >
> > >
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.java#L2684
> > >
> > > On Wed, 26 Sep 2018 at 22:47 Vyacheslav Daradur 
> > > wrote:
> > >
> > > > Andrey Gura, thank you for the answer!
> > > >
> > > > I agree that wrapping of 'init' method reduces the profit of watchdog
> > > > service in case of PME worker, but in other cases, we should wrap all
> > > > possible long sections on GridDhtPartitionExchangeFuture. For example
> > > > 'onCacheChangeRequest' method or
> > > > 'cctx.affinity().onCacheChangeRequest' inside because it may take
> > > > significant time (reproducer attached).
> > > >
> > > > I only want to point out a possible issue which may allow to end-user
> > > > halt the Ignite cluster accidentally.
> > > >
> > > > I'm sure that PME experts know how to fix this issue properly.
> > > > On Wed, Sep 26, 2018 at 10:28 PM Andrey Gura 
> wrote:
> > > > >
> > > > > Vyacheslav,
> > > > >
> > > > > Exchange worker is strongly tied with
> > > > > GridDhtPartitionExchangeFuture#init and it is ok. Exchange worker
> also
> > > > > shouldn't be blocked for long time but in reality it happens.It
> also
> > > > > means that your change doesn't make sense.
> > > > >
> > > > > What actually make sense it is identification of places which
> > > > > intentionally blocking. May be some places/actions should be
> braced by
> > > > > blocking guards.
> > > > >
> > > > > If you have failing tests please make sure that your
> failureHandler is
> > > > > NoOpFailureHandler or any other handler with ignoreFailureTypes =
> > > > > [CRITICAL_WORKER_BLOCKED].
> > > > >
> > > > >
> > > > > On Wed, Sep 26, 2018 at 9:43 PM Vyacheslav Daradur <
> > >
> > > daradu...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > > Hi Igniters!
> > > > > >
> > > > > > Thank you for this important improvement!
> > > > > >
> > > > > > I've looked through implementation and noticed that
> > > > > > GridDhtPartitionsExchangeFuture#init has not been wrapped in
> blocked
> > > > > > section. This means it easy to halt the node in case of
> longrunning
> > > > > > actions during PME, for example when we create a cache with
> > > > > > StoreFactrory which connect to 3rd party DB.
> > > > > >
> > > > > > I'm not sure that it is the right behavior.
> > > > > >
> > > > > > I filled the issue [1] and prepared the PR [2] with reproducer
> and
> > > >
> > > > possible fix.
> > > > > >
> > > > > > Andrey, could you please look at and confirm that it makes sense?
> > > > > >
> > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9710
> > > > > > [2] https://github.com/apache/ignite/pull/4845
> > > > > > On Mon, Sep 24, 2018 at 9:46 PM Andrey Kuznetsov <
> stku...@gmail.com>
> > > >
> > > > wrote:
> > > > > > >
> > > > > > > Denis,
> > > > > > >
> > > > > > > I've created the ticket [1] with short description of the
> > > >
> > > > functionality.
> > > > > > >
> > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9679
> > > > > > >
> > > > > > >
> > > > > > > пн, 24 сент. 2018 г. в 17:46, Denis Magda :
> > > > > > >
> > > > > > > > Andrey K. and G.,
> > > > > > > >
> > > > > > > > Thanks, do we have a documentation ticket created? Prachi
> > >
> > > (copied)
> > > > can help
> > > > > > > > with the documentation.
> > > > > > > >
> > > > > > > > --
> > > > > > > > Denis
> > > > > > > >
> > > > > > > > On Mon, Sep 24, 2018 at 5:51 AM Andrey Gura <
> ag...@apache.org>
> > > >
> > > > wrote:
> > > > > > > >
> > > > > > > > > Andrey,

Re: Critical worker threads liveness checking drawbacks

2018-09-28 Thread Nikolay Izhikov
Hello, Igniters.

I found that this feature can't be disabled from config.
The only way to disable it is from JMX bean.

I think it very dangerous: If we have some corner case or a bug in this Watch 
Dog it can make Ignite unusable.
I propose to implement possibility to disable this feature both - from config 
and from JVM options.

What do you think?

В Чт, 27/09/2018 в 16:14 +0300, Andrey Kuznetsov пишет:
> Maxim,
> 
> Thanks for being attentive! It's definitely a typo. Could you please create
> an issue?
> 
> чт, 27 сент. 2018 г. в 16:00, Maxim Muzafarov :
> 
> > Folks,
> > 
> > I've found in `GridCachePartitionExchangeManager:2684` [1] (master branch)
> > exchange future wrapped
> > with double `blockingSectionEnd` method. Is it correct? I just want to
> > understand this change and
> > how should I use this in the future.
> > 
> > Should I file a new issue to fix this? I think here `blockingSectionBegin`
> > method should be used.
> > 
> > -
> > blockingSectionEnd();
> > 
> > try {
> > resVer = exchFut.get(exchTimeout, TimeUnit.MILLISECONDS);
> > } finally {
> > blockingSectionEnd();
> > }
> > 
> > 
> > [1]
> > 
> > https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.java#L2684
> > 
> > On Wed, 26 Sep 2018 at 22:47 Vyacheslav Daradur 
> > wrote:
> > 
> > > Andrey Gura, thank you for the answer!
> > > 
> > > I agree that wrapping of 'init' method reduces the profit of watchdog
> > > service in case of PME worker, but in other cases, we should wrap all
> > > possible long sections on GridDhtPartitionExchangeFuture. For example
> > > 'onCacheChangeRequest' method or
> > > 'cctx.affinity().onCacheChangeRequest' inside because it may take
> > > significant time (reproducer attached).
> > > 
> > > I only want to point out a possible issue which may allow to end-user
> > > halt the Ignite cluster accidentally.
> > > 
> > > I'm sure that PME experts know how to fix this issue properly.
> > > On Wed, Sep 26, 2018 at 10:28 PM Andrey Gura  wrote:
> > > > 
> > > > Vyacheslav,
> > > > 
> > > > Exchange worker is strongly tied with
> > > > GridDhtPartitionExchangeFuture#init and it is ok. Exchange worker also
> > > > shouldn't be blocked for long time but in reality it happens.It also
> > > > means that your change doesn't make sense.
> > > > 
> > > > What actually make sense it is identification of places which
> > > > intentionally blocking. May be some places/actions should be braced by
> > > > blocking guards.
> > > > 
> > > > If you have failing tests please make sure that your failureHandler is
> > > > NoOpFailureHandler or any other handler with ignoreFailureTypes =
> > > > [CRITICAL_WORKER_BLOCKED].
> > > > 
> > > > 
> > > > On Wed, Sep 26, 2018 at 9:43 PM Vyacheslav Daradur <
> > 
> > daradu...@gmail.com>
> > > wrote:
> > > > > 
> > > > > Hi Igniters!
> > > > > 
> > > > > Thank you for this important improvement!
> > > > > 
> > > > > I've looked through implementation and noticed that
> > > > > GridDhtPartitionsExchangeFuture#init has not been wrapped in blocked
> > > > > section. This means it easy to halt the node in case of longrunning
> > > > > actions during PME, for example when we create a cache with
> > > > > StoreFactrory which connect to 3rd party DB.
> > > > > 
> > > > > I'm not sure that it is the right behavior.
> > > > > 
> > > > > I filled the issue [1] and prepared the PR [2] with reproducer and
> > > 
> > > possible fix.
> > > > > 
> > > > > Andrey, could you please look at and confirm that it makes sense?
> > > > > 
> > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9710
> > > > > [2] https://github.com/apache/ignite/pull/4845
> > > > > On Mon, Sep 24, 2018 at 9:46 PM Andrey Kuznetsov 
> > > 
> > > wrote:
> > > > > > 
> > > > > > Denis,
> > > > > > 
> > > > > > I've created the ticket [1] with short description of the
> > > 
> > > functionality.
> > > > > > 
> > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9679
> > > > > > 
> > > > > > 
> > > > > > пн, 24 сент. 2018 г. в 17:46, Denis Magda :
> > > > > > 
> > > > > > > Andrey K. and G.,
> > > > > > > 
> > > > > > > Thanks, do we have a documentation ticket created? Prachi
> > 
> > (copied)
> > > can help
> > > > > > > with the documentation.
> > > > > > > 
> > > > > > > --
> > > > > > > Denis
> > > > > > > 
> > > > > > > On Mon, Sep 24, 2018 at 5:51 AM Andrey Gura 
> > > 
> > > wrote:
> > > > > > > 
> > > > > > > > Andrey,
> > > > > > > > 
> > > > > > > > finally your change is merged to master branch. Congratulations
> > > 
> > > and
> > > > > > > > thank you very much! :)
> > > > > > > > 
> > > > > > > > I think that the next step is feature that will allow signal
> > > 
> > > about
> > > > > > > > blocked threads to the monitoring tools via MXBean.
> > > > > > > > 
> > > > > > > > I hope you will continue development of this feature and
> > 
> > provide
> > > your
> > > > > > > > vision in new JIRA 

Re: Critical worker threads liveness checking drawbacks

2018-09-27 Thread Andrey Kuznetsov
Maxim,

Thanks for being attentive! It's definitely a typo. Could you please create
an issue?

чт, 27 сент. 2018 г. в 16:00, Maxim Muzafarov :

> Folks,
>
> I've found in `GridCachePartitionExchangeManager:2684` [1] (master branch)
> exchange future wrapped
> with double `blockingSectionEnd` method. Is it correct? I just want to
> understand this change and
> how should I use this in the future.
>
> Should I file a new issue to fix this? I think here `blockingSectionBegin`
> method should be used.
>
> -
> blockingSectionEnd();
>
> try {
> resVer = exchFut.get(exchTimeout, TimeUnit.MILLISECONDS);
> } finally {
> blockingSectionEnd();
> }
>
>
> [1]
>
> https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.java#L2684
>
> On Wed, 26 Sep 2018 at 22:47 Vyacheslav Daradur 
> wrote:
>
> > Andrey Gura, thank you for the answer!
> >
> > I agree that wrapping of 'init' method reduces the profit of watchdog
> > service in case of PME worker, but in other cases, we should wrap all
> > possible long sections on GridDhtPartitionExchangeFuture. For example
> > 'onCacheChangeRequest' method or
> > 'cctx.affinity().onCacheChangeRequest' inside because it may take
> > significant time (reproducer attached).
> >
> > I only want to point out a possible issue which may allow to end-user
> > halt the Ignite cluster accidentally.
> >
> > I'm sure that PME experts know how to fix this issue properly.
> > On Wed, Sep 26, 2018 at 10:28 PM Andrey Gura  wrote:
> > >
> > > Vyacheslav,
> > >
> > > Exchange worker is strongly tied with
> > > GridDhtPartitionExchangeFuture#init and it is ok. Exchange worker also
> > > shouldn't be blocked for long time but in reality it happens.It also
> > > means that your change doesn't make sense.
> > >
> > > What actually make sense it is identification of places which
> > > intentionally blocking. May be some places/actions should be braced by
> > > blocking guards.
> > >
> > > If you have failing tests please make sure that your failureHandler is
> > > NoOpFailureHandler or any other handler with ignoreFailureTypes =
> > > [CRITICAL_WORKER_BLOCKED].
> > >
> > >
> > > On Wed, Sep 26, 2018 at 9:43 PM Vyacheslav Daradur <
> daradu...@gmail.com>
> > wrote:
> > > >
> > > > Hi Igniters!
> > > >
> > > > Thank you for this important improvement!
> > > >
> > > > I've looked through implementation and noticed that
> > > > GridDhtPartitionsExchangeFuture#init has not been wrapped in blocked
> > > > section. This means it easy to halt the node in case of longrunning
> > > > actions during PME, for example when we create a cache with
> > > > StoreFactrory which connect to 3rd party DB.
> > > >
> > > > I'm not sure that it is the right behavior.
> > > >
> > > > I filled the issue [1] and prepared the PR [2] with reproducer and
> > possible fix.
> > > >
> > > > Andrey, could you please look at and confirm that it makes sense?
> > > >
> > > > [1] https://issues.apache.org/jira/browse/IGNITE-9710
> > > > [2] https://github.com/apache/ignite/pull/4845
> > > > On Mon, Sep 24, 2018 at 9:46 PM Andrey Kuznetsov 
> > wrote:
> > > > >
> > > > > Denis,
> > > > >
> > > > > I've created the ticket [1] with short description of the
> > functionality.
> > > > >
> > > > > [1] https://issues.apache.org/jira/browse/IGNITE-9679
> > > > >
> > > > >
> > > > > пн, 24 сент. 2018 г. в 17:46, Denis Magda :
> > > > >
> > > > > > Andrey K. and G.,
> > > > > >
> > > > > > Thanks, do we have a documentation ticket created? Prachi
> (copied)
> > can help
> > > > > > with the documentation.
> > > > > >
> > > > > > --
> > > > > > Denis
> > > > > >
> > > > > > On Mon, Sep 24, 2018 at 5:51 AM Andrey Gura 
> > wrote:
> > > > > >
> > > > > > > Andrey,
> > > > > > >
> > > > > > > finally your change is merged to master branch. Congratulations
> > and
> > > > > > > thank you very much! :)
> > > > > > >
> > > > > > > I think that the next step is feature that will allow signal
> > about
> > > > > > > blocked threads to the monitoring tools via MXBean.
> > > > > > >
> > > > > > > I hope you will continue development of this feature and
> provide
> > your
> > > > > > > vision in new JIRA issue.
> > > > > > >
> > > > > > >
> > > > > > > On Tue, Sep 11, 2018 at 6:54 PM Andrey Kuznetsov <
> > stku...@gmail.com>
> > > > > > > wrote:
> > > > > > > >
> > > > > > > > David, Maxim!
> > > > > > > >
> > > > > > > > Thanks a lot for you ideas. Unfortunately, I can't adopt all
> > of them
> > > > > > > right
> > > > > > > > now: the scope is much broader than the scope of the change I
> > > > > > implement.
> > > > > > > I
> > > > > > > > have had a talk to a group of Ignite commiters, and we agreed
> > to
> > > > > > complete
> > > > > > > > the change as follows.
> > > > > > > > - Blocking instructions in system-critical which may
> resonably
> > last
> > > > > > long
> > > > > > > > should be explicitly excluded from the monitoring.
> > > > > > > 

Re: Critical worker threads liveness checking drawbacks

2018-09-27 Thread Maxim Muzafarov
Folks,

I've found in `GridCachePartitionExchangeManager:2684` [1] (master branch)
exchange future wrapped
with double `blockingSectionEnd` method. Is it correct? I just want to
understand this change and
how should I use this in the future.

Should I file a new issue to fix this? I think here `blockingSectionBegin`
method should be used.

-
blockingSectionEnd();

try {
resVer = exchFut.get(exchTimeout, TimeUnit.MILLISECONDS);
} finally {
blockingSectionEnd();
}


[1]
https://github.com/apache/ignite/blob/master/modules/core/src/main/java/org/apache/ignite/internal/processors/cache/GridCachePartitionExchangeManager.java#L2684

On Wed, 26 Sep 2018 at 22:47 Vyacheslav Daradur  wrote:

> Andrey Gura, thank you for the answer!
>
> I agree that wrapping of 'init' method reduces the profit of watchdog
> service in case of PME worker, but in other cases, we should wrap all
> possible long sections on GridDhtPartitionExchangeFuture. For example
> 'onCacheChangeRequest' method or
> 'cctx.affinity().onCacheChangeRequest' inside because it may take
> significant time (reproducer attached).
>
> I only want to point out a possible issue which may allow to end-user
> halt the Ignite cluster accidentally.
>
> I'm sure that PME experts know how to fix this issue properly.
> On Wed, Sep 26, 2018 at 10:28 PM Andrey Gura  wrote:
> >
> > Vyacheslav,
> >
> > Exchange worker is strongly tied with
> > GridDhtPartitionExchangeFuture#init and it is ok. Exchange worker also
> > shouldn't be blocked for long time but in reality it happens.It also
> > means that your change doesn't make sense.
> >
> > What actually make sense it is identification of places which
> > intentionally blocking. May be some places/actions should be braced by
> > blocking guards.
> >
> > If you have failing tests please make sure that your failureHandler is
> > NoOpFailureHandler or any other handler with ignoreFailureTypes =
> > [CRITICAL_WORKER_BLOCKED].
> >
> >
> > On Wed, Sep 26, 2018 at 9:43 PM Vyacheslav Daradur 
> wrote:
> > >
> > > Hi Igniters!
> > >
> > > Thank you for this important improvement!
> > >
> > > I've looked through implementation and noticed that
> > > GridDhtPartitionsExchangeFuture#init has not been wrapped in blocked
> > > section. This means it easy to halt the node in case of longrunning
> > > actions during PME, for example when we create a cache with
> > > StoreFactrory which connect to 3rd party DB.
> > >
> > > I'm not sure that it is the right behavior.
> > >
> > > I filled the issue [1] and prepared the PR [2] with reproducer and
> possible fix.
> > >
> > > Andrey, could you please look at and confirm that it makes sense?
> > >
> > > [1] https://issues.apache.org/jira/browse/IGNITE-9710
> > > [2] https://github.com/apache/ignite/pull/4845
> > > On Mon, Sep 24, 2018 at 9:46 PM Andrey Kuznetsov 
> wrote:
> > > >
> > > > Denis,
> > > >
> > > > I've created the ticket [1] with short description of the
> functionality.
> > > >
> > > > [1] https://issues.apache.org/jira/browse/IGNITE-9679
> > > >
> > > >
> > > > пн, 24 сент. 2018 г. в 17:46, Denis Magda :
> > > >
> > > > > Andrey K. and G.,
> > > > >
> > > > > Thanks, do we have a documentation ticket created? Prachi (copied)
> can help
> > > > > with the documentation.
> > > > >
> > > > > --
> > > > > Denis
> > > > >
> > > > > On Mon, Sep 24, 2018 at 5:51 AM Andrey Gura 
> wrote:
> > > > >
> > > > > > Andrey,
> > > > > >
> > > > > > finally your change is merged to master branch. Congratulations
> and
> > > > > > thank you very much! :)
> > > > > >
> > > > > > I think that the next step is feature that will allow signal
> about
> > > > > > blocked threads to the monitoring tools via MXBean.
> > > > > >
> > > > > > I hope you will continue development of this feature and provide
> your
> > > > > > vision in new JIRA issue.
> > > > > >
> > > > > >
> > > > > > On Tue, Sep 11, 2018 at 6:54 PM Andrey Kuznetsov <
> stku...@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > David, Maxim!
> > > > > > >
> > > > > > > Thanks a lot for you ideas. Unfortunately, I can't adopt all
> of them
> > > > > > right
> > > > > > > now: the scope is much broader than the scope of the change I
> > > > > implement.
> > > > > > I
> > > > > > > have had a talk to a group of Ignite commiters, and we agreed
> to
> > > > > complete
> > > > > > > the change as follows.
> > > > > > > - Blocking instructions in system-critical which may resonably
> last
> > > > > long
> > > > > > > should be explicitly excluded from the monitoring.
> > > > > > > - Failure handlers should have a setting to suppress some
> failures on
> > > > > > > per-failure-type basis.
> > > > > > > According to this I have updated the implementation: [1]
> > > > > > >
> > > > > > > [1] https://github.com/apache/ignite/pull/4089
> > > > > > >
> > > > > > > пн, 10 сент. 2018 г. в 22:35, David Harvey <
> syssoft...@gmail.com>:
> > > > > > >
> > > > > > > > When I've done this before,I've needed to find the oldest
> 

Re: Critical worker threads liveness checking drawbacks

2018-09-26 Thread Vyacheslav Daradur
Andrey Gura, thank you for the answer!

I agree that wrapping of 'init' method reduces the profit of watchdog
service in case of PME worker, but in other cases, we should wrap all
possible long sections on GridDhtPartitionExchangeFuture. For example
'onCacheChangeRequest' method or
'cctx.affinity().onCacheChangeRequest' inside because it may take
significant time (reproducer attached).

I only want to point out a possible issue which may allow to end-user
halt the Ignite cluster accidentally.

I'm sure that PME experts know how to fix this issue properly.
On Wed, Sep 26, 2018 at 10:28 PM Andrey Gura  wrote:
>
> Vyacheslav,
>
> Exchange worker is strongly tied with
> GridDhtPartitionExchangeFuture#init and it is ok. Exchange worker also
> shouldn't be blocked for long time but in reality it happens.It also
> means that your change doesn't make sense.
>
> What actually make sense it is identification of places which
> intentionally blocking. May be some places/actions should be braced by
> blocking guards.
>
> If you have failing tests please make sure that your failureHandler is
> NoOpFailureHandler or any other handler with ignoreFailureTypes =
> [CRITICAL_WORKER_BLOCKED].
>
>
> On Wed, Sep 26, 2018 at 9:43 PM Vyacheslav Daradur  
> wrote:
> >
> > Hi Igniters!
> >
> > Thank you for this important improvement!
> >
> > I've looked through implementation and noticed that
> > GridDhtPartitionsExchangeFuture#init has not been wrapped in blocked
> > section. This means it easy to halt the node in case of longrunning
> > actions during PME, for example when we create a cache with
> > StoreFactrory which connect to 3rd party DB.
> >
> > I'm not sure that it is the right behavior.
> >
> > I filled the issue [1] and prepared the PR [2] with reproducer and possible 
> > fix.
> >
> > Andrey, could you please look at and confirm that it makes sense?
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-9710
> > [2] https://github.com/apache/ignite/pull/4845
> > On Mon, Sep 24, 2018 at 9:46 PM Andrey Kuznetsov  wrote:
> > >
> > > Denis,
> > >
> > > I've created the ticket [1] with short description of the functionality.
> > >
> > > [1] https://issues.apache.org/jira/browse/IGNITE-9679
> > >
> > >
> > > пн, 24 сент. 2018 г. в 17:46, Denis Magda :
> > >
> > > > Andrey K. and G.,
> > > >
> > > > Thanks, do we have a documentation ticket created? Prachi (copied) can 
> > > > help
> > > > with the documentation.
> > > >
> > > > --
> > > > Denis
> > > >
> > > > On Mon, Sep 24, 2018 at 5:51 AM Andrey Gura  wrote:
> > > >
> > > > > Andrey,
> > > > >
> > > > > finally your change is merged to master branch. Congratulations and
> > > > > thank you very much! :)
> > > > >
> > > > > I think that the next step is feature that will allow signal about
> > > > > blocked threads to the monitoring tools via MXBean.
> > > > >
> > > > > I hope you will continue development of this feature and provide your
> > > > > vision in new JIRA issue.
> > > > >
> > > > >
> > > > > On Tue, Sep 11, 2018 at 6:54 PM Andrey Kuznetsov 
> > > > > wrote:
> > > > > >
> > > > > > David, Maxim!
> > > > > >
> > > > > > Thanks a lot for you ideas. Unfortunately, I can't adopt all of them
> > > > > right
> > > > > > now: the scope is much broader than the scope of the change I
> > > > implement.
> > > > > I
> > > > > > have had a talk to a group of Ignite commiters, and we agreed to
> > > > complete
> > > > > > the change as follows.
> > > > > > - Blocking instructions in system-critical which may resonably last
> > > > long
> > > > > > should be explicitly excluded from the monitoring.
> > > > > > - Failure handlers should have a setting to suppress some failures 
> > > > > > on
> > > > > > per-failure-type basis.
> > > > > > According to this I have updated the implementation: [1]
> > > > > >
> > > > > > [1] https://github.com/apache/ignite/pull/4089
> > > > > >
> > > > > > пн, 10 сент. 2018 г. в 22:35, David Harvey :
> > > > > >
> > > > > > > When I've done this before,I've needed to find the oldest  thread,
> > > > and
> > > > > kill
> > > > > > > the node running that.   From a language standpoint, Maxim's 
> > > > > > > "without
> > > > > > > progress" better than "heartbeat".   For example, what I'm most
> > > > > interested
> > > > > > > in on a distributed system is which thread started the work it has
> > > > not
> > > > > > > completed the earliest, and when did that thread last make forward
> > > > > > > process. You don't want to kill a node because a thread is
> > > > waiting
> > > > > on a
> > > > > > > lock held by a thread that went off-node and has not gotten a
> > > > response.
> > > > > > > If you don't understand the dependency relationships, you will 
> > > > > > > make
> > > > > > > incorrect recovery decisions.
> > > > > > >
> > > > > > > On Mon, Sep 10, 2018 at 4:08 AM Maxim Muzafarov 
> > > > > > > 
> > > > > > > wrote:
> > > > > > >
> > > > > > > > I think we should find exact answers to these questions:
> > > > > > > >  1. What 

Re: Critical worker threads liveness checking drawbacks

2018-09-26 Thread Andrey Gura
Vyacheslav,

Exchange worker is strongly tied with
GridDhtPartitionExchangeFuture#init and it is ok. Exchange worker also
shouldn't be blocked for long time but in reality it happens.It also
means that your change doesn't make sense.

What actually make sense it is identification of places which
intentionally blocking. May be some places/actions should be braced by
blocking guards.

If you have failing tests please make sure that your failureHandler is
NoOpFailureHandler or any other handler with ignoreFailureTypes =
[CRITICAL_WORKER_BLOCKED].


On Wed, Sep 26, 2018 at 9:43 PM Vyacheslav Daradur  wrote:
>
> Hi Igniters!
>
> Thank you for this important improvement!
>
> I've looked through implementation and noticed that
> GridDhtPartitionsExchangeFuture#init has not been wrapped in blocked
> section. This means it easy to halt the node in case of longrunning
> actions during PME, for example when we create a cache with
> StoreFactrory which connect to 3rd party DB.
>
> I'm not sure that it is the right behavior.
>
> I filled the issue [1] and prepared the PR [2] with reproducer and possible 
> fix.
>
> Andrey, could you please look at and confirm that it makes sense?
>
> [1] https://issues.apache.org/jira/browse/IGNITE-9710
> [2] https://github.com/apache/ignite/pull/4845
> On Mon, Sep 24, 2018 at 9:46 PM Andrey Kuznetsov  wrote:
> >
> > Denis,
> >
> > I've created the ticket [1] with short description of the functionality.
> >
> > [1] https://issues.apache.org/jira/browse/IGNITE-9679
> >
> >
> > пн, 24 сент. 2018 г. в 17:46, Denis Magda :
> >
> > > Andrey K. and G.,
> > >
> > > Thanks, do we have a documentation ticket created? Prachi (copied) can 
> > > help
> > > with the documentation.
> > >
> > > --
> > > Denis
> > >
> > > On Mon, Sep 24, 2018 at 5:51 AM Andrey Gura  wrote:
> > >
> > > > Andrey,
> > > >
> > > > finally your change is merged to master branch. Congratulations and
> > > > thank you very much! :)
> > > >
> > > > I think that the next step is feature that will allow signal about
> > > > blocked threads to the monitoring tools via MXBean.
> > > >
> > > > I hope you will continue development of this feature and provide your
> > > > vision in new JIRA issue.
> > > >
> > > >
> > > > On Tue, Sep 11, 2018 at 6:54 PM Andrey Kuznetsov 
> > > > wrote:
> > > > >
> > > > > David, Maxim!
> > > > >
> > > > > Thanks a lot for you ideas. Unfortunately, I can't adopt all of them
> > > > right
> > > > > now: the scope is much broader than the scope of the change I
> > > implement.
> > > > I
> > > > > have had a talk to a group of Ignite commiters, and we agreed to
> > > complete
> > > > > the change as follows.
> > > > > - Blocking instructions in system-critical which may resonably last
> > > long
> > > > > should be explicitly excluded from the monitoring.
> > > > > - Failure handlers should have a setting to suppress some failures on
> > > > > per-failure-type basis.
> > > > > According to this I have updated the implementation: [1]
> > > > >
> > > > > [1] https://github.com/apache/ignite/pull/4089
> > > > >
> > > > > пн, 10 сент. 2018 г. в 22:35, David Harvey :
> > > > >
> > > > > > When I've done this before,I've needed to find the oldest  thread,
> > > and
> > > > kill
> > > > > > the node running that.   From a language standpoint, Maxim's 
> > > > > > "without
> > > > > > progress" better than "heartbeat".   For example, what I'm most
> > > > interested
> > > > > > in on a distributed system is which thread started the work it has
> > > not
> > > > > > completed the earliest, and when did that thread last make forward
> > > > > > process. You don't want to kill a node because a thread is
> > > waiting
> > > > on a
> > > > > > lock held by a thread that went off-node and has not gotten a
> > > response.
> > > > > > If you don't understand the dependency relationships, you will make
> > > > > > incorrect recovery decisions.
> > > > > >
> > > > > > On Mon, Sep 10, 2018 at 4:08 AM Maxim Muzafarov 
> > > > > > wrote:
> > > > > >
> > > > > > > I think we should find exact answers to these questions:
> > > > > > >  1. What `critical` issue exactly is?
> > > > > > >  2. How can we find critical issues?
> > > > > > >  3. How can we handle critical issues?
> > > > > > >
> > > > > > > First,
> > > > > > >  - Ignore uninterruptable actions (e.g. worker\service shutdown)
> > > > > > >  - Long I/O operations (should be a configurable timeout for each
> > > > type of
> > > > > > > usage)
> > > > > > >  - Infinite loops
> > > > > > >  - Stalled\deadlocked threads (and\or too many parked threads,
> > > > exclude
> > > > > > I/O)
> > > > > > >
> > > > > > > Second,
> > > > > > >  - The working queue is without progress (e.g. disco, exchange
> > > > queues)
> > > > > > >  - Work hasn't been completed since the last heartbeat (checking
> > > > > > > milestones)
> > > > > > >  - Too many system resources used by a thread for the long period
> > > of
> > > > time
> > > > > > > (allocated memory, CPU)
> > > > > > >  

Re: Critical worker threads liveness checking drawbacks

2018-09-26 Thread Vyacheslav Daradur
Hi Igniters!

Thank you for this important improvement!

I've looked through implementation and noticed that
GridDhtPartitionsExchangeFuture#init has not been wrapped in blocked
section. This means it easy to halt the node in case of longrunning
actions during PME, for example when we create a cache with
StoreFactrory which connect to 3rd party DB.

I'm not sure that it is the right behavior.

I filled the issue [1] and prepared the PR [2] with reproducer and possible fix.

Andrey, could you please look at and confirm that it makes sense?

[1] https://issues.apache.org/jira/browse/IGNITE-9710
[2] https://github.com/apache/ignite/pull/4845
On Mon, Sep 24, 2018 at 9:46 PM Andrey Kuznetsov  wrote:
>
> Denis,
>
> I've created the ticket [1] with short description of the functionality.
>
> [1] https://issues.apache.org/jira/browse/IGNITE-9679
>
>
> пн, 24 сент. 2018 г. в 17:46, Denis Magda :
>
> > Andrey K. and G.,
> >
> > Thanks, do we have a documentation ticket created? Prachi (copied) can help
> > with the documentation.
> >
> > --
> > Denis
> >
> > On Mon, Sep 24, 2018 at 5:51 AM Andrey Gura  wrote:
> >
> > > Andrey,
> > >
> > > finally your change is merged to master branch. Congratulations and
> > > thank you very much! :)
> > >
> > > I think that the next step is feature that will allow signal about
> > > blocked threads to the monitoring tools via MXBean.
> > >
> > > I hope you will continue development of this feature and provide your
> > > vision in new JIRA issue.
> > >
> > >
> > > On Tue, Sep 11, 2018 at 6:54 PM Andrey Kuznetsov 
> > > wrote:
> > > >
> > > > David, Maxim!
> > > >
> > > > Thanks a lot for you ideas. Unfortunately, I can't adopt all of them
> > > right
> > > > now: the scope is much broader than the scope of the change I
> > implement.
> > > I
> > > > have had a talk to a group of Ignite commiters, and we agreed to
> > complete
> > > > the change as follows.
> > > > - Blocking instructions in system-critical which may resonably last
> > long
> > > > should be explicitly excluded from the monitoring.
> > > > - Failure handlers should have a setting to suppress some failures on
> > > > per-failure-type basis.
> > > > According to this I have updated the implementation: [1]
> > > >
> > > > [1] https://github.com/apache/ignite/pull/4089
> > > >
> > > > пн, 10 сент. 2018 г. в 22:35, David Harvey :
> > > >
> > > > > When I've done this before,I've needed to find the oldest  thread,
> > and
> > > kill
> > > > > the node running that.   From a language standpoint, Maxim's "without
> > > > > progress" better than "heartbeat".   For example, what I'm most
> > > interested
> > > > > in on a distributed system is which thread started the work it has
> > not
> > > > > completed the earliest, and when did that thread last make forward
> > > > > process. You don't want to kill a node because a thread is
> > waiting
> > > on a
> > > > > lock held by a thread that went off-node and has not gotten a
> > response.
> > > > > If you don't understand the dependency relationships, you will make
> > > > > incorrect recovery decisions.
> > > > >
> > > > > On Mon, Sep 10, 2018 at 4:08 AM Maxim Muzafarov 
> > > > > wrote:
> > > > >
> > > > > > I think we should find exact answers to these questions:
> > > > > >  1. What `critical` issue exactly is?
> > > > > >  2. How can we find critical issues?
> > > > > >  3. How can we handle critical issues?
> > > > > >
> > > > > > First,
> > > > > >  - Ignore uninterruptable actions (e.g. worker\service shutdown)
> > > > > >  - Long I/O operations (should be a configurable timeout for each
> > > type of
> > > > > > usage)
> > > > > >  - Infinite loops
> > > > > >  - Stalled\deadlocked threads (and\or too many parked threads,
> > > exclude
> > > > > I/O)
> > > > > >
> > > > > > Second,
> > > > > >  - The working queue is without progress (e.g. disco, exchange
> > > queues)
> > > > > >  - Work hasn't been completed since the last heartbeat (checking
> > > > > > milestones)
> > > > > >  - Too many system resources used by a thread for the long period
> > of
> > > time
> > > > > > (allocated memory, CPU)
> > > > > >  - Timing fields associated with each thread status exceeded a
> > > maximum
> > > > > time
> > > > > > limit.
> > > > > >
> > > > > > Third (not too many options here),
> > > > > >  - `log everything` should be the default behaviour in all these
> > > cases,
> > > > > > since it may be difficult to find the cause after the restart.
> > > > > >  - Wait some interval of time and kill the hanging node (cluster
> > > should
> > > > > be
> > > > > > configured stable enough)
> > > > > >
> > > > > > Questions,
> > > > > >  - Not sure, but can workers miss their heartbeat deadlines if CPU
> > > loads
> > > > > up
> > > > > > to 80%-90%? Bursts of momentary overloads can be
> > > > > > expected behaviour as a normal part of system operations.
> > > > > >  - Why do we decide that critical thread should monitor each other?
> > > For
> > > > > > instance, if all 

Re: Critical worker threads liveness checking drawbacks

2018-09-24 Thread Andrey Kuznetsov
Denis,

I've created the ticket [1] with short description of the functionality.

[1] https://issues.apache.org/jira/browse/IGNITE-9679


пн, 24 сент. 2018 г. в 17:46, Denis Magda :

> Andrey K. and G.,
>
> Thanks, do we have a documentation ticket created? Prachi (copied) can help
> with the documentation.
>
> --
> Denis
>
> On Mon, Sep 24, 2018 at 5:51 AM Andrey Gura  wrote:
>
> > Andrey,
> >
> > finally your change is merged to master branch. Congratulations and
> > thank you very much! :)
> >
> > I think that the next step is feature that will allow signal about
> > blocked threads to the monitoring tools via MXBean.
> >
> > I hope you will continue development of this feature and provide your
> > vision in new JIRA issue.
> >
> >
> > On Tue, Sep 11, 2018 at 6:54 PM Andrey Kuznetsov 
> > wrote:
> > >
> > > David, Maxim!
> > >
> > > Thanks a lot for you ideas. Unfortunately, I can't adopt all of them
> > right
> > > now: the scope is much broader than the scope of the change I
> implement.
> > I
> > > have had a talk to a group of Ignite commiters, and we agreed to
> complete
> > > the change as follows.
> > > - Blocking instructions in system-critical which may resonably last
> long
> > > should be explicitly excluded from the monitoring.
> > > - Failure handlers should have a setting to suppress some failures on
> > > per-failure-type basis.
> > > According to this I have updated the implementation: [1]
> > >
> > > [1] https://github.com/apache/ignite/pull/4089
> > >
> > > пн, 10 сент. 2018 г. в 22:35, David Harvey :
> > >
> > > > When I've done this before,I've needed to find the oldest  thread,
> and
> > kill
> > > > the node running that.   From a language standpoint, Maxim's "without
> > > > progress" better than "heartbeat".   For example, what I'm most
> > interested
> > > > in on a distributed system is which thread started the work it has
> not
> > > > completed the earliest, and when did that thread last make forward
> > > > process. You don't want to kill a node because a thread is
> waiting
> > on a
> > > > lock held by a thread that went off-node and has not gotten a
> response.
> > > > If you don't understand the dependency relationships, you will make
> > > > incorrect recovery decisions.
> > > >
> > > > On Mon, Sep 10, 2018 at 4:08 AM Maxim Muzafarov 
> > > > wrote:
> > > >
> > > > > I think we should find exact answers to these questions:
> > > > >  1. What `critical` issue exactly is?
> > > > >  2. How can we find critical issues?
> > > > >  3. How can we handle critical issues?
> > > > >
> > > > > First,
> > > > >  - Ignore uninterruptable actions (e.g. worker\service shutdown)
> > > > >  - Long I/O operations (should be a configurable timeout for each
> > type of
> > > > > usage)
> > > > >  - Infinite loops
> > > > >  - Stalled\deadlocked threads (and\or too many parked threads,
> > exclude
> > > > I/O)
> > > > >
> > > > > Second,
> > > > >  - The working queue is without progress (e.g. disco, exchange
> > queues)
> > > > >  - Work hasn't been completed since the last heartbeat (checking
> > > > > milestones)
> > > > >  - Too many system resources used by a thread for the long period
> of
> > time
> > > > > (allocated memory, CPU)
> > > > >  - Timing fields associated with each thread status exceeded a
> > maximum
> > > > time
> > > > > limit.
> > > > >
> > > > > Third (not too many options here),
> > > > >  - `log everything` should be the default behaviour in all these
> > cases,
> > > > > since it may be difficult to find the cause after the restart.
> > > > >  - Wait some interval of time and kill the hanging node (cluster
> > should
> > > > be
> > > > > configured stable enough)
> > > > >
> > > > > Questions,
> > > > >  - Not sure, but can workers miss their heartbeat deadlines if CPU
> > loads
> > > > up
> > > > > to 80%-90%? Bursts of momentary overloads can be
> > > > > expected behaviour as a normal part of system operations.
> > > > >  - Why do we decide that critical thread should monitor each other?
> > For
> > > > > instance, if all the tasks were blocked and unable to run,
> > > > > node reset would never occur. As for me, a better solution is
> to
> > use
> > > > a
> > > > > separate monitor thread or pool (maybe both with software
> > > > > and hardware checks) that not only checks heartbeats but
> > monitors the
> > > > > other system as well.
> > > > >
> > > > > On Mon, 10 Sep 2018 at 00:07 David Harvey 
> > wrote:
> > > > >
> > > > > > It would be safer to restart the entire cluster than to remove
> the
> > last
> > > > > > node for a cache that should be redundant.
> > > > > >
> > > > > > On Sun, Sep 9, 2018, 4:00 PM Andrey Gura 
> wrote:
> > > > > >
> > > > > > > Hi,
> > > > > > >
> > > > > > > I agree with Yakov that we can provide some option that manage
> > worker
> > > > > > > liveness checker behavior in case of observing that some worker
> > is
> > > > > > > blocked too long.
> > > > > > > At least it will  some workaround for cases when 

Re: Critical worker threads liveness checking drawbacks

2018-09-24 Thread Denis Magda
Andrey K. and G.,

Thanks, do we have a documentation ticket created? Prachi (copied) can help
with the documentation.

--
Denis

On Mon, Sep 24, 2018 at 5:51 AM Andrey Gura  wrote:

> Andrey,
>
> finally your change is merged to master branch. Congratulations and
> thank you very much! :)
>
> I think that the next step is feature that will allow signal about
> blocked threads to the monitoring tools via MXBean.
>
> I hope you will continue development of this feature and provide your
> vision in new JIRA issue.
>
>
> On Tue, Sep 11, 2018 at 6:54 PM Andrey Kuznetsov 
> wrote:
> >
> > David, Maxim!
> >
> > Thanks a lot for you ideas. Unfortunately, I can't adopt all of them
> right
> > now: the scope is much broader than the scope of the change I implement.
> I
> > have had a talk to a group of Ignite commiters, and we agreed to complete
> > the change as follows.
> > - Blocking instructions in system-critical which may resonably last long
> > should be explicitly excluded from the monitoring.
> > - Failure handlers should have a setting to suppress some failures on
> > per-failure-type basis.
> > According to this I have updated the implementation: [1]
> >
> > [1] https://github.com/apache/ignite/pull/4089
> >
> > пн, 10 сент. 2018 г. в 22:35, David Harvey :
> >
> > > When I've done this before,I've needed to find the oldest  thread, and
> kill
> > > the node running that.   From a language standpoint, Maxim's "without
> > > progress" better than "heartbeat".   For example, what I'm most
> interested
> > > in on a distributed system is which thread started the work it has not
> > > completed the earliest, and when did that thread last make forward
> > > process. You don't want to kill a node because a thread is waiting
> on a
> > > lock held by a thread that went off-node and has not gotten a response.
> > > If you don't understand the dependency relationships, you will make
> > > incorrect recovery decisions.
> > >
> > > On Mon, Sep 10, 2018 at 4:08 AM Maxim Muzafarov 
> > > wrote:
> > >
> > > > I think we should find exact answers to these questions:
> > > >  1. What `critical` issue exactly is?
> > > >  2. How can we find critical issues?
> > > >  3. How can we handle critical issues?
> > > >
> > > > First,
> > > >  - Ignore uninterruptable actions (e.g. worker\service shutdown)
> > > >  - Long I/O operations (should be a configurable timeout for each
> type of
> > > > usage)
> > > >  - Infinite loops
> > > >  - Stalled\deadlocked threads (and\or too many parked threads,
> exclude
> > > I/O)
> > > >
> > > > Second,
> > > >  - The working queue is without progress (e.g. disco, exchange
> queues)
> > > >  - Work hasn't been completed since the last heartbeat (checking
> > > > milestones)
> > > >  - Too many system resources used by a thread for the long period of
> time
> > > > (allocated memory, CPU)
> > > >  - Timing fields associated with each thread status exceeded a
> maximum
> > > time
> > > > limit.
> > > >
> > > > Third (not too many options here),
> > > >  - `log everything` should be the default behaviour in all these
> cases,
> > > > since it may be difficult to find the cause after the restart.
> > > >  - Wait some interval of time and kill the hanging node (cluster
> should
> > > be
> > > > configured stable enough)
> > > >
> > > > Questions,
> > > >  - Not sure, but can workers miss their heartbeat deadlines if CPU
> loads
> > > up
> > > > to 80%-90%? Bursts of momentary overloads can be
> > > > expected behaviour as a normal part of system operations.
> > > >  - Why do we decide that critical thread should monitor each other?
> For
> > > > instance, if all the tasks were blocked and unable to run,
> > > > node reset would never occur. As for me, a better solution is to
> use
> > > a
> > > > separate monitor thread or pool (maybe both with software
> > > > and hardware checks) that not only checks heartbeats but
> monitors the
> > > > other system as well.
> > > >
> > > > On Mon, 10 Sep 2018 at 00:07 David Harvey 
> wrote:
> > > >
> > > > > It would be safer to restart the entire cluster than to remove the
> last
> > > > > node for a cache that should be redundant.
> > > > >
> > > > > On Sun, Sep 9, 2018, 4:00 PM Andrey Gura  wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I agree with Yakov that we can provide some option that manage
> worker
> > > > > > liveness checker behavior in case of observing that some worker
> is
> > > > > > blocked too long.
> > > > > > At least it will  some workaround for cases when node fails is
> too
> > > > > > annoying.
> > > > > >
> > > > > > Backups count threshold sounds good but I don't understand how it
> > > will
> > > > > > help in case of cluster hanging.
> > > > > >
> > > > > > The simplest solution here is alert in cases of blocking of some
> > > > > > critical worker (we can improve WorkersRegistry for this purpose
> and
> > > > > > expose list of blocked workers) and optionally call system
> configured
> > > > > > 

Re: Critical worker threads liveness checking drawbacks

2018-09-24 Thread Andrey Gura
Andrey,

finally your change is merged to master branch. Congratulations and
thank you very much! :)

I think that the next step is feature that will allow signal about
blocked threads to the monitoring tools via MXBean.

I hope you will continue development of this feature and provide your
vision in new JIRA issue.


On Tue, Sep 11, 2018 at 6:54 PM Andrey Kuznetsov  wrote:
>
> David, Maxim!
>
> Thanks a lot for you ideas. Unfortunately, I can't adopt all of them right
> now: the scope is much broader than the scope of the change I implement. I
> have had a talk to a group of Ignite commiters, and we agreed to complete
> the change as follows.
> - Blocking instructions in system-critical which may resonably last long
> should be explicitly excluded from the monitoring.
> - Failure handlers should have a setting to suppress some failures on
> per-failure-type basis.
> According to this I have updated the implementation: [1]
>
> [1] https://github.com/apache/ignite/pull/4089
>
> пн, 10 сент. 2018 г. в 22:35, David Harvey :
>
> > When I've done this before,I've needed to find the oldest  thread, and kill
> > the node running that.   From a language standpoint, Maxim's "without
> > progress" better than "heartbeat".   For example, what I'm most interested
> > in on a distributed system is which thread started the work it has not
> > completed the earliest, and when did that thread last make forward
> > process. You don't want to kill a node because a thread is waiting on a
> > lock held by a thread that went off-node and has not gotten a response.
> > If you don't understand the dependency relationships, you will make
> > incorrect recovery decisions.
> >
> > On Mon, Sep 10, 2018 at 4:08 AM Maxim Muzafarov 
> > wrote:
> >
> > > I think we should find exact answers to these questions:
> > >  1. What `critical` issue exactly is?
> > >  2. How can we find critical issues?
> > >  3. How can we handle critical issues?
> > >
> > > First,
> > >  - Ignore uninterruptable actions (e.g. worker\service shutdown)
> > >  - Long I/O operations (should be a configurable timeout for each type of
> > > usage)
> > >  - Infinite loops
> > >  - Stalled\deadlocked threads (and\or too many parked threads, exclude
> > I/O)
> > >
> > > Second,
> > >  - The working queue is without progress (e.g. disco, exchange queues)
> > >  - Work hasn't been completed since the last heartbeat (checking
> > > milestones)
> > >  - Too many system resources used by a thread for the long period of time
> > > (allocated memory, CPU)
> > >  - Timing fields associated with each thread status exceeded a maximum
> > time
> > > limit.
> > >
> > > Third (not too many options here),
> > >  - `log everything` should be the default behaviour in all these cases,
> > > since it may be difficult to find the cause after the restart.
> > >  - Wait some interval of time and kill the hanging node (cluster should
> > be
> > > configured stable enough)
> > >
> > > Questions,
> > >  - Not sure, but can workers miss their heartbeat deadlines if CPU loads
> > up
> > > to 80%-90%? Bursts of momentary overloads can be
> > > expected behaviour as a normal part of system operations.
> > >  - Why do we decide that critical thread should monitor each other? For
> > > instance, if all the tasks were blocked and unable to run,
> > > node reset would never occur. As for me, a better solution is to use
> > a
> > > separate monitor thread or pool (maybe both with software
> > > and hardware checks) that not only checks heartbeats but monitors the
> > > other system as well.
> > >
> > > On Mon, 10 Sep 2018 at 00:07 David Harvey  wrote:
> > >
> > > > It would be safer to restart the entire cluster than to remove the last
> > > > node for a cache that should be redundant.
> > > >
> > > > On Sun, Sep 9, 2018, 4:00 PM Andrey Gura  wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > I agree with Yakov that we can provide some option that manage worker
> > > > > liveness checker behavior in case of observing that some worker is
> > > > > blocked too long.
> > > > > At least it will  some workaround for cases when node fails is too
> > > > > annoying.
> > > > >
> > > > > Backups count threshold sounds good but I don't understand how it
> > will
> > > > > help in case of cluster hanging.
> > > > >
> > > > > The simplest solution here is alert in cases of blocking of some
> > > > > critical worker (we can improve WorkersRegistry for this purpose and
> > > > > expose list of blocked workers) and optionally call system configured
> > > > > failure processor. BTW, failure processor can be extended in order to
> > > > > perform any checks (e.g. backup count) and decide whether it should
> > > > > stop node or not.
> > > > > On Sat, Sep 8, 2018 at 3:42 PM Andrey Kuznetsov 
> > > > wrote:
> > > > > >
> > > > > > David, Yakov, I understand your fears. But liveness checks deal
> > with
> > > > > > _critical_ conditions, i.e. when such a condition is met we
> > conclude
> > > > the
> > > 

Re: Critical worker threads liveness checking drawbacks

2018-09-11 Thread Andrey Kuznetsov
David, Maxim!

Thanks a lot for you ideas. Unfortunately, I can't adopt all of them right
now: the scope is much broader than the scope of the change I implement. I
have had a talk to a group of Ignite commiters, and we agreed to complete
the change as follows.
- Blocking instructions in system-critical which may resonably last long
should be explicitly excluded from the monitoring.
- Failure handlers should have a setting to suppress some failures on
per-failure-type basis.
According to this I have updated the implementation: [1]

[1] https://github.com/apache/ignite/pull/4089

пн, 10 сент. 2018 г. в 22:35, David Harvey :

> When I've done this before,I've needed to find the oldest  thread, and kill
> the node running that.   From a language standpoint, Maxim's "without
> progress" better than "heartbeat".   For example, what I'm most interested
> in on a distributed system is which thread started the work it has not
> completed the earliest, and when did that thread last make forward
> process. You don't want to kill a node because a thread is waiting on a
> lock held by a thread that went off-node and has not gotten a response.
> If you don't understand the dependency relationships, you will make
> incorrect recovery decisions.
>
> On Mon, Sep 10, 2018 at 4:08 AM Maxim Muzafarov 
> wrote:
>
> > I think we should find exact answers to these questions:
> >  1. What `critical` issue exactly is?
> >  2. How can we find critical issues?
> >  3. How can we handle critical issues?
> >
> > First,
> >  - Ignore uninterruptable actions (e.g. worker\service shutdown)
> >  - Long I/O operations (should be a configurable timeout for each type of
> > usage)
> >  - Infinite loops
> >  - Stalled\deadlocked threads (and\or too many parked threads, exclude
> I/O)
> >
> > Second,
> >  - The working queue is without progress (e.g. disco, exchange queues)
> >  - Work hasn't been completed since the last heartbeat (checking
> > milestones)
> >  - Too many system resources used by a thread for the long period of time
> > (allocated memory, CPU)
> >  - Timing fields associated with each thread status exceeded a maximum
> time
> > limit.
> >
> > Third (not too many options here),
> >  - `log everything` should be the default behaviour in all these cases,
> > since it may be difficult to find the cause after the restart.
> >  - Wait some interval of time and kill the hanging node (cluster should
> be
> > configured stable enough)
> >
> > Questions,
> >  - Not sure, but can workers miss their heartbeat deadlines if CPU loads
> up
> > to 80%-90%? Bursts of momentary overloads can be
> > expected behaviour as a normal part of system operations.
> >  - Why do we decide that critical thread should monitor each other? For
> > instance, if all the tasks were blocked and unable to run,
> > node reset would never occur. As for me, a better solution is to use
> a
> > separate monitor thread or pool (maybe both with software
> > and hardware checks) that not only checks heartbeats but monitors the
> > other system as well.
> >
> > On Mon, 10 Sep 2018 at 00:07 David Harvey  wrote:
> >
> > > It would be safer to restart the entire cluster than to remove the last
> > > node for a cache that should be redundant.
> > >
> > > On Sun, Sep 9, 2018, 4:00 PM Andrey Gura  wrote:
> > >
> > > > Hi,
> > > >
> > > > I agree with Yakov that we can provide some option that manage worker
> > > > liveness checker behavior in case of observing that some worker is
> > > > blocked too long.
> > > > At least it will  some workaround for cases when node fails is too
> > > > annoying.
> > > >
> > > > Backups count threshold sounds good but I don't understand how it
> will
> > > > help in case of cluster hanging.
> > > >
> > > > The simplest solution here is alert in cases of blocking of some
> > > > critical worker (we can improve WorkersRegistry for this purpose and
> > > > expose list of blocked workers) and optionally call system configured
> > > > failure processor. BTW, failure processor can be extended in order to
> > > > perform any checks (e.g. backup count) and decide whether it should
> > > > stop node or not.
> > > > On Sat, Sep 8, 2018 at 3:42 PM Andrey Kuznetsov 
> > > wrote:
> > > > >
> > > > > David, Yakov, I understand your fears. But liveness checks deal
> with
> > > > > _critical_ conditions, i.e. when such a condition is met we
> conclude
> > > the
> > > > > node as totally broken, and there is no sense to keep it alive
> > > regardless
> > > > > the data it contains. If we want to give it a chance, then the
> > > condition
> > > > > (long fsync etc.) should not considered as critical at all.
> > > > >
> > > > > сб, 8 сент. 2018 г. в 15:18, Yakov Zhdanov :
> > > > >
> > > > > > Agree with David. We need to have an opporunity set backups count
> > > > threshold
> > > > > > (at runtime also!) that will not allow any automatic stop if
> there
> > > > will be
> > > > > > a data loss. Andrey, what do you think?
> > > > > >
> > > > > 

Re: Critical worker threads liveness checking drawbacks

2018-09-11 Thread vgrigorev
Reliability of ignite is very important to me, so please consider following
idea:

- Important threads as WAL writer (as a sample of any critical thread)
must not do any blocking action, by this way:
   - WAL thread  must be management thread for all WAL operations
   - Child, worker thread of WAL writer must do separate operations which
implements concrete WAL writings
   - Operations are separate units of work, countable by it's heartbeat for
sample and has characteristics
   and ids. 
   - Operations written in queue and has state.
   - If hung occur in a concrete operation, this operation may be cancelled,
(all child operations in a cluster too) and all others operations continue
to work, with failed operation go to recovery state or report user about
fail
   - If WAL child thread do infinite blocking operation, it's need to kill
this working thread and start new with same queue of operations of WAL type

So, we become able :
- always know what concrete operation  are in hung, (not that whole main WAL
thread hung) so can better decide want to do.
- WAL thread operations newer irresponsive, at minimum it reports that it
long doing some operation and just can insert next operation queue or
propose fail
- report size of queue and else full detail information about what happening
and allow to decide precisely - fail concrete user operations, clean
resources, spawn new working thread or else, and continue to work without
painful node or cluster restart
- minimal cleanless possible (just some operations)
- balance operations with queues, also implementing backpressure, so make
sure that optimal performance load is kept and cluster will not go to
degradation from some local oversaturations
- newer see that node hung, but just degrade and being in fully controlled
state 

- WAL thread operations check management functions can be encapsulated to
special class with that functionality and called from else main threads as
now.

Sorry for any inconvenience, I'm new to writing here



--
Sent from: http://apache-ignite-developers.2346864.n4.nabble.com/


Re: Critical worker threads liveness checking drawbacks

2018-09-10 Thread David Harvey
When I've done this before,I've needed to find the oldest  thread, and kill
the node running that.   From a language standpoint, Maxim's "without
progress" better than "heartbeat".   For example, what I'm most interested
in on a distributed system is which thread started the work it has not
completed the earliest, and when did that thread last make forward
process. You don't want to kill a node because a thread is waiting on a
lock held by a thread that went off-node and has not gotten a response.
If you don't understand the dependency relationships, you will make
incorrect recovery decisions.

On Mon, Sep 10, 2018 at 4:08 AM Maxim Muzafarov  wrote:

> I think we should find exact answers to these questions:
>  1. What `critical` issue exactly is?
>  2. How can we find critical issues?
>  3. How can we handle critical issues?
>
> First,
>  - Ignore uninterruptable actions (e.g. worker\service shutdown)
>  - Long I/O operations (should be a configurable timeout for each type of
> usage)
>  - Infinite loops
>  - Stalled\deadlocked threads (and\or too many parked threads, exclude I/O)
>
> Second,
>  - The working queue is without progress (e.g. disco, exchange queues)
>  - Work hasn't been completed since the last heartbeat (checking
> milestones)
>  - Too many system resources used by a thread for the long period of time
> (allocated memory, CPU)
>  - Timing fields associated with each thread status exceeded a maximum time
> limit.
>
> Third (not too many options here),
>  - `log everything` should be the default behaviour in all these cases,
> since it may be difficult to find the cause after the restart.
>  - Wait some interval of time and kill the hanging node (cluster should be
> configured stable enough)
>
> Questions,
>  - Not sure, but can workers miss their heartbeat deadlines if CPU loads up
> to 80%-90%? Bursts of momentary overloads can be
> expected behaviour as a normal part of system operations.
>  - Why do we decide that critical thread should monitor each other? For
> instance, if all the tasks were blocked and unable to run,
> node reset would never occur. As for me, a better solution is to use a
> separate monitor thread or pool (maybe both with software
> and hardware checks) that not only checks heartbeats but monitors the
> other system as well.
>
> On Mon, 10 Sep 2018 at 00:07 David Harvey  wrote:
>
> > It would be safer to restart the entire cluster than to remove the last
> > node for a cache that should be redundant.
> >
> > On Sun, Sep 9, 2018, 4:00 PM Andrey Gura  wrote:
> >
> > > Hi,
> > >
> > > I agree with Yakov that we can provide some option that manage worker
> > > liveness checker behavior in case of observing that some worker is
> > > blocked too long.
> > > At least it will  some workaround for cases when node fails is too
> > > annoying.
> > >
> > > Backups count threshold sounds good but I don't understand how it will
> > > help in case of cluster hanging.
> > >
> > > The simplest solution here is alert in cases of blocking of some
> > > critical worker (we can improve WorkersRegistry for this purpose and
> > > expose list of blocked workers) and optionally call system configured
> > > failure processor. BTW, failure processor can be extended in order to
> > > perform any checks (e.g. backup count) and decide whether it should
> > > stop node or not.
> > > On Sat, Sep 8, 2018 at 3:42 PM Andrey Kuznetsov 
> > wrote:
> > > >
> > > > David, Yakov, I understand your fears. But liveness checks deal with
> > > > _critical_ conditions, i.e. when such a condition is met we conclude
> > the
> > > > node as totally broken, and there is no sense to keep it alive
> > regardless
> > > > the data it contains. If we want to give it a chance, then the
> > condition
> > > > (long fsync etc.) should not considered as critical at all.
> > > >
> > > > сб, 8 сент. 2018 г. в 15:18, Yakov Zhdanov :
> > > >
> > > > > Agree with David. We need to have an opporunity set backups count
> > > threshold
> > > > > (at runtime also!) that will not allow any automatic stop if there
> > > will be
> > > > > a data loss. Andrey, what do you think?
> > > > >
> > > > > --Yakov
> > > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > >   Andrey Kuznetsov.
> > >
> >
> --
> --
> Maxim Muzafarov
>


Re: Critical worker threads liveness checking drawbacks

2018-09-10 Thread Maxim Muzafarov
I think we should find exact answers to these questions:
 1. What `critical` issue exactly is?
 2. How can we find critical issues?
 3. How can we handle critical issues?

First,
 - Ignore uninterruptable actions (e.g. worker\service shutdown)
 - Long I/O operations (should be a configurable timeout for each type of
usage)
 - Infinite loops
 - Stalled\deadlocked threads (and\or too many parked threads, exclude I/O)

Second,
 - The working queue is without progress (e.g. disco, exchange queues)
 - Work hasn't been completed since the last heartbeat (checking milestones)
 - Too many system resources used by a thread for the long period of time
(allocated memory, CPU)
 - Timing fields associated with each thread status exceeded a maximum time
limit.

Third (not too many options here),
 - `log everything` should be the default behaviour in all these cases,
since it may be difficult to find the cause after the restart.
 - Wait some interval of time and kill the hanging node (cluster should be
configured stable enough)

Questions,
 - Not sure, but can workers miss their heartbeat deadlines if CPU loads up
to 80%-90%? Bursts of momentary overloads can be
expected behaviour as a normal part of system operations.
 - Why do we decide that critical thread should monitor each other? For
instance, if all the tasks were blocked and unable to run,
node reset would never occur. As for me, a better solution is to use a
separate monitor thread or pool (maybe both with software
and hardware checks) that not only checks heartbeats but monitors the
other system as well.

On Mon, 10 Sep 2018 at 00:07 David Harvey  wrote:

> It would be safer to restart the entire cluster than to remove the last
> node for a cache that should be redundant.
>
> On Sun, Sep 9, 2018, 4:00 PM Andrey Gura  wrote:
>
> > Hi,
> >
> > I agree with Yakov that we can provide some option that manage worker
> > liveness checker behavior in case of observing that some worker is
> > blocked too long.
> > At least it will  some workaround for cases when node fails is too
> > annoying.
> >
> > Backups count threshold sounds good but I don't understand how it will
> > help in case of cluster hanging.
> >
> > The simplest solution here is alert in cases of blocking of some
> > critical worker (we can improve WorkersRegistry for this purpose and
> > expose list of blocked workers) and optionally call system configured
> > failure processor. BTW, failure processor can be extended in order to
> > perform any checks (e.g. backup count) and decide whether it should
> > stop node or not.
> > On Sat, Sep 8, 2018 at 3:42 PM Andrey Kuznetsov 
> wrote:
> > >
> > > David, Yakov, I understand your fears. But liveness checks deal with
> > > _critical_ conditions, i.e. when such a condition is met we conclude
> the
> > > node as totally broken, and there is no sense to keep it alive
> regardless
> > > the data it contains. If we want to give it a chance, then the
> condition
> > > (long fsync etc.) should not considered as critical at all.
> > >
> > > сб, 8 сент. 2018 г. в 15:18, Yakov Zhdanov :
> > >
> > > > Agree with David. We need to have an opporunity set backups count
> > threshold
> > > > (at runtime also!) that will not allow any automatic stop if there
> > will be
> > > > a data loss. Andrey, what do you think?
> > > >
> > > > --Yakov
> > > >
> > >
> > >
> > > --
> > > Best regards,
> > >   Andrey Kuznetsov.
> >
>
-- 
--
Maxim Muzafarov


Re: Critical worker threads liveness checking drawbacks

2018-09-09 Thread David Harvey
It would be safer to restart the entire cluster than to remove the last
node for a cache that should be redundant.

On Sun, Sep 9, 2018, 4:00 PM Andrey Gura  wrote:

> Hi,
>
> I agree with Yakov that we can provide some option that manage worker
> liveness checker behavior in case of observing that some worker is
> blocked too long.
> At least it will  some workaround for cases when node fails is too
> annoying.
>
> Backups count threshold sounds good but I don't understand how it will
> help in case of cluster hanging.
>
> The simplest solution here is alert in cases of blocking of some
> critical worker (we can improve WorkersRegistry for this purpose and
> expose list of blocked workers) and optionally call system configured
> failure processor. BTW, failure processor can be extended in order to
> perform any checks (e.g. backup count) and decide whether it should
> stop node or not.
> On Sat, Sep 8, 2018 at 3:42 PM Andrey Kuznetsov  wrote:
> >
> > David, Yakov, I understand your fears. But liveness checks deal with
> > _critical_ conditions, i.e. when such a condition is met we conclude the
> > node as totally broken, and there is no sense to keep it alive regardless
> > the data it contains. If we want to give it a chance, then the condition
> > (long fsync etc.) should not considered as critical at all.
> >
> > сб, 8 сент. 2018 г. в 15:18, Yakov Zhdanov :
> >
> > > Agree with David. We need to have an opporunity set backups count
> threshold
> > > (at runtime also!) that will not allow any automatic stop if there
> will be
> > > a data loss. Andrey, what do you think?
> > >
> > > --Yakov
> > >
> >
> >
> > --
> > Best regards,
> >   Andrey Kuznetsov.
>


Re: Critical worker threads liveness checking drawbacks

2018-09-09 Thread Andrey Gura
Hi,

I agree with Yakov that we can provide some option that manage worker
liveness checker behavior in case of observing that some worker is
blocked too long.
At least it will  some workaround for cases when node fails is too annoying.

Backups count threshold sounds good but I don't understand how it will
help in case of cluster hanging.

The simplest solution here is alert in cases of blocking of some
critical worker (we can improve WorkersRegistry for this purpose and
expose list of blocked workers) and optionally call system configured
failure processor. BTW, failure processor can be extended in order to
perform any checks (e.g. backup count) and decide whether it should
stop node or not.
On Sat, Sep 8, 2018 at 3:42 PM Andrey Kuznetsov  wrote:
>
> David, Yakov, I understand your fears. But liveness checks deal with
> _critical_ conditions, i.e. when such a condition is met we conclude the
> node as totally broken, and there is no sense to keep it alive regardless
> the data it contains. If we want to give it a chance, then the condition
> (long fsync etc.) should not considered as critical at all.
>
> сб, 8 сент. 2018 г. в 15:18, Yakov Zhdanov :
>
> > Agree with David. We need to have an opporunity set backups count threshold
> > (at runtime also!) that will not allow any automatic stop if there will be
> > a data loss. Andrey, what do you think?
> >
> > --Yakov
> >
>
>
> --
> Best regards,
>   Andrey Kuznetsov.


Re: Critical worker threads liveness checking drawbacks

2018-09-08 Thread Andrey Kuznetsov
David, Yakov, I understand your fears. But liveness checks deal with
_critical_ conditions, i.e. when such a condition is met we conclude the
node as totally broken, and there is no sense to keep it alive regardless
the data it contains. If we want to give it a chance, then the condition
(long fsync etc.) should not considered as critical at all.

сб, 8 сент. 2018 г. в 15:18, Yakov Zhdanov :

> Agree with David. We need to have an opporunity set backups count threshold
> (at runtime also!) that will not allow any automatic stop if there will be
> a data loss. Andrey, what do you think?
>
> --Yakov
>


-- 
Best regards,
  Andrey Kuznetsov.


Re: Critical worker threads liveness checking drawbacks

2018-09-08 Thread Yakov Zhdanov
Agree with David. We need to have an opporunity set backups count threshold
(at runtime also!) that will not allow any automatic stop if there will be
a data loss. Andrey, what do you think?

--Yakov


Re: Critical worker threads liveness checking drawbacks

2018-09-07 Thread David Harvey
There are at least two production cases that need to be distinguished:
The first is where a single node restart will repair the problem( and you
get the right node.  )
The other cases are those where stopping the node will invalidate it's
backups, leaving only one copy of the data, and the problem is not
resolved.  Lots of opportunities to destroy all copies.  Automated
decisions should take into account whether a node in question is the last
source of Truth.

Killing off a single bad actor using automation is safer than having humans
with the CEO screaming at them to try.
-DH


PS:  I'm just finalizing an extension which allows cache templates created
in spring to force primary and backups to different failure
domains(availability zones) ( no need for custom Java code), and have been
fretting over all the ways to lose data.

On Thu, Sep 6, 2018, 10:03 AM Andrey Kuznetsov  wrote:

> Igniters,
>
> Currently, we have a nearly completed implementation for system-critical
> threads liveness checking [1], in terms of IEP-14 [2] and IEP-5 [3]. In a
> nutshell, system-critical threads monitor each other and checks for two
> aspects:
> - whether a thread is alive;
> - whether a thread is active, i.e. it updates its heartbeat timestamp
> periodically.
> When either check fails, critical failure handler is called, this in fact
> means node stop.
>
> The implementation of activity checks has a flaw now: some blocking actions
> are parts of normal operation and should not lead to node stop, e.g.
> - WAL writer thread can call {{fsync()}};
> - any cache write that occurs in system striped executor can lead to
> {{fsync()}} call again.
> The former example can be fixed by disabling heartbeat checks temporarily
> for known long-running actions, but it won't work with for the latter one.
>
> I see a few options to address the issue:
> - Just log any long-running action instead of calling critical failure
> handler.
> - Introduce several severity levels for long-running actions handling. Each
> level will have its own failure handler. Depending on the level,
> long-running action can lead to node stop, error logging or no-op reaction.
>
> I encourage you to suggest other options. Any idea is appreciated.
>
> [1] https://issues.apache.org/jira/browse/IGNITE-6587
> [2]
>
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling
> [3]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=74683878
>
> --
> Best regards,
>   Andrey Kuznetsov.
>


Re: Critical worker threads liveness checking drawbacks

2018-09-07 Thread Yakov Zhdanov
Yes, and you should suggest solution, e.g. throttle rebalancing threads
more to produce less load.

What you suggesting kills the idea of this enhancement.

--Yakov

2018-09-07 19:03 GMT+03:00 Andrey Kuznetsov :

> Yakov,
>
> Thanks for reply. Indeed, initial design assumed node termination when
> hanging critical thread has been detected. But sometimes it looks
> inappropriate. Let, for example fsync in WAL writer thread takes too long,
> and we terminate the node. Upon rebalancing, this may lead to long fsyncs
> on other nodes due to increased per node load, hence we can terminate the
> next node as well. Eventually we can collapse the entire cluster. Is it a
> possible scenario?
>
> пт, 7 сент. 2018 г. в 18:44, Yakov Zhdanov :
>
> > Andrey,
> >
> > I don't understand your point. My opinion, the idea of these changes is
> to
> > make cluster more stable and responsive by eliminating hanged nodes. I
> > would not make too much difference between threads trapped in deadlock
> and
> > threads hanging on fsync calls for too long. Both situations lead to
> > increasing latency in cluster till its full unavailability.
> >
> > So, killing node hanging on fsync may be reasonable. Agree?
> >
> > You may implement the approach when you have warning messages in logs by
> > default, but termination option should also be available.
> >
> > Thanks!
> >
> > --Yakov
> >
> >
>


Re: Critical worker threads liveness checking drawbacks

2018-09-07 Thread Andrey Kuznetsov
Yakov,

Thanks for reply. Indeed, initial design assumed node termination when
hanging critical thread has been detected. But sometimes it looks
inappropriate. Let, for example fsync in WAL writer thread takes too long,
and we terminate the node. Upon rebalancing, this may lead to long fsyncs
on other nodes due to increased per node load, hence we can terminate the
next node as well. Eventually we can collapse the entire cluster. Is it a
possible scenario?

пт, 7 сент. 2018 г. в 18:44, Yakov Zhdanov :

> Andrey,
>
> I don't understand your point. My opinion, the idea of these changes is to
> make cluster more stable and responsive by eliminating hanged nodes. I
> would not make too much difference between threads trapped in deadlock and
> threads hanging on fsync calls for too long. Both situations lead to
> increasing latency in cluster till its full unavailability.
>
> So, killing node hanging on fsync may be reasonable. Agree?
>
> You may implement the approach when you have warning messages in logs by
> default, but termination option should also be available.
>
> Thanks!
>
> --Yakov
>
>


Re: Critical worker threads liveness checking drawbacks

2018-09-07 Thread Yakov Zhdanov
Andrey,

I don't understand your point. My opinion, the idea of these changes is to
make cluster more stable and responsive by eliminating hanged nodes. I
would not make too much difference between threads trapped in deadlock and
threads hanging on fsync calls for too long. Both situations lead to
increasing latency in cluster till its full unavailability.

So, killing node hanging on fsync may be reasonable. Agree?

You may implement the approach when you have warning messages in logs by
default, but termination option should also be available.

Thanks!

--Yakov

2018-09-06 17:02 GMT+03:00 Andrey Kuznetsov :

> Igniters,
>
> Currently, we have a nearly completed implementation for system-critical
> threads liveness checking [1], in terms of IEP-14 [2] and IEP-5 [3]. In a
> nutshell, system-critical threads monitor each other and checks for two
> aspects:
> - whether a thread is alive;
> - whether a thread is active, i.e. it updates its heartbeat timestamp
> periodically.
> When either check fails, critical failure handler is called, this in fact
> means node stop.
>
> The implementation of activity checks has a flaw now: some blocking actions
> are parts of normal operation and should not lead to node stop, e.g.
> - WAL writer thread can call {{fsync()}};
> - any cache write that occurs in system striped executor can lead to
> {{fsync()}} call again.
> The former example can be fixed by disabling heartbeat checks temporarily
> for known long-running actions, but it won't work with for the latter one.
>
> I see a few options to address the issue:
> - Just log any long-running action instead of calling critical failure
> handler.
> - Introduce several severity levels for long-running actions handling. Each
> level will have its own failure handler. Depending on the level,
> long-running action can lead to node stop, error logging or no-op reaction.
>
> I encourage you to suggest other options. Any idea is appreciated.
>
> [1] https://issues.apache.org/jira/browse/IGNITE-6587
> [2]
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-
> 14+Ignite+failures+handling
> [3]
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=74683878
>
> --
> Best regards,
>   Andrey Kuznetsov.
>