Re: Critical worker threads liveness checking drawbacks

2018-12-19 Thread Dmitriy Pavlov
Hi, Sorry for being too formal here, but IGNITE-10003 is in progress. Also, I've tried to find anything related to it in the list. So according to the list, no one was asking to include. Sincerely, Dmitriy Pavlov ср, 19 дек. 2018 г. в 13:24,

Re: Critical worker threads liveness checking drawbacks

2018-12-19 Thread Nikolay Izhikov
Hello, Alexey. No, we don't include this ticket to 2.7. Should we? ср, 19 дек. 2018 г. в 12:55, Alexey Goncharuk : > Folks, why did not we include IGNITE-10003 to ignite-2.7 release scope? > This causes an Ignite node to be stopped by default when checkpoint read > lock acquire times out. I

Re: Critical worker threads liveness checking drawbacks

2018-12-19 Thread Alexey Goncharuk
Folks, why did not we include IGNITE-10003 to ignite-2.7 release scope? This causes an Ignite node to be stopped by default when checkpoint read lock acquire times out. I expect a lot of Ignite 2.7 users will be affected by this mistake. We should at least update the documentation and make users

Re: Critical worker threads liveness checking drawbacks

2018-10-25 Thread Alexey Goncharuk
Andrey, I still see that checkpoint read lock acquisition raises a CRITICAL_ERROR, which by default will shut down local node. As far as I remember, we decided that by default thread timeout should not trigger node failure. Now, however, it does, because we ignore SYSTEM_WORKER_BLOCKED events in

Re: Critical worker threads liveness checking drawbacks

2018-10-11 Thread Andrey Kuznetsov
Igniters, Now I spot blocking / long-running code arising from {{GridDhtPartitionsExchangeFuture#init}} calls in partition-exchanger thread, see [1]. Ideally, all blocking operations along all possible code paths should be guarded implicitly from critical failure detector to avoid the thread from

Re: Critical worker threads liveness checking drawbacks

2018-09-28 Thread Maxim Muzafarov
Andrey, Andrey > Thanks for being attentive! It's definitely a typo. Could you please create > an issue? I've created an issue [1] and prepared PR [2]. Please, review this change. [1] https://issues.apache.org/jira/browse/IGNITE-9723 [2] https://github.com/apache/ignite/pull/4862 On Fri, 28

Re: Critical worker threads liveness checking drawbacks

2018-09-28 Thread Yakov Zhdanov
Config option + mbean access. Does that make sense? Yakov On Fri, Sep 28, 2018, 17:17 Vladimir Ozerov wrote: > Then it should be config option. > > пт, 28 сент. 2018 г. в 13:15, Andrey Gura : > > > Guys, > > > > why we need both config option and system property? I believe one way is > >

Re: Critical worker threads liveness checking drawbacks

2018-09-28 Thread Vladimir Ozerov
Then it should be config option. пт, 28 сент. 2018 г. в 13:15, Andrey Gura : > Guys, > > why we need both config option and system property? I believe one way is > enough. > On Fri, Sep 28, 2018 at 12:38 PM Nikolay Izhikov > wrote: > > > > Ticket created -

Re: Critical worker threads liveness checking drawbacks

2018-09-28 Thread Andrey Gura
Guys, why we need both config option and system property? I believe one way is enough. On Fri, Sep 28, 2018 at 12:38 PM Nikolay Izhikov wrote: > > Ticket created - https://issues.apache.org/jira/browse/IGNITE-9737 > > Fixed version is 2.7. > > В Пт, 28/09/2018 в 11:41 +0300, Alexey Goncharuk

Re: Critical worker threads liveness checking drawbacks

2018-09-28 Thread Nikolay Izhikov
Ticket created - https://issues.apache.org/jira/browse/IGNITE-9737 Fixed version is 2.7. В Пт, 28/09/2018 в 11:41 +0300, Alexey Goncharuk пишет: > Nikolay, I agree, a user should be able to disable both thread liveness > check and checkpoint read lock timeout check from config and a system >

Re: Critical worker threads liveness checking drawbacks

2018-09-28 Thread Alexey Goncharuk
Nikolay, I agree, a user should be able to disable both thread liveness check and checkpoint read lock timeout check from config and a system property. пт, 28 сент. 2018 г. в 11:30, Nikolay Izhikov : > Hello, Igniters. > > I found that this feature can't be disabled from config. > The only way

Re: Critical worker threads liveness checking drawbacks

2018-09-28 Thread Nikolay Izhikov
Hello, Igniters. I found that this feature can't be disabled from config. The only way to disable it is from JMX bean. I think it very dangerous: If we have some corner case or a bug in this Watch Dog it can make Ignite unusable. I propose to implement possibility to disable this feature both -

Re: Critical worker threads liveness checking drawbacks

2018-09-27 Thread Andrey Kuznetsov
Maxim, Thanks for being attentive! It's definitely a typo. Could you please create an issue? чт, 27 сент. 2018 г. в 16:00, Maxim Muzafarov : > Folks, > > I've found in `GridCachePartitionExchangeManager:2684` [1] (master branch) > exchange future wrapped > with double `blockingSectionEnd`

Re: Critical worker threads liveness checking drawbacks

2018-09-27 Thread Maxim Muzafarov
Folks, I've found in `GridCachePartitionExchangeManager:2684` [1] (master branch) exchange future wrapped with double `blockingSectionEnd` method. Is it correct? I just want to understand this change and how should I use this in the future. Should I file a new issue to fix this? I think here

Re: Critical worker threads liveness checking drawbacks

2018-09-26 Thread Vyacheslav Daradur
Andrey Gura, thank you for the answer! I agree that wrapping of 'init' method reduces the profit of watchdog service in case of PME worker, but in other cases, we should wrap all possible long sections on GridDhtPartitionExchangeFuture. For example 'onCacheChangeRequest' method or

Re: Critical worker threads liveness checking drawbacks

2018-09-26 Thread Andrey Gura
Vyacheslav, Exchange worker is strongly tied with GridDhtPartitionExchangeFuture#init and it is ok. Exchange worker also shouldn't be blocked for long time but in reality it happens.It also means that your change doesn't make sense. What actually make sense it is identification of places which

Re: Critical worker threads liveness checking drawbacks

2018-09-26 Thread Vyacheslav Daradur
Hi Igniters! Thank you for this important improvement! I've looked through implementation and noticed that GridDhtPartitionsExchangeFuture#init has not been wrapped in blocked section. This means it easy to halt the node in case of longrunning actions during PME, for example when we create a

Re: Critical worker threads liveness checking drawbacks

2018-09-24 Thread Andrey Kuznetsov
Denis, I've created the ticket [1] with short description of the functionality. [1] https://issues.apache.org/jira/browse/IGNITE-9679 пн, 24 сент. 2018 г. в 17:46, Denis Magda : > Andrey K. and G., > > Thanks, do we have a documentation ticket created? Prachi (copied) can help > with the

Re: Critical worker threads liveness checking drawbacks

2018-09-24 Thread Denis Magda
Andrey K. and G., Thanks, do we have a documentation ticket created? Prachi (copied) can help with the documentation. -- Denis On Mon, Sep 24, 2018 at 5:51 AM Andrey Gura wrote: > Andrey, > > finally your change is merged to master branch. Congratulations and > thank you very much! :) > > I

Re: Critical worker threads liveness checking drawbacks

2018-09-24 Thread Andrey Gura
Andrey, finally your change is merged to master branch. Congratulations and thank you very much! :) I think that the next step is feature that will allow signal about blocked threads to the monitoring tools via MXBean. I hope you will continue development of this feature and provide your vision

Re: Critical worker threads liveness checking drawbacks

2018-09-11 Thread Andrey Kuznetsov
David, Maxim! Thanks a lot for you ideas. Unfortunately, I can't adopt all of them right now: the scope is much broader than the scope of the change I implement. I have had a talk to a group of Ignite commiters, and we agreed to complete the change as follows. - Blocking instructions in

Re: Critical worker threads liveness checking drawbacks

2018-09-11 Thread vgrigorev
Reliability of ignite is very important to me, so please consider following idea: - Important threads as WAL writer (as a sample of any critical thread) must not do any blocking action, by this way: - WAL thread must be management thread for all WAL operations - Child, worker thread of WAL

Re: Critical worker threads liveness checking drawbacks

2018-09-10 Thread David Harvey
When I've done this before,I've needed to find the oldest thread, and kill the node running that. From a language standpoint, Maxim's "without progress" better than "heartbeat". For example, what I'm most interested in on a distributed system is which thread started the work it has not

Re: Critical worker threads liveness checking drawbacks

2018-09-10 Thread Maxim Muzafarov
I think we should find exact answers to these questions: 1. What `critical` issue exactly is? 2. How can we find critical issues? 3. How can we handle critical issues? First, - Ignore uninterruptable actions (e.g. worker\service shutdown) - Long I/O operations (should be a configurable

Re: Critical worker threads liveness checking drawbacks

2018-09-09 Thread David Harvey
It would be safer to restart the entire cluster than to remove the last node for a cache that should be redundant. On Sun, Sep 9, 2018, 4:00 PM Andrey Gura wrote: > Hi, > > I agree with Yakov that we can provide some option that manage worker > liveness checker behavior in case of observing

Re: Critical worker threads liveness checking drawbacks

2018-09-09 Thread Andrey Gura
Hi, I agree with Yakov that we can provide some option that manage worker liveness checker behavior in case of observing that some worker is blocked too long. At least it will some workaround for cases when node fails is too annoying. Backups count threshold sounds good but I don't understand

Re: Critical worker threads liveness checking drawbacks

2018-09-08 Thread Andrey Kuznetsov
David, Yakov, I understand your fears. But liveness checks deal with _critical_ conditions, i.e. when such a condition is met we conclude the node as totally broken, and there is no sense to keep it alive regardless the data it contains. If we want to give it a chance, then the condition (long

Re: Critical worker threads liveness checking drawbacks

2018-09-08 Thread Yakov Zhdanov
Agree with David. We need to have an opporunity set backups count threshold (at runtime also!) that will not allow any automatic stop if there will be a data loss. Andrey, what do you think? --Yakov

Re: Critical worker threads liveness checking drawbacks

2018-09-07 Thread David Harvey
There are at least two production cases that need to be distinguished: The first is where a single node restart will repair the problem( and you get the right node. ) The other cases are those where stopping the node will invalidate it's backups, leaving only one copy of the data, and the problem

Re: Critical worker threads liveness checking drawbacks

2018-09-07 Thread Yakov Zhdanov
Yes, and you should suggest solution, e.g. throttle rebalancing threads more to produce less load. What you suggesting kills the idea of this enhancement. --Yakov 2018-09-07 19:03 GMT+03:00 Andrey Kuznetsov : > Yakov, > > Thanks for reply. Indeed, initial design assumed node termination when >

Re: Critical worker threads liveness checking drawbacks

2018-09-07 Thread Andrey Kuznetsov
Yakov, Thanks for reply. Indeed, initial design assumed node termination when hanging critical thread has been detected. But sometimes it looks inappropriate. Let, for example fsync in WAL writer thread takes too long, and we terminate the node. Upon rebalancing, this may lead to long fsyncs on

Re: Critical worker threads liveness checking drawbacks

2018-09-07 Thread Yakov Zhdanov
Andrey, I don't understand your point. My opinion, the idea of these changes is to make cluster more stable and responsive by eliminating hanged nodes. I would not make too much difference between threads trapped in deadlock and threads hanging on fsync calls for too long. Both situations lead to