[jira] [Assigned] (IGNITE-6587) Ignite watchdog service
[ https://issues.apache.org/jira/browse/IGNITE-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Kuznetsov reassigned IGNITE-6587: Assignee: Andrey Kuznetsov (was: Andrey Gura) > Ignite watchdog service > --- > > Key: IGNITE-6587 > URL: https://issues.apache.org/jira/browse/IGNITE-6587 > Project: Ignite > Issue Type: Improvement > Components: general >Affects Versions: 2.2 >Reporter: Alexey Goncharuk >Assignee: Andrey Kuznetsov >Priority: Major > Labels: IEP-5 > Fix For: 2.6 > > Attachments: watchdog.sh > > > As described in [1], each Ignite node has a number of system-critical > threads. We should implement a periodic check that calls failure handler when > one of the following conditions has been detected: > # Critical thread is not alive anymore. > # Critical thread 'hangs' for a long time, e.g. while executing a task > extracted from task queue. > Actual list of system-critical threads can be found at [1]. > [1] > https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (IGNITE-6587) Ignite watchdog service
[ https://issues.apache.org/jira/browse/IGNITE-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Andrey Gura reassigned IGNITE-6587: --- Assignee: Andrey Gura > Ignite watchdog service > --- > > Key: IGNITE-6587 > URL: https://issues.apache.org/jira/browse/IGNITE-6587 > Project: Ignite > Issue Type: Improvement > Components: general >Affects Versions: 2.2 >Reporter: Alexey Goncharuk >Assignee: Andrey Gura >Priority: Major > Labels: IEP-5 > Fix For: 2.6 > > Attachments: watchdog.sh > > > We need to come up with a 'watchdog service' to monitor for Ignite node local > health and kill the process under some critical conditions. > For example, if one of the mission-critical Ignite threads die, the Ignite > node must be stopped. > At the first glance, the list of critical threads is: > disco-event-worker > tcp-disco-sock-reader > tcp-disco-srvr > tcp-disco-msg-worker > tcp-comm-worker > grid-nio-worker-tcp-comm > exchange-worker > sys-stripe > grid-timeout-worker > db-checkpoint-thread > wal-file-archiver > ttl-cleanup-worker > nio-acceptor > The mechanism should support pluggable components so that self-check can be > extended via plugins. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (IGNITE-6587) Ignite watchdog service
[ https://issues.apache.org/jira/browse/IGNITE-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Pavlov reassigned IGNITE-6587: -- Assignee: (was: Dmitriy Pavlov) > Ignite watchdog service > --- > > Key: IGNITE-6587 > URL: https://issues.apache.org/jira/browse/IGNITE-6587 > Project: Ignite > Issue Type: Improvement > Components: general >Affects Versions: 2.2 >Reporter: Alexey Goncharuk > Labels: IEP-5 > Fix For: 2.4 > > Attachments: watchdog.sh > > > We need to come up with a 'watchdog service' to monitor for Ignite node local > health and kill the process under some critical conditions. > For example, if one of the mission-critical Ignite threads die, the Ignite > node must be stopped. > At the first glance, the list of critical threads is: > disco-event-worker > tcp-disco-sock-reader > tcp-disco-srvr > tcp-disco-msg-worker > tcp-comm-worker > grid-nio-worker-tcp-comm > exchange-worker > sys-stripe > grid-timeout-worker > db-checkpoint-thread > wal-file-archiver > ttl-cleanup-worker > nio-acceptor > The mechanism should support pluggable components so that self-check can be > extended via plugins. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Assigned] (IGNITE-6587) Ignite watchdog service
[ https://issues.apache.org/jira/browse/IGNITE-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dmitriy Pavlov reassigned IGNITE-6587: -- Assignee: Dmitriy Pavlov > Ignite watchdog service > --- > > Key: IGNITE-6587 > URL: https://issues.apache.org/jira/browse/IGNITE-6587 > Project: Ignite > Issue Type: Improvement > Components: general >Affects Versions: 2.2 >Reporter: Alexey Goncharuk >Assignee: Dmitriy Pavlov > Fix For: 2.4 > > > We need to come up with a 'watchdog service' to monitor for Ignite node local > health and kill the process under some critical conditions. > For example, if one of the mission-critical Ignite threads die, the Ignite > node must be stopped. > At the first glance, the list of critical threads is: > All TCP discovery threads > All communication NIO threads (acceptor and workers) > Exchange worker > Striped pool threads > Timeout Worker > Checkpointer > WAL archiver > The mechanism should support pluggable components so that self-check can be > extended via plugins. -- This message was sent by Atlassian JIRA (v6.4.14#64029)