[jira] [Assigned] (IGNITE-6587) Ignite watchdog service

2018-05-03 Thread Andrey Kuznetsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Kuznetsov reassigned IGNITE-6587:


Assignee: Andrey Kuznetsov  (was: Andrey Gura)

> Ignite watchdog service
> ---
>
> Key: IGNITE-6587
> URL: https://issues.apache.org/jira/browse/IGNITE-6587
> Project: Ignite
>  Issue Type: Improvement
>  Components: general
>Affects Versions: 2.2
>Reporter: Alexey Goncharuk
>Assignee: Andrey Kuznetsov
>Priority: Major
>  Labels: IEP-5
> Fix For: 2.6
>
> Attachments: watchdog.sh
>
>
> As described in [1], each Ignite node has a number of system-critical 
> threads. We should implement a periodic check that calls failure handler when 
> one of the following conditions has been detected:
> # Critical thread is not alive anymore.
> # Critical thread 'hangs' for a long time, e.g. while executing a task 
> extracted from task queue. 
> Actual list of system-critical threads can be found at [1].
> [1] 
> https://cwiki.apache.org/confluence/display/IGNITE/IEP-14+Ignite+failures+handling



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (IGNITE-6587) Ignite watchdog service

2018-04-19 Thread Andrey Gura (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrey Gura reassigned IGNITE-6587:
---

Assignee: Andrey Gura

> Ignite watchdog service
> ---
>
> Key: IGNITE-6587
> URL: https://issues.apache.org/jira/browse/IGNITE-6587
> Project: Ignite
>  Issue Type: Improvement
>  Components: general
>Affects Versions: 2.2
>Reporter: Alexey Goncharuk
>Assignee: Andrey Gura
>Priority: Major
>  Labels: IEP-5
> Fix For: 2.6
>
> Attachments: watchdog.sh
>
>
> We need to come up with a 'watchdog service' to monitor for Ignite node local 
> health and kill the process under some critical conditions.
> For example, if one of the mission-critical Ignite threads die, the Ignite 
> node must be stopped.
> At the first glance, the list of critical threads is:
> disco-event-worker
> tcp-disco-sock-reader
> tcp-disco-srvr
> tcp-disco-msg-worker
> tcp-comm-worker
> grid-nio-worker-tcp-comm
> exchange-worker
> sys-stripe
> grid-timeout-worker
> db-checkpoint-thread
> wal-file-archiver
> ttl-cleanup-worker
> nio-acceptor
> The mechanism should support pluggable components so that self-check can be 
> extended via plugins.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (IGNITE-6587) Ignite watchdog service

2017-12-05 Thread Dmitriy Pavlov (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Pavlov reassigned IGNITE-6587:
--

Assignee: (was: Dmitriy Pavlov)

> Ignite watchdog service
> ---
>
> Key: IGNITE-6587
> URL: https://issues.apache.org/jira/browse/IGNITE-6587
> Project: Ignite
>  Issue Type: Improvement
>  Components: general
>Affects Versions: 2.2
>Reporter: Alexey Goncharuk
>  Labels: IEP-5
> Fix For: 2.4
>
> Attachments: watchdog.sh
>
>
> We need to come up with a 'watchdog service' to monitor for Ignite node local 
> health and kill the process under some critical conditions.
> For example, if one of the mission-critical Ignite threads die, the Ignite 
> node must be stopped.
> At the first glance, the list of critical threads is:
> disco-event-worker
> tcp-disco-sock-reader
> tcp-disco-srvr
> tcp-disco-msg-worker
> tcp-comm-worker
> grid-nio-worker-tcp-comm
> exchange-worker
> sys-stripe
> grid-timeout-worker
> db-checkpoint-thread
> wal-file-archiver
> ttl-cleanup-worker
> nio-acceptor
> The mechanism should support pluggable components so that self-check can be 
> extended via plugins.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (IGNITE-6587) Ignite watchdog service

2017-10-18 Thread Dmitriy Pavlov (JIRA)

 [ 
https://issues.apache.org/jira/browse/IGNITE-6587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitriy Pavlov reassigned IGNITE-6587:
--

Assignee: Dmitriy Pavlov

> Ignite watchdog service
> ---
>
> Key: IGNITE-6587
> URL: https://issues.apache.org/jira/browse/IGNITE-6587
> Project: Ignite
>  Issue Type: Improvement
>  Components: general
>Affects Versions: 2.2
>Reporter: Alexey Goncharuk
>Assignee: Dmitriy Pavlov
> Fix For: 2.4
>
>
> We need to come up with a 'watchdog service' to monitor for Ignite node local 
> health and kill the process under some critical conditions.
> For example, if one of the mission-critical Ignite threads die, the Ignite 
> node must be stopped.
> At the first glance, the list of critical threads is:
> All TCP discovery threads
> All communication NIO threads (acceptor and workers)
> Exchange worker
> Striped pool threads
> Timeout Worker
> Checkpointer 
> WAL archiver
> The mechanism should support pluggable components so that self-check can be 
> extended via plugins.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)