[jira] [Comment Edited] (CASSANDRA-12461) Add hooks to StorageService shutdown

2016-09-29 Thread Jeremiah Jordan (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15532856#comment-15532856
 ] 

Jeremiah Jordan edited comment on CASSANDRA-12461 at 9/29/16 2:01 PM:
--

If we put the hooks in 3.0 we can use them to fix CASSANDRA-12011, if we want 
to fix that there.


was (Author: jjordan):
If we put the hooks in 3.0 we can use them to fix CASSANDRA-12011. 

> Add hooks to StorageService shutdown
> 
>
> Key: CASSANDRA-12461
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12461
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Anthony Cozzie
>Assignee: Anthony Cozzie
> Fix For: 3.x
>
> Attachments: 
> 0001-CASSANDRA-12461-add-C-support-for-shutdown-runnables.patch
>
>
> The JVM will usually run shutdown hooks in parallel.  This can lead to 
> synchronization problems between Cassandra, services that depend on it, and 
> services it depends on.  This patch adds some simple support for shutdown 
> hooks to StorageService.
> This should nearly solve CASSANDRA-12011



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-12461) Add hooks to StorageService shutdown

2016-09-29 Thread Alex Petrov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15531974#comment-15531974
 ] 

Alex Petrov edited comment on CASSANDRA-12461 at 9/29/16 7:41 AM:
--

+1 on changes. 

The initial patch was marked as a "bug", I've re-labeled it as an 
"improvement". But since we have fixed several issues with drain process 
([CASSANDRA-12509], for example), might be good to have it in 3.0? Or should I 
remove hooks, only leave bugfix for 3.0 patch (will be a trivial change, too), 
just don't want to create confusion.. 

Backport to 3.0 was trivial, but we can just skip committing if anything. CI 
triggered for both branches (although your latest branch only squashed, without 
any changes on top).

|[12461-trunk|https://github.com/ifesdjeen/cassandra/tree/12461-trunk]|[dtest|http://cassci.datastax.com/job/ifesdjeen-12461-trunk-dtest/]|[utest|http://cassci.datastax.com/job/ifesdjeen-12461-trunk-testall/]|
|[12461-3.0|https://github.com/ifesdjeen/cassandra/tree/12461-3.0]|[dtest|http://cassci.datastax.com/job/ifesdjeen-12461-3.0-dtest/]|[utest|http://cassci.datastax.com/job/ifesdjeen-12461-3.0-testall/]|


was (Author: ifesdjeen):
+1 on changes. 

The initial patch was marked as a "bug", I've re-labeled it as an 
"improvement". But since we have fixed several issues with drain process 
([CASSANDRA-12509], for example), might be good to have it in 3.0? Or should I 
remove hooks, only leave bugfix for 3.0 patch (will be a trivial change, too), 
just don't want to create confusion.. 

Backport to 3.0 was trivial, but we can just skip committing if anything. CI 
triggered for both branches (although your latest branch only squashed, without 
any changes on top).

|[12461-trunk|https://github.com/ifesdjeen/cassandra/tree/12461-trunk]|[dtest|http://cassci.datastax.com/job/ifesdjeen-12461-trunk-dtest/]|[utest|http://cassci.datastax.com/job/ifesdjeen-12461-trunk-testall/]|
|[12461-trunk|https://github.com/ifesdjeen/cassandra/tree/12461-trunk]|[3.0|http://cassci.datastax.com/job/ifesdjeen-12461-trunk-3.0/]|[utest|http://cassci.datastax.com/job/ifesdjeen-12461-trunk-testall/]||[12461-3.0|https://github.com/ifesdjeen/cassandra/tree/12461-3.0]|[dtest|http://cassci.datastax.com/job/ifesdjeen-12461-3.0-dtest/]|[utest|http://cassci.datastax.com/job/ifesdjeen-12461-3.0-testall/]|

> Add hooks to StorageService shutdown
> 
>
> Key: CASSANDRA-12461
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12461
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Anthony Cozzie
>Assignee: Anthony Cozzie
> Fix For: 3.x
>
> Attachments: 
> 0001-CASSANDRA-12461-add-C-support-for-shutdown-runnables.patch
>
>
> The JVM will usually run shutdown hooks in parallel.  This can lead to 
> synchronization problems between Cassandra, services that depend on it, and 
> services it depends on.  This patch adds some simple support for shutdown 
> hooks to StorageService.
> This should nearly solve CASSANDRA-12011



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-12461) Add hooks to StorageService shutdown

2016-09-29 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15531998#comment-15531998
 ] 

Stefania edited comment on CASSANDRA-12461 at 9/29/16 7:38 AM:
---

+1 for committing the full patch to 3.0 since it fixes a number issues, 
including things like CASSANDRA-12397.

I'll do another quick round tomorrow, and if CI results are clean, we can 
commit.


was (Author: stefania):
+1 for committing the full patch to 3.0 since it fixes a number issues, 
including making problems like CASSANDRA-12397 much harder to reproduce.

I'll do another quick round tomorrow, and if CI results are clean, we can 
commit.

> Add hooks to StorageService shutdown
> 
>
> Key: CASSANDRA-12461
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12461
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Anthony Cozzie
>Assignee: Anthony Cozzie
> Fix For: 3.x
>
> Attachments: 
> 0001-CASSANDRA-12461-add-C-support-for-shutdown-runnables.patch
>
>
> The JVM will usually run shutdown hooks in parallel.  This can lead to 
> synchronization problems between Cassandra, services that depend on it, and 
> services it depends on.  This patch adds some simple support for shutdown 
> hooks to StorageService.
> This should nearly solve CASSANDRA-12011



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-12461) Add hooks to StorageService shutdown

2016-09-28 Thread Stefania (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15531686#comment-15531686
 ] 

Stefania edited comment on CASSANDRA-12461 at 9/29/16 3:46 AM:
---

It LGTM, with these two changes:

* I ill advised you yesterday regarding replacing the boolean with the 
operating mode, because we are not updating it atomically and we are not 
guarding against unwanted transitions, due to some bug somewhere, e.g. DRAINED 
-> DECOMMISSIONED. I did not want to handle all this and check all the callers 
of {{setMode}} and so I've reintroduced the boolean, I merely renamed it to 
{{isShutdown}}.

* I think we could have races if we check that a node is drained before 
starting a service with two JMX calls. I moved the check in the SS methods and 
made them synchronized.

|[patch|https://github.com/stef1927/cassandra/commits/12461]|[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-12461-testall/]|[dtest|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-12461-dtest/]|

Let me know if you are OK with these two changes, if so, should we also 
back-port to 3.0?


was (Author: stefania):
It LGTM, with these two changes:

* I ill advised you yesterday regarding replacing the boolean with the 
operating mode, because we are not updating it atomically and we are not 
guarding against unwanted transitions, due to some bug somewhere, e.g. DRAINED 
-> DECOMMISSIONED or DRAINED -> NORMAL. I did not want to handle all this and 
check all the callers of {{setMode}} and so I've reintroduced the boolean, I 
merely renamed it to {{isShutdown}}.

* I think we could have races if we check that a node is drained before 
starting a service with two JMX calls. I move the check in the SS methods and 
made them synchronized.

|[patch|https://github.com/stef1927/cassandra/commits/12461]|[testall|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-12461-testall/]|[dtest|http://cassci.datastax.com/view/Dev/view/stef1927/job/stef1927-12461-dtest/]|

Let me know if you are OK with these two changes, if so, should we also 
back-port to 3.0?

> Add hooks to StorageService shutdown
> 
>
> Key: CASSANDRA-12461
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12461
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Anthony Cozzie
>Assignee: Anthony Cozzie
> Fix For: 3.x
>
> Attachments: 
> 0001-CASSANDRA-12461-add-C-support-for-shutdown-runnables.patch
>
>
> The JVM will usually run shutdown hooks in parallel.  This can lead to 
> synchronization problems between Cassandra, services that depend on it, and 
> services it depends on.  This patch adds some simple support for shutdown 
> hooks to StorageService.
> This should nearly solve CASSANDRA-12011



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (CASSANDRA-12461) Add hooks to StorageService shutdown

2016-09-22 Thread Alex Petrov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15513021#comment-15513021
 ] 

Alex Petrov edited comment on CASSANDRA-12461 at 9/22/16 12:37 PM:
---

I've discovered several more problems while working on this patch, in the last 
version (from [here|https://github.com/acoz/cassandra/commits/12461]):

  * node drain code was duplicated (with minor differences, which I indicate 
below), as I mentioned
  * it is possible to re-start services after drain which won't run regular 
shutdown path on jvm exit
  * if the node was drained, under Windows the timer resolution (added in 
[CASSANDRA-9634]) was not reset, since the node was considered "already 
drained" (although this existed before already)
  * same was happening with post-shutdown hooks in patch, since 
[here|https://github.com/acoz/cassandra/blob/f15cd6d2ea95540bfacd7285dc75d9d95999e5a2/src/java/org/apache/cassandra/service/StorageService.java#L575-L576]
 we'll return from runnable, since those services were shut down during 
{{drain}} 
[here|https://github.com/acoz/cassandra/blob/f15cd6d2ea95540bfacd7285dc75d9d95999e5a2/src/java/org/apache/cassandra/service/StorageService.java#L586-L589].
 So they wouldn't run at all if {{nodetool drain}} was called.
  * because logging system is shutdown in the post-shutdown hook, we depend on 
the order, although we have to guarantee that logging is available for all 
hooks and avoid any races or having to register hooks at the particular stage.

This is one of the reasons I was suggesting single drain process. 

I also suggest disallowing re-enabling auto-compaction, binary, gossip, handoff 
and thrift to ensure that we do not need to re-stop them in the final shutdown 
hook. Operator can not bring the node into "working" state after drain without 
restart anyways (one of the reasons is the fact that commit log is shut down by 
that time), and it was most likely never intended to do so.

I've made a comparison table to make it easier to see what {{drain()}} method 
was doing compared to {{drainOnShutdown}} runnable:

|| nodetool drain || shutdown drain hook |
| disables autocompaction   |   
  |
| shuts down compaction manager | |
| recycles commitlog segment recycling |
 |
| shuts down batchlog and hints earlier |   
  |
|   | flushes only tables with 
durable_writes |
|   | clears set timer resolution for 
windows |

I've combined the two processes, made clearer distinctions to allow running 
things in {{drainOnShutdown}}. Since we can run all the items from the 
{{nodetool drain}} part of the list during the normal node shutdown, the code 
got a bit simpler, too (the only difference is now logging). If this 
granularity is not enough, we have two more options:
  * run post-shutdown hooks directly before the JVM shutdown
  * have 3 stages: pre-, post- drain and pre-jvm shutdown instead

Although I prefer the current way.

Preliminary version of the update (also, CI pending): 
|[12461-trunk-v2|https://github.com/ifesdjeen/cassandra/tree/12461-trunk-v2]|[dtest|http://cassci.datastax.com/job/ifesdjeen-12461-trunk-v2-dtest/]|[utest|http://cassci.datastax.com/job/ifesdjeen-12461-trunk-v2-testall/]|

(I've discussed the change "in theory" with [~slebresne], although it's still 
worth for someone to take a deeper look at it, I'll ask around)


was (Author: ifesdjeen):
I've discovered several more problems while working on this patch:

  * node drain code was duplicated (with minor differences, which I indicate 
below)
  * if the node was drained, under Windows the timer resolution (added in 
[CASSANDRA-9634]) was not reset, since the node was considered "already 
drained".
  * same was happening with post-shutdown hooks, since 
[here|https://github.com/acoz/cassandra/blob/f15cd6d2ea95540bfacd7285dc75d9d95999e5a2/src/java/org/apache/cassandra/service/StorageService.java#L575-L576]
 we'll return from runnable, since those services were shut down during 
{{drain}} 
[here|https://github.com/acoz/cassandra/blob/f15cd6d2ea95540bfacd7285dc75d9d95999e5a2/src/java/org/apache/cassandra/service/StorageService.java#L586-L589].
 

This is one of the reasons I was advocating for a single consistent drain 
process. 
I also suggest disallowing re-enabling auto-compaction, binary, gossip, handoff 
and thrift to ensure that we do not need to re-stop them in the final shutdown 
hook. Operator can not bring the node into "working" state after drain without 
restart anyways (one of the reasons is the fact that commit log is shut down by 
that time), and it was most likely never intended to do so.

I've made a comparison table to make it easier to see what {{drain()}} method 
was doing compared to 

[jira] [Comment Edited] (CASSANDRA-12461) Add hooks to StorageService shutdown

2016-09-22 Thread Alex Petrov (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15513021#comment-15513021
 ] 

Alex Petrov edited comment on CASSANDRA-12461 at 9/22/16 11:28 AM:
---

I've discovered several more problems while working on this patch:

  * node drain code was duplicated (with minor differences, which I indicate 
below)
  * if the node was drained, under Windows the timer resolution (added in 
[CASSANDRA-9634]) was not reset, since the node was considered "already 
drained".
  * same was happening with post-shutdown hooks, since 
[here|https://github.com/acoz/cassandra/blob/f15cd6d2ea95540bfacd7285dc75d9d95999e5a2/src/java/org/apache/cassandra/service/StorageService.java#L575-L576]
 we'll return from runnable, since those services were shut down during 
{{drain}} 
[here|https://github.com/acoz/cassandra/blob/f15cd6d2ea95540bfacd7285dc75d9d95999e5a2/src/java/org/apache/cassandra/service/StorageService.java#L586-L589].
 

This is one of the reasons I was advocating for a single consistent drain 
process. 
I also suggest disallowing re-enabling auto-compaction, binary, gossip, handoff 
and thrift to ensure that we do not need to re-stop them in the final shutdown 
hook. Operator can not bring the node into "working" state after drain without 
restart anyways (one of the reasons is the fact that commit log is shut down by 
that time), and it was most likely never intended to do so.

I've made a comparison table to make it easier to see what {{drain()}} method 
was doing compared to {{drainOnShutdown}} runnable:

|| nodetool drain || shutdown drain hook |
| disables autocompaction   |   
  |
| shuts down compaction manager | |
| recycles commitlog segment recycling |
 |
| shuts down batchlog and hints earlier |   
  |
|   | flushes only tables with 
durable_writes |
|   | clears set timer resolution for 
windows |

I've combined the two processes, made clearer distinctions to allow running 
things in {{drainOnShutdown}}. Since we can run all the items from the 
{{nodetool drain}} part of the list during the normal node shutdown, the code 
got a bit simpler, too (the only difference is now logging). 

Preliminary version of the update (also, CI pending): 
|[12461-trunk-v2|https://github.com/ifesdjeen/cassandra/tree/12461-trunk-v2]|[dtest|http://cassci.datastax.com/job/ifesdjeen-12461-trunk-v2-dtest/]|[utest|http://cassci.datastax.com/job/ifesdjeen-12461-trunk-v2-testall/]|

(I've discussed the change "in theory" with [~slebresne], although it's still 
worth for someone to take a deeper look at it, I'll ask around)


was (Author: ifesdjeen):
I've discovered several more problems while working on this patch:

  * node drain code was duplicated (with minor differences, which I indicate 
below)
  * if the node was drained, under Windows the timer resolution (added in 
[CASSANDRA-9634]) was not reset, since the node was considered "already 
drained".
  * same was happening with post-shutdown hooks, since 
[here|https://github.com/acoz/cassandra/blob/f15cd6d2ea95540bfacd7285dc75d9d95999e5a2/src/java/org/apache/cassandra/service/StorageService.java#L575-L576]
 we'll return from runnable, since those services were shut down during 
{{drain}} 
[here|https://github.com/acoz/cassandra/blob/f15cd6d2ea95540bfacd7285dc75d9d95999e5a2/src/java/org/apache/cassandra/service/StorageService.java#L586-L589].
 

This is one of the reasons I was advocating for a single consistent drain 
process. 
I also suggest disallowing re-enabling auto-compaction, binary, gossip, handoff 
and thrift to ensure that we do not need to re-stop them in the final shutdown 
hook. Operator can not bring the node into "working" state after drain without 
restart anyways (one of the reasons is the fact that commit log is shut down by 
that time), and it was most likely never intended to do so. However, it might 
be useful for operator to run compactions on the drained node, so we only shut 
down compaction manager later. 

I've made a comparison table to make it easier to see what {{drain()}} method 
was doing compared to {{drainOnShutdown}} runnable:

|| nodetool drain || shutdown drain hook |
| disables autocompaction   |   
  |
| shuts down compaction manager | |
| recycles commitlog segment recycling |
 |
| shuts down batchlog and hints earlier |   
  |
|   | flushes only tables with 
durable_writes |
|   | clears set timer resolution for 
windows |

I've combined the two processes, made clearer distinctions to allow running 
things in 

[jira] [Comment Edited] (CASSANDRA-12461) Add hooks to StorageService shutdown

2016-08-29 Thread Anthony Cozzie (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-12461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15446337#comment-15446337
 ] 

Anthony Cozzie edited comment on CASSANDRA-12461 at 8/29/16 4:24 PM:
-

Ah, good catch.  I think this is a simple fix (logback's context aware doesn't 
happen in the constructor for some reason), at least it seems to make the error 
go away for me.  Updated my branches appropriately.

Also, the problem with removing the shutdown hook is that then nothing that we 
register here will run.  


was (Author: acoz):
Ah, good catch.  I think this is a simple fix (logback's context aware doesn't 
happen in the constructor for some reason).

Also, the problem with removing the shutdown hook is that then nothing that we 
register here will run.  

> Add hooks to StorageService shutdown
> 
>
> Key: CASSANDRA-12461
> URL: https://issues.apache.org/jira/browse/CASSANDRA-12461
> Project: Cassandra
>  Issue Type: Bug
>Reporter: Anthony Cozzie
>Assignee: Anthony Cozzie
> Attachments: 
> 0001-CASSANDRA-12461-add-C-support-for-shutdown-runnables.patch
>
>
> The JVM will usually run shutdown hooks in parallel.  This can lead to 
> synchronization problems between Cassandra, services that depend on it, and 
> services it depends on.  This patch adds some simple support for shutdown 
> hooks to StorageService.
> This should nearly solve CASSANDRA-12011



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)