[jira] [Commented] (CASSANDRA-5780) nodetool status and ring report incorrect/stale information after decommission

2015-09-28 Thread John Sumsion (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14933745#comment-14933745
 ] 

John Sumsion commented on CASSANDRA-5780:
-

The only thing I wouldn't want to have happen is to accidentally issue some 
kind of truncate that in a race condition inadvertently gets replicated to the 
entire cluster.  I don't know the cassandra codebase enough to understand 
whether that risk exists when calling {{ColumnFamilyStore.truncateBlocking()}}. 
 From what I can tell, I think it's likely pretty safe because once you get 
down to StorageService, there is no cross-cluster effect of actions taken at 
that level.

Can anyone reply who knows better what cross-cluster effects 
{{truncateBlocking()}} might have?

The reason I don't have that concern with the 'system' keyspace is that it is 
never replicated.

Actually, looking into  {{ColumnFamilyStore.truncateBlocking()}} makes me think 
that my proposed changes will blow up half-way through because a side-effect of 
truncating a table is writing back a "truncated at" record to 'system.local' 
table (which we just truncated).  I guess I need to run ccm with a local-built 
cassandra and try decomissioning to see what happens (not sure how to do that).

> nodetool status and ring report incorrect/stale information after decommission
> --
>
> Key: CASSANDRA-5780
> URL: https://issues.apache.org/jira/browse/CASSANDRA-5780
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Peter Haggerty
>Priority: Trivial
>  Labels: lhf, ponies, qa-resolved
> Fix For: 2.1.x
>
>
> Cassandra 1.2.6 ring of 12 instances, each with 256 tokens.
> Decommission 3 of the 12 nodes, one after another resulting a 9 instance ring.
> The 9 instances of cassandra that are in the ring all correctly report 
> nodetool status information for the ring and have the same data.
> After the first node is decommissioned:
> "nodetool status" on "decommissioned-1st" reports 11 nodes
> After the second node is decommissioned:
> "nodetool status" on "decommissioned-1st" reports 11 nodes
> "nodetool status" on "decommissioned-2nd" reports 10 nodes
> After the second node is decommissioned:
> "nodetool status" on "decommissioned-1st" reports 11 nodes
> "nodetool status" on "decommissioned-2nd" reports 10 nodes
> "nodetool status" on "decommissioned-3rd" reports 9 nodes
> The storage load information is similarly stale on the various decommissioned 
> nodes. The nodetool status and ring commands continue to return information 
> as if they were part of a cluster and they appear to return the last 
> information that they saw.
> In contrast the nodetool info command fails with an exception, which isn't 
> ideal but at least indicates that there was a failure rather than returning 
> stale information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-4967) config options have different bounds when set via different methods

2015-09-25 Thread John Sumsion (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908373#comment-14908373
 ] 

John Sumsion commented on CASSANDRA-4967:
-

I got part way through applying the validation checks before I left.  Hopefully 
I can wrap it up.  I didn't get any feedback on the approach, so I'm just 
continuing.

This branch is rebased on top of the latest trunk as of now:
- https://github.com/jdsumsion/cassandra/tree/4967-config-validation

> config options have different bounds when set via different methods
> ---
>
> Key: CASSANDRA-4967
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4967
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 1.2.0 beta 2
>Reporter: Robert Coli
>Priority: Minor
>  Labels: lhf
>
> (similar to some of the work done in 
> https://issues.apache.org/jira/browse/CASSANDRA-4479
> )
> If one sets a value in cassandra.yaml, that value might be subject to bounds 
> checking there. However if one sets that same value via JMX, it doesn't get 
> set via a bounds-checking code path.
> "./src/java/org/apache/cassandra/config/DatabaseDescriptor.java" (JMX set)
> {noformat}
> public static void setPhiConvictThreshold(double phiConvictThreshold)
> {
> conf.phi_convict_threshold = phiConvictThreshold;
> }
> {noformat}
> Versus..
> ./src/java/org/apache/cassandra/config/DatabaseDescriptor.java 
> (cassandra.yaml)
> {noformat}
> static void loadYaml()
> ...
>   /* phi convict threshold for FailureDetector */
> if (conf.phi_convict_threshold < 5 || conf.phi_convict_threshold 
> > 16)
> {
> throw new ConfigurationException("phi_convict_threshold must 
> be between 5 and 16");
> }
> {noformat}
> This seems to create a confusing situation where the range of potential 
> values for a given configuration option is different when set by different 
> methods. 
> It's difficult to imagine a circumstance where you want bounds checking to 
> keep your node from starting if you set that value in cassandra.yaml, but 
> also want to allow circumvention of that bounds checking if you set via JMX.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-4967) config options have different bounds when set via different methods

2015-09-24 Thread John Sumsion (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-4967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905876#comment-14905876
 ] 

John Sumsion commented on CASSANDRA-4967:
-

I am part-way down revamping the validation / defaults logic for config.  See 
this branch on github:
- https://github.com/jdsumsion/cassandra/tree/4967-config-validation

If I'm going the wrong direction, please let me know soon, as I want to wrap 
this up by the end of the summit.

> config options have different bounds when set via different methods
> ---
>
> Key: CASSANDRA-4967
> URL: https://issues.apache.org/jira/browse/CASSANDRA-4967
> Project: Cassandra
>  Issue Type: Improvement
>  Components: Core
>Affects Versions: 1.2.0 beta 2
>Reporter: Robert Coli
>Priority: Minor
>  Labels: lhf
>
> (similar to some of the work done in 
> https://issues.apache.org/jira/browse/CASSANDRA-4479
> )
> If one sets a value in cassandra.yaml, that value might be subject to bounds 
> checking there. However if one sets that same value via JMX, it doesn't get 
> set via a bounds-checking code path.
> "./src/java/org/apache/cassandra/config/DatabaseDescriptor.java" (JMX set)
> {noformat}
> public static void setPhiConvictThreshold(double phiConvictThreshold)
> {
> conf.phi_convict_threshold = phiConvictThreshold;
> }
> {noformat}
> Versus..
> ./src/java/org/apache/cassandra/config/DatabaseDescriptor.java 
> (cassandra.yaml)
> {noformat}
> static void loadYaml()
> ...
>   /* phi convict threshold for FailureDetector */
> if (conf.phi_convict_threshold < 5 || conf.phi_convict_threshold 
> > 16)
> {
> throw new ConfigurationException("phi_convict_threshold must 
> be between 5 and 16");
> }
> {noformat}
> This seems to create a confusing situation where the range of potential 
> values for a given configuration option is different when set by different 
> methods. 
> It's difficult to imagine a circumstance where you want bounds checking to 
> keep your node from starting if you set that value in cassandra.yaml, but 
> also want to allow circumvention of that bounds checking if you set via JMX.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-5780) nodetool status and ring report incorrect/stale information after decommission

2015-09-23 Thread John Sumsion (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905133#comment-14905133
 ] 

John Sumsion commented on CASSANDRA-5780:
-

I'm working on this on trunk, the patch will not be JDK1.8 specific to ease 
backporting, since this is open for 1.2, 2.x, trunk.

> nodetool status and ring report incorrect/stale information after decommission
> --
>
> Key: CASSANDRA-5780
> URL: https://issues.apache.org/jira/browse/CASSANDRA-5780
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Peter Haggerty
>Priority: Trivial
>  Labels: lhf, ponies, qa-resolved
> Fix For: 2.1.x
>
>
> Cassandra 1.2.6 ring of 12 instances, each with 256 tokens.
> Decommission 3 of the 12 nodes, one after another resulting a 9 instance ring.
> The 9 instances of cassandra that are in the ring all correctly report 
> nodetool status information for the ring and have the same data.
> After the first node is decommissioned:
> "nodetool status" on "decommissioned-1st" reports 11 nodes
> After the second node is decommissioned:
> "nodetool status" on "decommissioned-1st" reports 11 nodes
> "nodetool status" on "decommissioned-2nd" reports 10 nodes
> After the second node is decommissioned:
> "nodetool status" on "decommissioned-1st" reports 11 nodes
> "nodetool status" on "decommissioned-2nd" reports 10 nodes
> "nodetool status" on "decommissioned-3rd" reports 9 nodes
> The storage load information is similarly stale on the various decommissioned 
> nodes. The nodetool status and ring commands continue to return information 
> as if they were part of a cluster and they appear to return the last 
> information that they saw.
> In contrast the nodetool info command fails with an exception, which isn't 
> ideal but at least indicates that there was a failure rather than returning 
> stale information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-5780) nodetool status and ring report incorrect/stale information after decommission

2015-09-23 Thread John Sumsion (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905182#comment-14905182
 ] 

John Sumsion commented on CASSANDRA-5780:
-

Here is a branch on trunk:
- https://github.com/jdsumsion/cassandra/tree/5780-decomission-truncate-system

> nodetool status and ring report incorrect/stale information after decommission
> --
>
> Key: CASSANDRA-5780
> URL: https://issues.apache.org/jira/browse/CASSANDRA-5780
> Project: Cassandra
>  Issue Type: Bug
>  Components: Tools
>Reporter: Peter Haggerty
>Priority: Trivial
>  Labels: lhf, ponies, qa-resolved
> Fix For: 2.1.x
>
>
> Cassandra 1.2.6 ring of 12 instances, each with 256 tokens.
> Decommission 3 of the 12 nodes, one after another resulting a 9 instance ring.
> The 9 instances of cassandra that are in the ring all correctly report 
> nodetool status information for the ring and have the same data.
> After the first node is decommissioned:
> "nodetool status" on "decommissioned-1st" reports 11 nodes
> After the second node is decommissioned:
> "nodetool status" on "decommissioned-1st" reports 11 nodes
> "nodetool status" on "decommissioned-2nd" reports 10 nodes
> After the second node is decommissioned:
> "nodetool status" on "decommissioned-1st" reports 11 nodes
> "nodetool status" on "decommissioned-2nd" reports 10 nodes
> "nodetool status" on "decommissioned-3rd" reports 9 nodes
> The storage load information is similarly stale on the various decommissioned 
> nodes. The nodetool status and ring commands continue to return information 
> as if they were part of a cluster and they appear to return the last 
> information that they saw.
> In contrast the nodetool info command fails with an exception, which isn't 
> ideal but at least indicates that there was a failure rather than returning 
> stale information.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8169) Background bitrot detector to avoid client exposure

2014-10-28 Thread John Sumsion (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14187597#comment-14187597
 ] 

John Sumsion commented on CASSANDRA-8169:
-

I don't care that much about marking-things-unrepaired-in-the-bitrot-case 
because I believe its easier to just replace a bitrot-susceptible node than it 
is to repair around the bitrot.

My main motivation in submitting this ticket is to make sure there is as 
lightweight a mechanism as possible (read-only, low-throughput) for 
periodically verifying that ALL data can be read, and failing the node as early 
as possible to stay ahead of the replacement curve.

The 'scrub' tool is not good because it rewrites all the data.  The 'repair' 
tool because the move toward incremental-ness (awesome, btw) does not 
aggressively read all the data.  If a 'validate' tool existed, and if it 
triggered the 'disk_failure_policy' properly on all cases of corrupt data files 
(sstables, etc), then that is what I want.  The likelihood of cascading bitrot 
across boxes is not something I thought needed any attention.

 Background bitrot detector to avoid client exposure
 ---

 Key: CASSANDRA-8169
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8169
 Project: Cassandra
  Issue Type: New Feature
Reporter: John Sumsion

 With a lot of static data sitting in SSTables, and with only a relatively 
 small add/edit rate, incremental repair sounds very good.  However, there is 
 one significant cost to switching away from full repair.
 If/when bitrot corrupts an SSTable, there is nothing standing between a user 
 query and a corruption/failure-response event except for the other replicas.  
 This combined with a rolling restart or upgrade can make a token range 
 non-writable via quorum CL.
 While you could argue that full repairs should be scheduled on a longer-term 
 regular basis, I don't really care about all the repair overhead, I just want 
 something that can run ahead of user queries whose only responsibility is to 
 detect bitrot, so that I can replace nodes in an aggressive way instead of 
 having it be a failure-response situation.
 This bitrot detector need not incur the full cross-cluster cost of repair, 
 and so would be less of a burden to run periodically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-8169) Background bitrot detector to avoid client exposure

2014-10-23 Thread John Sumsion (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-8169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181852#comment-14181852
 ] 

John Sumsion commented on CASSANDRA-8169:
-

Yes, that would probably be sufficient.  As long as it doesn't do any writes, 
just reads and a report of status.

If there is sstable checksum corruption, I would expect the corrupt sstable 
policy to be triggered, killing the node or ignoring, based on the config.

Does that mean that we just kill this issue in favor of CASSANDRA-5791?  I 
think so.  Unless anyone sees any value in keeping this issue alive (vs 
CASSANDRA-5791), I'll close this issue in a couple days to give time for 
feedback.

 Background bitrot detector to avoid client exposure
 ---

 Key: CASSANDRA-8169
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8169
 Project: Cassandra
  Issue Type: New Feature
Reporter: John Sumsion

 With a lot of static data sitting in SSTables, and with only a relatively 
 small add/edit rate, incremental repair sounds very good.  However, there is 
 one significant cost to switching away from full repair.
 If/when bitrot corrupts an SSTable, there is nothing standing between a user 
 query and a corruption/failure-response event except for the other replicas.  
 This combined with a rolling restart or upgrade can make a token range 
 non-writable via quorum CL.
 While you could argue that full repairs should be scheduled on a longer-term 
 regular basis, I don't really care about all the repair overhead, I just want 
 something that can run ahead of user queries whose only responsibility is to 
 detect bitrot, so that I can replace nodes in an aggressive way instead of 
 having it be a failure-response situation.
 This bitrot detector need not incur the full cross-cluster cost of repair, 
 and so would be less of a burden to run periodically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-5791) A nodetool command to validate all sstables in a node

2014-10-23 Thread John Sumsion (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181858#comment-14181858
 ] 

John Sumsion commented on CASSANDRA-5791:
-

If there is bitrot that causes a checksum failure, I assume that this issue 
would cause the configured disk_failure_policy to take effect, is that true?

 A nodetool command to validate all sstables in a node
 -

 Key: CASSANDRA-5791
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5791
 Project: Cassandra
  Issue Type: New Feature
  Components: Core
Reporter: sankalp kohli
Priority: Minor

 CUrrently there is no nodetool command to validate all sstables on disk. The 
 only way to do this is to run a repair and see if it succeeds. But we cannot 
 repair the system keyspace. 
 Also we can run upgrade sstables but that re writes all the sstables. 
 This command should check the hash of all sstables and return whether all 
 data is readable all not. This should NOT care about consistency. 
 The compressed sstables do not have hash so not sure how it will work there.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-8169) Background bitrot detector to avoid client exposure

2014-10-22 Thread John Sumsion (JIRA)
John Sumsion created CASSANDRA-8169:
---

 Summary: Background bitrot detector to avoid client exposure
 Key: CASSANDRA-8169
 URL: https://issues.apache.org/jira/browse/CASSANDRA-8169
 Project: Cassandra
  Issue Type: New Feature
Reporter: John Sumsion


With a lot of static data sitting in SSTables, and with only a relatively small 
add/edit rate, incremental repair sounds very good.  However, there is one 
significant cost to switching away from full repair.

If/when bitrot corrupts an SSTable, there is nothing standing between a user 
query and a corruption/failure-response event except for the other replicas.  
This combined with a rolling restart or upgrade can make a token range 
non-writable via quorum CL.

While you could argue that full repairs should be scheduled on a longer-term 
regular basis, I don't really care about all the repair overhead, I just want 
something that can run ahead of user queries whose only responsibility is to 
detect bitrot, so that I can replace nodes in an aggressive way instead of 
having it be a failure-response situation.

This bitrot detector need not incur the full cross-cluster cost of repair, and 
so would be less of a burden to run periodically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7927) Kill daemon on any disk error

2014-10-13 Thread John Sumsion (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170210#comment-14170210
 ] 

John Sumsion commented on CASSANDRA-7927:
-

LGTM

 Kill daemon on any disk error
 -

 Key: CASSANDRA-7927
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7927
 Project: Cassandra
  Issue Type: New Feature
  Components: Core
 Environment: aws, stock cassandra or dse
Reporter: John Sumsion
Assignee: John Sumsion
  Labels: bootcamp, lhf
 Fix For: 2.1.1

 Attachments: 7927-v1-die.patch


 We got a disk read error on 1.2.13 that didn't trigger the disk failure 
 policy, and I'm trying to hunt down why, but in doing so, I saw that there is 
 no disk_failure_policy option for just killing the daemon.
 If we ever get a corrupt sstable, we want to replace the node anyway, because 
 some aws instance store disks just go bad.
 I want to use the JVMStabilityInspector from CASSANDRA-7507 to kill so that 
 remains standard, so I will base my patch on CASSANDRA-7507.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7927) Kill daemon on any disk error

2014-10-10 Thread John Sumsion (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166813#comment-14166813
 ] 

John Sumsion commented on CASSANDRA-7927:
-

Looks great!  The compromise on checking policy twice looks like it keeps the 
code less scattered.

Thanks for doing the cleanup!

 Kill daemon on any disk error
 -

 Key: CASSANDRA-7927
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7927
 Project: Cassandra
  Issue Type: New Feature
  Components: Core
 Environment: aws, stock cassandra or dse
Reporter: John Sumsion
Assignee: John Sumsion
  Labels: bootcamp, lhf
 Fix For: 2.1.1

 Attachments: 7927-v1-die.patch


 We got a disk read error on 1.2.13 that didn't trigger the disk failure 
 policy, and I'm trying to hunt down why, but in doing so, I saw that there is 
 no disk_failure_policy option for just killing the daemon.
 If we ever get a corrupt sstable, we want to replace the node anyway, because 
 some aws instance store disks just go bad.
 I want to use the JVMStabilityInspector from CASSANDRA-7507 to kill so that 
 remains standard, so I will base my patch on CASSANDRA-7507.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7507) OOM creates unreliable state - die instantly better

2014-09-13 Thread John Sumsion (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132821#comment-14132821
 ] 

John Sumsion commented on CASSANDRA-7507:
-

In OOM situations, I've also had loggers fail to work -- it might be worth 
adding a t.printStackTrace(System.err) as a failsafe before we exit.

In the most extreme cases, I've also had loggers totally block, and the 
System.exit() was never called, but that was on older jvms, and on really sick 
app servers, so you may not hit that kind of thing.

 OOM creates unreliable state - die instantly better
 ---

 Key: CASSANDRA-7507
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7507
 Project: Cassandra
  Issue Type: New Feature
Reporter: Karl Mueller
Assignee: Joshua McKenzie
Priority: Minor
 Fix For: 2.1.1

 Attachments: 7507_v1.txt, 7507_v2.txt, 7507_v3_build.txt, 
 7507_v3_java.txt, exceptionHandlingResults.txt, findSwallowedExceptions.py


 I had a cassandra node run OOM. My heap had enough headroom, there was just 
 something which either was a bug or some unfortunate amount of short-term 
 memory utilization. This resulted in the following error:
  WARN [StorageServiceShutdownHook] 2014-06-30 09:38:38,251 StorageProxy.java 
 (line 1713) Some hints were not written before shutdown.  This is not 
 supposed to happen.  You should (a) run repair, and (b) file a bug report
 There are no other messages of relevance besides the OOM error about 90 
 minutes earlier.
 My (limited) understanding of the JVM and Cassandra says that when it goes 
 OOM, it will attempt to signal cassandra to shut down cleanly. The problem, 
 in my view, is that with an OOM situation, nothing is guaranteed anymore. I 
 believe it's impossible to reliably cleanly shut down at this point, and 
 therefore it's wrong to even try. 
 Yes, ideally things could be written out, flushed to disk, memory messages 
 written, other nodes notified, etc. but why is there any reason to believe 
 any of those steps could happen? Would happen? Couldn't bad data be written 
 at this point to disk rather than good data? Some network messages delivered, 
 but not others?
 I think Cassandra should have the option to (and possibly default) to kill 
 itself immediately upon the OOM condition happening in a hard way, and not 
 rely on the java-based clean shutdown process. Cassandra already handles 
 recovery from unclean shutdown, and it's not a big deal. My node, for 
 example, kept in a sort-of alive state for 90 minutes where who knows what it 
 was doing or not doing.
 I don't know enough about the JVM and options for it to know the best exact 
 implementation of die instantly on OOM, but it should be something that's 
 possible either with some flags or a C library (which doesn't rely on java 
 memory to do something which it may not be able to get!)
 Short version: a kill -9 of all C* processes in that instance without needing 
 more java memory, when OOM is raised



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-7927) Kill deamon on any disk error

2014-09-13 Thread John Sumsion (JIRA)
John Sumsion created CASSANDRA-7927:
---

 Summary: Kill deamon on any disk error
 Key: CASSANDRA-7927
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7927
 Project: Cassandra
  Issue Type: New Feature
  Components: Core
 Environment: aws, stock cassandra or dse
Reporter: John Sumsion
 Fix For: 2.1.1


We got a disk read error on 1.2.13 that didn't trigger the disk failure policy, 
and I'm trying to hunt down why, but in doing so, I saw that there is no 
disk_failure_policy option for just killing the daemon.

If we ever get a corrupt sstable, we want to replace the node anyway, because 
some aws instance store disks just go bad.

I want to use the JVMIntrospector from CASSANDRA-7507 to kill so that remains 
standard, so I will base my patch on CASSANDRA-7507.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7927) Kill deamon on any disk error

2014-09-13 Thread John Sumsion (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132841#comment-14132841
 ] 

John Sumsion commented on CASSANDRA-7927:
-

Options for the policy enum:
- kill
- die
- poison_pill
- cassandracide

I guess I like 'die' the best.

 Kill deamon on any disk error
 -

 Key: CASSANDRA-7927
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7927
 Project: Cassandra
  Issue Type: New Feature
  Components: Core
 Environment: aws, stock cassandra or dse
Reporter: John Sumsion
 Fix For: 2.1.1


 We got a disk read error on 1.2.13 that didn't trigger the disk failure 
 policy, and I'm trying to hunt down why, but in doing so, I saw that there is 
 no disk_failure_policy option for just killing the daemon.
 If we ever get a corrupt sstable, we want to replace the node anyway, because 
 some aws instance store disks just go bad.
 I want to use the JVMIntrospector from CASSANDRA-7507 to kill so that remains 
 standard, so I will base my patch on CASSANDRA-7507.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CASSANDRA-7927) Kill daemon on any disk error

2014-09-13 Thread John Sumsion (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sumsion updated CASSANDRA-7927:

Attachment: 7927-v1-die.patch

has unit tests for the commitlog part, but couldn't find any good way to unit 
test FileUtil without plowing a lot of ground, but kept changes in FileUtil 
DRYish

 Kill daemon on any disk error
 -

 Key: CASSANDRA-7927
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7927
 Project: Cassandra
  Issue Type: New Feature
  Components: Core
 Environment: aws, stock cassandra or dse
Reporter: John Sumsion
 Fix For: 2.1.1

 Attachments: 7927-v1-die.patch


 We got a disk read error on 1.2.13 that didn't trigger the disk failure 
 policy, and I'm trying to hunt down why, but in doing so, I saw that there is 
 no disk_failure_policy option for just killing the daemon.
 If we ever get a corrupt sstable, we want to replace the node anyway, because 
 some aws instance store disks just go bad.
 I want to use the JVMStabilityInspector from CASSANDRA-7507 to kill so that 
 remains standard, so I will base my patch on CASSANDRA-7507.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CASSANDRA-7927) Kill daemon on any disk error

2014-09-13 Thread John Sumsion (JIRA)

[ 
https://issues.apache.org/jira/browse/CASSANDRA-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132919#comment-14132919
 ] 

John Sumsion commented on CASSANDRA-7927:
-

NOTE: patch is based on CASSANDRA-7507, and adds tests for that patch also

 Kill daemon on any disk error
 -

 Key: CASSANDRA-7927
 URL: https://issues.apache.org/jira/browse/CASSANDRA-7927
 Project: Cassandra
  Issue Type: New Feature
  Components: Core
 Environment: aws, stock cassandra or dse
Reporter: John Sumsion
  Labels: lhf
 Fix For: 2.1.1

 Attachments: 7927-v1-die.patch


 We got a disk read error on 1.2.13 that didn't trigger the disk failure 
 policy, and I'm trying to hunt down why, but in doing so, I saw that there is 
 no disk_failure_policy option for just killing the daemon.
 If we ever get a corrupt sstable, we want to replace the node anyway, because 
 some aws instance store disks just go bad.
 I want to use the JVMStabilityInspector from CASSANDRA-7507 to kill so that 
 remains standard, so I will base my patch on CASSANDRA-7507.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CASSANDRA-5895) JDK7 u40 stack size fixes

2013-08-15 Thread John Sumsion (JIRA)
John Sumsion created CASSANDRA-5895:
---

 Summary: JDK7 u40 stack size fixes
 Key: CASSANDRA-5895
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5895
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: cassandra, as launched by the default script by JDK 
1.7.0_40 using CCM
Reporter: John Sumsion


I use Archlinux, and the latest OpenJDK, and run my cassandra via @pcmanus's 
ccm.

When I tried the cassandra-2.0.0 branch via:

{noformat}
$ ccm create dev -v git:cassandra-2.0.0
{noformat}

Then when I say:

{noformat}
$ ccm populate -n3
$ ccm node1 start -v
{noformat}

I get the following error:

{noformat}
xss =  -ea -javaagent:/home/blah/.ccm/repository/1.2.2/lib/jamm-0.2.5.jar 
-XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms1985M -Xmx1985M 
-Xmn496M -XX:+HeapDumpOnOutOfMemoryError -Xss180k

The stack size specified is too small, Specify at least 228k
Error starting node node1
Standard error output is:
Error: Could not create the Java Virtual Machine.
Error: A fatal exception has occurred. Program will exit.
{noformat}

I tracked it down to conf/cassandra-env.sh, and changed the -Xss180k = 
-Xss228k and the node started.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (CASSANDRA-5895) JDK7 u40 stack size fixes

2013-08-15 Thread John Sumsion (JIRA)

 [ 
https://issues.apache.org/jira/browse/CASSANDRA-5895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

John Sumsion updated CASSANDRA-5895:


Attachment: 0002-Ignoring-files-produced-by-ant-build.patch
0001-Increasing-stack-size-to-avoid-error-on-OpenJDK-7-u4.patch

Here are two patches:
1) fix for this issue under JDK 1.7.0_40
2) fix for untracked files in git after running ant build

You can also fetch these from https://github.com/jdsumsion/cassandra (branch: 
jdk7-updates)

 JDK7 u40 stack size fixes
 -

 Key: CASSANDRA-5895
 URL: https://issues.apache.org/jira/browse/CASSANDRA-5895
 Project: Cassandra
  Issue Type: Bug
  Components: Core
 Environment: cassandra, as launched by the default script by JDK 
 1.7.0_40 using CCM
Reporter: John Sumsion
 Attachments: 
 0001-Increasing-stack-size-to-avoid-error-on-OpenJDK-7-u4.patch, 
 0002-Ignoring-files-produced-by-ant-build.patch


 I use Archlinux, and the latest OpenJDK, and run my cassandra via @pcmanus's 
 ccm.
 When I tried the cassandra-2.0.0 branch via:
 {noformat}
 $ ccm create dev -v git:cassandra-2.0.0
 {noformat}
 Then when I say:
 {noformat}
 $ ccm populate -n3
 $ ccm node1 start -v
 {noformat}
 I get the following error:
 {noformat}
 xss =  -ea -javaagent:/home/blah/.ccm/repository/1.2.2/lib/jamm-0.2.5.jar 
 -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms1985M -Xmx1985M 
 -Xmn496M -XX:+HeapDumpOnOutOfMemoryError -Xss180k
 The stack size specified is too small, Specify at least 228k
 Error starting node node1
 Standard error output is:
 Error: Could not create the Java Virtual Machine.
 Error: A fatal exception has occurred. Program will exit.
 {noformat}
 I tracked it down to conf/cassandra-env.sh, and changed the -Xss180k = 
 -Xss228k and the node started.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira