[jira] [Commented] (CASSANDRA-5780) nodetool status and ring report incorrect/stale information after decommission
[ https://issues.apache.org/jira/browse/CASSANDRA-5780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14933745#comment-14933745 ] John Sumsion commented on CASSANDRA-5780: - The only thing I wouldn't want to have happen is to accidentally issue some kind of truncate that in a race condition inadvertently gets replicated to the entire cluster. I don't know the cassandra codebase enough to understand whether that risk exists when calling {{ColumnFamilyStore.truncateBlocking()}}. From what I can tell, I think it's likely pretty safe because once you get down to StorageService, there is no cross-cluster effect of actions taken at that level. Can anyone reply who knows better what cross-cluster effects {{truncateBlocking()}} might have? The reason I don't have that concern with the 'system' keyspace is that it is never replicated. Actually, looking into {{ColumnFamilyStore.truncateBlocking()}} makes me think that my proposed changes will blow up half-way through because a side-effect of truncating a table is writing back a "truncated at" record to 'system.local' table (which we just truncated). I guess I need to run ccm with a local-built cassandra and try decomissioning to see what happens (not sure how to do that). > nodetool status and ring report incorrect/stale information after decommission > -- > > Key: CASSANDRA-5780 > URL: https://issues.apache.org/jira/browse/CASSANDRA-5780 > Project: Cassandra > Issue Type: Bug > Components: Tools >Reporter: Peter Haggerty >Priority: Trivial > Labels: lhf, ponies, qa-resolved > Fix For: 2.1.x > > > Cassandra 1.2.6 ring of 12 instances, each with 256 tokens. > Decommission 3 of the 12 nodes, one after another resulting a 9 instance ring. > The 9 instances of cassandra that are in the ring all correctly report > nodetool status information for the ring and have the same data. > After the first node is decommissioned: > "nodetool status" on "decommissioned-1st" reports 11 nodes > After the second node is decommissioned: > "nodetool status" on "decommissioned-1st" reports 11 nodes > "nodetool status" on "decommissioned-2nd" reports 10 nodes > After the second node is decommissioned: > "nodetool status" on "decommissioned-1st" reports 11 nodes > "nodetool status" on "decommissioned-2nd" reports 10 nodes > "nodetool status" on "decommissioned-3rd" reports 9 nodes > The storage load information is similarly stale on the various decommissioned > nodes. The nodetool status and ring commands continue to return information > as if they were part of a cluster and they appear to return the last > information that they saw. > In contrast the nodetool info command fails with an exception, which isn't > ideal but at least indicates that there was a failure rather than returning > stale information. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-4967) config options have different bounds when set via different methods
[ https://issues.apache.org/jira/browse/CASSANDRA-4967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908373#comment-14908373 ] John Sumsion commented on CASSANDRA-4967: - I got part way through applying the validation checks before I left. Hopefully I can wrap it up. I didn't get any feedback on the approach, so I'm just continuing. This branch is rebased on top of the latest trunk as of now: - https://github.com/jdsumsion/cassandra/tree/4967-config-validation > config options have different bounds when set via different methods > --- > > Key: CASSANDRA-4967 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4967 > Project: Cassandra > Issue Type: Improvement > Components: Core >Affects Versions: 1.2.0 beta 2 >Reporter: Robert Coli >Priority: Minor > Labels: lhf > > (similar to some of the work done in > https://issues.apache.org/jira/browse/CASSANDRA-4479 > ) > If one sets a value in cassandra.yaml, that value might be subject to bounds > checking there. However if one sets that same value via JMX, it doesn't get > set via a bounds-checking code path. > "./src/java/org/apache/cassandra/config/DatabaseDescriptor.java" (JMX set) > {noformat} > public static void setPhiConvictThreshold(double phiConvictThreshold) > { > conf.phi_convict_threshold = phiConvictThreshold; > } > {noformat} > Versus.. > ./src/java/org/apache/cassandra/config/DatabaseDescriptor.java > (cassandra.yaml) > {noformat} > static void loadYaml() > ... > /* phi convict threshold for FailureDetector */ > if (conf.phi_convict_threshold < 5 || conf.phi_convict_threshold > > 16) > { > throw new ConfigurationException("phi_convict_threshold must > be between 5 and 16"); > } > {noformat} > This seems to create a confusing situation where the range of potential > values for a given configuration option is different when set by different > methods. > It's difficult to imagine a circumstance where you want bounds checking to > keep your node from starting if you set that value in cassandra.yaml, but > also want to allow circumvention of that bounds checking if you set via JMX. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-4967) config options have different bounds when set via different methods
[ https://issues.apache.org/jira/browse/CASSANDRA-4967?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905876#comment-14905876 ] John Sumsion commented on CASSANDRA-4967: - I am part-way down revamping the validation / defaults logic for config. See this branch on github: - https://github.com/jdsumsion/cassandra/tree/4967-config-validation If I'm going the wrong direction, please let me know soon, as I want to wrap this up by the end of the summit. > config options have different bounds when set via different methods > --- > > Key: CASSANDRA-4967 > URL: https://issues.apache.org/jira/browse/CASSANDRA-4967 > Project: Cassandra > Issue Type: Improvement > Components: Core >Affects Versions: 1.2.0 beta 2 >Reporter: Robert Coli >Priority: Minor > Labels: lhf > > (similar to some of the work done in > https://issues.apache.org/jira/browse/CASSANDRA-4479 > ) > If one sets a value in cassandra.yaml, that value might be subject to bounds > checking there. However if one sets that same value via JMX, it doesn't get > set via a bounds-checking code path. > "./src/java/org/apache/cassandra/config/DatabaseDescriptor.java" (JMX set) > {noformat} > public static void setPhiConvictThreshold(double phiConvictThreshold) > { > conf.phi_convict_threshold = phiConvictThreshold; > } > {noformat} > Versus.. > ./src/java/org/apache/cassandra/config/DatabaseDescriptor.java > (cassandra.yaml) > {noformat} > static void loadYaml() > ... > /* phi convict threshold for FailureDetector */ > if (conf.phi_convict_threshold < 5 || conf.phi_convict_threshold > > 16) > { > throw new ConfigurationException("phi_convict_threshold must > be between 5 and 16"); > } > {noformat} > This seems to create a confusing situation where the range of potential > values for a given configuration option is different when set by different > methods. > It's difficult to imagine a circumstance where you want bounds checking to > keep your node from starting if you set that value in cassandra.yaml, but > also want to allow circumvention of that bounds checking if you set via JMX. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-5780) nodetool status and ring report incorrect/stale information after decommission
[ https://issues.apache.org/jira/browse/CASSANDRA-5780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905133#comment-14905133 ] John Sumsion commented on CASSANDRA-5780: - I'm working on this on trunk, the patch will not be JDK1.8 specific to ease backporting, since this is open for 1.2, 2.x, trunk. > nodetool status and ring report incorrect/stale information after decommission > -- > > Key: CASSANDRA-5780 > URL: https://issues.apache.org/jira/browse/CASSANDRA-5780 > Project: Cassandra > Issue Type: Bug > Components: Tools >Reporter: Peter Haggerty >Priority: Trivial > Labels: lhf, ponies, qa-resolved > Fix For: 2.1.x > > > Cassandra 1.2.6 ring of 12 instances, each with 256 tokens. > Decommission 3 of the 12 nodes, one after another resulting a 9 instance ring. > The 9 instances of cassandra that are in the ring all correctly report > nodetool status information for the ring and have the same data. > After the first node is decommissioned: > "nodetool status" on "decommissioned-1st" reports 11 nodes > After the second node is decommissioned: > "nodetool status" on "decommissioned-1st" reports 11 nodes > "nodetool status" on "decommissioned-2nd" reports 10 nodes > After the second node is decommissioned: > "nodetool status" on "decommissioned-1st" reports 11 nodes > "nodetool status" on "decommissioned-2nd" reports 10 nodes > "nodetool status" on "decommissioned-3rd" reports 9 nodes > The storage load information is similarly stale on the various decommissioned > nodes. The nodetool status and ring commands continue to return information > as if they were part of a cluster and they appear to return the last > information that they saw. > In contrast the nodetool info command fails with an exception, which isn't > ideal but at least indicates that there was a failure rather than returning > stale information. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-5780) nodetool status and ring report incorrect/stale information after decommission
[ https://issues.apache.org/jira/browse/CASSANDRA-5780?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905182#comment-14905182 ] John Sumsion commented on CASSANDRA-5780: - Here is a branch on trunk: - https://github.com/jdsumsion/cassandra/tree/5780-decomission-truncate-system > nodetool status and ring report incorrect/stale information after decommission > -- > > Key: CASSANDRA-5780 > URL: https://issues.apache.org/jira/browse/CASSANDRA-5780 > Project: Cassandra > Issue Type: Bug > Components: Tools >Reporter: Peter Haggerty >Priority: Trivial > Labels: lhf, ponies, qa-resolved > Fix For: 2.1.x > > > Cassandra 1.2.6 ring of 12 instances, each with 256 tokens. > Decommission 3 of the 12 nodes, one after another resulting a 9 instance ring. > The 9 instances of cassandra that are in the ring all correctly report > nodetool status information for the ring and have the same data. > After the first node is decommissioned: > "nodetool status" on "decommissioned-1st" reports 11 nodes > After the second node is decommissioned: > "nodetool status" on "decommissioned-1st" reports 11 nodes > "nodetool status" on "decommissioned-2nd" reports 10 nodes > After the second node is decommissioned: > "nodetool status" on "decommissioned-1st" reports 11 nodes > "nodetool status" on "decommissioned-2nd" reports 10 nodes > "nodetool status" on "decommissioned-3rd" reports 9 nodes > The storage load information is similarly stale on the various decommissioned > nodes. The nodetool status and ring commands continue to return information > as if they were part of a cluster and they appear to return the last > information that they saw. > In contrast the nodetool info command fails with an exception, which isn't > ideal but at least indicates that there was a failure rather than returning > stale information. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8169) Background bitrot detector to avoid client exposure
[ https://issues.apache.org/jira/browse/CASSANDRA-8169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14187597#comment-14187597 ] John Sumsion commented on CASSANDRA-8169: - I don't care that much about marking-things-unrepaired-in-the-bitrot-case because I believe its easier to just replace a bitrot-susceptible node than it is to repair around the bitrot. My main motivation in submitting this ticket is to make sure there is as lightweight a mechanism as possible (read-only, low-throughput) for periodically verifying that ALL data can be read, and failing the node as early as possible to stay ahead of the replacement curve. The 'scrub' tool is not good because it rewrites all the data. The 'repair' tool because the move toward incremental-ness (awesome, btw) does not aggressively read all the data. If a 'validate' tool existed, and if it triggered the 'disk_failure_policy' properly on all cases of corrupt data files (sstables, etc), then that is what I want. The likelihood of cascading bitrot across boxes is not something I thought needed any attention. Background bitrot detector to avoid client exposure --- Key: CASSANDRA-8169 URL: https://issues.apache.org/jira/browse/CASSANDRA-8169 Project: Cassandra Issue Type: New Feature Reporter: John Sumsion With a lot of static data sitting in SSTables, and with only a relatively small add/edit rate, incremental repair sounds very good. However, there is one significant cost to switching away from full repair. If/when bitrot corrupts an SSTable, there is nothing standing between a user query and a corruption/failure-response event except for the other replicas. This combined with a rolling restart or upgrade can make a token range non-writable via quorum CL. While you could argue that full repairs should be scheduled on a longer-term regular basis, I don't really care about all the repair overhead, I just want something that can run ahead of user queries whose only responsibility is to detect bitrot, so that I can replace nodes in an aggressive way instead of having it be a failure-response situation. This bitrot detector need not incur the full cross-cluster cost of repair, and so would be less of a burden to run periodically. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-8169) Background bitrot detector to avoid client exposure
[ https://issues.apache.org/jira/browse/CASSANDRA-8169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181852#comment-14181852 ] John Sumsion commented on CASSANDRA-8169: - Yes, that would probably be sufficient. As long as it doesn't do any writes, just reads and a report of status. If there is sstable checksum corruption, I would expect the corrupt sstable policy to be triggered, killing the node or ignoring, based on the config. Does that mean that we just kill this issue in favor of CASSANDRA-5791? I think so. Unless anyone sees any value in keeping this issue alive (vs CASSANDRA-5791), I'll close this issue in a couple days to give time for feedback. Background bitrot detector to avoid client exposure --- Key: CASSANDRA-8169 URL: https://issues.apache.org/jira/browse/CASSANDRA-8169 Project: Cassandra Issue Type: New Feature Reporter: John Sumsion With a lot of static data sitting in SSTables, and with only a relatively small add/edit rate, incremental repair sounds very good. However, there is one significant cost to switching away from full repair. If/when bitrot corrupts an SSTable, there is nothing standing between a user query and a corruption/failure-response event except for the other replicas. This combined with a rolling restart or upgrade can make a token range non-writable via quorum CL. While you could argue that full repairs should be scheduled on a longer-term regular basis, I don't really care about all the repair overhead, I just want something that can run ahead of user queries whose only responsibility is to detect bitrot, so that I can replace nodes in an aggressive way instead of having it be a failure-response situation. This bitrot detector need not incur the full cross-cluster cost of repair, and so would be less of a burden to run periodically. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-5791) A nodetool command to validate all sstables in a node
[ https://issues.apache.org/jira/browse/CASSANDRA-5791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14181858#comment-14181858 ] John Sumsion commented on CASSANDRA-5791: - If there is bitrot that causes a checksum failure, I assume that this issue would cause the configured disk_failure_policy to take effect, is that true? A nodetool command to validate all sstables in a node - Key: CASSANDRA-5791 URL: https://issues.apache.org/jira/browse/CASSANDRA-5791 Project: Cassandra Issue Type: New Feature Components: Core Reporter: sankalp kohli Priority: Minor CUrrently there is no nodetool command to validate all sstables on disk. The only way to do this is to run a repair and see if it succeeds. But we cannot repair the system keyspace. Also we can run upgrade sstables but that re writes all the sstables. This command should check the hash of all sstables and return whether all data is readable all not. This should NOT care about consistency. The compressed sstables do not have hash so not sure how it will work there. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-8169) Background bitrot detector to avoid client exposure
John Sumsion created CASSANDRA-8169: --- Summary: Background bitrot detector to avoid client exposure Key: CASSANDRA-8169 URL: https://issues.apache.org/jira/browse/CASSANDRA-8169 Project: Cassandra Issue Type: New Feature Reporter: John Sumsion With a lot of static data sitting in SSTables, and with only a relatively small add/edit rate, incremental repair sounds very good. However, there is one significant cost to switching away from full repair. If/when bitrot corrupts an SSTable, there is nothing standing between a user query and a corruption/failure-response event except for the other replicas. This combined with a rolling restart or upgrade can make a token range non-writable via quorum CL. While you could argue that full repairs should be scheduled on a longer-term regular basis, I don't really care about all the repair overhead, I just want something that can run ahead of user queries whose only responsibility is to detect bitrot, so that I can replace nodes in an aggressive way instead of having it be a failure-response situation. This bitrot detector need not incur the full cross-cluster cost of repair, and so would be less of a burden to run periodically. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7927) Kill daemon on any disk error
[ https://issues.apache.org/jira/browse/CASSANDRA-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14170210#comment-14170210 ] John Sumsion commented on CASSANDRA-7927: - LGTM Kill daemon on any disk error - Key: CASSANDRA-7927 URL: https://issues.apache.org/jira/browse/CASSANDRA-7927 Project: Cassandra Issue Type: New Feature Components: Core Environment: aws, stock cassandra or dse Reporter: John Sumsion Assignee: John Sumsion Labels: bootcamp, lhf Fix For: 2.1.1 Attachments: 7927-v1-die.patch We got a disk read error on 1.2.13 that didn't trigger the disk failure policy, and I'm trying to hunt down why, but in doing so, I saw that there is no disk_failure_policy option for just killing the daemon. If we ever get a corrupt sstable, we want to replace the node anyway, because some aws instance store disks just go bad. I want to use the JVMStabilityInspector from CASSANDRA-7507 to kill so that remains standard, so I will base my patch on CASSANDRA-7507. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7927) Kill daemon on any disk error
[ https://issues.apache.org/jira/browse/CASSANDRA-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14166813#comment-14166813 ] John Sumsion commented on CASSANDRA-7927: - Looks great! The compromise on checking policy twice looks like it keeps the code less scattered. Thanks for doing the cleanup! Kill daemon on any disk error - Key: CASSANDRA-7927 URL: https://issues.apache.org/jira/browse/CASSANDRA-7927 Project: Cassandra Issue Type: New Feature Components: Core Environment: aws, stock cassandra or dse Reporter: John Sumsion Assignee: John Sumsion Labels: bootcamp, lhf Fix For: 2.1.1 Attachments: 7927-v1-die.patch We got a disk read error on 1.2.13 that didn't trigger the disk failure policy, and I'm trying to hunt down why, but in doing so, I saw that there is no disk_failure_policy option for just killing the daemon. If we ever get a corrupt sstable, we want to replace the node anyway, because some aws instance store disks just go bad. I want to use the JVMStabilityInspector from CASSANDRA-7507 to kill so that remains standard, so I will base my patch on CASSANDRA-7507. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7507) OOM creates unreliable state - die instantly better
[ https://issues.apache.org/jira/browse/CASSANDRA-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132821#comment-14132821 ] John Sumsion commented on CASSANDRA-7507: - In OOM situations, I've also had loggers fail to work -- it might be worth adding a t.printStackTrace(System.err) as a failsafe before we exit. In the most extreme cases, I've also had loggers totally block, and the System.exit() was never called, but that was on older jvms, and on really sick app servers, so you may not hit that kind of thing. OOM creates unreliable state - die instantly better --- Key: CASSANDRA-7507 URL: https://issues.apache.org/jira/browse/CASSANDRA-7507 Project: Cassandra Issue Type: New Feature Reporter: Karl Mueller Assignee: Joshua McKenzie Priority: Minor Fix For: 2.1.1 Attachments: 7507_v1.txt, 7507_v2.txt, 7507_v3_build.txt, 7507_v3_java.txt, exceptionHandlingResults.txt, findSwallowedExceptions.py I had a cassandra node run OOM. My heap had enough headroom, there was just something which either was a bug or some unfortunate amount of short-term memory utilization. This resulted in the following error: WARN [StorageServiceShutdownHook] 2014-06-30 09:38:38,251 StorageProxy.java (line 1713) Some hints were not written before shutdown. This is not supposed to happen. You should (a) run repair, and (b) file a bug report There are no other messages of relevance besides the OOM error about 90 minutes earlier. My (limited) understanding of the JVM and Cassandra says that when it goes OOM, it will attempt to signal cassandra to shut down cleanly. The problem, in my view, is that with an OOM situation, nothing is guaranteed anymore. I believe it's impossible to reliably cleanly shut down at this point, and therefore it's wrong to even try. Yes, ideally things could be written out, flushed to disk, memory messages written, other nodes notified, etc. but why is there any reason to believe any of those steps could happen? Would happen? Couldn't bad data be written at this point to disk rather than good data? Some network messages delivered, but not others? I think Cassandra should have the option to (and possibly default) to kill itself immediately upon the OOM condition happening in a hard way, and not rely on the java-based clean shutdown process. Cassandra already handles recovery from unclean shutdown, and it's not a big deal. My node, for example, kept in a sort-of alive state for 90 minutes where who knows what it was doing or not doing. I don't know enough about the JVM and options for it to know the best exact implementation of die instantly on OOM, but it should be something that's possible either with some flags or a C library (which doesn't rely on java memory to do something which it may not be able to get!) Short version: a kill -9 of all C* processes in that instance without needing more java memory, when OOM is raised -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-7927) Kill deamon on any disk error
John Sumsion created CASSANDRA-7927: --- Summary: Kill deamon on any disk error Key: CASSANDRA-7927 URL: https://issues.apache.org/jira/browse/CASSANDRA-7927 Project: Cassandra Issue Type: New Feature Components: Core Environment: aws, stock cassandra or dse Reporter: John Sumsion Fix For: 2.1.1 We got a disk read error on 1.2.13 that didn't trigger the disk failure policy, and I'm trying to hunt down why, but in doing so, I saw that there is no disk_failure_policy option for just killing the daemon. If we ever get a corrupt sstable, we want to replace the node anyway, because some aws instance store disks just go bad. I want to use the JVMIntrospector from CASSANDRA-7507 to kill so that remains standard, so I will base my patch on CASSANDRA-7507. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7927) Kill deamon on any disk error
[ https://issues.apache.org/jira/browse/CASSANDRA-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132841#comment-14132841 ] John Sumsion commented on CASSANDRA-7927: - Options for the policy enum: - kill - die - poison_pill - cassandracide I guess I like 'die' the best. Kill deamon on any disk error - Key: CASSANDRA-7927 URL: https://issues.apache.org/jira/browse/CASSANDRA-7927 Project: Cassandra Issue Type: New Feature Components: Core Environment: aws, stock cassandra or dse Reporter: John Sumsion Fix For: 2.1.1 We got a disk read error on 1.2.13 that didn't trigger the disk failure policy, and I'm trying to hunt down why, but in doing so, I saw that there is no disk_failure_policy option for just killing the daemon. If we ever get a corrupt sstable, we want to replace the node anyway, because some aws instance store disks just go bad. I want to use the JVMIntrospector from CASSANDRA-7507 to kill so that remains standard, so I will base my patch on CASSANDRA-7507. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (CASSANDRA-7927) Kill daemon on any disk error
[ https://issues.apache.org/jira/browse/CASSANDRA-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Sumsion updated CASSANDRA-7927: Attachment: 7927-v1-die.patch has unit tests for the commitlog part, but couldn't find any good way to unit test FileUtil without plowing a lot of ground, but kept changes in FileUtil DRYish Kill daemon on any disk error - Key: CASSANDRA-7927 URL: https://issues.apache.org/jira/browse/CASSANDRA-7927 Project: Cassandra Issue Type: New Feature Components: Core Environment: aws, stock cassandra or dse Reporter: John Sumsion Fix For: 2.1.1 Attachments: 7927-v1-die.patch We got a disk read error on 1.2.13 that didn't trigger the disk failure policy, and I'm trying to hunt down why, but in doing so, I saw that there is no disk_failure_policy option for just killing the daemon. If we ever get a corrupt sstable, we want to replace the node anyway, because some aws instance store disks just go bad. I want to use the JVMStabilityInspector from CASSANDRA-7507 to kill so that remains standard, so I will base my patch on CASSANDRA-7507. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (CASSANDRA-7927) Kill daemon on any disk error
[ https://issues.apache.org/jira/browse/CASSANDRA-7927?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14132919#comment-14132919 ] John Sumsion commented on CASSANDRA-7927: - NOTE: patch is based on CASSANDRA-7507, and adds tests for that patch also Kill daemon on any disk error - Key: CASSANDRA-7927 URL: https://issues.apache.org/jira/browse/CASSANDRA-7927 Project: Cassandra Issue Type: New Feature Components: Core Environment: aws, stock cassandra or dse Reporter: John Sumsion Labels: lhf Fix For: 2.1.1 Attachments: 7927-v1-die.patch We got a disk read error on 1.2.13 that didn't trigger the disk failure policy, and I'm trying to hunt down why, but in doing so, I saw that there is no disk_failure_policy option for just killing the daemon. If we ever get a corrupt sstable, we want to replace the node anyway, because some aws instance store disks just go bad. I want to use the JVMStabilityInspector from CASSANDRA-7507 to kill so that remains standard, so I will base my patch on CASSANDRA-7507. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (CASSANDRA-5895) JDK7 u40 stack size fixes
John Sumsion created CASSANDRA-5895: --- Summary: JDK7 u40 stack size fixes Key: CASSANDRA-5895 URL: https://issues.apache.org/jira/browse/CASSANDRA-5895 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra, as launched by the default script by JDK 1.7.0_40 using CCM Reporter: John Sumsion I use Archlinux, and the latest OpenJDK, and run my cassandra via @pcmanus's ccm. When I tried the cassandra-2.0.0 branch via: {noformat} $ ccm create dev -v git:cassandra-2.0.0 {noformat} Then when I say: {noformat} $ ccm populate -n3 $ ccm node1 start -v {noformat} I get the following error: {noformat} xss = -ea -javaagent:/home/blah/.ccm/repository/1.2.2/lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms1985M -Xmx1985M -Xmn496M -XX:+HeapDumpOnOutOfMemoryError -Xss180k The stack size specified is too small, Specify at least 228k Error starting node node1 Standard error output is: Error: Could not create the Java Virtual Machine. Error: A fatal exception has occurred. Program will exit. {noformat} I tracked it down to conf/cassandra-env.sh, and changed the -Xss180k = -Xss228k and the node started. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (CASSANDRA-5895) JDK7 u40 stack size fixes
[ https://issues.apache.org/jira/browse/CASSANDRA-5895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] John Sumsion updated CASSANDRA-5895: Attachment: 0002-Ignoring-files-produced-by-ant-build.patch 0001-Increasing-stack-size-to-avoid-error-on-OpenJDK-7-u4.patch Here are two patches: 1) fix for this issue under JDK 1.7.0_40 2) fix for untracked files in git after running ant build You can also fetch these from https://github.com/jdsumsion/cassandra (branch: jdk7-updates) JDK7 u40 stack size fixes - Key: CASSANDRA-5895 URL: https://issues.apache.org/jira/browse/CASSANDRA-5895 Project: Cassandra Issue Type: Bug Components: Core Environment: cassandra, as launched by the default script by JDK 1.7.0_40 using CCM Reporter: John Sumsion Attachments: 0001-Increasing-stack-size-to-avoid-error-on-OpenJDK-7-u4.patch, 0002-Ignoring-files-produced-by-ant-build.patch I use Archlinux, and the latest OpenJDK, and run my cassandra via @pcmanus's ccm. When I tried the cassandra-2.0.0 branch via: {noformat} $ ccm create dev -v git:cassandra-2.0.0 {noformat} Then when I say: {noformat} $ ccm populate -n3 $ ccm node1 start -v {noformat} I get the following error: {noformat} xss = -ea -javaagent:/home/blah/.ccm/repository/1.2.2/lib/jamm-0.2.5.jar -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms1985M -Xmx1985M -Xmn496M -XX:+HeapDumpOnOutOfMemoryError -Xss180k The stack size specified is too small, Specify at least 228k Error starting node node1 Standard error output is: Error: Could not create the Java Virtual Machine. Error: A fatal exception has occurred. Program will exit. {noformat} I tracked it down to conf/cassandra-env.sh, and changed the -Xss180k = -Xss228k and the node started. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira