Re: Question about EC2 and SSDs
Theory aside, I switched from RAID of ephemerals for data, and root volume for write log to single EBS-based SSD without any noticeable impact on performance. will On Thu, Sep 4, 2014 at 9:35 PM, Steve Robenalt sroben...@highwire.org wrote: Yes, I am aware there are no heads on an SSD. I also have seen plenty of examples where compatibility issues force awkward engineering tradeoffs, even as technology advances so I am jaded enough to be wary of making assumptions, which is why I asked the question. Steve On Sep 4, 2014 5:50 PM, Robert Coli rc...@eventbrite.com wrote: On Thu, Sep 4, 2014 at 5:44 PM, Steve Robenalt sroben...@highwire.org wrote: Thanks Robert! I am assuming that you meant that it's possible with a single SSD, right? Yes, no matter how many SSDs you have you are unlikely to be able to convince one of them to physically seek a drive head across its plater, because they don't have heads or platters. =Rob
Re: VPC AWS
I don't think traffic will flow between classic ec2 and vpc directly. There is some kind of gateway bridge instance that sits between, acting as a NAT. I would think that would cause new challenges for: -transitions -clients Sorry this response isn't heavy on content! I'm curious how this thread goes... Will On Thursday, June 5, 2014, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi guys, We are going to move from a cluster made of simple Amazon EC2 servers to a VPC cluster. We are using Cassandra 1.2.11 and I have some questions regarding this switch and the Cassandra configuration inside a VPC. Actually I found no documentation on this topic, but I am quite sure that some people are already using VPC. If you can point me to any documentation regarding VPC / Cassandra, it would be very nice of you. We have only one DC for now, but we need to remain multi DC compatible, since we will add DC very soon. Else, I would like to know if I should keep using EC2MultiRegionSnitch or change the snitch to anything else. What about broadcast/listen ip, seeds...? We currently use public ip as for broadcast address and for seeds. We use private ones for listen address. Machines inside the VPC will only have private IP AFAIK. Should I keep using a broadcast address ? Is there any other incidence when switching to a VPC ? Sorry if the topic was already discussed, I was unable to find any useful information... -- Will Oberman Civic Science, Inc. 6101 Penn Avenue, Fifth Floor Pittsburgh, PA 15206 (M) 412-480-7835 (E) ober...@civicscience.com
alternative vnode upgrade strategy?
I'm concerned about the bad reports of using shuffle to do a vnode upgrade (and I did a smoke test trying shuffle a test cluster, and had out of disk space issues). I then started to plan out the dual DC upgrade path, but I wonder if this option is easier: Starting point: N node cluster, no vnodes. 1.) Upgrade all N nodes to vnodes in place 2.) Boot N new nodes and let them bootstrap 3.) Decommission the N old nodes And/Or 1.) Upgrade all N nodes to vnodes in place Start loop 2.) Boot a new node and let it bootstrap 3.) Decommission an old node End loop The only problem I can think of is unlike dual DC, there is no easy rollback strategy. Though, at this point, vnodes have been in the wild for quite some time, and I kind of want/need to move over to vnodes, so it's vnodes or bust so to speak :-) Am I missing a terrible gotcha (other than the rollback issue)? I did a smoke test of the 2nd scenario (one node in/out at a time) and didn't see any issues will
Re: NTS, vnodes and 0% chance of data loss
After sleeping on this, I'm sure my original conclusions are wrong. In all of the referenced cases/threads, I internalized rack awareness and hotspots to mean something different and wrong. A hotspot didn't mean multiple replicas in the same rack (as I had been thinking), it meant the process of finding replica placement might hit the same vnode proportionally wrong due to the random association of vnodes - {dc,rack}. To not people astray, I think everything in my email below is correct until: Which means a rack failure (3 nodes) has a non-zero chance of data failure (right?). And again, my flaw was thinking that when Cassandra selected replicas for token X in a vnode world, that it would possibly pick vnodes that happened to be on the same rack due to random placements of the tokens. That is wrong (looking at the source for NTS), as NTS does skip over the same rack (though, it will allow multiple in the same rack if you fill up... I guess if someone did DC:4 with 3 racks they'll always get one rack with two copies of the data, for example). will On Tue, May 13, 2014 at 1:41 PM, William Oberman ober...@civicscience.comwrote: I found this: http://mail-archives.apache.org/mod_mbox/cassandra-user/201404.mbox/%3ccaeduwd1erq-1m-kfj6ubzsbeser8dwh+g-kgdpstnbgqsqc...@mail.gmail.com%3E I read the three referenced cases. In addition, case 4123 references: http://www.mail-archive.com/dev@cassandra.apache.org/msg03844.html And even though I *think* I understand all of the issues now, I still want to double check... Assumptions: -A cluster using NTS with options [DC:3] -Physical layout = In DC, 3 nodes/rack for a total of 9 nodes No vnodes: I could do token selection using ideas from case 3810 such that each rack has one replica. At this point, my 0% chance of data loss scenarios are: 1.) Failure of two nodes at random 2.) Failure of 2 racks (6 nodes!) Vnodes: my 0% chance of data loss scenarios are: 1.) Failure of two nodes at random Which means a rack failure (3 nodes) has a non-zero chance of data failure (right?). To get specific, I'm in AWS, so racks ~= availability zones. In the years I've been in AWS, I've seen several occasions of single zone downtimes, and one time of single zone catastrophic loss. E.g. for AWS I feel like you *have* to plan for a single zone failure, and in terms of safety first you *should* plan for two zone failures. To mitigate this data loss risk seems rough for vnodes, again if I'm understanding everything correctly: -To ensure 0% data loss for one zone = I need RF=4 -To ensure 0% data loss for two zones = I need RF=7 I'd really like to use vnodes, but RF=7 is crazy. To reiterate what I think is the core idea of this message: 1.) for vnodes 0% data loss = RF=(# of allowed failures at once)+1 2.) racks don't change the above equation at all will
Re: clearing tombstones?
Not an expert, just a user of cassandra. For me, before was a cf with a set of files (I forget the official naming system, so I'll make up my own): A0 A1 ... AN During: A0 A1 ... AN B0 Where B0 is the union of Ai. Due to tombstones, mutations, etc. B0 is at most 2x, but also probably close to 2x (unless you are all tombstones, like me). After B0 Since cassandra can clean up Ai. Not sure when this happens. Not sure what state you are in above. Sounds like between during and after. Will On Thursday, May 8, 2014, Ruchir Jha ruchir@gmail.com wrote: I tried to do this, however the doubling in disk space is not temporary as you state in your note. What am I missing? On Fri, Apr 11, 2014 at 10:44 AM, William Oberman ober...@civicscience.comjavascript:_e(%7B%7D,'cvml','ober...@civicscience.com'); wrote: So, if I was impatient and just wanted to make this happen now, I could: 1.) Change GCGraceSeconds of the CF to 0 2.) run nodetool compact (*) 3.) Change GCGraceSeconds of the CF back to 10 days Since I have ~900M tombstones, even if I miss a few due to impatience, I don't care *that* much as I could re-run my clean up tool against the now much smaller CF. (*) A long long time ago I seem to recall reading advice about don't ever run nodetool compact, but I can't remember why. Is there any bad long term consequence? Short term there are several: -a heavy operation -temporary 2x disk space -one big SSTable afterwards But moving forward, everything is ok right? CommitLog/MemTable-SStables, minor compactions that merge SSTables, etc... The only flaw I can think of is it will take forever until the SSTable minor compactions build up enough to consider including the big SSTable in a compaction, making it likely I'll have to self manage compactions. On Fri, Apr 11, 2014 at 10:31 AM, Mark Reddy mark.re...@boxever.comwrote: Correct, a tombstone will only be removed after gc_grace period has elapsed. The default value is set to 10 days which allows a great deal of time for consistency to be achieved prior to deletion. If you are operationally confident that you can achieve consistency via anti-entropy repairs within a shorter period you can always reduce that 10 day interval. Mark On Fri, Apr 11, 2014 at 3:16 PM, William Oberman ober...@civicscience.com wrote: I'm seeing a lot of articles about a dependency between removing tombstones and GCGraceSeconds, which might be my problem (I just checked, and this CF has GCGraceSeconds of 10 days). On Fri, Apr 11, 2014 at 10:10 AM, tommaso barbugli tbarbu...@gmail.comwrote: compaction should take care of it; for me it never worked so I run nodetool compaction on every node; that does it. 2014-04-11 16:05 GMT+02:00 William Oberman ober...@civicscience.com: I'm wondering what will clear tombstoned rows? nodetool cleanup, nodetool repair, or time (as in just wait)? I had a CF that was more or less storing session information. After some time, we decided that one piece of this information was pointless to track (and was 90%+ of the columns, and in 99% of those cases was ALL columns for a row). I wrote a process to remove all of those columns (which again in a vast majority of cases had the effect of removing the whole row). This CF had ~1 billion rows, so I expect to be left with ~100m rows. After I did this mass delete, everything was the same size on disk (which I expected, knowing how tombstoning works). It wasn't 100% clear to me what to poke to cause compactions to clear the tombstones. First I tried nodetool cleanup on a candidate node. But, afterwards the disk usage was the same. Then I tried nodetool repair on that same node. But again, disk usage is still the same. The CF has no snapshots. So, am I misunderstanding something? Is there another operation to try? Do I have to just wait? I've only done cleanup/re -- Will Oberman Civic Science, Inc. 6101 Penn Avenue, Fifth Floor Pittsburgh, PA 15206 (M) 412-480-7835 (E) ober...@civicscience.com
Re: clearing tombstones?
I'm still somewhat in the middle of the process, but it's far enough along to report back. 1.) I changed GCGraceSeconds of the CF to 0 using cassandra-cli 2.) I ran nodetool compact on a single node of the nine (I'll call it 1). It took 5-7 hours, and reduced the CF from ~450 to ~75GG (*). 3.) I ran nodetool compact on nodes 2, 3, while watching write/read latency averages in OpsCenter. I got all of the way to 9 without any ill effect 4.) 2-9 all completed with similar results (*) So, I left one one detail that changed the math (I said above I expected to clear down to at most 50GB). I found a small bug in my delete code mid-last week. Basically, it deleted all of the rows I wanted, but due to a race condition, there was a chance I'd delete rows in the middle of doing new inserts. Luckily, even in this case, it wasn't end of the world, but I stopped the cleanup anyways and added a time check (as all of the rows I wanted to delete were older than 30 days). I *thought* I'd restarted the cleanup threads on a smaller dataset due to all of the deletes, but instead I saw millions millions of empty rows (the tombstones). Thus the start of this clear the tombstones subtask to the original goal, and the reason I didn't see a 90%+ reduction in size. In any case, now I'm running the cleanup process again, which will be followed by ANOTHER round of compactions, and then I'll finally turn GCGraceSeconds back on. On the read/write production side, you'd never know anything happened. Good job on the distributed system! :-) Thanks again, will On Fri, Apr 11, 2014 at 1:02 PM, Mark Reddy mark.re...@boxever.com wrote: Thats great Will, if you could update the thread with the actions you decide to take and the results that would be great. Mark On Fri, Apr 11, 2014 at 5:53 PM, William Oberman ober...@civicscience.com wrote: I've learned a *lot* from this thread. My thanks to all of the contributors! Paulo: Good luck with LCS. I wish I could help there, but all of my CF's are SizeTiered (mostly as I'm on the same schema/same settings since 0.7...) will On Fri, Apr 11, 2014 at 12:14 PM, Mina Naguib mina.nag...@adgear.comwrote: Levelled Compaction is a wholly different beast when it comes to tombstones. The tombstones are inserted, like any other write really, at the lower levels in the leveldb hierarchy. They are only removed after they have had the chance to naturally migrate upwards in the leveldb hierarchy to the highest level in your data store. How long that takes depends on: 1. The amount of data in your store and the number of levels your LCS strategy has 2. The amount of new writes entering the bottom funnel of your leveldb, forcing upwards compaction and combining To give you an idea, I had a similar scenario and ran a (slow, throttled) delete job on my cluster around December-January. Here's a graph of the disk space usage on one node. Notice the still-diclining usage long after the cleanup job has finished (sometime in January). I tend to think of tombstones in LCS as little bombs that get to explode much later in time: http://mina.naguib.ca/images/tombstones-cassandra-LCS.jpg On 2014-04-11, at 11:20 AM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: I have a similar problem here, I deleted about 30% of a very large CF using LCS (about 80GB per node), but still my data hasn't shrinked, even if I used 1 day for gc_grace_seconds. Would nodetool scrub help? Does nodetool scrub forces a minor compaction? Cheers, Paulo On Fri, Apr 11, 2014 at 12:12 PM, Mark Reddy mark.re...@boxever.comwrote: Yes, running nodetool compact (major compaction) creates one large SSTable. This will mess up the heuristics of the SizeTiered strategy (is this the compaction strategy you are using?) leading to multiple 'small' SSTables alongside the single large SSTable, which results in increased read latency. You will incur the operational overhead of having to manage compactions if you wish to compact these smaller SSTables. For all these reasons it is generally advised to stay away from running compactions manually. Assuming that this is a production environment and you want to keep everything running as smoothly as possible I would reduce the gc_grace on the CF, allow automatic minor compactions to kick in and then increase the gc_grace once again after the tombstones have been removed. On Fri, Apr 11, 2014 at 3:44 PM, William Oberman ober...@civicscience.com wrote: So, if I was impatient and just wanted to make this happen now, I could: 1.) Change GCGraceSeconds of the CF to 0 2.) run nodetool compact (*) 3.) Change GCGraceSeconds of the CF back to 10 days Since I have ~900M tombstones, even if I miss a few due to impatience, I don't care *that* much as I could re-run my clean up tool against the now much smaller CF. (*) A long long time ago I seem to recall reading advice about don't ever run
bloom filter + suddenly smaller CF
I had a thread on this forum about clearing junk from a CF. In my case, it's ~90% of ~1 billion rows. One side effect I had hoped for was a reduction in the size of the bloom filter. But, according to nodetool cfstats, it's still fairly large (~1.5GB of RAM). Do bloom filters ever resize themselves when the CF suddenly gets smaller? My next test will be restarting one of the instances, though I'll have to wait on that operation so I thought I'd ask in the meantime. will
Re: bloom filter + suddenly smaller CF
I didn't cross link my thread, but the basic idea is I've done: 1.) Process that deleted ~900M of ~1G rows from a CF 2.) Set GCGraceSeconds to 0 on CF 3.) Run nodetool compact on all N nodes And I checked, and all N nodes have bloom filters using 1.5 +/- .2 GB of RAM (I didn't explicitly write down the before numbers, but they seem about the same) . So, compaction didn't change the BF's (unless cassandra needs a 2nd compaction to see all of the data cleared by the 1st compaction). will On Mon, Apr 14, 2014 at 9:52 AM, Michal Michalski michal.michal...@boxever.com wrote: Bloom filters are built on creation / rebuild of SSTable. If you removed the data, but the old SSTables weren't compacted or you didn't rebuild them manually, bloom filters will stay the same size. M. Kind regards, Michał Michalski, michal.michal...@boxever.com On 14 April 2014 14:44, William Oberman ober...@civicscience.com wrote: I had a thread on this forum about clearing junk from a CF. In my case, it's ~90% of ~1 billion rows. One side effect I had hoped for was a reduction in the size of the bloom filter. But, according to nodetool cfstats, it's still fairly large (~1.5GB of RAM). Do bloom filters ever resize themselves when the CF suddenly gets smaller? My next test will be restarting one of the instances, though I'll have to wait on that operation so I thought I'd ask in the meantime. will
Re: bloom filter + suddenly smaller CF
Ah, so I could change the chance value to poke it. Good to know! On Mon, Apr 14, 2014 at 10:12 AM, Michal Michalski michal.michal...@boxever.com wrote: Sorry, I misread the question - I thought you've also changed FP chance value, not only removed the data. Kind regards, Michał Michalski, michal.michal...@boxever.com On 14 April 2014 15:07, Michal Michalski michal.michal...@boxever.comwrote: Did you set Bloom Filter's FP chance before or after the step 3) above? If you did it before, C* should build Bloom Filters properly. If not - that's the reason. Kind regards, Michał Michalski, michal.michal...@boxever.com On 14 April 2014 15:04, William Oberman ober...@civicscience.com wrote: I didn't cross link my thread, but the basic idea is I've done: 1.) Process that deleted ~900M of ~1G rows from a CF 2.) Set GCGraceSeconds to 0 on CF 3.) Run nodetool compact on all N nodes And I checked, and all N nodes have bloom filters using 1.5 +/- .2 GB of RAM (I didn't explicitly write down the before numbers, but they seem about the same) . So, compaction didn't change the BF's (unless cassandra needs a 2nd compaction to see all of the data cleared by the 1st compaction). will On Mon, Apr 14, 2014 at 9:52 AM, Michal Michalski michal.michal...@boxever.com wrote: Bloom filters are built on creation / rebuild of SSTable. If you removed the data, but the old SSTables weren't compacted or you didn't rebuild them manually, bloom filters will stay the same size. M. Kind regards, Michał Michalski, michal.michal...@boxever.com On 14 April 2014 14:44, William Oberman ober...@civicscience.comwrote: I had a thread on this forum about clearing junk from a CF. In my case, it's ~90% of ~1 billion rows. One side effect I had hoped for was a reduction in the size of the bloom filter. But, according to nodetool cfstats, it's still fairly large (~1.5GB of RAM). Do bloom filters ever resize themselves when the CF suddenly gets smaller? My next test will be restarting one of the instances, though I'll have to wait on that operation so I thought I'd ask in the meantime. will
clearing tombstones?
I'm wondering what will clear tombstoned rows? nodetool cleanup, nodetool repair, or time (as in just wait)? I had a CF that was more or less storing session information. After some time, we decided that one piece of this information was pointless to track (and was 90%+ of the columns, and in 99% of those cases was ALL columns for a row). I wrote a process to remove all of those columns (which again in a vast majority of cases had the effect of removing the whole row). This CF had ~1 billion rows, so I expect to be left with ~100m rows. After I did this mass delete, everything was the same size on disk (which I expected, knowing how tombstoning works). It wasn't 100% clear to me what to poke to cause compactions to clear the tombstones. First I tried nodetool cleanup on a candidate node. But, afterwards the disk usage was the same. Then I tried nodetool repair on that same node. But again, disk usage is still the same. The CF has no snapshots. So, am I misunderstanding something? Is there another operation to try? Do I have to just wait? I've only done cleanup/repair on one node. Do I have to run one or the other over all nodes to clear tombstones? Cassandra 1.2.15 if it matters, Thanks! will
Re: clearing tombstones?
I'm seeing a lot of articles about a dependency between removing tombstones and GCGraceSeconds, which might be my problem (I just checked, and this CF has GCGraceSeconds of 10 days). On Fri, Apr 11, 2014 at 10:10 AM, tommaso barbugli tbarbu...@gmail.comwrote: compaction should take care of it; for me it never worked so I run nodetool compaction on every node; that does it. 2014-04-11 16:05 GMT+02:00 William Oberman ober...@civicscience.com: I'm wondering what will clear tombstoned rows? nodetool cleanup, nodetool repair, or time (as in just wait)? I had a CF that was more or less storing session information. After some time, we decided that one piece of this information was pointless to track (and was 90%+ of the columns, and in 99% of those cases was ALL columns for a row). I wrote a process to remove all of those columns (which again in a vast majority of cases had the effect of removing the whole row). This CF had ~1 billion rows, so I expect to be left with ~100m rows. After I did this mass delete, everything was the same size on disk (which I expected, knowing how tombstoning works). It wasn't 100% clear to me what to poke to cause compactions to clear the tombstones. First I tried nodetool cleanup on a candidate node. But, afterwards the disk usage was the same. Then I tried nodetool repair on that same node. But again, disk usage is still the same. The CF has no snapshots. So, am I misunderstanding something? Is there another operation to try? Do I have to just wait? I've only done cleanup/repair on one node. Do I have to run one or the other over all nodes to clear tombstones? Cassandra 1.2.15 if it matters, Thanks! will
Re: clearing tombstones?
So, if I was impatient and just wanted to make this happen now, I could: 1.) Change GCGraceSeconds of the CF to 0 2.) run nodetool compact (*) 3.) Change GCGraceSeconds of the CF back to 10 days Since I have ~900M tombstones, even if I miss a few due to impatience, I don't care *that* much as I could re-run my clean up tool against the now much smaller CF. (*) A long long time ago I seem to recall reading advice about don't ever run nodetool compact, but I can't remember why. Is there any bad long term consequence? Short term there are several: -a heavy operation -temporary 2x disk space -one big SSTable afterwards But moving forward, everything is ok right? CommitLog/MemTable-SStables, minor compactions that merge SSTables, etc... The only flaw I can think of is it will take forever until the SSTable minor compactions build up enough to consider including the big SSTable in a compaction, making it likely I'll have to self manage compactions. On Fri, Apr 11, 2014 at 10:31 AM, Mark Reddy mark.re...@boxever.com wrote: Correct, a tombstone will only be removed after gc_grace period has elapsed. The default value is set to 10 days which allows a great deal of time for consistency to be achieved prior to deletion. If you are operationally confident that you can achieve consistency via anti-entropy repairs within a shorter period you can always reduce that 10 day interval. Mark On Fri, Apr 11, 2014 at 3:16 PM, William Oberman ober...@civicscience.com wrote: I'm seeing a lot of articles about a dependency between removing tombstones and GCGraceSeconds, which might be my problem (I just checked, and this CF has GCGraceSeconds of 10 days). On Fri, Apr 11, 2014 at 10:10 AM, tommaso barbugli tbarbu...@gmail.comwrote: compaction should take care of it; for me it never worked so I run nodetool compaction on every node; that does it. 2014-04-11 16:05 GMT+02:00 William Oberman ober...@civicscience.com: I'm wondering what will clear tombstoned rows? nodetool cleanup, nodetool repair, or time (as in just wait)? I had a CF that was more or less storing session information. After some time, we decided that one piece of this information was pointless to track (and was 90%+ of the columns, and in 99% of those cases was ALL columns for a row). I wrote a process to remove all of those columns (which again in a vast majority of cases had the effect of removing the whole row). This CF had ~1 billion rows, so I expect to be left with ~100m rows. After I did this mass delete, everything was the same size on disk (which I expected, knowing how tombstoning works). It wasn't 100% clear to me what to poke to cause compactions to clear the tombstones. First I tried nodetool cleanup on a candidate node. But, afterwards the disk usage was the same. Then I tried nodetool repair on that same node. But again, disk usage is still the same. The CF has no snapshots. So, am I misunderstanding something? Is there another operation to try? Do I have to just wait? I've only done cleanup/repair on one node. Do I have to run one or the other over all nodes to clear tombstones? Cassandra 1.2.15 if it matters, Thanks! will
Re: clearing tombstones?
Answered my own question. Good writeup here of the pros/cons of compact: http://www.datastax.com/documentation/cassandra/1.2/cassandra/operations/ops_about_config_compact_c.html And I was thinking of bad information that used to float in this forum about major compactions (with respect to the impact to minor compactions). I'm hesitant to write the offending sentence again :-) On Fri, Apr 11, 2014 at 10:44 AM, William Oberman ober...@civicscience.comwrote: So, if I was impatient and just wanted to make this happen now, I could: 1.) Change GCGraceSeconds of the CF to 0 2.) run nodetool compact (*) 3.) Change GCGraceSeconds of the CF back to 10 days Since I have ~900M tombstones, even if I miss a few due to impatience, I don't care *that* much as I could re-run my clean up tool against the now much smaller CF. (*) A long long time ago I seem to recall reading advice about don't ever run nodetool compact, but I can't remember why. Is there any bad long term consequence? Short term there are several: -a heavy operation -temporary 2x disk space -one big SSTable afterwards But moving forward, everything is ok right? CommitLog/MemTable-SStables, minor compactions that merge SSTables, etc... The only flaw I can think of is it will take forever until the SSTable minor compactions build up enough to consider including the big SSTable in a compaction, making it likely I'll have to self manage compactions. On Fri, Apr 11, 2014 at 10:31 AM, Mark Reddy mark.re...@boxever.comwrote: Correct, a tombstone will only be removed after gc_grace period has elapsed. The default value is set to 10 days which allows a great deal of time for consistency to be achieved prior to deletion. If you are operationally confident that you can achieve consistency via anti-entropy repairs within a shorter period you can always reduce that 10 day interval. Mark On Fri, Apr 11, 2014 at 3:16 PM, William Oberman ober...@civicscience.com wrote: I'm seeing a lot of articles about a dependency between removing tombstones and GCGraceSeconds, which might be my problem (I just checked, and this CF has GCGraceSeconds of 10 days). On Fri, Apr 11, 2014 at 10:10 AM, tommaso barbugli tbarbu...@gmail.comwrote: compaction should take care of it; for me it never worked so I run nodetool compaction on every node; that does it. 2014-04-11 16:05 GMT+02:00 William Oberman ober...@civicscience.com: I'm wondering what will clear tombstoned rows? nodetool cleanup, nodetool repair, or time (as in just wait)? I had a CF that was more or less storing session information. After some time, we decided that one piece of this information was pointless to track (and was 90%+ of the columns, and in 99% of those cases was ALL columns for a row). I wrote a process to remove all of those columns (which again in a vast majority of cases had the effect of removing the whole row). This CF had ~1 billion rows, so I expect to be left with ~100m rows. After I did this mass delete, everything was the same size on disk (which I expected, knowing how tombstoning works). It wasn't 100% clear to me what to poke to cause compactions to clear the tombstones. First I tried nodetool cleanup on a candidate node. But, afterwards the disk usage was the same. Then I tried nodetool repair on that same node. But again, disk usage is still the same. The CF has no snapshots. So, am I misunderstanding something? Is there another operation to try? Do I have to just wait? I've only done cleanup/repair on one node. Do I have to run one or the other over all nodes to clear tombstones? Cassandra 1.2.15 if it matters, Thanks! will
Re: clearing tombstones?
Yes, I'm using SizeTiered. I totally understand the mess up the heuristics issue. But, I don't understand You will incur the operational overhead of having to manage compactions if you wish to compact these smaller SSTables. My understanding is the small tables will still compact. The problem is that until I have 3 other (by default) tables of the same size as the big table, it won't be compacted. In my case, this might not be terrible though, right? To get into the trees, I have 9 nodes with RF=3 and this CF is ~500GB/node. I deleted like 90-95% of the data, so I expect the data to be 25-50GB after the tombstones are cleared, but call it 50GB. That means I won't compact this 50GB file until I gather another 150GB (50,50,50,50-200). But, that's not *horrible*. Now, if I only deleted 10% of the data, waiting to compact 450GB until I had another 1.3TB would be rough... I think your advice is great for people looking for normal answers in the forum, but I don't think my use case is very normal :-) will On Fri, Apr 11, 2014 at 11:12 AM, Mark Reddy mark.re...@boxever.com wrote: Yes, running nodetool compact (major compaction) creates one large SSTable. This will mess up the heuristics of the SizeTiered strategy (is this the compaction strategy you are using?) leading to multiple 'small' SSTables alongside the single large SSTable, which results in increased read latency. You will incur the operational overhead of having to manage compactions if you wish to compact these smaller SSTables. For all these reasons it is generally advised to stay away from running compactions manually. Assuming that this is a production environment and you want to keep everything running as smoothly as possible I would reduce the gc_grace on the CF, allow automatic minor compactions to kick in and then increase the gc_grace once again after the tombstones have been removed. On Fri, Apr 11, 2014 at 3:44 PM, William Oberman ober...@civicscience.com wrote: So, if I was impatient and just wanted to make this happen now, I could: 1.) Change GCGraceSeconds of the CF to 0 2.) run nodetool compact (*) 3.) Change GCGraceSeconds of the CF back to 10 days Since I have ~900M tombstones, even if I miss a few due to impatience, I don't care *that* much as I could re-run my clean up tool against the now much smaller CF. (*) A long long time ago I seem to recall reading advice about don't ever run nodetool compact, but I can't remember why. Is there any bad long term consequence? Short term there are several: -a heavy operation -temporary 2x disk space -one big SSTable afterwards But moving forward, everything is ok right? CommitLog/MemTable-SStables, minor compactions that merge SSTables, etc... The only flaw I can think of is it will take forever until the SSTable minor compactions build up enough to consider including the big SSTable in a compaction, making it likely I'll have to self manage compactions. On Fri, Apr 11, 2014 at 10:31 AM, Mark Reddy mark.re...@boxever.comwrote: Correct, a tombstone will only be removed after gc_grace period has elapsed. The default value is set to 10 days which allows a great deal of time for consistency to be achieved prior to deletion. If you are operationally confident that you can achieve consistency via anti-entropy repairs within a shorter period you can always reduce that 10 day interval. Mark On Fri, Apr 11, 2014 at 3:16 PM, William Oberman ober...@civicscience.com wrote: I'm seeing a lot of articles about a dependency between removing tombstones and GCGraceSeconds, which might be my problem (I just checked, and this CF has GCGraceSeconds of 10 days). On Fri, Apr 11, 2014 at 10:10 AM, tommaso barbugli tbarbu...@gmail.com wrote: compaction should take care of it; for me it never worked so I run nodetool compaction on every node; that does it. 2014-04-11 16:05 GMT+02:00 William Oberman ober...@civicscience.com: I'm wondering what will clear tombstoned rows? nodetool cleanup, nodetool repair, or time (as in just wait)? I had a CF that was more or less storing session information. After some time, we decided that one piece of this information was pointless to track (and was 90%+ of the columns, and in 99% of those cases was ALL columns for a row). I wrote a process to remove all of those columns (which again in a vast majority of cases had the effect of removing the whole row). This CF had ~1 billion rows, so I expect to be left with ~100m rows. After I did this mass delete, everything was the same size on disk (which I expected, knowing how tombstoning works). It wasn't 100% clear to me what to poke to cause compactions to clear the tombstones. First I tried nodetool cleanup on a candidate node. But, afterwards the disk usage was the same. Then I tried nodetool repair on that same node. But again, disk usage is still the same. The CF has no snapshots. So, am I misunderstanding
Re: using hadoop + cassandra for CF mutations (delete)
I use PHP, and phpCassa to talk to cassandra from within my app. I'm using the below script's structure as a way to run a local mutation on each of my nodes: === ?php require_once('PATH/TO/phpcassa-1.0.a.6/lib/autoload.php'); use phpcassa\ColumnFamily; use phpcassa\Connection\ConnectionPool; use phpcassa\SystemManager; try { //i'm sure there is a cleaner way of doing this, but for me I can't connect as localhost, I need to use the AWS private IP $ip = exec(/sbin/ifconfig | grep 'inet addr:'| grep -v '127.0.0.1' | cut -d: -f2 | awk '{ print $1}') $localCassandra = $ip:9160; $keyspace = ; $cf = ; $systemManager = new SystemManager($localCassandra); $ring = $systemManager-describe_ring($keyspace); $startToken = null; $endToken = null; foreach ($ring as $ringDetails) { //There is an endpoint per RF, with the 1st == owner foreach ($ringDetails-endpoints as $endpoint) { if ($endpoint == $ip) { $startToken = $ringDetails-start_token; $endToken = $ringDetails-end_token; } break; } if ($startToken != null $endToken != null) { break; } } if ($startToken == null || $endToken == null) { fwrite(STDERR, My ip[$ip] not in ring = . print_r($ring, true)); exit(1); } $pool = new ConnectionPool($keyspace, array($localCassandra)); $column_family = new ColumnFamily($pool, $cf); //I patched my local phpCassa to support null == iterate over all rows. //if my pull request is accepted it will be in the git repo. otherwise put //a large int as arg 3, something hopefully larger than all rows on a node... foreach ($column_family-get_range_by_token($startToken, $endToken, null) as $key = $columns) { foreach ($columns as $cn = $cv) { //I track information here } //and do an optional mutation here based on the row } } catch (Exception $e) { fwrite(STDERR, $e); } On Fri, Apr 4, 2014 at 1:40 PM, William Oberman ober...@civicscience.comwrote: Looking at the code, cassandra.input.split.size==Pig URL split_size, right? But, in cassandra 1.2.15 I'm wondering if there is a bug that would make the hadoop conf setting cassandra.input.split.size not be used unless you manually set the URI to splitSize=0 (because the abstract class defaults the splitSize to 64k instead of 0)? Long story short though, I've messed with that setting in the direction you suggested (decreasing), and I'm confident hadoop/pig was picking it up (I eventually decreased it too far, which caused an server side error of too much memory used). I'm stuck in a rock a hard place on the mappers. At 20 tasks, based on the delete rate before time out failures happen, it was going to take 1-2 days to run the deletes (I was seeing ~10k deletes/sec across all 20 task threads). But, this is going to be be my plan at this point: less tasks at once, even if it takes a week (of hopefully unsupervised time). Thanks for the feedback! On Fri, Apr 4, 2014 at 12:57 PM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: You said you have tried the Pig URL split_size, but have you actually tried decreasing the value of cassandra.input.split.size hadoop property? The default is 65536, so you may want to decrease that to see if the number of mappers increase. But at some point, even if you lower that value it will stop decreasing the number of mappers but I don't know exactly why, probably because it hits the minimum number of rows per token. Another suggestion is to decrease the number of simultaneous mappers of your job, so it doesn't hit cassandra too hard, and you'll get less TimedOutExceptions, but your job will take longer to complete. On Fri, Apr 4, 2014 at 1:24 PM, William Oberman ober...@civicscience.com wrote: Hi, I have some history with cassandra + hadoop: 1.) Single DC + integrated hadoop = Was ok until I needed steady performance (the single DC was used in a production environment) 2.) Two DC's + integrated hadoop on 1 of 2 DCs = Was ok until my data grew and in AWS compute is expensive compared to data storage... e.g. running a 24x7 DC was a lot more expensive than the following solution... 3.) Single DC + a constant ETL to S3 = Is still ok, I can spawn an arbitrarily large EMR cluster. And 24x7 data storage + transient EMR is cost effective. But, one of my CF's has had a change of usage pattern making a large %, but not all of the data, fairly pointless to store. I thought I'd write a Pig UDF that could peek at a row of data and delete if it fails my criteria. And it works in terms of logic, but not in terms of practical execution. The CF in question has O(billion) keys, and afterwards it will have ~10% of that at most. I basically keep losing the jobs due to too many task failures, all rooted in: Caused
Re: in AWS is it worth trying to talk to a server in the same zone as your client?
Same region, cross zone transfer is $0.01 / GB (see http://aws.amazon.com/ec2/pricing/, Data Transfer section). On Wed, Feb 12, 2014 at 3:04 PM, Russell Bradberry rbradbe...@gmail.comwrote: Cross zone data transfer does not cost any extra money. LOCAL_QUORUM = QUORUM if all 6 servers are located in the same logical datacenter. Ensure your clients are connecting to either the local IP or the AWS hostname that is a CNAME to the local ip from within AWS. If you connect to the public IP you will get charged for outbound data transfer. On February 12, 2014 at 2:58:07 PM, Yogi Nerella (ynerella...@gmail.com//ynerella...@gmail.com) wrote: Also, may be you need to check the read consistency to local_quorum, otherwise the servers still try to read the data from all other data centers. I can understand the latency, but I cant understand how it would save money? The amount of data transferred from the AWS server to the client should be same no matter where the client is connected? On Wed, Feb 12, 2014 at 10:33 AM, Andrey Ilinykh ailin...@gmail.comwrote: yes, sure. Taking data from the same zone will reduce latency and save you some money. On Wed, Feb 12, 2014 at 10:13 AM, Brian Tarbox tar...@cabotresearch.comwrote: We're running a C* cluster with 6 servers spread across the four us-east1 zones. We also spread our clients (hundreds of them) across the four zones. Currently we give our clients a connection string listing all six servers and let C* do its thing. This is all working just fine...and we're paying a fair bit in AWS transfer costs. There is a suspicion that this transfer cost is driven by us passing data around between our C* servers and clients. Would there be any value to trying to get a client to talk to one of the C* servers in its own zone? I understand (at least partially!) about coordinator nodes and replication and know that no matter which server is the coordinator for an operation replication may cause bits to get transferred to/from servers in other zones. Having said that...is there a chance that trying to encourage a client to initially contact a server in its own zone would help? Thank you, Brian Tarbox
dependencies for cassandra's pig integration?
I'm using AWS's EMR (hadoop as a service), and one step copies some data from EMR - my cassandra cluster. I used to patch EMR with pig 0.11, but now AWS officially supports 0.11, so I thought I'd give it a try. I was having issues. The AWS forum on it is here: https://forums.aws.amazon.com/thread.jspa?threadID=131212 But the crux seems to be: I need to have pig.jar on the hadoop classpath, and AWS doesn't install it there. (/home/hadoop/lib is where all hadoop lib's live, and amazon installs pig into /home/hadoop/lib/pig). Does this make sense, or not? will
cqlsh + existing cf's + query
I've been running cassandra a while, and have used the PHP api and cassandra-cli, but never gave cqlsh a shot. I'm not quite getting it. My most simple CF is a dumping ground for testing things created as: create column family stats; I was putting random stats I was computing in it. All keys, column names column values are really ascii. Using cqlsh (1.1.12, default shell so CQL 2.0 I believe), I did: USE my_keyspace_name; ASSUME stats(KEY) VALUES are text, NAMES are text, VALUES are text; SELECT * from stats LIMIT 10; Looks great. But, then I try to do: SELECT * from stats WHERE KEY='KNOWN_KEY_NAME'; I get: Bad Request: cannot parse 'KNOWN_KEY_NAME' as hex bytes If I do the hex value for KNOWN_KEY_NAME, it works. E.g. SELECT * from stats WHERE KEY='various_0-9a-f_chars'; I can't find a TO_HEX(string) built in function for SQL. Am I doing something wrong? DESCRIBE TABLE stats; CREATE TABLE stats ( KEY blob PRIMARY KEY ) WITH comment='' AND comparator=blob AND read_repair_chance=1.00 AND gc_grace_seconds=864000 AND default_validation=blob AND min_compaction_threshold=4 AND max_compaction_threshold=32 AND replicate_on_write='false' AND compaction_strategy_class='SizeTieredCompactionStrategy';
1.1.9 - 1.1.11 rpm upgrade issue
I get this: Running rpm_check_debug ERROR with rpm_check_debug vs depsolve: apache-cassandra11 conflicts with apache-cassandra11-1.1.11-1.noarch I'm using Centos. Problem with my OS, or problem with the package? (And how can it conflict with itself??) will
Re: normal thread counts?
I've done some more digging, and I have more data but no answers (not knowing the cassandra internals). Based on Aaron's comment about gossipinfo/thread dump: -All IPs that gossip knows about have 2 threads in my thread dump (that seems ok/fine) -I have an additional set of IPs in my thread dump in the WRITE- state that 1.) Used to be part of my cluster, but are not currently 2.) Had tokens that are NOT part of the cluster anymore Cassandra is attempting to communicate with these bad IPs once a minute. The log for that attempt is at the bottom of this email. Does this sound familiar to anyone else? Log snippet: /var/log/cassandra/system.log: INFO [GossipStage:1] 2013-05-01 11:05:11,865 Gossiper.java (line 831) InetAddress /10.114.67.189 is now dead. /var/log/cassandra/system.log: INFO [GossipStage:1] 2013-05-01 11:05:11,866 StorageService.java (line 1303) Removing token 0 for /10.114.67.189 On Tue, Apr 30, 2013 at 5:34 PM, aaron morton aa...@thelastpickle.comwrote: The issue below could result in abandoned threads under high contention, so we'll get that fixed. But we are not sure how/why it would be called so many times. If you could provide a full list of threads and the output from nodetool gossipinfo that would help. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 1/05/2013, at 8:34 AM, aaron morton aa...@thelastpickle.com wrote: Many many many of the threads are trying to talk to IPs that aren't in the cluster (I assume they are the IP's of dead hosts). Are these IP's from before the upgrade ? Are they IP's you expect to see ? Cross reference them with the output from nodetool gossipinfo to see why the node thinks they should be used. Could you provide a list of the thread names ? One way to remove those IPs that may be to rolling restart with -Dcassandra.load_ring_state=false i the JVM opts at the bottom of cassandra-env.sh The OutboundTcpConnection threads are created in pairs by the OutboundTcpConnectionPool, which is created here https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/MessagingService.java#L502 The threads are created in the OutboundTcpConnectionPool constructor checking to see if this could be the source of the leak. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 1/05/2013, at 2:18 AM, William Oberman ober...@civicscience.com wrote: I use phpcassa. I did a thread dump. 99% of the threads look very similar (I'm using 1.1.9 in terms of matching source lines). The thread names are all like this: WRITE-/10.x.y.z. There are a LOT of duplicates (in terms of the same IP). Many many many of the threads are trying to talk to IPs that aren't in the cluster (I assume they are the IP's of dead hosts). The stack trace is basically the same for them all, attached at the bottom. There is a lot of things I could talk about in terms of my situation, but what I think might be pertinent to this thread: I hit a tipping point recently and upgraded a 9 node cluster from AWS m1.large to m1.xlarge (rolling, one at a time). 7 of the 9 upgraded fine and work great. 2 of the 9 keep struggling. I've replaced them many times now, each time using this process: http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node And even this morning the only two nodes with a high number of threads are those two (yet again). And at some point they'll OOM. Seems like there is something about my cluster (caused by the recent upgrade?) that causes a thread leak on OutboundTcpConnection But I don't know how to escape from the trap. Any ideas? stackTrace = [ { className = sun.misc.Unsafe; fileName = Unsafe.java; lineNumber = -2; methodName = park; nativeMethod = true; }, { className = java.util.concurrent.locks.LockSupport; fileName = LockSupport.java; lineNumber = 158; methodName = park; nativeMethod = false; }, { className = java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject; fileName = AbstractQueuedSynchronizer.java; lineNumber = 1987; methodName = await; nativeMethod = false; }, { className = java.util.concurrent.LinkedBlockingQueue; fileName = LinkedBlockingQueue.java; lineNumber = 399; methodName = take; nativeMethod = false; }, { className = org.apache.cassandra.net.OutboundTcpConnection; fileName = OutboundTcpConnection.java; lineNumber = 104; methodName = run; nativeMethod = false; } ]; -- On Mon, Apr 29, 2013 at 4:31 PM, aaron morton aa...@thelastpickle.comwrote: I used JMX to check current number of threads in a production cassandra machine, and it was ~27,000. That does not sound too good
Re: normal thread counts?
That has GOT to be it. 1.1.10 upgrade it is... On Wed, May 1, 2013 at 5:09 PM, Janne Jalkanen janne.jalka...@ecyrd.comwrote: This sounds very much like https://issues.apache.org/jira/browse/CASSANDRA-5175, which was fixed in 1.1.10. /Janne On Apr 30, 2013, at 23:34 , aaron morton aa...@thelastpickle.com wrote: Many many many of the threads are trying to talk to IPs that aren't in the cluster (I assume they are the IP's of dead hosts). Are these IP's from before the upgrade ? Are they IP's you expect to see ? Cross reference them with the output from nodetool gossipinfo to see why the node thinks they should be used. Could you provide a list of the thread names ? One way to remove those IPs that may be to rolling restart with -Dcassandra.load_ring_state=false i the JVM opts at the bottom of cassandra-env.sh The OutboundTcpConnection threads are created in pairs by the OutboundTcpConnectionPool, which is created here https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/MessagingService.java#L502 The threads are created in the OutboundTcpConnectionPool constructor checking to see if this could be the source of the leak. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 1/05/2013, at 2:18 AM, William Oberman ober...@civicscience.com wrote: I use phpcassa. I did a thread dump. 99% of the threads look very similar (I'm using 1.1.9 in terms of matching source lines). The thread names are all like this: WRITE-/10.x.y.z. There are a LOT of duplicates (in terms of the same IP). Many many many of the threads are trying to talk to IPs that aren't in the cluster (I assume they are the IP's of dead hosts). The stack trace is basically the same for them all, attached at the bottom. There is a lot of things I could talk about in terms of my situation, but what I think might be pertinent to this thread: I hit a tipping point recently and upgraded a 9 node cluster from AWS m1.large to m1.xlarge (rolling, one at a time). 7 of the 9 upgraded fine and work great. 2 of the 9 keep struggling. I've replaced them many times now, each time using this process: http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node And even this morning the only two nodes with a high number of threads are those two (yet again). And at some point they'll OOM. Seems like there is something about my cluster (caused by the recent upgrade?) that causes a thread leak on OutboundTcpConnection But I don't know how to escape from the trap. Any ideas? stackTrace = [ { className = sun.misc.Unsafe; fileName = Unsafe.java; lineNumber = -2; methodName = park; nativeMethod = true; }, { className = java.util.concurrent.locks.LockSupport; fileName = LockSupport.java; lineNumber = 158; methodName = park; nativeMethod = false; }, { className = java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject; fileName = AbstractQueuedSynchronizer.java; lineNumber = 1987; methodName = await; nativeMethod = false; }, { className = java.util.concurrent.LinkedBlockingQueue; fileName = LinkedBlockingQueue.java; lineNumber = 399; methodName = take; nativeMethod = false; }, { className = org.apache.cassandra.net.OutboundTcpConnection; fileName = OutboundTcpConnection.java; lineNumber = 104; methodName = run; nativeMethod = false; } ]; -- On Mon, Apr 29, 2013 at 4:31 PM, aaron morton aa...@thelastpickle.comwrote: I used JMX to check current number of threads in a production cassandra machine, and it was ~27,000. That does not sound too good. My first guess would be lots of client connections. What client are you using, does it do connection pooling ? See the comments in cassandra.yaml around rpc_server_type, the default uses sync uses one thread per connection, you may be better with HSHA. But if your app is leaking connection you should probably deal with that first. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 30/04/2013, at 3:07 AM, William Oberman ober...@civicscience.com wrote: Hi, I'm having some issues. I keep getting: ERROR [GossipStage:1] 2013-04-28 07:48:48,876 AbstractCassandraDaemon.java (line 135) Exception in thread Thread[GossipStage:1,5,main] java.lang.OutOfMemoryError: unable to create new native thread -- after a day or two of runtime. I've checked and my system settings seem acceptable: memlock=unlimited nofiles=10 nproc=122944 I've messed with heap sizes from 6-12GB (15 physical, m1.xlarge in AWS), and I keep OOM'ing with the above error. I've found some (what seem to me) to be obscure references to the stack size
Re: normal thread counts?
I use phpcassa. I did a thread dump. 99% of the threads look very similar (I'm using 1.1.9 in terms of matching source lines). The thread names are all like this: WRITE-/10.x.y.z. There are a LOT of duplicates (in terms of the same IP). Many many many of the threads are trying to talk to IPs that aren't in the cluster (I assume they are the IP's of dead hosts). The stack trace is basically the same for them all, attached at the bottom. There is a lot of things I could talk about in terms of my situation, but what I think might be pertinent to this thread: I hit a tipping point recently and upgraded a 9 node cluster from AWS m1.large to m1.xlarge (rolling, one at a time). 7 of the 9 upgraded fine and work great. 2 of the 9 keep struggling. I've replaced them many times now, each time using this process: http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node And even this morning the only two nodes with a high number of threads are those two (yet again). And at some point they'll OOM. Seems like there is something about my cluster (caused by the recent upgrade?) that causes a thread leak on OutboundTcpConnection But I don't know how to escape from the trap. Any ideas? stackTrace = [ { className = sun.misc.Unsafe; fileName = Unsafe.java; lineNumber = -2; methodName = park; nativeMethod = true; }, { className = java.util.concurrent.locks.LockSupport; fileName = LockSupport.java; lineNumber = 158; methodName = park; nativeMethod = false; }, { className = java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject; fileName = AbstractQueuedSynchronizer.java; lineNumber = 1987; methodName = await; nativeMethod = false; }, { className = java.util.concurrent.LinkedBlockingQueue; fileName = LinkedBlockingQueue.java; lineNumber = 399; methodName = take; nativeMethod = false; }, { className = org.apache.cassandra.net.OutboundTcpConnection; fileName = OutboundTcpConnection.java; lineNumber = 104; methodName = run; nativeMethod = false; } ]; -- On Mon, Apr 29, 2013 at 4:31 PM, aaron morton aa...@thelastpickle.comwrote: I used JMX to check current number of threads in a production cassandra machine, and it was ~27,000. That does not sound too good. My first guess would be lots of client connections. What client are you using, does it do connection pooling ? See the comments in cassandra.yaml around rpc_server_type, the default uses sync uses one thread per connection, you may be better with HSHA. But if your app is leaking connection you should probably deal with that first. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 30/04/2013, at 3:07 AM, William Oberman ober...@civicscience.com wrote: Hi, I'm having some issues. I keep getting: ERROR [GossipStage:1] 2013-04-28 07:48:48,876 AbstractCassandraDaemon.java (line 135) Exception in thread Thread[GossipStage:1,5,main] java.lang.OutOfMemoryError: unable to create new native thread -- after a day or two of runtime. I've checked and my system settings seem acceptable: memlock=unlimited nofiles=10 nproc=122944 I've messed with heap sizes from 6-12GB (15 physical, m1.xlarge in AWS), and I keep OOM'ing with the above error. I've found some (what seem to me) to be obscure references to the stack size interacting with # of threads. If I'm understanding it correctly, to reason about Java mem usage I have to think of OS + Heap as being locked down, and the stack gets the leftovers of physical memory and each thread gets a stack. For me, the system ulimit setting on stack is 10240k (no idea if java sees or respects this setting). My -Xss for cassandra is the default (I hope, don't remember messing with it) of 180k. I used JMX to check current number of threads in a production cassandra machine, and it was ~27,000. Is that a normal thread count? Could my OOM be related to stack + number of threads, or am I overlooking something more simple? will
normal thread counts?
Hi, I'm having some issues. I keep getting: ERROR [GossipStage:1] 2013-04-28 07:48:48,876 AbstractCassandraDaemon.java (line 135) Exception in thread Thread[GossipStage:1,5,main] java.lang.OutOfMemoryError: unable to create new native thread -- after a day or two of runtime. I've checked and my system settings seem acceptable: memlock=unlimited nofiles=10 nproc=122944 I've messed with heap sizes from 6-12GB (15 physical, m1.xlarge in AWS), and I keep OOM'ing with the above error. I've found some (what seem to me) to be obscure references to the stack size interacting with # of threads. If I'm understanding it correctly, to reason about Java mem usage I have to think of OS + Heap as being locked down, and the stack gets the leftovers of physical memory and each thread gets a stack. For me, the system ulimit setting on stack is 10240k (no idea if java sees or respects this setting). My -Xss for cassandra is the default (I hope, don't remember messing with it) of 180k. I used JMX to check current number of threads in a production cassandra machine, and it was ~27,000. Is that a normal thread count? Could my OOM be related to stack + number of threads, or am I overlooking something more simple? will
Re: StatusLogger format?
99% sure it's in bytes. On Mon, Apr 15, 2013 at 11:25 AM, William Oberman ober...@civicscience.comwrote: Mainly the: ColumnFamilyMemtable ops,data section. Is data in bytes/kb/mb/etc? Example line: StatusLogger.java (line 116) civicscience.sessions4963,1799916 Thanks!
Re: how to stop out of control compactions?
Thanks Gregg Aaron. Missed that setting! On Tuesday, April 2, 2013, aaron morton wrote: Set the min and max compaction thresholds for a given column family +1 for setting the max_compaction_threshold (as well as the min) on the a CF when you are getting behind. It can limit the size of the compactions and give things a chance to complete in a reasonable time. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 2/04/2013, at 3:42 AM, Gregg Ulrich gulr...@netflix.comjavascript:_e({}, 'cvml', 'gulr...@netflix.com'); wrote: You may want to set compaction threshold and not throughput. If you set the min threshold to something very large (10), compactions will not start until cassandra finds this many files to compact (which it should not). In the past I have used this to stop compactions on a node, and then run an offline major compaction to get though the compaction, then set the min threshold back. Not everyone likes major compactions though. setcompactionthreshold keyspace cfname minthreshold maxthreshold - Set the min and max compaction thresholds for a given column family On Mon, Apr 1, 2013 at 12:38 PM, William Oberman ober...@civicscience.comjavascript:_e({}, 'cvml', 'ober...@civicscience.com'); wrote: I'll skip the prelude, but I worked myself into a bit of a jam. I'm recovering now, but I want to double check if I'm thinking about things correct. Basically, I was in a state where a majority of my servers wanted to do compactions, and rather large ones. This was impacting my site performance. I tried nodetool stop COMPACTION. I tried setcompactionthroughput=1. I tried restarting servers, but they'd restart the compactions pretty much immediately on boot. Then I realized that: nodetool stop COMPACTION only stopped running compactions, and then the compactions would re-enqueue themselves rather quickly. So, right now I have: 1.) scripts running on N-1 servers looping on nodetool stop COMPACTION in a tight loop 2.) On the Nth server I've disabled gossip/thrift and turned up setcompactionthroughput to 999 3.) When the Nth server completes, I pick from the remaining N-1 (well, I'm still running the first compaction, which is going to take 12 more hours, but that is the plan at least). Does this make sense? Other than the fact there was probably warning signs that would have prevented me from getting into this state in the first place? :-) will -- Will Oberman Civic Science, Inc. 6101 Penn Avenue, Fifth Floor Pittsburgh, PA 15206 (M) 412-480-7835 (E) ober...@civicscience.com
Re: how to stop out of control compactions?
Edward, you make a good point, and I do think am getting closer to having to increase my cluster size (I'm around ~300GB/node now). In my case, I think it was neither. I had one node OOM after working on a large compaction but it continued to run in a zombie like state (constantly GC'ing), which I didn't have an alert on. Then I had the bad luck of a close token also starting a large compaction. I have RF=3 with some of my R/W patterns at quorum, causing that segment of my cluster to get slow (e.g. a % of of my traffic started to slow). I was running 1.1.2 (I haven't had to poke anything for quite some time, obviously), so I upgraded before moving on (as I saw a lot of bug fixes to compaction issues in release notes). But the upgrade caused even more nodes to start compactions. Which lead to my original email... I had a cluster where 80% of my nodes were compacting, and I really needed to boost production traffic and couldn't seem to tamp cassandra down temporarily. Thanks for the advice everyone! will On Tue, Apr 2, 2013 at 10:20 AM, Edward Capriolo edlinuxg...@gmail.comwrote: Settings do not make compactions go away. If your compactions are out of control it usually means one of these things, 1) you have a corrupt table that the compaction never finishes on, sstables count keep growing 2) you do not have enough hardware to handle your write load On Tue, Apr 2, 2013 at 7:50 AM, William Oberman ober...@civicscience.comwrote: Thanks Gregg Aaron. Missed that setting! On Tuesday, April 2, 2013, aaron morton wrote: Set the min and max compaction thresholds for a given column family +1 for setting the max_compaction_threshold (as well as the min) on the a CF when you are getting behind. It can limit the size of the compactions and give things a chance to complete in a reasonable time. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 2/04/2013, at 3:42 AM, Gregg Ulrich gulr...@netflix.com wrote: You may want to set compaction threshold and not throughput. If you set the min threshold to something very large (10), compactions will not start until cassandra finds this many files to compact (which it should not). In the past I have used this to stop compactions on a node, and then run an offline major compaction to get though the compaction, then set the min threshold back. Not everyone likes major compactions though. setcompactionthreshold keyspace cfname minthreshold maxthreshold - Set the min and max compaction thresholds for a given column family On Mon, Apr 1, 2013 at 12:38 PM, William Oberman ober...@civicscience.com wrote: I'll skip the prelude, but I worked myself into a bit of a jam. I'm recovering now, but I want to double check if I'm thinking about things correct. Basically, I was in a state where a majority of my servers wanted to do compactions, and rather large ones. This was impacting my site performance. I tried nodetool stop COMPACTION. I tried setcompactionthroughput=1. I tried restarting servers, but they'd restart the compactions pretty much immediately on boot. Then I realized that: nodetool stop COMPACTION only stopped running compactions, and then the compactions would re-enqueue themselves rather quickly. So, right now I have: 1.) scripts running on N-1 servers looping on nodetool stop COMPACTION in a tight loop 2.) On the Nth server I've disabled gossip/thrift and turned up setcompactionthroughput to 999 3.) When the Nth server completes, I pick from the remaining N-1 (well, I'm still running the first compaction, which is going to take 12 more hours, but that is the plan at least). Does this make sense? Other than the fact there was probably warning signs that would have prevented me from getting into this state in the first place? :-) will -- Will Oberman Civic Science, Inc. 6101 Penn Avenue, Fifth Floor Pittsburgh, PA 15206 (M) 412-480-7835 (E) ober...@civicscience.com
how to stop out of control compactions?
I'll skip the prelude, but I worked myself into a bit of a jam. I'm recovering now, but I want to double check if I'm thinking about things correct. Basically, I was in a state where a majority of my servers wanted to do compactions, and rather large ones. This was impacting my site performance. I tried nodetool stop COMPACTION. I tried setcompactionthroughput=1. I tried restarting servers, but they'd restart the compactions pretty much immediately on boot. Then I realized that: nodetool stop COMPACTION only stopped running compactions, and then the compactions would re-enqueue themselves rather quickly. So, right now I have: 1.) scripts running on N-1 servers looping on nodetool stop COMPACTION in a tight loop 2.) On the Nth server I've disabled gossip/thrift and turned up setcompactionthroughput to 999 3.) When the Nth server completes, I pick from the remaining N-1 (well, I'm still running the first compaction, which is going to take 12 more hours, but that is the plan at least). Does this make sense? Other than the fact there was probably warning signs that would have prevented me from getting into this state in the first place? :-) will
odd timestamps
I happened to notice some bizarre timestamps coming out of the cassandra-cli. Example: [default@XXX] get CF[‘e2b753aa33b13e74e5e803d787b06000']; = (column=c35ef420-c37a-11e0-ac88-09b2f4397c6a, value=XXX, timestamp=2013042719) = (column=c3845ea0-c37a-11e0-8f6f-09b2f4397c6a, value=XXX, timestamp=2013287771) = (column=c3993840-c37a-11e0-a069-09b2f4397c6a, value=XXX timestamp=2013423245) = (column=c39a9040-c37a-11e0-8617-09b2f4397c6a, value=XXX, timestamp=2013431971) Returned 4 results. I'm used to timestamps being micro from 1970, like this (same CF): [default@XXX] get CF[‘3a4599767e16e94465e8491139154871']; = (column=6c8e3160-4678-11e2-b69a-3534db9e8cfa, value=XXX, timestamp=1355549380581630) = (column=6c91f3a0-4678-11e2-bc00-3534db9e8cfa, value=XXX, timestamp=1355549380606285) = (column=6c980f00-4678-11e2-9963-3534db9e8cfa, value=XXX, timestamp=1355549380646378) = (column=6c994100-4678-11e2-955f-3534db9e8cfa, value=XXX, timestamp=1355549380654057) = (column=6c9e7950-4678-11e2-9189-3534db9e8cfa, value=XXX timestamp=1355549380688268) Returned 5 results. Was there a version of cassandra that wrote bad timestamps at the server level? I started around 0.8.4, and I'm running 1.1.2 right now. Could it have been a client with a bug? I've only been using phpcassa and the CLI to read/write data (no CQL). Is there another theory? I don't _think_ these odd timestamps will cause me problems. I just don't like unexpected results from my data stores... :-) will
Re: sstable2json had random behavior
No, I have the other files unfortunately and I had it fail once and succeed every time after. I'm tracking the external information of sstable2json more carefully now (exit status, stdout, stderr), so hopefully if it happens again I can be more help. will On Tue, Jan 22, 2013 at 3:38 PM, aaron morton aa...@thelastpickle.comwrote: William, If the solution from Binh works for you can you please submit a ticket to https://issues.apache.org/jira/browse/CASSANDRA The error message could be better if that is the case. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 22/01/2013, at 9:16 AM, Binh Nguyen binhn...@gmail.com wrote: Hi William, I also saw this one before but it always happened in my case when I have only Data and Index files. The problem goes away when I have all another files (Compression, Filter...) On Mon, Jan 21, 2013 at 11:36 AM, William Oberman ober...@civicscience.com wrote: I'm running 1.1.6 from the datastax repo. I ran sstable2json and got the following error: Exception in thread main java.io.IOError: java.io.IOException: dataSize of 7020023552240793698 starting at 993981393 would be larger than file /var/lib/cassandra/data/X-Data.db length 7502161255 at org.apache.cassandra.io.sstable.SSTableIdentityIterator.init(SSTableIdentityIterator.java:156) at org.apache.cassandra.io.sstable.SSTableIdentityIterator.init(SSTableIdentityIterator.java:86) at org.apache.cassandra.io.sstable.SSTableIdentityIterator.init(SSTableIdentityIterator.java:70) at org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTableScanner.java:187) at org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTableScanner.java:151) at org.apache.cassandra.io.sstable.SSTableScanner.next(SSTableScanner.java:143) at org.apache.cassandra.tools.SSTableExport.export(SSTableExport.java:309) at org.apache.cassandra.tools.SSTableExport.export(SSTableExport.java:340) at org.apache.cassandra.tools.SSTableExport.export(SSTableExport.java:353) at org.apache.cassandra.tools.SSTableExport.main(SSTableExport.java:418) Caused by: java.io.IOException: dataSize of 7020023552240793698 starting at 993981393 would be larger than file /var/lib/cassandra/data/X-Data.db length 7502161255 at org.apache.cassandra.io.sstable.SSTableIdentityIterator.init(SSTableIdentityIterator.java:115) ... 9 more I ran it again, and didn't. This makes me worried :-) Does anyone else ever see this class of error, and does it ever disappear for them?
sstable2json had random behavior
I'm running 1.1.6 from the datastax repo. I ran sstable2json and got the following error: Exception in thread main java.io.IOError: java.io.IOException: dataSize of 7020023552240793698 starting at 993981393 would be larger than file /var/lib/cassandra/data/X-Data.db length 7502161255 at org.apache.cassandra.io.sstable.SSTableIdentityIterator.init(SSTableIdentityIterator.java:156) at org.apache.cassandra.io.sstable.SSTableIdentityIterator.init(SSTableIdentityIterator.java:86) at org.apache.cassandra.io.sstable.SSTableIdentityIterator.init(SSTableIdentityIterator.java:70) at org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTableScanner.java:187) at org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTableScanner.java:151) at org.apache.cassandra.io.sstable.SSTableScanner.next(SSTableScanner.java:143) at org.apache.cassandra.tools.SSTableExport.export(SSTableExport.java:309) at org.apache.cassandra.tools.SSTableExport.export(SSTableExport.java:340) at org.apache.cassandra.tools.SSTableExport.export(SSTableExport.java:353) at org.apache.cassandra.tools.SSTableExport.main(SSTableExport.java:418) Caused by: java.io.IOException: dataSize of 7020023552240793698 starting at 993981393 would be larger than file /var/lib/cassandra/data/X-Data.db length 7502161255 at org.apache.cassandra.io.sstable.SSTableIdentityIterator.init(SSTableIdentityIterator.java:115) ... 9 more I ran it again, and didn't. This makes me worried :-) Does anyone else ever see this class of error, and does it ever disappear for them?
Re: Cassandra at Amazon AWS
I have a peer EBS disk to the ephemeral disk . Then I do nodetool snapshot - rsync from ephemeral to EBS - take snapshot of EBS. Syncing nodetool snapshot directly to S3 would involve less steps and be cheaper (EBS costs more than S3), but I do post processing on the snapshot for EMR, and it seemed pointless to push/pull from S3 when I could just map EBS disks around. I snapshot the EBS disk as a failsafe, and snapshots are cheap (they cost the same as S3). I've seen/read about how other people just watch the data directories for new SStables and trigger copy to S3 (there are open source projects that do that for you I believe). And I think lot of people rely on the replication factor + multiple zones. will On Thu, Jan 17, 2013 at 8:44 AM, Adam Venturella aventure...@gmail.comwrote: Jared, how do you guys handle data backups for your ephemeral based cluster? I'm trying to move to ephemeral drives myself, and that was my last sticking point; asking how others in the community deal with backup in case the VM explodes. On Wed, Jan 16, 2013 at 1:21 PM, Jared Biel jared.b...@bolderthinking.com wrote: We're currently using Cassandra on EC2 at very low scale (a 2 node cluster on m1.large instances in two regions.) I don't believe that EBS is recommended for performance reasons. Also, it's proven to be very unreliable in the past (most of the big/notable AWS outages were due to EBS issues.) We've moved 99% of our instances off of EBS. As other have said, if you require more space in the future it's easy to add more nodes to the cluster. I've found this page (http://www.ec2instances.info/) very useful in determining the amount of space each instance type has. Note that by default only one ephemeral drive is attached and you must specify all ephemeral drives that you want to use at launch time. Also, you can create a RAID 0 of all local disks to provide maximum speed and space. On 16 January 2013 20:42, Marcelo Elias Del Valle mvall...@gmail.com wrote: Hello, I am currently using hadoop + cassandra at amazon AWS. Cassandra runs on EC2 and my hadoop process runs at EMR. For cassandra storage, I am using local EC2 EBS disks. My system is running fine for my tests, but to me it's not a good setup for production. I need my system to perform well for specially for writes on cassandra, but the amount of data could grow really big, taking several Tb of total storage. My first guess was using S3 as a storage and I saw this can be done by using Cloudian package, but I wouldn't like to become dependent on a pre-package solution and I found it's kind of expensive for more than 100Tb: http://www.cloudian.com/pricing.html I saw some discussion at internet about using EBS or ephemeral disks for storage at Amazon too. My question is: does someone on this list have the same problem as me? What are you using as solution to Cassandra's storage when running it at Amazon AWS? Any thoughts would be highly appreciatted. Best regards, -- Marcelo Elias Del Valle http://mvalle.com - @mvallebr
Re: AWS EMR - Cassandra
DataStax recommended (forget the reference) to use the ephemeral disks in RAID0, which is what I've been running for well over a year now in production. In terms of how I'm doing Cassandra/AWS/Hadoop, I started by doing the split data center thing (one DC for low latency queries, one DC for hadoop). But, that's a lot of system management. And compute is the most expensive part of AWS, and you need a LOT of compute to run this setup. I tried doing Cassandra EC2 cluster - snapshot - clone cluster with hadoop overlay - ETL to S3 using hadoop - EMR for real work. But that's kind of a pain too (and the ETL to S3 wasn't very fast). Now I'm going after the SStables directly(*), which sounds like how Netflix does it. You can do incremental updates, if you're careful. (*) Cassandra EC2 - backup to local EBS - remap EBS to another box - sstable2json over new sstables - S3 (splitting into ~100MB parts), then use EMR to consume the JSON part files. will On Wed, Jan 16, 2013 at 3:30 PM, Marcelo Elias Del Valle mvall...@gmail.com wrote: William, I just saw your message today. I am using Cassandra + Amazon EMR (hadoop 1.0.3) but I am not using PIG as you are. I set my configuration vars in Java, as I have a custom jar file and I am using ColumnFamilyInputFormat. However, if I understood well your problem, the only thing you have to do is to set environment vars when running cluster tasks, right? Take a look a this link: http://sujee.net/tech/articles/hadoop/amazon-emr-beyond-basics/ As it shows, you can run EMR setting some command line arguments that specify a script to be executed before the job starts, in each machine in the cluster. This way, you would be able to correctly set the vars you need. Out of curiosity, could you share what are you using for cassandra storage? I am currently using EC2 local disks, but I am looking for an alternative. Best regards, Marcelo. 2013/1/4 William Oberman ober...@civicscience.com So I've made it work, but I don't get it yet. I have no idea why my DIY server works when I set the environment variables on the machine that kicks off pig (master), and in EMR it doesn't. I recompiled ConfigHelper and CassandraStorage with tons of debugging, and in EMR I can see the hadoop Configuration object get the proper values on the master node, and I can see it does NOT propagate to the task threads. The other part that was driving me nuts could be made more user friendly. The issue is this: I started to try to set cassandra.thrift.address, cassandra.thrift.port, cassandra.partitioner.class in mapred-site.xml, and it didn't work. After even more painful debugging, I noticed that the only time Cassandra sets the input/output versions of those settings (and these input/output specific versions are the only versions really used!) is when Cassandra maps the system environment variables. So, having cassandra.thrift.address in mapred-site.xml does NOTHING, as I needed to have cassandra.output.thrift.address set. It would be much nicer if the get{Input/Output}XYZ checked for the existence of getXYZ if get{Input/Output}XYZ is empty/null. E.g. in getOutputThriftAddress(), if that setting is null, it would have been nice if that method returned getThriftAddress(). My problem went away when I put the full cross product in the XML. E.g. cassandra.input.thrift.address and cassandra.output.thrift.address (and port, and partitioner). I still want to know why the old easy way (of setting the 3 system variables on the pig starter box, and having the config flow into the task trackers) doesn't work! will On Fri, Jan 4, 2013 at 9:04 AM, William Oberman ober...@civicscience.com wrote: On all tasktrackers, I see: java.io.IOException: PIG_OUTPUT_INITIAL_ADDRESS or PIG_INITIAL_ADDRESS environment variable not set at org.apache.cassandra.hadoop.pig.CassandraStorage.setStoreLocation(CassandraStorage.java:821) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.setLocation(PigOutputFormat.java:170) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.setUpContext(PigOutputCommitter.java:112) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.getCommitters(PigOutputCommitter.java:86) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.init(PigOutputCommitter.java:67) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.getOutputCommitter(PigOutputFormat.java:279) at org.apache.hadoop.mapred.Task.initialize(Task.java:515) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:358) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396
Re: AWS EMR - Cassandra
On all tasktrackers, I see: java.io.IOException: PIG_OUTPUT_INITIAL_ADDRESS or PIG_INITIAL_ADDRESS environment variable not set at org.apache.cassandra.hadoop.pig.CassandraStorage.setStoreLocation(CassandraStorage.java:821) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.setLocation(PigOutputFormat.java:170) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.setUpContext(PigOutputCommitter.java:112) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.getCommitters(PigOutputCommitter.java:86) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.init(PigOutputCommitter.java:67) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.getOutputCommitter(PigOutputFormat.java:279) at org.apache.hadoop.mapred.Task.initialize(Task.java:515) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:358) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132) at org.apache.hadoop.mapred.Child.main(Child.java:249) On Thu, Jan 3, 2013 at 10:45 PM, aaron morton aa...@thelastpickle.comwrote: Instead, I get an error from CassandraStorage that the initial address isn't set (on the slave, the master is ok). Can you post the full error ? Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 4/01/2013, at 11:15 AM, William Oberman ober...@civicscience.com wrote: Anyone ever try to read or write directly between EMR - Cassandra? I'm running various Cassandra resources in Ec2, so the physical connection part is pretty easy using security groups. But, I'm having some configuration issues. I have managed to get Cassandra + Hadoop working in the past using a DIY hadoop cluster, and looking at the configurations in the two environments (EMR vs DIY), I'm not sure what's different that is causing my failures... I should probably note I'm using the Pig integration of Cassandra. Versions: Hadoop 1.0.3, Pig 0.10, Cassandra 1.1.7. I'm 99% sure I have classpaths working (because I didn't at first, and now EMR can find and instantiate CassandraStorage on master and slaves). What isn't working are the system variables. In my DIY cluster, all I needed to do was: --- export PIG_INITIAL_ADDRESS=XXX export PIG_RPC_PORT=9160 export PIG_PARTITIONER=org.apache.cassandra.dht.RandomPartitioner -- And the task trackers somehow magically picked up the values (I never questioned how/why). But, in EMR, they do not. Instead, I get an error from CassandraStorage that the initial address isn't set (on the slave, the master is ok). My DIY cluster used CDH3, which was hadoop 0.20.something. So, maybe the problem is a different version of hadoop? Looking at the CassandraStorage class, I realize I have no idea how it used to work, since it only seems to look at System variables. Those variables are set on the Job.getConfiguration object. I don't know how that part of hadoop works though... do variables that get set on Job on the master get propagated to the task threads? I do know that on my DIY cluster, I do NOT set those system variables on the slaves... Thanks! will
Re: AWS EMR - Cassandra
So I've made it work, but I don't get it yet. I have no idea why my DIY server works when I set the environment variables on the machine that kicks off pig (master), and in EMR it doesn't. I recompiled ConfigHelper and CassandraStorage with tons of debugging, and in EMR I can see the hadoop Configuration object get the proper values on the master node, and I can see it does NOT propagate to the task threads. The other part that was driving me nuts could be made more user friendly. The issue is this: I started to try to set cassandra.thrift.address, cassandra.thrift.port, cassandra.partitioner.class in mapred-site.xml, and it didn't work. After even more painful debugging, I noticed that the only time Cassandra sets the input/output versions of those settings (and these input/output specific versions are the only versions really used!) is when Cassandra maps the system environment variables. So, having cassandra.thrift.address in mapred-site.xml does NOTHING, as I needed to have cassandra.output.thrift.address set. It would be much nicer if the get{Input/Output}XYZ checked for the existence of getXYZ if get{Input/Output}XYZ is empty/null. E.g. in getOutputThriftAddress(), if that setting is null, it would have been nice if that method returned getThriftAddress(). My problem went away when I put the full cross product in the XML. E.g. cassandra.input.thrift.address and cassandra.output.thrift.address (and port, and partitioner). I still want to know why the old easy way (of setting the 3 system variables on the pig starter box, and having the config flow into the task trackers) doesn't work! will On Fri, Jan 4, 2013 at 9:04 AM, William Oberman ober...@civicscience.comwrote: On all tasktrackers, I see: java.io.IOException: PIG_OUTPUT_INITIAL_ADDRESS or PIG_INITIAL_ADDRESS environment variable not set at org.apache.cassandra.hadoop.pig.CassandraStorage.setStoreLocation(CassandraStorage.java:821) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.setLocation(PigOutputFormat.java:170) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.setUpContext(PigOutputCommitter.java:112) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.getCommitters(PigOutputCommitter.java:86) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.init(PigOutputCommitter.java:67) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.getOutputCommitter(PigOutputFormat.java:279) at org.apache.hadoop.mapred.Task.initialize(Task.java:515) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:358) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132) at org.apache.hadoop.mapred.Child.main(Child.java:249) On Thu, Jan 3, 2013 at 10:45 PM, aaron morton aa...@thelastpickle.comwrote: Instead, I get an error from CassandraStorage that the initial address isn't set (on the slave, the master is ok). Can you post the full error ? Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 4/01/2013, at 11:15 AM, William Oberman ober...@civicscience.com wrote: Anyone ever try to read or write directly between EMR - Cassandra? I'm running various Cassandra resources in Ec2, so the physical connection part is pretty easy using security groups. But, I'm having some configuration issues. I have managed to get Cassandra + Hadoop working in the past using a DIY hadoop cluster, and looking at the configurations in the two environments (EMR vs DIY), I'm not sure what's different that is causing my failures... I should probably note I'm using the Pig integration of Cassandra. Versions: Hadoop 1.0.3, Pig 0.10, Cassandra 1.1.7. I'm 99% sure I have classpaths working (because I didn't at first, and now EMR can find and instantiate CassandraStorage on master and slaves). What isn't working are the system variables. In my DIY cluster, all I needed to do was: --- export PIG_INITIAL_ADDRESS=XXX export PIG_RPC_PORT=9160 export PIG_PARTITIONER=org.apache.cassandra.dht.RandomPartitioner -- And the task trackers somehow magically picked up the values (I never questioned how/why). But, in EMR, they do not. Instead, I get an error from CassandraStorage that the initial address isn't set (on the slave, the master is ok). My DIY cluster used CDH3, which was hadoop 0.20.something. So, maybe the problem is a different version of hadoop? Looking at the CassandraStorage class, I realize I have no idea how it used to work, since it only seems to look
Re: remove DC
My situation is that DC2 was not written to, thanks! On Mon, Nov 12, 2012 at 7:48 PM, Jeremiah Jordan jeremiah.jor...@gmail.comwrote: If you have any data that you wrote to DC2, since the last time you ran repair, you should probably run repair to make sure that data made it over to DC1, if you never wrote data directly to DC2, then you are correct you don't need to run repair. You should just need to update the schema, and then decommission the node. -Jeremiah On Nov 12, 2012, at 2:25 PM, William Oberman ober...@civicscience.com wrote: There is a great guide here on how to add resources: http://www.datastax.com/docs/1.1/operations/cluster_management#adding-capacity What about deleting resources? I'm thinking of removing a data center. Clearly I'd need to change strategy options, which is currently something like this: {DC1:3,DC2:1} to: {DC1:3}) But, after that change, I'm wondering if anything else needs to happen? All of the data in DC1 is already in the correct spots, so I don't think I have to run repair or cleanup... will
remove DC
There is a great guide here on how to add resources: http://www.datastax.com/docs/1.1/operations/cluster_management#adding-capacity What about deleting resources? I'm thinking of removing a data center. Clearly I'd need to change strategy options, which is currently something like this: {DC1:3,DC2:1} to: {DC1:3}) But, after that change, I'm wondering if anything else needs to happen? All of the data in DC1 is already in the correct spots, so I don't think I have to run repair or cleanup... will
Re: hadoop consistency level
A recent thread made it sound like Brisk was no longer a datastax supported thing (it's DataStax Enterpise, or DSE, now): http://www.mail-archive.com/user@cassandra.apache.org/msg24921.html In particular this response: http://www.mail-archive.com/user@cassandra.apache.org/msg25061.html On Thu, Oct 18, 2012 at 2:49 PM, Jean-Nicolas Boulay Desjardins jnbdzjn...@gmail.com wrote: Why don't you look into Brisk: http://www.datastax.com/docs/0.8/brisk/about_brisk On Thu, Oct 18, 2012 at 2:46 PM, Andrey Ilinykh ailin...@gmail.comwrote: Hello, everybody! I'm thinking about running hadoop jobs on the top of the cassandra cluster. My understanding is - hadoop jobs read data from local nodes only. Does it mean the consistency level is always ONE? Thank you, Andrey
cassandra + pig
I'm wondering how many people are using cassandra + pig out there? I recently went through the effort of validating things at a much higher level than I previously did(*), and found a few issues: https://issues.apache.org/jira/browse/CASSANDRA-4748 https://issues.apache.org/jira/browse/CASSANDRA-4749 https://issues.apache.org/jira/browse/CASSANDRA-4789 In general, it seems like the widerow implementation still has rough edges. I'm concerned I'm not understanding why other people aren't using the feature, and thus finding these problems. Is everyone else just setting a high static limit? E.g. LOAD 'cassandra://KEYSPACE/CF?limit=X where X = the max size of any key? Is everyone else using data models that result in keys with # columns always less than 1024? Do newer version of hadoop consume the cassandra API in a way that work around these issues? I'm using CDH3 == hadoop 0.20.2, pig 0.8.1. (*) I took a random subsample of 50,000 keys of my production data (approx 1M total key/value pairs, some keys having only a single value and some having 1000's). I then wrote both a pig script and simple procedural version of the pig script. Then I compared the results. Obviously I started with differences, though after locally patching my code to fix the above 3 bugs (though, really only two issues), I now (finally) get the same results.
Re: cassandra + pig
If you don't mind me asking, how are you handling the fact that pre-widerow you are only getting a static number of columns per key (default 1024)? Or am I not understanding the limit concept? On Thu, Oct 11, 2012 at 11:25 AM, Jeremy Hanna jeremy.hanna1...@gmail.comwrote: The Dachis Group (where I just came from, now at DataStax) uses pig with cassandra for a lot of things. However, we weren't using the widerow implementation yet since wide row support is new to 1.1.x and we were on 0.7, then 0.8, then 1.0.x. I think since it's new to 1.1's hadoop support, it sounds like there are some rough edges like you say. But issues that are reproducible on tickets for any problems are much appreciated and they will get addressed. On Oct 11, 2012, at 10:43 AM, William Oberman ober...@civicscience.com wrote: I'm wondering how many people are using cassandra + pig out there? I recently went through the effort of validating things at a much higher level than I previously did(*), and found a few issues: https://issues.apache.org/jira/browse/CASSANDRA-4748 https://issues.apache.org/jira/browse/CASSANDRA-4749 https://issues.apache.org/jira/browse/CASSANDRA-4789 In general, it seems like the widerow implementation still has rough edges. I'm concerned I'm not understanding why other people aren't using the feature, and thus finding these problems. Is everyone else just setting a high static limit? E.g. LOAD 'cassandra://KEYSPACE/CF?limit=X where X = the max size of any key? Is everyone else using data models that result in keys with # columns always less than 1024? Do newer version of hadoop consume the cassandra API in a way that work around these issues? I'm using CDH3 == hadoop 0.20.2, pig 0.8.1. (*) I took a random subsample of 50,000 keys of my production data (approx 1M total key/value pairs, some keys having only a single value and some having 1000's). I then wrote both a pig script and simple procedural version of the pig script. Then I compared the results. Obviously I started with differences, though after locally patching my code to fix the above 3 bugs (though, really only two issues), I now (finally) get the same results.
Re: cassandra + pig
Thanks Jeremy! Maybe figuring out how to do paging in pig would have been easier, but I found the widerow setting first which led me where I am today. I don't mind helping to blaze trails, or contribute back when doing so, but I usually try to follow rather than lead when it comes to tools/software I choose use. I didn't realize how close to the edge I was getting in this case :-) On Thu, Oct 11, 2012 at 1:03 PM, Jeremy Hanna jeremy.hanna1...@gmail.comwrote: For our use case, we had a lot of narrow column families and the couple of column families that had wide rows, we did our own paging through them. I don't recall if we did paging in pig or mapreduce but you should be able to do that in both since pig allows you to specify the slice start. On Oct 11, 2012, at 11:28 AM, William Oberman ober...@civicscience.com wrote: If you don't mind me asking, how are you handling the fact that pre-widerow you are only getting a static number of columns per key (default 1024)? Or am I not understanding the limit concept? On Thu, Oct 11, 2012 at 11:25 AM, Jeremy Hanna jeremy.hanna1...@gmail.com wrote: The Dachis Group (where I just came from, now at DataStax) uses pig with cassandra for a lot of things. However, we weren't using the widerow implementation yet since wide row support is new to 1.1.x and we were on 0.7, then 0.8, then 1.0.x. I think since it's new to 1.1's hadoop support, it sounds like there are some rough edges like you say. But issues that are reproducible on tickets for any problems are much appreciated and they will get addressed. On Oct 11, 2012, at 10:43 AM, William Oberman ober...@civicscience.com wrote: I'm wondering how many people are using cassandra + pig out there? I recently went through the effort of validating things at a much higher level than I previously did(*), and found a few issues: https://issues.apache.org/jira/browse/CASSANDRA-4748 https://issues.apache.org/jira/browse/CASSANDRA-4749 https://issues.apache.org/jira/browse/CASSANDRA-4789 In general, it seems like the widerow implementation still has rough edges. I'm concerned I'm not understanding why other people aren't using the feature, and thus finding these problems. Is everyone else just setting a high static limit? E.g. LOAD 'cassandra://KEYSPACE/CF?limit=X where X = the max size of any key? Is everyone else using data models that result in keys with # columns always less than 1024? Do newer version of hadoop consume the cassandra API in a way that work around these issues? I'm using CDH3 == hadoop 0.20.2, pig 0.8.1. (*) I took a random subsample of 50,000 keys of my production data (approx 1M total key/value pairs, some keys having only a single value and some having 1000's). I then wrote both a pig script and simple procedural version of the pig script. Then I compared the results. Obviously I started with differences, though after locally patching my code to fix the above 3 bugs (though, really only two issues), I now (finally) get the same results.
Re: pig and widerows
The next painful lesson for me was figuring out how to get logging working for a distributed hadoop process. In my test environment, I have a single node that runs name/secondaryname/data/job trackers (call it central), and I have two cassandra nodes running tasktrackers. But, I also have cassandra libraries on the central box, and invoke my pig script from there. I had been patching and recompiling cassandra (1.1.5 with my logging, and the system env fix) on that central box, and SOME of the logging was appearing in the pig output. But, eventually I decided to move that recompiled code to the tasktracker boxes, and then I found even more of the logging I had added in: /var/log/hadoop/userlogs/JOB_ID on each of the tasktrackers. Based on this new logging, I found out that the widerows setting wasn't propagating from the central box to the tasktrackers. I added: export PIG_WIDEROW_INPUT=true To hadoop-env.sh on each of the tasktrackers and it finally worked! So, long story short, to actually get all columns for a key I had to: 1.) patch 1.1.5 to honor the PIG_WIDEROW_INPUT=true system setting 2.) add the system setting to ALL nodes in the hadoop cluster I'm going to try to undo all of my other hacks to get logging/printing working to confirm if those were actually the only two changes I had to make. will On Thu, Sep 27, 2012 at 1:43 PM, William Oberman ober...@civicscience.comwrote: Ok, this is painful. The first problem I found is in stock 1.1.5 there is no way to set widerows to true! The new widerows URI parsing is NOT in 1.1.5. And for extra fun, getting the value from the system property is BROKEN (at least in my centos linux environment). Here are the key lines of code (in CassandraStorage), note the different ways of getting the property! getenv in the test, and getProperty in the set: widerows = DEFAULT_WIDEROW_INPUT; if (System.getenv(PIG_WIDEROW_INPUT) != null) widerows = Boolean.valueOf(System.getProperty(PIG_WIDEROW_INPUT)); I added this logging: logger.warn(widerows = + widerows + getenv= + System.getenv(PIG_WIDEROW_INPUT) + getProp=+System.getProperty(PIG_WIDEROW_INPUT)); And I saw: org.apache.cassandra.hadoop.pig.CassandraStorage - widerows = false getenv=true getProp=null So for me getProperty != getenv :-( For people trying to figure out how to debug cassandra + hadoop + pig, for me the key to get debugging and logging working was to focus on /etc/hadoop/conf (not /etc/pig/conf as I expected). Also, if you want to compile your own cassandra (to add logging messages), make sure it's appears first on the pig classpath (use pig -secretDebugCmd to see the fully qualified command line). The next thing I'm trying to figure out is why when widerows == true I'm STILL not seeing more than 1024 columns :-( will On Wed, Sep 26, 2012 at 3:42 PM, William Oberman ober...@civicscience.com wrote: Hi, I'm trying to figure out what's going on with my cassandra/hadoop/pig system. I created a mini copy of my main cassandra data by randomly subsampling to get ~50,000 keys. I was then writing pig scripts but also the equivalent operation using simple single threaded code to double check pig. Of course my very first test failed. After doing a pig DUMP on the raw data, what appears to be happening is I'm only getting the first 1024 columns of a key. After some googling, this seems to be known behavior unless you add ?widerows=true to the pig load URI. I tried this, but it didn't seem to fix anything :-( Here's the the start of my pig script: foo = LOAD 'cassandra://KEYSPACE/COLUMN_FAMILY?widerows=true' USING CassandraStorage() AS (key:chararray, columns:bag {column:tuple (name, value)}); I'm using cassandra 1.1.5 from datastax rpms. I'm using hadoop (0.20.2+923.418-1) and pig (0.8.1+28.39-1) from cloudera rpms. What am I doing wrong? Or, how I can enable debugging/logging to next figure out what is going on? I haven't had to debug hadoop+pig+cassandra much, other than doing DUMP/ILLUSTRATE from pig. will
Re: pig and widerows
I don't want to switch my cassandra to HEAD, but looking at the newest code for CassandraStorage, I'm concerned the Uri parsing for widerows isn't going to work. setLocation first calls setLocationFromUri (which sets widerows to the Uri value), but then sets widerows to a static value (which is defined as false), and then it sets widerows to the system setting if it exists. That doesn't seem right... ? But setLocationFromUri also gets called from setStoreLocation, and I don't really know the difference between setLocation and setStoreLocation in terms of what is going on in terms of the integration between cassandra/pig/hadoop. will On Thu, Sep 27, 2012 at 3:26 PM, William Oberman ober...@civicscience.comwrote: The next painful lesson for me was figuring out how to get logging working for a distributed hadoop process. In my test environment, I have a single node that runs name/secondaryname/data/job trackers (call it central), and I have two cassandra nodes running tasktrackers. But, I also have cassandra libraries on the central box, and invoke my pig script from there. I had been patching and recompiling cassandra (1.1.5 with my logging, and the system env fix) on that central box, and SOME of the logging was appearing in the pig output. But, eventually I decided to move that recompiled code to the tasktracker boxes, and then I found even more of the logging I had added in: /var/log/hadoop/userlogs/JOB_ID on each of the tasktrackers. Based on this new logging, I found out that the widerows setting wasn't propagating from the central box to the tasktrackers. I added: export PIG_WIDEROW_INPUT=true To hadoop-env.sh on each of the tasktrackers and it finally worked! So, long story short, to actually get all columns for a key I had to: 1.) patch 1.1.5 to honor the PIG_WIDEROW_INPUT=true system setting 2.) add the system setting to ALL nodes in the hadoop cluster I'm going to try to undo all of my other hacks to get logging/printing working to confirm if those were actually the only two changes I had to make. will On Thu, Sep 27, 2012 at 1:43 PM, William Oberman ober...@civicscience.com wrote: Ok, this is painful. The first problem I found is in stock 1.1.5 there is no way to set widerows to true! The new widerows URI parsing is NOT in 1.1.5. And for extra fun, getting the value from the system property is BROKEN (at least in my centos linux environment). Here are the key lines of code (in CassandraStorage), note the different ways of getting the property! getenv in the test, and getProperty in the set: widerows = DEFAULT_WIDEROW_INPUT; if (System.getenv(PIG_WIDEROW_INPUT) != null) widerows = Boolean.valueOf(System.getProperty(PIG_WIDEROW_INPUT)); I added this logging: logger.warn(widerows = + widerows + getenv= + System.getenv(PIG_WIDEROW_INPUT) + getProp=+System.getProperty(PIG_WIDEROW_INPUT)); And I saw: org.apache.cassandra.hadoop.pig.CassandraStorage - widerows = false getenv=true getProp=null So for me getProperty != getenv :-( For people trying to figure out how to debug cassandra + hadoop + pig, for me the key to get debugging and logging working was to focus on /etc/hadoop/conf (not /etc/pig/conf as I expected). Also, if you want to compile your own cassandra (to add logging messages), make sure it's appears first on the pig classpath (use pig -secretDebugCmd to see the fully qualified command line). The next thing I'm trying to figure out is why when widerows == true I'm STILL not seeing more than 1024 columns :-( will On Wed, Sep 26, 2012 at 3:42 PM, William Oberman ober...@civicscience.com wrote: Hi, I'm trying to figure out what's going on with my cassandra/hadoop/pig system. I created a mini copy of my main cassandra data by randomly subsampling to get ~50,000 keys. I was then writing pig scripts but also the equivalent operation using simple single threaded code to double check pig. Of course my very first test failed. After doing a pig DUMP on the raw data, what appears to be happening is I'm only getting the first 1024 columns of a key. After some googling, this seems to be known behavior unless you add ?widerows=true to the pig load URI. I tried this, but it didn't seem to fix anything :-( Here's the the start of my pig script: foo = LOAD 'cassandra://KEYSPACE/COLUMN_FAMILY?widerows=true' USING CassandraStorage() AS (key:chararray, columns:bag {column:tuple (name, value)}); I'm using cassandra 1.1.5 from datastax rpms. I'm using hadoop (0.20.2+923.418-1) and pig (0.8.1+28.39-1) from cloudera rpms. What am I doing wrong? Or, how I can enable debugging/logging to next figure out what is going on? I haven't had to debug hadoop+pig+cassandra much, other than doing DUMP/ILLUSTRATE from pig. will
pig and widerows
Hi, I'm trying to figure out what's going on with my cassandra/hadoop/pig system. I created a mini copy of my main cassandra data by randomly subsampling to get ~50,000 keys. I was then writing pig scripts but also the equivalent operation using simple single threaded code to double check pig. Of course my very first test failed. After doing a pig DUMP on the raw data, what appears to be happening is I'm only getting the first 1024 columns of a key. After some googling, this seems to be known behavior unless you add ?widerows=true to the pig load URI. I tried this, but it didn't seem to fix anything :-( Here's the the start of my pig script: foo = LOAD 'cassandra://KEYSPACE/COLUMN_FAMILY?widerows=true' USING CassandraStorage() AS (key:chararray, columns:bag {column:tuple (name, value)}); I'm using cassandra 1.1.5 from datastax rpms. I'm using hadoop (0.20.2+923.418-1) and pig (0.8.1+28.39-1) from cloudera rpms. What am I doing wrong? Or, how I can enable debugging/logging to next figure out what is going on? I haven't had to debug hadoop+pig+cassandra much, other than doing DUMP/ILLUSTRATE from pig. will
new nodetool ring output and unbalanced ring?
Hi, I recently upgraded from 0.8.x to 1.1.x (through 1.0 briefly) and nodetool -ring seems to have changed from owns to effectively owns. Effectively owns seems to account for replication factor (RF). I'm ok with all of this, yet I still can't figure out what's up with my cluster. I have a NetworkTopologyStrategy with two data centers (DCs) with RF/number nodes in DC combinations of: DC Name, RF, # in DC analytics, 1, 2 us-east, 3, 4 So I'd expect 50% on each analytics node, and 75% for each us-east node. Instead, I have two nodes in us-east with 50/100??? (the other two are 75/75 as expected). Here is the output of nodetool (all nodes report the same thing): Address DC RackStatus State Load Effective-Ownership Token 127605887595351923798765477786913079296 x.x.x.x us-east 1c Up Normal 94.57 GB75.00% 0 x.x.x.x analytics 1c Up Normal 60.64 GB50.00% 1 x.x.x.x us-east 1c Up Normal 131.76 GB 75.00% 42535295865117307932921825928971026432 x.x.x.xus-east 1c Up Normal 43.45 GB50.00% 85070591730234615865843651857942052864 x.x.x.xanalytics 1d Up Normal 60.88 GB50.00% 85070591730234615865843651857942052865 x.x.x.x us-east 1d Up Normal 98.56 GB100.00% 127605887595351923798765477786913079296 If I use cassandra-cli to do show keyspaces; I get (and again, all nodes report the same thing): Keyspace: civicscience: Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy Durable Writes: true Options: [analytics:1, us-east:3] I removed the output about all of my column families (CFs), hopefully that doesn't matter. Did I compute the tokens wrong? Is there a combination of nodetool commands I can run to migrate the data around to rebalance to 75/75/75/75? I routinely run repair already. And as the release notes required, I ran upgradesstables during the upgrade process. Before the upgrade, I was getting analytics = 0%, and us-east = 25% on each node, which I expected for owns. will
Re: new nodetool ring output and unbalanced ring?
Didn't notice the racks! Of course If I change a 1c to a 1d, what would I have to do to make sure data shuffles around correctly? Repair everywhere? will On Thu, Sep 6, 2012 at 2:09 PM, Tyler Hobbs ty...@datastax.com wrote: The main issue is that one of your us-east nodes is in rack 1d, while the restart are in rack 1c. With NTS and multiple racks, Cassandra will try use one node from each rack as a replica for a range until it either meets the RF for the DC, or runs out of racks, in which case it just picks nodes sequentially going clockwise around the ring (starting from the range being considered, not the last node that was chosen as a replica). To fix this, you'll either need to make the 1d node a 1c node, or make 42535295865117307932921825928971026432 a 1d node so that you're alternating racks within that DC. On Thu, Sep 6, 2012 at 12:54 PM, William Oberman ober...@civicscience.com wrote: Hi, I recently upgraded from 0.8.x to 1.1.x (through 1.0 briefly) and nodetool -ring seems to have changed from owns to effectively owns. Effectively owns seems to account for replication factor (RF). I'm ok with all of this, yet I still can't figure out what's up with my cluster. I have a NetworkTopologyStrategy with two data centers (DCs) with RF/number nodes in DC combinations of: DC Name, RF, # in DC analytics, 1, 2 us-east, 3, 4 So I'd expect 50% on each analytics node, and 75% for each us-east node. Instead, I have two nodes in us-east with 50/100??? (the other two are 75/75 as expected). Here is the output of nodetool (all nodes report the same thing): Address DC RackStatus State Load Effective-Ownership Token 127605887595351923798765477786913079296 x.x.x.x us-east 1c Up Normal 94.57 GB75.00% 0 x.x.x.x analytics 1c Up Normal 60.64 GB50.00% 1 x.x.x.x us-east 1c Up Normal 131.76 GB 75.00% 42535295865117307932921825928971026432 x.x.x.xus-east 1c Up Normal 43.45 GB50.00% 85070591730234615865843651857942052864 x.x.x.xanalytics 1d Up Normal 60.88 GB50.00% 85070591730234615865843651857942052865 x.x.x.x us-east 1d Up Normal 98.56 GB100.00% 127605887595351923798765477786913079296 If I use cassandra-cli to do show keyspaces; I get (and again, all nodes report the same thing): Keyspace: civicscience: Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy Durable Writes: true Options: [analytics:1, us-east:3] I removed the output about all of my column families (CFs), hopefully that doesn't matter. Did I compute the tokens wrong? Is there a combination of nodetool commands I can run to migrate the data around to rebalance to 75/75/75/75? I routinely run repair already. And as the release notes required, I ran upgradesstables during the upgrade process. Before the upgrade, I was getting analytics = 0%, and us-east = 25% on each node, which I expected for owns. will -- Tyler Hobbs DataStax http://datastax.com/ -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
Re: Professional Support
I also have used datastax with great success (same disclaimer). A specific example: -I setup a one-on-one call to talk through an issue, in my case a server reconfiguration. It took 2 days to find a time to meet, though that was my fault as I believe they could have worked me in within a day. I wanted to split an existing cluster into 'oltp' and 'analytics', similar to what brisk does now out of the box. -During the call they walked me through all of the steps I'd have to do, answered any questions I had, and filled in the blanks for some of the reasoning behind their recommendations. -After the call I recieved constant support through the reconfiguration. For example: I found out that Ec2Snitch doesn't play nicely with PropertyFileSnitch in a rolling restart (all of the Ec2Snitch based servers stopped working immediately as soon as a PropFileSnitch server joined the ring, this is in 0.8.4), and they wrote a custom patch for me that made it work within a day. -In particular, Ben and Jackson helped me, so if either of you read the user list, thanks again! will On Tue, Sep 6, 2011 at 1:25 PM, Jim Ancona j...@anconafamily.com wrote: We use Datastax (http://www.datastax.com) and we have been very happy with the support we've received. We haven't tried any of the other providers on that page, so I can't comment on them. Jim (Disclaimer: no connection with Datastax other than as a satisfied customer.) On Tue, Sep 6, 2011 at 1:15 PM, China Stoffen chinastof...@yahoo.com wrote: There is a link to a page which lists few professional support providers on Cassandra homepage. I have contacted few of them and couple are just out of providing support and others didn't reply. So, do you know about any professional support provider for Cassandra solutions and how much they charge per year?
Re: cassandra 0.8.4 + pig (using cloudera rpms)
Yes, my cluster is working. I didn't realize it at the time, but the StorageService link I listed is already in 0.8.4, so yes the only file I had to patch was VersionedValue. Not sure what was going on with the pig jars, but after more configuration changes than I can count, I'm pretty sure removing pig.jar in favor of the cloudera pig jar was the magic bullet (for the ClassNotFound I was getting, in this case TException). One final note: in production I had to patch all of my cassandra servers (OLTP and analytics)* with the VersionedValue file for it to work (though, I did forget one setting, so now I'm still not 100% sure I had to patch all of them, but it's working now). will OLTP = vanilla cassandra analytics = cassandra + tasktracker. I'm not on brisk yet, so I've been rolling my own. On Mon, Sep 5, 2011 at 1:41 AM, Jeremy Hanna jeremy.hanna1...@gmail.comwrote: Thanks William - so you were able to get everything running correctly, right? FWIW, we're in the process of upgrading to 0.8.4 and found that all we needed was that first link you mentioned - the VersionedValue modification. It's running fine on our staging cluster and we're in the process of moving to production. We're currently using pig from cdhu0. All we did was replace the 0.8.4 jars after installing the debian packages for 0.8.4. Not sure if that helps anyone, but thought I would share what we've seen. Btw, this shouldn't be a problem once 0.8.5 comes out. On Sep 4, 2011, at 11:03 AM, William Oberman wrote: I've had some troubles, so I thought I'd pass on my various bug fixes: -Cass 0.8.4 has troubles with pig/hadoop (you get NPE's when trying to connect to cassandra in the pig logs). You need this patch: http://svn.apache.org/viewvc?revision=1158940view=revision And maybe this: http://svn.apache.org/viewvc?revision=1155157view=revision -I had installed from riptano rpms. I downloaded the src, applied the patch, and did ant jar. I then replaced the rpm installed cassandra jar with this new one (ugly, but I wanted to continue to run from the package). -I think I was able to just replace the apache-cassandra-0.8.4.jar on just my jobtracker + tasktracker nodes (I need to retest from scratch to be sure, I've done a _lot_ of configuring and reconfiguring) -Then I started getting ClassNotFound exceptions during map/reduce tasks. Still not sure why this fix works, but the problem seems to be cloudera pig version 0.20.2+923.97-1 has two jars that match pig*.jar (which is what cassandra contrib/pig/bin/pig_cassandra uses to setup the classpath). I had to rename /usr/lib/pig/pig.jar for things to work (leaving pig-0.8.1-cdh3u1-core.jar as the only match). My pig script is still running, but it's the first time it didn't immediately crash. will
cassandra 0.8.4 + pig (using cloudera rpms)
I've had some troubles, so I thought I'd pass on my various bug fixes: -Cass 0.8.4 has troubles with pig/hadoop (you get NPE's when trying to connect to cassandra in the pig logs). You need this patch: http://svn.apache.org/viewvc?revision=1158940view=revision And maybe this: http://svn.apache.org/viewvc?revision=1155157view=revision -I had installed from riptano rpms. I downloaded the src, applied the patch, and did ant jar. I then replaced the rpm installed cassandra jar with this new one (ugly, but I wanted to continue to run from the package). -I think I was able to just replace the apache-cassandra-0.8.4.jar on just my jobtracker + tasktracker nodes (I need to retest from scratch to be sure, I've done a _lot_ of configuring and reconfiguring) -Then I started getting ClassNotFound exceptions during map/reduce tasks. Still not sure why this fix works, but the problem seems to be cloudera pig version 0.20.2+923.97-1 has two jars that match pig*.jar (which is what cassandra contrib/pig/bin/pig_cassandra uses to setup the classpath). I had to rename /usr/lib/pig/pig.jar for things to work (leaving pig-0.8.1-cdh3u1-core.jar as the only match). My pig script is still running, but it's the first time it didn't immediately crash. will
Re: how to migrate?
create keyspace civicscience with replication_factor=3 and strategy_options = [{us-east:3}] and placement_strategy='org.apache.cassandra.locator.NetworkTopologyStrategy'; FYI the replication_factor property with the NTS is incorrect, the next(?) revision of 0.8 will raise an error on restart. I'm not sure what you're saying. Should it have been: create keyspace civicscience with strategy_options = [{us-east:3}] and placement_strategy='org.apache.cassandra.locator.NetworkTopologyStrategy'; ? I'm wondering if I write my own snitch that extends Ec2Snitch with overrides as follows: getDC = if(AZ == c || d) return return us-east (to keep current nodes the same) else return us-east-hadoop; getRack = return super(); (returning a,b,c,d seems ok) prob easier to use the PropertyFileSnitch, see the yaml file and the conf/cassandra-topology.properties . You can then manually put the nodes into the DC and Rack you want. I'll read up on them next then. The Ec2Snitch seemed tempting to use, given I was in Ec2 ;-) -Can I (how do I safely) change the keyspace strategy_options from [{us-east:3}] to [{us-east:2, us-east-hadoop:1}] This seems like the riskiest/most complicated step of everything I've proposed... http://wiki.apache.org/cassandra/Operations#Replication The wiki has changed since I last read it (I guess that happens) ;-) I think I understand how to make changes and migrate data around now.
how to migrate?
I was hoping to transition my simple cassandra cluster (where each node is a cassandra + hadoop tasktracker) to a cluster with two virtual datacenters (vanilla cassandra vs. cassandra + hadoop tasktracker), based on this: http://wiki.apache.org/cassandra/HadoopSupport#ClusterConfig The problem I'm having is my hadoop jobs are getting heavy enough it's affecting my user facing performance on my cluster. Right now I'm in AWS, and I have 4 nodes in us-east split over two availability zones (us-east-1c that I'll call c and us-east-1d that I'll call d), setup with this keyspace: create keyspace civicscience with replication_factor=3 and strategy_options = [{us-east:3}] and placement_strategy='org.apache.cassandra.locator.NetworkTopologyStrategy'; And I'm using the Ec2Snitch. I'm wondering if I write my own snitch that extends Ec2Snitch with overrides as follows: getDC = if(AZ == c || d) return return us-east (to keep current nodes the same) else return us-east-hadoop; getRack = return super(); (returning a,b,c,d seems ok) Then, if I boot N new nodes into us-east-1[a,b] they will be hadoop nodes because of the snitch. I'll obviously have to change my home brew cassandra + hadoop instances to selectively run task trackers or not (a/b = yes, and c/d = no). But: -Is the overall RF=3 still ok? -What is the recommended split between normal and hadoop in terms of strategy_options (assuming RF=3)? 2/1? -Can I (how do I safely) change the keyspace strategy_options from [{us-east:3}] to [{us-east:2, us-east-hadoop:1}] This seems like the riskiest/most complicated step of everything I've proposed... -After I change the options, what (if anything) would I have to do to migrate data around? One final question: should I add new nodes as Brisk instances instead of my home brew cassandra + hadoop nodes? I've obviously already put in the pain/effort of learning how to run hadoop + cassandra... Thanks for any help/advice! will
Re: Survey: Cassandra/JVM Resident Set Size increase
I finally upgraded to 0.7.4 - 0.8.0 (using riptano packages) 2 days ago. Before, my resident memory (for the java process) would slowly grow without bound and the OS would kill the process. But, over the last 2 days, I _think_ it's been stable. I'll let you know in a week :-) My other stats: AWS large (64 bit, 7.5GB, 4 compute units, no swap by default and I didn't enable it manually) Centos 5.6 Sun 1.6.0_24-b07 2 column families 4 machine cluster with RF=3 Mostly balanced write/read load (usually more writes) Not quite big data volumes, large 10^6 or small 10^7 ops/day No deletes or mutations, I only add or read Everything else is stock, I haven't tuned anything as performance was ok. No JVM options other than what was in the package. No JNA. Not sure the GC patterns. will On Tue, Jul 12, 2011 at 9:28 AM, Chris Burroughs chris.burrou...@gmail.comwrote: ### Preamble There have been several reports on the mailing list of the JVM running Cassandra using too much memory. That is, the resident set size is (max java heap size + mmaped segments) and continues to grow until the process swaps, kernel oom killer comes along, or performance just degrades too far due to the lack of space for the page cache. It has been unclear from these reports if there is a pattern. My hope here is that by comparing JVM versions, OS versions, JVM configuration etc., we will find something. Thank you everyone for your time. Some example reports: - http://www.mail-archive.com/user@cassandra.apache.org/msg09279.html - http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Very-high-memory-utilization-not-caused-by-mmap-on-sstables-td5840777.html - https://issues.apache.org/jira/browse/CASSANDRA-2868 - http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/OOM-or-what-settings-to-use-on-AWS-large-td6504060.html - http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-memory-problem-td6545642.html For reference theories include (in no particular order): - memory fragmentation - JVM bug - OS/glibc bug - direct memory - swap induced fragmentation - some other bad interaction of cassandra/jdk/jvm/os/nio-insanity. ### Survey 1. Do you think you are experiencing this problem? 2. Why? (This is a good time to share a graph like http://www.twitpic.com/5fdabn or http://img24.imageshack.us/img24/1754/cassandrarss.png) 2. Are you using mmap? (If yes be sure to have read http://wiki.apache.org/cassandra/FAQ#mmap , and explain how you have used pmap [or another tool] to rule you mmap and top decieving you.) 3. Are you using JNA? Was mlockall succesful (it's in the logs on startup)? 4. Is swap enabled? Are you swapping? 5. What version of Apache Cassandra are you using? 6. What is the earliest version of Apache Cassandra you recall seeing this problem with? 7. Have you tried the patch from CASSANDRA-2654 ? 8. What jvm and version are you using? 9. What OS and version are you using? 10. What are your jvm flags? 11. Have you tried limiting direct memory (-XX:MaxDirectMemorySize) 12. Can you characterise how much GC your cluster is doing? 13. Approximately how many read/writes per unit time is your cluster doing (per node or the whole cluster)? 14. How are you column families configured (key cache size, row cache size, etc.)?
Re: What does a write lock ?
Disregard most of my post (already). I forgot that reads aren't isolated. That means A and B are states cassandra will *eventually* be in, but at any point in time a read might see a partial B (where some columns are still A, and others are B). Though, I'm sure someone else will confirm if I'm wrong yet again. For me, if I need two pieces of data to be consistently related to each other and stored in cassandra, I encode them (usually JSON) and store them in one column. will On Fri, Jul 8, 2011 at 8:30 AM, William Oberman ober...@civicscience.comwrote: Questions like this seem to come up a lot: http://stackoverflow.com/questions/6033888/cassandra-atomicity-isolation-of-column-updates-on-a-single-row-on-on-single-no http://stackoverflow.com/questions/2055037/cassandra-atomic-reads-writes-within-a-single-columnfamily http://www.mail-archive.com/user@cassandra.apache.org/msg14701.html Lets say you read state A (from one key in one CF), you change the data to A' in your client, and you write A'. Are you worried that someone else might have changed A to B during this process (making the new state a race between A' and B)? It doesn't sound to me like you are... It sounds to me like you're worried about a set of columns for the key being in a consistent state before, during, and after a process. And A - A' and A - B will each be atomic for the key (based on my understanding). But, if A' and B are changes to a different set of columns, I believe that would interleave, which itself could be inconsistent from your application's point of view. will On Thu, Jul 7, 2011 at 11:41 PM, Jeffrey Kesselman jef...@gmail.comwrote: Really, as i lay in the bath thinking nabout it, I concluded what I am looking for is a very limited form of Consistency. Its consistency over a single row on a single node just for the period of update. On Thu, Jul 7, 2011 at 10:34 PM, Jeffrey Kesselman jef...@gmail.comwrote: Its not really isolation, btw, because we arent talking about anyone seeing an update mid-update.Rather, we are talking about when updates are allowed to occur. Atomicity means that all the updates happen together or they don't happen at all. Isolation means that no results of the update are visible until the entire update operation is complete. This really lies somewhere in the middle of the two concepts. Its part of the results of the combined effects of ACID On Thu, Jul 7, 2011 at 10:27 PM, Jonathan Ellis jbel...@gmail.comwrote: Sounds to me like you're confusing atomicity with isolation. On Thu, Jul 7, 2011 at 2:54 PM, Jeffrey Kesselman jef...@gmail.com wrote: Yup, im even more confused.Lets talk about the model, not the implementation. AIUI updates to a row are atomic across all columns in that row at once, true? If true then the next question is, does the validation happen inside or outside of that guarantee, and is the row guaranteed not to change between validation and update? If that is *not* the case then it makes a whole class of solutions to synchronization problems fail and puts my larger project in serious question. On Thu, Jul 7, 2011 at 3:43 PM, Yang tedd...@gmail.com wrote: no , the memtable is a concurrentskiplistmap insertion can happen in parallel On Jul 7, 2011 9:24 AM, Jeffrey Kesselman jef...@gmail.com wrote: This has me more confused. Does this mean that ALL rows on a given node are only updated sequentially, never in parallel? On Thu, Jul 7, 2011 at 3:21 PM, Yang tedd...@gmail.com wrote: just to add onto what jonathan said the columns are immutable . if u overwrite/ reconcile a new obj is created and shoved into the memtable there is a shared lock for all writes though which guard against an exclusive lock on memtable switching/flushing On Jul 7, 2011 7:51 AM, A J s5a...@gmail.com wrote: Does a write lock: 1. Just the columns in question for the specific row in question ? 2. The full row in question ? 3. The full CF ? I doubt read does any locks. Thanks. -- It's always darkest just before you are eaten by a grue. -- It's always darkest just before you are eaten by a grue. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com -- It's always darkest just before you are eaten by a grue. -- It's always darkest just before you are eaten by a grue.
Re: What does a write lock ?
I think you need to look into Zookeeper, or other distributed coordinator, as you have little/no guarantees from cassandra between 1-3 (in terms of the guarantees you want and need). And my terminology in my post is different than yours. My client == your server. Specifically, I was thinking in terms of: user - cassandra client code (that runs on a server) - cassandra server code (e.g. cassandra itself) that runs either on the same or different server On Fri, Jul 8, 2011 at 10:22 AM, Jeffrey Kesselman jef...@gmail.com wrote: Not quite, its more limited and specific The order of operations is all within the Cassandra node server and looks like this this... We have one row, A. Thats the only row being operated on. Client - submits A' Server does the following: (1) Validate function reads current A (2) Validate function validates A' vs. A (3) If validation succeeds, allows update to A'. My fear/concern is that after 1 and before 3, a second update to A'' comes in and changes the current value of A, therefor invalidating my validation check, see? If Cassandra does not guard against this then one possible solution would be to make my own key-to-mutex map in memory, lock the mutex for A's key as a precursor to (1) and release it in a post-update function. But I am always very nervous about inserting locking into a process that wasn't designed with it already in mind... On Fri, Jul 8, 2011 at 8:30 AM, William Oberman ober...@civicscience.comwrote: Questions like this seem to come up a lot: http://stackoverflow.com/questions/6033888/cassandra-atomicity-isolation-of-column-updates-on-a-single-row-on-on-single-no http://stackoverflow.com/questions/2055037/cassandra-atomic-reads-writes-within-a-single-columnfamily http://www.mail-archive.com/user@cassandra.apache.org/msg14701.html Lets say you read state A (from one key in one CF), you change the data to A' in your client, and you write A'. Are you worried that someone else might have changed A to B during this process (making the new state a race between A' and B)? It doesn't sound to me like you are... It sounds to me like you're worried about a set of columns for the key being in a consistent state before, during, and after a process. And A - A' and A - B will each be atomic for the key (based on my understanding). But, if A' and B are changes to a different set of columns, I believe that would interleave, which itself could be inconsistent from your application's point of view. will On Thu, Jul 7, 2011 at 11:41 PM, Jeffrey Kesselman jef...@gmail.comwrote: Really, as i lay in the bath thinking nabout it, I concluded what I am looking for is a very limited form of Consistency. Its consistency over a single row on a single node just for the period of update. On Thu, Jul 7, 2011 at 10:34 PM, Jeffrey Kesselman jef...@gmail.comwrote: Its not really isolation, btw, because we arent talking about anyone seeing an update mid-update.Rather, we are talking about when updates are allowed to occur. Atomicity means that all the updates happen together or they don't happen at all. Isolation means that no results of the update are visible until the entire update operation is complete. This really lies somewhere in the middle of the two concepts. Its part of the results of the combined effects of ACID On Thu, Jul 7, 2011 at 10:27 PM, Jonathan Ellis jbel...@gmail.comwrote: Sounds to me like you're confusing atomicity with isolation. On Thu, Jul 7, 2011 at 2:54 PM, Jeffrey Kesselman jef...@gmail.com wrote: Yup, im even more confused.Lets talk about the model, not the implementation. AIUI updates to a row are atomic across all columns in that row at once, true? If true then the next question is, does the validation happen inside or outside of that guarantee, and is the row guaranteed not to change between validation and update? If that is *not* the case then it makes a whole class of solutions to synchronization problems fail and puts my larger project in serious question. On Thu, Jul 7, 2011 at 3:43 PM, Yang tedd...@gmail.com wrote: no , the memtable is a concurrentskiplistmap insertion can happen in parallel On Jul 7, 2011 9:24 AM, Jeffrey Kesselman jef...@gmail.com wrote: This has me more confused. Does this mean that ALL rows on a given node are only updated sequentially, never in parallel? On Thu, Jul 7, 2011 at 3:21 PM, Yang tedd...@gmail.com wrote: just to add onto what jonathan said the columns are immutable . if u overwrite/ reconcile a new obj is created and shoved into the memtable there is a shared lock for all writes though which guard against an exclusive lock on memtable switching/flushing On Jul 7, 2011 7:51 AM, A J s5a...@gmail.com wrote: Does a write lock: 1. Just the columns in question for the specific row in question ? 2. The full row in question
Re: What does a write lock ?
Also, one point of early confusion for me is there is a slightly different definition of atomicity depending on if your talking software vs. database, and I'm a software guy. From wikipedia: Software = Atomicity is a guarantee of isolation from concurrent processes. Additionally, atomic operations commonly have a succeed-or-fail definition — they either successfully change the state of the system, or have no visible effect. Database = In an atomic transaction, a series of database operations either all occur, or nothing occurs. I believe that cassandra is using the database definition. will On Fri, Jul 8, 2011 at 10:35 AM, William Oberman ober...@civicscience.comwrote: I think you need to look into Zookeeper, or other distributed coordinator, as you have little/no guarantees from cassandra between 1-3 (in terms of the guarantees you want and need). And my terminology in my post is different than yours. My client == your server. Specifically, I was thinking in terms of: user - cassandra client code (that runs on a server) - cassandra server code (e.g. cassandra itself) that runs either on the same or different server On Fri, Jul 8, 2011 at 10:22 AM, Jeffrey Kesselman jef...@gmail.comwrote: Not quite, its more limited and specific The order of operations is all within the Cassandra node server and looks like this this... We have one row, A. Thats the only row being operated on. Client - submits A' Server does the following: (1) Validate function reads current A (2) Validate function validates A' vs. A (3) If validation succeeds, allows update to A'. My fear/concern is that after 1 and before 3, a second update to A'' comes in and changes the current value of A, therefor invalidating my validation check, see? If Cassandra does not guard against this then one possible solution would be to make my own key-to-mutex map in memory, lock the mutex for A's key as a precursor to (1) and release it in a post-update function. But I am always very nervous about inserting locking into a process that wasn't designed with it already in mind... On Fri, Jul 8, 2011 at 8:30 AM, William Oberman ober...@civicscience.com wrote: Questions like this seem to come up a lot: http://stackoverflow.com/questions/6033888/cassandra-atomicity-isolation-of-column-updates-on-a-single-row-on-on-single-no http://stackoverflow.com/questions/2055037/cassandra-atomic-reads-writes-within-a-single-columnfamily http://www.mail-archive.com/user@cassandra.apache.org/msg14701.html Lets say you read state A (from one key in one CF), you change the data to A' in your client, and you write A'. Are you worried that someone else might have changed A to B during this process (making the new state a race between A' and B)? It doesn't sound to me like you are... It sounds to me like you're worried about a set of columns for the key being in a consistent state before, during, and after a process. And A - A' and A - B will each be atomic for the key (based on my understanding). But, if A' and B are changes to a different set of columns, I believe that would interleave, which itself could be inconsistent from your application's point of view. will On Thu, Jul 7, 2011 at 11:41 PM, Jeffrey Kesselman jef...@gmail.comwrote: Really, as i lay in the bath thinking nabout it, I concluded what I am looking for is a very limited form of Consistency. Its consistency over a single row on a single node just for the period of update. On Thu, Jul 7, 2011 at 10:34 PM, Jeffrey Kesselman jef...@gmail.comwrote: Its not really isolation, btw, because we arent talking about anyone seeing an update mid-update.Rather, we are talking about when updates are allowed to occur. Atomicity means that all the updates happen together or they don't happen at all. Isolation means that no results of the update are visible until the entire update operation is complete. This really lies somewhere in the middle of the two concepts. Its part of the results of the combined effects of ACID On Thu, Jul 7, 2011 at 10:27 PM, Jonathan Ellis jbel...@gmail.comwrote: Sounds to me like you're confusing atomicity with isolation. On Thu, Jul 7, 2011 at 2:54 PM, Jeffrey Kesselman jef...@gmail.com wrote: Yup, im even more confused.Lets talk about the model, not the implementation. AIUI updates to a row are atomic across all columns in that row at once, true? If true then the next question is, does the validation happen inside or outside of that guarantee, and is the row guaranteed not to change between validation and update? If that is *not* the case then it makes a whole class of solutions to synchronization problems fail and puts my larger project in serious question. On Thu, Jul 7, 2011 at 3:43 PM, Yang tedd...@gmail.com wrote: no , the memtable is a concurrentskiplistmap insertion can happen in parallel On Jul 7, 2011 9:24 AM, Jeffrey Kesselman
Re: What does a write lock ?
I use a language specific wrapper around thrift as my client, but yes, I guess I fundamentally mean thrift == client, and the cassandra server == server. will On Fri, Jul 8, 2011 at 11:08 AM, Jeffrey Kesselman jef...@gmail.com wrote: I am confused by what you mean by Cassandra client code. Is this part of the Cassnadra server? My architecture is my user talks thrift to Cassandra.
Re: What does a write lock ?
I haven't ever written my own org.apache.cassandra.db.marshal.AbstractType (which is I think what your talking about), so I have no idea. Looking up the JavaDoc for that class, validate says validate that the byte array is a valid sequence for the type we are supposed to be comparing, which sounds like a local operation to me (e.g. it shouldn't fetch remote data, it's just saying yep, this is a valid member of type T). will On Fri, Jul 8, 2011 at 11:17 AM, Jeffrey Kesselman jef...@gmail.com wrote: Alright, So are you saying the column validator, as specified by conf/storage-conf.xml is checked in the client interface library and not on the server side? That seems odd to me on a number of levels, not the least being I cant see how thrift could autogenerate that for different languages or how those other languages would use a Java class. * * On Fri, Jul 8, 2011 at 11:13 AM, William Oberman ober...@civicscience.com wrote: I use a language specific wrapper around thrift as my client, but yes, I guess I fundamentally mean thrift == client, and the cassandra server == server. will On Fri, Jul 8, 2011 at 11:08 AM, Jeffrey Kesselman jef...@gmail.comwrote: I am confused by what you mean by Cassandra client code. Is this part of the Cassnadra server? My architecture is my user talks thrift to Cassandra. -- It's always darkest just before you are eaten by a grue.
Re: Cassandra memory problem
I think I had (and have) a similar problem: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/OOM-or-what-settings-to-use-on-AWS-large-td6504060.html My memory usage grew slowly until I ran out of mem and the OS killed my process (due to no swap). I'm still on 0.7.4, but I'm rolling out 0.8.1 next week, which I was hoping would fix the problem. I'm using Centos with Sun 1.6.0_24-b07 will On Thu, Jul 7, 2011 at 7:41 AM, Daniel Doubleday daniel.double...@gmx.netwrote: Hm - had to digg deeper and it totally looks like a native mem leak to me: We are still growing with res += 100MB a day. Cassandra is 8G now I checked the cassandra process with pmap -x Here's the human readable (aggregated) output: Format is thingy: RSS in KB Summary: Total SST: 1961616 Anon RSS: 6499640 Total RSS: 8478376 Here's a little more detail: SSTables (data and index files) ** Attic: 0 PrivateChatNotification: 38108 Schema: 0 PrivateChat: 161048 UserData: 116788 HintsColumnFamily: 0 Rooms: 100548 Tracker: 476 Migrations: 0 ObjectRepository: 793680 BlobStore: 350924 Activities: 400044 LocationInfo: 0 Libraries ** javajar: 2292 nativelib: 13028 Other ** 28201: 32 jna979649866618987247.tmp: 92 locale-archive: 1492 [stack]: 132 java: 44 ffi8TsQPY(deleted): 8 And ** [anon]: 6499640 Maybe the output of pmap is totally misleading but my interpretation is that only 2GB of RSS is attributed to paged in sstables. I have one large anon block which looks like this: Address Kbytes RSS Dirty Mode Mapping 00073f60 0 3093248 3093248 rwx--[ anon ] This is the native heap thats been allocated on startup and mlocked So theres still 3.5GB of anon memory. We haven't deployed https://issues.apache.org/jira/browse/CASSANDRA-2654 yet and this might be part of it but I don't think thats the main problem. As I said mem goes up by 100MB each day pretty linearly. Would be great if anyone could verify this by running pmap or talk my off the roof by explaining that nothing's the way it seems. All this might be heavily OS specific so maybe that's only on Debian? Thanks a lot Daniel On Jul 4, 2011, at 2:42 PM, Jonathan Ellis wrote: mmap'd data will be attributed to res, but the OS can page it out instead of killing the process. On Mon, Jul 4, 2011 at 5:52 AM, Daniel Doubleday daniel.double...@gmx.net wrote: Hi all, we have a mem problem with cassandra. res goes up without bounds (well until the os kills the process because we dont have swap) I found a thread that's about the same problem but on OpenJDK: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Very-high-memory-utilization-not-caused-by-mmap-on-sstables-td5840777.html We are on Debian with Sun JDK. Resident mem is 7.4G while heap is restricted to 3G. Anyone else is seeing this with Sun JDK? Cheers, Daniel :/home/dd# java -version java version 1.6.0_24 Java(TM) SE Runtime Environment (build 1.6.0_24-b07) Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode) :/home/dd# ps aux |grep java cass 28201 9.5 46.8 372659544 7707172 ? SLl May24 5656:21 /usr/bin/java -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms3000M -Xmx3000M -Xmn400M ... PID USER PR NI VIRT RES SHR S %CPU %MEMTIME+ COMMAND 28201 cass 20 0 355g 7.4g 1.4g S8 46.9 5656:25 java -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
cassandra/hadoop/pig
I have a few cassandra/hadoop/pig questions. I currently have things set up in a test environment, and for the most part everything works. But, before I start to roll things out to production, I wanted to check on/confirm some things. When I originally set things up, I used: http://wiki.apache.org/cassandra/HadoopSupport http://hadoop.apache.org/common/docs/r0.20.203.0/cluster_setup.html One difference I noticed between the two guides, which I ignored at the time, was how datanodes are treated. The wiki said At least one node in your cluster will also need to be a datanode. That's because Hadoop uses HDFS to store information like jar dependencies for your job, static data (like stop words for a word count), and things like that - it's the distributed cache. It's a very small amount of data but the Hadoop cluster needs it to run properly. But, the hadoop guide (if you follow it blindly like I did), creates a datanode on all TaskTracker nodes. I _think_ that is controlled by the conf/slaves file, but I haven't proved that yet. Is there any good reason to run datanodes on only the JobTracker vs. on all nodes? If I should only run it on the JobTracker, how do I properly stop the datanodes from starting automatically (when both start-dfs and start-mapred seem to draw from the same slaves file)? I noticed a second issue/oddness with datanodes, in that the HDFS data isn't always small. The other day I ran out of disk running my pig script. I checked, and by default, hadoop creates HDFS in /tmp, and I'm using EC2 (and /tmp is on the boot device) which is only 10G by default. Do other people put HDFS on a different disk? If yes, I'll really want to only run one datanode, as I don't want to re-template all of my cassandra nodes to have HDFS disks vs. one new JobTracker node. In terms of hardware, I am running small instances (32bit, 2GB) in the test cluster, while my production cluster is larges (64bit, 7 or 8GB). I was going to check the performance impact there, but even on smalls in test I was able to run hadoop jobs while serving web requests. I am wondering if smalls are causing the high HDFS usage though (I think data might spill more, if I'm understanding things correctly). If these are more hadoop then cassandra questions, let me know and I'll move my questions around. I did want to mention that these are small details compared to the amount of complicated things that worked like a charm during my configuration and testing of the combination of cassandra/hadoop/pig. It was impressive :-) Thanks! will
Re: cassandra/hadoop/pig
That makes sense. The problem is I jumped directly to using pig, which is abstracting some of the data flow from me. I guess I'll have to figure out what it's doing under the covers, to know how to optimize/fix bottlenecks. But for now, I'm taking this information to mean I should run datanodes with HDFS on larger non-root disks on all tasktracker nodes to ensure my pig scripts work, until I'm willing to either write the M/R code myself, or figure out how to optimize pig and/or the pig script. will On Wed, Jul 6, 2011 at 3:29 PM, Edward Capriolo edlinuxg...@gmail.comwrote: On Wed, Jul 6, 2011 at 2:48 PM, William Oberman ober...@civicscience.comwrote: I have a few cassandra/hadoop/pig questions. I currently have things set up in a test environment, and for the most part everything works. But, before I start to roll things out to production, I wanted to check on/confirm some things. When I originally set things up, I used: http://wiki.apache.org/cassandra/HadoopSupport http://hadoop.apache.org/common/docs/r0.20.203.0/cluster_setup.html One difference I noticed between the two guides, which I ignored at the time, was how datanodes are treated. The wiki said At least one node in your cluster will also need to be a datanode. That's because Hadoop uses HDFS to store information like jar dependencies for your job, static data (like stop words for a word count), and things like that - it's the distributed cache. It's a very small amount of data but the Hadoop cluster needs it to run properly. But, the hadoop guide (if you follow it blindly like I did), creates a datanode on all TaskTracker nodes. I _think_ that is controlled by the conf/slaves file, but I haven't proved that yet. Is there any good reason to run datanodes on only the JobTracker vs. on all nodes? If I should only run it on the JobTracker, how do I properly stop the datanodes from starting automatically (when both start-dfs and start-mapred seem to draw from the same slaves file)? I noticed a second issue/oddness with datanodes, in that the HDFS data isn't always small. The other day I ran out of disk running my pig script. I checked, and by default, hadoop creates HDFS in /tmp, and I'm using EC2 (and /tmp is on the boot device) which is only 10G by default. Do other people put HDFS on a different disk? If yes, I'll really want to only run one datanode, as I don't want to re-template all of my cassandra nodes to have HDFS disks vs. one new JobTracker node. In terms of hardware, I am running small instances (32bit, 2GB) in the test cluster, while my production cluster is larges (64bit, 7 or 8GB). I was going to check the performance impact there, but even on smalls in test I was able to run hadoop jobs while serving web requests. I am wondering if smalls are causing the high HDFS usage though (I think data might spill more, if I'm understanding things correctly). If these are more hadoop then cassandra questions, let me know and I'll move my questions around. I did want to mention that these are small details compared to the amount of complicated things that worked like a charm during my configuration and testing of the combination of cassandra/hadoop/pig. It was impressive :-) Thanks! will The logic that only one datanode is needed is not an absolute truth. If your jobs use ColumnFamilyInputFormat to read and write to ColumnFamilyOutputFormat then technically you only need one DataNode to hold the distributed cache. However, if you have a large amount of intermediate results or even a multiphase job that has to persist data between phases (this is very very common) then that single DataNode is a bottleneck. Most hadoop clusters run a DataNode and TaskTracker on each slave. Most situations would use datanodes very heavily, for example suppose you have 4 map/reduce jobs to run on the same Cassandra data. Ingesting the data from Cassandra at the beginning of each job might would be wasteful. It might be better to take the data into HDFS during the first job and then save it. Your subsequent jobs could use that instead of re-acquiring it from Cassandra.
Re: Strong Consistency with ONE read/writes
Was just going off of: Send the value to the primary replica and send placeholder values to the other replicas. Sounded like you wanted to write the value to one, and write the placeholder to N-1 to me. But, C* will propagate the value to N-1 eventually anyways, 'cause that's just what it does anyways :-) will On Sun, Jul 3, 2011 at 7:47 PM, AJ a...@dude.podzone.net wrote: ** On 7/3/2011 3:49 PM, Will Oberman wrote: Why not send the value itself instead of a placeholder? Now it takes 2x writes on a random node to do a single update (write placeholder, write update) and N*x writes from the client (write value, write placeholder to N-1). Where N is replication factor. Seems like extra network and IO instead of less... To send the value to each node is 1.) unnecessary, 2.) will only cause a large burst of network traffic. Think about if it's a large data value, such as a document. Just let C* do it's thing. The extra messages are tiny and doesn't significantly increase latency since they are all sent asynchronously. Of course, I still think this sounds like reimplementing Cassandra internals in a Cassandra client (just guessing, I'm not a cassandra dev) I don't see how. Maybe you should take a peek at the source. On Jul 3, 2011, at 5:20 PM, AJ a...@dude.podzone.net wrote: Yang, How would you deal with the problem when the 1st node responds success but then crashes before completely forwarding any replicas? Then, after switching to the next primary, a read would return stale data. Here's a quick-n-dirty way: Send the value to the primary replica and send placeholder values to the other replicas. The placeholder value is something like, PENDING_UPDATE. The placeholder values are sent with timestamps 1 less than the timestamp for the actual value that went to the primary. Later, when the changes propagate, the actual values will overwrite the placeholders. In event of a crash before the placeholder gets overwritten, the next read value will tell the client so. The client will report to the user that the key/column is unavailable. The downside is you've overwritten your data and maybe would like to know what the old data was! But, maybe there's another way using other columns or with MVCC. The client would want a success from the primary and the secondary replicas to be certain of future read consistency in case the primary goes down immediately as I said above. The ability to set an update_pending flag on any column value would probably make this work. But, I'll think more on this later. aj On 7/2/2011 10:55 AM, Yang wrote: there is a JIRA completed in 0.7.x that Prefers a certain node in snitch, so this does roughly what you want MOST of the time but the problem is that it does not GUARANTEE that the same node will always be read. I recently read into the HBase vs Cassandra comparison thread that started after Facebook dropped Cassandra for their messaging system, and understood some of the differences. what you want is essentially what HBase does. the fundamental difference there is really due to the gossip protocol: it's a probablistic, or eventually consistent failure detector while HBase/Google Bigtable use Zookeeper/Chubby to provide a strong failure detector (a distributed lock). so in HBase, if a tablet server goes down, it really goes down, it can not re-grab the tablet from the new tablet server without going through a start up protocol (notifying the master, which would notify the clients etc), in other words it is guaranteed that one tablet is served by only one tablet server at any given time. in comparison the above JIRA only TRYIES to serve that key from one particular replica. HBase can have that guarantee because the group membership is maintained by the strong failure detector. just for hacking curiosity, a strong failure detector + Cassandra replicas is not impossible (actually seems not difficult), although the performance is not clear. what would such a strong failure detector bring to Cassandra besides this ONE-ONE strong consistency ? that is an interesting question I think. considering that HBase has been deployed on big clusters, it is probably OK with the performance of the strong Zookeeper failure detector. then a further question was: why did Dynamo originally choose to use the probablistic failure detector? yes Dynamo's main theme is eventually consistent, so the Phi-detector is **enough**, but if a strong detector buys us more with little cost, wouldn't that be great? On Fri, Jul 1, 2011 at 6:53 PM, AJ a...@dude.podzone.net wrote: Is this possible? All reads and writes for a given key will always go to the same node from a client. It seems the only thing needed is to allow the clients to compute which node is the closes replica for the given key using the same algorithm C* uses. When the first replica receives the write request, it will write to
Re: Strong Consistency with ONE read/writes
I'm using cassandra as a tool, like a black box with a certain contract to the world. Without modifying the core, C* will send the updates to all replicas, so your plan would cause the extra write (for the placeholder). I wasn't assuming a modification to how C* fundamentally works. Sounds like you are hacking (or at least looking) at the source, so all the power to you if/when you try these kind of changes. will On Sun, Jul 3, 2011 at 8:45 PM, AJ a...@dude.podzone.net wrote: ** On 7/3/2011 6:32 PM, William Oberman wrote: Was just going off of: Send the value to the primary replica and send placeholder values to the other replicas. Sounded like you wanted to write the value to one, and write the placeholder to N-1 to me. Yes, that is what I was suggesting. The point of the placeholders is to handle the crash case that I talked about... like a WAL does. But, C* will propagate the value to N-1 eventually anyways, 'cause that's just what it does anyways :-) will On Sun, Jul 3, 2011 at 7:47 PM, AJ a...@dude.podzone.net wrote: On 7/3/2011 3:49 PM, Will Oberman wrote: Why not send the value itself instead of a placeholder? Now it takes 2x writes on a random node to do a single update (write placeholder, write update) and N*x writes from the client (write value, write placeholder to N-1). Where N is replication factor. Seems like extra network and IO instead of less... To send the value to each node is 1.) unnecessary, 2.) will only cause a large burst of network traffic. Think about if it's a large data value, such as a document. Just let C* do it's thing. The extra messages are tiny and doesn't significantly increase latency since they are all sent asynchronously. Of course, I still think this sounds like reimplementing Cassandra internals in a Cassandra client (just guessing, I'm not a cassandra dev) I don't see how. Maybe you should take a peek at the source. On Jul 3, 2011, at 5:20 PM, AJ a...@dude.podzone.net wrote: Yang, How would you deal with the problem when the 1st node responds success but then crashes before completely forwarding any replicas? Then, after switching to the next primary, a read would return stale data. Here's a quick-n-dirty way: Send the value to the primary replica and send placeholder values to the other replicas. The placeholder value is something like, PENDING_UPDATE. The placeholder values are sent with timestamps 1 less than the timestamp for the actual value that went to the primary. Later, when the changes propagate, the actual values will overwrite the placeholders. In event of a crash before the placeholder gets overwritten, the next read value will tell the client so. The client will report to the user that the key/column is unavailable. The downside is you've overwritten your data and maybe would like to know what the old data was! But, maybe there's another way using other columns or with MVCC. The client would want a success from the primary and the secondary replicas to be certain of future read consistency in case the primary goes down immediately as I said above. The ability to set an update_pending flag on any column value would probably make this work. But, I'll think more on this later. aj On 7/2/2011 10:55 AM, Yang wrote: there is a JIRA completed in 0.7.x that Prefers a certain node in snitch, so this does roughly what you want MOST of the time but the problem is that it does not GUARANTEE that the same node will always be read. I recently read into the HBase vs Cassandra comparison thread that started after Facebook dropped Cassandra for their messaging system, and understood some of the differences. what you want is essentially what HBase does. the fundamental difference there is really due to the gossip protocol: it's a probablistic, or eventually consistent failure detector while HBase/Google Bigtable use Zookeeper/Chubby to provide a strong failure detector (a distributed lock). so in HBase, if a tablet server goes down, it really goes down, it can not re-grab the tablet from the new tablet server without going through a start up protocol (notifying the master, which would notify the clients etc), in other words it is guaranteed that one tablet is served by only one tablet server at any given time. in comparison the above JIRA only TRYIES to serve that key from one particular replica. HBase can have that guarantee because the group membership is maintained by the strong failure detector. just for hacking curiosity, a strong failure detector + Cassandra replicas is not impossible (actually seems not difficult), although the performance is not clear. what would such a strong failure detector bring to Cassandra besides this ONE-ONE strong consistency ? that is an interesting question I think. considering that HBase has been deployed on big clusters, it is probably OK with the performance of the strong Zookeeper failure
Re: Strong Consistency with ONE read/writes
Ok, I see the you happen to choose the 'right' node idea, but it sounds like you want to solve C* problems in the client, and they already wrote that complicated code to make clients simple. You're talking about reimplementing key-node mappings, network topology (with failures), etc... Plus, if they change something about replication and you get too tricky, your code breaks. Or, if they optimize something, you might not benefit. On Jul 1, 2011, at 10:33 PM, AJ a...@dude.podzone.net wrote: I'm saying I will make my clients forward the C* requests to the first replica instead of forwarding to a random node. -- Sent from my Android phone with K-9 Mail. Please excuse my brevity. Will Oberman ober...@civicscience.com wrote: Sent from my iPhone On Jul 1, 2011, at 9:53 PM, AJ a...@dude.podzone.net wrote: Is this possible? All reads and writes for a given key will always go to the same node from a client. I don't think that's true. Given a key K, the client will write to N nodes (N=replication factor). And at consistency level ONE the client will return after 1 ack (from the N writes).
Re: hadoop results
I think I'll do the former, thanks! On Wed, Jun 29, 2011 at 11:16 PM, aaron morton aa...@thelastpickle.comwrote: How about get_slice() with reversed == true and count = 1 to get the highest time UUID ? Or you can also store a column with a magic name that have the value of the timeuuid that is the current metric to use. Cheers - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com On 30 Jun 2011, at 06:35, William Oberman wrote: I'll start with my question: given a CF with comparator TimeUUIDType, what is the most efficient way to get the greatest column's value? Context: I've been running cassandra for a couple of months now, so obviously it's time to start layering more on top :-) In my test environment, I managed to get pig/hadoop running, and developed a few scripts to collect metrics I've been missing since I switched from MySQL to cassandra (including the ever useful select count(*) from table equivalent). I was hoping to dump the results of this processing back into cassandra for use in other tools/processes. My initial thought was: new CF called stats with comparator TimeUUIDType. The basic idea being I'd store: stat_name - time stat was computed (as UUID) - value That way I can also see a historical perspective of any given stat for auditing (and for cumulative stats to see trends). The stat_name itself is a URI that is composed of what and any constraints on the what (including an optional time range, if the stat supports it). E.g. ClassOfSomething/ID/MetricName/OptionalTimeRange (or something, still deciding on the format of the URI). But, right now, the only way I know to get the current stat value would be to iterate over all columns (the TimeUUIDs) and then return the last one. Thanks for any tips, will -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
hadoop results
I'll start with my question: given a CF with comparator TimeUUIDType, what is the most efficient way to get the greatest column's value? Context: I've been running cassandra for a couple of months now, so obviously it's time to start layering more on top :-) In my test environment, I managed to get pig/hadoop running, and developed a few scripts to collect metrics I've been missing since I switched from MySQL to cassandra (including the ever useful select count(*) from table equivalent). I was hoping to dump the results of this processing back into cassandra for use in other tools/processes. My initial thought was: new CF called stats with comparator TimeUUIDType. The basic idea being I'd store: stat_name - time stat was computed (as UUID) - value That way I can also see a historical perspective of any given stat for auditing (and for cumulative stats to see trends). The stat_name itself is a URI that is composed of what and any constraints on the what (including an optional time range, if the stat supports it). E.g. ClassOfSomething/ID/MetricName/OptionalTimeRange (or something, still deciding on the format of the URI). But, right now, the only way I know to get the current stat value would be to iterate over all columns (the TimeUUIDs) and then return the last one. Thanks for any tips, will
Re: Backup/Restore: Coordinating Cassandra Nodetool Snapshots with Amazon EBS Snapshots?
I've been doing EBS snapshots for mysql for some time now, and was using a similar pattern as Josep (XFS with freeze, snap, unfreeze), with the extra complication that I was actually using 8 EBS's in RAID-0 (and the extra extra complication that I had to lock the MyISAM tables... glad to be moving away from that). For cassandra I switched to ephemeral disks, as per recommendations from this forum. One note on EBS snapshots though: the last time I checked (which was some time ago) I noticed degraded IO performance on the box during the snapshotting process even though the take snapshot command returns almost immediately. My theory back then was that amazon does the delta/compress/store outside of the VM, but it obviously has an effect on resources on the box the VM runs on. I was doing this on a mysql slave that no one talked to, so I didn't care/bother looking into it further. will On Thu, Jun 23, 2011 at 10:30 AM, Peter Schuller peter.schul...@infidyne.com wrote: EBS volume atomicity is good. We've had tons of experience since EBS came out almost 4 years ago, to back all kinds of things, including large DBs. And thanks a lot for coming forward with production experience. That is always useful with these things. -- / Peter Schuller -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
OOM (or, what settings to use on AWS large?)
I woke up this morning to all 4 of 4 of my cassandra instances reporting they were down in my cluster. I quickly started them all, and everything seems fine. I'm doing a postmortem now, but it appears they all OOM'd at roughly the same time, which was not reported in any cassandra log, but I discovered something in /var/log/kern that showed java died of oom(*). In amazon, I'm using large instances for cassandra, and they have no swap (as recommended), so I have ~8GB of ram. Should I use a different max mem setting? I'm using a stock rpm from riptano/datastax. If I run ps -aux I get: /usr/bin/java -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms3843M -Xmx3843M -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss128k -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.net.preferIPv4Stack=true -Djava.rmi.server.hostname=X.X.X.X -Dcom.sun.management.jmxremote.port=8080 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dmx4jaddress=0.0.0.0 -Dmx4jport=8081 -Dlog4j.configuration=log4j-server.properties -Dlog4j.defaultInitOverride=true -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid -cp :/etc/cassandra/conf:/usr/share/cassandra/lib/antlr-3.1.3.jar:/usr/share/cassandra/lib/apache-cassandra-0.7.4.jar:/usr/share/cassandra/lib/avro-1.4.0-fixes.jar:/usr/share/cassandra/lib/avro-1.4.0-sources-fixes.jar:/usr/share/cassandra/lib/commons-cli-1.1.jar:/usr/share/cassandra/lib/commons-codec-1.2.jar:/usr/share/cassandra/lib/commons-collections-3.2.1.jar:/usr/share/cassandra/lib/commons-lang-2.4.jar:/usr/share/cassandra/lib/concurrentlinkedhashmap-lru-1.1.jar:/usr/share/cassandra/lib/guava-r05.jar:/usr/share/cassandra/lib/high-scale-lib.jar:/usr/share/cassandra/lib/jackson-core-asl-1.4.0.jar:/usr/share/cassandra/lib/jackson-mapper-asl-1.4.0.jar:/usr/share/cassandra/lib/jetty-6.1.21.jar:/usr/share/cassandra/lib/jetty-util-6.1.21.jar:/usr/share/cassandra/lib/jline-0.9.94.jar:/usr/share/cassandra/lib/json-simple-1.1.jar:/usr/share/cassandra/lib/jug-2.0.0.jar:/usr/share/cassandra/lib/libthrift-0.5.jar:/usr/share/cassandra/lib/log4j-1.2.16.jar:/usr/share/cassandra/lib/mx4j-tools.jar:/usr/share/cassandra/lib/servlet-api-2.5-20081211.jar:/usr/share/cassandra/lib/slf4j-api-1.6.1.jar:/usr/share/cassandra/lib/slf4j-log4j12-1.6.1.jar:/usr/share/cassandra/lib/snakeyaml-1.6.jar org.apache.cassandra.thrift.CassandraDaemon (*) Also, why would they all OOM so close to each other? Bad luck? Or once the first node went down, is there an increased chance of the rest? I'm still on 0.7.4, when I released cassandra to production that was the latest release. In addition to (or instead of?) fixing memory settings, I'm guessing I should upgrade. will
Re: OOM (or, what settings to use on AWS large?)
Well, I managed to run 50 days before an OOM, so any changes I make will take a while to test ;-) I've seen the GCInspector log lines appear periodically in my logs, but I didn't see a correlation with the crash. I'll read the instructions on how to properly do a rolling upgrade today, practice on test, and try that on production first. will On Wed, Jun 22, 2011 at 8:41 AM, Sasha Dolgy sdo...@gmail.com wrote: We had a similar problem a last month and found that the OS eventually in the end killed the Cassandra process on each of our nodes ... I've upgraded to 0.8.0 from 0.7.6-2 and have not had the problem since, but i do see consumption levels rising consistently from one day to the next on each node .. On Wed, Jun 1, 2011 at 2:30 PM, Sasha Dolgy sdo...@gmail.com wrote: is there a specific string I should be looking for in the logs that isn't super obvious to me at the moment... On Tue, May 31, 2011 at 8:21 PM, Jonathan Ellis jbel...@gmail.com wrote: The place to start is with the statistics Cassandra logs after each GC. look for GCInspector I found this in the logs on all my servers but never did much after that On Wed, Jun 22, 2011 at 2:33 PM, William Oberman ober...@civicscience.com wrote: I woke up this morning to all 4 of 4 of my cassandra instances reporting they were down in my cluster. I quickly started them all, and everything seems fine. I'm doing a postmortem now, but it appears they all OOM'd at roughly the same time, which was not reported in any cassandra log, but I discovered something in /var/log/kern that showed java died of oom(*). In amazon, I'm using large instances for cassandra, and they have no swap (as recommended), so I have ~8GB of ram. Should I use a different max mem setting? I'm using a stock rpm from riptano/datastax. If I run ps -aux I get: /usr/bin/java -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 -Xms3843M -Xmx3843M -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss128k -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly -Djava.net.preferIPv4Stack=true -Djava.rmi.server.hostname=X.X.X.X -Dcom.sun.management.jmxremote.port=8080 -Dcom.sun.management.jmxremote.ssl=false -Dcom.sun.management.jmxremote.authenticate=false -Dmx4jaddress=0.0.0.0 -Dmx4jport=8081 -Dlog4j.configuration=log4j-server.properties -Dlog4j.defaultInitOverride=true -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid -cp :/etc/cassandra/conf:/usr/share/cassandra/lib/antlr-3.1.3.jar:/usr/share/cassandra/lib/apache-cassandra-0.7.4.jar:/usr/share/cassandra/lib/avro-1.4.0-fixes.jar:/usr/share/cassandra/lib/avro-1.4.0-sources-fixes.jar:/usr/share/cassandra/lib/commons-cli-1.1.jar:/usr/share/cassandra/lib/commons-codec-1.2.jar:/usr/share/cassandra/lib/commons-collections-3.2.1.jar:/usr/share/cassandra/lib/commons-lang-2.4.jar:/usr/share/cassandra/lib/concurrentlinkedhashmap-lru-1.1.jar:/usr/share/cassandra/lib/guava-r05.jar:/usr/share/cassandra/lib/high-scale-lib.jar:/usr/share/cassandra/lib/jackson-core-asl-1.4.0.jar:/usr/share/cassandra/lib/jackson-mapper-asl-1.4.0.jar:/usr/share/cassandra/lib/jetty-6.1.21.jar:/usr/share/cassandra/lib/jetty-util-6.1.21.jar:/usr/share/cassandra/lib/jline-0.9.94.jar:/usr/share/cassandra/lib/json-simple-1.1.jar:/usr/share/cassandra/lib/jug-2.0.0.jar:/usr/share/cassandra/lib/libthrift-0.5.jar:/usr/share/cassandra/lib/log4j-1.2.16.jar:/usr/share/cassandra/lib/mx4j-tools.jar:/usr/share/cassandra/lib/servlet-api-2.5-20081211.jar:/usr/share/cassandra/lib/slf4j-api-1.6.1.jar:/usr/share/cassandra/lib/slf4j-log4j12-1.6.1.jar:/usr/share/cassandra/lib/snakeyaml-1.6.jar org.apache.cassandra.thrift.CassandraDaemon (*) Also, why would they all OOM so close to each other? Bad luck? Or once the first node went down, is there an increased chance of the rest? I'm still on 0.7.4, when I released cassandra to production that was the latest release. In addition to (or instead of?) fixing memory settings, I'm guessing I should upgrade. will -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
Re: OOM (or, what settings to use on AWS large?)
I was wondering/I figured that /var/log/kern indicated the OS was killing java (versus an internal OOM). The nodetool repair is interesting. My application never deletes, so I didn't bother running it. But, if that helps prevent OOMs as well, I'll add it to the crontab (plan A is still upgrading to 0.8.0). will On Wed, Jun 22, 2011 at 8:53 AM, Sasha Dolgy sdo...@gmail.com wrote: Yes ... this is because it was the OS that killed the process, and wasn't related to Cassandra crashing. Reviewing our monitoring, we saw that memory utilization was pegged at 100% for days and days before it was finally killed because 'apt' was fighting for resource. At least, that's as far as I got in my investigation before giving up, moving to 0.8.0 and implementing 24hr nodetool repair on each node via cronjobso far ... no problems. On Wed, Jun 22, 2011 at 2:49 PM, William Oberman ober...@civicscience.com wrote: Well, I managed to run 50 days before an OOM, so any changes I make will take a while to test ;-) I've seen the GCInspector log lines appear periodically in my logs, but I didn't see a correlation with the crash. I'll read the instructions on how to properly do a rolling upgrade today, practice on test, and try that on production first. will -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
Re: OOM (or, what settings to use on AWS large?)
The CLI is posted, I assume that's the defaults (I didn't touch anything). The machines basically just run cassandra (and standard Centos5 background stuff). will On Wed, Jun 22, 2011 at 9:49 AM, Jake Luciani jak...@gmail.com wrote: Are you running with the default heap settings? what else is running on the boxes? On Wed, Jun 22, 2011 at 9:06 AM, William Oberman ober...@civicscience.com wrote: I was wondering/I figured that /var/log/kern indicated the OS was killing java (versus an internal OOM). The nodetool repair is interesting. My application never deletes, so I didn't bother running it. But, if that helps prevent OOMs as well, I'll add it to the crontab (plan A is still upgrading to 0.8.0). will On Wed, Jun 22, 2011 at 8:53 AM, Sasha Dolgy sdo...@gmail.com wrote: Yes ... this is because it was the OS that killed the process, and wasn't related to Cassandra crashing. Reviewing our monitoring, we saw that memory utilization was pegged at 100% for days and days before it was finally killed because 'apt' was fighting for resource. At least, that's as far as I got in my investigation before giving up, moving to 0.8.0 and implementing 24hr nodetool repair on each node via cronjobso far ... no problems. On Wed, Jun 22, 2011 at 2:49 PM, William Oberman ober...@civicscience.com wrote: Well, I managed to run 50 days before an OOM, so any changes I make will take a while to test ;-) I've seen the GCInspector log lines appear periodically in my logs, but I didn't see a correlation with the crash. I'll read the instructions on how to properly do a rolling upgrade today, practice on test, and try that on production first. will -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com -- http://twitter.com/tjake -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
rpm from 0.7.x - 0.8?
I'm running 0.7.4 from rpm (riptano). If I do a yum upgrade, it's trying to do 0.7.6. To get 0.8.x I have to do install apache-cassandra08. But that is going to install two copies. Is there a semi-official way of properly upgrading to 0.8 via rpm? -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
Re: rpm from 0.7.x - 0.8?
I just did a remove then install, and it seems to work. For those of you out there with JMX issues, the default port moved from 8080 to 7199 (which includes the internal default to nodetool). I was confused why nodetool ring would fail on some boxes and not others. I had to add -p depending on the version of nodetool will On Wed, Jun 22, 2011 at 10:15 AM, William Oberman ober...@civicscience.comwrote: I'm running 0.7.4 from rpm (riptano). If I do a yum upgrade, it's trying to do 0.7.6. To get 0.8.x I have to do install apache-cassandra08. But that is going to install two copies. Is there a semi-official way of properly upgrading to 0.8 via rpm? -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
Re: rpm from 0.7.x - 0.8?
I have a question about auto_bootstrap. When I originally brought up the cluser, I did: -seed with auto_boot = false -1,2,3 with auto_boot = true Now that I'm doing a rolling upgrade, do I set them all to auto_boot = true? Or does the seed stay false? Or should I mark them all false? I have manually set tokens on all of the. The doc confused me: Set to 'true' to make new [non-seed] nodes automatically migrate the right data to themselves. (If no InitialTokenhttp://wiki.apache.org/cassandra/InitialToken is specified, they will pick one such that they will get half the range of the most-loaded node.) If a node starts up without bootstrapping, it will mark itself bootstrapped so that you can't subsequently accidently bootstrap a node with data on it. (You can reset this by wiping your data and commitlog directories.) Default is: 'false', so that new clusters don't bootstrap immediately. You should turn this on when you start adding new nodes to a cluster that already has data on it. I'm not adding new nodes, but the cluster does have data on it... will On Wed, Jun 22, 2011 at 11:39 AM, William Oberman ober...@civicscience.comwrote: I just did a remove then install, and it seems to work. For those of you out there with JMX issues, the default port moved from 8080 to 7199 (which includes the internal default to nodetool). I was confused why nodetool ring would fail on some boxes and not others. I had to add -p depending on the version of nodetool will On Wed, Jun 22, 2011 at 10:15 AM, William Oberman ober...@civicscience.com wrote: I'm running 0.7.4 from rpm (riptano). If I do a yum upgrade, it's trying to do 0.7.6. To get 0.8.x I have to do install apache-cassandra08. But that is going to install two copies. Is there a semi-official way of properly upgrading to 0.8 via rpm? -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
Re: rpm from 0.7.x - 0.8?
Thanks Jonathan. I'm sure it's been true for everyone else as well, but the rolling upgrade seems to have worked like a charm for me (other than the JMX port # changing initial confusion). One minor thing that probably particular to my case: when I removed the old package, it unlinked my symlink /var/lib/cassandra/data (rather than edit the cassandra config, I symlinked my amazon disk to where cassandra expected it). At first I thought I had lost all of my data, but after restoring the link, everything was happy. will On Wed, Jun 22, 2011 at 12:34 PM, Jonathan Ellis jbel...@gmail.com wrote: Doesn't matter. auto_bootstrap only applies to first start ever. On Wed, Jun 22, 2011 at 10:48 AM, William Oberman ober...@civicscience.com wrote: I have a question about auto_bootstrap. When I originally brought up the cluser, I did: -seed with auto_boot = false -1,2,3 with auto_boot = true Now that I'm doing a rolling upgrade, do I set them all to auto_boot = true? Or does the seed stay false? Or should I mark them all false? I have manually set tokens on all of the. The doc confused me: Set to 'true' to make new [non-seed] nodes automatically migrate the right data to themselves. (If no InitialToken is specified, they will pick one such that they will get half the range of the most-loaded node.) If a node starts up without bootstrapping, it will mark itself bootstrapped so that you can't subsequently accidently bootstrap a node with data on it. (You can reset this by wiping your data and commitlog directories.) Default is: 'false', so that new clusters don't bootstrap immediately. You should turn this on when you start adding new nodes to a cluster that already has data on it. I'm not adding new nodes, but the cluster does have data on it... will On Wed, Jun 22, 2011 at 11:39 AM, William Oberman ober...@civicscience.com wrote: I just did a remove then install, and it seems to work. For those of you out there with JMX issues, the default port moved from 8080 to 7199 (which includes the internal default to nodetool). I was confused why nodetool ring would fail on some boxes and not others. I had to add -p depending on the version of nodetool will On Wed, Jun 22, 2011 at 10:15 AM, William Oberman ober...@civicscience.com wrote: I'm running 0.7.4 from rpm (riptano). If I do a yum upgrade, it's trying to do 0.7.6. To get 0.8.x I have to do install apache-cassandra08. But that is going to install two copies. Is there a semi-official way of properly upgrading to 0.8 via rpm? -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
Re: Docs: Token Selection
I haven't done it yet, but when I researched how to make geo-diverse/failover DCs, I figured I'd have to do something like RF=6, strategy = {DC1=3, DC2=3}, and LOCAL_QUORUM for reads/writes. This gives you an ack after 2 local nodes do the read/write, but the data eventually gets distributed to the other DC for a full failover. No ying-yang, but I believe accomplishes the same goal? will On Fri, Jun 17, 2011 at 2:20 AM, Jonathan Ellis jbel...@gmail.com wrote: Replication location is determined by the row key, not the location of the client that inserted it. (Otherwise, without knowing what DC a row was inserted in, you couldn't look it up to read it!) On Fri, Jun 17, 2011 at 12:20 AM, AJ a...@dude.podzone.net wrote: On 6/16/2011 9:45 PM, aaron morton wrote: But, I'm thinking about using OldNetworkTopStrat. NetworkTopologyStrategy is where it's at. Oh yeah? It didn't look like it would serve my requirements. I want 2 full production geo-diverse data centers with each serving as a failover for the other. Random Partitioner. Each dc holds 2 replicas from the local clients and 1 replica goes to the other dc. It doesn't look like I can do a ying-yang setup like that with NTS. Am I wrong? A - Aaron Morton Freelance Cassandra Developer @aaronmorton http://www.thelastpickle.com -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
prep for cassandra storage from pig
I think I'm stuck on typing issues trying to store data in cassandra. To verify, cassandra wants (key, {tuples}) My pig script is fairly brief: raw = LOAD 'cassandra://test_in/test_cf' USING CassandraStorage() AS (key:chararray, columns:bag {column:tuple (name, value)}); --colums == timeUUID - JSON rows = FOREACH raw GENERATE key, FLATTEN(columns); alias_target_day = FOREACH rows { --I wrote a specialized parser that does exactly what I need observation_map = com.civicscience.pig.ParseObservation($2); GENERATE $0 as alias, observation_map#'_fqt' as target, observation_map#'_day' as day; }; grouping = GROUP alias_target_day BY ((chararray)target,(chararray)day); X = FOREACH grouping GENERATE group.$0 as target, TOTUPLE(group.$1, COUNT($1)) as day_count; This gets me: (targetA, (day1, count)) (targetA, (day2, count)) (targetB, (day1, count)) But, cassandra wants the 2nd item to be a bag. So, I tried: X = FOREACH grouping GENERATE group.$0 as target, TOBAG(TOTUPLE(group.$1, COUNT($1))) as day_count; But this results in: (targetA, {((day1, count))}) (targetA, {((day2, count))}) (targetB, {((day1, count))}) It's hard to see, but the 2nd item now has a nested tuple as the first value, which is still bad. How to I get (key, {tuple})??? I wasn't sure where to post this (pig or cassandra), so I'm posting to the pig list too. will
Re: prep for cassandra storage from pig
My problem is the column names are dynamic (a date), and pygmalion seems to want the column names to be fixed at compile time (the script). On Wed, Jun 15, 2011 at 3:04 PM, Jeremy Hanna jeremy.hanna1...@gmail.comwrote: Hi Will, That's partly why I like to use FromCassandraBag and ToCassandraBag from pygmalion - it does the work for you to get it back into a form that cassandra understands. Others may know better how to massage the data into that form using just pig, but if all else fails, you could write a udf to do that. Jeremy On Jun 15, 2011, at 1:17 PM, William Oberman wrote: I think I'm stuck on typing issues trying to store data in cassandra. To verify, cassandra wants (key, {tuples}) My pig script is fairly brief: raw = LOAD 'cassandra://test_in/test_cf' USING CassandraStorage() AS (key:chararray, columns:bag {column:tuple (name, value)}); --colums == timeUUID - JSON rows = FOREACH raw GENERATE key, FLATTEN(columns); alias_target_day = FOREACH rows { --I wrote a specialized parser that does exactly what I need observation_map = com.civicscience.pig.ParseObservation($2); GENERATE $0 as alias, observation_map#'_fqt' as target, observation_map#'_day' as day; }; grouping = GROUP alias_target_day BY ((chararray)target,(chararray)day); X = FOREACH grouping GENERATE group.$0 as target, TOTUPLE(group.$1, COUNT($1)) as day_count; This gets me: (targetA, (day1, count)) (targetA, (day2, count)) (targetB, (day1, count)) But, cassandra wants the 2nd item to be a bag. So, I tried: X = FOREACH grouping GENERATE group.$0 as target, TOBAG(TOTUPLE(group.$1, COUNT($1))) as day_count; But this results in: (targetA, {((day1, count))}) (targetA, {((day2, count))}) (targetB, {((day1, count))}) It's hard to see, but the 2nd item now has a nested tuple as the first value, which is still bad. How to I get (key, {tuple})??? I wasn't sure where to post this (pig or cassandra), so I'm posting to the pig list too. will -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
Re: prep for cassandra storage from pig
I'll do a reply all, to keep this more consistent (sorry!). Rather than staying stuck, I wrote a custom function: TupleToBagOfTuple. I'm curious if I could have avoided it with proper pig scripting though. On Wed, Jun 15, 2011 at 3:08 PM, William Oberman ober...@civicscience.comwrote: My problem is the column names are dynamic (a date), and pygmalion seems to want the column names to be fixed at compile time (the script). On Wed, Jun 15, 2011 at 3:04 PM, Jeremy Hanna jeremy.hanna1...@gmail.comwrote: Hi Will, That's partly why I like to use FromCassandraBag and ToCassandraBag from pygmalion - it does the work for you to get it back into a form that cassandra understands. Others may know better how to massage the data into that form using just pig, but if all else fails, you could write a udf to do that. Jeremy On Jun 15, 2011, at 1:17 PM, William Oberman wrote: I think I'm stuck on typing issues trying to store data in cassandra. To verify, cassandra wants (key, {tuples}) My pig script is fairly brief: raw = LOAD 'cassandra://test_in/test_cf' USING CassandraStorage() AS (key:chararray, columns:bag {column:tuple (name, value)}); --colums == timeUUID - JSON rows = FOREACH raw GENERATE key, FLATTEN(columns); alias_target_day = FOREACH rows { --I wrote a specialized parser that does exactly what I need observation_map = com.civicscience.pig.ParseObservation($2); GENERATE $0 as alias, observation_map#'_fqt' as target, observation_map#'_day' as day; }; grouping = GROUP alias_target_day BY ((chararray)target,(chararray)day); X = FOREACH grouping GENERATE group.$0 as target, TOTUPLE(group.$1, COUNT($1)) as day_count; This gets me: (targetA, (day1, count)) (targetA, (day2, count)) (targetB, (day1, count)) But, cassandra wants the 2nd item to be a bag. So, I tried: X = FOREACH grouping GENERATE group.$0 as target, TOBAG(TOTUPLE(group.$1, COUNT($1))) as day_count; But this results in: (targetA, {((day1, count))}) (targetA, {((day2, count))}) (targetB, {((day1, count))}) It's hard to see, but the 2nd item now has a nested tuple as the first value, which is still bad. How to I get (key, {tuple})??? I wasn't sure where to post this (pig or cassandra), so I'm posting to the pig list too. will -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
hadoop/pig notes
I decided to try out hadoop/pig + cassandra. I had my ups and downs to get the script I wanted to run to work. I'm sure everyone who tries will have their own experiences/problems, but mine were: -Everything I need to know was in http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html and http://wiki.apache.org/cassandra/HadoopSupport -Java is really picky about hostnames. I'm in EC2, and rather than rely on DNS, I basically have all of my machines share an /etc/hosts file. But, the command line hostname wasn't returning the same thing as in /etc/hosts, which caused all kinds of weird hadoop issues at first. (I had hostname as foo and /etc/hosts had foo.prod). -I forgot I had iptables on. It's always easier to not have firewalls to start (this is true when configuring anything of course) -Use the same version of everything everywhere. And for hadoop/pig, I was having issues until I used the combination of hadoop-0.20.2 + pig-0.8.1. -For hadoop's mapred-site.xml you HAVE to supply a port (hostname:port), and there isn't a standard, and it seems arbitrary. I used 8021, based on notes in a case somewhere from hadoop (I think trying to standardize). It took me awhile to figure the syntax of Pig Latin out, but I finally managed to get a script that does a count of all columns in a column family: rows = LOAD 'cassandra://keyspace/columnfamily' USING CassandraStorage(); filter_rows = FILTER rows BY $1 is not null; counts = FOREACH filter_rows GENERATE COUNT($1); counts_in_bag = GROUP counts ALL; sum_of_bag = FOREACH counts_in_bag GENERATE SUM($1); dump sum_of_bag; I'm trying to see the impact of running hadoop on the same servers as cassandra now. And yes, I've seen the note in the wiki about the clever partitioning of cassandra nodes to allow for web latency nodes + hadoop processing nodes :-)
Re: best way to backup
Thanks, I think I'm getting some of the file layout/data structures now, so that helps with the backup strategy. I might still start simple, as it's usually harder to screw up simple, but at least I'll know where I can go with something more clever. will On Sat, Apr 30, 2011 at 9:15 AM, Jeremiah Jordan jeremiah.jor...@morningstar.com wrote: The files inside the keyspace folders are the SSTable. -- *From:* aaron morton [mailto:aa...@thelastpickle.com] *Sent:* Friday, April 29, 2011 4:49 PM *To:* user@cassandra.apache.org *Subject:* Re: best way to backup William, Some info on the sstables from me http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/ http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/If you want to know more check out the BigTable and original Facebook papers, linked from the wiki http://wiki.apache.org/cassandra/ArchitectureOverviewAaron On 29 Apr 2011, at 23:43, William Oberman wrote: Dumb question, but referenced twice now: which files are the SSTables and why is backing them up incrementally a win? Or should I not bother to understand internals, and instead just roll with the backup my keyspace(s) and system in a compressed tar strategy, as while it may be excessive, it's guaranteed to work and work easily (which I like, a great deal). will On Fri, Apr 29, 2011 at 4:58 AM, Daniel Doubleday daniel.double...@gmx.net wrote: What we are about to set up is a time machine like backup. This is more like an add on to the s3 backup. Our boxes have an additional larger drive for local backup. We create a new backup snaphot every x hours which hardlinks the files in the previous snapshot (bit like cassandras incremental_backups thing) and than we sync that snapshot dir with the cassandra data dir. We can do archiving / backup to external system from there without impacting the main data raid. But the main reason to do this is to have an 'omg we screwed up big time and deleted / corrupted data' recovery. On Apr 28, 2011, at 9:53 PM, William Oberman wrote: Even with N-nodes for redundancy, I still want to have backups. I'm an amazon person, so naturally I'm thinking S3. Reading over the docs, and messing with nodeutil, it looks like each new snapshot contains the previous snapshot as a subset (and I've read how cassandra uses hard links to avoid excessive disk use). When does that pattern break down? I'm basically debating if I can do a rsync like backup, or if I should do a compressed tar backup. And I obviously want multiple points in time. S3 does allow file versioning, if a file or file name is changed/resused over time (only matters in the rsync case). My only concerns with compressed tars is I'll have to have free space to create the archive and I get no delta space savings on the backup (the former is solved by not allowing the disk space to get so low and/or adding more nodes to bring down the space, the latter is solved by S3 being really cheap anyways). -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
Re: best way to backup
Dumb question, but referenced twice now: which files are the SSTables and why is backing them up incrementally a win? Or should I not bother to understand internals, and instead just roll with the backup my keyspace(s) and system in a compressed tar strategy, as while it may be excessive, it's guaranteed to work and work easily (which I like, a great deal). will On Fri, Apr 29, 2011 at 4:58 AM, Daniel Doubleday daniel.double...@gmx.netwrote: What we are about to set up is a time machine like backup. This is more like an add on to the s3 backup. Our boxes have an additional larger drive for local backup. We create a new backup snaphot every x hours which hardlinks the files in the previous snapshot (bit like cassandras incremental_backups thing) and than we sync that snapshot dir with the cassandra data dir. We can do archiving / backup to external system from there without impacting the main data raid. But the main reason to do this is to have an 'omg we screwed up big time and deleted / corrupted data' recovery. On Apr 28, 2011, at 9:53 PM, William Oberman wrote: Even with N-nodes for redundancy, I still want to have backups. I'm an amazon person, so naturally I'm thinking S3. Reading over the docs, and messing with nodeutil, it looks like each new snapshot contains the previous snapshot as a subset (and I've read how cassandra uses hard links to avoid excessive disk use). When does that pattern break down? I'm basically debating if I can do a rsync like backup, or if I should do a compressed tar backup. And I obviously want multiple points in time. S3 does allow file versioning, if a file or file name is changed/resused over time (only matters in the rsync case). My only concerns with compressed tars is I'll have to have free space to create the archive and I get no delta space savings on the backup (the former is solved by not allowing the disk space to get so low and/or adding more nodes to bring down the space, the latter is solved by S3 being really cheap anyways). -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
best way to backup
Even with N-nodes for redundancy, I still want to have backups. I'm an amazon person, so naturally I'm thinking S3. Reading over the docs, and messing with nodeutil, it looks like each new snapshot contains the previous snapshot as a subset (and I've read how cassandra uses hard links to avoid excessive disk use). When does that pattern break down? I'm basically debating if I can do a rsync like backup, or if I should do a compressed tar backup. And I obviously want multiple points in time. S3 does allow file versioning, if a file or file name is changed/resused over time (only matters in the rsync case). My only concerns with compressed tars is I'll have to have free space to create the archive and I get no delta space savings on the backup (the former is solved by not allowing the disk space to get so low and/or adding more nodes to bring down the space, the latter is solved by S3 being really cheap anyways). -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
Re: best way to backup
Interesting. Both use cases seem easy to code. Compress to S3 = cassandra snapshot, tar, s3 put EBS = cassandra snapshot, rsync snapshot dir - ebs, ebs snapshot I think the former is cheaper in terms of costs, as my gut says keeping around an EBS drive is more money than the lack of deltas in S3. But, EBS would allow extremely fine grained snapshotting without storage penalties (EBS snaps are supposed to be compressed deltas behind the scenes). Of course, I don't know how cassandra likes frequent snapshotting, and given the amount of node redundancy/eventual consistency that seems pointless anyways. will On Thu, Apr 28, 2011 at 3:57 PM, Sasha Dolgy sdo...@gmail.com wrote: You could take a snapshot to an EBS volume. then, take a snapshot of that via AWS. of course, this is ok.when they -arent- having outages and issues ... On Apr 28, 2011 9:54 PM, William Oberman ober...@civicscience.com wrote: Even with N-nodes for redundancy, I still want to have backups. I'm an amazon person, so naturally I'm thinking S3. Reading over the docs, and messing with nodeutil, it looks like each new snapshot contains the previous snapshot as a subset (and I've read how cassandra uses hard links to avoid excessive disk use). When does that pattern break down? I'm basically debating if I can do a rsync like backup, or if I should do a compressed tar backup. And I obviously want multiple points in time. S3 does allow file versioning, if a file or file name is changed/resused over time (only matters in the rsync case). My only concerns with compressed tars is I'll have to have free space to create the archive and I get no delta space savings on the backup (the former is solved by not allowing the disk space to get so low and/or adding more nodes to bring down the space, the latter is solved by S3 being really cheap anyways). -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
Re: best way to backup
My newbie mistake (always good to test things): my script wasn't storing/restoring system, only my keyspace. So, if you want to be able to restore from backup, make sure you save the keyspace and system! will On Thu, Apr 28, 2011 at 4:35 PM, Jeremy Hanna jeremy.hanna1...@gmail.comwrote: one thing we're looking at doing is watching the cassandra data directory and backing up the sstables to s3 when they are created. Some guys at simplegeo started tablesnap that does this: https://github.com/simplegeo/tablesnap What it does is for every sstable that is pushed to s3, it also copies a json file with the current files in the directory, so you can know what to restore in that event (as far as I understand). On Apr 28, 2011, at 2:53 PM, William Oberman wrote: Even with N-nodes for redundancy, I still want to have backups. I'm an amazon person, so naturally I'm thinking S3. Reading over the docs, and messing with nodeutil, it looks like each new snapshot contains the previous snapshot as a subset (and I've read how cassandra uses hard links to avoid excessive disk use). When does that pattern break down? I'm basically debating if I can do a rsync like backup, or if I should do a compressed tar backup. And I obviously want multiple points in time. S3 does allow file versioning, if a file or file name is changed/resused over time (only matters in the rsync case). My only concerns with compressed tars is I'll have to have free space to create the archive and I get no delta space savings on the backup (the former is solved by not allowing the disk space to get so low and/or adding more nodes to bring down the space, the latter is solved by S3 being really cheap anyways). -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
Re: advice for EC2 deployment
While I haven't configured it for multi-region yet, Sasha is exactly right now how amzon's DNS works (returning private vs. public IP depending on if the machine is local to the region or not). For extra fun, now that Route53 exists you can (somewhat trivially) map and dynamically maintain all EC2 instances to stable DNS names (but make sure to use CNAMEs to get the DNS magic!). E.g. cassandra1.somethinghardtoguess.ec2.yourdomain.com - weird.ec2.public.dns.name I'd drop in the somethinghardtoguess myself given Route53 can expose your internal network topology if someone can guess the DNS name. will On Wed, Apr 27, 2011 at 8:15 AM, Sasha Dolgy sdo...@gmail.com wrote: Hi, If I understand you correctly, you are trying to get a private ip in us-east speaking to the private ip in us-west. to make your life easier, configure your nodes to use hostname of the server. if it's in a different region, it will use the public ip (ec2 dns will handle this for you) and if it's in the same region, it will use the private ip. this way you can stop worrying about if you are using the public or private ip to communicate with another node. let the aws dns do the work for you. just make sure you are using v0.8 with SSL turned on and have the appropriate security group definitions ... -sasha On Wed, Apr 27, 2011 at 1:55 PM, pankajsoni0126 pankajsoni0...@gmail.com wrote: I have been trying to deploy Cassandra cluster across regions and for that I posted this IP address resolution in MultiDC setup. But when it is to get nodes talking to each other on different regions say, us-east and us-west over private IP's of EC2 nodes I am facing problems. I am assuming if Cassandra is built for multi-DC setup it should be easily deployed with node1's DC1's public IP listed as seed in all nodes in DC2 and to gain idea about network topology? I have hit a dud for deployment in such scenario. Or is it there any way possible to use Private IP's for such a scenario in EC2, as Public Ip are less secure and costly? -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
Re: advice for EC2 deployment
It's great advice, but I'm still torn. I've never done multi-region work before, and I'd prefer to wait for 0.8 with built-in inter-node security, but I'm otherwise ready to roll (and need to roll) cassandra out sooner than that. Given how well my system held up with a total single AZ failure, I'm really leaning on starting by treating AZ's as DCs, and racks as... random? I don't think that part matters. My question for today is to just use the property file snitch, or to roll my own version of Ec2Snith that does AZ as DC. I do increase my risk being single region to start, so I was going to figure out how to push snapshots to S3. One question on that note: is it better to try and snapshot all nodes at roughly the same point in time, or is it better to do rolling snapshots? will On Wed, Apr 27, 2011 at 7:13 AM, aaron morton aa...@thelastpickle.comwrote: Using the EC2Snitch you could have one AZ in us-east-1 and one Az in us-west-1, treat each AZ as a single rack and each region as a DC. The network topology is rack aware so will prefer request that go to the same rack (not much of an issue when you have only one rack). If possible I would use the same RF in each DC, if you want the fail over to be as clean as possible (see earlier comments about when number failed nodes in a dc). i.e. 3 replicas in each dc / region. Until you find a reason otherwise use LOCAL_QUORUM, if that proves to be too slow or you get more experience and feel comfortable with the trade offs then change to a lower CL. Dropping the CL level for write bursts does not make the cluster run any faster, it lets the client think the cluster is running faster and can result in the client overloading (in a good this is what it should do way) the cluster. This can result in more eventual consistency work to be done later during maintenance or read requests. If that is a reasonable trade off, you can write at CL ONE and read at CL ALL to ensure you get consistent reads (quorum is not good enough in that case). Jump in and test it at Quorum, you may find the write performance is good enough. There are lots of dials to play with http://wiki.apache.org/cassandra/MemtableThresholds Hope that helps. Aaron On 27 Apr 2011, at 09:31, William Oberman wrote: I see what you're saying. I was able to control write latency on mysql using insert vs insert delayed (what I feel is MySQLs poor man's eventual consistency option) + the fact that replication was a background asynchronous process. In terms of read latency, I was able to do up to a few hundred well indexed mysql queries (across AZs) on a view while keeping the overall latency of the page around or less than a second. I basically am replacing two use cases, the cases with difficult to scale anticipated write volumes. The first case was previously using insert delayed (which I'm doing in cassandra as ONE) as I wasn't getting consistent write/read operations before anyways. The second case was using traditional insert (which I was going to replace with some QUORUM-like level, I was assuming LOCAL_QUORUM). But, the latter case uses a write through memory cache (memcache), so I don't know how often it really reads data from the persistent store. But I definitely need to make sure it is consistent. In any case, it sounds like I'd be best served treating AZs as DCs, but then I don't know what to make racks? Or do racks not matter in a single AZ? That way I can get an ack from a LOCAL_QUORUM read/write before the (slightly) slower read/write to/from the other AZ (for redundancy). Then I'm only screwed if Amazon has a multi-AZ failure (so far, they've kept it to only one!) :-) will On Tue, Apr 26, 2011 at 5:01 PM, aaron morton aa...@thelastpickle.comwrote: One difference between Cassandra and MySQL replication may be when the network IO happens. Was the MySQL replication synchronous on transaction commit ? I was only aware that it had async replication, which means the client is not exposed to the network latency. In cassandra the network latency is exposed to the client as it needs to wait for the CL number of nodes to respond. If you use the PropertyFilePartitioner with the NetworkTopology you can manually assign machines to racks / dc's based on IP. See conf/cassandra-topology.property file there is also an Ec2Snitch which (from the code) /** * A snitch that assumes an EC2 region is a DC and an EC2 availability_zone * is a rack. This information is available in the config for the node. Recent discussion on DC aware CL levels http://www.mail-archive.com/user@cassandra.apache.org/msg11414.html Hope that helps. http://www.mail-archive.com/user@cassandra.apache.org/msg11414.html Aaron On 27 Apr 2011, at 01:18, William Oberman wrote: Thanks Aaron! Unless no one on this list uses EC2, there were a few minor troubles end of last week through the weekend which taught me a lot about obscure failure modes
Re: advice for EC2 deployment
Thanks Sasha. Fortunately/unfortunately I did realize the default current behavior of the Ec2Snitch, but my application isn't multi-region capable (yet), so I need to get intra-region redundancy. And having a SingleRegionEc2Snitch that did DC=ec2zone and RACK=??? would be much better for me (for now). On Wed, Apr 27, 2011 at 9:21 AM, Sasha Dolgy sdo...@gmail.com wrote: Hi William, The default behavior of Ec2Snitch is outlined below: http://svn.apache.org/repos/asf/cassandra/trunk/src/java/org/apache/cassandra/locator/Ec2Snitch.java // Split us-east-1a or asia-1a into us-east/1a and asia/1a. String azone = new String(b ,UTF-8); String[] splits = azone.split(-); ec2zone = splits[splits.length - 1]; ec2region = splits.length 3 ? splits[0] : splits[0]+-+splits[1]; logger.info(EC2Snitch using region: + ec2region + , zone: + ec2zone + .); ApplicationState.DC = ec2region ApplicationState.RACK = ec2zone We leverage cassandra instances in APAC, US Europe ... so it's important for us to know that we have one data center in each 'region' and multiple racks per DC ... -sasha On Wed, Apr 27, 2011 at 3:06 PM, William Oberman ober...@civicscience.com wrote: It's great advice, but I'm still torn. I've never done multi-region work before, and I'd prefer to wait for 0.8 with built-in inter-node security, but I'm otherwise ready to roll (and need to roll) cassandra out sooner than that. Given how well my system held up with a total single AZ failure, I'm really leaning on starting by treating AZ's as DCs, and racks as... random? I don't think that part matters. My question for today is to just use the property file snitch, or to roll my own version of Ec2Snith that does AZ as DC. I do increase my risk being single region to start, so I was going to figure out how to push snapshots to S3. One question on that note: is it better to try and snapshot all nodes at roughly the same point in time, or is it better to do rolling snapshots? will -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
Re: advice for EC2 deployment
I don't think of it as migrating an instance, it's more of a destroy/start with EC2. But, I still think it would be very useful to spin up a set of instances with known hostnames (cassandra1, 2, 3... N) and be able to quickly SSH to them by doing ssh ec2u...@cassandra1.random.ec2.mydomain.com . Also, it makes finding seeds a lot easier, as you don't have to manage IPs in the config file, just names (cassandra-seed1.random.ec2.mydomain.com). I should have mentioned it, but people that are already doing this trick (I'm not... yet) are actually doing: hostname.region.ec2.mydomain.com (as it's useful to know the region). I don't anything cares about AZ, but you could embed that too if it matters. will On Wed, Apr 27, 2011 at 9:26 AM, Sasha Dolgy sdo...@gmail.com wrote: if you migrate the instance, does Route53 automatically re-map all the information to the new ec2 instance? another issue is that cassandra only maintains the IP of the other nodes, and not the hostname (assumed based on output of the nodetool ring) ... which means, if you migrate the instance and Route53 does do some auto-magic .. the private ip for the instance will have changed and you will need to migrate that node back into the ring, while moving the old referenced IP out ... we've had quite a lot of pain with this in the past. rule of thumb, if you want to upgrade / migrate an instance, you need to remove it from the ring, do your work, bootstrap it back to the ring .. i think this could be avoided if cassandra maintained hostname references and not just IP references for nodes. -sasha On Wed, Apr 27, 2011 at 2:56 PM, William Oberman ober...@civicscience.com wrote: While I haven't configured it for multi-region yet, Sasha is exactly right now how amzon's DNS works (returning private vs. public IP depending on if the machine is local to the region or not). For extra fun, now that Route53 exists you can (somewhat trivially) map and dynamically maintain all EC2 instances to stable DNS names (but make sure to use CNAMEs to get the DNS magic!). E.g. cassandra1.somethinghardtoguess.ec2.yourdomain.com - weird.ec2.public.dns.name I'd drop in the somethinghardtoguess myself given Route53 can expose your internal network topology if someone can guess the DNS name. will -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
Re: advice for EC2 deployment
Oh, and Route53 doesn't do anything automatically, but there is an API to manage the DNS. It's up to you to run a task on instance boot/terminate, or a cron job if you want to do this trick (for now, seems like a solid future feature of Route53). Though, I hear geographical aware Route53 is already in the works (to route EC2 traffic to the closest region). will On Wed, Apr 27, 2011 at 9:33 AM, William Oberman ober...@civicscience.comwrote: I don't think of it as migrating an instance, it's more of a destroy/start with EC2. But, I still think it would be very useful to spin up a set of instances with known hostnames (cassandra1, 2, 3... N) and be able to quickly SSH to them by doing ssh ec2u...@cassandra1.random.ec2.mydomain.com. Also, it makes finding seeds a lot easier, as you don't have to manage IPs in the config file, just names (cassandra-seed1.random.ec2.mydomain.com). I should have mentioned it, but people that are already doing this trick (I'm not... yet) are actually doing: hostname.region.ec2.mydomain.com (as it's useful to know the region). I don't anything cares about AZ, but you could embed that too if it matters. will On Wed, Apr 27, 2011 at 9:26 AM, Sasha Dolgy sdo...@gmail.com wrote: if you migrate the instance, does Route53 automatically re-map all the information to the new ec2 instance? another issue is that cassandra only maintains the IP of the other nodes, and not the hostname (assumed based on output of the nodetool ring) ... which means, if you migrate the instance and Route53 does do some auto-magic .. the private ip for the instance will have changed and you will need to migrate that node back into the ring, while moving the old referenced IP out ... we've had quite a lot of pain with this in the past. rule of thumb, if you want to upgrade / migrate an instance, you need to remove it from the ring, do your work, bootstrap it back to the ring .. i think this could be avoided if cassandra maintained hostname references and not just IP references for nodes. -sasha On Wed, Apr 27, 2011 at 2:56 PM, William Oberman ober...@civicscience.com wrote: While I haven't configured it for multi-region yet, Sasha is exactly right now how amzon's DNS works (returning private vs. public IP depending on if the machine is local to the region or not). For extra fun, now that Route53 exists you can (somewhat trivially) map and dynamically maintain all EC2 instances to stable DNS names (but make sure to use CNAMEs to get the DNS magic!). E.g. cassandra1.somethinghardtoguess.ec2.yourdomain.com - weird.ec2.public.dns.name I'd drop in the somethinghardtoguess myself given Route53 can expose your internal network topology if someone can guess the DNS name. will -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
Re: advice for EC2 deployment
I think you're right about changing NetworkToplogyStrategy, but the timing isn't working in my favor at this point. I wonder how bad that will really be On Wed, Apr 27, 2011 at 9:35 AM, Sasha Dolgy sdo...@gmail.com wrote: so can you not simply leverage a strategy that replicates data between racks and at some point in the future when you move to multi-dc upgrade the replication strategy to maintain the current replication and add in some replication between DC's ... ? i'll go re-read your posts to see if you've already tried this. I vaguely remember Ellis saying it's not a good idea to switch NetworkTopologyStrategy ... On Wed, Apr 27, 2011 at 3:29 PM, William Oberman ober...@civicscience.com wrote: Thanks Sasha. Fortunately/unfortunately I did realize the default current behavior of the Ec2Snitch, but my application isn't multi-region capable (yet), so I need to get intra-region redundancy. And having a SingleRegionEc2Snitch that did DC=ec2zone and RACK=??? would be much better for me (for now). -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
Re: advice for EC2 deployment
Thanks Aaron! Unless no one on this list uses EC2, there were a few minor troubles end of last week through the weekend which taught me a lot about obscure failure modes in various applications I use :-) My original post was trying to be more redundant than fast, which has been by overall goal from even before moving to Cassandra (my downtime from the EC2 madness was minimal, and due to only having one single point of failure == the amazon load balancer). My secondary goal was trying to make moving to a second region easier, but is that is causing problems I can drop the idea. I might be downplaying the cost of inter-AZ communication, but I've lived with that for quite some time, for example my current setup of MySQL in Master-Master replication is split over zones, and my webservers live in yet different zones. Maybe Cassandra is chattier than I'm used to? (again, I'm fairly new to cassandra) Based on that article, the discussion, and the recent EC2 issues, it sounds like it would be better to start with: -6 nodes split in two AZs 3/3 -Configure replication to do 2 in one AZ and one in the other (NetworkTopology treats AZs as racks, so does RF=3,us-east=3 make this happen naturally?) -What does LOCAL_QUORUM do in this case? Is there a rack quorum? Or does the natural latencies of AZs make LOCAL_QUORUM behave like a rack quorum? will On Tue, Apr 26, 2011 at 1:14 AM, aaron morton aa...@thelastpickle.comwrote: For background see this article: http://www.datastax.com/dev/blog/deploying-cassandra-across-multiple-data-centers http://www.datastax.com/dev/blog/deploying-cassandra-across-multiple-data-centersAnd this recent discussion http://www.mail-archive.com/user@cassandra.apache.org/msg12502.html http://www.mail-archive.com/user@cassandra.apache.org/msg12502.htmlIssues that may be a concern: - lots of cross AZ latency in us-east, e.g. LOCAL_QUORUM ops must wait cross AZ . Also consider it during maintenance tasks, how much of a pain is it going to be to have latency between every node. - IMHO not having sufficient (by that I mean 3) replicas in a cassandra DC to handle a single node failure when working at Quorum reduces the utility of the DC. e.g. with a local RF of 2 in the west, the quorum is 2, and if you lose one node from the replica set you will not be able to use local QUORUM for keys in that range. Or consider a failure mode where the west is disconnected from the east. Could you start simple with 3 replicas in one AZ in us-east and 3 replicas in an AZ+Region ? Then work through some failure scenarios. Hope that helps. Aaron On 22 Apr 2011, at 03:28, William Oberman wrote: Hi, My service is not yet ready to be fully multi-DC, due to how some of my legacy MySQL stuff works. But, I wanted to get cassandra going ASAP and work towards multi-DC. I have two main cassandra use cases: one where I can handle eventual consistency (and all of the writes/reads are currently ONE), and one where I can't (writes/reads are currently QUORUM). My test cluster is currently 4 smalls all in us-east with RF=3 (more to prove I can clustering, than to have an exact production replica). All of my unit tests, and load tests (again, not to prove true max load, but to more to tease out concurrency issues) are passing now. For production, I was thinking of doing: -4 cassandra larges in us-east (where I am now), once in each AZ -1 cassandra large in us-west (where I have nothing) For now, my data can fit into a single large's 2 disk ephemeral using RAID0, and I was then thinking of doing a RF=3 with us-east=2 and us-west=1. If I do eventual consistency at ONE, and consistency at LOCAL_QUORUM, I was hoping: -eventual consistency ops would be really fast -consistent ops would be pretty fast (what does LOCAL_QUORUM do in this case? return after 1 or 2 us-east nodes ack?) -us-west would contain a complete copy of my data, so it's a good eventually consistent close to real time backup (assuming it can keep up over long periods of time, but I think it should) -eventually, when I'm ready to roll out in us-west I'll be able to change the replication settings and that server in us-west could help seed new cassandra instances faster than the ones in us-east Or am I missing something really fundamental about how cassandra works making this a terrible plan? I should have plenty of time to get my multi-DC working before the instance in us-west fills up (but even then, I should be able to add instances over there to stall fairly trivially, right?). Thanks! will -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
practice failure recovery
In my test cluster I manged to jam up a cassandra server. I figure the easy failsafe solution is to just boot a replacement node, but I thought I'd try a minute to either figure out what I did, or try to figure out how to properly recover it before I lose my current state. The symptom = on startup I get an exception: ERROR 11:58:34,567 Exception encountered during startup. java.lang.IndexOutOfBoundsException: 6 at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:121) at org.apache.cassandra.db.marshal.TimeUUIDType.compareTimestampBytes(TimeUUIDType.java:56) at org.apache.cassandra.db.marshal.TimeUUIDType.compare(TimeUUIDType.java:45) at org.apache.cassandra.db.marshal.TimeUUIDType.compare(TimeUUIDType.java:29) at java.util.concurrent.ConcurrentSkipListMap$ComparableUsingComparator.compareTo(ConcurrentSkipListMap.java:606) at java.util.concurrent.ConcurrentSkipListMap.findPredecessor(ConcurrentSkipListMap.java:685) at java.util.concurrent.ConcurrentSkipListMap.doPut(ConcurrentSkipListMap.java:864) at java.util.concurrent.ConcurrentSkipListMap.putIfAbsent(ConcurrentSkipListMap.java:1893) at org.apache.cassandra.db.ColumnFamily.addColumn(ColumnFamily.java:216) at org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:130) at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:120) at org.apache.cassandra.db.RowMutation$RowMutationSerializer.deserialize(RowMutation.java:380) at org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:253) at org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:156) at org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:173) at org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCassandraDaemon.java:314) at org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:79) Where things went wrong = I had been doing various testing and unit testing, as this is my proof of concept cluster. The unit tests in particular work by cloning a keyspace as keyspace_UUID (to get a blank slate). Because of various bugs in my code and configuration, this left a fair amount of crud keyspaces by the time I got everything to pass. So, I wrote a script to drop all of the test keyspaces (the script had worked on a single node environment, which was my first step before the cluster). I think the CLI doesn't wait for schema propagation, so the script confused the node I was talking to, as after it ran the schema UUIDs of that node vs. the rest of the cluster didn't agree (describe cluster in the CLI). And, it wasn't fixing itself. nodetool loadbalance said it would do a decommission/bootstrap, which I thought might give the bad node a kick in the pants, so I tried it. Afterwards, I ran nodetool ring against all nodes and the problem node claimed all was UP, but everything else listed the problem node as ? and everything else as UP (sadly, I either didn't check or can't remember what nodetool ring said before loadbalance). So, I shut down the problem node. But, when I tried to restart it, I got the error you see above. Not sure what was the worst/dumbest thing I did, but it's definitely unhappy now!
Re: advice for EC2 deployment
I see what you're saying. I was able to control write latency on mysql using insert vs insert delayed (what I feel is MySQLs poor man's eventual consistency option) + the fact that replication was a background asynchronous process. In terms of read latency, I was able to do up to a few hundred well indexed mysql queries (across AZs) on a view while keeping the overall latency of the page around or less than a second. I basically am replacing two use cases, the cases with difficult to scale anticipated write volumes. The first case was previously using insert delayed (which I'm doing in cassandra as ONE) as I wasn't getting consistent write/read operations before anyways. The second case was using traditional insert (which I was going to replace with some QUORUM-like level, I was assuming LOCAL_QUORUM). But, the latter case uses a write through memory cache (memcache), so I don't know how often it really reads data from the persistent store. But I definitely need to make sure it is consistent. In any case, it sounds like I'd be best served treating AZs as DCs, but then I don't know what to make racks? Or do racks not matter in a single AZ? That way I can get an ack from a LOCAL_QUORUM read/write before the (slightly) slower read/write to/from the other AZ (for redundancy). Then I'm only screwed if Amazon has a multi-AZ failure (so far, they've kept it to only one!) :-) will On Tue, Apr 26, 2011 at 5:01 PM, aaron morton aa...@thelastpickle.comwrote: One difference between Cassandra and MySQL replication may be when the network IO happens. Was the MySQL replication synchronous on transaction commit ? I was only aware that it had async replication, which means the client is not exposed to the network latency. In cassandra the network latency is exposed to the client as it needs to wait for the CL number of nodes to respond. If you use the PropertyFilePartitioner with the NetworkTopology you can manually assign machines to racks / dc's based on IP. See conf/cassandra-topology.property file there is also an Ec2Snitch which (from the code) /** * A snitch that assumes an EC2 region is a DC and an EC2 availability_zone * is a rack. This information is available in the config for the node. Recent discussion on DC aware CL levels http://www.mail-archive.com/user@cassandra.apache.org/msg11414.html Hope that helps. http://www.mail-archive.com/user@cassandra.apache.org/msg11414.html Aaron On 27 Apr 2011, at 01:18, William Oberman wrote: Thanks Aaron! Unless no one on this list uses EC2, there were a few minor troubles end of last week through the weekend which taught me a lot about obscure failure modes in various applications I use :-) My original post was trying to be more redundant than fast, which has been by overall goal from even before moving to Cassandra (my downtime from the EC2 madness was minimal, and due to only having one single point of failure == the amazon load balancer). My secondary goal was trying to make moving to a second region easier, but is that is causing problems I can drop the idea. I might be downplaying the cost of inter-AZ communication, but I've lived with that for quite some time, for example my current setup of MySQL in Master-Master replication is split over zones, and my webservers live in yet different zones. Maybe Cassandra is chattier than I'm used to? (again, I'm fairly new to cassandra) Based on that article, the discussion, and the recent EC2 issues, it sounds like it would be better to start with: -6 nodes split in two AZs 3/3 -Configure replication to do 2 in one AZ and one in the other (NetworkTopology treats AZs as racks, so does RF=3,us-east=3 make this happen naturally?) -What does LOCAL_QUORUM do in this case? Is there a rack quorum? Or does the natural latencies of AZs make LOCAL_QUORUM behave like a rack quorum? will On Tue, Apr 26, 2011 at 1:14 AM, aaron morton aa...@thelastpickle.comwrote: For background see this article: http://www.datastax.com/dev/blog/deploying-cassandra-across-multiple-data-centers http://www.datastax.com/dev/blog/deploying-cassandra-across-multiple-data-centersAnd this recent discussion http://www.mail-archive.com/user@cassandra.apache.org/msg12502.html http://www.mail-archive.com/user@cassandra.apache.org/msg12502.htmlIssues that may be a concern: - lots of cross AZ latency in us-east, e.g. LOCAL_QUORUM ops must wait cross AZ . Also consider it during maintenance tasks, how much of a pain is it going to be to have latency between every node. - IMHO not having sufficient (by that I mean 3) replicas in a cassandra DC to handle a single node failure when working at Quorum reduces the utility of the DC. e.g. with a local RF of 2 in the west, the quorum is 2, and if you lose one node from the replica set you will not be able to use local QUORUM for keys in that range. Or consider a failure mode where the west is disconnected from the east. Could you
Re: practice failure recovery
Done and done. I'm really loving how easy the nuclear option has been (it was what I tested first). will On Tue, Apr 26, 2011 at 5:09 PM, aaron morton aa...@thelastpickle.comwrote: In 0.7.X the cli waits for the schema to agree before returning, you should see... Waiting for schema agreement... ... schemas agree across the cluster Or if things fail The schema has not settled in %d seconds; further migrations are ill-advised until it does.%nVersions are %s%n WRT the error, first guess is something in the schema has changed it's upsetting the log replay. Given all the crazy i'd go with the nuclear option. Aaron On 27 Apr 2011, at 07:11, William Oberman wrote: In my test cluster I manged to jam up a cassandra server. I figure the easy failsafe solution is to just boot a replacement node, but I thought I'd try a minute to either figure out what I did, or try to figure out how to properly recover it before I lose my current state. The symptom = on startup I get an exception: ERROR 11:58:34,567 Exception encountered during startup. java.lang.IndexOutOfBoundsException: 6 at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:121) at org.apache.cassandra.db.marshal.TimeUUIDType.compareTimestampBytes(TimeUUIDType.java:56) at org.apache.cassandra.db.marshal.TimeUUIDType.compare(TimeUUIDType.java:45) at org.apache.cassandra.db.marshal.TimeUUIDType.compare(TimeUUIDType.java:29) at java.util.concurrent.ConcurrentSkipListMap$ComparableUsingComparator.compareTo(ConcurrentSkipListMap.java:606) at java.util.concurrent.ConcurrentSkipListMap.findPredecessor(ConcurrentSkipListMap.java:685) at java.util.concurrent.ConcurrentSkipListMap.doPut(ConcurrentSkipListMap.java:864) at java.util.concurrent.ConcurrentSkipListMap.putIfAbsent(ConcurrentSkipListMap.java:1893) at org.apache.cassandra.db.ColumnFamily.addColumn(ColumnFamily.java:216) at org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:130) at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:120) at org.apache.cassandra.db.RowMutation$RowMutationSerializer.deserialize(RowMutation.java:380) at org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:253) at org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:156) at org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:173) at org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCassandraDaemon.java:314) at org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:79) Where things went wrong = I had been doing various testing and unit testing, as this is my proof of concept cluster. The unit tests in particular work by cloning a keyspace as keyspace_UUID (to get a blank slate). Because of various bugs in my code and configuration, this left a fair amount of crud keyspaces by the time I got everything to pass. So, I wrote a script to drop all of the test keyspaces (the script had worked on a single node environment, which was my first step before the cluster). I think the CLI doesn't wait for schema propagation, so the script confused the node I was talking to, as after it ran the schema UUIDs of that node vs. the rest of the cluster didn't agree (describe cluster in the CLI). And, it wasn't fixing itself. nodetool loadbalance said it would do a decommission/bootstrap, which I thought might give the bad node a kick in the pants, so I tried it. Afterwards, I ran nodetool ring against all nodes and the problem node claimed all was UP, but everything else listed the problem node as ? and everything else as UP (sadly, I either didn't check or can't remember what nodetool ring said before loadbalance). So, I shut down the problem node. But, when I tried to restart it, I got the error you see above. Not sure what was the worst/dumbest thing I did, but it's definitely unhappy now! -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
Re: Ec2Snitch + NetworkTopologyStrategy if only in one region?
Also for the new users like me, don't assume DC1 is a keyword like I did. A working example of a keyspace in EC2 is: create keyspace test with replication_factor=3 and strategy_options = [{us-east:3}] and placement_strategy='org.apache.cassandra.locator.NetworkTopologyStrategy'; For a single DC in EC2 deployment. I felt silly afterwards, but I couldn't find official docs on the structure of strategy_options anywhere. will On Wed, Apr 13, 2011 at 5:14 PM, William Oberman ober...@civicscience.comwrote: One last coda, for other noobs to cassandra like me. If you use NetworkTopologyStrategy with replication_factor 1, make sure you have EC2 instance in multiple availability zones. I was doing baby steps, and tried doing a cluster in one AZ (before spreading to multiple AZs) and was getting the most baffling errors (cassandra_UnavailableException). I finally thought to check the cassandra server logs (after debugging the client code, firewalls, etc... painstakingly for connectivity problems), and it ends up my cassandra cluster was considering itself unavailable as it couldn't replicate as much as it wanted to. I kind of wish a different word than unavailable was chosen for this error condition :-) will On Tue, Apr 12, 2011 at 10:37 PM, aaron morton aa...@thelastpickle.comwrote: If you can use standard + encoded I would go with that. Aaron On 13 Apr 2011, at 07:07, William Oberman wrote: Excellent to know! (and yes, I figure I'll expand someday, so I'm glad I found this out before digging a hole). The other issue I've been pondering is a normal column family of encoded objects (in my case JSON) vs. a super column. Based on my use case, things I've read, etc... right now I'm coming down on normal + encoded. will On Tue, Apr 12, 2011 at 2:57 PM, Jonathan Ellis jbel...@gmail.comwrote: NTS is overkill in the sense that it doesn't really benefit you in a single DC, but if you think you may expand to another DC in the future it's much simpler if you were already using NTS, than first migrating to NTS (changing strategy is painful). I can't think of any downsides to using NTS in a single-DC environment, so that's the safe option. On Tue, Apr 12, 2011 at 1:15 PM, William Oberman ober...@civicscience.com wrote: Hi, I'm getting closer to commiting to cassandra, and now I'm in system/IT issues and questions. I'm in the amazon EC2 cloud. I previously used this forum to discover the best practice for disk layouts (large instance + the two ephemeral disks in RAID0 for data + root volume for everything else). Now I'm hoping to confirm bits and pieces of things I've read about for snitch/replication strategies. I was thinking of using endpoint_snitch: org.apache.cassandra.locator.Ec2Snitch placement_strategy='org.apache.cassandra.locator.NetworkTopologyStrategy' (for people hitting this from the mailing list or google, I feel obligated to note that the former setting is in cassandra.yaml, and the latter is an option on a keyspace). But, I'm only in one region. Is using the amazon snitch/networktopology overkill given everything I have is in one DC (I believe region==DC and availability_zone==rack). I'm using multiple availability zones for some level of redundancy, I'm just not yet to the point I'm using multiple regions. If someday I move to using multiple regions, would that change the answer? Thanks! -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
system_* consistency level?
Hi, My unit tests started failing once I upgraded from a single node cassandra cluster to a full N node cluster (I'm starting with 4). I had a few various bugs, mostly due to forgetting to read/write at a quorum level in places I needed stronger consistency guarantees. But, I kept getting random, intermittent failure (the worst kind). I'm 99% sure I see why, after some painful debugging, but I don't know what to do about it. The basic flaw in my understanding of cassandra seems to boil down to: I thought system mutations of keyspaces/column families where of a stronger consistency than ONE, but that appears to not be true. Any way for me to update a cluster at something more like QUORUM? The basic idea is in my unit test.setup() I clone my real keyspace as keyspace_UUID (with all of the exact same CFs) to get a fresh space to play in. In a single node environment, no issues. But, in a cluster, it seems that it takes a while for the system_add_keyspace call to propagate. No worries I think, I just modify my setup() to do describe_keyspace(keyspace_UUID) in a while loop until the cluster is ready. My random failures drop considerably, but every once and awhile I see a similar kind of failure. Then I find out that schema updates seem to propagate on a per node basis. At least, that's what I have to assume as I'm using phpcassa which uses a connection pool, and I see in my logging that my setup() succeeds because one connection in the pool sees the new keyspace, but when my tests run I grab a connection from the pool that is missing it! Do I have a solution other than changing my setup yet again to loop over all cassandra servers doing a describe_keyspace()? -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
Re: system_* consistency level?
That was the trick. Thanks! On Apr 20, 2011, at 6:05 PM, Jonathan Ellis jbel...@gmail.com wrote: See the comments for describe_schema_versions. On Wed, Apr 20, 2011 at 4:59 PM, William Oberman ober...@civicscience.com wrote: Hi, My unit tests started failing once I upgraded from a single node cassandra cluster to a full N node cluster (I'm starting with 4). I had a few various bugs, mostly due to forgetting to read/write at a quorum level in places I needed stronger consistency guarantees. But, I kept getting random, intermittent failure (the worst kind). I'm 99% sure I see why, after some painful debugging, but I don't know what to do about it. The basic flaw in my understanding of cassandra seems to boil down to: I thought system mutations of keyspaces/column families where of a stronger consistency than ONE, but that appears to not be true. Any way for me to update a cluster at something more like QUORUM? The basic idea is in my unit test.setup() I clone my real keyspace as keyspace_UUID (with all of the exact same CFs) to get a fresh space to play in. In a single node environment, no issues. But, in a cluster, it seems that it takes a while for the system_add_keyspace call to propagate. No worries I think, I just modify my setup() to do describe_keyspace(keyspace_UUID) in a while loop until the cluster is ready. My random failures drop considerably, but every once and awhile I see a similar kind of failure. Then I find out that schema updates seem to propagate on a per node basis. At least, that's what I have to assume as I'm using phpcassa which uses a connection pool, and I see in my logging that my setup() succeeds because one connection in the pool sees the new keyspace, but when my tests run I grab a connection from the pool that is missing it! Do I have a solution other than changing my setup yet again to loop over all cassandra servers doing a describe_keyspace()? -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Cassandra 0.7.4 and LOCAL_QUORUM Consistency level
I had a similar error today when I tried using LOCAL_QUORUM without having a properly configured NetworkTopologyStrategy. QUORUM worked fine however. will On Tue, Apr 19, 2011 at 8:52 PM, Oleg Tsvinev oleg.tsvi...@gmail.comwrote: Earlier I've posted the same message to a hector-users list. Guys, I'm a bit puzzled today. I'm using just released Hector 0.7.0-29 (thank you, Nate!) and Cassandra 0.7.4 and getting the exception below, marked as (1) Exception. When I dig to Cassandra source code below, marked as (2) Cassandra source, I see that there's no check for LOCAL_QUORUM. I also see that (3) cassandra.thrift defines LOCAL_QUORUM as enum value 3 and in debugger, I see that LOCAL_QUORUM is a valid enum value. My question is, how is it possible to use LOCAL_QUORUM if Cassandra code throws exception when it sees it? Is it a bad merge or something? I know it worked before, from looking at https://issues.apache.org/jira/browse/CASSANDRA-2254 :-\ (1) Exception: 2011-04-19 14:57:33,550 [pool-2-thread-49] ERROR org.apache.cassandra.thrift.Cassandra$Processor - Internal error processing batch_mutate java.lang.UnsupportedOperationException: invalid consistency level: LOCAL_QUORUM at org.apache.cassandra.service.WriteResponseHandler.determineBlockFor(WriteResponseHandler.java:99) at org.apache.cassandra.service.WriteResponseHandler.init(WriteResponseHandler.java:48) at org.apache.cassandra.service.WriteResponseHandler.create(WriteResponseHandler.java:61) at org.apache.cassandra.locator.AbstractReplicationStrategy.getWriteResponseHandler(AbstractReplicationStrategy.java:127) at org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:115) at org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:415) at org.apache.cassandra.thrift.CassandraServer.batch_mutate(CassandraServer.java:388) at org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:3036) at org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2555) at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) (2) Cassandra source (line 99 is throw statement) protected int determineBlockFor(String table) { int blockFor = 0; switch (consistencyLevel) { case ONE: blockFor = 1; break; case ANY: blockFor = 1; break; case TWO: blockFor = 2; break; case THREE: blockFor = 3; break; case QUORUM: blockFor = (writeEndpoints.size() / 2) + 1; break; case ALL: blockFor = writeEndpoints.size(); break; default: throw new UnsupportedOperationException(invalid consistency level: + consistencyLevel.toString()); } // at most one node per range can bootstrap at a time, and these will be added to the write until // bootstrap finishes (at which point we no longer need to write to the old ones). assert 1 = blockFor blockFor = 2 * Table.open(table).getReplicationStrategy().getReplicationFactor() : String.format(invalid response count %d for replication factor %d, blockFor, Table.open(table).getReplicationStrategy().getReplicationFactor()); return blockFor; } (3) cassandra.thrift: enum ConsistencyLevel { ONE = 1, QUORUM = 2, LOCAL_QUORUM = 3, EACH_QUORUM = 4, ALL = 5, ANY = 6, TWO = 7, THREE = 8, } -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
Re: Cassandra 0.7.4 and LOCAL_QUORUM Consistency level
Good point, should have read your message (and the code) more closely! Sent from my iPhone On Apr 19, 2011, at 9:16 PM, Oleg Tsvinev oleg.tsvi...@gmail.com wrote: I'm puzzled because code does not even check for LOCAL_QUORUM before throwing exception. Indeed I did not configure NetworkTopologyStrategy. Are you saying that it works after configuring it? On Tue, Apr 19, 2011 at 6:04 PM, William Oberman ober...@civicscience.com wrote: I had a similar error today when I tried using LOCAL_QUORUM without having a properly configured NetworkTopologyStrategy. QUORUM worked fine however. will On Tue, Apr 19, 2011 at 8:52 PM, Oleg Tsvinev oleg.tsvi...@gmail.com wrote: Earlier I've posted the same message to a hector-users list. Guys, I'm a bit puzzled today. I'm using just released Hector 0.7.0-29 (thank you, Nate!) and Cassandra 0.7.4 and getting the exception below, marked as (1) Exception. When I dig to Cassandra source code below, marked as (2) Cassandra source, I see that there's no check for LOCAL_QUORUM. I also see that (3) cassandra.thrift defines LOCAL_QUORUM as enum value 3 and in debugger, I see that LOCAL_QUORUM is a valid enum value. My question is, how is it possible to use LOCAL_QUORUM if Cassandra code throws exception when it sees it? Is it a bad merge or something? I know it worked before, from looking at https://issues.apache.org/jira/browse/CASSANDRA-2254 :-\ (1) Exception: 2011-04-19 14:57:33,550 [pool-2-thread-49] ERROR org.apache.cassandra.thrift.Cassandra$Processor - Internal error processing batch_mutate java.lang.UnsupportedOperationException: invalid consistency level: LOCAL_QUORUM at org.apache.cassandra.service.WriteResponseHandler.determineBlockFor(WriteResponseHandler.java:99) at org.apache.cassandra.service.WriteResponseHandler.init(WriteResponseHandler.java:48) at org.apache.cassandra.service.WriteResponseHandler.create(WriteResponseHandler.java:61) at org.apache.cassandra.locator.AbstractReplicationStrategy.getWriteResponseHandler(AbstractReplicationStrategy.java:127) at org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:115) at org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:415) at org.apache.cassandra.thrift.CassandraServer.batch_mutate(CassandraServer.java:388) at org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:3036) at org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2555) at org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:206) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) (2) Cassandra source (line 99 is throw statement) protected int determineBlockFor(String table) { int blockFor = 0; switch (consistencyLevel) { case ONE: blockFor = 1; break; case ANY: blockFor = 1; break; case TWO: blockFor = 2; break; case THREE: blockFor = 3; break; case QUORUM: blockFor = (writeEndpoints.size() / 2) + 1; break; case ALL: blockFor = writeEndpoints.size(); break; default: throw new UnsupportedOperationException(invalid consistency level: + consistencyLevel.toString()); } // at most one node per range can bootstrap at a time, and these will be added to the write until // bootstrap finishes (at which point we no longer need to write to the old ones). assert 1 = blockFor blockFor = 2 * Table.open(table).getReplicationStrategy().getReplicationFactor() : String.format(invalid response count %d for replication factor %d, blockFor, Table.open(table).getReplicationStrategy().getReplicationFactor()); return blockFor; } (3) cassandra.thrift: enum ConsistencyLevel { ONE = 1, QUORUM = 2, LOCAL_QUORUM = 3, EACH_QUORUM = 4, ALL = 5, ANY = 6, TWO = 7, THREE = 8, } -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
Re: Ec2Snitch + NetworkTopologyStrategy if only in one region?
One last coda, for other noobs to cassandra like me. If you use NetworkTopologyStrategy with replication_factor 1, make sure you have EC2 instance in multiple availability zones. I was doing baby steps, and tried doing a cluster in one AZ (before spreading to multiple AZs) and was getting the most baffling errors (cassandra_UnavailableException). I finally thought to check the cassandra server logs (after debugging the client code, firewalls, etc... painstakingly for connectivity problems), and it ends up my cassandra cluster was considering itself unavailable as it couldn't replicate as much as it wanted to. I kind of wish a different word than unavailable was chosen for this error condition :-) will On Tue, Apr 12, 2011 at 10:37 PM, aaron morton aa...@thelastpickle.comwrote: If you can use standard + encoded I would go with that. Aaron On 13 Apr 2011, at 07:07, William Oberman wrote: Excellent to know! (and yes, I figure I'll expand someday, so I'm glad I found this out before digging a hole). The other issue I've been pondering is a normal column family of encoded objects (in my case JSON) vs. a super column. Based on my use case, things I've read, etc... right now I'm coming down on normal + encoded. will On Tue, Apr 12, 2011 at 2:57 PM, Jonathan Ellis jbel...@gmail.com wrote: NTS is overkill in the sense that it doesn't really benefit you in a single DC, but if you think you may expand to another DC in the future it's much simpler if you were already using NTS, than first migrating to NTS (changing strategy is painful). I can't think of any downsides to using NTS in a single-DC environment, so that's the safe option. On Tue, Apr 12, 2011 at 1:15 PM, William Oberman ober...@civicscience.com wrote: Hi, I'm getting closer to commiting to cassandra, and now I'm in system/IT issues and questions. I'm in the amazon EC2 cloud. I previously used this forum to discover the best practice for disk layouts (large instance + the two ephemeral disks in RAID0 for data + root volume for everything else). Now I'm hoping to confirm bits and pieces of things I've read about for snitch/replication strategies. I was thinking of using endpoint_snitch: org.apache.cassandra.locator.Ec2Snitch placement_strategy='org.apache.cassandra.locator.NetworkTopologyStrategy' (for people hitting this from the mailing list or google, I feel obligated to note that the former setting is in cassandra.yaml, and the latter is an option on a keyspace). But, I'm only in one region. Is using the amazon snitch/networktopology overkill given everything I have is in one DC (I believe region==DC and availability_zone==rack). I'm using multiple availability zones for some level of redundancy, I'm just not yet to the point I'm using multiple regions. If someday I move to using multiple regions, would that change the answer? Thanks! -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com
Ec2Snitch + NetworkTopologyStrategy if only in one region?
Hi, I'm getting closer to commiting to cassandra, and now I'm in system/IT issues and questions. I'm in the amazon EC2 cloud. I previously used this forum to discover the best practice for disk layouts (large instance + the two ephemeral disks in RAID0 for data + root volume for everything else). Now I'm hoping to confirm bits and pieces of things I've read about for snitch/replication strategies. I was thinking of using endpoint_snitch: org.apache.cassandra.locator.Ec2Snitch placement_strategy='org.apache.cassandra.locator.NetworkTopologyStrategy' (for people hitting this from the mailing list or google, I feel obligated to note that the former setting is in cassandra.yaml, and the latter is an option on a keyspace). But, I'm only in one region. Is using the amazon snitch/networktopology overkill given everything I have is in one DC (I believe region==DC and availability_zone==rack). I'm using multiple availability zones for some level of redundancy, I'm just not yet to the point I'm using multiple regions. If someday I move to using multiple regions, would that change the answer? Thanks! -- Will Oberman Civic Science, Inc. 3030 Penn Avenue., First Floor Pittsburgh, PA 15201 (M) 412-480-7835 (E) ober...@civicscience.com