Re: Question about EC2 and SSDs

2014-09-05 Thread William Oberman
Theory aside, I switched from RAID of ephemerals for data, and root
volume for write log to single EBS-based SSD without any noticeable impact
on performance.

will

On Thu, Sep 4, 2014 at 9:35 PM, Steve Robenalt sroben...@highwire.org
wrote:

 Yes, I am aware there are no heads on an SSD. I also have seen plenty of
 examples where compatibility issues force awkward engineering tradeoffs,
 even as technology advances so I am jaded enough to be wary of making
 assumptions, which is why I asked the question.

 Steve
 On Sep 4, 2014 5:50 PM, Robert Coli rc...@eventbrite.com wrote:

 On Thu, Sep 4, 2014 at 5:44 PM, Steve Robenalt sroben...@highwire.org
 wrote:

 Thanks Robert! I am assuming that you meant that it's possible with a
 single SSD, right?


 Yes, no matter how many SSDs you have you are unlikely to be able to
 convince one of them to physically seek a drive head across its plater,
 because they don't have heads or platters.

 =Rob




Re: VPC AWS

2014-06-05 Thread William Oberman
I don't think traffic will flow between classic ec2 and vpc directly.
There is some kind of gateway bridge instance that sits between, acting as
a NAT.   I would think that would cause new challenges for:
-transitions
-clients

Sorry this response isn't heavy on content!  I'm curious how this thread
goes...

Will

On Thursday, June 5, 2014, Alain RODRIGUEZ arodr...@gmail.com wrote:

 Hi guys,

 We are going to move from a cluster made of simple Amazon EC2 servers to a
 VPC cluster. We are using Cassandra 1.2.11 and I have some questions
 regarding this switch and the Cassandra configuration inside a VPC.

 Actually I found no documentation on this topic, but I am quite sure that
 some people are already using VPC. If you can point me to any documentation
 regarding VPC / Cassandra, it would be very nice of you. We have only one
 DC for now, but we need to remain multi DC compatible, since we will add DC
 very soon.

 Else, I would like to know if I should keep using EC2MultiRegionSnitch or
 change the snitch to anything else.

 What about broadcast/listen ip, seeds...?

 We currently use public ip as for broadcast address and for seeds. We use
 private ones for listen address. Machines inside the VPC will only have
 private IP AFAIK. Should I keep using a broadcast address ?

  Is there any other incidence when switching to a VPC ?

 Sorry if the topic was already discussed, I was unable to find any useful
 information...



-- 
Will Oberman
Civic Science, Inc.
6101 Penn Avenue, Fifth Floor
Pittsburgh, PA 15206
(M) 412-480-7835
(E) ober...@civicscience.com


alternative vnode upgrade strategy?

2014-05-28 Thread William Oberman
I'm concerned about the bad reports of using shuffle to do a vnode upgrade
(and I did a smoke test trying shuffle a test cluster, and had out of
disk space issues).

I then started to plan out the dual DC upgrade path, but I wonder if this
option is easier:

Starting point: N node cluster, no vnodes.

1.) Upgrade all N nodes to vnodes in place
2.) Boot N new nodes and let them bootstrap
3.) Decommission the N old nodes

And/Or

1.) Upgrade all N nodes to vnodes in place
Start loop
2.) Boot a new node and let it bootstrap
3.) Decommission an old node
End loop

The only problem I can think of is unlike dual DC, there is no easy
rollback strategy.  Though, at this point, vnodes have been in the wild for
quite some time, and I kind of want/need to move over to vnodes, so it's
vnodes or bust so to speak :-)

Am I missing a terrible gotcha (other than the rollback issue)?  I did a
smoke test of the 2nd scenario (one node in/out at a time) and didn't see
any issues

will


Re: NTS, vnodes and 0% chance of data loss

2014-05-14 Thread William Oberman
After sleeping on this, I'm sure my original conclusions are wrong.  In all
of the referenced cases/threads, I internalized rack awareness and
hotspots to mean something different and wrong.  A hotspot didn't mean
multiple replicas in the same rack (as I had been thinking), it meant the
process of finding replica placement might hit the same vnode
proportionally wrong due to the random association of vnodes - {dc,rack}.

To not people astray, I think everything in my email below is correct
until: Which means a rack failure (3 nodes) has a non-zero chance of data
failure (right?).  And again, my flaw was thinking that when Cassandra
selected replicas for token X in a vnode world, that it would possibly
pick vnodes that happened to be on the same rack due to random placements
of the tokens.  That is wrong (looking at the source for NTS), as NTS does
skip over the same rack (though, it will allow multiple in the same rack if
you fill up... I guess if someone did DC:4 with 3 racks they'll always
get one rack with two copies of the data, for example).

will

On Tue, May 13, 2014 at 1:41 PM, William Oberman
ober...@civicscience.comwrote:

 I found this:

 http://mail-archives.apache.org/mod_mbox/cassandra-user/201404.mbox/%3ccaeduwd1erq-1m-kfj6ubzsbeser8dwh+g-kgdpstnbgqsqc...@mail.gmail.com%3E

 I read the three referenced cases.  In addition, case 4123 references:
 http://www.mail-archive.com/dev@cassandra.apache.org/msg03844.html

 And even though I *think* I understand all of the issues now, I still want
 to double check...

 Assumptions:
 -A cluster using NTS with options [DC:3]
 -Physical layout = In DC, 3 nodes/rack for a total of 9 nodes

 No vnodes: I could do token selection using ideas from case 3810 such that
 each rack has one replica.  At this point, my 0% chance of data loss
 scenarios are:
 1.) Failure of two nodes at random
 2.) Failure of 2 racks (6 nodes!)

 Vnodes: my 0% chance of data loss scenarios are:
 1.) Failure of two nodes at random
 Which means a rack failure (3 nodes) has a non-zero chance of data failure
 (right?).

 To get specific, I'm in AWS, so racks ~= availability zones.  In the
 years I've been in AWS, I've seen several occasions of single zone
 downtimes, and one time of single zone catastrophic loss.  E.g. for AWS
 I feel like you *have* to plan for a single zone failure, and in terms of
 safety first you *should* plan for two zone failures.

 To mitigate this data loss risk seems rough for vnodes, again if I'm
 understanding everything correctly:
 -To ensure 0% data loss for one zone = I need RF=4
 -To ensure 0% data loss for two zones = I need RF=7

 I'd really like to use vnodes, but RF=7 is crazy.

 To reiterate what I think is the core idea of this message:
 1.) for vnodes 0% data loss = RF=(# of allowed failures at once)+1
 2.) racks don't change the above equation at all

 will



Re: clearing tombstones?

2014-05-11 Thread William Oberman
Not an expert, just a user of cassandra. For me, before was a cf with a
set of files (I forget the official naming system, so I'll make up my own):
A0
A1
...
AN

During:
A0
A1
...
AN
B0

Where B0 is the union of Ai. Due to tombstones, mutations, etc.  B0 is at
most 2x, but also probably close to 2x (unless you are all tombstones,
like me).

After
B0

Since cassandra can clean up Ai. Not sure when this happens.

Not sure what state you are in above. Sounds like between during and
after.

Will

On Thursday, May 8, 2014, Ruchir Jha ruchir@gmail.com wrote:

 I tried to do this, however the doubling in disk space is not temporary
 as you state in your note. What am I missing?


 On Fri, Apr 11, 2014 at 10:44 AM, William Oberman 
 ober...@civicscience.comjavascript:_e(%7B%7D,'cvml','ober...@civicscience.com');
  wrote:

 So, if I was impatient and just wanted to make this happen now, I could:

 1.) Change GCGraceSeconds of the CF to 0
 2.) run nodetool compact (*)
 3.) Change GCGraceSeconds of the CF back to 10 days

 Since I have ~900M tombstones, even if I miss a few due to impatience, I
 don't care *that* much as I could re-run my clean up tool against the now
 much smaller CF.

 (*) A long long time ago I seem to recall reading advice about don't ever
 run nodetool compact, but I can't remember why.  Is there any bad long
 term consequence?  Short term there are several:
 -a heavy operation
 -temporary 2x disk space
 -one big SSTable afterwards
 But moving forward, everything is ok right?  CommitLog/MemTable-SStables,
 minor compactions that merge SSTables, etc...  The only flaw I can think of
 is it will take forever until the SSTable minor compactions build up enough
 to consider including the big SSTable in a compaction, making it likely
 I'll have to self manage compactions.



 On Fri, Apr 11, 2014 at 10:31 AM, Mark Reddy mark.re...@boxever.comwrote:

 Correct, a tombstone will only be removed after gc_grace period has
 elapsed. The default value is set to 10 days which allows a great deal of
 time for consistency to be achieved prior to deletion. If you are
 operationally confident that you can achieve consistency via anti-entropy
 repairs within a shorter period you can always reduce that 10 day interval.


 Mark


 On Fri, Apr 11, 2014 at 3:16 PM, William Oberman ober...@civicscience.com
  wrote:

 I'm seeing a lot of articles about a dependency between removing
 tombstones and GCGraceSeconds, which might be my problem (I just checked,
 and this CF has GCGraceSeconds of 10 days).


 On Fri, Apr 11, 2014 at 10:10 AM, tommaso barbugli tbarbu...@gmail.comwrote:

 compaction should take care of it; for me it never worked so I run
 nodetool compaction on every node; that does it.


 2014-04-11 16:05 GMT+02:00 William Oberman ober...@civicscience.com:

 I'm wondering what will clear tombstoned rows?  nodetool cleanup, nodetool
 repair, or time (as in just wait)?

 I had a CF that was more or less storing session information.  After some
 time, we decided that one piece of this information was pointless to track
 (and was 90%+ of the columns, and in 99% of those cases was ALL columns for
 a row).   I wrote a process to remove all of those columns (which again in
 a vast majority of cases had the effect of removing the whole row).

 This CF had ~1 billion rows, so I expect to be left with ~100m rows.
  After I did this mass delete, everything was the same size on disk (which
 I expected, knowing how tombstoning works).  It wasn't 100% clear to me
 what to poke to cause compactions to clear the tombstones.  First I tried
 nodetool cleanup on a candidate node.  But, afterwards the disk usage was
 the same.  Then I tried nodetool repair on that same node.  But again, disk
 usage is still the same.  The CF has no snapshots.

 So, am I misunderstanding something?  Is there another operation to try?
  Do I have to just wait?  I've only done cleanup/re




-- 
Will Oberman
Civic Science, Inc.
6101 Penn Avenue, Fifth Floor
Pittsburgh, PA 15206
(M) 412-480-7835
(E) ober...@civicscience.com


Re: clearing tombstones?

2014-04-14 Thread William Oberman
I'm still somewhat in the middle of the process, but it's far enough along
to report back.

1.) I changed GCGraceSeconds of the CF to 0 using cassandra-cli
2.)  I ran nodetool compact on a single node of the nine (I'll call it
1).  It took 5-7 hours, and reduced the CF from ~450 to ~75GG (*).
3.)  I ran nodetool compact on nodes 2, 3,  while watching write/read
latency averages in OpsCenter.  I got all of the way to 9 without any ill
effect
4.) 2-9 all completed with similar results

(*) So, I left one one detail that changed the math (I said above I
expected to clear down to at most 50GB).  I found a small bug in my delete
code mid-last week.  Basically, it deleted all of the rows I wanted, but
due to a race condition, there was a chance I'd delete rows in the middle
of doing new inserts.  Luckily, even in this case, it wasn't end of the
world, but I stopped the cleanup anyways and added a time check (as all of
the rows I wanted to delete were older than 30 days).  I *thought* I'd
restarted the cleanup threads on a smaller dataset due to all of the
deletes, but instead I saw millions  millions of empty rows (the
tombstones).  Thus the start of this clear the tombstones subtask to the
original goal, and the reason I didn't see a 90%+ reduction in size.

In any case, now I'm running the cleanup process again, which will be
followed by ANOTHER round of compactions, and then I'll finally turn
GCGraceSeconds back on.

On the read/write production side, you'd never know anything happened.
 Good job on the distributed system! :-)

Thanks again,

will


On Fri, Apr 11, 2014 at 1:02 PM, Mark Reddy mark.re...@boxever.com wrote:

 Thats great Will, if you could update the thread with the actions you
 decide to take and the results that would be great.


 Mark


 On Fri, Apr 11, 2014 at 5:53 PM, William Oberman ober...@civicscience.com
  wrote:

 I've learned a *lot* from this thread.  My thanks to all of the
 contributors!

 Paulo: Good luck with LCS.  I wish I could help there, but all of my CF's
 are SizeTiered (mostly as I'm on the same schema/same settings since 0.7...)

 will



 On Fri, Apr 11, 2014 at 12:14 PM, Mina Naguib mina.nag...@adgear.comwrote:


 Levelled Compaction is a wholly different beast when it comes to
 tombstones.

 The tombstones are inserted, like any other write really, at the lower
 levels in the leveldb hierarchy.

 They are only removed after they have had the chance to naturally
 migrate upwards in the leveldb hierarchy to the highest level in your data
 store.  How long that takes depends on:
  1. The amount of data in your store and the number of levels your LCS
 strategy has
 2. The amount of new writes entering the bottom funnel of your leveldb,
 forcing upwards compaction and combining

 To give you an idea, I had a similar scenario and ran a (slow,
 throttled) delete job on my cluster around December-January.  Here's a
 graph of the disk space usage on one node.  Notice the still-diclining
 usage long after the cleanup job has finished (sometime in January).  I
 tend to think of tombstones in LCS as little bombs that get to explode much
 later in time:

 http://mina.naguib.ca/images/tombstones-cassandra-LCS.jpg



 On 2014-04-11, at 11:20 AM, Paulo Ricardo Motta Gomes 
 paulo.mo...@chaordicsystems.com wrote:

 I have a similar problem here, I deleted about 30% of a very large CF
 using LCS (about 80GB per node), but still my data hasn't shrinked, even if
 I used 1 day for gc_grace_seconds. Would nodetool scrub help? Does nodetool
 scrub forces a minor compaction?

 Cheers,

 Paulo


 On Fri, Apr 11, 2014 at 12:12 PM, Mark Reddy mark.re...@boxever.comwrote:

 Yes, running nodetool compact (major compaction) creates one large
 SSTable. This will mess up the heuristics of the SizeTiered strategy (is
 this the compaction strategy you are using?) leading to multiple 'small'
 SSTables alongside the single large SSTable, which results in increased
 read latency. You will incur the operational overhead of having to manage
 compactions if you wish to compact these smaller SSTables. For all these
 reasons it is generally advised to stay away from running compactions
 manually.

 Assuming that this is a production environment and you want to keep
 everything running as smoothly as possible I would reduce the gc_grace on
 the CF, allow automatic minor compactions to kick in and then increase the
 gc_grace once again after the tombstones have been removed.


 On Fri, Apr 11, 2014 at 3:44 PM, William Oberman 
 ober...@civicscience.com wrote:

 So, if I was impatient and just wanted to make this happen now, I
 could:

 1.) Change GCGraceSeconds of the CF to 0
 2.) run nodetool compact (*)
 3.) Change GCGraceSeconds of the CF back to 10 days

 Since I have ~900M tombstones, even if I miss a few due to impatience,
 I don't care *that* much as I could re-run my clean up tool against the 
 now
 much smaller CF.

 (*) A long long time ago I seem to recall reading advice about don't
 ever run

bloom filter + suddenly smaller CF

2014-04-14 Thread William Oberman
I had a thread on this forum about clearing junk from a CF.  In my case,
it's ~90% of ~1 billion rows.

One side effect I had hoped for was a reduction in the size of the bloom
filter.  But, according to nodetool cfstats, it's still fairly large
(~1.5GB of RAM).

Do bloom filters ever resize themselves when the CF suddenly gets smaller?

My next test will be restarting one of the instances, though I'll have to
wait on that operation so I thought I'd ask in the meantime.

will


Re: bloom filter + suddenly smaller CF

2014-04-14 Thread William Oberman
I didn't cross link my thread, but the basic idea is I've done:

1.) Process that deleted ~900M of ~1G rows from a CF
2.) Set GCGraceSeconds to 0 on CF
3.) Run nodetool compact on all N nodes

And I checked, and all N nodes have bloom filters using 1.5 +/- .2 GB of
RAM (I didn't explicitly write down the before numbers, but they seem about
the same) .  So, compaction didn't change the BF's (unless cassandra needs
a 2nd compaction to see all of the data cleared by the 1st compaction).

will


On Mon, Apr 14, 2014 at 9:52 AM, Michal Michalski 
michal.michal...@boxever.com wrote:

 Bloom filters are built on creation / rebuild of SSTable. If you removed
 the data, but the old SSTables weren't compacted or you didn't rebuild them
 manually, bloom filters will stay the same size.

 M.

 Kind regards,
 Michał Michalski,
 michal.michal...@boxever.com


 On 14 April 2014 14:44, William Oberman ober...@civicscience.com wrote:

 I had a thread on this forum about clearing junk from a CF.  In my case,
 it's ~90% of ~1 billion rows.

 One side effect I had hoped for was a reduction in the size of the bloom
 filter.  But, according to nodetool cfstats, it's still fairly large
 (~1.5GB of RAM).

 Do bloom filters ever resize themselves when the CF suddenly gets
 smaller?

 My next test will be restarting one of the instances, though I'll have to
 wait on that operation so I thought I'd ask in the meantime.

 will





Re: bloom filter + suddenly smaller CF

2014-04-14 Thread William Oberman
Ah, so I could change the chance value to poke it.  Good to know!


On Mon, Apr 14, 2014 at 10:12 AM, Michal Michalski 
michal.michal...@boxever.com wrote:

 Sorry, I misread the question - I thought you've also changed FP chance
 value, not only removed the data.

 Kind regards,
 Michał Michalski,
 michal.michal...@boxever.com


 On 14 April 2014 15:07, Michal Michalski michal.michal...@boxever.comwrote:

 Did you set Bloom Filter's FP chance before or after the step 3) above?
 If you did it before, C* should build Bloom Filters properly. If not -
 that's the reason.

 Kind regards,
 Michał Michalski,
 michal.michal...@boxever.com


 On 14 April 2014 15:04, William Oberman ober...@civicscience.com wrote:

 I didn't cross link my thread, but the basic idea is I've done:

 1.) Process that deleted ~900M of ~1G rows from a CF
 2.) Set GCGraceSeconds to 0 on CF
 3.) Run nodetool compact on all N nodes

 And I checked, and all N nodes have bloom filters using 1.5 +/- .2 GB of
 RAM (I didn't explicitly write down the before numbers, but they seem about
 the same) .  So, compaction didn't change the BF's (unless cassandra needs
 a 2nd compaction to see all of the data cleared by the 1st compaction).

 will


 On Mon, Apr 14, 2014 at 9:52 AM, Michal Michalski 
 michal.michal...@boxever.com wrote:

 Bloom filters are built on creation / rebuild of SSTable. If you
 removed the data, but the old SSTables weren't compacted or you didn't
 rebuild them manually, bloom filters will stay the same size.

 M.

 Kind regards,
 Michał Michalski,
 michal.michal...@boxever.com


 On 14 April 2014 14:44, William Oberman ober...@civicscience.comwrote:

 I had a thread on this forum about clearing junk from a CF.  In my
 case, it's ~90% of ~1 billion rows.

 One side effect I had hoped for was a reduction in the size of the
 bloom filter.  But, according to nodetool cfstats, it's still fairly large
 (~1.5GB of RAM).

 Do bloom filters ever resize themselves when the CF suddenly gets
 smaller?

 My next test will be restarting one of the instances, though I'll have
 to wait on that operation so I thought I'd ask in the meantime.

 will










clearing tombstones?

2014-04-11 Thread William Oberman
I'm wondering what will clear tombstoned rows?  nodetool cleanup, nodetool
repair, or time (as in just wait)?

I had a CF that was more or less storing session information.  After some
time, we decided that one piece of this information was pointless to track
(and was 90%+ of the columns, and in 99% of those cases was ALL columns for
a row).   I wrote a process to remove all of those columns (which again in
a vast majority of cases had the effect of removing the whole row).

This CF had ~1 billion rows, so I expect to be left with ~100m rows.  After
I did this mass delete, everything was the same size on disk (which I
expected, knowing how tombstoning works).  It wasn't 100% clear to me what
to poke to cause compactions to clear the tombstones.  First I tried
nodetool cleanup on a candidate node.  But, afterwards the disk usage was
the same.  Then I tried nodetool repair on that same node.  But again, disk
usage is still the same.  The CF has no snapshots.

So, am I misunderstanding something?  Is there another operation to try?
 Do I have to just wait?  I've only done cleanup/repair on one node.  Do
I have to run one or the other over all nodes to clear tombstones?

Cassandra 1.2.15 if it matters,

Thanks!

will


Re: clearing tombstones?

2014-04-11 Thread William Oberman
I'm seeing a lot of articles about a dependency between removing tombstones
and GCGraceSeconds, which might be my problem (I just checked, and this CF
has GCGraceSeconds of 10 days).


On Fri, Apr 11, 2014 at 10:10 AM, tommaso barbugli tbarbu...@gmail.comwrote:

 compaction should take care of it; for me it never worked so I run
 nodetool compaction on every node; that does it.


 2014-04-11 16:05 GMT+02:00 William Oberman ober...@civicscience.com:

 I'm wondering what will clear tombstoned rows?  nodetool cleanup, nodetool
 repair, or time (as in just wait)?

 I had a CF that was more or less storing session information.  After some
 time, we decided that one piece of this information was pointless to track
 (and was 90%+ of the columns, and in 99% of those cases was ALL columns for
 a row).   I wrote a process to remove all of those columns (which again in
 a vast majority of cases had the effect of removing the whole row).

 This CF had ~1 billion rows, so I expect to be left with ~100m rows.
  After I did this mass delete, everything was the same size on disk (which
 I expected, knowing how tombstoning works).  It wasn't 100% clear to me
 what to poke to cause compactions to clear the tombstones.  First I tried
 nodetool cleanup on a candidate node.  But, afterwards the disk usage was
 the same.  Then I tried nodetool repair on that same node.  But again, disk
 usage is still the same.  The CF has no snapshots.

 So, am I misunderstanding something?  Is there another operation to try?
  Do I have to just wait?  I've only done cleanup/repair on one node.  Do
 I have to run one or the other over all nodes to clear tombstones?

 Cassandra 1.2.15 if it matters,

 Thanks!

 will





Re: clearing tombstones?

2014-04-11 Thread William Oberman
So, if I was impatient and just wanted to make this happen now, I could:

1.) Change GCGraceSeconds of the CF to 0
2.) run nodetool compact (*)
3.) Change GCGraceSeconds of the CF back to 10 days

Since I have ~900M tombstones, even if I miss a few due to impatience, I
don't care *that* much as I could re-run my clean up tool against the now
much smaller CF.

(*) A long long time ago I seem to recall reading advice about don't ever
run nodetool compact, but I can't remember why.  Is there any bad long
term consequence?  Short term there are several:
-a heavy operation
-temporary 2x disk space
-one big SSTable afterwards
But moving forward, everything is ok right?  CommitLog/MemTable-SStables,
minor compactions that merge SSTables, etc...  The only flaw I can think of
is it will take forever until the SSTable minor compactions build up enough
to consider including the big SSTable in a compaction, making it likely
I'll have to self manage compactions.



On Fri, Apr 11, 2014 at 10:31 AM, Mark Reddy mark.re...@boxever.com wrote:

 Correct, a tombstone will only be removed after gc_grace period has
 elapsed. The default value is set to 10 days which allows a great deal of
 time for consistency to be achieved prior to deletion. If you are
 operationally confident that you can achieve consistency via anti-entropy
 repairs within a shorter period you can always reduce that 10 day interval.


 Mark


 On Fri, Apr 11, 2014 at 3:16 PM, William Oberman ober...@civicscience.com
  wrote:

 I'm seeing a lot of articles about a dependency between removing
 tombstones and GCGraceSeconds, which might be my problem (I just checked,
 and this CF has GCGraceSeconds of 10 days).


 On Fri, Apr 11, 2014 at 10:10 AM, tommaso barbugli 
 tbarbu...@gmail.comwrote:

 compaction should take care of it; for me it never worked so I run
 nodetool compaction on every node; that does it.


 2014-04-11 16:05 GMT+02:00 William Oberman ober...@civicscience.com:

 I'm wondering what will clear tombstoned rows?  nodetool cleanup,
 nodetool repair, or time (as in just wait)?

 I had a CF that was more or less storing session information.  After
 some time, we decided that one piece of this information was pointless to
 track (and was 90%+ of the columns, and in 99% of those cases was ALL
 columns for a row).   I wrote a process to remove all of those columns
 (which again in a vast majority of cases had the effect of removing the
 whole row).

 This CF had ~1 billion rows, so I expect to be left with ~100m rows.
  After I did this mass delete, everything was the same size on disk (which
 I expected, knowing how tombstoning works).  It wasn't 100% clear to me
 what to poke to cause compactions to clear the tombstones.  First I tried
 nodetool cleanup on a candidate node.  But, afterwards the disk usage was
 the same.  Then I tried nodetool repair on that same node.  But again, disk
 usage is still the same.  The CF has no snapshots.

 So, am I misunderstanding something?  Is there another operation to
 try?  Do I have to just wait?  I've only done cleanup/repair on one node.
  Do I have to run one or the other over all nodes to clear tombstones?

 Cassandra 1.2.15 if it matters,

 Thanks!

 will









Re: clearing tombstones?

2014-04-11 Thread William Oberman
Answered my own question.  Good writeup here of the pros/cons of compact:
http://www.datastax.com/documentation/cassandra/1.2/cassandra/operations/ops_about_config_compact_c.html

And I was thinking of bad information that used to float in this forum
about major compactions (with respect to the impact to minor compactions).
 I'm hesitant to write the offending sentence again :-)


On Fri, Apr 11, 2014 at 10:44 AM, William Oberman
ober...@civicscience.comwrote:

 So, if I was impatient and just wanted to make this happen now, I could:

 1.) Change GCGraceSeconds of the CF to 0
 2.) run nodetool compact (*)
 3.) Change GCGraceSeconds of the CF back to 10 days

 Since I have ~900M tombstones, even if I miss a few due to impatience, I
 don't care *that* much as I could re-run my clean up tool against the now
 much smaller CF.

 (*) A long long time ago I seem to recall reading advice about don't ever
 run nodetool compact, but I can't remember why.  Is there any bad long
 term consequence?  Short term there are several:
 -a heavy operation
 -temporary 2x disk space
 -one big SSTable afterwards
 But moving forward, everything is ok right?  CommitLog/MemTable-SStables,
 minor compactions that merge SSTables, etc...  The only flaw I can think of
 is it will take forever until the SSTable minor compactions build up enough
 to consider including the big SSTable in a compaction, making it likely
 I'll have to self manage compactions.



 On Fri, Apr 11, 2014 at 10:31 AM, Mark Reddy mark.re...@boxever.comwrote:

 Correct, a tombstone will only be removed after gc_grace period has
 elapsed. The default value is set to 10 days which allows a great deal of
 time for consistency to be achieved prior to deletion. If you are
 operationally confident that you can achieve consistency via anti-entropy
 repairs within a shorter period you can always reduce that 10 day interval.


 Mark


 On Fri, Apr 11, 2014 at 3:16 PM, William Oberman 
 ober...@civicscience.com wrote:

 I'm seeing a lot of articles about a dependency between removing
 tombstones and GCGraceSeconds, which might be my problem (I just checked,
 and this CF has GCGraceSeconds of 10 days).


 On Fri, Apr 11, 2014 at 10:10 AM, tommaso barbugli 
 tbarbu...@gmail.comwrote:

 compaction should take care of it; for me it never worked so I run
 nodetool compaction on every node; that does it.


 2014-04-11 16:05 GMT+02:00 William Oberman ober...@civicscience.com:

 I'm wondering what will clear tombstoned rows?  nodetool cleanup,
 nodetool repair, or time (as in just wait)?

 I had a CF that was more or less storing session information.  After
 some time, we decided that one piece of this information was pointless to
 track (and was 90%+ of the columns, and in 99% of those cases was ALL
 columns for a row).   I wrote a process to remove all of those columns
 (which again in a vast majority of cases had the effect of removing the
 whole row).

 This CF had ~1 billion rows, so I expect to be left with ~100m rows.
  After I did this mass delete, everything was the same size on disk (which
 I expected, knowing how tombstoning works).  It wasn't 100% clear to me
 what to poke to cause compactions to clear the tombstones.  First I tried
 nodetool cleanup on a candidate node.  But, afterwards the disk usage was
 the same.  Then I tried nodetool repair on that same node.  But again, 
 disk
 usage is still the same.  The CF has no snapshots.

 So, am I misunderstanding something?  Is there another operation to
 try?  Do I have to just wait?  I've only done cleanup/repair on one 
 node.
  Do I have to run one or the other over all nodes to clear tombstones?

 Cassandra 1.2.15 if it matters,

 Thanks!

 will










Re: clearing tombstones?

2014-04-11 Thread William Oberman
Yes, I'm using SizeTiered.

I totally understand the mess up the heuristics issue.  But, I don't
understand You will incur the operational overhead of having to manage
compactions if you wish to compact these smaller SSTables.  My
understanding is the small tables will still compact.  The problem is that
until I have 3 other (by default) tables of the same size as the big
table, it won't be compacted.

In my case, this might not be terrible though, right?  To get into the
trees, I have 9 nodes with RF=3 and this CF is ~500GB/node.  I deleted like
90-95% of the data, so I expect the data to be 25-50GB after the tombstones
are cleared, but call it 50GB.  That means I won't compact this 50GB file
until I gather another 150GB (50,50,50,50-200).   But, that's not
*horrible*.  Now, if I only deleted 10% of the data, waiting to compact
450GB until I had another 1.3TB would be rough...

I think your advice is great for people looking for normal answers in the
forum, but I don't think my use case is very normal :-)

will

On Fri, Apr 11, 2014 at 11:12 AM, Mark Reddy mark.re...@boxever.com wrote:

 Yes, running nodetool compact (major compaction) creates one large
 SSTable. This will mess up the heuristics of the SizeTiered strategy (is
 this the compaction strategy you are using?) leading to multiple 'small'
 SSTables alongside the single large SSTable, which results in increased
 read latency. You will incur the operational overhead of having to manage
 compactions if you wish to compact these smaller SSTables. For all these
 reasons it is generally advised to stay away from running compactions
 manually.

 Assuming that this is a production environment and you want to keep
 everything running as smoothly as possible I would reduce the gc_grace on
 the CF, allow automatic minor compactions to kick in and then increase the
 gc_grace once again after the tombstones have been removed.


 On Fri, Apr 11, 2014 at 3:44 PM, William Oberman ober...@civicscience.com
  wrote:

 So, if I was impatient and just wanted to make this happen now, I could:

 1.) Change GCGraceSeconds of the CF to 0
 2.) run nodetool compact (*)
 3.) Change GCGraceSeconds of the CF back to 10 days

 Since I have ~900M tombstones, even if I miss a few due to impatience, I
 don't care *that* much as I could re-run my clean up tool against the now
 much smaller CF.

 (*) A long long time ago I seem to recall reading advice about don't
 ever run nodetool compact, but I can't remember why.  Is there any bad
 long term consequence?  Short term there are several:
 -a heavy operation
 -temporary 2x disk space
 -one big SSTable afterwards
 But moving forward, everything is ok right?
  CommitLog/MemTable-SStables, minor compactions that merge SSTables,
 etc...  The only flaw I can think of is it will take forever until the
 SSTable minor compactions build up enough to consider including the big
 SSTable in a compaction, making it likely I'll have to self manage
 compactions.



 On Fri, Apr 11, 2014 at 10:31 AM, Mark Reddy mark.re...@boxever.comwrote:

 Correct, a tombstone will only be removed after gc_grace period has
 elapsed. The default value is set to 10 days which allows a great deal of
 time for consistency to be achieved prior to deletion. If you are
 operationally confident that you can achieve consistency via anti-entropy
 repairs within a shorter period you can always reduce that 10 day interval.


 Mark


 On Fri, Apr 11, 2014 at 3:16 PM, William Oberman 
 ober...@civicscience.com wrote:

 I'm seeing a lot of articles about a dependency between removing
 tombstones and GCGraceSeconds, which might be my problem (I just checked,
 and this CF has GCGraceSeconds of 10 days).


 On Fri, Apr 11, 2014 at 10:10 AM, tommaso barbugli tbarbu...@gmail.com
  wrote:

 compaction should take care of it; for me it never worked so I run
 nodetool compaction on every node; that does it.


 2014-04-11 16:05 GMT+02:00 William Oberman ober...@civicscience.com:

 I'm wondering what will clear tombstoned rows?  nodetool cleanup,
 nodetool repair, or time (as in just wait)?

 I had a CF that was more or less storing session information.  After
 some time, we decided that one piece of this information was pointless to
 track (and was 90%+ of the columns, and in 99% of those cases was ALL
 columns for a row).   I wrote a process to remove all of those columns
 (which again in a vast majority of cases had the effect of removing the
 whole row).

 This CF had ~1 billion rows, so I expect to be left with ~100m rows.
  After I did this mass delete, everything was the same size on disk 
 (which
 I expected, knowing how tombstoning works).  It wasn't 100% clear to me
 what to poke to cause compactions to clear the tombstones.  First I tried
 nodetool cleanup on a candidate node.  But, afterwards the disk usage was
 the same.  Then I tried nodetool repair on that same node.  But again, 
 disk
 usage is still the same.  The CF has no snapshots.

 So, am I misunderstanding

Re: using hadoop + cassandra for CF mutations (delete)

2014-04-08 Thread William Oberman
I use PHP, and phpCassa to talk to cassandra from within my app.  I'm using
the below script's structure as a way to run a local mutation on each of my
nodes:

===
?php
require_once('PATH/TO/phpcassa-1.0.a.6/lib/autoload.php');

use phpcassa\ColumnFamily;
use phpcassa\Connection\ConnectionPool;
use phpcassa\SystemManager;

try {
//i'm sure there is a cleaner way of doing this, but for me I can't
connect as localhost, I need to use the AWS private IP
$ip = exec(/sbin/ifconfig  | grep 'inet addr:'| grep -v '127.0.0.1' |
cut -d: -f2 | awk '{ print $1}')
$localCassandra = $ip:9160;
$keyspace = ;
$cf = ;

$systemManager = new SystemManager($localCassandra);
$ring = $systemManager-describe_ring($keyspace);
$startToken = null;
$endToken = null;
foreach ($ring as $ringDetails) {
//There is an endpoint per RF, with the 1st == owner
foreach ($ringDetails-endpoints as $endpoint) {
if ($endpoint == $ip) {
$startToken = $ringDetails-start_token;
$endToken = $ringDetails-end_token;
}
break;
}
if ($startToken != null  $endToken != null) {
break;
}
}

if ($startToken == null || $endToken == null) {
fwrite(STDERR, My ip[$ip] not in ring =  . print_r($ring, true));
exit(1);
}

$pool = new ConnectionPool($keyspace, array($localCassandra));
$column_family = new ColumnFamily($pool, $cf);
//I patched my local phpCassa to support null == iterate over all rows.
//if my pull request is accepted it will be in the git repo.
 otherwise put
//a large int as arg 3, something hopefully larger than all rows on
a node...
foreach ($column_family-get_range_by_token($startToken, $endToken,
null) as $key = $columns) {
foreach ($columns as $cn = $cv) {
//I track information here
}
//and do an optional mutation here based on the row
}
} catch (Exception $e) {
fwrite(STDERR, $e);
}



On Fri, Apr 4, 2014 at 1:40 PM, William Oberman ober...@civicscience.comwrote:

 Looking at the code, cassandra.input.split.size==Pig URL split_size,
 right?  But, in cassandra 1.2.15 I'm wondering if there is a bug that would
 make the hadoop conf setting cassandra.input.split.size not be used
 unless you manually set the URI to splitSize=0 (because the abstract class
 defaults the splitSize to 64k instead of 0)?  Long story short though,
 I've messed with that setting in the direction you suggested (decreasing),
 and I'm confident hadoop/pig was picking it up (I eventually decreased it
 too far, which caused an server side error of too much memory used).

 I'm stuck in a rock  a hard place on the mappers.  At 20 tasks, based
 on the delete rate before time out failures happen, it was going to take
 1-2 days to run the deletes (I was seeing ~10k deletes/sec across all 20
 task threads).   But, this is going to be be my plan at this point: less
 tasks at once, even if it takes a week (of hopefully unsupervised time).

 Thanks for the feedback!






 On Fri, Apr 4, 2014 at 12:57 PM, Paulo Ricardo Motta Gomes 
 paulo.mo...@chaordicsystems.com wrote:

 You said you have tried the Pig URL split_size, but have you actually
 tried decreasing the value of cassandra.input.split.size hadoop property?
 The default is 65536, so you may want to decrease that to see if the number
 of mappers increase. But at some point, even if you lower that value it
 will stop decreasing the number of mappers but I don't know exactly why,
 probably because it hits the minimum number of rows per token.

 Another suggestion is to decrease the number of simultaneous mappers of
 your job, so it doesn't hit cassandra too hard, and you'll get less
 TimedOutExceptions, but your job will take longer to complete.

 On Fri, Apr 4, 2014 at 1:24 PM, William Oberman ober...@civicscience.com
  wrote:

 Hi,

 I have some history with cassandra + hadoop:
 1.) Single DC + integrated hadoop = Was ok until I needed steady
 performance (the single DC was used in a production environment)
 2.) Two DC's + integrated hadoop on 1 of 2 DCs = Was ok until my data
 grew and in AWS compute is expensive compared to data storage... e.g.
 running a 24x7 DC was a lot more expensive than the following solution...
 3.) Single DC + a constant ETL to S3 = Is still ok, I can spawn an
 arbitrarily large EMR cluster.  And 24x7 data storage + transient EMR is
 cost effective.

 But, one of my CF's has had a change of usage pattern making a large %,
 but not all of the data, fairly pointless to store.  I thought I'd write a
 Pig UDF that could peek at a row of data and delete if it fails my
 criteria.  And it works in terms of logic, but not in terms of practical
 execution.  The CF in question has O(billion) keys, and afterwards it will
 have ~10% of that at most.

 I basically keep losing the jobs due to too many task failures, all
 rooted in:
 Caused

Re: in AWS is it worth trying to talk to a server in the same zone as your client?

2014-02-12 Thread William Oberman
Same region, cross zone transfer is $0.01 / GB (see
http://aws.amazon.com/ec2/pricing/, Data Transfer section).


 On Wed, Feb 12, 2014 at 3:04 PM, Russell Bradberry rbradbe...@gmail.comwrote:

 Cross zone data transfer does not cost any extra money.

 LOCAL_QUORUM = QUORUM if all 6 servers are located in the same logical
 datacenter.

 Ensure your clients are connecting to either the local IP or the AWS
 hostname that is a CNAME to the local ip from within AWS.  If you connect
 to the public IP you will get charged for outbound data transfer.



 On February 12, 2014 at 2:58:07 PM, Yogi Nerella 
 (ynerella...@gmail.com//ynerella...@gmail.com)
 wrote:

 Also, may be you need to check the read consistency to local_quorum,
 otherwise the servers still try to read the data from all other data
 centers.

 I can understand the latency, but I cant understand how it would save
 money?   The amount of data transferred from the AWS server to the client
 should be same no matter where the client is connected?



 On Wed, Feb 12, 2014 at 10:33 AM, Andrey Ilinykh ailin...@gmail.comwrote:

 yes, sure. Taking data from the same zone will reduce latency and save
 you some money.


 On Wed, Feb 12, 2014 at 10:13 AM, Brian Tarbox 
 tar...@cabotresearch.comwrote:

 We're running a C* cluster with 6 servers spread across the four
 us-east1 zones.

 We also spread our clients (hundreds of them) across the four zones.

 Currently we give our clients a connection string listing all six
 servers and let C* do its thing.

 This is all working just fine...and we're paying a fair bit in AWS
 transfer costs.  There is a suspicion that this transfer cost is driven by
 us passing data around between our C* servers and clients.

 Would there be any value to trying to get a client to talk to one of the
 C* servers in its own zone?

 I understand (at least partially!) about coordinator nodes and
 replication and know that no matter which server is the coordinator for an
 operation replication may cause bits to get transferred to/from servers in
 other zones.  Having said that...is there a chance that trying to encourage
 a client to initially contact a server in its own zone would help?

 Thank you,

 Brian Tarbox






dependencies for cassandra's pig integration?

2013-07-31 Thread William Oberman
I'm using AWS's EMR (hadoop as a service), and one step copies some data
from EMR - my cassandra cluster.  I used to patch EMR with pig 0.11, but
now AWS officially supports 0.11, so I thought I'd give it a try.  I was
having issues.  The AWS forum on it is here:
https://forums.aws.amazon.com/thread.jspa?threadID=131212

But the crux seems to be: I need to have pig.jar on the hadoop classpath,
and AWS doesn't install it there. (/home/hadoop/lib is where all hadoop
lib's live, and amazon installs pig into /home/hadoop/lib/pig).  Does this
make sense, or not?

will


cqlsh + existing cf's + query

2013-07-03 Thread William Oberman
I've been running cassandra a while, and have used the PHP api and
cassandra-cli, but never gave cqlsh a shot.

I'm not quite getting it.  My most simple CF is a dumping ground for
testing things created as:
create column family stats;
I was putting random stats I was computing in it.  All keys, column names 
column values are really ascii.

Using cqlsh (1.1.12, default shell so CQL 2.0 I believe), I did:
USE my_keyspace_name;
ASSUME stats(KEY) VALUES are text, NAMES are text, VALUES are text;
SELECT * from stats LIMIT 10;

Looks great.  But, then I try to do:
SELECT * from stats WHERE KEY='KNOWN_KEY_NAME';

I get:
Bad Request: cannot parse 'KNOWN_KEY_NAME' as hex bytes

If I do the hex value for KNOWN_KEY_NAME, it works.  E.g.
SELECT * from stats WHERE KEY='various_0-9a-f_chars';

I can't find a TO_HEX(string) built in function for SQL.  Am I doing
something wrong?

DESCRIBE TABLE stats;

CREATE TABLE stats (
  KEY blob PRIMARY KEY
) WITH
  comment='' AND
  comparator=blob AND
  read_repair_chance=1.00 AND
  gc_grace_seconds=864000 AND
  default_validation=blob AND
  min_compaction_threshold=4 AND
  max_compaction_threshold=32 AND
  replicate_on_write='false' AND
  compaction_strategy_class='SizeTieredCompactionStrategy';


1.1.9 - 1.1.11 rpm upgrade issue

2013-05-03 Thread William Oberman
I get this:
Running rpm_check_debug
ERROR with rpm_check_debug vs depsolve:
apache-cassandra11 conflicts with apache-cassandra11-1.1.11-1.noarch

I'm using Centos.  Problem with my OS, or problem with the package?  (And
how can it conflict with itself??)

will


Re: normal thread counts?

2013-05-01 Thread William Oberman
I've done some more digging, and I have more data but no answers (not
knowing the cassandra internals).

Based on Aaron's comment about gossipinfo/thread dump:

-All IPs that gossip knows about have 2 threads in my thread dump (that
seems ok/fine)

-I have an additional set of IPs in my thread dump in the WRITE- state that
  1.) Used to be part of my cluster, but are not currently
  2.) Had tokens that are NOT part of the cluster anymore

Cassandra is attempting to communicate with these bad IPs once a minute.
 The log for that attempt is at the bottom of this email.  Does this sound
familiar to anyone else?

Log snippet:

/var/log/cassandra/system.log: INFO [GossipStage:1] 2013-05-01 11:05:11,865
Gossiper.java (line 831) InetAddress /10.114.67.189 is now dead.
/var/log/cassandra/system.log: INFO [GossipStage:1] 2013-05-01 11:05:11,866
StorageService.java (line 1303) Removing token 0 for /10.114.67.189



On Tue, Apr 30, 2013 at 5:34 PM, aaron morton aa...@thelastpickle.comwrote:

 The issue below could result in abandoned threads under high contention,
 so we'll get that fixed.

 But we are not sure how/why it would be called so many times. If you could
 provide a full list of threads and the output from nodetool gossipinfo that
 would help.

 Cheers

-
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 1/05/2013, at 8:34 AM, aaron morton aa...@thelastpickle.com wrote:

  Many many many of the threads are trying to talk to IPs that aren't in
 the cluster (I assume they are the IP's of dead hosts).

 Are these IP's from before the upgrade ? Are they IP's you expect to see ?

 Cross reference them with the output from nodetool gossipinfo to see why
 the node thinks they should be used.
 Could you provide a list of the thread names ?

 One way to remove those IPs that may be to rolling restart with
 -Dcassandra.load_ring_state=false i the JVM opts at the bottom of
 cassandra-env.sh

 The OutboundTcpConnection threads are created in pairs by the
 OutboundTcpConnectionPool, which is created here
 https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/MessagingService.java#L502
  The
 threads are created in the OutboundTcpConnectionPool constructor checking
 to see if this could be the source of the leak.

 Cheers

-
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 1/05/2013, at 2:18 AM, William Oberman ober...@civicscience.com
 wrote:

 I use phpcassa.

 I did a thread dump.  99% of the threads look very similar (I'm using
 1.1.9 in terms of matching source lines).  The thread names are all like
 this: WRITE-/10.x.y.z.  There are a LOT of duplicates (in terms of the
 same IP).  Many many many of the threads are trying to talk to IPs that
 aren't in the cluster (I assume they are the IP's of dead hosts).  The
 stack trace is basically the same for them all, attached at the bottom.

 There is a lot of things I could talk about in terms of my situation, but
 what I think might be pertinent to this thread: I hit a tipping point
 recently and upgraded a 9 node cluster from AWS m1.large to m1.xlarge
 (rolling, one at a time).  7 of the 9 upgraded fine and work great.  2 of
 the 9 keep struggling.  I've replaced them many times now, each time using
 this process:
 http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node
 And even this morning the only two nodes with a high number of threads are
 those two (yet again).  And at some point they'll OOM.

 Seems like there is something about my cluster (caused by the recent
 upgrade?) that causes a thread leak on OutboundTcpConnection   But I don't
 know how to escape from the trap.  Any ideas?


 
   stackTrace = [ {
 className = sun.misc.Unsafe;
 fileName = Unsafe.java;
 lineNumber = -2;
  methodName = park;
 nativeMethod = true;
}, {
 className = java.util.concurrent.locks.LockSupport;
 fileName = LockSupport.java;
 lineNumber = 158;
 methodName = park;
 nativeMethod = false;
}, {
 className =
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject;
 fileName = AbstractQueuedSynchronizer.java;
 lineNumber = 1987;
 methodName = await;
 nativeMethod = false;
}, {
 className = java.util.concurrent.LinkedBlockingQueue;
 fileName = LinkedBlockingQueue.java;
 lineNumber = 399;
 methodName = take;
 nativeMethod = false;
}, {
 className = org.apache.cassandra.net.OutboundTcpConnection;
 fileName = OutboundTcpConnection.java;
 lineNumber = 104;
 methodName = run;
 nativeMethod = false;
} ];
 --




 On Mon, Apr 29, 2013 at 4:31 PM, aaron morton aa...@thelastpickle.comwrote:

  I used JMX to check current number of threads in a production cassandra
 machine, and it was ~27,000.

 That does not sound too good

Re: normal thread counts?

2013-05-01 Thread William Oberman
That has GOT to be it.  1.1.10 upgrade it is...


On Wed, May 1, 2013 at 5:09 PM, Janne Jalkanen janne.jalka...@ecyrd.comwrote:


 This sounds very much like
 https://issues.apache.org/jira/browse/CASSANDRA-5175, which was fixed in
 1.1.10.

 /Janne

 On Apr 30, 2013, at 23:34 , aaron morton aa...@thelastpickle.com wrote:

  Many many many of the threads are trying to talk to IPs that aren't in
 the cluster (I assume they are the IP's of dead hosts).

 Are these IP's from before the upgrade ? Are they IP's you expect to see ?

 Cross reference them with the output from nodetool gossipinfo to see why
 the node thinks they should be used.
 Could you provide a list of the thread names ?

 One way to remove those IPs that may be to rolling restart with
 -Dcassandra.load_ring_state=false i the JVM opts at the bottom of
 cassandra-env.sh

 The OutboundTcpConnection threads are created in pairs by the
 OutboundTcpConnectionPool, which is created here
 https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/net/MessagingService.java#L502
  The
 threads are created in the OutboundTcpConnectionPool constructor checking
 to see if this could be the source of the leak.

 Cheers

-
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 1/05/2013, at 2:18 AM, William Oberman ober...@civicscience.com
 wrote:

 I use phpcassa.

 I did a thread dump.  99% of the threads look very similar (I'm using
 1.1.9 in terms of matching source lines).  The thread names are all like
 this: WRITE-/10.x.y.z.  There are a LOT of duplicates (in terms of the
 same IP).  Many many many of the threads are trying to talk to IPs that
 aren't in the cluster (I assume they are the IP's of dead hosts).  The
 stack trace is basically the same for them all, attached at the bottom.

 There is a lot of things I could talk about in terms of my situation, but
 what I think might be pertinent to this thread: I hit a tipping point
 recently and upgraded a 9 node cluster from AWS m1.large to m1.xlarge
 (rolling, one at a time).  7 of the 9 upgraded fine and work great.  2 of
 the 9 keep struggling.  I've replaced them many times now, each time using
 this process:
 http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node
 And even this morning the only two nodes with a high number of threads are
 those two (yet again).  And at some point they'll OOM.

 Seems like there is something about my cluster (caused by the recent
 upgrade?) that causes a thread leak on OutboundTcpConnection   But I don't
 know how to escape from the trap.  Any ideas?


 
   stackTrace = [ {
 className = sun.misc.Unsafe;
 fileName = Unsafe.java;
 lineNumber = -2;
  methodName = park;
 nativeMethod = true;
}, {
 className = java.util.concurrent.locks.LockSupport;
 fileName = LockSupport.java;
 lineNumber = 158;
 methodName = park;
 nativeMethod = false;
}, {
 className =
 java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject;
 fileName = AbstractQueuedSynchronizer.java;
 lineNumber = 1987;
 methodName = await;
 nativeMethod = false;
}, {
 className = java.util.concurrent.LinkedBlockingQueue;
 fileName = LinkedBlockingQueue.java;
 lineNumber = 399;
 methodName = take;
 nativeMethod = false;
}, {
 className = org.apache.cassandra.net.OutboundTcpConnection;
 fileName = OutboundTcpConnection.java;
 lineNumber = 104;
 methodName = run;
 nativeMethod = false;
} ];
 --




 On Mon, Apr 29, 2013 at 4:31 PM, aaron morton aa...@thelastpickle.comwrote:

  I used JMX to check current number of threads in a production cassandra
 machine, and it was ~27,000.

 That does not sound too good.

 My first guess would be lots of client connections. What client are you
 using, does it do connection pooling ?
 See the comments in cassandra.yaml around rpc_server_type, the default
 uses sync uses one thread per connection, you may be better with HSHA. But
 if your app is leaking connection you should probably deal with that first.

 Cheers

-
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 30/04/2013, at 3:07 AM, William Oberman ober...@civicscience.com
 wrote:

 Hi,

 I'm having some issues.  I keep getting:
 
 ERROR [GossipStage:1] 2013-04-28 07:48:48,876
 AbstractCassandraDaemon.java (line 135) Exception in thread
 Thread[GossipStage:1,5,main]
 java.lang.OutOfMemoryError: unable to create new native thread
 --
 after a day or two of runtime.  I've checked and my system settings seem
 acceptable:
 memlock=unlimited
 nofiles=10
 nproc=122944

 I've messed with heap sizes from 6-12GB (15 physical, m1.xlarge in AWS),
 and I keep OOM'ing with the above error.

 I've found some (what seem to me) to be obscure references to the stack
 size

Re: normal thread counts?

2013-04-30 Thread William Oberman
I use phpcassa.

I did a thread dump.  99% of the threads look very similar (I'm using 1.1.9
in terms of matching source lines).  The thread names are all like this:
WRITE-/10.x.y.z.  There are a LOT of duplicates (in terms of the same
IP).  Many many many of the threads are trying to talk to IPs that aren't
in the cluster (I assume they are the IP's of dead hosts).  The stack trace
is basically the same for them all, attached at the bottom.

There is a lot of things I could talk about in terms of my situation, but
what I think might be pertinent to this thread: I hit a tipping point
recently and upgraded a 9 node cluster from AWS m1.large to m1.xlarge
(rolling, one at a time).  7 of the 9 upgraded fine and work great.  2 of
the 9 keep struggling.  I've replaced them many times now, each time using
this process:
http://www.datastax.com/docs/1.1/cluster_management#replacing-a-dead-node
And even this morning the only two nodes with a high number of threads are
those two (yet again).  And at some point they'll OOM.

Seems like there is something about my cluster (caused by the recent
upgrade?) that causes a thread leak on OutboundTcpConnection   But I don't
know how to escape from the trap.  Any ideas?



  stackTrace = [ {
className = sun.misc.Unsafe;
fileName = Unsafe.java;
lineNumber = -2;
methodName = park;
nativeMethod = true;
   }, {
className = java.util.concurrent.locks.LockSupport;
fileName = LockSupport.java;
lineNumber = 158;
methodName = park;
nativeMethod = false;
   }, {
className =
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject;
fileName = AbstractQueuedSynchronizer.java;
lineNumber = 1987;
methodName = await;
nativeMethod = false;
   }, {
className = java.util.concurrent.LinkedBlockingQueue;
fileName = LinkedBlockingQueue.java;
lineNumber = 399;
methodName = take;
nativeMethod = false;
   }, {
className = org.apache.cassandra.net.OutboundTcpConnection;
fileName = OutboundTcpConnection.java;
lineNumber = 104;
methodName = run;
nativeMethod = false;
   } ];
--




On Mon, Apr 29, 2013 at 4:31 PM, aaron morton aa...@thelastpickle.comwrote:

  I used JMX to check current number of threads in a production cassandra
 machine, and it was ~27,000.

 That does not sound too good.

 My first guess would be lots of client connections. What client are you
 using, does it do connection pooling ?
 See the comments in cassandra.yaml around rpc_server_type, the default
 uses sync uses one thread per connection, you may be better with HSHA. But
 if your app is leaking connection you should probably deal with that first.

 Cheers

-
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 30/04/2013, at 3:07 AM, William Oberman ober...@civicscience.com
 wrote:

 Hi,

 I'm having some issues.  I keep getting:
 
 ERROR [GossipStage:1] 2013-04-28 07:48:48,876 AbstractCassandraDaemon.java
 (line 135) Exception in thread Thread[GossipStage:1,5,main]
 java.lang.OutOfMemoryError: unable to create new native thread
 --
 after a day or two of runtime.  I've checked and my system settings seem
 acceptable:
 memlock=unlimited
 nofiles=10
 nproc=122944

 I've messed with heap sizes from 6-12GB (15 physical, m1.xlarge in AWS),
 and I keep OOM'ing with the above error.

 I've found some (what seem to me) to be obscure references to the stack
 size interacting with # of threads.  If I'm understanding it correctly, to
 reason about Java mem usage I have to think of OS + Heap as being locked
 down, and the stack gets the leftovers of physical memory and each thread
 gets a stack.

 For me, the system ulimit setting on stack is 10240k (no idea if java sees
 or respects this setting).  My -Xss for cassandra is the default (I hope,
 don't remember messing with it) of 180k.  I used JMX to check current
 number of threads in a production cassandra machine, and it was ~27,000.
  Is that a normal thread count?  Could my OOM be related to stack + number
 of threads, or am I overlooking something more simple?

 will





normal thread counts?

2013-04-29 Thread William Oberman
Hi,

I'm having some issues.  I keep getting:

ERROR [GossipStage:1] 2013-04-28 07:48:48,876 AbstractCassandraDaemon.java
(line 135) Exception in thread Thread[GossipStage:1,5,main]
java.lang.OutOfMemoryError: unable to create new native thread
--
after a day or two of runtime.  I've checked and my system settings seem
acceptable:
memlock=unlimited
nofiles=10
nproc=122944

I've messed with heap sizes from 6-12GB (15 physical, m1.xlarge in AWS),
and I keep OOM'ing with the above error.

I've found some (what seem to me) to be obscure references to the stack
size interacting with # of threads.  If I'm understanding it correctly, to
reason about Java mem usage I have to think of OS + Heap as being locked
down, and the stack gets the leftovers of physical memory and each thread
gets a stack.

For me, the system ulimit setting on stack is 10240k (no idea if java sees
or respects this setting).  My -Xss for cassandra is the default (I hope,
don't remember messing with it) of 180k.  I used JMX to check current
number of threads in a production cassandra machine, and it was ~27,000.
 Is that a normal thread count?  Could my OOM be related to stack + number
of threads, or am I overlooking something more simple?

will


Re: StatusLogger format?

2013-04-15 Thread William Oberman
99% sure it's in bytes.


On Mon, Apr 15, 2013 at 11:25 AM, William Oberman
ober...@civicscience.comwrote:

 Mainly the:
 ColumnFamilyMemtable ops,data
 section.

 Is data in bytes/kb/mb/etc?

 Example line:
 StatusLogger.java (line 116) civicscience.sessions4963,1799916

 Thanks!





Re: how to stop out of control compactions?

2013-04-02 Thread William Oberman
Thanks Gregg  Aaron. Missed that setting!

On Tuesday, April 2, 2013, aaron morton wrote:

 Set the min and max
 compaction thresholds for a given column family

 +1 for setting the max_compaction_threshold (as well as the min) on the a
 CF when you are getting behind. It can limit the size of the compactions
 and give things a chance to complete in a reasonable time.

 Cheers

 -
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 2/04/2013, at 3:42 AM, Gregg Ulrich gulr...@netflix.comjavascript:_e({}, 
 'cvml', 'gulr...@netflix.com');
 wrote:

 You may want to set compaction threshold and not throughput.  If you set
 the min threshold to something very large (10), compactions will not
 start until cassandra finds this many files to compact (which it should
 not).

 In the past I have used this to stop compactions on a node, and then run
 an offline major compaction to get though the compaction, then set the min
 threshold back.  Not everyone likes major compactions though.



   setcompactionthreshold keyspace cfname minthreshold maxthreshold
 - Set the min and max
 compaction thresholds for a given column family



 On Mon, Apr 1, 2013 at 12:38 PM, William Oberman 
 ober...@civicscience.comjavascript:_e({}, 'cvml', 
 'ober...@civicscience.com');
  wrote:

 I'll skip the prelude, but I worked myself into a bit of a jam.  I'm
 recovering now, but I want to double check if I'm thinking about things
 correct.

 Basically, I was in a state where a majority of my servers wanted to do
 compactions, and rather large ones.  This was impacting my site
 performance.  I tried nodetool stop COMPACTION.  I tried
 setcompactionthroughput=1.  I tried restarting servers, but they'd restart
 the compactions pretty much immediately on boot.

 Then I realized that:
 nodetool stop COMPACTION
 only stopped running compactions, and then the compactions would
 re-enqueue themselves rather quickly.

 So, right now I have:
 1.) scripts running on N-1 servers looping on nodetool stop COMPACTION
 in a tight loop
 2.) On the Nth server I've disabled gossip/thrift and turned up
 setcompactionthroughput to 999
 3.) When the Nth server completes, I pick from the remaining N-1 (well,
 I'm still running the first compaction, which is going to take 12 more
 hours, but that is the plan at least).

 Does this make sense?  Other than the fact there was probably warning
 signs that would have prevented me from getting into this state in the
 first place? :-)

 will





-- 
Will Oberman
Civic Science, Inc.
6101 Penn Avenue, Fifth Floor
Pittsburgh, PA 15206
(M) 412-480-7835
(E) ober...@civicscience.com


Re: how to stop out of control compactions?

2013-04-02 Thread William Oberman
Edward, you make a good point, and I do think am getting closer to having
to increase my cluster size (I'm around ~300GB/node now).

In my case, I think it was neither.  I had one node OOM after working on a
large compaction but it continued to run in a zombie like state (constantly
GC'ing), which I didn't have an alert on.  Then I had the bad luck of a
close token also starting a large compaction.  I have RF=3 with some of
my R/W patterns at quorum, causing that segment of my cluster to get slow
(e.g. a % of of my traffic started to slow).  I was running 1.1.2 (I
haven't had to poke anything for quite some time, obviously), so I upgraded
before moving on (as I saw a lot of bug fixes to compaction issues in
release notes).  But the upgrade caused even more nodes to start
compactions.  Which lead to my original email... I had a cluster where 80%
of my nodes were compacting, and I really needed to boost production
traffic and couldn't seem to tamp cassandra down temporarily.

Thanks for the advice everyone!

will


On Tue, Apr 2, 2013 at 10:20 AM, Edward Capriolo edlinuxg...@gmail.comwrote:

 Settings do not make compactions go away. If your compactions are out of
 control it usually means one of these things,
 1)  you have a corrupt table that the compaction never finishes on,
 sstables count keep growing
 2) you do not have enough hardware to handle your write load


 On Tue, Apr 2, 2013 at 7:50 AM, William Oberman 
 ober...@civicscience.comwrote:

 Thanks Gregg  Aaron. Missed that setting!

 On Tuesday, April 2, 2013, aaron morton wrote:

 Set the min and max
 compaction thresholds for a given column family

 +1 for setting the max_compaction_threshold (as well as the min) on the
 a CF when you are getting behind. It can limit the size of the compactions
 and give things a chance to complete in a reasonable time.

 Cheers

-
 Aaron Morton
 Freelance Cassandra Consultant
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 2/04/2013, at 3:42 AM, Gregg Ulrich gulr...@netflix.com wrote:

 You may want to set compaction threshold and not throughput.  If you set
 the min threshold to something very large (10), compactions will not
 start until cassandra finds this many files to compact (which it should
 not).

 In the past I have used this to stop compactions on a node, and then run
 an offline major compaction to get though the compaction, then set the min
 threshold back.  Not everyone likes major compactions though.



   setcompactionthreshold keyspace cfname minthreshold
 maxthreshold - Set the min and max
 compaction thresholds for a given column family



 On Mon, Apr 1, 2013 at 12:38 PM, William Oberman 
 ober...@civicscience.com wrote:

 I'll skip the prelude, but I worked myself into a bit of a jam.  I'm
 recovering now, but I want to double check if I'm thinking about things
 correct.

 Basically, I was in a state where a majority of my servers wanted to do
 compactions, and rather large ones.  This was impacting my site
 performance.  I tried nodetool stop COMPACTION.  I tried
 setcompactionthroughput=1.  I tried restarting servers, but they'd restart
 the compactions pretty much immediately on boot.

 Then I realized that:
 nodetool stop COMPACTION
 only stopped running compactions, and then the compactions would
 re-enqueue themselves rather quickly.

 So, right now I have:
 1.) scripts running on N-1 servers looping on nodetool stop
 COMPACTION in a tight loop
 2.) On the Nth server I've disabled gossip/thrift and turned up
 setcompactionthroughput to 999
 3.) When the Nth server completes, I pick from the remaining N-1 (well,
 I'm still running the first compaction, which is going to take 12 more
 hours, but that is the plan at least).

 Does this make sense?  Other than the fact there was probably warning
 signs that would have prevented me from getting into this state in the
 first place? :-)

 will





 --
 Will Oberman
 Civic Science, Inc.
 6101 Penn Avenue, Fifth Floor
 Pittsburgh, PA 15206
 (M) 412-480-7835
 (E) ober...@civicscience.com





how to stop out of control compactions?

2013-04-01 Thread William Oberman
I'll skip the prelude, but I worked myself into a bit of a jam.  I'm
recovering now, but I want to double check if I'm thinking about things
correct.

Basically, I was in a state where a majority of my servers wanted to do
compactions, and rather large ones.  This was impacting my site
performance.  I tried nodetool stop COMPACTION.  I tried
setcompactionthroughput=1.  I tried restarting servers, but they'd restart
the compactions pretty much immediately on boot.

Then I realized that:
nodetool stop COMPACTION
only stopped running compactions, and then the compactions would re-enqueue
themselves rather quickly.

So, right now I have:
1.) scripts running on N-1 servers looping on nodetool stop COMPACTION in
a tight loop
2.) On the Nth server I've disabled gossip/thrift and turned up
setcompactionthroughput to 999
3.) When the Nth server completes, I pick from the remaining N-1 (well, I'm
still running the first compaction, which is going to take 12 more hours,
but that is the plan at least).

Does this make sense?  Other than the fact there was probably warning signs
that would have prevented me from getting into this state in the first
place? :-)

will


odd timestamps

2013-03-05 Thread William Oberman
I happened to notice some bizarre timestamps coming out of the
cassandra-cli.  Example:
[default@XXX] get CF[‘e2b753aa33b13e74e5e803d787b06000'];
= (column=c35ef420-c37a-11e0-ac88-09b2f4397c6a, value=XXX,
timestamp=2013042719)
= (column=c3845ea0-c37a-11e0-8f6f-09b2f4397c6a, value=XXX,
timestamp=2013287771)
= (column=c3993840-c37a-11e0-a069-09b2f4397c6a, value=XXX
timestamp=2013423245)
= (column=c39a9040-c37a-11e0-8617-09b2f4397c6a, value=XXX,
timestamp=2013431971)
Returned 4 results.

I'm used to timestamps being micro from 1970, like this (same CF):
[default@XXX] get CF[‘3a4599767e16e94465e8491139154871'];
= (column=6c8e3160-4678-11e2-b69a-3534db9e8cfa, value=XXX,
timestamp=1355549380581630)
= (column=6c91f3a0-4678-11e2-bc00-3534db9e8cfa, value=XXX,
timestamp=1355549380606285)
= (column=6c980f00-4678-11e2-9963-3534db9e8cfa, value=XXX,
timestamp=1355549380646378)
= (column=6c994100-4678-11e2-955f-3534db9e8cfa, value=XXX,
timestamp=1355549380654057)
= (column=6c9e7950-4678-11e2-9189-3534db9e8cfa, value=XXX
timestamp=1355549380688268)
Returned 5 results.

Was there a version of cassandra that wrote bad timestamps at the server
level?  I started around 0.8.4, and I'm running 1.1.2 right now.

Could it have been a client with a bug?  I've only been using phpcassa and
the CLI to read/write data (no CQL).

Is there another theory?  I don't _think_ these odd timestamps will cause
me problems.  I just don't like unexpected results from my data stores...
:-)

will


Re: sstable2json had random behavior

2013-01-22 Thread William Oberman
No, I have the other files unfortunately and I had it fail once and succeed
every time after.

I'm tracking the external information of sstable2json more carefully now
(exit status, stdout, stderr), so hopefully if it happens again I can be
more help.

will


On Tue, Jan 22, 2013 at 3:38 PM, aaron morton aa...@thelastpickle.comwrote:

 William,
 If the solution from Binh works for you can you please submit a ticket to
 https://issues.apache.org/jira/browse/CASSANDRA

 The error message could be better if that is the case.

 Cheers

-
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 22/01/2013, at 9:16 AM, Binh Nguyen binhn...@gmail.com wrote:

 Hi William,

 I also saw this one before but it always happened in my case when I have
 only Data and Index files. The problem goes away when I have all another
 files (Compression, Filter...)


 On Mon, Jan 21, 2013 at 11:36 AM, William Oberman 
 ober...@civicscience.com wrote:

 I'm running 1.1.6 from the datastax repo.

 I ran sstable2json and got the following error:
 Exception in thread main java.io.IOError: java.io.IOException: dataSize
 of 7020023552240793698 starting at 993981393 would be larger than file
 /var/lib/cassandra/data/X-Data.db length 7502161255
 at
 org.apache.cassandra.io.sstable.SSTableIdentityIterator.init(SSTableIdentityIterator.java:156)
 at
 org.apache.cassandra.io.sstable.SSTableIdentityIterator.init(SSTableIdentityIterator.java:86)
 at
 org.apache.cassandra.io.sstable.SSTableIdentityIterator.init(SSTableIdentityIterator.java:70)
 at
 org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTableScanner.java:187)
 at
 org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTableScanner.java:151)
 at
 org.apache.cassandra.io.sstable.SSTableScanner.next(SSTableScanner.java:143)
 at
 org.apache.cassandra.tools.SSTableExport.export(SSTableExport.java:309)
 at
 org.apache.cassandra.tools.SSTableExport.export(SSTableExport.java:340)
 at
 org.apache.cassandra.tools.SSTableExport.export(SSTableExport.java:353)
 at
 org.apache.cassandra.tools.SSTableExport.main(SSTableExport.java:418)
 Caused by: java.io.IOException: dataSize of 7020023552240793698 starting
 at 993981393 would be larger than file
 /var/lib/cassandra/data/X-Data.db length 7502161255
 at
 org.apache.cassandra.io.sstable.SSTableIdentityIterator.init(SSTableIdentityIterator.java:115)
 ... 9 more


 I ran it again, and didn't.  This makes me worried :-)  Does anyone else
 ever see this class of error, and does it ever disappear for them?






sstable2json had random behavior

2013-01-21 Thread William Oberman
I'm running 1.1.6 from the datastax repo.

I ran sstable2json and got the following error:
Exception in thread main java.io.IOError: java.io.IOException: dataSize
of 7020023552240793698 starting at 993981393 would be larger than file
/var/lib/cassandra/data/X-Data.db length 7502161255
at
org.apache.cassandra.io.sstable.SSTableIdentityIterator.init(SSTableIdentityIterator.java:156)
at
org.apache.cassandra.io.sstable.SSTableIdentityIterator.init(SSTableIdentityIterator.java:86)
at
org.apache.cassandra.io.sstable.SSTableIdentityIterator.init(SSTableIdentityIterator.java:70)
at
org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTableScanner.java:187)
at
org.apache.cassandra.io.sstable.SSTableScanner$KeyScanningIterator.next(SSTableScanner.java:151)
at
org.apache.cassandra.io.sstable.SSTableScanner.next(SSTableScanner.java:143)
at
org.apache.cassandra.tools.SSTableExport.export(SSTableExport.java:309)
at
org.apache.cassandra.tools.SSTableExport.export(SSTableExport.java:340)
at
org.apache.cassandra.tools.SSTableExport.export(SSTableExport.java:353)
at
org.apache.cassandra.tools.SSTableExport.main(SSTableExport.java:418)
Caused by: java.io.IOException: dataSize of 7020023552240793698 starting at
993981393 would be larger than file /var/lib/cassandra/data/X-Data.db
length 7502161255
at
org.apache.cassandra.io.sstable.SSTableIdentityIterator.init(SSTableIdentityIterator.java:115)
... 9 more


I ran it again, and didn't.  This makes me worried :-)  Does anyone else
ever see this class of error, and does it ever disappear for them?


Re: Cassandra at Amazon AWS

2013-01-17 Thread William Oberman
I have a peer EBS disk to the ephemeral disk .  Then I do nodetool
snapshot - rsync from ephemeral to EBS - take snapshot of EBS.  Syncing
nodetool snapshot directly to S3 would involve less steps and be cheaper
(EBS costs more than S3), but I do post processing on the snapshot for EMR,
and it seemed pointless to push/pull from S3 when I could just map EBS
disks around.  I snapshot the EBS disk as a failsafe, and snapshots are
cheap (they cost the same as S3).

I've seen/read about how other people just watch the data directories for
new SStables and trigger copy to S3 (there are open source projects that do
that for you I believe).

And I think lot of people rely on the replication factor + multiple zones.

will


On Thu, Jan 17, 2013 at 8:44 AM, Adam Venturella aventure...@gmail.comwrote:

 Jared, how do you guys handle data backups for your ephemeral based
 cluster?

 I'm trying to move to ephemeral drives myself, and that was my last
 sticking point; asking how others in the community deal with backup in case
 the VM explodes.



 On Wed, Jan 16, 2013 at 1:21 PM, Jared Biel jared.b...@bolderthinking.com
  wrote:

 We're currently using Cassandra on EC2 at very low scale (a 2 node
 cluster on m1.large instances in two regions.) I don't believe that
 EBS is recommended for performance reasons. Also, it's proven to be
 very unreliable in the past (most of the big/notable AWS outages were
 due to EBS issues.) We've moved 99% of our instances off of EBS.

 As other have said, if you require more space in the future it's easy
 to add more nodes to the cluster. I've found this page
 (http://www.ec2instances.info/) very useful in determining the amount
 of space each instance type has. Note that by default only one
 ephemeral drive is attached and you must specify all ephemeral drives
 that you want to use at launch time. Also, you can create a RAID 0 of
 all local disks to provide maximum speed and space.


 On 16 January 2013 20:42, Marcelo Elias Del Valle mvall...@gmail.com
 wrote:
  Hello,
 
 I am currently using hadoop + cassandra at amazon AWS. Cassandra
 runs on
  EC2 and my hadoop process runs at EMR. For cassandra storage, I am using
  local EC2 EBS disks.
 My system is running fine for my tests, but to me it's not a good
 setup
  for production. I need my system to perform well for specially for
 writes on
  cassandra, but the amount of data could grow really big, taking several
 Tb
  of total storage.
  My first guess was using S3 as a storage and I saw this can be done
 by
  using Cloudian package, but I wouldn't like to become dependent on a
  pre-package solution and I found it's kind of expensive for more than
 100Tb:
  http://www.cloudian.com/pricing.html
  I saw some discussion at internet about using EBS or ephemeral
 disks for
  storage at Amazon too.
 
  My question is: does someone on this list have the same problem as
 me?
  What are you using as solution to Cassandra's storage when running it at
  Amazon AWS?
 
  Any thoughts would be highly appreciatted.
 
  Best regards,
  --
  Marcelo Elias Del Valle
  http://mvalle.com - @mvallebr





Re: AWS EMR - Cassandra

2013-01-16 Thread William Oberman
DataStax recommended (forget the reference) to use the ephemeral disks in
RAID0, which is what I've been running for well over a year now in
production.

In terms of how I'm doing Cassandra/AWS/Hadoop, I started by doing the
split data center thing (one DC for low latency queries, one DC for
hadoop).  But, that's a lot of system management.  And compute is the most
expensive part of AWS, and you need a LOT of compute to run this setup.  I
tried doing Cassandra EC2 cluster - snapshot - clone cluster with hadoop
overlay - ETL to S3 using hadoop - EMR for real work.  But that's kind of
a pain too (and the ETL to S3 wasn't very fast).

Now I'm going after the SStables directly(*), which sounds like how Netflix
does it.  You can do incremental updates, if you're careful.

(*) Cassandra EC2 - backup to local EBS - remap EBS to another box -
sstable2json over new sstables - S3 (splitting into ~100MB parts), then
use EMR to consume the JSON part files.

will


On Wed, Jan 16, 2013 at 3:30 PM, Marcelo Elias Del Valle mvall...@gmail.com
 wrote:

 William,

 I just saw your message today. I am using Cassandra + Amazon EMR
 (hadoop 1.0.3) but I am not using PIG as you are. I set my configuration
 vars in Java, as I have a custom jar file and I am using
 ColumnFamilyInputFormat.
 However, if I understood well your problem, the only thing you have to
 do is to set environment vars when running cluster tasks, right? Take a
 look a this link:
 http://sujee.net/tech/articles/hadoop/amazon-emr-beyond-basics/
 As it shows, you can run EMR setting some command line arguments that
 specify a script to be executed before the job starts, in each machine in
 the cluster. This way, you would be able to correctly set the vars you need.
  Out of curiosity, could you share what are you using for cassandra
 storage? I am currently using EC2 local disks, but I am looking for an
 alternative.

 Best regards,
 Marcelo.


 2013/1/4 William Oberman ober...@civicscience.com

 So I've made it work, but I don't get it yet.

 I have no idea why my DIY server works when I set the environment
 variables on the machine that kicks off pig (master), and in EMR it
 doesn't.  I recompiled ConfigHelper and CassandraStorage with tons of
 debugging, and in EMR I can see the hadoop Configuration object get the
 proper values on the master node, and I can see it does NOT propagate to
 the task threads.

 The other part that was driving me nuts could be made more user friendly.
  The issue is this: I started to try to set
 cassandra.thrift.address, cassandra.thrift.port,
 cassandra.partitioner.class in mapred-site.xml, and it didn't work.  After
 even more painful debugging, I noticed that the only time Cassandra sets
 the input/output versions of those settings (and these input/output
 specific versions are the only versions really used!) is when Cassandra
 maps the system environment variables.  So, having cassandra.thrift.address
 in mapred-site.xml does NOTHING, as I needed to
 have cassandra.output.thrift.address set.  It would be much nicer if the
 get{Input/Output}XYZ checked for the existence of getXYZ
 if get{Input/Output}XYZ is empty/null.  E.g. in getOutputThriftAddress(),
 if that setting is null, it would have been nice if that method returned
 getThriftAddress().  My problem went away when I put the full cross product
 in the XML. E.g. cassandra.input.thrift.address
 and cassandra.output.thrift.address (and port, and partitioner).

 I still want to know why the old easy way (of setting the 3 system
 variables on the pig starter box, and having the config flow into the task
 trackers) doesn't work!

 will


 On Fri, Jan 4, 2013 at 9:04 AM, William Oberman ober...@civicscience.com
  wrote:

 On all tasktrackers, I see:
 java.io.IOException: PIG_OUTPUT_INITIAL_ADDRESS or PIG_INITIAL_ADDRESS
 environment variable not set
 at
 org.apache.cassandra.hadoop.pig.CassandraStorage.setStoreLocation(CassandraStorage.java:821)
 at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.setLocation(PigOutputFormat.java:170)
 at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.setUpContext(PigOutputCommitter.java:112)
 at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.getCommitters(PigOutputCommitter.java:86)
 at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.init(PigOutputCommitter.java:67)
 at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.getOutputCommitter(PigOutputFormat.java:279)
 at org.apache.hadoop.mapred.Task.initialize(Task.java:515)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:358)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396

Re: AWS EMR - Cassandra

2013-01-04 Thread William Oberman
On all tasktrackers, I see:
java.io.IOException: PIG_OUTPUT_INITIAL_ADDRESS or PIG_INITIAL_ADDRESS
environment variable not set
at
org.apache.cassandra.hadoop.pig.CassandraStorage.setStoreLocation(CassandraStorage.java:821)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.setLocation(PigOutputFormat.java:170)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.setUpContext(PigOutputCommitter.java:112)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.getCommitters(PigOutputCommitter.java:86)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.init(PigOutputCommitter.java:67)
at
org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.getOutputCommitter(PigOutputFormat.java:279)
at org.apache.hadoop.mapred.Task.initialize(Task.java:515)
at org.apache.hadoop.mapred.MapTask.run(MapTask.java:358)
at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
at org.apache.hadoop.mapred.Child.main(Child.java:249)


On Thu, Jan 3, 2013 at 10:45 PM, aaron morton aa...@thelastpickle.comwrote:

 Instead, I get an error from CassandraStorage that the initial address
 isn't set (on the slave, the master is ok).

 Can you post the full error ?

 Cheers
-
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 4/01/2013, at 11:15 AM, William Oberman ober...@civicscience.com
 wrote:

 Anyone ever try to read or write directly between EMR - Cassandra?

 I'm running various Cassandra resources in Ec2, so the physical
 connection part is pretty easy using security groups.  But, I'm having
 some configuration issues.  I have managed to get Cassandra + Hadoop
 working in the past using a DIY hadoop cluster, and looking at the
 configurations in the two environments (EMR vs DIY), I'm not sure what's
 different that is causing my failures...  I should probably note I'm using
 the Pig integration of Cassandra.

 Versions: Hadoop 1.0.3, Pig 0.10, Cassandra 1.1.7.

 I'm 99% sure I have classpaths working (because I didn't at first, and now
 EMR can find and instantiate CassandraStorage on master and slaves).  What
 isn't working are the system variables.  In my DIY cluster, all I needed to
 do was:
 ---
 export PIG_INITIAL_ADDRESS=XXX
 export PIG_RPC_PORT=9160
 export PIG_PARTITIONER=org.apache.cassandra.dht.RandomPartitioner
 --
 And the task trackers somehow magically picked up the values (I never
 questioned how/why).  But, in EMR, they do not.  Instead, I get an error
 from CassandraStorage that the initial address isn't set (on the slave, the
 master is ok).

 My DIY cluster used CDH3, which was hadoop 0.20.something.  So, maybe the
 problem is a different version of hadoop?

 Looking at the CassandraStorage class, I realize I have no idea how it
 used to work, since it only seems to look at System variables.  Those
 variables are set on the Job.getConfiguration object.  I don't know how
 that part of hadoop works though... do variables that get set on Job on the
 master get propagated to the task threads?  I do know that on my DIY
 cluster, I do NOT set those system variables on the slaves...

 Thanks!

 will





Re: AWS EMR - Cassandra

2013-01-04 Thread William Oberman
So I've made it work, but I don't get it yet.

I have no idea why my DIY server works when I set the environment variables
on the machine that kicks off pig (master), and in EMR it doesn't.  I
recompiled ConfigHelper and CassandraStorage with tons of debugging, and in
EMR I can see the hadoop Configuration object get the proper values on the
master node, and I can see it does NOT propagate to the task threads.

The other part that was driving me nuts could be made more user friendly.
 The issue is this: I started to try to set
cassandra.thrift.address, cassandra.thrift.port,
cassandra.partitioner.class in mapred-site.xml, and it didn't work.  After
even more painful debugging, I noticed that the only time Cassandra sets
the input/output versions of those settings (and these input/output
specific versions are the only versions really used!) is when Cassandra
maps the system environment variables.  So, having cassandra.thrift.address
in mapred-site.xml does NOTHING, as I needed to
have cassandra.output.thrift.address set.  It would be much nicer if the
get{Input/Output}XYZ checked for the existence of getXYZ
if get{Input/Output}XYZ is empty/null.  E.g. in getOutputThriftAddress(),
if that setting is null, it would have been nice if that method returned
getThriftAddress().  My problem went away when I put the full cross product
in the XML. E.g. cassandra.input.thrift.address
and cassandra.output.thrift.address (and port, and partitioner).

I still want to know why the old easy way (of setting the 3 system
variables on the pig starter box, and having the config flow into the task
trackers) doesn't work!

will


On Fri, Jan 4, 2013 at 9:04 AM, William Oberman ober...@civicscience.comwrote:

 On all tasktrackers, I see:
 java.io.IOException: PIG_OUTPUT_INITIAL_ADDRESS or PIG_INITIAL_ADDRESS
 environment variable not set
 at
 org.apache.cassandra.hadoop.pig.CassandraStorage.setStoreLocation(CassandraStorage.java:821)
 at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.setLocation(PigOutputFormat.java:170)
 at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.setUpContext(PigOutputCommitter.java:112)
 at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.getCommitters(PigOutputCommitter.java:86)
 at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputCommitter.init(PigOutputCommitter.java:67)
 at
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigOutputFormat.getOutputCommitter(PigOutputFormat.java:279)
 at org.apache.hadoop.mapred.Task.initialize(Task.java:515)
  at org.apache.hadoop.mapred.MapTask.run(MapTask.java:358)
 at org.apache.hadoop.mapred.Child$4.run(Child.java:255)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:396)
 at
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1132)
 at org.apache.hadoop.mapred.Child.main(Child.java:249)


 On Thu, Jan 3, 2013 at 10:45 PM, aaron morton aa...@thelastpickle.comwrote:

 Instead, I get an error from CassandraStorage that the initial address
 isn't set (on the slave, the master is ok).

 Can you post the full error ?

 Cheers
-
 Aaron Morton
 Freelance Cassandra Developer
 New Zealand

 @aaronmorton
 http://www.thelastpickle.com

 On 4/01/2013, at 11:15 AM, William Oberman ober...@civicscience.com
 wrote:

 Anyone ever try to read or write directly between EMR - Cassandra?

 I'm running various Cassandra resources in Ec2, so the physical
 connection part is pretty easy using security groups.  But, I'm having
 some configuration issues.  I have managed to get Cassandra + Hadoop
 working in the past using a DIY hadoop cluster, and looking at the
 configurations in the two environments (EMR vs DIY), I'm not sure what's
 different that is causing my failures...  I should probably note I'm using
 the Pig integration of Cassandra.

 Versions: Hadoop 1.0.3, Pig 0.10, Cassandra 1.1.7.

 I'm 99% sure I have classpaths working (because I didn't at first, and
 now EMR can find and instantiate CassandraStorage on master and slaves).
  What isn't working are the system variables.  In my DIY cluster, all I
 needed to do was:
 ---
 export PIG_INITIAL_ADDRESS=XXX
 export PIG_RPC_PORT=9160
 export PIG_PARTITIONER=org.apache.cassandra.dht.RandomPartitioner
 --
 And the task trackers somehow magically picked up the values (I never
 questioned how/why).  But, in EMR, they do not.  Instead, I get an error
 from CassandraStorage that the initial address isn't set (on the slave, the
 master is ok).

 My DIY cluster used CDH3, which was hadoop 0.20.something.  So, maybe the
 problem is a different version of hadoop?

 Looking at the CassandraStorage class, I realize I have no idea how it
 used to work, since it only seems to look

Re: remove DC

2012-11-13 Thread William Oberman
My situation is that DC2 was not written to, thanks!


On Mon, Nov 12, 2012 at 7:48 PM, Jeremiah Jordan
jeremiah.jor...@gmail.comwrote:

 If you have any data that you wrote to DC2, since the last time you ran
 repair, you should probably run repair to make sure that data made it over
 to DC1, if you never wrote data directly to DC2, then you are correct you
 don't need to run repair.

 You should just need to update the schema, and then decommission the node.

 -Jeremiah

 On Nov 12, 2012, at 2:25 PM, William Oberman ober...@civicscience.com
 wrote:

 There is a great guide here on how to add resources:

 http://www.datastax.com/docs/1.1/operations/cluster_management#adding-capacity

 What about deleting resources?  I'm thinking of removing a data center.
 Clearly I'd need to change strategy options, which is currently something
 like this:
 {DC1:3,DC2:1}
 to:
 {DC1:3})

 But, after that change, I'm wondering if anything else needs to happen?
  All of the data in DC1 is already in the correct spots, so I don't think I
 have to run repair or cleanup...

 will





remove DC

2012-11-12 Thread William Oberman
There is a great guide here on how to add resources:
http://www.datastax.com/docs/1.1/operations/cluster_management#adding-capacity

What about deleting resources?  I'm thinking of removing a data center.
Clearly I'd need to change strategy options, which is currently something
like this:
{DC1:3,DC2:1}
to:
{DC1:3})

But, after that change, I'm wondering if anything else needs to happen?
 All of the data in DC1 is already in the correct spots, so I don't think I
have to run repair or cleanup...

will


Re: hadoop consistency level

2012-10-18 Thread William Oberman
A recent thread made it sound like Brisk was no longer a datastax supported
thing (it's DataStax Enterpise, or DSE, now):
http://www.mail-archive.com/user@cassandra.apache.org/msg24921.html

In particular this response:
http://www.mail-archive.com/user@cassandra.apache.org/msg25061.html

On Thu, Oct 18, 2012 at 2:49 PM, Jean-Nicolas Boulay Desjardins 
jnbdzjn...@gmail.com wrote:

 Why don't you look into Brisk:
 http://www.datastax.com/docs/0.8/brisk/about_brisk


 On Thu, Oct 18, 2012 at 2:46 PM, Andrey Ilinykh ailin...@gmail.comwrote:

 Hello, everybody!
 I'm thinking about running hadoop jobs on the top of the cassandra
 cluster. My understanding is - hadoop jobs read data from local nodes
 only. Does it mean the consistency level is always ONE?

 Thank you,
   Andrey





cassandra + pig

2012-10-11 Thread William Oberman
I'm wondering how many people are using cassandra + pig out there?  I
recently went through the effort of validating things at a much higher
level than I previously did(*), and found a few issues:
https://issues.apache.org/jira/browse/CASSANDRA-4748
https://issues.apache.org/jira/browse/CASSANDRA-4749
https://issues.apache.org/jira/browse/CASSANDRA-4789

In general, it seems like the widerow implementation still has rough edges.
 I'm concerned I'm not understanding why other people aren't using the
feature, and thus finding these problems.  Is everyone else just setting a
high static limit?  E.g.  LOAD 'cassandra://KEYSPACE/CF?limit=X where X =
the max size of any key?  Is everyone else using data models that result in
keys with # columns always less than 1024?  Do newer version of hadoop
consume the cassandra API in a way that work around these issues?  I'm
using CDH3 == hadoop 0.20.2, pig 0.8.1.

(*) I took a random subsample of 50,000 keys of my production data (approx
1M total key/value pairs, some keys having only a single value and some
having 1000's).  I then wrote both a pig script and simple procedural
version of the pig script.  Then I compared the results.  Obviously I
started with differences, though after locally patching my code to fix the
above 3 bugs (though, really only two issues), I now (finally) get the same
results.


Re: cassandra + pig

2012-10-11 Thread William Oberman
If you don't mind me asking, how are you handling the fact that pre-widerow
you are only getting a static number of columns per key (default 1024)?  Or
am I not understanding the limit concept?

On Thu, Oct 11, 2012 at 11:25 AM, Jeremy Hanna
jeremy.hanna1...@gmail.comwrote:

 The Dachis Group (where I just came from, now at DataStax) uses pig with
 cassandra for a lot of things.  However, we weren't using the widerow
 implementation yet since wide row support is new to 1.1.x and we were on
 0.7, then 0.8, then 1.0.x.

 I think since it's new to 1.1's hadoop support, it sounds like there are
 some rough edges like you say.  But issues that are reproducible on tickets
 for any problems are much appreciated and they will get addressed.

 On Oct 11, 2012, at 10:43 AM, William Oberman ober...@civicscience.com
 wrote:

  I'm wondering how many people are using cassandra + pig out there?  I
 recently went through the effort of validating things at a much higher
 level than I previously did(*), and found a few issues:
  https://issues.apache.org/jira/browse/CASSANDRA-4748
  https://issues.apache.org/jira/browse/CASSANDRA-4749
  https://issues.apache.org/jira/browse/CASSANDRA-4789
 
  In general, it seems like the widerow implementation still has rough
 edges.  I'm concerned I'm not understanding why other people aren't using
 the feature, and thus finding these problems.  Is everyone else just
 setting a high static limit?  E.g.  LOAD 'cassandra://KEYSPACE/CF?limit=X
 where X = the max size of any key?  Is everyone else using data models
 that result in keys with # columns always less than 1024?  Do newer version
 of hadoop consume the cassandra API in a way that work around these issues?
  I'm using CDH3 == hadoop 0.20.2, pig 0.8.1.
 
  (*) I took a random subsample of 50,000 keys of my production data
 (approx 1M total key/value pairs, some keys having only a single value and
 some having 1000's).  I then wrote both a pig script and simple procedural
 version of the pig script.  Then I compared the results.  Obviously I
 started with differences, though after locally patching my code to fix the
 above 3 bugs (though, really only two issues), I now (finally) get the same
 results.




Re: cassandra + pig

2012-10-11 Thread William Oberman
Thanks Jeremy!  Maybe figuring out how to do paging in pig would have been
easier, but I found the widerow setting first which led me where I am
today.  I don't mind helping to blaze trails, or contribute back when doing
so, but I usually try to follow rather than lead when it comes to
tools/software I choose use.  I didn't realize how close to the edge I was
getting in this case :-)

On Thu, Oct 11, 2012 at 1:03 PM, Jeremy Hanna jeremy.hanna1...@gmail.comwrote:

 For our use case, we had a lot of narrow column families and the couple of
 column families that had wide rows, we did our own paging through them.  I
 don't recall if we did paging in pig or mapreduce but you should be able to
 do that in both since pig allows you to specify the slice start.

 On Oct 11, 2012, at 11:28 AM, William Oberman ober...@civicscience.com
 wrote:

  If you don't mind me asking, how are you handling the fact that
 pre-widerow you are only getting a static number of columns per key
 (default 1024)?  Or am I not understanding the limit concept?
 
  On Thu, Oct 11, 2012 at 11:25 AM, Jeremy Hanna 
 jeremy.hanna1...@gmail.com wrote:
  The Dachis Group (where I just came from, now at DataStax) uses pig with
 cassandra for a lot of things.  However, we weren't using the widerow
 implementation yet since wide row support is new to 1.1.x and we were on
 0.7, then 0.8, then 1.0.x.
 
  I think since it's new to 1.1's hadoop support, it sounds like there are
 some rough edges like you say.  But issues that are reproducible on tickets
 for any problems are much appreciated and they will get addressed.
 
  On Oct 11, 2012, at 10:43 AM, William Oberman ober...@civicscience.com
 wrote:
 
   I'm wondering how many people are using cassandra + pig out there?  I
 recently went through the effort of validating things at a much higher
 level than I previously did(*), and found a few issues:
   https://issues.apache.org/jira/browse/CASSANDRA-4748
   https://issues.apache.org/jira/browse/CASSANDRA-4749
   https://issues.apache.org/jira/browse/CASSANDRA-4789
  
   In general, it seems like the widerow implementation still has rough
 edges.  I'm concerned I'm not understanding why other people aren't using
 the feature, and thus finding these problems.  Is everyone else just
 setting a high static limit?  E.g.  LOAD 'cassandra://KEYSPACE/CF?limit=X
 where X = the max size of any key?  Is everyone else using data models
 that result in keys with # columns always less than 1024?  Do newer version
 of hadoop consume the cassandra API in a way that work around these issues?
  I'm using CDH3 == hadoop 0.20.2, pig 0.8.1.
  
   (*) I took a random subsample of 50,000 keys of my production data
 (approx 1M total key/value pairs, some keys having only a single value and
 some having 1000's).  I then wrote both a pig script and simple procedural
 version of the pig script.  Then I compared the results.  Obviously I
 started with differences, though after locally patching my code to fix the
 above 3 bugs (though, really only two issues), I now (finally) get the same
 results.
 
 
 




Re: pig and widerows

2012-09-27 Thread William Oberman
The next painful lesson for me was figuring out how to get logging working
for a distributed hadoop process.   In my test environment, I have a single
node that runs name/secondaryname/data/job trackers (call it central),
and I have two cassandra nodes running tasktrackers.  But, I also have
cassandra libraries on the central box, and invoke my pig script from
there.   I had been patching and recompiling cassandra (1.1.5 with my
logging, and the system env fix) on that central box, and SOME of the
logging was appearing in the pig output.  But, eventually I decided to move
that recompiled code to the tasktracker boxes, and then I found even more
of the logging I had added in:
/var/log/hadoop/userlogs/JOB_ID
on each of the tasktrackers.

Based on this new logging, I found out that the widerows setting wasn't
propagating from the central box to the tasktrackers.  I added:
export PIG_WIDEROW_INPUT=true
To hadoop-env.sh on each of the tasktrackers and it finally worked!

So, long story short, to actually get all columns for a key I had to:
1.) patch 1.1.5 to honor the PIG_WIDEROW_INPUT=true system setting
2.) add the system setting to ALL nodes in the hadoop cluster

I'm going to try to undo all of my other hacks to get logging/printing
working to confirm if those were actually the only two changes I had to
make.

will

On Thu, Sep 27, 2012 at 1:43 PM, William Oberman
ober...@civicscience.comwrote:

 Ok, this is painful.  The first problem I found is in stock 1.1.5 there is
 no way to set widerows to true!  The new widerows URI parsing is NOT in
 1.1.5.  And for extra fun, getting the value from the system property is
 BROKEN (at least in my centos linux environment).

 Here are the key lines of code (in CassandraStorage), note the different
 ways of getting the property!  getenv in the test, and getProperty in the
 set:
 widerows = DEFAULT_WIDEROW_INPUT;
 if (System.getenv(PIG_WIDEROW_INPUT) != null)
 widerows =
 Boolean.valueOf(System.getProperty(PIG_WIDEROW_INPUT));

 I added this logging:
 logger.warn(widerows =  + widerows +  getenv= +
 System.getenv(PIG_WIDEROW_INPUT) + 
 getProp=+System.getProperty(PIG_WIDEROW_INPUT));

 And I saw:
 org.apache.cassandra.hadoop.pig.CassandraStorage - widerows = false
 getenv=true getProp=null
 So for me getProperty != getenv :-(

 For people trying to figure out how to debug cassandra + hadoop + pig, for
 me the key to get debugging and logging working was to focus on
 /etc/hadoop/conf (not /etc/pig/conf as I expected).

 Also, if you want to compile your own cassandra (to add logging messages),
 make sure it's appears first on the pig classpath (use pig -secretDebugCmd
 to see the fully qualified command line).

 The next thing I'm trying to figure out is why when widerows == true I'm
 STILL not seeing more than 1024 columns :-(

 will

 On Wed, Sep 26, 2012 at 3:42 PM, William Oberman ober...@civicscience.com
  wrote:

 Hi,

 I'm trying to figure out what's going on with my cassandra/hadoop/pig
 system.  I created a mini copy of my main cassandra data by randomly
 subsampling to get ~50,000 keys.  I was then writing pig scripts but also
 the equivalent operation using simple single threaded code to double check
 pig.

 Of course my very first test failed.  After doing a pig DUMP on the raw
 data, what appears to be happening is I'm only getting the first 1024
 columns of a key.  After some googling, this seems to be known behavior
 unless you add ?widerows=true to the pig load URI. I tried this, but
 it didn't seem to fix anything :-(   Here's the the start of my pig script:
 foo = LOAD 'cassandra://KEYSPACE/COLUMN_FAMILY?widerows=true' USING
 CassandraStorage() AS (key:chararray, columns:bag {column:tuple (name,
 value)});

 I'm using cassandra 1.1.5 from datastax rpms.  I'm using hadoop
 (0.20.2+923.418-1) and pig (0.8.1+28.39-1) from cloudera rpms.

 What am I doing wrong?  Or, how I can enable debugging/logging to next
 figure out what is going on?  I haven't had to debug hadoop+pig+cassandra
 much, other than doing DUMP/ILLUSTRATE from pig.

 will






Re: pig and widerows

2012-09-27 Thread William Oberman
I don't want to switch my cassandra to HEAD, but looking at the newest code
for CassandraStorage, I'm concerned the Uri parsing for widerows isn't
going to work.  setLocation first calls setLocationFromUri (which sets
widerows to the Uri value), but then sets widerows to a static value (which
is defined as false), and then it sets widerows to the system setting if it
exists.  That doesn't seem right...  ?

But setLocationFromUri also gets called from setStoreLocation, and I don't
really know the difference between setLocation and setStoreLocation in
terms of what is going on in terms of the integration between
cassandra/pig/hadoop.

will

On Thu, Sep 27, 2012 at 3:26 PM, William Oberman
ober...@civicscience.comwrote:

 The next painful lesson for me was figuring out how to get logging working
 for a distributed hadoop process.   In my test environment, I have a single
 node that runs name/secondaryname/data/job trackers (call it central),
 and I have two cassandra nodes running tasktrackers.  But, I also have
 cassandra libraries on the central box, and invoke my pig script from
 there.   I had been patching and recompiling cassandra (1.1.5 with my
 logging, and the system env fix) on that central box, and SOME of the
 logging was appearing in the pig output.  But, eventually I decided to move
 that recompiled code to the tasktracker boxes, and then I found even more
 of the logging I had added in:
 /var/log/hadoop/userlogs/JOB_ID
 on each of the tasktrackers.

 Based on this new logging, I found out that the widerows setting wasn't
 propagating from the central box to the tasktrackers.  I added:
 export PIG_WIDEROW_INPUT=true
 To hadoop-env.sh on each of the tasktrackers and it finally worked!

 So, long story short, to actually get all columns for a key I had to:
 1.) patch 1.1.5 to honor the PIG_WIDEROW_INPUT=true system setting
 2.) add the system setting to ALL nodes in the hadoop cluster

 I'm going to try to undo all of my other hacks to get logging/printing
 working to confirm if those were actually the only two changes I had to
 make.

 will


 On Thu, Sep 27, 2012 at 1:43 PM, William Oberman ober...@civicscience.com
  wrote:

 Ok, this is painful.  The first problem I found is in stock 1.1.5 there
 is no way to set widerows to true!  The new widerows URI parsing is NOT in
 1.1.5.  And for extra fun, getting the value from the system property is
 BROKEN (at least in my centos linux environment).

 Here are the key lines of code (in CassandraStorage), note the different
 ways of getting the property!  getenv in the test, and getProperty in the
 set:
 widerows = DEFAULT_WIDEROW_INPUT;
 if (System.getenv(PIG_WIDEROW_INPUT) != null)
 widerows =
 Boolean.valueOf(System.getProperty(PIG_WIDEROW_INPUT));

 I added this logging:
 logger.warn(widerows =  + widerows +  getenv= +
 System.getenv(PIG_WIDEROW_INPUT) + 
 getProp=+System.getProperty(PIG_WIDEROW_INPUT));

 And I saw:
 org.apache.cassandra.hadoop.pig.CassandraStorage - widerows = false
 getenv=true getProp=null
 So for me getProperty != getenv :-(

 For people trying to figure out how to debug cassandra + hadoop + pig,
 for me the key to get debugging and logging working was to focus on
 /etc/hadoop/conf (not /etc/pig/conf as I expected).

 Also, if you want to compile your own cassandra (to add logging
 messages), make sure it's appears first on the pig classpath (use pig
 -secretDebugCmd to see the fully qualified command line).

 The next thing I'm trying to figure out is why when widerows == true I'm
 STILL not seeing more than 1024 columns :-(

 will

 On Wed, Sep 26, 2012 at 3:42 PM, William Oberman 
 ober...@civicscience.com wrote:

 Hi,

 I'm trying to figure out what's going on with my cassandra/hadoop/pig
 system.  I created a mini copy of my main cassandra data by randomly
 subsampling to get ~50,000 keys.  I was then writing pig scripts but also
 the equivalent operation using simple single threaded code to double check
 pig.

 Of course my very first test failed.  After doing a pig DUMP on the raw
 data, what appears to be happening is I'm only getting the first 1024
 columns of a key.  After some googling, this seems to be known behavior
 unless you add ?widerows=true to the pig load URI. I tried this, but
 it didn't seem to fix anything :-(   Here's the the start of my pig script:
 foo = LOAD 'cassandra://KEYSPACE/COLUMN_FAMILY?widerows=true' USING
 CassandraStorage() AS (key:chararray, columns:bag {column:tuple (name,
 value)});

 I'm using cassandra 1.1.5 from datastax rpms.  I'm using hadoop
 (0.20.2+923.418-1) and pig (0.8.1+28.39-1) from cloudera rpms.

 What am I doing wrong?  Or, how I can enable debugging/logging to next
 figure out what is going on?  I haven't had to debug hadoop+pig+cassandra
 much, other than doing DUMP/ILLUSTRATE from pig.

 will






pig and widerows

2012-09-26 Thread William Oberman
Hi,

I'm trying to figure out what's going on with my cassandra/hadoop/pig
system.  I created a mini copy of my main cassandra data by randomly
subsampling to get ~50,000 keys.  I was then writing pig scripts but also
the equivalent operation using simple single threaded code to double check
pig.

Of course my very first test failed.  After doing a pig DUMP on the raw
data, what appears to be happening is I'm only getting the first 1024
columns of a key.  After some googling, this seems to be known behavior
unless you add ?widerows=true to the pig load URI. I tried this, but
it didn't seem to fix anything :-(   Here's the the start of my pig script:
foo = LOAD 'cassandra://KEYSPACE/COLUMN_FAMILY?widerows=true' USING
CassandraStorage() AS (key:chararray, columns:bag {column:tuple (name,
value)});

I'm using cassandra 1.1.5 from datastax rpms.  I'm using hadoop
(0.20.2+923.418-1) and pig (0.8.1+28.39-1) from cloudera rpms.

What am I doing wrong?  Or, how I can enable debugging/logging to next
figure out what is going on?  I haven't had to debug hadoop+pig+cassandra
much, other than doing DUMP/ILLUSTRATE from pig.

will


new nodetool ring output and unbalanced ring?

2012-09-06 Thread William Oberman
Hi,

I recently upgraded from 0.8.x to 1.1.x (through 1.0 briefly) and nodetool
-ring seems to have changed from owns to effectively owns.
 Effectively owns seems to account for replication factor (RF).  I'm ok
with all of this, yet I still can't figure out what's up with my cluster.
 I have a NetworkTopologyStrategy with two data centers (DCs) with
RF/number nodes in DC combinations of:
DC Name, RF, # in DC
analytics, 1, 2
us-east, 3, 4
So I'd expect 50% on each analytics node, and 75% for each us-east node.
 Instead, I have two nodes in us-east with 50/100??? (the other two are
75/75 as expected).

Here is the output of nodetool (all nodes report the same thing):
Address DC  RackStatus State   Load
 Effective-Ownership Token

   127605887595351923798765477786913079296
x.x.x.x   us-east 1c  Up Normal  94.57 GB75.00%
 0
x.x.x.x   analytics   1c  Up Normal  60.64 GB50.00%
 1
x.x.x.x   us-east 1c  Up Normal  131.76 GB   75.00%
 42535295865117307932921825928971026432
x.x.x.xus-east 1c  Up Normal  43.45 GB50.00%
   85070591730234615865843651857942052864
x.x.x.xanalytics   1d  Up Normal  60.88 GB50.00%
   85070591730234615865843651857942052865
x.x.x.x   us-east 1d  Up Normal  98.56 GB100.00%
  127605887595351923798765477786913079296

If I use cassandra-cli to do show keyspaces; I get (and again, all nodes
report the same thing):
Keyspace: civicscience:
  Replication Strategy: org.apache.cassandra.locator.NetworkTopologyStrategy
  Durable Writes: true
Options: [analytics:1, us-east:3]
I removed the output about all of my column families (CFs), hopefully that
doesn't matter.

Did I compute the tokens wrong?  Is there a combination of nodetool
commands I can run to migrate the data around to rebalance to 75/75/75/75?
 I routinely run repair already.  And as the release notes required, I ran
upgradesstables during the upgrade process.

Before the upgrade, I was getting analytics = 0%, and us-east = 25% on each
node, which I expected for owns.

will


Re: new nodetool ring output and unbalanced ring?

2012-09-06 Thread William Oberman
Didn't notice the racks!  Of course

If I change a 1c to a 1d, what would I have to do to make sure data
shuffles around correctly?  Repair everywhere?

will

On Thu, Sep 6, 2012 at 2:09 PM, Tyler Hobbs ty...@datastax.com wrote:

 The main issue is that one of your us-east nodes is in rack 1d, while the
 restart are in rack 1c.  With NTS and multiple racks, Cassandra will try
 use one node from each rack as a replica for a range until it either meets
 the RF for the DC, or runs out of racks, in which case it just picks nodes
 sequentially going clockwise around the ring (starting from the range being
 considered, not the last node that was chosen as a replica).

 To fix this, you'll either need to make the 1d node a 1c node, or make
 42535295865117307932921825928971026432 a 1d node so that you're alternating
 racks within that DC.


 On Thu, Sep 6, 2012 at 12:54 PM, William Oberman ober...@civicscience.com
  wrote:

 Hi,

 I recently upgraded from 0.8.x to 1.1.x (through 1.0 briefly) and
 nodetool -ring seems to have changed from owns to effectively owns.
  Effectively owns seems to account for replication factor (RF).  I'm ok
 with all of this, yet I still can't figure out what's up with my cluster.
  I have a NetworkTopologyStrategy with two data centers (DCs) with
 RF/number nodes in DC combinations of:
 DC Name, RF, # in DC
 analytics, 1, 2
 us-east, 3, 4
 So I'd expect 50% on each analytics node, and 75% for each us-east node.
  Instead, I have two nodes in us-east with 50/100??? (the other two are
 75/75 as expected).

 Here is the output of nodetool (all nodes report the same thing):
 Address DC  RackStatus State   Load
  Effective-Ownership Token

  127605887595351923798765477786913079296
 x.x.x.x   us-east 1c  Up Normal  94.57 GB75.00%
0
 x.x.x.x   analytics   1c  Up Normal  60.64 GB50.00%
1
 x.x.x.x   us-east 1c  Up Normal  131.76 GB   75.00%
42535295865117307932921825928971026432
 x.x.x.xus-east 1c  Up Normal  43.45 GB50.00%
  85070591730234615865843651857942052864
 x.x.x.xanalytics   1d  Up Normal  60.88 GB50.00%
  85070591730234615865843651857942052865
 x.x.x.x   us-east 1d  Up Normal  98.56 GB100.00%
 127605887595351923798765477786913079296

 If I use cassandra-cli to do show keyspaces; I get (and again, all
 nodes report the same thing):
 Keyspace: civicscience:
   Replication Strategy:
 org.apache.cassandra.locator.NetworkTopologyStrategy
   Durable Writes: true
 Options: [analytics:1, us-east:3]
 I removed the output about all of my column families (CFs), hopefully
 that doesn't matter.

 Did I compute the tokens wrong?  Is there a combination of nodetool
 commands I can run to migrate the data around to rebalance to 75/75/75/75?
  I routinely run repair already.  And as the release notes required, I ran
 upgradesstables during the upgrade process.

 Before the upgrade, I was getting analytics = 0%, and us-east = 25% on
 each node, which I expected for owns.

 will




 --
 Tyler Hobbs
 DataStax http://datastax.com/




-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com


Re: Professional Support

2011-09-06 Thread William Oberman
I also have used datastax with great success (same disclaimer).

A specific example:
-I setup a one-on-one call to talk through an issue, in my case a server
reconfiguration.  It took 2 days to find a time to meet, though that was my
fault as I believe they could have worked me in within a day.  I wanted to
split an existing cluster into 'oltp' and 'analytics', similar to what brisk
does now out of the box.
-During the call they walked me through all of the steps I'd have to do,
answered any questions I had, and filled in the blanks for some of the
reasoning behind their recommendations.
-After the call I recieved constant support through the reconfiguration.
 For example: I found out that Ec2Snitch doesn't play nicely with
PropertyFileSnitch in a rolling restart (all of the Ec2Snitch based servers
stopped working immediately as soon as a PropFileSnitch server joined the
ring, this is in 0.8.4), and they wrote a custom patch for me that made it
work within a day.
-In particular, Ben and Jackson helped me, so if either of you read the user
list, thanks again!

will

On Tue, Sep 6, 2011 at 1:25 PM, Jim Ancona j...@anconafamily.com wrote:

 We use Datastax (http://www.datastax.com) and we have been very happy
 with the support we've received.

 We haven't tried any of the other providers on that page, so I can't
 comment on them.

 Jim
 (Disclaimer: no connection with Datastax other than as a satisfied
 customer.)

 On Tue, Sep 6, 2011 at 1:15 PM, China Stoffen chinastof...@yahoo.com
 wrote:
  There is a link to a page which lists few professional support providers
 on
  Cassandra homepage. I have contacted few of them and couple are just out
 of
  providing support and others didn't reply. So, do you know about any
  professional support provider for Cassandra solutions and how much they
  charge per year?
 



Re: cassandra 0.8.4 + pig (using cloudera rpms)

2011-09-05 Thread William Oberman
Yes, my cluster is working.

I didn't realize it at the time, but the StorageService link I listed is
already in 0.8.4, so yes the only file I had to patch was VersionedValue.
 Not sure what was going on with the pig jars, but after more configuration
changes than I can count, I'm pretty sure removing pig.jar in favor of the
cloudera pig jar was the magic bullet (for the ClassNotFound I was getting,
in this case TException).

One final note: in production I had to patch all of my cassandra servers
(OLTP and analytics)* with the VersionedValue file for it to work (though, I
did forget one setting, so now I'm still not 100% sure I had to patch all of
them, but it's working now).

will

OLTP = vanilla cassandra
analytics = cassandra + tasktracker.  I'm not on brisk yet, so I've been
rolling my own.


On Mon, Sep 5, 2011 at 1:41 AM, Jeremy Hanna jeremy.hanna1...@gmail.comwrote:

 Thanks William - so you were able to get everything running correctly,
 right?

 FWIW, we're in the process of upgrading to 0.8.4 and found that all we
 needed was that first link you mentioned - the VersionedValue modification.
  It's running fine on our staging cluster and we're in the process of moving
 to production.  We're currently using pig from cdhu0.  All we did was
 replace the 0.8.4 jars after installing the debian packages for 0.8.4.

 Not sure if that helps anyone, but thought I would share what we've seen.

 Btw, this shouldn't be a problem once 0.8.5 comes out.

 On Sep 4, 2011, at 11:03 AM, William Oberman wrote:

  I've had some troubles, so I thought I'd pass on my various bug fixes:
 
  -Cass 0.8.4 has troubles with pig/hadoop (you get NPE's when trying to
 connect to cassandra in the pig logs).  You need this patch:
  http://svn.apache.org/viewvc?revision=1158940view=revision
  And maybe this:
  http://svn.apache.org/viewvc?revision=1155157view=revision
 
  -I had installed from riptano rpms.  I downloaded the src, applied the
 patch, and did ant jar.  I then replaced the rpm installed cassandra jar
 with this new one (ugly, but I wanted to continue to run from the package).
 
  -I think I was able to just replace the apache-cassandra-0.8.4.jar on
 just my jobtracker + tasktracker nodes (I need to retest from scratch to be
 sure, I've done a _lot_ of configuring and reconfiguring)
 
  -Then I started getting ClassNotFound exceptions during map/reduce tasks.
  Still not sure why this fix works, but the problem seems to be cloudera pig
 version 0.20.2+923.97-1 has two jars that match pig*.jar (which is what
 cassandra contrib/pig/bin/pig_cassandra uses to setup the classpath).  I had
 to rename /usr/lib/pig/pig.jar for things to work (leaving
 pig-0.8.1-cdh3u1-core.jar as the only match).
 
  My pig script is still running, but it's the first time it didn't
 immediately crash.
 
  will
 
 




cassandra 0.8.4 + pig (using cloudera rpms)

2011-09-04 Thread William Oberman
I've had some troubles, so I thought I'd pass on my various bug fixes:

-Cass 0.8.4 has troubles with pig/hadoop (you get NPE's when trying to connect 
to cassandra in the pig logs).  You need this patch:
http://svn.apache.org/viewvc?revision=1158940view=revision
And maybe this:
http://svn.apache.org/viewvc?revision=1155157view=revision

-I had installed from riptano rpms.  I downloaded the src, applied the patch, 
and did ant jar.  I then replaced the rpm installed cassandra jar with this 
new one (ugly, but I wanted to continue to run from the package).

-I think I was able to just replace the apache-cassandra-0.8.4.jar on just my 
jobtracker + tasktracker nodes (I need to retest from scratch to be sure, I've 
done a _lot_ of configuring and reconfiguring)

-Then I started getting ClassNotFound exceptions during map/reduce tasks.  
Still not sure why this fix works, but the problem seems to be cloudera pig 
version 0.20.2+923.97-1 has two jars that match pig*.jar (which is what 
cassandra contrib/pig/bin/pig_cassandra uses to setup the classpath).  I had to 
rename /usr/lib/pig/pig.jar for things to work (leaving 
pig-0.8.1-cdh3u1-core.jar as the only match).

My pig script is still running, but it's the first time it didn't immediately 
crash.

will




Re: how to migrate?

2011-08-25 Thread William Oberman


  create keyspace civicscience with replication_factor=3 and
 strategy_options = [{us-east:3}] and
 placement_strategy='org.apache.cassandra.locator.NetworkTopologyStrategy';

 FYI the replication_factor property with the NTS is incorrect, the next(?)
 revision of 0.8 will raise an error on restart.


I'm not sure what you're saying.  Should it have been:
create keyspace civicscience with strategy_options = [{us-east:3}] and
placement_strategy='org.apache.cassandra.locator.NetworkTopologyStrategy';
?


 I'm wondering if I write my own snitch that extends Ec2Snitch with
 overrides as follows:
 getDC = if(AZ == c || d) return return us-east (to keep current nodes the
 same) else return us-east-hadoop;
 getRack = return super(); (returning a,b,c,d seems ok)

 prob easier to use the PropertyFileSnitch, see the yaml file and the
 conf/cassandra-topology.properties . You can then manually put the nodes
 into the DC and Rack you want.


I'll read up on them next then.  The Ec2Snitch seemed tempting to use, given
I was in Ec2 ;-)

  -Can I (how do I safely) change the keyspace strategy_options
 from [{us-east:3}] to [{us-east:2, us-east-hadoop:1}]   This seems like
 the riskiest/most complicated step of everything I've proposed...

 http://wiki.apache.org/cassandra/Operations#Replication


The wiki has changed since I last read it (I guess that happens) ;-)  I
think I understand how to make changes and migrate data around now.


how to migrate?

2011-08-24 Thread William Oberman
I was hoping to transition my simple cassandra cluster (where each node is a 
cassandra + hadoop tasktracker) to a cluster with two virtual datacenters 
(vanilla cassandra vs. cassandra + hadoop tasktracker), based on this:
http://wiki.apache.org/cassandra/HadoopSupport#ClusterConfig
The problem I'm having is my hadoop jobs are getting heavy enough it's 
affecting my user facing performance on my cluster.

Right now I'm in AWS, and I have 4 nodes in us-east split over two availability 
zones (us-east-1c that I'll call c and us-east-1d that I'll call d), 
setup with this keyspace:
create keyspace civicscience with replication_factor=3 and strategy_options = 
[{us-east:3}] and 
placement_strategy='org.apache.cassandra.locator.NetworkTopologyStrategy';
And I'm using the Ec2Snitch.

I'm wondering if I write my own snitch that extends Ec2Snitch with overrides as 
follows:
getDC = if(AZ == c || d) return return us-east (to keep current nodes the same) 
else return us-east-hadoop;
getRack = return super(); (returning a,b,c,d seems ok)

Then, if I boot N new nodes into us-east-1[a,b] they will be hadoop nodes 
because of the snitch.  I'll obviously have to change my home brew cassandra + 
hadoop instances to selectively run task trackers or not (a/b = yes, and c/d = 
no).

But:
-Is the overall RF=3 still ok?
-What is the recommended split between normal and hadoop in terms of 
strategy_options (assuming RF=3)?  2/1?  
-Can I (how do I safely) change the keyspace strategy_options from 
[{us-east:3}] to [{us-east:2, us-east-hadoop:1}]   This seems like the 
riskiest/most complicated step of everything I've proposed...
-After I change the options, what (if anything) would I have to do to migrate 
data around?  

One final question: should I add new nodes as Brisk instances instead of my 
home brew cassandra + hadoop nodes?  I've obviously already put in the 
pain/effort of learning how to run hadoop + cassandra...

Thanks for any help/advice!

will



Re: Survey: Cassandra/JVM Resident Set Size increase

2011-07-14 Thread William Oberman
I finally upgraded to 0.7.4 - 0.8.0 (using riptano packages) 2 days ago.
Before, my resident memory (for the java process) would slowly grow without
bound and the OS would kill the process.  But, over the last 2 days, I
_think_ it's been stable.  I'll let you know in a week :-)

My other stats:
AWS large (64 bit, 7.5GB, 4 compute units, no swap by default and I didn't
enable it manually)
Centos 5.6
Sun  1.6.0_24-b07
2 column families
4 machine cluster with RF=3
Mostly balanced write/read load (usually more writes)
Not quite big data volumes, large 10^6 or small 10^7 ops/day
No deletes or mutations, I only add or read

Everything else is stock, I haven't tuned anything as performance was ok.
No JVM options other than what was in the package.  No JNA.  Not sure the GC
patterns.

will

On Tue, Jul 12, 2011 at 9:28 AM, Chris Burroughs
chris.burrou...@gmail.comwrote:

 ### Preamble

 There have been several reports on the mailing list of the JVM running
 Cassandra using too much memory.  That is, the resident set size is
 (max java heap size + mmaped segments) and continues to grow until the
 process swaps, kernel oom killer comes along, or performance just
 degrades too far due to the lack of space for the page cache.  It has
 been unclear from these reports if there is a pattern.  My hope here is
 that by comparing JVM versions, OS versions, JVM configuration etc., we
 will find something.  Thank you everyone for your time.


 Some example reports:
  - http://www.mail-archive.com/user@cassandra.apache.org/msg09279.html
  -

 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Very-high-memory-utilization-not-caused-by-mmap-on-sstables-td5840777.html
  - https://issues.apache.org/jira/browse/CASSANDRA-2868
  -

 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/OOM-or-what-settings-to-use-on-AWS-large-td6504060.html
  -

 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-memory-problem-td6545642.html

 For reference theories include (in no particular order):
  - memory fragmentation
  - JVM bug
  - OS/glibc bug
  - direct memory
  - swap induced fragmentation
  - some other bad interaction of cassandra/jdk/jvm/os/nio-insanity.

 ### Survey

 1. Do you think you are experiencing this problem?

 2.  Why? (This is a good time to share a graph like
 http://www.twitpic.com/5fdabn or
 http://img24.imageshack.us/img24/1754/cassandrarss.png)

 2. Are you using mmap? (If yes be sure to have read
 http://wiki.apache.org/cassandra/FAQ#mmap , and explain how you have
 used pmap [or another tool] to rule you mmap and top decieving you.)

 3. Are you using JNA?  Was mlockall succesful (it's in the logs on
 startup)?

 4. Is swap enabled? Are you swapping?

 5. What version of Apache Cassandra are you using?

 6. What is the earliest version of Apache Cassandra you recall seeing
 this problem with?

 7. Have you tried the patch from CASSANDRA-2654 ?

 8. What jvm and version are you using?

 9. What OS and version are you using?

 10. What are your jvm flags?

 11. Have you tried limiting direct memory (-XX:MaxDirectMemorySize)

 12. Can you characterise how much GC your cluster is doing?

 13. Approximately how many read/writes per unit time is your cluster
 doing (per node or the whole cluster)?

 14.  How are you column families configured (key cache size, row cache
 size, etc.)?




Re: What does a write lock ?

2011-07-08 Thread William Oberman
Disregard most of my post (already).  I forgot that reads aren't isolated.
 That means A and B are states cassandra will *eventually* be in, but at any
point in time a read might see a partial B (where some columns are still
A, and others are B).  Though, I'm sure someone else will confirm if I'm
wrong yet again.

For me, if I need two pieces of data to be consistently related to each
other and stored in cassandra, I encode them (usually JSON) and store them
in one column.

will

On Fri, Jul 8, 2011 at 8:30 AM, William Oberman ober...@civicscience.comwrote:

 Questions like this seem to come up a lot:

 http://stackoverflow.com/questions/6033888/cassandra-atomicity-isolation-of-column-updates-on-a-single-row-on-on-single-no

 http://stackoverflow.com/questions/2055037/cassandra-atomic-reads-writes-within-a-single-columnfamily
 http://www.mail-archive.com/user@cassandra.apache.org/msg14701.html

 Lets say you read state A (from one key in one CF), you change the data to
 A' in your client, and you write A'.  Are you worried that someone else
 might have changed A to B during this process (making the new state a race
 between A' and B)?  It doesn't sound to me like you are...  It sounds to me
 like you're worried about a set of columns for the key being in a consistent
 state before, during, and after a process.  And A - A' and A - B will each
 be atomic for the key (based on my understanding).  But, if A' and B are
 changes to a different set of columns, I believe that would interleave,
 which itself could be inconsistent from your application's point of view.


 will

 On Thu, Jul 7, 2011 at 11:41 PM, Jeffrey Kesselman jef...@gmail.comwrote:

 Really, as i lay in the bath thinking nabout it, I concluded what I am
 looking for is a very limited form of Consistency.

 Its consistency over a single row on a single node just for the period of
 update.


 On Thu, Jul 7, 2011 at 10:34 PM, Jeffrey Kesselman jef...@gmail.comwrote:

 Its not really isolation, btw, because we
 arent talking about anyone seeing an update mid-update.Rather, we
 are talking about when updates are allowed to occur.

 Atomicity means that all the updates happen together or they don't happen
 at all.
 Isolation means that no results of the update are visible until the
 entire update operation is complete.

 This really lies somewhere in the middle of the two concepts.   Its part
 of the results of the combined effects of ACID


 On Thu, Jul 7, 2011 at 10:27 PM, Jonathan Ellis jbel...@gmail.comwrote:

 Sounds to me like you're confusing atomicity with isolation.

 On Thu, Jul 7, 2011 at 2:54 PM, Jeffrey Kesselman jef...@gmail.com
 wrote:
  Yup, im even more confused.Lets talk about the model, not the
  implementation.
  AIUI updates to a row are atomic across all columns in that row at
 once,
  true?
  If true then the next question is, does the validation happen inside
 or
  outside of that guarantee, and is the row guaranteed not to change
 between
  validation and update?
  If that is *not* the case then it makes a whole class of solutions to
  synchronization problems fail and puts my larger project
  in serious question.
 
  On Thu, Jul 7, 2011 at 3:43 PM, Yang tedd...@gmail.com wrote:
 
  no , the memtable is a concurrentskiplistmap
 
  insertion can happen in parallel
 
  On Jul 7, 2011 9:24 AM, Jeffrey Kesselman jef...@gmail.com
 wrote:
   This has me more confused.
  
   Does this mean that ALL rows on a given node are only updated
   sequentially,
   never in parallel?
  
   On Thu, Jul 7, 2011 at 3:21 PM, Yang tedd...@gmail.com
 wrote:
  
   just to add onto what jonathan said
  
   the columns are immutable . if u overwrite/ reconcile a new obj is
   created and shoved into the memtable
  
   there is a shared lock for all writes though which guard against
 an
   exclusive lock on memtable switching/flushing
   On Jul 7, 2011 7:51 AM, A J s5a...@gmail.com wrote:
Does a write lock:
1. Just the columns in question for the specific row in question
 ?
2. The full row in question ?
3. The full CF ?
   
I doubt read does any locks.
   
Thanks.
  
  
  
  
   --
   It's always darkest just before you are eaten by a grue.
 
 
 
  --
  It's always darkest just before you are eaten by a grue.
 



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com




 --
 It's always darkest just before you are eaten by a grue.




 --
 It's always darkest just before you are eaten by a grue.







Re: What does a write lock ?

2011-07-08 Thread William Oberman
I think you need to look into Zookeeper, or other distributed coordinator,
as you have little/no guarantees from cassandra between 1-3 (in terms of the
guarantees you want and need).

And my terminology in my post is different than yours.  My client == your
server.  Specifically, I was thinking in terms of:
user - cassandra client code (that runs on a server) - cassandra server
code (e.g. cassandra itself) that runs either on the same or different
server


On Fri, Jul 8, 2011 at 10:22 AM, Jeffrey Kesselman jef...@gmail.com wrote:

 Not quite, its more limited and specific

 The order of operations is all within the Cassandra node server and looks
 like this this...

 We have one row, A.  Thats the only row being operated on.

 Client - submits A'
 Server does the following:
 (1) Validate function reads current A
 (2) Validate function validates A' vs. A
 (3) If validation succeeds, allows update to A'.

 My fear/concern is that after 1 and before 3, a second update to A'' comes
 in and changes the current value of A, therefor invalidating my
 validation check, see?

 If Cassandra does not guard against this then one possible
 solution would be to make my own key-to-mutex map in memory, lock the mutex
 for A's key as a precursor to (1) and release it in a post-update function.
  But I am always very nervous about inserting locking into a process that
 wasn't designed with it already in mind...


 On Fri, Jul 8, 2011 at 8:30 AM, William Oberman 
 ober...@civicscience.comwrote:

 Questions like this seem to come up a lot:

 http://stackoverflow.com/questions/6033888/cassandra-atomicity-isolation-of-column-updates-on-a-single-row-on-on-single-no

 http://stackoverflow.com/questions/2055037/cassandra-atomic-reads-writes-within-a-single-columnfamily
 http://www.mail-archive.com/user@cassandra.apache.org/msg14701.html

 Lets say you read state A (from one key in one CF), you change the data to
 A' in your client, and you write A'.  Are you worried that someone else
 might have changed A to B during this process (making the new state a race
 between A' and B)?  It doesn't sound to me like you are...  It sounds to me
 like you're worried about a set of columns for the key being in a consistent
 state before, during, and after a process.  And A - A' and A - B will each
 be atomic for the key (based on my understanding).  But, if A' and B are
 changes to a different set of columns, I believe that would interleave,
 which itself could be inconsistent from your application's point of view.


 will

 On Thu, Jul 7, 2011 at 11:41 PM, Jeffrey Kesselman jef...@gmail.comwrote:

 Really, as i lay in the bath thinking nabout it, I concluded what I am
 looking for is a very limited form of Consistency.

 Its consistency over a single row on a single node just for the period of
 update.


 On Thu, Jul 7, 2011 at 10:34 PM, Jeffrey Kesselman jef...@gmail.comwrote:

 Its not really isolation, btw, because we
 arent talking about anyone seeing an update mid-update.Rather, we
 are talking about when updates are allowed to occur.

 Atomicity means that all the updates happen together or they don't
 happen at all.
 Isolation means that no results of the update are visible until the
 entire update operation is complete.

 This really lies somewhere in the middle of the two concepts.   Its part
 of the results of the combined effects of ACID


 On Thu, Jul 7, 2011 at 10:27 PM, Jonathan Ellis jbel...@gmail.comwrote:

 Sounds to me like you're confusing atomicity with isolation.

 On Thu, Jul 7, 2011 at 2:54 PM, Jeffrey Kesselman jef...@gmail.com
 wrote:
  Yup, im even more confused.Lets talk about the model, not the
  implementation.
  AIUI updates to a row are atomic across all columns in that row at
 once,
  true?
  If true then the next question is, does the validation happen inside
 or
  outside of that guarantee, and is the row guaranteed not to change
 between
  validation and update?
  If that is *not* the case then it makes a whole class of solutions to
  synchronization problems fail and puts my larger project
  in serious question.
 
  On Thu, Jul 7, 2011 at 3:43 PM, Yang tedd...@gmail.com wrote:
 
  no , the memtable is a concurrentskiplistmap
 
  insertion can happen in parallel
 
  On Jul 7, 2011 9:24 AM, Jeffrey Kesselman jef...@gmail.com
 wrote:
   This has me more confused.
  
   Does this mean that ALL rows on a given node are only updated
   sequentially,
   never in parallel?
  
   On Thu, Jul 7, 2011 at 3:21 PM, Yang tedd...@gmail.com
 wrote:
  
   just to add onto what jonathan said
  
   the columns are immutable . if u overwrite/ reconcile a new obj
 is
   created and shoved into the memtable
  
   there is a shared lock for all writes though which guard against
 an
   exclusive lock on memtable switching/flushing
   On Jul 7, 2011 7:51 AM, A J s5a...@gmail.com wrote:
Does a write lock:
1. Just the columns in question for the specific row in
 question ?
2. The full row in question

Re: What does a write lock ?

2011-07-08 Thread William Oberman
Also, one point of early confusion for me is there is a slightly different
definition of atomicity depending on if your talking software vs.
database, and I'm a software guy.  From wikipedia:

Software = Atomicity is a guarantee of isolation from concurrent processes.
Additionally, atomic operations commonly have a succeed-or-fail definition —
they either successfully change the state of the system, or have no visible
effect.

Database = In an atomic transaction, a series of database operations either
all occur, or nothing occurs.

I believe that cassandra is using the database definition.

will


On Fri, Jul 8, 2011 at 10:35 AM, William Oberman
ober...@civicscience.comwrote:

 I think you need to look into Zookeeper, or other distributed coordinator,
 as you have little/no guarantees from cassandra between 1-3 (in terms of the
 guarantees you want and need).

 And my terminology in my post is different than yours.  My client == your
 server.  Specifically, I was thinking in terms of:
 user - cassandra client code (that runs on a server) - cassandra server
 code (e.g. cassandra itself) that runs either on the same or different
 server



 On Fri, Jul 8, 2011 at 10:22 AM, Jeffrey Kesselman jef...@gmail.comwrote:

 Not quite, its more limited and specific

 The order of operations is all within the Cassandra node server and looks
 like this this...

 We have one row, A.  Thats the only row being operated on.

 Client - submits A'
 Server does the following:
 (1) Validate function reads current A
 (2) Validate function validates A' vs. A
 (3) If validation succeeds, allows update to A'.

 My fear/concern is that after 1 and before 3, a second update to A'' comes
 in and changes the current value of A, therefor invalidating my
 validation check, see?

 If Cassandra does not guard against this then one possible
 solution would be to make my own key-to-mutex map in memory, lock the mutex
 for A's key as a precursor to (1) and release it in a post-update function.
  But I am always very nervous about inserting locking into a process that
 wasn't designed with it already in mind...


 On Fri, Jul 8, 2011 at 8:30 AM, William Oberman ober...@civicscience.com
  wrote:

 Questions like this seem to come up a lot:

 http://stackoverflow.com/questions/6033888/cassandra-atomicity-isolation-of-column-updates-on-a-single-row-on-on-single-no

 http://stackoverflow.com/questions/2055037/cassandra-atomic-reads-writes-within-a-single-columnfamily
 http://www.mail-archive.com/user@cassandra.apache.org/msg14701.html

 Lets say you read state A (from one key in one CF), you change the data
 to A' in your client, and you write A'.  Are you worried that someone else
 might have changed A to B during this process (making the new state a race
 between A' and B)?  It doesn't sound to me like you are...  It sounds to me
 like you're worried about a set of columns for the key being in a consistent
 state before, during, and after a process.  And A - A' and A - B will each
 be atomic for the key (based on my understanding).  But, if A' and B are
 changes to a different set of columns, I believe that would interleave,
 which itself could be inconsistent from your application's point of view.


 will

 On Thu, Jul 7, 2011 at 11:41 PM, Jeffrey Kesselman jef...@gmail.comwrote:

 Really, as i lay in the bath thinking nabout it, I concluded what I am
 looking for is a very limited form of Consistency.

 Its consistency over a single row on a single node just for the period
 of update.


 On Thu, Jul 7, 2011 at 10:34 PM, Jeffrey Kesselman jef...@gmail.comwrote:

 Its not really isolation, btw, because we
 arent talking about anyone seeing an update mid-update.Rather, we
 are talking about when updates are allowed to occur.

 Atomicity means that all the updates happen together or they don't
 happen at all.
 Isolation means that no results of the update are visible until the
 entire update operation is complete.

 This really lies somewhere in the middle of the two concepts.   Its
 part of the results of the combined effects of ACID


 On Thu, Jul 7, 2011 at 10:27 PM, Jonathan Ellis jbel...@gmail.comwrote:

 Sounds to me like you're confusing atomicity with isolation.

 On Thu, Jul 7, 2011 at 2:54 PM, Jeffrey Kesselman jef...@gmail.com
 wrote:
  Yup, im even more confused.Lets talk about the model, not the
  implementation.
  AIUI updates to a row are atomic across all columns in that row at
 once,
  true?
  If true then the next question is, does the validation happen inside
 or
  outside of that guarantee, and is the row guaranteed not to change
 between
  validation and update?
  If that is *not* the case then it makes a whole class
 of solutions to
  synchronization problems fail and puts my larger project
  in serious question.
 
  On Thu, Jul 7, 2011 at 3:43 PM, Yang tedd...@gmail.com wrote:
 
  no , the memtable is a concurrentskiplistmap
 
  insertion can happen in parallel
 
  On Jul 7, 2011 9:24 AM, Jeffrey Kesselman

Re: What does a write lock ?

2011-07-08 Thread William Oberman
I use a language specific wrapper around thrift as my client, but yes, I
guess I fundamentally mean thrift == client, and the cassandra server ==
server.

will

On Fri, Jul 8, 2011 at 11:08 AM, Jeffrey Kesselman jef...@gmail.com wrote:

 I am confused by what you mean by Cassandra client code.  Is this part of
 the Cassnadra server?

 My architecture is my user talks thrift to Cassandra.





Re: What does a write lock ?

2011-07-08 Thread William Oberman
I haven't ever written my own org.apache.cassandra.db.marshal.AbstractType
(which is I think what your talking about), so I have no idea.

Looking up the JavaDoc for that class, validate says validate that the byte
array is a valid sequence for the type we are supposed to be comparing,
which sounds like a local operation to me (e.g. it shouldn't fetch remote
data, it's just saying yep, this is a valid member of type T).

will

On Fri, Jul 8, 2011 at 11:17 AM, Jeffrey Kesselman jef...@gmail.com wrote:

 Alright,

 So are you saying the column validator, as specified
 by conf/storage-conf.xml is checked in the client interface library and not
 on the server side?  That seems odd to me on a number of levels, not the
 least being I cant see how thrift could autogenerate that
 for different languages or how those other languages would use a Java class.
 *
 *
 On Fri, Jul 8, 2011 at 11:13 AM, William Oberman ober...@civicscience.com
  wrote:

 I use a language specific wrapper around thrift as my client, but yes, I
 guess I fundamentally mean thrift == client, and the cassandra server ==
 server.

 will


 On Fri, Jul 8, 2011 at 11:08 AM, Jeffrey Kesselman jef...@gmail.comwrote:

 I am confused by what you mean by Cassandra client code.  Is this part
 of the Cassnadra server?

 My architecture is my user talks thrift to Cassandra.






 --
 It's always darkest just before you are eaten by a grue.



Re: Cassandra memory problem

2011-07-07 Thread William Oberman
I think I had (and have) a similar problem:
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/OOM-or-what-settings-to-use-on-AWS-large-td6504060.html
My memory usage grew slowly until I ran out of mem and the OS killed my
process (due to no swap).

I'm still on 0.7.4, but I'm rolling out 0.8.1 next week, which I was hoping
would fix the problem.  I'm using Centos with Sun 1.6.0_24-b07

will

On Thu, Jul 7, 2011 at 7:41 AM, Daniel Doubleday
daniel.double...@gmx.netwrote:

 Hm - had to digg deeper and it totally looks like a native mem leak to me:

 We are still growing with res += 100MB a day. Cassandra is  8G now

 I checked the cassandra process with pmap -x

 Here's the human readable (aggregated) output:

 Format is thingy: RSS in KB

 Summary:

 Total SST: 1961616
 Anon RSS: 6499640

 Total RSS: 8478376

 Here's a little more detail:

 SSTables (data and index files)
 **
 Attic: 0
 PrivateChatNotification: 38108
 Schema: 0
 PrivateChat: 161048
 UserData: 116788
 HintsColumnFamily: 0
 Rooms: 100548
 Tracker: 476
 Migrations: 0
 ObjectRepository: 793680
 BlobStore: 350924
 Activities: 400044
 LocationInfo: 0

 Libraries
 **
 javajar: 2292
 nativelib: 13028

 Other
 **
 28201: 32
 jna979649866618987247.tmp: 92
 locale-archive: 1492
 [stack]: 132
 java: 44
 ffi8TsQPY(deleted): 8

 And
 **
 [anon]: 6499640


 Maybe the output of pmap is totally misleading but my interpretation is
 that only 2GB of RSS is attributed to paged in sstables.
 I have one large anon block which looks like this:

 Address   Kbytes RSS   Dirty Mode   Mapping
 00073f60   0 3093248 3093248 rwx--[ anon ]

 This is the native heap thats been allocated on startup and mlocked

 So theres still 3.5GB of anon memory.

 We haven't deployed https://issues.apache.org/jira/browse/CASSANDRA-2654 yet
 and this might be part of it but I don't think thats the main problem.
 As I said mem goes up by 100MB each day pretty linearly.

 Would be great if anyone could verify this by running pmap or talk my off
 the roof by explaining that nothing's the way it seems.

 All this might be heavily OS specific so maybe that's only on Debian?

 Thanks a lot
 Daniel

 On Jul 4, 2011, at 2:42 PM, Jonathan Ellis wrote:

 mmap'd data will be attributed to res, but the OS can page it out
 instead of killing the process.

 On Mon, Jul 4, 2011 at 5:52 AM, Daniel Doubleday
 daniel.double...@gmx.net wrote:

 Hi all,

 we have a mem problem with cassandra. res goes up without bounds (well
 until

 the os kills the process because we dont have swap)

 I found a thread that's about the same problem but on OpenJDK:


 http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Very-high-memory-utilization-not-caused-by-mmap-on-sstables-td5840777.html

 We are on Debian with Sun JDK.

 Resident mem is 7.4G while heap is restricted to 3G.

 Anyone else is seeing this with Sun JDK?

 Cheers,

 Daniel

 :/home/dd# java -version

 java version 1.6.0_24

 Java(TM) SE Runtime Environment (build 1.6.0_24-b07)

 Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)

 :/home/dd# ps aux |grep java

 cass 28201  9.5 46.8 372659544 7707172 ?   SLl  May24 5656:21

 /usr/bin/java -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42

 -Xms3000M -Xmx3000M -Xmn400M ...

   PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND



 28201 cass  20   0  355g 7.4g 1.4g S8 46.9   5656:25 java







 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com





cassandra/hadoop/pig

2011-07-06 Thread William Oberman
I have a few cassandra/hadoop/pig questions.  I currently have things set up
in a test environment, and for the most part everything works.  But, before
I start to roll things out to production, I wanted to check on/confirm some
things.

When I originally set things up, I used:
http://wiki.apache.org/cassandra/HadoopSupport
http://hadoop.apache.org/common/docs/r0.20.203.0/cluster_setup.html

One difference I noticed between the two guides, which I ignored at the
time, was how datanodes are treated.  The wiki said At least one node in
your cluster will also need to be a datanode. That's because Hadoop uses
HDFS to store information like jar dependencies for your job, static data
(like stop words for a word count), and things like that - it's the
distributed cache. It's a very small amount of data but the Hadoop cluster
needs it to run properly.  But, the hadoop guide (if you follow it blindly
like I did), creates a datanode on all TaskTracker nodes.  I _think_ that is
controlled by the conf/slaves file, but I haven't proved that yet.  Is there
any good reason to run datanodes on only the JobTracker vs. on all nodes?
If I should only run it on the JobTracker, how do I properly stop the
datanodes from starting automatically (when both start-dfs and start-mapred
seem to draw from the same slaves file)?

I noticed a second issue/oddness with datanodes, in that the HDFS data isn't
always small.  The other day I ran out of disk running my pig script.  I
checked, and by default, hadoop creates HDFS in /tmp, and I'm using EC2 (and
/tmp is on the boot device) which is only 10G by default.  Do other people
put HDFS on a different disk?  If yes, I'll really want to only run one
datanode, as I don't want to re-template all of my cassandra nodes to have
HDFS disks vs. one new JobTracker node.

In terms of hardware, I am running small instances (32bit, 2GB) in the test
cluster, while my production cluster is larges (64bit, 7 or 8GB).  I was
going to check the performance impact there, but even on smalls in test I
was able to run hadoop jobs while serving web requests.  I am wondering if
smalls are causing the high HDFS usage though (I think data might spill
more, if I'm understanding things correctly).

If these are more hadoop then cassandra questions, let me know and I'll move
my questions around.

I did want to mention that these are small details compared to the amount of
complicated things that worked like a charm during my configuration and
testing of the combination of cassandra/hadoop/pig.  It was impressive :-)

Thanks!

will


Re: cassandra/hadoop/pig

2011-07-06 Thread William Oberman
That makes sense.  The problem is I jumped directly to using pig, which is
abstracting some of the data flow from me.  I guess I'll have to figure out
what it's doing under the covers, to know how to optimize/fix bottlenecks.

But for now, I'm taking this information to mean I should run datanodes
with HDFS on larger non-root disks on all tasktracker nodes to ensure my pig
scripts work, until I'm willing to either write the M/R code myself, or
figure out how to optimize pig and/or the pig script.

will

On Wed, Jul 6, 2011 at 3:29 PM, Edward Capriolo edlinuxg...@gmail.comwrote:



 On Wed, Jul 6, 2011 at 2:48 PM, William Oberman 
 ober...@civicscience.comwrote:

 I have a few cassandra/hadoop/pig questions.  I currently have things set
 up in a test environment, and for the most part everything works.  But,
 before I start to roll things out to production, I wanted to check
 on/confirm some things.

 When I originally set things up, I used:
 http://wiki.apache.org/cassandra/HadoopSupport
 http://hadoop.apache.org/common/docs/r0.20.203.0/cluster_setup.html

 One difference I noticed between the two guides, which I ignored at the
 time, was how datanodes are treated.  The wiki said At least one node in
 your cluster will also need to be a datanode. That's because Hadoop uses
 HDFS to store information like jar dependencies for your job, static data
 (like stop words for a word count), and things like that - it's the
 distributed cache. It's a very small amount of data but the Hadoop cluster
 needs it to run properly.  But, the hadoop guide (if you follow it blindly
 like I did), creates a datanode on all TaskTracker nodes.  I _think_ that is
 controlled by the conf/slaves file, but I haven't proved that yet.  Is there
 any good reason to run datanodes on only the JobTracker vs. on all nodes?
 If I should only run it on the JobTracker, how do I properly stop the
 datanodes from starting automatically (when both start-dfs and start-mapred
 seem to draw from the same slaves file)?

 I noticed a second issue/oddness with datanodes, in that the HDFS data
 isn't always small.  The other day I ran out of disk running my pig script.
 I checked, and by default, hadoop creates HDFS in /tmp, and I'm using EC2
 (and /tmp is on the boot device) which is only 10G by default.  Do other
 people put HDFS on a different disk?  If yes, I'll really want to only run
 one datanode, as I don't want to re-template all of my cassandra nodes to
 have HDFS disks vs. one new JobTracker node.

 In terms of hardware, I am running small instances (32bit, 2GB) in the
 test cluster, while my production cluster is larges (64bit, 7 or 8GB).  I
 was going to check the performance impact there, but even on smalls in test
 I was able to run hadoop jobs while serving web requests.  I am wondering if
 smalls are causing the high HDFS usage though (I think data might spill
 more, if I'm understanding things correctly).

 If these are more hadoop then cassandra questions, let me know and I'll
 move my questions around.

 I did want to mention that these are small details compared to the amount
 of complicated things that worked like a charm during my configuration and
 testing of the combination of cassandra/hadoop/pig.  It was impressive :-)

 Thanks!

 will


 The logic that only one datanode is needed is not an absolute truth. If
 your jobs use ColumnFamilyInputFormat to read and write to
 ColumnFamilyOutputFormat then technically you only need one DataNode to hold
 the distributed cache. However, if you have a large amount of intermediate
 results or even a multiphase job that has to persist data between phases
 (this is very very common) then that single DataNode is a bottleneck. Most
 hadoop clusters run a DataNode and TaskTracker on each slave.

 Most situations would use datanodes very heavily, for example suppose you
 have 4 map/reduce jobs to run on the same Cassandra data. Ingesting the data
 from Cassandra at the beginning of each job might would be wasteful. It
 might be better to take the data into HDFS during the first job and then
 save it. Your subsequent jobs could use that instead of re-acquiring it from
 Cassandra.



Re: Strong Consistency with ONE read/writes

2011-07-03 Thread William Oberman
Was just going off of: Send the value to the primary replica and send
placeholder values to the other replicas.  Sounded like you wanted to write
the value to one, and write the placeholder to N-1 to me.  But, C* will
propagate the value to N-1 eventually anyways, 'cause that's just what it
does anyways :-)

will

On Sun, Jul 3, 2011 at 7:47 PM, AJ a...@dude.podzone.net wrote:

 **
 On 7/3/2011 3:49 PM, Will Oberman wrote:

 Why not send the value itself instead of a placeholder?  Now it takes 2x
 writes on a random node to do a single update (write placeholder, write
 update) and N*x writes from the client (write value, write placeholder to
 N-1). Where N is replication factor.  Seems like extra network and IO
 instead of less...


 To send the value to each node is 1.) unnecessary, 2.) will only cause a
 large burst of network traffic.  Think about if it's a large data value,
 such as a document.  Just let C* do it's thing.  The extra messages are tiny
 and doesn't significantly increase latency since they are all sent
 asynchronously.



 Of course, I still think this sounds like reimplementing Cassandra
 internals in a Cassandra client (just guessing, I'm not a cassandra dev)


 I don't see how.  Maybe you should take a peek at the source.



 On Jul 3, 2011, at 5:20 PM, AJ a...@dude.podzone.net wrote:

   Yang,

 How would you deal with the problem when the 1st node responds success but
 then crashes before completely forwarding any replicas?  Then, after
 switching to the next primary, a read would return stale data.

 Here's a quick-n-dirty way:  Send the value to the primary replica and send
 placeholder values to the other replicas.  The placeholder value is
 something like, PENDING_UPDATE.  The placeholder values are sent with
 timestamps 1 less than the timestamp for the actual value that went to the
 primary.  Later, when the changes propagate, the actual values will
 overwrite the placeholders.  In event of a crash before the placeholder gets
 overwritten, the next read value will tell the client so.  The client will
 report to the user that the key/column is unavailable.  The downside is
 you've overwritten your data and maybe would like to know what the old data
 was!  But, maybe there's another way using other columns or with MVCC.  The
 client would want a success from the primary and the secondary replicas to
 be certain of future read consistency in case the primary goes down
 immediately as I said above.  The ability to set an update_pending flag on
 any column value would probably make this work.  But, I'll think more on
 this later.

 aj

 On 7/2/2011 10:55 AM, Yang wrote:

 there is a JIRA completed in 0.7.x that Prefers a certain node in snitch,
 so this does roughly what you want MOST of the time


  but the problem is that it does not GUARANTEE that the same node will
 always be read.  I recently read into the HBase vs Cassandra comparison
 thread that started after Facebook dropped Cassandra for their messaging
 system, and understood some of the differences. what you want is essentially
 what HBase does. the fundamental difference there is really due to the
 gossip protocol: it's a probablistic, or eventually consistent failure
 detector  while HBase/Google Bigtable use Zookeeper/Chubby to provide a
 strong failure detector (a distributed lock).  so in HBase, if a tablet
 server goes down, it really goes down, it can not re-grab the tablet from
 the new tablet server without going through a start up protocol (notifying
 the master, which would notify the clients etc),  in other words it is
 guaranteed that one tablet is served by only one tablet server at any given
 time.  in comparison the above JIRA only TRYIES to serve that key from one
 particular replica. HBase can have that guarantee because the group
 membership is maintained by the strong failure detector.

  just for hacking curiosity, a strong failure detector + Cassandra
 replicas is not impossible (actually seems not difficult), although the
 performance is not clear. what would such a strong failure detector bring to
 Cassandra besides this ONE-ONE strong consistency ? that is an interesting
 question I think.

  considering that HBase has been deployed on big clusters, it is probably
 OK with the performance of the strong  Zookeeper failure detector. then a
 further question was: why did Dynamo originally choose to use the
 probablistic failure detector? yes Dynamo's main theme is eventually
 consistent, so the Phi-detector is **enough**, but if a strong detector
 buys us more with little cost, wouldn't that  be great?



 On Fri, Jul 1, 2011 at 6:53 PM, AJ a...@dude.podzone.net wrote:

 Is this possible?

 All reads and writes for a given key will always go to the same node from
 a client.  It seems the only thing needed is to allow the clients to compute
 which node is the closes replica for the given key using the same algorithm
 C* uses.  When the first replica receives the write request, it will write
 to 

Re: Strong Consistency with ONE read/writes

2011-07-03 Thread William Oberman
I'm using cassandra as a tool, like a black box with a certain contract to
the world.  Without modifying the core, C* will send the updates to all
replicas, so your plan would cause the extra write (for the placeholder).  I
wasn't assuming a modification to how C* fundamentally works.

Sounds like you are hacking (or at least looking) at the source, so all the
power to you if/when you try these kind of changes.

will

On Sun, Jul 3, 2011 at 8:45 PM, AJ a...@dude.podzone.net wrote:

 **
 On 7/3/2011 6:32 PM, William Oberman wrote:

 Was just going off of:  Send the value to the primary replica and send
 placeholder values to the other replicas.  Sounded like you wanted to write
 the value to one, and write the placeholder to N-1 to me.


 Yes, that is what I was suggesting.  The point of the placeholders is to
 handle the crash case that I talked about... like a WAL does.


 But, C* will propagate the value to N-1 eventually anyways, 'cause that's
 just what it does anyways :-)

  will

 On Sun, Jul 3, 2011 at 7:47 PM, AJ a...@dude.podzone.net wrote:

  On 7/3/2011 3:49 PM, Will Oberman wrote:

 Why not send the value itself instead of a placeholder?  Now it takes 2x
 writes on a random node to do a single update (write placeholder, write
 update) and N*x writes from the client (write value, write placeholder to
 N-1). Where N is replication factor.  Seems like extra network and IO
 instead of less...


  To send the value to each node is 1.) unnecessary, 2.) will only cause a
 large burst of network traffic.  Think about if it's a large data value,
 such as a document.  Just let C* do it's thing.  The extra messages are tiny
 and doesn't significantly increase latency since they are all sent
 asynchronously.



 Of course, I still think this sounds like reimplementing Cassandra
 internals in a Cassandra client (just guessing, I'm not a cassandra dev)



  I don't see how.  Maybe you should take a peek at the source.



 On Jul 3, 2011, at 5:20 PM, AJ a...@dude.podzone.net wrote:

   Yang,

 How would you deal with the problem when the 1st node responds success but
 then crashes before completely forwarding any replicas?  Then, after
 switching to the next primary, a read would return stale data.

 Here's a quick-n-dirty way:  Send the value to the primary replica and
 send placeholder values to the other replicas.  The placeholder value is
 something like, PENDING_UPDATE.  The placeholder values are sent with
 timestamps 1 less than the timestamp for the actual value that went to the
 primary.  Later, when the changes propagate, the actual values will
 overwrite the placeholders.  In event of a crash before the placeholder gets
 overwritten, the next read value will tell the client so.  The client will
 report to the user that the key/column is unavailable.  The downside is
 you've overwritten your data and maybe would like to know what the old data
 was!  But, maybe there's another way using other columns or with MVCC.  The
 client would want a success from the primary and the secondary replicas to
 be certain of future read consistency in case the primary goes down
 immediately as I said above.  The ability to set an update_pending flag on
 any column value would probably make this work.  But, I'll think more on
 this later.

 aj

 On 7/2/2011 10:55 AM, Yang wrote:

 there is a JIRA completed in 0.7.x that Prefers a certain node in
 snitch, so this does roughly what you want MOST of the time


  but the problem is that it does not GUARANTEE that the same node will
 always be read.  I recently read into the HBase vs Cassandra comparison
 thread that started after Facebook dropped Cassandra for their messaging
 system, and understood some of the differences. what you want is essentially
 what HBase does. the fundamental difference there is really due to the
 gossip protocol: it's a probablistic, or eventually consistent failure
 detector  while HBase/Google Bigtable use Zookeeper/Chubby to provide a
 strong failure detector (a distributed lock).  so in HBase, if a tablet
 server goes down, it really goes down, it can not re-grab the tablet from
 the new tablet server without going through a start up protocol (notifying
 the master, which would notify the clients etc),  in other words it is
 guaranteed that one tablet is served by only one tablet server at any given
 time.  in comparison the above JIRA only TRYIES to serve that key from one
 particular replica. HBase can have that guarantee because the group
 membership is maintained by the strong failure detector.

  just for hacking curiosity, a strong failure detector + Cassandra
 replicas is not impossible (actually seems not difficult), although the
 performance is not clear. what would such a strong failure detector bring to
 Cassandra besides this ONE-ONE strong consistency ? that is an interesting
 question I think.

  considering that HBase has been deployed on big clusters, it is probably
 OK with the performance of the strong  Zookeeper failure

Re: Strong Consistency with ONE read/writes

2011-07-02 Thread William Oberman
Ok, I see the you happen to choose the 'right' node idea, but it sounds
like you want to solve C* problems in the client, and they already wrote
that complicated code to make clients simple.   You're talking about
reimplementing key-node mappings, network topology (with failures), etc...
 Plus, if they change something about replication and you get too tricky,
your code breaks.  Or, if they optimize something, you might not benefit.

On Jul 1, 2011, at 10:33 PM, AJ a...@dude.podzone.net wrote:

I'm saying I will make my clients forward the C* requests to the first
replica instead of forwarding to a random node.
-- 
Sent from my Android phone with K-9 Mail. Please excuse my brevity.

Will Oberman ober...@civicscience.com wrote:



 Sent from my iPhone

 On Jul 1, 2011, at 9:53 PM, AJ a...@dude.podzone.net wrote:

  Is this possible?
 
  All reads and writes for a given key will always go to the same node
  from a client.

 I don't think that's true. Given a key K, the client will write to N
 nodes (N=replication factor). And at consistency level ONE the client
 will return after 1 ack (from the N writes).




Re: hadoop results

2011-06-30 Thread William Oberman
I think I'll do the former, thanks!

On Wed, Jun 29, 2011 at 11:16 PM, aaron morton aa...@thelastpickle.comwrote:

 How about  get_slice() with reversed == true and count = 1 to get the
 highest time UUID ?

 Or you can also store a column with a magic name that have the value of the
 timeuuid that is the current metric to use.

 Cheers

 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 30 Jun 2011, at 06:35, William Oberman wrote:

  I'll start with my question: given a CF with comparator TimeUUIDType,
 what is the most efficient way to get the greatest column's value?
 
  Context: I've been running cassandra for a couple of months now, so
 obviously it's time to start layering more on top :-)  In my test
 environment, I managed to get pig/hadoop running, and developed a few
 scripts to collect metrics I've been missing since I switched from MySQL to
 cassandra (including the ever useful select count(*) from table
 equivalent).
 
  I was hoping to dump the results of this processing back into cassandra
 for use in other tools/processes.  My initial thought was: new CF called
 stats with comparator TimeUUIDType.  The basic idea being I'd store:
  stat_name - time stat was computed (as UUID) - value
  That way I can also see a historical perspective of any given stat for
 auditing (and for cumulative stats to see trends).  The stat_name itself is
 a URI that is composed of what and any constraints on the what
 (including an optional time range, if the stat supports it).  E.g.
 ClassOfSomething/ID/MetricName/OptionalTimeRange (or something, still
 deciding on the format of the URI).  But, right now, the only way I know to
 get the current stat value would be to iterate over all columns (the
 TimeUUIDs) and then return the last one.
 
  Thanks for any tips,
 
  will




-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com


hadoop results

2011-06-29 Thread William Oberman
I'll start with my question: given a CF with comparator TimeUUIDType, what
is the most efficient way to get the greatest column's value?

Context: I've been running cassandra for a couple of months now, so
obviously it's time to start layering more on top :-)  In my test
environment, I managed to get pig/hadoop running, and developed a few
scripts to collect metrics I've been missing since I switched from MySQL to
cassandra (including the ever useful select count(*) from table
equivalent).

I was hoping to dump the results of this processing back into cassandra for
use in other tools/processes.  My initial thought was: new CF called stats
with comparator TimeUUIDType.  The basic idea being I'd store:
stat_name - time stat was computed (as UUID) - value
That way I can also see a historical perspective of any given stat for
auditing (and for cumulative stats to see trends).  The stat_name itself is
a URI that is composed of what and any constraints on the what
(including an optional time range, if the stat supports it).  E.g.
ClassOfSomething/ID/MetricName/OptionalTimeRange (or something, still
deciding on the format of the URI).  But, right now, the only way I know to
get the current stat value would be to iterate over all columns (the
TimeUUIDs) and then return the last one.

Thanks for any tips,

will


Re: Backup/Restore: Coordinating Cassandra Nodetool Snapshots with Amazon EBS Snapshots?

2011-06-23 Thread William Oberman
I've been doing EBS snapshots for mysql for some time now, and was using a
similar pattern as Josep (XFS with freeze, snap, unfreeze), with the extra
complication that I was actually using 8 EBS's in RAID-0 (and the extra
extra complication that I had to lock the MyISAM tables... glad to be moving
away from that).  For cassandra I switched to ephemeral disks, as per
recommendations from this forum.

One note on EBS snapshots though: the last time I checked (which was some
time ago) I noticed degraded IO performance on the box during the
snapshotting process even though the take snapshot command returns almost
immediately.  My theory back then was that amazon does the
delta/compress/store outside of the VM, but it obviously has an effect on
resources on the box the VM runs on.  I was doing this on a mysql slave that
no one talked to, so I didn't care/bother looking into it further.

will

On Thu, Jun 23, 2011 at 10:30 AM, Peter Schuller 
peter.schul...@infidyne.com wrote:

  EBS volume atomicity is good. We've had tons of experience since EBS
 came
  out almost 4 years ago,  to back all kinds of things, including large
 DBs.

 And thanks a lot for coming forward with production experience. That
 is always useful with these things.

 --
 / Peter Schuller




-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com


OOM (or, what settings to use on AWS large?)

2011-06-22 Thread William Oberman
I woke up this morning to all 4 of 4 of my cassandra instances reporting
they were down in my cluster.  I quickly started them all, and everything
seems fine.  I'm doing a postmortem now, but it appears they all OOM'd at
roughly the same time, which was not reported in any cassandra log, but I
discovered something in /var/log/kern that showed java died of oom(*).  In
amazon, I'm using large instances for cassandra, and they have no swap (as
recommended), so I have ~8GB of ram.  Should I use a different max mem
setting?  I'm using a stock rpm from riptano/datastax.  If I run ps -aux I
get:

/usr/bin/java -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42
-Xms3843M -Xmx3843M -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss128k
-XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
-XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1
-XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
-Djava.net.preferIPv4Stack=true -Djava.rmi.server.hostname=X.X.X.X
-Dcom.sun.management.jmxremote.port=8080
-Dcom.sun.management.jmxremote.ssl=false
-Dcom.sun.management.jmxremote.authenticate=false -Dmx4jaddress=0.0.0.0
-Dmx4jport=8081 -Dlog4j.configuration=log4j-server.properties
-Dlog4j.defaultInitOverride=true
-Dcassandra-pidfile=/var/run/cassandra/cassandra.pid -cp
:/etc/cassandra/conf:/usr/share/cassandra/lib/antlr-3.1.3.jar:/usr/share/cassandra/lib/apache-cassandra-0.7.4.jar:/usr/share/cassandra/lib/avro-1.4.0-fixes.jar:/usr/share/cassandra/lib/avro-1.4.0-sources-fixes.jar:/usr/share/cassandra/lib/commons-cli-1.1.jar:/usr/share/cassandra/lib/commons-codec-1.2.jar:/usr/share/cassandra/lib/commons-collections-3.2.1.jar:/usr/share/cassandra/lib/commons-lang-2.4.jar:/usr/share/cassandra/lib/concurrentlinkedhashmap-lru-1.1.jar:/usr/share/cassandra/lib/guava-r05.jar:/usr/share/cassandra/lib/high-scale-lib.jar:/usr/share/cassandra/lib/jackson-core-asl-1.4.0.jar:/usr/share/cassandra/lib/jackson-mapper-asl-1.4.0.jar:/usr/share/cassandra/lib/jetty-6.1.21.jar:/usr/share/cassandra/lib/jetty-util-6.1.21.jar:/usr/share/cassandra/lib/jline-0.9.94.jar:/usr/share/cassandra/lib/json-simple-1.1.jar:/usr/share/cassandra/lib/jug-2.0.0.jar:/usr/share/cassandra/lib/libthrift-0.5.jar:/usr/share/cassandra/lib/log4j-1.2.16.jar:/usr/share/cassandra/lib/mx4j-tools.jar:/usr/share/cassandra/lib/servlet-api-2.5-20081211.jar:/usr/share/cassandra/lib/slf4j-api-1.6.1.jar:/usr/share/cassandra/lib/slf4j-log4j12-1.6.1.jar:/usr/share/cassandra/lib/snakeyaml-1.6.jar
org.apache.cassandra.thrift.CassandraDaemon

(*) Also, why would they all OOM so close to each other?  Bad luck?  Or once
the first node went down, is there an increased chance of the rest?

I'm still on 0.7.4, when I released cassandra to production that was the
latest release.  In addition to (or instead of?) fixing memory settings, I'm
guessing I should upgrade.

will


Re: OOM (or, what settings to use on AWS large?)

2011-06-22 Thread William Oberman
Well, I managed to run 50 days before an OOM, so any changes I make will
take a while to test ;-)  I've seen the GCInspector log lines appear
periodically in my logs, but I didn't see a correlation with the crash.

I'll read the instructions on how to properly do a rolling upgrade today,
practice on test, and try that on production first.

will

On Wed, Jun 22, 2011 at 8:41 AM, Sasha Dolgy sdo...@gmail.com wrote:

 We had a similar problem a last month and found that the OS eventually
 in the end killed the Cassandra process on each of our nodes ... I've
 upgraded to 0.8.0 from 0.7.6-2 and have not had the problem since, but
 i do see consumption levels rising consistently from one day to the
 next on each node ..

 On Wed, Jun 1, 2011 at 2:30 PM, Sasha Dolgy sdo...@gmail.com wrote:
  is there a specific string I should be looking for in the logs that
  isn't super obvious to me at the moment...
 
  On Tue, May 31, 2011 at 8:21 PM, Jonathan Ellis jbel...@gmail.com
 wrote:
  The place to start is with the statistics Cassandra logs after each GC.

 look for GCInspector

 I found this in the logs on all my servers but never did much after
 that

 On Wed, Jun 22, 2011 at 2:33 PM, William Oberman
 ober...@civicscience.com wrote:
  I woke up this morning to all 4 of 4 of my cassandra instances reporting
  they were down in my cluster.  I quickly started them all, and everything
  seems fine.  I'm doing a postmortem now, but it appears they all OOM'd at
  roughly the same time, which was not reported in any cassandra log, but I
  discovered something in /var/log/kern that showed java died of oom(*).
  In
  amazon, I'm using large instances for cassandra, and they have no swap
 (as
  recommended), so I have ~8GB of ram.  Should I use a different max mem
  setting?  I'm using a stock rpm from riptano/datastax.  If I run ps
 -aux I
  get:
  /usr/bin/java -ea -XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42
  -Xms3843M -Xmx3843M -Xmn200M -XX:+HeapDumpOnOutOfMemoryError -Xss128k
  -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled
  -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1
  -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseCMSInitiatingOccupancyOnly
  -Djava.net.preferIPv4Stack=true -Djava.rmi.server.hostname=X.X.X.X
  -Dcom.sun.management.jmxremote.port=8080
  -Dcom.sun.management.jmxremote.ssl=false
  -Dcom.sun.management.jmxremote.authenticate=false -Dmx4jaddress=0.0.0.0
  -Dmx4jport=8081 -Dlog4j.configuration=log4j-server.properties
  -Dlog4j.defaultInitOverride=true
  -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid -cp
 
 :/etc/cassandra/conf:/usr/share/cassandra/lib/antlr-3.1.3.jar:/usr/share/cassandra/lib/apache-cassandra-0.7.4.jar:/usr/share/cassandra/lib/avro-1.4.0-fixes.jar:/usr/share/cassandra/lib/avro-1.4.0-sources-fixes.jar:/usr/share/cassandra/lib/commons-cli-1.1.jar:/usr/share/cassandra/lib/commons-codec-1.2.jar:/usr/share/cassandra/lib/commons-collections-3.2.1.jar:/usr/share/cassandra/lib/commons-lang-2.4.jar:/usr/share/cassandra/lib/concurrentlinkedhashmap-lru-1.1.jar:/usr/share/cassandra/lib/guava-r05.jar:/usr/share/cassandra/lib/high-scale-lib.jar:/usr/share/cassandra/lib/jackson-core-asl-1.4.0.jar:/usr/share/cassandra/lib/jackson-mapper-asl-1.4.0.jar:/usr/share/cassandra/lib/jetty-6.1.21.jar:/usr/share/cassandra/lib/jetty-util-6.1.21.jar:/usr/share/cassandra/lib/jline-0.9.94.jar:/usr/share/cassandra/lib/json-simple-1.1.jar:/usr/share/cassandra/lib/jug-2.0.0.jar:/usr/share/cassandra/lib/libthrift-0.5.jar:/usr/share/cassandra/lib/log4j-1.2.16.jar:/usr/share/cassandra/lib/mx4j-tools.jar:/usr/share/cassandra/lib/servlet-api-2.5-20081211.jar:/usr/share/cassandra/lib/slf4j-api-1.6.1.jar:/usr/share/cassandra/lib/slf4j-log4j12-1.6.1.jar:/usr/share/cassandra/lib/snakeyaml-1.6.jar
  org.apache.cassandra.thrift.CassandraDaemon
  (*) Also, why would they all OOM so close to each other?  Bad luck?  Or
 once
  the first node went down, is there an increased chance of the rest?
  I'm still on 0.7.4, when I released cassandra to production that was the
  latest release.  In addition to (or instead of?) fixing memory settings,
 I'm
  guessing I should upgrade.
  will




-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com


Re: OOM (or, what settings to use on AWS large?)

2011-06-22 Thread William Oberman
I was wondering/I figured that /var/log/kern indicated the OS was killing
java (versus an internal OOM).

The nodetool repair is interesting.  My application never deletes, so I
didn't bother running it.  But, if that helps prevent OOMs as well, I'll add
it to the crontab

(plan A is still upgrading to 0.8.0).

will

On Wed, Jun 22, 2011 at 8:53 AM, Sasha Dolgy sdo...@gmail.com wrote:

 Yes ... this is because it was the OS that killed the process, and
 wasn't related to Cassandra crashing.  Reviewing our monitoring, we
 saw that memory utilization was pegged at 100% for days and days
 before it was finally killed because 'apt' was fighting for resource.
 At least, that's as far as I got in my investigation before giving up,
 moving to 0.8.0 and implementing 24hr nodetool repair on each node via
 cronjobso far ... no problems.

 On Wed, Jun 22, 2011 at 2:49 PM, William Oberman
 ober...@civicscience.com wrote:
  Well, I managed to run 50 days before an OOM, so any changes I make will
  take a while to test ;-)  I've seen the GCInspector log lines appear
  periodically in my logs, but I didn't see a correlation with the crash.
  I'll read the instructions on how to properly do a rolling upgrade today,
  practice on test, and try that on production first.
  will




-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com


Re: OOM (or, what settings to use on AWS large?)

2011-06-22 Thread William Oberman
The CLI is posted, I assume that's the defaults (I didn't touch anything).
The machines basically just run cassandra (and standard Centos5 background
stuff).

will

On Wed, Jun 22, 2011 at 9:49 AM, Jake Luciani jak...@gmail.com wrote:

 Are you running with the default heap settings? what else is running on the
 boxes?



 On Wed, Jun 22, 2011 at 9:06 AM, William Oberman ober...@civicscience.com
  wrote:

 I was wondering/I figured that /var/log/kern indicated the OS was killing
 java (versus an internal OOM).

 The nodetool repair is interesting.  My application never deletes, so I
 didn't bother running it.  But, if that helps prevent OOMs as well, I'll add
 it to the crontab

 (plan A is still upgrading to 0.8.0).

 will


 On Wed, Jun 22, 2011 at 8:53 AM, Sasha Dolgy sdo...@gmail.com wrote:

 Yes ... this is because it was the OS that killed the process, and
 wasn't related to Cassandra crashing.  Reviewing our monitoring, we
 saw that memory utilization was pegged at 100% for days and days
 before it was finally killed because 'apt' was fighting for resource.
 At least, that's as far as I got in my investigation before giving up,
 moving to 0.8.0 and implementing 24hr nodetool repair on each node via
 cronjobso far ... no problems.

 On Wed, Jun 22, 2011 at 2:49 PM, William Oberman
 ober...@civicscience.com wrote:
  Well, I managed to run 50 days before an OOM, so any changes I make
 will
  take a while to test ;-)  I've seen the GCInspector log lines appear
  periodically in my logs, but I didn't see a correlation with the crash.
  I'll read the instructions on how to properly do a rolling upgrade
 today,
  practice on test, and try that on production first.
  will




 --
 Will Oberman
 Civic Science, Inc.
 3030 Penn Avenue., First Floor
 Pittsburgh, PA 15201
 (M) 412-480-7835
 (E) ober...@civicscience.com




 --
 http://twitter.com/tjake




-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com


rpm from 0.7.x - 0.8?

2011-06-22 Thread William Oberman
I'm running 0.7.4 from rpm (riptano).  If I do a yum upgrade, it's trying to
do 0.7.6.  To get 0.8.x I have to do install apache-cassandra08.  But that
is going to install two copies.

Is there a semi-official way of properly upgrading to 0.8 via rpm?

-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com


Re: rpm from 0.7.x - 0.8?

2011-06-22 Thread William Oberman
I just did a remove then install, and it seems to work.

For those of you out there with JMX issues, the default port moved from 8080
to 7199 (which includes the internal default to nodetool).  I was confused
why nodetool ring would fail on some boxes and not others.  I had to add -p
depending on the version of nodetool

will

On Wed, Jun 22, 2011 at 10:15 AM, William Oberman
ober...@civicscience.comwrote:

 I'm running 0.7.4 from rpm (riptano).  If I do a yum upgrade, it's trying
 to do 0.7.6.  To get 0.8.x I have to do install apache-cassandra08.  But
 that is going to install two copies.

 Is there a semi-official way of properly upgrading to 0.8 via rpm?

 --
 Will Oberman
 Civic Science, Inc.
 3030 Penn Avenue., First Floor
 Pittsburgh, PA 15201
 (M) 412-480-7835
 (E) ober...@civicscience.com




-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com


Re: rpm from 0.7.x - 0.8?

2011-06-22 Thread William Oberman
I have a question about auto_bootstrap.  When I originally brought up the
cluser, I did:
-seed with auto_boot = false
-1,2,3 with auto_boot = true

Now that I'm doing a rolling upgrade, do I set them all to auto_boot =
true?  Or does the seed stay false?  Or should I mark them all false?  I
have manually set tokens on all of the.

The doc confused me:
Set to 'true' to make new [non-seed] nodes automatically migrate the right
data to themselves. (If no
InitialTokenhttp://wiki.apache.org/cassandra/InitialToken
 is specified, they will pick one such that they will get half the range of
the most-loaded node.) If a node starts up without bootstrapping, it will
mark itself bootstrapped so that you can't subsequently accidently bootstrap
a node with data on it. (You can reset this by wiping your data and
commitlog directories.)
Default is: 'false', so that new clusters don't bootstrap immediately. You
should turn this on when you start adding new nodes to a cluster that
already has data on it.

I'm not adding new nodes, but the cluster does have data on it...

will

On Wed, Jun 22, 2011 at 11:39 AM, William Oberman
ober...@civicscience.comwrote:

 I just did a remove then install, and it seems to work.

 For those of you out there with JMX issues, the default port moved from
 8080 to 7199 (which includes the internal default to nodetool).  I was
 confused why nodetool ring would fail on some boxes and not others.  I had
 to add -p depending on the version of nodetool

 will


 On Wed, Jun 22, 2011 at 10:15 AM, William Oberman 
 ober...@civicscience.com wrote:

 I'm running 0.7.4 from rpm (riptano).  If I do a yum upgrade, it's trying
 to do 0.7.6.  To get 0.8.x I have to do install apache-cassandra08.  But
 that is going to install two copies.

 Is there a semi-official way of properly upgrading to 0.8 via rpm?

 --
 Will Oberman
 Civic Science, Inc.
 3030 Penn Avenue., First Floor
 Pittsburgh, PA 15201
 (M) 412-480-7835
 (E) ober...@civicscience.com




 --
 Will Oberman
 Civic Science, Inc.
 3030 Penn Avenue., First Floor
 Pittsburgh, PA 15201
 (M) 412-480-7835
 (E) ober...@civicscience.com




-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com


Re: rpm from 0.7.x - 0.8?

2011-06-22 Thread William Oberman
Thanks Jonathan.  I'm sure it's been true for everyone else as well, but the
rolling upgrade seems to have worked like a charm for me (other than the JMX
port # changing initial confusion).

One minor thing that probably particular to my case: when I removed the old
package, it unlinked my symlink /var/lib/cassandra/data (rather than edit
the cassandra config, I symlinked my amazon disk to where cassandra expected
it).  At first I thought I had lost all of my data, but after restoring the
link, everything was happy.

will

On Wed, Jun 22, 2011 at 12:34 PM, Jonathan Ellis jbel...@gmail.com wrote:

 Doesn't matter.  auto_bootstrap only applies to first start ever.

 On Wed, Jun 22, 2011 at 10:48 AM, William Oberman
 ober...@civicscience.com wrote:
  I have a question about auto_bootstrap.  When I originally brought up the
  cluser, I did:
  -seed with auto_boot = false
  -1,2,3 with auto_boot = true
 
  Now that I'm doing a rolling upgrade, do I set them all to auto_boot =
  true?  Or does the seed stay false?  Or should I mark them all false?  I
  have manually set tokens on all of the.
 
  The doc confused me:
  Set to 'true' to make new [non-seed] nodes automatically migrate the
 right
  data to themselves. (If no InitialToken is specified, they will pick one
  such that they will get half the range of the most-loaded node.) If a
 node
  starts up without bootstrapping, it will mark itself bootstrapped so that
  you can't subsequently accidently bootstrap a node with data on it. (You
 can
  reset this by wiping your data and commitlog directories.)
  Default is: 'false', so that new clusters don't bootstrap immediately.
 You
  should turn this on when you start adding new nodes to a cluster that
  already has data on it.
 
  I'm not adding new nodes, but the cluster does have data on it...
 
  will
 
  On Wed, Jun 22, 2011 at 11:39 AM, William Oberman 
 ober...@civicscience.com
  wrote:
 
  I just did a remove then install, and it seems to work.
 
  For those of you out there with JMX issues, the default port moved from
  8080 to 7199 (which includes the internal default to nodetool).  I was
  confused why nodetool ring would fail on some boxes and not others.  I
 had
  to add -p depending on the version of nodetool
 
  will
 
  On Wed, Jun 22, 2011 at 10:15 AM, William Oberman
  ober...@civicscience.com wrote:
 
  I'm running 0.7.4 from rpm (riptano).  If I do a yum upgrade, it's
 trying
  to do 0.7.6.  To get 0.8.x I have to do install apache-cassandra08.
 But
  that is going to install two copies.
 
  Is there a semi-official way of properly upgrading to 0.8 via rpm?
 
  --
  Will Oberman
  Civic Science, Inc.
  3030 Penn Avenue., First Floor
  Pittsburgh, PA 15201
  (M) 412-480-7835
  (E) ober...@civicscience.com
 
 
 
  --
  Will Oberman
  Civic Science, Inc.
  3030 Penn Avenue., First Floor
  Pittsburgh, PA 15201
  (M) 412-480-7835
  (E) ober...@civicscience.com
 
 
 
  --
  Will Oberman
  Civic Science, Inc.
  3030 Penn Avenue., First Floor
  Pittsburgh, PA 15201
  (M) 412-480-7835
  (E) ober...@civicscience.com
 



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com




-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com


Re: Docs: Token Selection

2011-06-17 Thread William Oberman
I haven't done it yet, but when I researched how to make
geo-diverse/failover DCs, I figured I'd have to do something like RF=6,
strategy = {DC1=3, DC2=3}, and LOCAL_QUORUM for reads/writes.  This gives
you an ack after 2 local nodes do the read/write, but the data eventually
gets distributed to the other DC for a full failover.  No ying-yang, but I
believe accomplishes the same goal?

will

On Fri, Jun 17, 2011 at 2:20 AM, Jonathan Ellis jbel...@gmail.com wrote:

 Replication location is determined by the row key, not the location of
 the client that inserted it.  (Otherwise, without knowing what DC a
 row was inserted in, you couldn't look it up to read it!)

 On Fri, Jun 17, 2011 at 12:20 AM, AJ a...@dude.podzone.net wrote:
  On 6/16/2011 9:45 PM, aaron morton wrote:
 
  But, I'm thinking about using OldNetworkTopStrat.
 
  NetworkTopologyStrategy is where it's at.
 
  Oh yeah?  It didn't look like it would serve my requirements.  I want 2
 full
  production geo-diverse data centers with each serving as a failover for
 the
  other.  Random Partitioner.  Each dc holds 2 replicas from the local
 clients
  and 1 replica goes to the other dc.  It doesn't look like I can do a
  ying-yang setup like that with NTS.  Am I wrong?
 
  A
  -
  Aaron Morton
  Freelance Cassandra Developer
  @aaronmorton
  http://www.thelastpickle.com
 
 



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com




-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com


prep for cassandra storage from pig

2011-06-15 Thread William Oberman
I think I'm stuck on typing issues trying to store data in cassandra.  To
verify, cassandra wants (key, {tuples})

My pig script is fairly brief:
raw = LOAD 'cassandra://test_in/test_cf' USING CassandraStorage() AS
(key:chararray, columns:bag {column:tuple (name, value)});
--colums == timeUUID - JSON
rows = FOREACH raw GENERATE key, FLATTEN(columns);
alias_target_day = FOREACH rows {
--I wrote a specialized parser that does exactly what I need
observation_map = com.civicscience.pig.ParseObservation($2);
GENERATE $0 as alias, observation_map#'_fqt' as target,
observation_map#'_day' as day;
};
grouping = GROUP alias_target_day BY ((chararray)target,(chararray)day);
X = FOREACH grouping GENERATE group.$0 as target, TOTUPLE(group.$1,
COUNT($1)) as day_count;

This gets me:
(targetA, (day1, count))
(targetA, (day2, count))
(targetB, (day1, count))


But, cassandra wants the 2nd item to be a bag.  So, I tried:
X = FOREACH grouping GENERATE group.$0 as target, TOBAG(TOTUPLE(group.$1,
COUNT($1))) as day_count;

But this results in:
(targetA, {((day1, count))})
(targetA, {((day2, count))})
(targetB, {((day1, count))})
It's hard to see, but the 2nd item now has a nested tuple as the first
value, which is still bad.

How to I get (key, {tuple})???  I wasn't sure where to post this (pig or
cassandra), so I'm posting to the pig list too.

will


Re: prep for cassandra storage from pig

2011-06-15 Thread William Oberman
My problem is the column names are dynamic (a date), and pygmalion seems to
want the column names to be fixed at compile time (the script).

On Wed, Jun 15, 2011 at 3:04 PM, Jeremy Hanna jeremy.hanna1...@gmail.comwrote:

 Hi Will,

 That's partly why I like to use FromCassandraBag and ToCassandraBag from
 pygmalion - it does the work for you to get it back into a form that
 cassandra understands.

 Others may know better how to massage the data into that form using just
 pig, but if all else fails, you could write a udf to do that.

 Jeremy

 On Jun 15, 2011, at 1:17 PM, William Oberman wrote:

  I think I'm stuck on typing issues trying to store data in cassandra.  To
 verify, cassandra wants (key, {tuples})
 
  My pig script is fairly brief:
  raw = LOAD 'cassandra://test_in/test_cf' USING CassandraStorage() AS
 (key:chararray, columns:bag {column:tuple (name, value)});
  --colums == timeUUID - JSON
  rows = FOREACH raw GENERATE key, FLATTEN(columns);
  alias_target_day = FOREACH rows {
  --I wrote a specialized parser that does exactly what I need
  observation_map = com.civicscience.pig.ParseObservation($2);
  GENERATE $0 as alias, observation_map#'_fqt' as target,
 observation_map#'_day' as day;
  };
  grouping = GROUP alias_target_day BY ((chararray)target,(chararray)day);
  X = FOREACH grouping GENERATE group.$0 as target, TOTUPLE(group.$1,
 COUNT($1)) as day_count;
 
  This gets me:
  (targetA, (day1, count))
  (targetA, (day2, count))
  (targetB, (day1, count))
  
 
  But, cassandra wants the 2nd item to be a bag.  So, I tried:
  X = FOREACH grouping GENERATE group.$0 as target, TOBAG(TOTUPLE(group.$1,
 COUNT($1))) as day_count;
 
  But this results in:
  (targetA, {((day1, count))})
  (targetA, {((day2, count))})
  (targetB, {((day1, count))})
  It's hard to see, but the 2nd item now has a nested tuple as the first
 value, which is still bad.
 
  How to I get (key, {tuple})???  I wasn't sure where to post this (pig or
 cassandra), so I'm posting to the pig list too.
 
  will




-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com


Re: prep for cassandra storage from pig

2011-06-15 Thread William Oberman
I'll do a reply all, to keep this more consistent (sorry!).

Rather than staying stuck, I wrote a custom function: TupleToBagOfTuple. I'm
curious if I could have avoided it with proper pig scripting though.

On Wed, Jun 15, 2011 at 3:08 PM, William Oberman
ober...@civicscience.comwrote:

 My problem is the column names are dynamic (a date), and pygmalion seems to
 want the column names to be fixed at compile time (the script).


 On Wed, Jun 15, 2011 at 3:04 PM, Jeremy Hanna 
 jeremy.hanna1...@gmail.comwrote:

 Hi Will,

 That's partly why I like to use FromCassandraBag and ToCassandraBag from
 pygmalion - it does the work for you to get it back into a form that
 cassandra understands.

 Others may know better how to massage the data into that form using just
 pig, but if all else fails, you could write a udf to do that.

 Jeremy

 On Jun 15, 2011, at 1:17 PM, William Oberman wrote:

  I think I'm stuck on typing issues trying to store data in cassandra.
  To verify, cassandra wants (key, {tuples})
 
  My pig script is fairly brief:
  raw = LOAD 'cassandra://test_in/test_cf' USING CassandraStorage() AS
 (key:chararray, columns:bag {column:tuple (name, value)});
  --colums == timeUUID - JSON
  rows = FOREACH raw GENERATE key, FLATTEN(columns);
  alias_target_day = FOREACH rows {
  --I wrote a specialized parser that does exactly what I need
  observation_map = com.civicscience.pig.ParseObservation($2);
  GENERATE $0 as alias, observation_map#'_fqt' as target,
 observation_map#'_day' as day;
  };
  grouping = GROUP alias_target_day BY ((chararray)target,(chararray)day);
  X = FOREACH grouping GENERATE group.$0 as target, TOTUPLE(group.$1,
 COUNT($1)) as day_count;
 
  This gets me:
  (targetA, (day1, count))
  (targetA, (day2, count))
  (targetB, (day1, count))
  
 
  But, cassandra wants the 2nd item to be a bag.  So, I tried:
  X = FOREACH grouping GENERATE group.$0 as target,
 TOBAG(TOTUPLE(group.$1, COUNT($1))) as day_count;
 
  But this results in:
  (targetA, {((day1, count))})
  (targetA, {((day2, count))})
  (targetB, {((day1, count))})
  It's hard to see, but the 2nd item now has a nested tuple as the first
 value, which is still bad.
 
  How to I get (key, {tuple})???  I wasn't sure where to post this (pig or
 cassandra), so I'm posting to the pig list too.
 
  will




 --
 Will Oberman
 Civic Science, Inc.
 3030 Penn Avenue., First Floor
 Pittsburgh, PA 15201
 (M) 412-480-7835
 (E) ober...@civicscience.com




-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com


hadoop/pig notes

2011-06-08 Thread William Oberman
I decided to try out hadoop/pig + cassandra.  I had my ups and downs to get
the script I wanted to run to work.  I'm sure everyone who tries will have
their own experiences/problems, but mine were:

-Everything I need to know was in
http://hadoop.apache.org/common/docs/r0.20.2/cluster_setup.html and
http://wiki.apache.org/cassandra/HadoopSupport

-Java is really picky about hostnames.  I'm in EC2, and rather than rely on
DNS, I basically have all of my machines share an /etc/hosts file.  But, the
command line hostname wasn't returning the same thing as in /etc/hosts,
which caused all kinds of weird hadoop issues at first.  (I had hostname as
foo and /etc/hosts had foo.prod).

-I forgot I had iptables on.  It's always easier to not have firewalls to
start (this is true when configuring anything of course)

-Use the same version of everything everywhere.  And for hadoop/pig, I was
having issues until I used the combination of hadoop-0.20.2 + pig-0.8.1.

-For hadoop's mapred-site.xml you HAVE to supply a port (hostname:port), and
there isn't a standard, and it seems arbitrary.  I used 8021, based on notes
in a case somewhere from hadoop (I think trying to standardize).

It took me awhile to figure the syntax of Pig Latin out, but I finally
managed to get a script that does a count of all columns in a column family:
rows = LOAD 'cassandra://keyspace/columnfamily' USING CassandraStorage();
filter_rows = FILTER rows BY $1 is not null;
counts = FOREACH filter_rows GENERATE COUNT($1);
counts_in_bag = GROUP counts ALL;
sum_of_bag = FOREACH counts_in_bag  GENERATE SUM($1);
dump sum_of_bag;

I'm trying to see the impact of running hadoop on the same servers as
cassandra now.  And yes, I've seen the note in the wiki about the clever
partitioning of cassandra nodes to allow for web latency nodes + hadoop
processing nodes :-)


Re: best way to backup

2011-04-30 Thread William Oberman
Thanks, I think I'm getting some of the file layout/data structures now, so
that helps with the backup strategy.  I might still start simple, as it's
usually harder to screw up simple, but at least I'll know where I can go
with something more clever.

will

On Sat, Apr 30, 2011 at 9:15 AM, Jeremiah Jordan 
jeremiah.jor...@morningstar.com wrote:

  The files inside the keyspace folders are the SSTable.

  --
 *From:* aaron morton [mailto:aa...@thelastpickle.com]
 *Sent:* Friday, April 29, 2011 4:49 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: best way to backup

 William,
 Some info on the sstables from me
 http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/

  http://thelastpickle.com/2011/04/28/Forces-of-Write-and-Read/If you
 want to know more check out the BigTable and original Facebook papers,
 linked from the wiki

  http://wiki.apache.org/cassandra/ArchitectureOverviewAaron

  On 29 Apr 2011, at 23:43, William Oberman wrote:

 Dumb question, but referenced twice now: which files are the SSTables and
 why is backing them up incrementally a win?

 Or should I not bother to understand internals, and instead just roll with
 the backup my keyspace(s) and system in a compressed tar strategy, as
 while it may be excessive, it's guaranteed to work and work easily (which I
 like, a great deal).

 will

 On Fri, Apr 29, 2011 at 4:58 AM, Daniel Doubleday 
 daniel.double...@gmx.net wrote:

 What we are about to set up is a time machine like backup. This is more
 like an add on to the s3 backup.

 Our boxes have an additional larger drive for local backup. We create a
 new backup snaphot every x hours which hardlinks the files in the previous
 snapshot (bit like cassandras incremental_backups thing) and than we sync
 that snapshot dir with the cassandra data dir. We can do archiving / backup
 to external system from there without impacting the main data raid.

 But the main reason to do this is to have an 'omg we screwed up big time
 and deleted / corrupted data' recovery.

  On Apr 28, 2011, at 9:53 PM, William Oberman wrote:

   Even with N-nodes for redundancy, I still want to have backups.  I'm an
 amazon person, so naturally I'm thinking S3.  Reading over the docs, and
 messing with nodeutil, it looks like each new snapshot contains the previous
 snapshot as a subset (and I've read how cassandra uses hard links to avoid
 excessive disk use).  When does that pattern break down?

 I'm basically debating if I can do a rsync like backup, or if I should
 do a compressed tar backup.  And I obviously want multiple points in time.
 S3 does allow file versioning, if a file or file name is changed/resused
 over time (only matters in the rsync case).  My only concerns with
 compressed tars is I'll have to have free space to create the archive and I
 get no delta space savings on the backup (the former is solved by not
 allowing the disk space to get so low and/or adding more nodes to bring down
 the space, the latter is solved by S3 being really cheap anyways).

 --
 Will Oberman
 Civic Science, Inc.
 3030 Penn Avenue., First Floor
 Pittsburgh, PA 15201
 (M) 412-480-7835
 (E) ober...@civicscience.com





 --
 Will Oberman
 Civic Science, Inc.
 3030 Penn Avenue., First Floor
 Pittsburgh, PA 15201
 (M) 412-480-7835
 (E) ober...@civicscience.com





-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com


Re: best way to backup

2011-04-29 Thread William Oberman
Dumb question, but referenced twice now: which files are the SSTables and
why is backing them up incrementally a win?

Or should I not bother to understand internals, and instead just roll with
the backup my keyspace(s) and system in a compressed tar strategy, as
while it may be excessive, it's guaranteed to work and work easily (which I
like, a great deal).

will

On Fri, Apr 29, 2011 at 4:58 AM, Daniel Doubleday
daniel.double...@gmx.netwrote:

 What we are about to set up is a time machine like backup. This is more
 like an add on to the s3 backup.

 Our boxes have an additional larger drive for local backup. We create a new
 backup snaphot every x hours which hardlinks the files in the previous
 snapshot (bit like cassandras incremental_backups thing) and than we sync
 that snapshot dir with the cassandra data dir. We can do archiving / backup
 to external system from there without impacting the main data raid.

 But the main reason to do this is to have an 'omg we screwed up big time
 and deleted / corrupted data' recovery.

 On Apr 28, 2011, at 9:53 PM, William Oberman wrote:

 Even with N-nodes for redundancy, I still want to have backups.  I'm an
 amazon person, so naturally I'm thinking S3.  Reading over the docs, and
 messing with nodeutil, it looks like each new snapshot contains the previous
 snapshot as a subset (and I've read how cassandra uses hard links to avoid
 excessive disk use).  When does that pattern break down?

 I'm basically debating if I can do a rsync like backup, or if I should do
 a compressed tar backup.  And I obviously want multiple points in time.  S3
 does allow file versioning, if a file or file name is changed/resused over
 time (only matters in the rsync case).  My only concerns with compressed
 tars is I'll have to have free space to create the archive and I get no
 delta space savings on the backup (the former is solved by not allowing
 the disk space to get so low and/or adding more nodes to bring down the
 space, the latter is solved by S3 being really cheap anyways).

 --
 Will Oberman
 Civic Science, Inc.
 3030 Penn Avenue., First Floor
 Pittsburgh, PA 15201
 (M) 412-480-7835
 (E) ober...@civicscience.com





-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com


best way to backup

2011-04-28 Thread William Oberman
Even with N-nodes for redundancy, I still want to have backups.  I'm an
amazon person, so naturally I'm thinking S3.  Reading over the docs, and
messing with nodeutil, it looks like each new snapshot contains the previous
snapshot as a subset (and I've read how cassandra uses hard links to avoid
excessive disk use).  When does that pattern break down?

I'm basically debating if I can do a rsync like backup, or if I should do
a compressed tar backup.  And I obviously want multiple points in time.  S3
does allow file versioning, if a file or file name is changed/resused over
time (only matters in the rsync case).  My only concerns with compressed
tars is I'll have to have free space to create the archive and I get no
delta space savings on the backup (the former is solved by not allowing
the disk space to get so low and/or adding more nodes to bring down the
space, the latter is solved by S3 being really cheap anyways).

-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com


Re: best way to backup

2011-04-28 Thread William Oberman
Interesting.  Both use cases seem easy to code.
Compress to S3 = cassandra snapshot, tar, s3 put
EBS = cassandra snapshot, rsync snapshot dir - ebs, ebs snapshot

I think the former is cheaper in terms of costs, as my gut says keeping
around an EBS drive is more money than the lack of deltas in S3.  But, EBS
would allow extremely fine grained snapshotting without storage penalties
(EBS snaps are supposed to be compressed deltas behind the scenes).  Of
course, I don't know how cassandra likes frequent snapshotting, and given
the amount of node redundancy/eventual consistency that seems pointless
anyways.

will

On Thu, Apr 28, 2011 at 3:57 PM, Sasha Dolgy sdo...@gmail.com wrote:

 You could take a snapshot to an EBS volume.  then, take a snapshot of that
 via AWS.  of course, this is ok.when they -arent- having outages and issues
 ...
 On Apr 28, 2011 9:54 PM, William Oberman ober...@civicscience.com
 wrote:
  Even with N-nodes for redundancy, I still want to have backups. I'm an
  amazon person, so naturally I'm thinking S3. Reading over the docs, and
  messing with nodeutil, it looks like each new snapshot contains the
 previous
  snapshot as a subset (and I've read how cassandra uses hard links to
 avoid
  excessive disk use). When does that pattern break down?
 
  I'm basically debating if I can do a rsync like backup, or if I should
 do
  a compressed tar backup. And I obviously want multiple points in time. S3
  does allow file versioning, if a file or file name is changed/resused
 over
  time (only matters in the rsync case). My only concerns with compressed
  tars is I'll have to have free space to create the archive and I get no
  delta space savings on the backup (the former is solved by not allowing
  the disk space to get so low and/or adding more nodes to bring down the
  space, the latter is solved by S3 being really cheap anyways).
 
  --
  Will Oberman
  Civic Science, Inc.
  3030 Penn Avenue., First Floor
  Pittsburgh, PA 15201
  (M) 412-480-7835
  (E) ober...@civicscience.com




-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com


Re: best way to backup

2011-04-28 Thread William Oberman
My newbie mistake (always good to test things): my script wasn't
storing/restoring system, only my keyspace.  So, if you want to be able to
restore from backup, make sure you save the keyspace and system!

will

On Thu, Apr 28, 2011 at 4:35 PM, Jeremy Hanna jeremy.hanna1...@gmail.comwrote:

 one thing we're looking at doing is watching the cassandra data directory
 and backing up the sstables to s3 when they are created.  Some guys at
 simplegeo started tablesnap that does this:
 https://github.com/simplegeo/tablesnap

 What it does is for every sstable that is pushed to s3, it also copies a
 json file with the current files in the directory, so you can know what to
 restore in that event (as far as I understand).

 On Apr 28, 2011, at 2:53 PM, William Oberman wrote:

  Even with N-nodes for redundancy, I still want to have backups.  I'm an
 amazon person, so naturally I'm thinking S3.  Reading over the docs, and
 messing with nodeutil, it looks like each new snapshot contains the previous
 snapshot as a subset (and I've read how cassandra uses hard links to avoid
 excessive disk use).  When does that pattern break down?
 
  I'm basically debating if I can do a rsync like backup, or if I should
 do a compressed tar backup.  And I obviously want multiple points in time.
  S3 does allow file versioning, if a file or file name is changed/resused
 over time (only matters in the rsync case).  My only concerns with
 compressed tars is I'll have to have free space to create the archive and I
 get no delta space savings on the backup (the former is solved by not
 allowing the disk space to get so low and/or adding more nodes to bring down
 the space, the latter is solved by S3 being really cheap anyways).
 
  --
  Will Oberman
  Civic Science, Inc.
  3030 Penn Avenue., First Floor
  Pittsburgh, PA 15201
  (M) 412-480-7835
  (E) ober...@civicscience.com




-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com


Re: advice for EC2 deployment

2011-04-27 Thread William Oberman
While I haven't configured it for multi-region yet, Sasha is exactly right
now how amzon's DNS works (returning private vs. public IP depending on if
the machine is local to the region or not).  For extra fun, now that Route53
exists you can (somewhat trivially) map and dynamically maintain all EC2
instances to stable DNS names (but make sure to use CNAMEs to get the DNS
magic!).  E.g.
cassandra1.somethinghardtoguess.ec2.yourdomain.com -
weird.ec2.public.dns.name

I'd drop in the somethinghardtoguess myself given Route53 can expose your
internal network topology if someone can guess the DNS name.

will

On Wed, Apr 27, 2011 at 8:15 AM, Sasha Dolgy sdo...@gmail.com wrote:

 Hi,

 If I understand you correctly, you are trying to get a private ip in
 us-east speaking to the private ip in us-west.  to make your life
 easier, configure your nodes to use hostname of the server.  if it's
 in a different region, it will use the public ip (ec2 dns will handle
 this for you) and if it's in the same region, it will use the private
 ip.  this way you can stop worrying about if you are using the public
 or private ip to communicate with another node.  let the aws dns do
 the work for you.

 just make sure you are using v0.8 with SSL turned on and have the
 appropriate security group definitions ...

 -sasha



 On Wed, Apr 27, 2011 at 1:55 PM, pankajsoni0126
 pankajsoni0...@gmail.com wrote:
  I have been trying to deploy Cassandra cluster across regions and for
 that I
  posted this IP address resolution in MultiDC setup.
 
  But when it is to get nodes talking to each other on different regions
 say,
  us-east and us-west over private IP's of EC2 nodes I am facing problems.
 
  I am assuming if Cassandra is built for multi-DC setup it should be
 easily
  deployed with node1's DC1's public IP listed as seed in all nodes in DC2
 and
  to gain idea about network topology? I have hit a dud for deployment in
 such
  scenario.
 
  Or is it there any way possible to use Private IP's for such a scenario
 in
  EC2, as Public Ip are less secure and costly?




-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com


Re: advice for EC2 deployment

2011-04-27 Thread William Oberman
It's great advice, but I'm still torn.  I've never done multi-region work
before, and I'd prefer to wait for 0.8 with built-in inter-node security,
but I'm otherwise ready to roll (and need to roll) cassandra out sooner than
that.

Given how well my system held up with a total single AZ failure, I'm really
leaning on starting by treating AZ's as DCs, and racks as... random?  I
don't think that part matters.  My question for today is to just use the
property file snitch, or to roll my own version of Ec2Snith that does AZ as
DC.

I do increase my risk being single region to start, so I was going to figure
out how to push snapshots to S3.  One question on that note: is it better to
try and snapshot all nodes at roughly the same point in time, or is it
better to do rolling snapshots?

will

On Wed, Apr 27, 2011 at 7:13 AM, aaron morton aa...@thelastpickle.comwrote:

 Using the EC2Snitch you could have one AZ in us-east-1 and one Az in
 us-west-1, treat each AZ as a single rack and each region as a DC. The
 network topology is rack aware so will prefer request that go to the same
 rack (not much of an issue when you have only one rack).

 If possible I would use the same RF in each DC, if you want the fail over
 to be as clean as possible (see earlier comments about when number failed
 nodes in a dc). i.e. 3 replicas in each dc / region.

 Until you find a reason otherwise use LOCAL_QUORUM, if that proves to be
 too slow or you get more experience and feel comfortable with the trade offs
 then change to a lower CL.

 Dropping the CL level for write bursts does not make the cluster run any
 faster, it lets the client think the cluster is running faster and can
 result in the client overloading (in a good this is what it should do way)
 the cluster. This can result in more eventual consistency work to be done
 later during maintenance or read requests. If that is a reasonable trade
 off, you can write at CL ONE and read at CL ALL to ensure you get consistent
 reads (quorum is not good enough in that case).

 Jump in and test it at Quorum, you may find the write performance is good
 enough. There are lots of dials to play with
 http://wiki.apache.org/cassandra/MemtableThresholds

 Hope that helps.
 Aaron


 On 27 Apr 2011, at 09:31, William Oberman wrote:

 I see what you're saying.  I was able to control write latency on mysql
 using insert vs insert delayed (what I feel is MySQLs poor man's eventual
 consistency option) + the fact that replication was a background
 asynchronous process.  In terms of read latency, I was able to do up to a
 few hundred well indexed mysql queries (across AZs) on a view while keeping
 the overall latency of the page around or less than a second.

 I basically am replacing two use cases, the cases with difficult to scale
 anticipated write volumes.  The first case was previously using insert
 delayed (which I'm doing in cassandra as ONE) as I wasn't getting consistent
 write/read operations before anyways.  The second case was using traditional
 insert (which I was going to replace with some QUORUM-like level, I was
 assuming LOCAL_QUORUM).  But, the latter case uses a write through memory
 cache (memcache), so I don't know how often it really reads data from the
 persistent store.  But I definitely need to make sure it is consistent.

 In any case, it sounds like I'd be best served treating AZs as DCs, but
 then I don't know what to make racks?  Or do racks not matter in a single
 AZ?  That way I can get an ack from a LOCAL_QUORUM read/write before the
 (slightly) slower read/write to/from the other AZ (for redundancy).  Then
 I'm only screwed if Amazon has a multi-AZ failure (so far, they've kept it
 to only one!) :-)

 will

 On Tue, Apr 26, 2011 at 5:01 PM, aaron morton aa...@thelastpickle.comwrote:

 One difference between Cassandra and MySQL replication may be when the
 network IO happens. Was the MySQL replication synchronous on transaction
 commit ?  I was only aware that it had async replication, which means the
 client is not exposed to the network latency. In cassandra the network
 latency is exposed to the client as it needs to wait for the CL number of
 nodes to respond.

 If you use the PropertyFilePartitioner with the NetworkTopology you can
 manually assign machines to racks / dc's based on IP.
 See conf/cassandra-topology.property file there is also an Ec2Snitch which
 (from the code)
 /**
  * A snitch that assumes an EC2 region is a DC and an EC2
 availability_zone
  *  is a rack. This information is available in the config for the node.

 Recent discussion on DC aware CL levels
 http://www.mail-archive.com/user@cassandra.apache.org/msg11414.html

 Hope that helps.
  http://www.mail-archive.com/user@cassandra.apache.org/msg11414.html
 Aaron


 On 27 Apr 2011, at 01:18, William Oberman wrote:

 Thanks Aaron!

 Unless no one on this list uses EC2, there were a few minor troubles end
 of last week through the weekend which taught me a lot about obscure failure
 modes

Re: advice for EC2 deployment

2011-04-27 Thread William Oberman
Thanks Sasha.  Fortunately/unfortunately I did realize the default  current
behavior of the Ec2Snitch, but my application isn't multi-region capable
(yet), so I need to get intra-region redundancy.  And having a
SingleRegionEc2Snitch that did DC=ec2zone and RACK=??? would be much better
for me (for now).

On Wed, Apr 27, 2011 at 9:21 AM, Sasha Dolgy sdo...@gmail.com wrote:

 Hi William,

 The default behavior of Ec2Snitch is outlined below:


 http://svn.apache.org/repos/asf/cassandra/trunk/src/java/org/apache/cassandra/locator/Ec2Snitch.java

// Split us-east-1a or asia-1a into us-east/1a and
 asia/1a.
String azone = new String(b ,UTF-8);
String[] splits = azone.split(-);
ec2zone = splits[splits.length - 1];
ec2region = splits.length  3 ? splits[0] : splits[0]+-+splits[1];
logger.info(EC2Snitch using region:  + ec2region + , zone:
  + ec2zone + .);

 ApplicationState.DC = ec2region
 ApplicationState.RACK = ec2zone

 We leverage cassandra instances in APAC, US  Europe ... so it's
 important for us to know that we have one data center in each 'region'
 and multiple racks per DC ...

 -sasha

 On Wed, Apr 27, 2011 at 3:06 PM, William Oberman
 ober...@civicscience.com wrote:
  It's great advice, but I'm still torn.  I've never done multi-region work
  before, and I'd prefer to wait for 0.8 with built-in inter-node security,
  but I'm otherwise ready to roll (and need to roll) cassandra out sooner
 than
  that.
 
  Given how well my system held up with a total single AZ failure, I'm
 really
  leaning on starting by treating AZ's as DCs, and racks as... random?  I
  don't think that part matters.  My question for today is to just use the
  property file snitch, or to roll my own version of Ec2Snith that does AZ
 as
  DC.
 
  I do increase my risk being single region to start, so I was going to
 figure
  out how to push snapshots to S3.  One question on that note: is it better
 to
  try and snapshot all nodes at roughly the same point in time, or is it
  better to do rolling snapshots?
 
  will




-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com


Re: advice for EC2 deployment

2011-04-27 Thread William Oberman
I don't think of it as migrating an instance, it's more of a destroy/start
with EC2.  But, I still think it would be very useful to spin up a set of
instances with known hostnames (cassandra1, 2, 3... N) and be able to
quickly SSH to them by doing ssh ec2u...@cassandra1.random.ec2.mydomain.com
.

Also, it makes finding seeds a lot easier, as you don't have to manage IPs
in the config file, just names (cassandra-seed1.random.ec2.mydomain.com).

I should have mentioned it, but people that are already doing this trick
(I'm not... yet) are actually doing: hostname.region.ec2.mydomain.com (as
it's useful to know the region).  I don't anything cares about AZ, but you
could embed that too if it matters.

will

On Wed, Apr 27, 2011 at 9:26 AM, Sasha Dolgy sdo...@gmail.com wrote:

 if you migrate the instance, does Route53 automatically re-map all the
 information to the new ec2 instance?  another issue is that cassandra
 only maintains the IP of the other nodes, and not the hostname
 (assumed based on output of the nodetool ring)  ...

 which means, if you migrate the instance and Route53 does do some
 auto-magic .. the private ip for the instance will have changed and
 you will need to migrate that node back into the ring, while moving
 the old referenced IP out ... we've had quite a lot of pain with this
 in the past.  rule of thumb, if you want to upgrade / migrate an
 instance, you need to remove it from the ring, do your work, bootstrap
 it back to the ring .. i think this could be avoided if cassandra
 maintained hostname references and not just IP references for nodes.

 -sasha

 On Wed, Apr 27, 2011 at 2:56 PM, William Oberman
 ober...@civicscience.com wrote:
  While I haven't configured it for multi-region yet, Sasha is exactly
 right
  now how amzon's DNS works (returning private vs. public IP depending on
 if
  the machine is local to the region or not).  For extra fun, now that
 Route53
  exists you can (somewhat trivially) map and dynamically maintain all EC2
  instances to stable DNS names (but make sure to use CNAMEs to get the DNS
  magic!).  E.g.
  cassandra1.somethinghardtoguess.ec2.yourdomain.com -
  weird.ec2.public.dns.name
 
  I'd drop in the somethinghardtoguess myself given Route53 can expose your
  internal network topology if someone can guess the DNS name.
 
  will




-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com


Re: advice for EC2 deployment

2011-04-27 Thread William Oberman
Oh, and Route53 doesn't do anything automatically, but there is an API to
manage the DNS.  It's up to you to run a task on instance boot/terminate, or
a cron job if you want to do this trick (for now, seems like a solid future
feature of Route53).  Though, I hear geographical aware Route53 is already
in the works (to route EC2 traffic to the closest region).

will

On Wed, Apr 27, 2011 at 9:33 AM, William Oberman
ober...@civicscience.comwrote:

 I don't think of it as migrating an instance, it's more of a destroy/start
 with EC2.  But, I still think it would be very useful to spin up a set of
 instances with known hostnames (cassandra1, 2, 3... N) and be able to
 quickly SSH to them by doing ssh
 ec2u...@cassandra1.random.ec2.mydomain.com.

 Also, it makes finding seeds a lot easier, as you don't have to manage IPs
 in the config file, just names (cassandra-seed1.random.ec2.mydomain.com).

 I should have mentioned it, but people that are already doing this trick
 (I'm not... yet) are actually doing: hostname.region.ec2.mydomain.com (as
 it's useful to know the region).  I don't anything cares about AZ, but you
 could embed that too if it matters.

 will


 On Wed, Apr 27, 2011 at 9:26 AM, Sasha Dolgy sdo...@gmail.com wrote:

 if you migrate the instance, does Route53 automatically re-map all the
 information to the new ec2 instance?  another issue is that cassandra
 only maintains the IP of the other nodes, and not the hostname
 (assumed based on output of the nodetool ring)  ...

 which means, if you migrate the instance and Route53 does do some
 auto-magic .. the private ip for the instance will have changed and
 you will need to migrate that node back into the ring, while moving
 the old referenced IP out ... we've had quite a lot of pain with this
 in the past.  rule of thumb, if you want to upgrade / migrate an
 instance, you need to remove it from the ring, do your work, bootstrap
 it back to the ring .. i think this could be avoided if cassandra
 maintained hostname references and not just IP references for nodes.

 -sasha

 On Wed, Apr 27, 2011 at 2:56 PM, William Oberman
 ober...@civicscience.com wrote:
  While I haven't configured it for multi-region yet, Sasha is exactly
 right
  now how amzon's DNS works (returning private vs. public IP depending on
 if
  the machine is local to the region or not).  For extra fun, now that
 Route53
  exists you can (somewhat trivially) map and dynamically maintain all EC2
  instances to stable DNS names (but make sure to use CNAMEs to get the
 DNS
  magic!).  E.g.
  cassandra1.somethinghardtoguess.ec2.yourdomain.com -
  weird.ec2.public.dns.name
 
  I'd drop in the somethinghardtoguess myself given Route53 can expose
 your
  internal network topology if someone can guess the DNS name.
 
  will




 --
 Will Oberman
 Civic Science, Inc.
 3030 Penn Avenue., First Floor
 Pittsburgh, PA 15201
 (M) 412-480-7835
 (E) ober...@civicscience.com




-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com


Re: advice for EC2 deployment

2011-04-27 Thread William Oberman
I think you're right about changing NetworkToplogyStrategy, but the timing
isn't working in my favor at this point.  I wonder how bad that will really
be

On Wed, Apr 27, 2011 at 9:35 AM, Sasha Dolgy sdo...@gmail.com wrote:

 so can you not simply leverage a strategy that replicates data between
 racks and at some point in the future when you move to multi-dc
 upgrade the replication strategy to maintain the current replication
 and add in some replication between DC's ... ?

 i'll go re-read your posts to see if you've already tried this.  I
 vaguely remember Ellis saying it's not a good idea to switch
 NetworkTopologyStrategy ...

 On Wed, Apr 27, 2011 at 3:29 PM, William Oberman
 ober...@civicscience.com wrote:
  Thanks Sasha.  Fortunately/unfortunately I did realize the default 
 current
  behavior of the Ec2Snitch, but my application isn't multi-region capable
  (yet), so I need to get intra-region redundancy.  And having a
  SingleRegionEc2Snitch that did DC=ec2zone and RACK=??? would be much
 better
  for me (for now).




-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com


Re: advice for EC2 deployment

2011-04-26 Thread William Oberman
Thanks Aaron!

Unless no one on this list uses EC2, there were a few minor troubles end of
last week through the weekend which taught me a lot about obscure failure
modes in various applications I use :-)  My original post was trying to be
more redundant than fast, which has been by overall goal from even before
moving to Cassandra (my downtime from the EC2 madness was minimal, and due
to only having one single point of failure == the amazon load balancer).  My
secondary goal was  trying to make moving to a second region easier, but is
that is causing problems I can drop the idea.

I might be downplaying the cost of inter-AZ communication, but I've lived
with that for quite some time, for example my current setup of MySQL in
Master-Master replication is split over zones, and my webservers live in yet
different zones.  Maybe Cassandra is chattier than I'm used to?  (again,
I'm fairly new to cassandra)

Based on that article, the discussion, and the recent EC2 issues, it sounds
like it would be better to start with:
-6 nodes split in two AZs 3/3
-Configure replication to do 2 in one AZ and one in the other
(NetworkTopology treats AZs as racks, so does RF=3,us-east=3 make this
happen naturally?)
-What does LOCAL_QUORUM do in this case?  Is there a rack quorum?  Or does
the natural latencies of AZs make LOCAL_QUORUM behave like a rack quorum?

will

On Tue, Apr 26, 2011 at 1:14 AM, aaron morton aa...@thelastpickle.comwrote:

 For background see this article:

 http://www.datastax.com/dev/blog/deploying-cassandra-across-multiple-data-centers


 http://www.datastax.com/dev/blog/deploying-cassandra-across-multiple-data-centersAnd
 this recent discussion
 http://www.mail-archive.com/user@cassandra.apache.org/msg12502.html

 http://www.mail-archive.com/user@cassandra.apache.org/msg12502.htmlIssues
 that may be a concern:
 - lots of cross AZ latency in us-east, e.g. LOCAL_QUORUM ops must wait
 cross AZ . Also consider it during maintenance tasks, how much of a pain is
 it going to be to have latency between every node.
 - IMHO not having sufficient (by that I mean 3) replicas in a cassandra DC
 to handle a single node failure when working at Quorum reduces the utility
 of the DC. e.g. with a local RF of 2 in the west, the quorum is 2, and if
 you lose one node from the replica set you will not be able to use local
 QUORUM for keys in that range. Or consider a failure mode where the west is
 disconnected from the east.

 Could you start simple with 3 replicas in one AZ in us-east and 3 replicas
 in an AZ+Region ?  Then work through some failure scenarios.

 Hope that helps.
 Aaron


 On 22 Apr 2011, at 03:28, William Oberman wrote:

 Hi,

 My service is not yet ready to be fully multi-DC, due to how some of my
 legacy MySQL stuff works.  But, I wanted to get cassandra going ASAP and
 work towards multi-DC.  I have two main cassandra use cases: one where I can
 handle eventual consistency (and all of the writes/reads are currently ONE),
 and one where I can't (writes/reads are currently QUORUM).  My test cluster
 is currently 4 smalls all in us-east with RF=3 (more to prove I can
 clustering, than to have an exact production replica).  All of my unit
 tests, and load tests (again, not to prove true max load, but to more to
 tease out concurrency issues) are passing now.

 For production, I was thinking of doing:
 -4 cassandra larges in us-east (where I am now), once in each AZ
 -1 cassandra large in us-west (where I have nothing)
 For now, my data can fit into a single large's 2 disk ephemeral using
 RAID0, and I was then thinking of doing a RF=3 with us-east=2 and
 us-west=1.  If I do eventual consistency at ONE, and consistency at
 LOCAL_QUORUM, I was hoping:
 -eventual consistency ops would be really fast
 -consistent ops would be pretty fast (what does LOCAL_QUORUM do in this
 case?  return after 1 or 2 us-east nodes ack?)
 -us-west would contain a complete copy of my data, so it's a good
 eventually consistent close to real time backup  (assuming it can keep up
 over long periods of time, but I think it should)
 -eventually, when I'm ready to roll out in us-west I'll be able to change
 the replication settings and that server in us-west could help seed new
 cassandra instances faster than the ones in us-east

 Or am I missing something really fundamental about how cassandra works
 making this a terrible plan?  I should have plenty of time to get my
 multi-DC working before the instance in us-west fills up (but even then, I
 should be able to add instances over there to stall fairly trivially,
 right?).

 Thanks!

 will





-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com


practice failure recovery

2011-04-26 Thread William Oberman
In my test cluster I manged to jam up a cassandra server.  I figure the easy
 failsafe solution is to just boot a replacement node, but I thought I'd
try a minute to either figure out what I did, or try to figure out how to
properly recover it before I lose my current state.

The symptom = on startup I get an exception:
ERROR 11:58:34,567 Exception encountered during startup.
java.lang.IndexOutOfBoundsException: 6
at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:121)
at
org.apache.cassandra.db.marshal.TimeUUIDType.compareTimestampBytes(TimeUUIDType.java:56)
at
org.apache.cassandra.db.marshal.TimeUUIDType.compare(TimeUUIDType.java:45)
at
org.apache.cassandra.db.marshal.TimeUUIDType.compare(TimeUUIDType.java:29)
at
java.util.concurrent.ConcurrentSkipListMap$ComparableUsingComparator.compareTo(ConcurrentSkipListMap.java:606)
at
java.util.concurrent.ConcurrentSkipListMap.findPredecessor(ConcurrentSkipListMap.java:685)
at
java.util.concurrent.ConcurrentSkipListMap.doPut(ConcurrentSkipListMap.java:864)
at
java.util.concurrent.ConcurrentSkipListMap.putIfAbsent(ConcurrentSkipListMap.java:1893)
at
org.apache.cassandra.db.ColumnFamily.addColumn(ColumnFamily.java:216)
at
org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:130)
at
org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:120)
at
org.apache.cassandra.db.RowMutation$RowMutationSerializer.deserialize(RowMutation.java:380)
at
org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:253)
at
org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:156)
at
org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:173)
at
org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCassandraDaemon.java:314)
at
org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:79)

Where things went wrong = I had been doing various testing and unit testing,
as this is my proof of concept cluster.  The unit tests in particular work
by cloning a keyspace as keyspace_UUID (to get a blank slate).  Because of
various bugs in my code and configuration, this left a fair amount of crud
keyspaces by the time I got everything to pass.  So, I wrote a script to
drop all of the test keyspaces (the script had worked on a single node
environment, which was my first step before the cluster).  I think the CLI
doesn't wait for schema propagation, so the script confused the node I was
talking to, as after it ran the schema UUIDs of that node vs. the rest of
the cluster didn't agree (describe cluster in the CLI).  And, it wasn't
fixing itself.  nodetool loadbalance said it would do a
decommission/bootstrap, which I thought might give the bad node a kick in
the pants, so I tried it.  Afterwards, I ran nodetool ring against all
nodes and the problem node claimed all was UP, but everything else listed
the problem node as ? and everything else as UP (sadly, I either didn't
check or can't remember what nodetool ring said before loadbalance).  So,
I shut down the problem node.  But, when I tried to restart it, I got the
error you see above.

Not sure what was the worst/dumbest thing I did, but it's definitely unhappy
now!


Re: advice for EC2 deployment

2011-04-26 Thread William Oberman
I see what you're saying.  I was able to control write latency on mysql
using insert vs insert delayed (what I feel is MySQLs poor man's eventual
consistency option) + the fact that replication was a background
asynchronous process.  In terms of read latency, I was able to do up to a
few hundred well indexed mysql queries (across AZs) on a view while keeping
the overall latency of the page around or less than a second.

I basically am replacing two use cases, the cases with difficult to scale
anticipated write volumes.  The first case was previously using insert
delayed (which I'm doing in cassandra as ONE) as I wasn't getting consistent
write/read operations before anyways.  The second case was using traditional
insert (which I was going to replace with some QUORUM-like level, I was
assuming LOCAL_QUORUM).  But, the latter case uses a write through memory
cache (memcache), so I don't know how often it really reads data from the
persistent store.  But I definitely need to make sure it is consistent.

In any case, it sounds like I'd be best served treating AZs as DCs, but then
I don't know what to make racks?  Or do racks not matter in a single AZ?
That way I can get an ack from a LOCAL_QUORUM read/write before the
(slightly) slower read/write to/from the other AZ (for redundancy).  Then
I'm only screwed if Amazon has a multi-AZ failure (so far, they've kept it
to only one!) :-)

will

On Tue, Apr 26, 2011 at 5:01 PM, aaron morton aa...@thelastpickle.comwrote:

 One difference between Cassandra and MySQL replication may be when the
 network IO happens. Was the MySQL replication synchronous on transaction
 commit ?  I was only aware that it had async replication, which means the
 client is not exposed to the network latency. In cassandra the network
 latency is exposed to the client as it needs to wait for the CL number of
 nodes to respond.

 If you use the PropertyFilePartitioner with the NetworkTopology you can
 manually assign machines to racks / dc's based on IP.
 See conf/cassandra-topology.property file there is also an Ec2Snitch which
 (from the code)
 /**
  * A snitch that assumes an EC2 region is a DC and an EC2 availability_zone
  *  is a rack. This information is available in the config for the node.

 Recent discussion on DC aware CL levels
 http://www.mail-archive.com/user@cassandra.apache.org/msg11414.html

 Hope that helps.
  http://www.mail-archive.com/user@cassandra.apache.org/msg11414.html
 Aaron


 On 27 Apr 2011, at 01:18, William Oberman wrote:

 Thanks Aaron!

 Unless no one on this list uses EC2, there were a few minor troubles end of
 last week through the weekend which taught me a lot about obscure failure
 modes in various applications I use :-)  My original post was trying to be
 more redundant than fast, which has been by overall goal from even before
 moving to Cassandra (my downtime from the EC2 madness was minimal, and due
 to only having one single point of failure == the amazon load balancer).  My
 secondary goal was  trying to make moving to a second region easier, but is
 that is causing problems I can drop the idea.

 I might be downplaying the cost of inter-AZ communication, but I've lived
 with that for quite some time, for example my current setup of MySQL in
 Master-Master replication is split over zones, and my webservers live in yet
 different zones.  Maybe Cassandra is chattier than I'm used to?  (again,
 I'm fairly new to cassandra)

 Based on that article, the discussion, and the recent EC2 issues, it sounds
 like it would be better to start with:
 -6 nodes split in two AZs 3/3
 -Configure replication to do 2 in one AZ and one in the other
 (NetworkTopology treats AZs as racks, so does RF=3,us-east=3 make this
 happen naturally?)
 -What does LOCAL_QUORUM do in this case?  Is there a rack quorum?  Or
 does the natural latencies of AZs make LOCAL_QUORUM behave like a rack
 quorum?

 will

 On Tue, Apr 26, 2011 at 1:14 AM, aaron morton aa...@thelastpickle.comwrote:

 For background see this article:

 http://www.datastax.com/dev/blog/deploying-cassandra-across-multiple-data-centers


 http://www.datastax.com/dev/blog/deploying-cassandra-across-multiple-data-centersAnd
 this recent discussion
 http://www.mail-archive.com/user@cassandra.apache.org/msg12502.html

 http://www.mail-archive.com/user@cassandra.apache.org/msg12502.htmlIssues
 that may be a concern:
 - lots of cross AZ latency in us-east, e.g. LOCAL_QUORUM ops must wait
 cross AZ . Also consider it during maintenance tasks, how much of a pain is
 it going to be to have latency between every node.
 - IMHO not having sufficient (by that I mean 3) replicas in a cassandra DC
 to handle a single node failure when working at Quorum reduces the utility
 of the DC. e.g. with a local RF of 2 in the west, the quorum is 2, and if
 you lose one node from the replica set you will not be able to use local
 QUORUM for keys in that range. Or consider a failure mode where the west is
 disconnected from the east.

 Could you

Re: practice failure recovery

2011-04-26 Thread William Oberman
Done and done.  I'm really loving how easy the nuclear option has been (it
was what I tested first).

will

On Tue, Apr 26, 2011 at 5:09 PM, aaron morton aa...@thelastpickle.comwrote:

 In 0.7.X the cli waits for the schema to agree before returning, you should
 see...

 Waiting for schema agreement...
 ... schemas agree across the cluster

 Or if things fail
 The schema has not settled in %d seconds; further migrations are
 ill-advised until it does.%nVersions are %s%n

 WRT the error, first guess is something in the schema has changed it's
 upsetting the log replay. Given all the crazy i'd go with the nuclear
 option.

 Aaron

 On 27 Apr 2011, at 07:11, William Oberman wrote:

  In my test cluster I manged to jam up a cassandra server.  I figure the
 easy  failsafe solution is to just boot a replacement node, but I thought
 I'd try a minute to either figure out what I did, or try to figure out how
 to properly recover it before I lose my current state.
 
  The symptom = on startup I get an exception:
  ERROR 11:58:34,567 Exception encountered during startup.
  java.lang.IndexOutOfBoundsException: 6
  at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:121)
  at
 org.apache.cassandra.db.marshal.TimeUUIDType.compareTimestampBytes(TimeUUIDType.java:56)
  at
 org.apache.cassandra.db.marshal.TimeUUIDType.compare(TimeUUIDType.java:45)
  at
 org.apache.cassandra.db.marshal.TimeUUIDType.compare(TimeUUIDType.java:29)
  at
 java.util.concurrent.ConcurrentSkipListMap$ComparableUsingComparator.compareTo(ConcurrentSkipListMap.java:606)
  at
 java.util.concurrent.ConcurrentSkipListMap.findPredecessor(ConcurrentSkipListMap.java:685)
  at
 java.util.concurrent.ConcurrentSkipListMap.doPut(ConcurrentSkipListMap.java:864)
  at
 java.util.concurrent.ConcurrentSkipListMap.putIfAbsent(ConcurrentSkipListMap.java:1893)
  at
 org.apache.cassandra.db.ColumnFamily.addColumn(ColumnFamily.java:216)
  at
 org.apache.cassandra.db.ColumnFamilySerializer.deserializeColumns(ColumnFamilySerializer.java:130)
  at
 org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:120)
  at
 org.apache.cassandra.db.RowMutation$RowMutationSerializer.deserialize(RowMutation.java:380)
  at
 org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:253)
  at
 org.apache.cassandra.db.commitlog.CommitLog.recover(CommitLog.java:156)
  at
 org.apache.cassandra.service.AbstractCassandraDaemon.setup(AbstractCassandraDaemon.java:173)
  at
 org.apache.cassandra.service.AbstractCassandraDaemon.activate(AbstractCassandraDaemon.java:314)
  at
 org.apache.cassandra.thrift.CassandraDaemon.main(CassandraDaemon.java:79)
 
  Where things went wrong = I had been doing various testing and unit
 testing, as this is my proof of concept cluster.  The unit tests in
 particular work by cloning a keyspace as keyspace_UUID (to get a blank
 slate).  Because of various bugs in my code and configuration, this left a
 fair amount of crud keyspaces by the time I got everything to pass.  So, I
 wrote a script to drop all of the test keyspaces (the script had worked on a
 single node environment, which was my first step before the cluster).  I
 think the CLI doesn't wait for schema propagation, so the script confused
 the node I was talking to, as after it ran the schema UUIDs of that node vs.
 the rest of the cluster didn't agree (describe cluster in the CLI).  And,
 it wasn't fixing itself.  nodetool loadbalance said it would do a
 decommission/bootstrap, which I thought might give the bad node a kick in
 the pants, so I tried it.  Afterwards, I ran nodetool ring against all
 nodes and the problem node claimed all was UP, but everything else listed
 the problem node as ? and everything else as UP (sadly, I either didn't
 check or can't remember what nodetool ring said before loadbalance).  So,
 I shut down the problem node.  But, when I tried to restart it, I got the
 error you see above.
 
  Not sure what was the worst/dumbest thing I did, but it's definitely
 unhappy now!




-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com


Re: Ec2Snitch + NetworkTopologyStrategy if only in one region?

2011-04-20 Thread William Oberman
Also for the new users like me, don't assume DC1 is a keyword like I did.  A
working example of a keyspace in EC2 is:

create keyspace test with replication_factor=3 and strategy_options =
[{us-east:3}] and
placement_strategy='org.apache.cassandra.locator.NetworkTopologyStrategy';

For a single DC in EC2 deployment.  I felt silly afterwards, but I couldn't
find official docs on the structure of strategy_options anywhere.

will

On Wed, Apr 13, 2011 at 5:14 PM, William Oberman
ober...@civicscience.comwrote:

 One last coda, for other noobs to cassandra like me.  If you use
 NetworkTopologyStrategy with replication_factor  1, make sure you have EC2
 instance in multiple availability zones.  I was doing baby steps, and tried
 doing a cluster in one AZ (before spreading to multiple AZs) and was getting
 the most baffling errors (cassandra_UnavailableException).  I finally
 thought to check the cassandra server logs (after debugging the client code,
 firewalls, etc... painstakingly for connectivity problems), and it ends up
 my cassandra cluster was considering itself unavailable as it couldn't
 replicate as much as it wanted to.  I kind of wish a different word than
 unavailable was chosen for this error condition :-)

 will


 On Tue, Apr 12, 2011 at 10:37 PM, aaron morton aa...@thelastpickle.comwrote:

 If you can use standard + encoded I would go with that.

 Aaron

 On 13 Apr 2011, at 07:07, William Oberman wrote:

 Excellent to know! (and yes, I figure I'll expand someday, so I'm glad I
 found this out before digging a hole).

 The other issue I've been pondering is a normal column family of encoded
 objects (in my case JSON) vs. a super column.  Based on my use case, things
 I've read, etc...  right now I'm coming down on normal + encoded.

 will

 On Tue, Apr 12, 2011 at 2:57 PM, Jonathan Ellis jbel...@gmail.comwrote:

 NTS is overkill in the sense that it doesn't really benefit you in a
 single DC, but if you think you may expand to another DC in the future
 it's much simpler if you were already using NTS, than first migrating
 to NTS (changing strategy is painful).

 I can't think of any downsides to using NTS in a single-DC
 environment, so that's the safe option.

 On Tue, Apr 12, 2011 at 1:15 PM, William Oberman
 ober...@civicscience.com wrote:
  Hi,
 
  I'm getting closer to commiting to cassandra, and now I'm in system/IT
  issues and questions.  I'm in the amazon EC2 cloud.  I previously used
 this
  forum to discover the best practice for disk layouts (large instance +
 the
  two ephemeral disks in RAID0 for data + root volume for everything
 else).
  Now I'm hoping to confirm bits and pieces of things I've read about for
  snitch/replication strategies.  I was thinking of using
  endpoint_snitch: org.apache.cassandra.locator.Ec2Snitch
 
 placement_strategy='org.apache.cassandra.locator.NetworkTopologyStrategy'
  (for people hitting this from the mailing list or google, I feel
 obligated
  to note that the former setting is in cassandra.yaml, and the latter is
 an
  option on a keyspace).
 
  But, I'm only in one region. Is using the amazon snitch/networktopology
  overkill given everything I have is in one DC (I believe region==DC and
  availability_zone==rack).  I'm using multiple availability zones for
 some
  level of redundancy, I'm just not yet to the point I'm using multiple
  regions.  If someday I move to using multiple regions, would that
 change the
  answer?
 
  Thanks!
 
  --
  Will Oberman
  Civic Science, Inc.
  3030 Penn Avenue., First Floor
  Pittsburgh, PA 15201
  (M) 412-480-7835
  (E) ober...@civicscience.com
 



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com




 --
 Will Oberman
 Civic Science, Inc.
 3030 Penn Avenue., First Floor
 Pittsburgh, PA 15201
 (M) 412-480-7835
 (E) ober...@civicscience.com





 --
 Will Oberman
 Civic Science, Inc.
 3030 Penn Avenue., First Floor
 Pittsburgh, PA 15201
 (M) 412-480-7835
 (E) ober...@civicscience.com




-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com


system_* consistency level?

2011-04-20 Thread William Oberman
Hi,

My unit tests started failing once I upgraded from a single node cassandra
cluster to a full N node cluster (I'm starting with 4).  I had a few
various bugs, mostly due to forgetting to read/write at a quorum level in
places I needed stronger consistency guarantees.  But, I kept getting
random, intermittent failure (the worst kind).  I'm 99% sure I see why,
after some painful debugging, but I don't know what to do about it.  The
basic flaw in my understanding of cassandra seems to boil down to: I thought
system mutations of keyspaces/column families where of a stronger
consistency than ONE, but that appears to not be true.  Any way for me to
update a cluster at something more like QUORUM?

The basic idea is in my unit test.setup() I clone my real keyspace as
keyspace_UUID (with all of the exact same CFs) to get a fresh space to play
in.  In a single node environment, no issues.  But, in a cluster, it seems
that it takes a while for the system_add_keyspace call to propagate.  No
worries I think, I just modify my setup() to do
describe_keyspace(keyspace_UUID) in a while loop until the cluster is
ready.  My random failures drop considerably, but every once and awhile I
see a similar kind of failure.  Then I find out that schema updates seem to
propagate on a per node basis.  At least, that's what I have to assume as
I'm using phpcassa which uses a connection pool, and I see in my logging
that my setup() succeeds because one connection in the pool sees the new
keyspace, but when my tests run I grab a connection from the pool that is
missing it!

Do I have a solution other than changing my setup yet again to loop over all
cassandra servers doing a describe_keyspace()?

-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com


Re: system_* consistency level?

2011-04-20 Thread William Oberman
That was the trick.  Thanks!

On Apr 20, 2011, at 6:05 PM, Jonathan Ellis jbel...@gmail.com wrote:

 See the comments for describe_schema_versions.

 On Wed, Apr 20, 2011 at 4:59 PM, William Oberman
 ober...@civicscience.com wrote:
 Hi,

 My unit tests started failing once I upgraded from a single node cassandra
 cluster to a full N node cluster (I'm starting with 4).  I had a few
 various bugs, mostly due to forgetting to read/write at a quorum level in
 places I needed stronger consistency guarantees.  But, I kept getting
 random, intermittent failure (the worst kind).  I'm 99% sure I see why,
 after some painful debugging, but I don't know what to do about it.  The
 basic flaw in my understanding of cassandra seems to boil down to: I thought
 system mutations of keyspaces/column families where of a stronger
 consistency than ONE, but that appears to not be true.  Any way for me to
 update a cluster at something more like QUORUM?

 The basic idea is in my unit test.setup() I clone my real keyspace as
 keyspace_UUID (with all of the exact same CFs) to get a fresh space to play
 in.  In a single node environment, no issues.  But, in a cluster, it seems
 that it takes a while for the system_add_keyspace call to propagate.  No
 worries I think, I just modify my setup() to do
 describe_keyspace(keyspace_UUID) in a while loop until the cluster is
 ready.  My random failures drop considerably, but every once and awhile I
 see a similar kind of failure.  Then I find out that schema updates seem to
 propagate on a per node basis.  At least, that's what I have to assume as
 I'm using phpcassa which uses a connection pool, and I see in my logging
 that my setup() succeeds because one connection in the pool sees the new
 keyspace, but when my tests run I grab a connection from the pool that is
 missing it!

 Do I have a solution other than changing my setup yet again to loop over all
 cassandra servers doing a describe_keyspace()?

 --
 Will Oberman
 Civic Science, Inc.
 3030 Penn Avenue., First Floor
 Pittsburgh, PA 15201
 (M) 412-480-7835
 (E) ober...@civicscience.com




 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com


Re: Cassandra 0.7.4 and LOCAL_QUORUM Consistency level

2011-04-19 Thread William Oberman
I had a similar error today when I tried using LOCAL_QUORUM without having a
properly configured NetworkTopologyStrategy.  QUORUM worked fine however.

will

On Tue, Apr 19, 2011 at 8:52 PM, Oleg Tsvinev oleg.tsvi...@gmail.comwrote:

 Earlier I've posted the same message to a hector-users list.

 Guys,

 I'm a bit puzzled today. I'm using just released Hector 0.7.0-29
 (thank you, Nate!) and Cassandra 0.7.4 and getting the exception
 below, marked as (1) Exception. When I dig to Cassandra source code
 below, marked as (2) Cassandra source, I see that there's no check for
 LOCAL_QUORUM. I also see that (3) cassandra.thrift defines
 LOCAL_QUORUM as enum value 3 and in debugger, I see that LOCAL_QUORUM
 is a valid enum value.

 My question is, how is it possible to use LOCAL_QUORUM if Cassandra
 code throws exception when it sees it? Is it a bad merge or something?
 I know it worked before, from looking at
 https://issues.apache.org/jira/browse/CASSANDRA-2254

 :-\

 (1) Exception:

 2011-04-19 14:57:33,550 [pool-2-thread-49] ERROR
 org.apache.cassandra.thrift.Cassandra$Processor - Internal error
 processing batch_mutate
 java.lang.UnsupportedOperationException: invalid consistency level:
 LOCAL_QUORUM
   at
 org.apache.cassandra.service.WriteResponseHandler.determineBlockFor(WriteResponseHandler.java:99)
   at
 org.apache.cassandra.service.WriteResponseHandler.init(WriteResponseHandler.java:48)
   at
 org.apache.cassandra.service.WriteResponseHandler.create(WriteResponseHandler.java:61)
   at
 org.apache.cassandra.locator.AbstractReplicationStrategy.getWriteResponseHandler(AbstractReplicationStrategy.java:127)
   at
 org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:115)
   at
 org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:415)
   at
 org.apache.cassandra.thrift.CassandraServer.batch_mutate(CassandraServer.java:388)
   at
 org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:3036)
   at
 org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2555)
   at
 org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:206)
   at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)

 (2) Cassandra source (line 99 is throw statement)

   protected int determineBlockFor(String table)
   {
   int blockFor = 0;
   switch (consistencyLevel)
   {
   case ONE:
   blockFor = 1;
   break;
   case ANY:
   blockFor = 1;
   break;
   case TWO:
   blockFor = 2;
   break;
   case THREE:
   blockFor = 3;
   break;
   case QUORUM:
   blockFor = (writeEndpoints.size() / 2) + 1;
   break;
   case ALL:
   blockFor = writeEndpoints.size();
   break;
   default:
   throw new UnsupportedOperationException(invalid
 consistency level:  + consistencyLevel.toString());
   }
   // at most one node per range can bootstrap at a time, and
 these will be added to the write until
   // bootstrap finishes (at which point we no longer need to
 write to the old ones).
   assert 1 = blockFor  blockFor = 2 *
 Table.open(table).getReplicationStrategy().getReplicationFactor()
   : String.format(invalid response count %d for replication
 factor %d,
   blockFor,
 Table.open(table).getReplicationStrategy().getReplicationFactor());
   return blockFor;
   }

 (3) cassandra.thrift:

 enum ConsistencyLevel {
   ONE = 1,
   QUORUM = 2,
   LOCAL_QUORUM = 3,
   EACH_QUORUM = 4,
   ALL = 5,
   ANY = 6,
   TWO = 7,
   THREE = 8,
 }




-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com


Re: Cassandra 0.7.4 and LOCAL_QUORUM Consistency level

2011-04-19 Thread William Oberman
Good point, should have read your message (and the code) more closely!

Sent from my iPhone

On Apr 19, 2011, at 9:16 PM, Oleg Tsvinev oleg.tsvi...@gmail.com wrote:

 I'm puzzled because code does not even check for LOCAL_QUORUM before
 throwing exception.
 Indeed I did not configure NetworkTopologyStrategy. Are you saying
 that it works after configuring it?

 On Tue, Apr 19, 2011 at 6:04 PM, William Oberman
 ober...@civicscience.com wrote:
 I had a similar error today when I tried using LOCAL_QUORUM without having a
 properly configured NetworkTopologyStrategy.  QUORUM worked fine however.
 will

 On Tue, Apr 19, 2011 at 8:52 PM, Oleg Tsvinev oleg.tsvi...@gmail.com
 wrote:

 Earlier I've posted the same message to a hector-users list.

 Guys,

 I'm a bit puzzled today. I'm using just released Hector 0.7.0-29
 (thank you, Nate!) and Cassandra 0.7.4 and getting the exception
 below, marked as (1) Exception. When I dig to Cassandra source code
 below, marked as (2) Cassandra source, I see that there's no check for
 LOCAL_QUORUM. I also see that (3) cassandra.thrift defines
 LOCAL_QUORUM as enum value 3 and in debugger, I see that LOCAL_QUORUM
 is a valid enum value.

 My question is, how is it possible to use LOCAL_QUORUM if Cassandra
 code throws exception when it sees it? Is it a bad merge or something?
 I know it worked before, from looking at
 https://issues.apache.org/jira/browse/CASSANDRA-2254

 :-\

 (1) Exception:

 2011-04-19 14:57:33,550 [pool-2-thread-49] ERROR
 org.apache.cassandra.thrift.Cassandra$Processor - Internal error
 processing batch_mutate
 java.lang.UnsupportedOperationException: invalid consistency level:
 LOCAL_QUORUM
   at
 org.apache.cassandra.service.WriteResponseHandler.determineBlockFor(WriteResponseHandler.java:99)
   at
 org.apache.cassandra.service.WriteResponseHandler.init(WriteResponseHandler.java:48)
   at
 org.apache.cassandra.service.WriteResponseHandler.create(WriteResponseHandler.java:61)
   at
 org.apache.cassandra.locator.AbstractReplicationStrategy.getWriteResponseHandler(AbstractReplicationStrategy.java:127)
   at
 org.apache.cassandra.service.StorageProxy.mutate(StorageProxy.java:115)
   at
 org.apache.cassandra.thrift.CassandraServer.doInsert(CassandraServer.java:415)
   at
 org.apache.cassandra.thrift.CassandraServer.batch_mutate(CassandraServer.java:388)
   at
 org.apache.cassandra.thrift.Cassandra$Processor$batch_mutate.process(Cassandra.java:3036)
   at
 org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2555)
   at
 org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:206)
   at
 java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
   at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
   at java.lang.Thread.run(Thread.java:662)

 (2) Cassandra source (line 99 is throw statement)

   protected int determineBlockFor(String table)
   {
   int blockFor = 0;
   switch (consistencyLevel)
   {
   case ONE:
   blockFor = 1;
   break;
   case ANY:
   blockFor = 1;
   break;
   case TWO:
   blockFor = 2;
   break;
   case THREE:
   blockFor = 3;
   break;
   case QUORUM:
   blockFor = (writeEndpoints.size() / 2) + 1;
   break;
   case ALL:
   blockFor = writeEndpoints.size();
   break;
   default:
   throw new UnsupportedOperationException(invalid
 consistency level:  + consistencyLevel.toString());
   }
   // at most one node per range can bootstrap at a time, and
 these will be added to the write until
   // bootstrap finishes (at which point we no longer need to
 write to the old ones).
   assert 1 = blockFor  blockFor = 2 *
 Table.open(table).getReplicationStrategy().getReplicationFactor()
   : String.format(invalid response count %d for replication
 factor %d,
   blockFor,
 Table.open(table).getReplicationStrategy().getReplicationFactor());
   return blockFor;
   }

 (3) cassandra.thrift:

 enum ConsistencyLevel {
   ONE = 1,
   QUORUM = 2,
   LOCAL_QUORUM = 3,
   EACH_QUORUM = 4,
   ALL = 5,
   ANY = 6,
   TWO = 7,
   THREE = 8,
 }



 --
 Will Oberman
 Civic Science, Inc.
 3030 Penn Avenue., First Floor
 Pittsburgh, PA 15201
 (M) 412-480-7835
 (E) ober...@civicscience.com



Re: Ec2Snitch + NetworkTopologyStrategy if only in one region?

2011-04-13 Thread William Oberman
One last coda, for other noobs to cassandra like me.  If you use
NetworkTopologyStrategy with replication_factor  1, make sure you have EC2
instance in multiple availability zones.  I was doing baby steps, and tried
doing a cluster in one AZ (before spreading to multiple AZs) and was getting
the most baffling errors (cassandra_UnavailableException).  I finally
thought to check the cassandra server logs (after debugging the client code,
firewalls, etc... painstakingly for connectivity problems), and it ends up
my cassandra cluster was considering itself unavailable as it couldn't
replicate as much as it wanted to.  I kind of wish a different word than
unavailable was chosen for this error condition :-)

will

On Tue, Apr 12, 2011 at 10:37 PM, aaron morton aa...@thelastpickle.comwrote:

 If you can use standard + encoded I would go with that.

 Aaron

 On 13 Apr 2011, at 07:07, William Oberman wrote:

 Excellent to know! (and yes, I figure I'll expand someday, so I'm glad I
 found this out before digging a hole).

 The other issue I've been pondering is a normal column family of encoded
 objects (in my case JSON) vs. a super column.  Based on my use case, things
 I've read, etc...  right now I'm coming down on normal + encoded.

 will

 On Tue, Apr 12, 2011 at 2:57 PM, Jonathan Ellis jbel...@gmail.com wrote:

 NTS is overkill in the sense that it doesn't really benefit you in a
 single DC, but if you think you may expand to another DC in the future
 it's much simpler if you were already using NTS, than first migrating
 to NTS (changing strategy is painful).

 I can't think of any downsides to using NTS in a single-DC
 environment, so that's the safe option.

 On Tue, Apr 12, 2011 at 1:15 PM, William Oberman
 ober...@civicscience.com wrote:
  Hi,
 
  I'm getting closer to commiting to cassandra, and now I'm in system/IT
  issues and questions.  I'm in the amazon EC2 cloud.  I previously used
 this
  forum to discover the best practice for disk layouts (large instance +
 the
  two ephemeral disks in RAID0 for data + root volume for everything
 else).
  Now I'm hoping to confirm bits and pieces of things I've read about for
  snitch/replication strategies.  I was thinking of using
  endpoint_snitch: org.apache.cassandra.locator.Ec2Snitch
 
 placement_strategy='org.apache.cassandra.locator.NetworkTopologyStrategy'
  (for people hitting this from the mailing list or google, I feel
 obligated
  to note that the former setting is in cassandra.yaml, and the latter is
 an
  option on a keyspace).
 
  But, I'm only in one region. Is using the amazon snitch/networktopology
  overkill given everything I have is in one DC (I believe region==DC and
  availability_zone==rack).  I'm using multiple availability zones for
 some
  level of redundancy, I'm just not yet to the point I'm using multiple
  regions.  If someday I move to using multiple regions, would that change
 the
  answer?
 
  Thanks!
 
  --
  Will Oberman
  Civic Science, Inc.
  3030 Penn Avenue., First Floor
  Pittsburgh, PA 15201
  (M) 412-480-7835
  (E) ober...@civicscience.com
 



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com




 --
 Will Oberman
 Civic Science, Inc.
 3030 Penn Avenue., First Floor
 Pittsburgh, PA 15201
 (M) 412-480-7835
 (E) ober...@civicscience.com





-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com


Ec2Snitch + NetworkTopologyStrategy if only in one region?

2011-04-12 Thread William Oberman
Hi,

I'm getting closer to commiting to cassandra, and now I'm in system/IT
issues and questions.  I'm in the amazon EC2 cloud.  I previously used this
forum to discover the best practice for disk layouts (large instance + the
two ephemeral disks in RAID0 for data + root volume for everything else).
Now I'm hoping to confirm bits and pieces of things I've read about for
snitch/replication strategies.  I was thinking of using
endpoint_snitch: org.apache.cassandra.locator.Ec2Snitch
placement_strategy='org.apache.cassandra.locator.NetworkTopologyStrategy'
(for people hitting this from the mailing list or google, I feel obligated
to note that the former setting is in cassandra.yaml, and the latter is an
option on a keyspace).

But, I'm only in one region. Is using the amazon snitch/networktopology
overkill given everything I have is in one DC (I believe region==DC and
availability_zone==rack).  I'm using multiple availability zones for some
level of redundancy, I'm just not yet to the point I'm using multiple
regions.  If someday I move to using multiple regions, would that change the
answer?

Thanks!

-- 
Will Oberman
Civic Science, Inc.
3030 Penn Avenue., First Floor
Pittsburgh, PA 15201
(M) 412-480-7835
(E) ober...@civicscience.com


  1   2   >