Re: Dazed and confused with Cassandra on EC2 ...

2010-10-09 Thread Peter Schuller
 The main reason to set Xms=Xmx is so mlockall can tag the entire heap
 as don't swap this out on startup.  Secondarily whenever the heap
 resizes upwards the JVM does a stop-the-world gc, but no, not really a
 big deal when your uptime is in days or weeks.

I'm not sure where this is coming from, assuming you mean a *full* stw
GC. I cannot remember ever observing this in all my hears of heap
growth ;), and I don't see why it would be the case.

Heap *shrinkage* on the other hand is another matter and for both CMS
and G1, shrinkage never happens except on Full GC, unfortunately. I'm
hoping G1 will eventually improve here since its compacting design
should fundamentally make it feasible to implement efficiently.

Disregarding the mlockall issue which is pretty specific to Cassandra,
the typical reason, as I have understood it, for setting Xms=Xmx on
machine dedicatedly running some particular JVM is that if you have to
reserve Xmx amount of memory anyway, and perhaps even expect that to
be reached, it is generally more efficiency to just let the JVM use it
all from the start - due to the usual effect of GC being more
efficient with larger heap sizes and in order to avoid bad policy
decisions of the GC causing excessive GC activity rather than just
growing the heap which you're fine with anyway.

-- 
/ Peter Schuller


Re: Dazed and confused with Cassandra on EC2 ...

2010-10-09 Thread Peter Schuller
 Heap *shrinkage* on the other hand is another matter and for both CMS
 and G1, shrinkage never happens except on Full GC, unfortunately. I'm

And to be clear, the implication here is that shrinkage normally
doesn't happen. The implication is *not* that you see fallbacks to
full GC for the purpose of shrinkage.

-- 
/ Peter Schuller


Re: Dazed and confused with Cassandra on EC2 ...

2010-10-08 Thread Jonathan Ellis
On Fri, Oct 8, 2010 at 4:54 AM, Jedd Rashbrooke
jedd.rashbro...@imagini.net wrote:
 On 8 October 2010 02:05, Matthew Dennis mden...@riptano.com wrote:
 Also, in general, you probably want to set Xms = Xmx (regardless of the
 value you eventually decide on for that).

  Matthew - we'd just about reached that conclusion!  Is it as big an
  issue for clusters that are up for days at a time, with maybe Xmx:3G?
  I'd kind of assumed that it'd get to its max pretty quickly.  I haven't
  been watching that with jconsole, but might watch it on the next startup.

The main reason to set Xms=Xmx is so mlockall can tag the entire heap
as don't swap this out on startup.  Secondarily whenever the heap
resizes upwards the JVM does a stop-the-world gc, but no, not really a
big deal when your uptime is in days or weeks.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: Dazed and confused with Cassandra on EC2 ...

2010-10-07 Thread Matthew Dennis
Also, in general, you probably want to set Xms = Xmx (regardless of the
value you eventually decide on for that).

If you set them equal, the JVM will just go ahead and allocate that amount
on startup.  If they're different, then when you grow above Xms it has to
allocate more and move a bunch of stuff around.  It may have to do this
multiple times.  Note that it does this as the worst time possible (i.e.
under heavy load, which is likely what caused you to grow past Xms in the
first place).

On Thu, Oct 7, 2010 at 2:49 PM, Peter Schuller
peter.schul...@infidyne.comwrote:

   There's some words on the 'Net that - the recent pages on
   Riptano's site in fact - that strongly encourage scaling left
   and right, rather than beefing up the boxes - and certainly
   we're seeing far less bother from GC using a much smaller
   heap - previously we'd been going up to 16GB, or even
   higher.  This is based on my previous positive experiences
   of getting better performance from memory hog apps (eg.
   Java) by giving them more memory.  In any case, it seems
   that using large amounts of memory on EC2 is just asking
   for trouble.

 Keep in mind that while GC tends to be more efficient with larger heap
 sizes, that does not always translate into better overall performance
 when other things have to be considered. In particular, in the case of
 Cassandra, if you waste 10-15 gigs of RAM on the JVM heap for a
 Cassandra instances which could live with e.g. 1 GB, you're actively
 taking away those 10-15 gigs of RAM from the operating system to use
 for the buffer cache. Particularly if you're I/O bound on reads then,
 this could have very detrimental effects (assuming the data set is
 sufficiently small and locality is such that 15 GB of extra buffer
 cache makes a difference; usually, but not always, this is the case).

 So with Cassandra, in the general case, you definitely want to keep
 hour heap size reasonable in relation to the actual live set (amount
 of actually reachable data), rather than just cranking it up as much
 as possible.

 (The main issue here is also keeping it high enough to not OOM, given
 that exact memory demands are hard to predict; it would be absolutely
 great if the JVM was better at maintaining a reasonable heap size to
 live set size ratio so that much less tweaking of heap sizes was
 necessary, but this is not the case.)

 --
 / Peter Schuller



Re: Dazed and confused with Cassandra on EC2 ...

2010-10-04 Thread Jedd Rashbrooke
 Hi Peter,

 Thanks again for your time and thoughts on this problem.

 We think we've got a bit ahead of the problem by just
 scaling back (quite savagely) on the rate that we try to
 hit the cluster.  Previously, with a surplus of optimism,
 we were throwing very big Hadoop jobs at Cassandra,
 including what I understand to be a worst-case usage
 (random reads).

 Now we're throttling right back on the number of parallel
 jobs that we fire from Hadoop, and we're seeing better
 performance, in terms of the boxes generally staying up
 as far as nodetool and other interactive sessions are
 concerned.

 As discussed, we've adopted quite a number of different
 approaches with GC - at the moment we've returned to:

 JVM_OPTS= \
-ea \
-Xms2G \
-Xmx3G \
-XX:+UseParNewGC \
-XX:+UseConcMarkSweepGC \
-XX:+CMSParallelRemarkEnabled \
-XX:SurvivorRatio=8 \
-XX:MaxTenuringThreshold=1 \
-XX:+HeapDumpOnOutOfMemoryError \
-Dcom.sun.management.jmxremote.port=8080 \
-Dcom.sun.management.jmxremote.ssl=false \
-Dcom.sun.management.jmxremote.authenticate=false

 ... which is much closer to the default as shipped - notable
 change is the heap size, which out of the box comes as 1G.

 There's some words on the 'Net that - the recent pages on
 Riptano's site in fact - that strongly encourage scaling left
 and right, rather than beefing up the boxes - and certainly
 we're seeing far less bother from GC using a much smaller
 heap - previously we'd been going up to 16GB, or even
 higher.  This is based on my previous positive experiences
 of getting better performance from memory hog apps (eg.
 Java) by giving them more memory.  In any case, it seems
 that using large amounts of memory on EC2 is just asking
 for trouble.

 And because it's Amazon, more smaller machines generally
 works out as the same CPU grunt per dollar, of course ..
 although the management costs go up.

 To answer your last question there - we'd been using some
 pretty beefy EC2 boxes, but now we think we'll head back
 to the 2-core 7GB medium-ish sized machines I think.

 All IO still runs like a dog no matter how much money you
 spend, sadly.

 cheers,
 Jedd.


Re: Dazed and confused with Cassandra on EC2 ...

2010-10-02 Thread Peter Schuller
(sorry for the delay in following up on this thread)

  Actually, there's a question - is it 'acceptable' do you think
  for GC to take out a small number of your nodes at a time,
  so long as the bulk (or at least where RF is  nodes gone
  on STW GC) of the nodes are okay?  I suspect this is a
  question peculiar to Amazon EC2, as I've never seen a box
  rendered non-communicative by a single core flat-lining.

Well, first of all I still find it very strange that GC takes nodes
down at all, unless one is specifically putting sufficiant CPU load on
the cluster that e.g. concurrent GC causes a problem. But in
particular if you're still seeing those crazy long GC pause times
still, IMO something is severaly wrong and I would not personally
recommend going to production with that unresolved since whatever the
cause is, may suddenly start having other effects. Severely long
ParNew pause times are really not expected; the only two major reasons
I can think of, at least when running on real hardware, and barring
JVM bugs, are (1) swapping, and (2) possibly extreme performance
penalties associated with a very full old generation in which case the
solution is larger heap. I don't remember whether you indicated any
heap statistics so I'm not sure whether (2) is a possibility. But I
would expect OutOfMemory errors long before a ParNew takes 300+
*seconds*, just out of JVM policies w.r.t. acceptable GC efficiency.

Bottom line: 300+ seconds for a ParNew collections is *way way way*
out there. 300 *milli*-seconds is more along the lines of what one
might expect (usually lower than that). Even if you can seemingly
lessen the impact by using the throughput collector, I wouldn't be
comfortable with shrugging off whatever is happening.

That said, in terms of the effects on the cluster: I have not had much
hands-on experience with this, but I believe you'd expect a definite
visible from the point of view of clients. Cassandra is not optimized
for instantly detecting slow nodes and transparently working around
them with zero impact on clients; I don't think it is recommended to
be running a cluster with nodes regularly bouncing in and our, for
whatever reason, if it can be avoided.

Not sure what to say, other than to strongly recommend getting to the
bottom of this problem which seems non-specific to Cassandra, before
relying on the system in a production setting. The extremity of the
issues you're seeing are far beyond what I would ever expect even
allowing for who knows what EC2 is doing or what other people are
running on the machine, except for the hypothesis that they
over-commit memory and the extreme latencies are due to swapping. But
if that is what is happening, that just tells me that EC2 is unusable
for this type of thing, but I still think it's far fetched since the
impact should be significant on a great number of customers of theirs.

I forget and I didn't find it by brief sifting through thread history;
were you running on small EC2 instances or larger ones?

-- 
/ Peter Schuller


Re: Dazed and confused with Cassandra on EC2 ...

2010-09-28 Thread Jedd Rashbrooke
 Peter - my apologies for the slow response - we had
 to divert down a 'Plan B' approach last week involving
 MySQL, memcache, redis and various other uglies.

On 20 September 2010 23:11, Peter Schuller peter.schul...@infidyne.com wrote:
 Are you running an old JVM by any chance? (Just grasping for straws.)

 JVM is Sun's 1.6 - I've been caught out once before with
 openjdk's performance challenges, so I'm particularly
 careful with this now.

 Hmm. I can see useless spinning decreasing efficiency, but the numbers
 from your log are really extreme. Do you have a URL / bug id or
 anything that one can read up on about this?

 We've rebuilt the Cassandra cluster this week, avoiding
 Hadoop entirely - partly to reduce the variables in play,
 and partly because it looks like we'll only need two 'feeder'
 nodes for our jobs with the size of Cassandra cluster that
 we're likely going to end up with (10-12 ish).  Any ratio
 higher than that seems to, on EC2 at least, cause too many
 fails on the Cassandra side.

 Actually, there's a question - is it 'acceptable' do you think
 for GC to take out a small number of your nodes at a time,
 so long as the bulk (or at least where RF is  nodes gone
 on STW GC) of the nodes are okay?  I suspect this is a
 question peculiar to Amazon EC2, as I've never seen a box
 rendered non-communicative by a single core flat-lining.

 By the end of this week we hope to have a better idea (mind,
 I've thought that for the past 5 weeks of experimenting).  If I'm
 back to square one at that point I'll start pastebining some logs
 and configs.  Increasingly, I'm convinced that many of these
 problems would be solved if we hosted our own servers.

 cheers,
 Jedd.


Re: Dazed and confused with Cassandra on EC2 ...

2010-09-20 Thread Dave Gardner
As a follow up to this conversation; we are still having issues with our
Cassandra cluster on EC2.

It *looks* to be related to Garbage Collection; however we aren't sure what
the root cause of the problem is. Here is an extract from logs:

 INFO [GMFD:1] 2010-09-20 15:22:00,242 Gossiper.java (line 578) InetAddress
/10.102.57.197 is now UP
 INFO [HINTED-HANDOFF-POOL:1] 2010-09-20 15:22:00,242
HintedHandOffManager.java (line 165) Started hinted handoff for endPoint /
10.102.57.197
 INFO [HINTED-HANDOFF-POOL:1] 2010-09-20 15:22:00,247
HintedHandOffManager.java (line 222) Finished hinted handoff of 0 rows to
endpoint /10.102.57.197
 INFO [GC inspection] 2010-09-20 15:27:42,046 GCInspector.java (line 129) GC
for ParNew: 325411 ms, 84284896 reclaimed leaving 640770336 used; max is
25907560448
 INFO [WRITE-/10.102.7.187] 2010-09-20 15:27:42,052
OutboundTcpConnection.java (line 105) error writing to /10.102.7.187
 INFO [GC inspection] 2010-09-20 15:27:42,082 GCInspector.java (line 150)
Pool NameActive   Pending
 INFO [GC inspection] 2010-09-20 15:27:42,083 GCInspector.java (line 156)
STREAM-STAGE  0 0
 INFO [GC inspection] 2010-09-20 15:27:42,083 GCInspector.java (line 156)
RESPONSE-STAGE0 3
 INFO [GC inspection] 2010-09-20 15:27:42,083 GCInspector.java (line 156)
ROW-READ-STAGE625
 INFO [GC inspection] 2010-09-20 15:27:42,084 GCInspector.java (line 156)
LB-OPERATIONS 0 0
 INFO [GC inspection] 2010-09-20 15:27:42,084 GCInspector.java (line 156)
MISCELLANEOUS-POOL0 0
 INFO [GC inspection] 2010-09-20 15:27:42,084 GCInspector.java (line 156)
GMFD  1   129
 INFO [GC inspection] 2010-09-20 15:27:42,084 GCInspector.java (line 156)
CONSISTENCY-MANAGER   0 0
 INFO [GC inspection] 2010-09-20 15:27:42,085 GCInspector.java (line 156)
LB-TARGET 0 0
 INFO [GC inspection] 2010-09-20 15:27:42,085 GCInspector.java (line 156)
ROW-MUTATION-STAGE1 1
 INFO [GC inspection] 2010-09-20 15:27:42,085 GCInspector.java (line 156)
MESSAGE-STREAMING-POOL0 0
 INFO [GC inspection] 2010-09-20 15:27:42,086 GCInspector.java (line 156)
LOAD-BALANCER-STAGE   0 0
 INFO [GC inspection] 2010-09-20 15:27:42,086 GCInspector.java (line 156)
FLUSH-SORTER-POOL 0 0
 INFO [GC inspection] 2010-09-20 15:27:42,086 GCInspector.java (line 156)
MEMTABLE-POST-FLUSHER 0 0
 INFO [GC inspection] 2010-09-20 15:27:42,086 GCInspector.java (line 156)
AE-SERVICE-STAGE  0 0
 INFO [GC inspection] 2010-09-20 15:27:42,087 GCInspector.java (line 156)
FLUSH-WRITER-POOL 0 0
 INFO [GC inspection] 2010-09-20 15:27:42,087 GCInspector.java (line 156)
HINTED-HANDOFF-POOL   0 0
 INFO [GC inspection] 2010-09-20 15:27:42,087 GCInspector.java (line 161)
CompactionManager   n/a 0
 INFO [GMFD:1] 2010-09-20 15:27:42,088 Gossiper.java (line 592) Node /
10.102.7.187 has restarted, now UP again
 INFO [GMFD:1] 2010-09-20 15:27:42,089 StorageService.java (line 548) Node /
10.102.7.187 state jump to normal
 INFO [GMFD:1] 2010-09-20 15:27:42,089 StorageService.java (line 555) Will
not change my token ownership to /10.102.7.187
 INFO [HINTED-HANDOFF-POOL:1] 2010-09-20 15:27:42,089
HintedHandOffManager.java (line 165) Started hinted handoff for endPoint /
10.102.7.187
 INFO [HINTED-HANDOFF-POOL:1] 2010-09-20 15:27:42,112
HintedHandOffManager.java (line 222) Finished hinted handoff of 0 rows to
endpoint /10.102.7.187
 INFO [WRITE-/10.102.7.187] 2010-09-20 15:27:42,389
OutboundTcpConnection.java (line 105) error writing to /10.102.7.187
 INFO [WRITE-/10.102.57.197] 2010-09-20 15:34:15,911
OutboundTcpConnection.java (line 105) error writing to /10.102.57.197
 INFO [GC inspection] 2010-09-20 15:34:15,924 GCInspector.java (line 129) GC
for ParNew: 372272 ms, 82471240 reclaimed leaving 671616240 used; max is
25907560448
 INFO [GC inspection] 2010-09-20 15:34:15,925 GCInspector.java (line 150)
Pool NameActive   Pending
 INFO [GC inspection] 2010-09-20 15:34:15,925 GCInspector.java (line 156)
STREAM-STAGE  0 0
 INFO [GC inspection] 2010-09-20 15:34:15,926 GCInspector.java (line 156)
RESPONSE-STAGE021
 INFO [Timer-0] 2010-09-20 15:34:15,926 Gossiper.java (line 180) InetAddress
/10.102.7.187 is now dead.
 INFO [GC inspection] 2010-09-20 15:34:15,926 GCInspector.java (line 156)
ROW-READ-STAGE185
 INFO [GC inspection] 2010-09-20 15:34:15,926 GCInspector.java (line 156)
LB-OPERATIONS 0 0
 INFO [GC inspection] 2010-09-20 15:34:15,935 GCInspector.java (line 156)
MISCELLANEOUS-POOL0 0
 INFO [GC inspection] 2010-09-20 15:34:15,935 

Re: Dazed and confused with Cassandra on EC2 ...

2010-09-20 Thread Peter Schuller
 Can anyone help shed any light on why this might be happening? We've tried a
 variety of JVM settings to alleviate this; currently with no luck.

Extremely long ParNew (young generations) pause times are almost
always due to swapping. Are you swapping?

-- 
/ Peter Schuller


Re: Dazed and confused with Cassandra on EC2 ...

2010-09-20 Thread Dave Gardner
Nope - no swap enabled.


top - 16:53:14 up 12 days,  6:11,  3 users,  load average: 1.99, 2.63, 5.03
Tasks: 133 total,   1 running, 132 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,
0.0%st
Mem:  35840228k total, 33077580k used,  2762648k free,   263388k buffers
Swap:0k total,0k used,0k free, 29156108k cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  SWAP
COMMAND

 3398 root  20   0  203g 1.2g 186m S0  3.5  5124095h 202g
jsvc

 4162 hadoop20   0 9095m  84m 9544 S0  0.2  5124095h 8.8g
java

32153 hadoop20   0 9057m 154m 9416 S0  0.4  5124095h 8.7g
java

18091 hadoop20   0 9257m 561m 9460 S0  1.6 17232821w 8.5g
java

 4267 hadoop20   0 2400m  85m 9404 S0  0.2  5124095h 2.3g
java

 4289 hadoop20   0 2337m  78m 9348 S0  0.2   0:01.44 2.2g
java






On 20 September 2010 17:48, Peter Schuller peter.schul...@infidyne.comwrote:

  Can anyone help shed any light on why this might be happening? We've
 tried a
  variety of JVM settings to alleviate this; currently with no luck.

 Extremely long ParNew (young generations) pause times are almost
 always due to swapping. Are you swapping?

 --
 / Peter Schuller



Re: Dazed and confused with Cassandra on EC2 ...

2010-09-20 Thread Dave Gardner
One other question for the list:

I gather GMFD is gossip stage - but what does this actually mean? Is it an
issue to have 203 pending operations?

Thanks

Dave

 INFO [GC inspection] 2010-09-20 16:56:12,792 GCInspector.java (line 129) GC
for ParNew: 127970 ms, 570382800 reclaimed leaving 460688576 used; max is
6576406528
 INFO [GC inspection] 2010-09-20 16:56:12,803 GCInspector.java (line 150)
Pool NameActive   Pending
 INFO [GC inspection] 2010-09-20 16:56:12,803 GCInspector.java (line 156)
STREAM-STAGE  0 0
 INFO [GC inspection] 2010-09-20 16:56:12,803 GCInspector.java (line 156)
RESPONSE-STAGE2 2
 INFO [GC inspection] 2010-09-20 16:56:12,804 GCInspector.java (line 156)
ROW-READ-STAGE   1212
 INFO [GC inspection] 2010-09-20 16:56:12,804 GCInspector.java (line 156)
LB-OPERATIONS 0 0
 INFO [GC inspection] 2010-09-20 16:56:12,804 GCInspector.java (line 156)
MISCELLANEOUS-POOL0 0
 INFO [GC inspection] 2010-09-20 16:56:12,804 GCInspector.java (line 156)
GMFD  1   203
 INFO [GC inspection] 2010-09-20 16:56:12,805 GCInspector.java (line 156)
CONSISTENCY-MANAGER   0 0
 INFO [GC inspection] 2010-09-20 16:56:12,805 GCInspector.java (line 156)
LB-TARGET 0 0
 INFO [GC inspection] 2010-09-20 16:56:12,805 GCInspector.java (line 156)
ROW-MUTATION-STAGE0 0
 INFO [GC inspection] 2010-09-20 16:56:12,806 GCInspector.java (line 156)
MESSAGE-STREAMING-POOL0 0
 INFO [GC inspection] 2010-09-20 16:56:12,806 GCInspector.java (line 156)
LOAD-BALANCER-STAGE   0 0
 INFO [GC inspection] 2010-09-20 16:56:12,806 GCInspector.java (line 156)
FLUSH-SORTER-POOL 0 0
 INFO [GC inspection] 2010-09-20 16:56:12,806 GCInspector.java (line 156)
MEMTABLE-POST-FLUSHER 0 0
 INFO [GC inspection] 2010-09-20 16:56:12,807 GCInspector.java (line 156)
AE-SERVICE-STAGE  0 0
 INFO [GC inspection] 2010-09-20 16:56:12,807 GCInspector.java (line 156)
FLUSH-WRITER-POOL 0 0
 INFO [GC inspection] 2010-09-20 16:56:12,807 GCInspector.java (line 156)
HINTED-HANDOFF-POOL   0 0
 INFO [GC inspection] 2010-09-20 16:56:12,808 GCInspector.java (line 161)
CompactionManager   n/a 1


Re: Dazed and confused with Cassandra on EC2 ...

2010-09-20 Thread Peter Schuller
 Nope - no swap enabled.

Something is seriously weird, unless the system clock is broken... Given:

INFO [GC inspection] 2010-09-20 15:27:42,046 GCInspector.java (line
129) GC for ParNew: 325411 ms, 84284896 reclaimed leaving 640770336
used; max is 25907560448
 INFO [GC inspection] 2010-09-20 15:34:15,924 GCInspector.java (line
129) GC for ParNew: 372272 ms, 82471240 reclaimed leaving 671616240
used; max is 25907560448

We have *extremely* slow ParNew:s on a heap that is not even close to
being full. I highly doubt fragmentation is causing this kind of
extremity.

I wonder if it is possible that EC2 instances are over-committed on
memory such that swapping is happening behind the scenes on the
host... but I have always assumed memory was not over-committed on
EC2.

Can you run with -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimestamps?

-- 
/ Peter Schuller


Re: Dazed and confused with Cassandra on EC2 ...

2010-09-17 Thread Dave Viner
Hi Jedd,

I'm using Cassandra on EC2 as well - so I'm quite interested.

Just to clarify your post - it sounds like you have 4 questions/issue:

1. Writes have slowed down significantly.  What's the logical explanation?
And what is the logical solution/options to solve it?

2. You grew from 2 nodes to 4, but the original 2 nodes have 200GB and the 2
new ones have 40 GB.  What's the recommended practice for rebalancing (i.e.,
when should you do it), what's the actual procedure, and what's the expected
impact of it?

3. Cassandra nodes disappear.  (I'm not quite clear what this means.)

4. You took a machine offline without decommissioning it from the cluster.
 Now the machine is gone, but the other nodes (in Gossip logs) report that
they are still looking for it.  How do you stop nodes from looking for a
removed node?

I'm not trying to put words in your mouth - but I want to make sure that I
understand what you're asking about (because I have similar ec2-related
thoughts).  Let me know if this is an accurate summary.

Dave Viner


On Fri, Sep 17, 2010 at 7:41 AM, Jedd Rashbrooke 
jedd.rashbro...@imagini.net wrote:

  Howdi,

  I've just landed in an experiment to get Cassandra going, and
  fed by PHP via Thrift via Hadoop, all running on EC2.  I've been
  lurking a bit on the list for a couple of weeks, mostly reading any
  threads with the word 'performance' in them.  Few people have
  anything polite to say about EC2, but I want to just throw out
  some observations and get some feedback on whether what
  I'm seeing is even approaching any kind of normal.

  My background is mostly *nix and networking, with half-way
  decent understanding of DB's -- but Cassandra, Hadoop, Thrift
  and EC2 are all fairly new to me.

  We're using a four-node decently-specced (m2.2xlarge, if you're
  EC2-aware) cluster - 32GB, 4-core, if you're not :)  I'm using
  Ubuntu with the Deb packages for Cassandra and Hadoop, and
  some fairly conservative tweaks to things like JVM memory
  (bumping them up to 4GB, then 16GB).

  One of our insert jobs - a mapper only process - was running
  pretty fast a few days ago.  Somewhere around a million lines
  of input, split into a dozen files, inserting via a Hadoop job in
  about a half hour.  Happy times.  This was when the cluster
  was modestly sized - 20-50GB.  It's now about 200GB, and
  performance has dropped by an order of magnitude - perhaps
  5-6 hours to do the same amount of work, using the same
  codebase and the same input data.

  I've read that reads slow down as the DB grows, but had an
  expectation that writes would be consistently snappy.  How
  surprising is this performance drop given the DB growth?

  My 4-node cluster started off as a 2-node - and now nodetool
  ring suggests the two original nodes are 200GB each, and
  the newer two are 40GB.  Is this normal?  Would a rebalance
  likely improve performance substantially?  My feeling is that
  it would be expensive to perform.

  EC2 seems to get a bad rap, and we're feeling quite a bit of
  pain, which is sad given the (on paper) spec of the machines,
  and the cost - over US$3k/month for the cluster.  I've split
  Cassandra commitlog, Cassandra data, hadoop(hdfs) and
  tmp onto separate 'spindles' - observations so far suggest
  late '90's disk IO speed (15MB max sustained writes, one
  machine, one disk to another), and consistently inconsistent
  performance (identical machine next to it running the same
  task at the same time was getting 28MB) over several hours.

  Cassandra nodes seems to disappear too easily - even
  with just one core (out of four) maxed out with a jsvc task,
  minimal disk or network activity, the machine feels very
  sluggish.  Tailing the cassandra logs hints that it's doing
  hinted handoffs and occasionally compaction tasks.  I've
  never seen this kind of behaviour - and suspect this is
  more a feature of EC2.

  Gossip now seems to be pining the loss of an older machine
  (that I stupidly took offline briefly - EC2 gave it a new IP address
  when it came back).  There's nothing in the storage-conf to
  refer to the old address, all 4 Cassandra daemons have been
  re-started several times since, but gossip occasionally (a day
  later) says that it is looking for it - and more worrying that
  it is 'now part of the cluster'.  I'm unsure if this is just an
  irritation or part of the underlying problem.

  What I'm going to do next is to try importing some data into
  a local machine - it's just time-consuming to pull in our S3
  data - and see if I can fake up to around the same capacity
  and watch for performance degradation.

  I'm also toying with the idea of going from 4 to 8 nodes,
  but I'm clueless on whether / how much this would help.

  As I say, though, I'm keen on anyone else's observations on
  my observations - I'm painfully aware that I'm juggling a lot
  of unknown factors at the moment.

  cheers,
  Jedd.



Re: Dazed and confused with Cassandra on EC2 ...

2010-09-17 Thread Robert Coli

 On 9/17/10 7:41 AM, Jedd Rashbrooke wrote:

  Happy times.  This was when the cluster
  was modestly sized - 20-50GB.  It's now about 200GB, and
  performance has dropped by an order of magnitude - perhaps
  5-6 hours to do the same amount of work, using the same
  codebase and the same input data.

You don't mention which version of the deb package you're using, but :

https://issues.apache.org/jira/browse/CASSANDRA-1214

Is a performance hit which occurs as a result of the growth in data 
size, and could be your issue.


Have you checked for swapping behavior like the above while your system 
is unhappy?


=Rob



Re: Dazed and confused with Cassandra on EC2 ...

2010-09-17 Thread Jedd Rashbrooke
 Hi Dave,

 Thank you for your response.

 I can clarify a couple of things here:

 2. You grew from 2 nodes to 4, but the original 2 nodes have 200GB and the 2
 new ones have 40 GB.  What's the recommended practice for rebalancing (i.e.,
 when should you do it), what's the actual procedure, and what's the expected
 impact of it?

 + is it likely to cause a problem in the short term if I don't (ie.
if I just wait
 until 'normal activity' to somehow even out the distribution of data).

 3. Cassandra nodes disappear.  (I'm not quite clear what this means.)

 Nodetool reports the node as down.  I'm seeing lots of machine-x is DOWN
 in the logs.  Flapping, actually.  I don't have any swap configured (which I've
 read somewhere might induce flapping).

 The machine also feels like it goes on a hiatus - separately, but typically
 observed at the same time.  Tail -f on the Cassandra logs delays for several
 minutes, pending ssh's to the box also stall until 'something' happens that
 releases the machine from its slumber.  Typically that something is a
 message in the logs that a compaction of a hintedhandoff has completed.

 As I say, nmon/top show minimal network  disk activity, and just one
 of the four cores flatlining during this time.  The machine *should* be
 more responsive.

 Actually:   http://pastebin.com/AeM2VgL3

 All the machines referenced in there are ones that are in the cluster now.


 4. You took a machine offline without decommissioning it from the cluster.
  Now the machine is gone, but the other nodes (in Gossip logs) report that
 they are still looking for it.  How do you stop nodes from looking for a
 removed node?

 I was attempting to drain the thing first, but that was stalling, so I stopped
 Cassandra then stopped the box.  The storage and config were on EBS
 (persistent disk) so they came back - it's just that the IP address of the
 machine changed.  I typically use my own assigned hostnames (cass-01,
 cass-02, etc, say) but for proper resolution I use the EC2 'internal
hostnames',
 which were updated to all four Cassandra boxes, the other three instances
 of Cassandra were stopped, and then all four brought back up.


 You say you have similar EC2-related thoughts .. have you done much on
 the EC2 hardware so far?  Are you seeing the same kind of thing?

 cheers,
 Jedd.


Re: Dazed and confused with Cassandra on EC2 ...

2010-09-17 Thread Jedd Rashbrooke
 Hi Rob,

 Thanks for your suggestions.  I should have been a bit more verbose
 in my platform description -- I'm using 64-bit instances, which I think
 in a Ben Black video I saw led to a sensible default usage of mmap
 when left at auto.  Should I look at forcing this setting?

 You don't mention which version of the deb package you're using, but :

 I'm using 0.6.5 - the ones bundled by Eric Evans eev...@apache.org

 Because these are 32GB machines, I've not configured them with
 any swap at all.  I've rarely done this in the past - but was aware
 there was this swap-hell scenario with JVM's, and the rationale
 makes sense -- it's better to have the JVM crash ''cleanly than
 to have it grind the machine to a halt and make it impossible
 to get onto the machine to kill the process.

 cheers,
 Jedd.