Re: 99.999% uptime - Operations Best Practices?

2011-06-23 Thread David Boxenhorn
I think very high uptime, and very low data loss is achievable in
Cassandra, but, for new users there are TONS of gotchas. You really
have to know what you're doing, and I doubt that many people acquire
that knowledge without making a lot of mistakes.

I see above that most people are talking about configuration issues.
But, the first thing that you will probably do, before you have any
experience with Cassandra(!), is architect your system. Architecture
is not easily changed when you bump into a gotcha, and for some reason
you really have to search the literature well to find out about them.
So, my contributions:

The too many CFs problem. Cassandra doesn't do well with many column
families. If you come from a relational world, a real application can
easily have hundreds of tables. Even if you combine them into entities
(which is the Cassandra way), you can easily end up with dozens of
entities. The most natural thing for someone with a relational
background is have one CF per entity, plus indexes according to your
needs. Don't do it. You need to store multiple entities in the same
CF. Group them together according to access patterns (i.e. when you
use X,  you probably also need Y), and distinguish them by adding a
prefix to their keys (e.g. entityName@key).

Don't use supercolumns, use composite columns. Supercolumns are
disfavored by the Cassandra community and are slowly being orphaned.
For example, secondary indexes don't work on supercolumns. Nor does
CQL. Bugs crop up with supercolumns that don't happen with regular
columns because internally there's a huge separate code base for
supercolumns, and every new feature is designed and implemented for
regular columns and then retrofitted for supercolumns (or not).

There should really be a database of gotchas somewhere, and how they
were solved...

On Thu, Jun 23, 2011 at 6:57 AM, Les Hazlewood l...@katasoft.com wrote:
 Edward,
 Thank you so much for this reply - this is great stuff, and I really
 appreciate it.
 You'll be happy to know that I've already pre-ordered your book.  I'm
 looking forward to it! (When is the ship date?)
 Best regards,
 Les

 On Wed, Jun 22, 2011 at 7:03 PM, Edward Capriolo edlinuxg...@gmail.com
 wrote:


 On Wed, Jun 22, 2011 at 8:31 PM, Les Hazlewood l...@katasoft.com wrote:

 Hi Thoku,
 You were able to more concisely represent my intentions (and their
 reasoning) in this thread than I was able to do so myself.  Thanks!

 On Wed, Jun 22, 2011 at 5:14 PM, Thoku Hansen tho...@gmail.com wrote:

 I think that Les's question was reasonable. Why *not* ask the community
 for the 'gotchas'?
 Whether the info is already documented or not, it could be an
 opportunity to improve the documentation based on users' perception.
 The you just have to learn responses are fair also, but that reminds
 me of the days when running Oracle was a black art, and accumulated wisdom
 made DBAs irreplaceable.

 Yes, this was my initial concern.  I know that Cassandra is still young,
 and I expect this to be the norm for a while, but I was hoping to make that
 process a bit easier (for me and anyone else reading this thread in the
 future).

 Some recommendations *are* documented, but they are dispersed / stale /
 contradictory / or counter-intuitive.
 Others have not been documented in the wiki nor in DataStax's doco, and
 are instead learned anecdotally or The Hard Way.
 For example, whether documented or not, some of the 'gotchas' that I
 encountered when I first started working with Cassandra were:
 * Don't use OpenJDK. Prefer the Sun JDK. (Wiki says this, Jira says
 that).
 * Its not viable to run without JNA installed.
 * Disable swap memory.
 * Need to run nodetool repair on a regular basis.
 I'm looking forward to Edward Capriolo's Cassandra book which Les will
 probably find helpful.

 Thanks for linking to this.  I'm pre-ordering right away.
 And thanks for the pointers, they are exactly the kind of enumerated
 things I was looking to elicit.  These are the kinds of things that are hard
 to track down in a single place.  I think it'd be nice for the community to
 contribute this stuff to a single page ('best practices', 'checklist',
 whatever you want to call it).  It would certainly make things easier when
 getting started.
 Thanks again,
 Les

 Since I got a plug on the book I will chip in again to the thread :)

 Some things that were mentioned already:

 Install JNA absolutely (without JNA the snapshot command has to fork to
 hard link the sstables, I have seen clients backoff from this). Also the
 performance focused Cassandra devs always try to squeeze out performance by
 utilizing more native features.

 OpenJDK vs Sun. I agree, almost always try to do what 'most others' do in
 production, this way you get surprised less.

 Other stuff:

 RAID. You might want to go RAID 1+0 if you are aiming for uptime. RAID 0
 has better performance, but if you lose a node your capacity is diminished,
 rebuilding and rejoining a node involves more 

Re: 99.999% uptime - Operations Best Practices?

2011-06-23 Thread Karl Hiramoto
On 06/23/11 09:43, David Boxenhorn wrote:
 I think very high uptime, and very low data loss is achievable in
 Cassandra, but, for new users there are TONS of gotchas. You really
 have to know what you're doing, and I doubt that many people acquire
 that knowledge without making a lot of mistakes.

 I see above that most people are talking about configuration issues.
 But, the first thing that you will probably do, before you have any
 experience with Cassandra(!), is architect your system. Architecture
 is not easily changed when you bump into a gotcha, and for some reason
 you really have to search the literature well to find out about them.
 So, my contributions:

 The too many CFs problem. Cassandra doesn't do well with many column
 families. If you come from a relational world, a real application can
 easily have hundreds of tables. Even if you combine them into entities
 (which is the Cassandra way), you can easily end up with dozens of
 entities. The most natural thing for someone with a relational
 background is have one CF per entity, plus indexes according to your
 needs. Don't do it. You need to store multiple entities in the same
 CF. Group them together according to access patterns (i.e. when you
 use X,  you probably also need Y), and distinguish them by adding a
 prefix to their keys (e.g. entityName@key).

While avoiding too many CF's  is a good idea  I would also advise
against a very large  CF.   Keeping a CF size down, helps speed up
repair and compact.


--
Karl


Re: 99.999% uptime - Operations Best Practices?

2011-06-23 Thread Dominic Williams
Les,

Cassandra is a good system, but it has not reached version 1.0 yet, nor has
HBase etc. It is cutting edge technology and therefore in practice you are
unlikely to achieve five nines immediately - even if in theory with perfect
planning, perfect administration and so on, this should be achievable even
now.

The reasons you might choose Cassandra are:-
1. New more flexible data model that may increase developer productivity and
lead to fast release cycle
2. Superior capability as concerns being able to *write* large volumes of
data, which is incredibly useful in many applications
3. Horizontal scalability, where you can add nodes rather than buying bigger
machines
4. Data redundancy, which means you have a kind of live backup going on a
bit like RAID - we use replication factor 3 for example
5. Due to the redundancy of data across the cluster, the ability to perform
rolling restarts to administer and upgrade your nodes while the cluster
continues to run (yes, this is the feature that in theory allows for
continual operation, but in practice until we reach 1.0 I don't think five
nines of uptime is always possible in every scenario yet because of
deficiencies that may present themselves unexpectedly)
6. The benefit of building your new product on a platform designed to solve
many modern computing challenges that will give you a better upgrade path
e.g. for example in future when you grow you won't have to change over from
SQL to NoSQL because you're already on it!

These are pretty compelling arguments, but you have to be realistic about
where Cassandra is right now. For what it's worth though, you might also
consider how easy it is to screw up databases running on commercial
production software that are handling very large amounts of data (just let
the volumes handling the commit log run short of disk space for example).
Setting up a Cassandra cluster is the simplest way to handle big data I've
seen and this reduction in complexity will also contribute to uptime.

Best, Dominic

On 22 June 2011 22:24, Les Hazlewood l...@katasoft.com wrote:

 I'm planning on using Cassandra as a product's core data store, and it is
 imperative that it never goes down or loses data, even in the event of a
 data center failure.  This uptime requirement (five nines: 99.999% uptime)
 w/ WAN capabilities is largely what led me to choose Cassandra over other
 NoSQL products, given its history and 'from the ground up' design for such
 operational benefits.

 However, in a recent thread, a user indicated that all 4 of 4 of his
 Cassandra instances were down because the OS killed the Java processes due
 to memory starvation, and all 4 instances went down in a relatively short
 period of time of each other.  Another user helped out and replied that
 running 0.8 and nodetool repair on each node regularly via a cron job (once
 a day?) seems to work for him.

 Naturally this was disconcerting to read, given our needs for a Highly
 Available product - we'd be royally screwed if this ever happened to us.
  But given Cassandra's history and it's current production use, I'm aware
 that this HA/uptime is being achieved today, and I believe it is certainly
 achievable.

 So, is there a collective set of guidelines or best practices to ensure
 this problem (or unavailability due to OOM) can be easily managed?

 Things like memory settings, initial GC recommendations, cron
 recommendations, ulimit settings, etc. that can be bundled up as a
 best-practices Production Kickstart?

 Could anyone share their nuggets of wisdom or point me to resources where
 this may already exist?

 Thanks!

 Best regards,

 Les



Re: 99.999% uptime - Operations Best Practices?

2011-06-23 Thread Chris Burroughs
On 06/22/2011 10:03 PM, Edward Capriolo wrote:
 I have not read the original thread concerning the problem you mentioned.
 One way to avoid OOM is large amounts of RAM :) On a more serious note most
 OOM's are caused by setting caches or memtables too large. If the OOM was
 caused by a software bug, the cassandra devs are on the ball and move fast.
 I still suggest not jumping into a release right away. 

For what it's worth  that particular thread was about the kernel oom
killer, which is a good example of a the kind of gotcha that has caused
several people to chime in with the importance of monitoring both
Cassandra and the OS.


Re: 99.999% uptime - Operations Best Practices?

2011-06-23 Thread Chris Burroughs
On 06/22/2011 07:12 PM, Les Hazlewood wrote:
 Telling me to read the mailing lists and follow the issue tracker and use
 monitoring software is all great and fine - and I do all of these things
 today already - but this is a philosophical recommendation that does not
 actually address my question.  So I chalk this up as an error on my side in
 not being clear in my question - my apologies.  Let me reformulate it :)

For what it's worth that was intended as a concrete suggestion.  We
adopted Cassandra a year ago when (IMHO) it was a mistake to do so it
without the willingness to develop sufficient in house expertise to
internally patch/fork/debug if needed.  Things are more mature now, best
practices more widespread etc., but you should judge that yourself.

In the spirit of your re-formulated questions:
 - Read-before-write is a Cassandra anti-pattern, avoid it if at all
possible.
 - Those optional lines in the env script about GC logging?  Uncomment
them on at least some of your boxes.
 - use MLOCKALL+mmap, or standard io, but not mmap without MLOCKALL.


Re: 99.999% uptime - Operations Best Practices?

2011-06-23 Thread Les Hazlewood
Great stuff Chris - thanks so much for the feedback!

Les


Re: 99.999% uptime - Operations Best Practices?

2011-06-23 Thread Les Hazlewood

 In the spirit of your re-formulated questions:
  - Read-before-write is a Cassandra anti-pattern, avoid it if at all
 possible.


 This leads me to believe that Cassandra may not be a good idea for a
 primary OLTP data store.  For example only create a user object if email
 foo is not already in use or, more generally, you can't create object X
 because one with an existing constraint already exists.

 Is that a fair assumption?


Actually, this may not be true, at least using Digg and Twitter as examples.
 I'd assume those apps are far more read-heavy than they are write-heavy,
but I wouldn't know for sure.


Re: 99.999% uptime - Operations Best Practices?

2011-06-23 Thread Chris Burroughs
On 06/23/2011 01:56 PM, Les Hazlewood wrote:
 Is there a roadmap or time to 1.0?  Even a ballpark time (e.g next year 3rd
 quarter, end of year, etc) would be great as it would help me understand
 where it may lie in relation to my production rollout.


The C* devs are rather strongly inclined against putting too much
meaning in version numbers.  The next major release might be called 1.0.
Or maybe it won't.  Either way it won't be different code or support
from something called 0.9 or 10.0.

September 8th is the feature freeze for the next major release.


Re: 99.999% uptime - Operations Best Practices?

2011-06-23 Thread Nate McCall
As an additional concrete detail to Edward's response, 'result
pinning' can provide some performance improvements depending on
topology and workload. See the conf file comments for details:
https://github.com/apache/cassandra/blob/cassandra-0.8.0/conf/cassandra.yaml#L308-315

I would also advise to take the time to experiment with consistency
levels (particularly in multi-DC setup) and their effect on response
times and weigh those against your consistency requirements.

For the record, any performance twiddling will only provide useful
results when comparable metrics are available for the similar workload
(Les, it appears you have a good grasp of this already - just wanted
to re-iterate).


99.999% uptime - Operations Best Practices?

2011-06-22 Thread Les Hazlewood
I'm planning on using Cassandra as a product's core data store, and it is
imperative that it never goes down or loses data, even in the event of a
data center failure.  This uptime requirement (five nines: 99.999% uptime)
w/ WAN capabilities is largely what led me to choose Cassandra over other
NoSQL products, given its history and 'from the ground up' design for such
operational benefits.

However, in a recent thread, a user indicated that all 4 of 4 of his
Cassandra instances were down because the OS killed the Java processes due
to memory starvation, and all 4 instances went down in a relatively short
period of time of each other.  Another user helped out and replied that
running 0.8 and nodetool repair on each node regularly via a cron job (once
a day?) seems to work for him.

Naturally this was disconcerting to read, given our needs for a Highly
Available product - we'd be royally screwed if this ever happened to us.
 But given Cassandra's history and it's current production use, I'm aware
that this HA/uptime is being achieved today, and I believe it is certainly
achievable.

So, is there a collective set of guidelines or best practices to ensure this
problem (or unavailability due to OOM) can be easily managed?

Things like memory settings, initial GC recommendations, cron
recommendations, ulimit settings, etc. that can be bundled up as a
best-practices Production Kickstart?

Could anyone share their nuggets of wisdom or point me to resources where
this may already exist?

Thanks!

Best regards,

Les


Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Ryan King
On Wed, Jun 22, 2011 at 2:24 PM, Les Hazlewood l...@katasoft.com wrote:
 I'm planning on using Cassandra as a product's core data store, and it is
 imperative that it never goes down or loses data, even in the event of a
 data center failure.  This uptime requirement (five nines: 99.999% uptime)
 w/ WAN capabilities is largely what led me to choose Cassandra over other
 NoSQL products, given its history and 'from the ground up' design for such
 operational benefits.
 However, in a recent thread, a user indicated that all 4 of 4 of his
 Cassandra instances were down because the OS killed the Java processes due
 to memory starvation, and all 4 instances went down in a relatively short
 period of time of each other.  Another user helped out and replied that
 running 0.8 and nodetool repair on each node regularly via a cron job (once
 a day?) seems to work for him.
 Naturally this was disconcerting to read, given our needs for a Highly
 Available product - we'd be royally screwed if this ever happened to us.
  But given Cassandra's history and it's current production use, I'm aware
 that this HA/uptime is being achieved today, and I believe it is certainly
 achievable.
 So, is there a collective set of guidelines or best practices to ensure this
 problem (or unavailability due to OOM) can be easily managed?
 Things like memory settings, initial GC recommendations, cron
 recommendations, ulimit settings, etc. that can be bundled up as a
 best-practices Production Kickstart?

Unfortunately most of these are in the category of it depends.

-ryan

 Could anyone share their nuggets of wisdom or point me to resources where
 this may already exist?
 Thanks!
 Best regards,
 Les



Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Les Hazlewood
Just to be clear:

I understand that resources like [1] and [2] exist, and I've read them.  I'm
just wondering if there are any 'gotchas' that might be missing from that
documentation that should be considered and if there are any recommendations
in addition to these documents.

Thanks,

Les

[1] http://www.datastax.com/docs/0.8/operations/index
[2] http://wiki.apache.org/cassandra/Operations


Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Les Hazlewood
I understand that every environment is different and it always 'depends' :)
 But recommending settings and techniques based on an existing real
production environment (like the user's suggestion to run nodetool repair as
a regular cron job) is always a better starting point for a new Cassandra
evaluator than having to start from scratch.

Ryan, do you have any 'seed' settings that you guys use for nodes at
Twitter?

Are there any resources/write-ups beyond the two I've listed already that
address some of these 'gotchas'?  If those two links are in fact the ideal
starting point, that's fine - but it appears that this may not be the case
however based on the aforementioned user as well as the other who helped him
who saw similar warning signs.

I'm hoping for someone to dispel these reports based on what people actually
do in production today.  Any info/settings/recommendations based on real
production environments would be appreciated!

Thanks again,

Les


Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Sasha Dolgy
Implement monitoring and be proactive...that will stop you waking up to a
big surprise.  i'm sure there were symltoms leading up to all 4 nodes going
down.  willing to wager that each node went down at different times and not
all went down at once...
On Jun 22, 2011 11:50 PM, Les Hazlewood l...@katasoft.com wrote:
 I understand that every environment is different and it always 'depends'
:)
 But recommending settings and techniques based on an existing real
 production environment (like the user's suggestion to run nodetool repair
as
 a regular cron job) is always a better starting point for a new Cassandra
 evaluator than having to start from scratch.

 Ryan, do you have any 'seed' settings that you guys use for nodes at
 Twitter?

 Are there any resources/write-ups beyond the two I've listed already that
 address some of these 'gotchas'? If those two links are in fact the ideal
 starting point, that's fine - but it appears that this may not be the case
 however based on the aforementioned user as well as the other who helped
him
 who saw similar warning signs.

 I'm hoping for someone to dispel these reports based on what people
actually
 do in production today. Any info/settings/recommendations based on real
 production environments would be appreciated!

 Thanks again,

 Les


Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Will Oberman

Sadly, they all went down within minutes of each other.

Sent from my iPhone

On Jun 22, 2011, at 6:16 PM, Sasha Dolgy sdo...@gmail.com wrote:

Implement monitoring and be proactive...that will stop you waking up  
to a big surprise.  i'm sure there were symltoms leading up to all 4  
nodes going down.  willing to wager that each node went down at  
different times and not all went down at once...


On Jun 22, 2011 11:50 PM, Les Hazlewood l...@katasoft.com wrote:
 I understand that every environment is different and it always  
'depends' :)

 But recommending settings and techniques based on an existing real
 production environment (like the user's suggestion to run nodetool  
repair as
 a regular cron job) is always a better starting point for a new  
Cassandra

 evaluator than having to start from scratch.

 Ryan, do you have any 'seed' settings that you guys use for nodes at
 Twitter?

 Are there any resources/write-ups beyond the two I've listed  
already that
 address some of these 'gotchas'? If those two links are in fact  
the ideal
 starting point, that's fine - but it appears that this may not be  
the case
 however based on the aforementioned user as well as the other who  
helped him

 who saw similar warning signs.

 I'm hoping for someone to dispel these reports based on what  
people actually
 do in production today. Any info/settings/recommendations based on  
real

 production environments would be appreciated!

 Thanks again,

 Les


Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Chris Burroughs
On 06/22/2011 05:33 PM, Les Hazlewood wrote:
 Just to be clear:
 
 I understand that resources like [1] and [2] exist, and I've read them.  I'm
 just wondering if there are any 'gotchas' that might be missing from that
 documentation that should be considered and if there are any recommendations
 in addition to these documents.
 
 Thanks,
 
 Les
 
 [1] http://www.datastax.com/docs/0.8/operations/index
 [2] http://wiki.apache.org/cassandra/Operations
 

Well if they new some secret gotcha the dutiful cassandra operators of
the world would update the wiki.

The closest thing to a 'gotcha' is that neither Cassandra nor any other
technology is going to get you those nines.  Humans will need to commit
to reading the mailing lists, following JIRA, and understanding what the
code is doing.  And humans will need to commit to combine that
understanding with monitoring and alerting to figure out all of the it
depends for your particular case.


Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Edward Capriolo
Committing to that many 9s is going to be impossible since as far as I
know no internet service provier will sla you more the 2 9s . You can
not have more uptime then your isp.

On Wednesday, June 22, 2011, Chris Burroughs chris.burrou...@gmail.com wrote:
 On 06/22/2011 05:33 PM, Les Hazlewood wrote:
 Just to be clear:

 I understand that resources like [1] and [2] exist, and I've read them.  I'm
 just wondering if there are any 'gotchas' that might be missing from that
 documentation that should be considered and if there are any recommendations
 in addition to these documents.

 Thanks,

 Les

 [1] http://www.datastax.com/docs/0.8/operations/index
 [2] http://wiki.apache.org/cassandra/Operations


 Well if they new some secret gotcha the dutiful cassandra operators of
 the world would update the wiki.

 The closest thing to a 'gotcha' is that neither Cassandra nor any other
 technology is going to get you those nines.  Humans will need to commit
 to reading the mailing lists, following JIRA, and understanding what the
 code is doing.  And humans will need to commit to combine that
 understanding with monitoring and alerting to figure out all of the it
 depends for your particular case.



Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Peter Lin
you have to use multiple data centers to really deliver 4 or 5 9's of service


On Wed, Jun 22, 2011 at 7:09 PM, Edward Capriolo edlinuxg...@gmail.com wrote:
 Committing to that many 9s is going to be impossible since as far as I
 know no internet service provier will sla you more the 2 9s . You can
 not have more uptime then your isp.

 On Wednesday, June 22, 2011, Chris Burroughs chris.burrou...@gmail.com 
 wrote:
 On 06/22/2011 05:33 PM, Les Hazlewood wrote:
 Just to be clear:

 I understand that resources like [1] and [2] exist, and I've read them.  I'm
 just wondering if there are any 'gotchas' that might be missing from that
 documentation that should be considered and if there are any recommendations
 in addition to these documents.

 Thanks,

 Les

 [1] http://www.datastax.com/docs/0.8/operations/index
 [2] http://wiki.apache.org/cassandra/Operations


 Well if they new some secret gotcha the dutiful cassandra operators of
 the world would update the wiki.

 The closest thing to a 'gotcha' is that neither Cassandra nor any other
 technology is going to get you those nines.  Humans will need to commit
 to reading the mailing lists, following JIRA, and understanding what the
 code is doing.  And humans will need to commit to combine that
 understanding with monitoring and alerting to figure out all of the it
 depends for your particular case.




Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Les Hazlewood

 
  [1] http://www.datastax.com/docs/0.8/operations/index
  [2] http://wiki.apache.org/cassandra/Operations
 

 Well if they new some secret gotcha the dutiful cassandra operators of
 the world would update the wiki.


As I am new to the Cassandra community, I don't know how 'dutifully' this is
maintained.  My questions were not unreasonable question given the nature of
open-source documentation.  All I was looking for was what people thought
were best practices based on their own production experience.

Telling me to read the mailing lists and follow the issue tracker and use
monitoring software is all great and fine - and I do all of these things
today already - but this is a philosophical recommendation that does not
actually address my question.  So I chalk this up as an error on my side in
not being clear in my question - my apologies.  Let me reformulate it :)

Does anyone out there have any concrete recommended techniques or insights
in maintaining a HA Cassandra cluster that you've gained based on production
experience beyond what is described in the 2 links above?

Thanks,

Les


Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Les Hazlewood
On Wed, Jun 22, 2011 at 4:11 PM, Peter Lin wool...@gmail.com wrote:

 you have to use multiple data centers to really deliver 4 or 5 9's of
 service


We do, hence my question, as well as my choice of Cassandra :)

Best,

Les


Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread mcasandra
In my opinion 5 9s don't matter. It's the number of impacted customers. You
might be down during peak for 5 mts causing 1000s of customer turn aways
while you might be down during night causing only few customer turn aways.

There is no magic bullet. It's all about learning and improving. You will
not get HA right away, but over period of time as you learn and improve you
will do better.

--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/99-999-uptime-Operations-Best-Practices-tp6506227p6506511.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Peter Lin
so having multiple data centers is step 1 of 4/5 9's.

I've worked on some services that had 3-4 9's SLA. Getting there is
really tough as others have stated. you have to auditing built into
your service, capacity metrics, capacity planning, some kind of
real-time monitoring, staff to respond to alerts, plan for handling
system failures, training to handle outage and a dozen other things.

your best choice is to hire someone that has built a system that
supports 4-5 9's and patiently work to get there.


On Wed, Jun 22, 2011 at 7:16 PM, Les Hazlewood l...@katasoft.com wrote:
 On Wed, Jun 22, 2011 at 4:11 PM, Peter Lin wool...@gmail.com wrote:

 you have to use multiple data centers to really deliver 4 or 5 9's of
 service

 We do, hence my question, as well as my choice of Cassandra :)
 Best,
 Les


Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Les Hazlewood
Forget the 5 9's - I apologize for even writing that.  It was my shorthand
way of saying 'this can never go down'.  I'm not asking for philosophical
advice - I've been doing large scale enterprise deployments for over 10
years.  I 'get' the 'it depends' and 'do your homework' philosophy.

All I'm asking for is concrete techniques that anyone might wish to share
that they've found valuable beyond what is currently written in the existing
operations documentation in [1] and [2].

If no one wants to share that, that's totally cool - no need to derail the
thread into a different discussion.

Thanks,

Les


Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread mcasandra
Start with reading comments on cassandra.yaml and 
http://wiki.apache.org/cassandra/Operations
http://wiki.apache.org/cassandra/Operations 

As far as I know there is no comprehensive list for performance tuning. More
specifically common setting applicable to everyone. For most part issues
revolve around compactions and GC tuning.

--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/99-999-uptime-Operations-Best-Practices-tp6506227p6506529.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Les Hazlewood
I have architected, built and been responsible for systems that support 4-5
9s for years.  This discussion is not about how to do that generally.  It
was intended to be about concrete techniques that have been found valuable
when deploying Cassandra in HA environments beyond what is documented in [1]
and [2].

Cheers,

Les


Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Les Hazlewood
Yep, that was [2] on my existing list.  Thanks very much for actually
addressing my question - it is greatly appreciated!

If anyone else has examples they'd like to share (like their own cron
techniques, or JVM settings and why, etc), I'd love to hear them!

Best regards,

Les

On Wed, Jun 22, 2011 at 4:24 PM, mcasandra mohitanch...@gmail.com wrote:

 Start with reading comments on cassandra.yaml and
 http://wiki.apache.org/cassandra/Operations
 http://wiki.apache.org/cassandra/Operations

 As far as I know there is no comprehensive list for performance tuning.
 More
 specifically common setting applicable to everyone. For most part issues
 revolve around compactions and GC tuning.



Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread mcasandra

Les Hazlewood wrote:
 
 I have architected, built and been responsible for systems that support
 4-5
 9s for years. 
 

So have most of us. But probably by now it should be clear that no
technology can provide concrete recommendations. They can only provide what
might be helpful which varies from env to env. That's why I suggest look at
the comments in cassandra.yaml and see which are applicable in your
scenario. I learn something new everytime I read it.

BTW: Can you be clear as to what kind of recommendations are you referring
to? NetworkToplogy, how many copies to store, uptime, load balancing,
request routing when on DC is down? If you ask specific questions you might
get better response.

--
View this message in context: 
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/99-999-uptime-Operations-Best-Practices-tp6506227p6506565.html
Sent from the cassandra-u...@incubator.apache.org mailing list archive at 
Nabble.com.


Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Les Hazlewood
On Wed, Jun 22, 2011 at 4:35 PM, mcasandra mohitanch...@gmail.com wrote:

 might be helpful which varies from env to env. That's why I suggest look at
 the comments in cassandra.yaml and see which are applicable in your
 scenario. I learn something new everytime I read it.


Yep, and this was awesome - thanks very much for the reply - very helpful.


 BTW: Can you be clear as to what kind of recommendations are you referring
 to? NetworkToplogy, how many copies to store, uptime, load balancing,
 request routing when on DC is down? If you ask specific questions you might
 get better response.


Yes, this was my fault in not being specific, but I intentionally left it
open to see if anyone wanted to bring up something specific to their
environment that they thought would be valuable ('e.g. when our nodes got to
95% memory utilization, we find that GC behavior is doing X. Setting the JVM
option of 'foo' helped us reduce problem Y').

I was mainly looking initially for what folks thought were satisfactory
initial JVM/GC and *nix OS settings for a production node (e.g. 8 cores w/
64 gig ram, or an EC2 'large' or 'XL' node).  E.g. what collector was used,
and why, whether folks have used the standard CMS collector or if they've
tried the G1 collector and what settings made them happy after testing...

Those kinds of things.  Call it a tiny 'case study' if you will.  Network
topology I thought I'd leave for a whole 'nuther discussion :)

As an aside, I definitely plan to publish our actual JVM and OS settings and
operational procedures once we find a happy medium based on our application
in the event that it might help someone else.

Thanks again!

Les


Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread C. Scott Andreas
Hi Les,

I wanted to offer a couple thoughts on where to start and strategies for 
approaching development and deployment with reliability in mind.

One way that we've found to more productively think about the reliability of 
our data tier is to focus our thoughts away from a concept of uptime or x 
nines toward one of error rates. Ryan mentioned that it depends, and while 
brief, this is actually a very correct comment. Perhaps I can help elaborate.

Failures in systems distributed across multiple systems in multiple datacenters 
can rarely be described in terms of binary uptime guarantees (e.g., either 
everything is up or everything is down). Instead, certain nodes may be 
unavailable at certain times, but given appropriate read and write parameters 
(and their implicit tradeoffs), these service interruptions may remain 
transparent.

Cassandra provides a variety of tools to allow you to tune these, two of the 
most important of which are the consistency level for reads and writes and your 
replication factor. I'm sure you're  familiar with these, but mention them 
because thinking hard about the tradeoffs you're willing to make in terms of 
consistency and replication may heavily impact your operational experience if 
availability is of utmost importance.

Of course, the single-node operational story is very important as well. Ryan's 
it depends comment here takes on painful significance for myself, as we've 
found that the manner in which read and write loads vary, their duration, and 
intensity can have very different operational profiles and failure modes. If 
relaxed consistency is acceptable for your reads and writes, you'll likely find 
querying with CL.ONE to be more available than QUROUM or ALL, at the cost of 
reduced consistency. Similarly, if it is economical for you to provision extra 
nodes for a higher replication factor, you will increase your ability to 
continue reading and writing in the event of single- or multiple-node failures.

One of the prime challenges we've faced is reducing the frequency and intensity 
of full garbage collections in the JVM, which tend to result in single-node 
unavailability. Thanks to help from Jonathan Ellis and Peter Schuller (along 
with a fair amount of elbow grease ourselves), we've worked through several of 
these issues and have arrived at a steady state that leaves the ring happy even 
under load. We've not found GC tuning to bring night-and-day differences 
outside of resolving the STW collections, but the difference is noticeable.

Occasionally, these issues will result from Cassandra's behavior itself; 
documented APIs such as querying for the count of all columns associated with a 
key will materialize the row across all nodes being queried. Once when issuing 
a count query for a key that had around 300k columns at CL.QUORUM, we knocked 
three nodes out of our ring by triggering a stop-the-world collection that 
lasted about 30 seconds, so watch out for things like that.

Some of the other tuning knobs available to you involve tradeoffs such as when 
to flush memtables or to trigger compactions, both of which are somewhat 
intensive operations that can strain a cluster under heavy read or write load, 
but which are equally necessary for the cluster to remain in operation. If you 
find yourself pushing hard against these tradeoffs and attempting to navigate a 
path between icebergs, it's very likely that the best answer to the problem is 
more or more powerful hardware.

But a lot of this is tacit knowledge, which often comes through a bit of pain 
but is hopefully operationally transparent to your users.  Things that you 
discover once the system is live in operation and your monitoring is providing 
continuous feedback about the ring's health. This is where Sasha's point 
becomes so critical -- having advanced early-warning systems in place, watching 
monitoring and graphs closely even when everything's fine, and beginning to 
understand how it likes to operate and what it tends to do will give you a huge 
leg up on your reliability and allow you to react to issues in the ring before 
they present operational impact.

You mention that you've been building HA systems for a long time -- indeed, far 
longer than I have, so I'm sure that you're also aware that good, solid 
up/down binaries are hard to come by, that none of this is easy, and that 
while some pointers are available (the defaults are actually quite good), it's 
essentially impossible to offer the best production defaults because they 
vary wildly based on your hardware, ring configuration, and read/write load and 
query patterns.

To that end, you might find it more productive to begin with the defaults as 
you develop your system, and let the ring tell you how it's feeling as you 
begin load testing. Once you have stressed it to the point of failure, you'll 
see how it failed and either be able to isolate the cause and begin planning to 
handle that mode, or better yet, understand 

Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Thoku Hansen
I think that Les's question was reasonable. Why *not* ask the community for the 
'gotchas'?

Whether the info is already documented or not, it could be an opportunity to 
improve the documentation based on users' perception.

The you just have to learn responses are fair also, but that reminds me of 
the days when running Oracle was a black art, and accumulated wisdom made DBAs 
irreplaceable.

Some recommendations *are* documented, but they are dispersed / stale / 
contradictory / or counter-intuitive.

Others have not been documented in the wiki nor in DataStax's doco, and are 
instead learned anecdotally or The Hard Way.

For example, whether documented or not, some of the 'gotchas' that I 
encountered when I first started working with Cassandra were:

* Don't use OpenJDK. Prefer the Sun JDK. (Wiki says this, Jira says that).
* Its not viable to run without JNA installed.
* Disable swap memory.
* Need to run nodetool repair on a regular basis.

I'm looking forward to Edward Capriolo's Cassandra book which Les will probably 
find helpful.

On Jun 22, 2011, at 7:12 PM, Les Hazlewood wrote:

 
  [1] http://www.datastax.com/docs/0.8/operations/index
  [2] http://wiki.apache.org/cassandra/Operations
 
 
 Well if they new some secret gotcha the dutiful cassandra operators of
 the world would update the wiki.
 
 As I am new to the Cassandra community, I don't know how 'dutifully' this is 
 maintained.  My questions were not unreasonable question given the nature of 
 open-source documentation.  All I was looking for was what people thought 
 were best practices based on their own production experience.
 
 Telling me to read the mailing lists and follow the issue tracker and use 
 monitoring software is all great and fine - and I do all of these things 
 today already - but this is a philosophical recommendation that does not 
 actually address my question.  So I chalk this up as an error on my side in 
 not being clear in my question - my apologies.  Let me reformulate it :)
 
 Does anyone out there have any concrete recommended techniques or insights in 
 maintaining a HA Cassandra cluster that you've gained based on production 
 experience beyond what is described in the 2 links above?
 
 Thanks,
 
 Les



Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Les Hazlewood
Hi Scott,

First, let me say that this email was amazing - I'm always appreciative of
the time that anyone puts into mailing list replies, especially ones as
thorough, well-thought and articulated as this one.  I'm a firm believer
that these types of replies reflect a strong and durable open-source
community.  You, sir, are a bad ass :)  Thanks so much!

As for the '5 9s' comment, I apologize for even writing that - it threw
everyone off.  It was a shorthand way of saying this data store is so
critical to the product, that if it ever goes down entirely (as it did for
one user of 4 nodes, all at the same time), then we're screwed.  I was
hoping to trigger the 'hrm - what have we done ourselves to work to that
availability that wasn't easily represented in the documentation' train of
thought.  It proved to be a red herring however, so I apologize for even
bringing it up.

Thanks *very* much for the reply.  I'll be sure to follow up with the list
as I come across any particular issues and I'll also report my own findings
in the interest of (hopefully) being beneficial to anyone in the future.

Cheers,

Les

On Wed, Jun 22, 2011 at 4:58 PM, C. Scott Andreas
csco...@urbanairship.comwrote:

 Hi Les,

 I wanted to offer a couple thoughts on where to start and strategies for
 approaching development and deployment with reliability in mind.

 One way that we've found to more productively think about the reliability
 of our data tier is to focus our thoughts away from a concept of uptime or
 *x* nines toward one of error rates. Ryan mentioned that it depends,
 and while brief, this is actually a very correct comment. Perhaps I can help
 elaborate.

 Failures in systems distributed across multiple systems in multiple
 datacenters can rarely be described in terms of binary uptime guarantees
 (e.g., either everything is up or everything is down). Instead, certain
 nodes may be unavailable at certain times, but given appropriate read and
 write parameters (and their implicit tradeoffs), these service interruptions
 may remain transparent.

 Cassandra provides a variety of tools to allow you to tune these, two of
 the most important of which are the consistency level for reads and writes
 and your replication factor. I'm sure you're  familiar with these, but
 mention them because thinking hard about the tradeoffs you're willing to
 make in terms of consistency and replication may heavily impact your
 operational experience if availability is of utmost importance.

 Of course, the single-node operational story is very important as well.
 Ryan's it depends comment here takes on painful significance for myself,
 as we've found that the manner in which read and write loads vary, their
 duration, and intensity can have very different operational profiles and
 failure modes. If relaxed consistency is acceptable for your reads and
 writes, you'll likely find querying with CL.ONE to be more available than
 QUROUM or ALL, at the cost of reduced consistency. Similarly, if it is
 economical for you to provision extra nodes for a higher replication factor,
 you will increase your ability to continue reading and writing in the event
 of single- or multiple-node failures.

 One of the prime challenges we've faced is reducing the frequency and
 intensity of full garbage collections in the JVM, which tend to result in
 single-node unavailability. Thanks to help from Jonathan Ellis and Peter
 Schuller (along with a fair amount of elbow grease ourselves), we've worked
 through several of these issues and have arrived at a steady state that
 leaves the ring happy even under load. We've not found GC tuning to bring
 night-and-day differences outside of resolving the STW collections, but the
 difference is noticeable.

 Occasionally, these issues will result from Cassandra's behavior itself;
 documented APIs such as querying for the count of all columns associated
 with a key will materialize the row across all nodes being queried. Once
 when issuing a count query for a key that had around 300k columns at
 CL.QUORUM, we knocked three nodes out of our ring by triggering a
 stop-the-world collection that lasted about 30 seconds, so watch out for
 things like that.

 Some of the other tuning knobs available to you involve tradeoffs such as
 when to flush memtables or to trigger compactions, both of which are
 somewhat intensive operations that can strain a cluster under heavy read or
 write load, but which are equally necessary for the cluster to remain in
 operation. If you find yourself pushing hard against these tradeoffs and
 attempting to navigate a path between icebergs, it's very likely that the
 best answer to the problem is more or more powerful hardware.

 But a lot of this is tacit knowledge, which often comes through a bit of
 pain but is hopefully operationally transparent to your users.  Things that
 you discover once the system is live in operation and your monitoring is
 providing continuous feedback about the ring's health. 

Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Les Hazlewood
Hi Thoku,

You were able to more concisely represent my intentions (and their
reasoning) in this thread than I was able to do so myself.  Thanks!

On Wed, Jun 22, 2011 at 5:14 PM, Thoku Hansen tho...@gmail.com wrote:

 I think that Les's question was reasonable. Why *not* ask the community for
 the 'gotchas'?

 Whether the info is already documented or not, it could be an opportunity
 to improve the documentation based on users' perception.

 The you just have to learn responses are fair also, but that reminds me
 of the days when running Oracle was a black art, and accumulated wisdom made
 DBAs irreplaceable.


Yes, this was my initial concern.  I know that Cassandra is still young, and
I expect this to be the norm for a while, but I was hoping to make that
process a bit easier (for me and anyone else reading this thread in the
future).

Some recommendations *are* documented, but they are dispersed / stale /
 contradictory / or counter-intuitive.

 Others have not been documented in the wiki nor in DataStax's doco, and are
 instead learned anecdotally or The Hard Way.

 For example, whether documented or not, some of the 'gotchas' that I
 encountered when I first started working with Cassandra were:

 * Don't use OpenJDK. Prefer the Sun JDK. (Wiki says 
 thishttp://wiki.apache.org/cassandra/GettingStarted
 , Jira says that https://issues.apache.org/jira/browse/CASSANDRA-2441).
 * Its not viable to run without JNA installed.
 * Disable swap memory.
 * Need to run nodetool repair on a regular basis.

 I'm looking forward to Edward Capriolo's Cassandra 
 bookhttps://www.packtpub.com/cassandra-apache-high-performance-cookbook/book
  which
 Les will probably find helpful.


Thanks for linking to this.  I'm pre-ordering right away.

And thanks for the pointers, they are exactly the kind of enumerated things
I was looking to elicit.  These are the kinds of things that are hard to
track down in a single place.  I think it'd be nice for the community to
contribute this stuff to a single page ('best practices', 'checklist',
whatever you want to call it).  It would certainly make things easier when
getting started.

Thanks again,

Les


Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Edward Capriolo
On Wed, Jun 22, 2011 at 8:31 PM, Les Hazlewood l...@katasoft.com wrote:

 Hi Thoku,

 You were able to more concisely represent my intentions (and their
 reasoning) in this thread than I was able to do so myself.  Thanks!

 On Wed, Jun 22, 2011 at 5:14 PM, Thoku Hansen tho...@gmail.com wrote:

 I think that Les's question was reasonable. Why *not* ask the community
 for the 'gotchas'?

 Whether the info is already documented or not, it could be an opportunity
 to improve the documentation based on users' perception.

 The you just have to learn responses are fair also, but that reminds me
 of the days when running Oracle was a black art, and accumulated wisdom made
 DBAs irreplaceable.


 Yes, this was my initial concern.  I know that Cassandra is still young,
 and I expect this to be the norm for a while, but I was hoping to make that
 process a bit easier (for me and anyone else reading this thread in the
 future).

 Some recommendations *are* documented, but they are dispersed / stale /
 contradictory / or counter-intuitive.

 Others have not been documented in the wiki nor in DataStax's doco, and
 are instead learned anecdotally or The Hard Way.

 For example, whether documented or not, some of the 'gotchas' that I
 encountered when I first started working with Cassandra were:

 * Don't use OpenJDK. Prefer the Sun JDK. (Wiki says 
 thishttp://wiki.apache.org/cassandra/GettingStarted
 , Jira says that https://issues.apache.org/jira/browse/CASSANDRA-2441).
 * Its not viable to run without JNA installed.
 * Disable swap memory.
 * Need to run nodetool repair on a regular basis.

 I'm looking forward to Edward Capriolo's Cassandra 
 bookhttps://www.packtpub.com/cassandra-apache-high-performance-cookbook/book
  which
 Les will probably find helpful.


 Thanks for linking to this.  I'm pre-ordering right away.

 And thanks for the pointers, they are exactly the kind of enumerated things
 I was looking to elicit.  These are the kinds of things that are hard to
 track down in a single place.  I think it'd be nice for the community to
 contribute this stuff to a single page ('best practices', 'checklist',
 whatever you want to call it).  It would certainly make things easier when
 getting started.

 Thanks again,

 Les


Since I got a plug on the book I will chip in again to the thread :)

Some things that were mentioned already:

Install JNA absolutely (without JNA the snapshot command has to fork to hard
link the sstables, I have seen clients backoff from this). Also the
performance focused Cassandra devs always try to squeeze out performance by
utilizing more native features.

OpenJDK vs Sun. I agree, almost always try to do what 'most others' do in
production, this way you get surprised less.

Other stuff:

RAID. You might want to go RAID 1+0 if you are aiming for uptime. RAID 0 has
better performance, but if you lose a node your capacity is diminished,
rebuilding and rejoining a node involves more manpower more steps and more
chances for human error.

Collect statistics on the normal system items CPU, disk (size and
utilization), memory. Then collect the JMX cassandra counters and understand
how they interact. For example record ReadCount and WriteCount per column
family, then use try to determine how this effects disk utilization. You can
use this for capacity planning. Then try using a key/row cache. Evaluate
again. Check the hit rate graph for your new cache. How did this effect your
disk? You want to head off anything that can be a performance killer like
traffic patterns changing or data growing significantly.

Do not be short on hardware. I do not want to say overbuy but if uptime is
important have spares drives and servers and have room to grow.

Balance that ring :)

I have not read the original thread concerning the problem you mentioned.
One way to avoid OOM is large amounts of RAM :) On a more serious note most
OOM's are caused by setting caches or memtables too large. If the OOM was
caused by a software bug, the cassandra devs are on the ball and move fast.
I still suggest not jumping into a release right away. I know its hard to
live without counters or CQL since new things are super cool. But if you
want all those 9s your going to have to stay disciplined. Unless a release
has a fix for a problem you think you have, stay a minor or revision back,
or at least wait some time before upgrading to it, and do some internal
confidence testing before pulling the trigger on an update.

Almost all usecases demand that repair be run regularly due to the nature of
distributed deletes.

Other good tips, subscribe to all the mailing lists, and hang out in the IRC
channels cassandra, cassandra-dev, cassandra-ops. You get an osmoses
learning effect and you learn to fix or head off issues you never had.


Re: 99.999% uptime - Operations Best Practices?

2011-06-22 Thread Les Hazlewood
Edward,

Thank you so much for this reply - this is great stuff, and I really
appreciate it.

You'll be happy to know that I've already pre-ordered your book.  I'm
looking forward to it! (When is the ship date?)

Best regards,

Les

On Wed, Jun 22, 2011 at 7:03 PM, Edward Capriolo edlinuxg...@gmail.comwrote:



 On Wed, Jun 22, 2011 at 8:31 PM, Les Hazlewood l...@katasoft.com wrote:

 Hi Thoku,

 You were able to more concisely represent my intentions (and their
 reasoning) in this thread than I was able to do so myself.  Thanks!

 On Wed, Jun 22, 2011 at 5:14 PM, Thoku Hansen tho...@gmail.com wrote:

 I think that Les's question was reasonable. Why *not* ask the community
 for the 'gotchas'?

 Whether the info is already documented or not, it could be an opportunity
 to improve the documentation based on users' perception.

 The you just have to learn responses are fair also, but that reminds me
 of the days when running Oracle was a black art, and accumulated wisdom made
 DBAs irreplaceable.


 Yes, this was my initial concern.  I know that Cassandra is still young,
 and I expect this to be the norm for a while, but I was hoping to make that
 process a bit easier (for me and anyone else reading this thread in the
 future).

 Some recommendations *are* documented, but they are dispersed / stale /
 contradictory / or counter-intuitive.

 Others have not been documented in the wiki nor in DataStax's doco, and
 are instead learned anecdotally or The Hard Way.

 For example, whether documented or not, some of the 'gotchas' that I
 encountered when I first started working with Cassandra were:

 * Don't use OpenJDK. Prefer the Sun JDK. (Wiki says 
 thishttp://wiki.apache.org/cassandra/GettingStarted
 , Jira says that https://issues.apache.org/jira/browse/CASSANDRA-2441
 ).
 * Its not viable to run without JNA installed.
 * Disable swap memory.
 * Need to run nodetool repair on a regular basis.

 I'm looking forward to Edward Capriolo's Cassandra 
 bookhttps://www.packtpub.com/cassandra-apache-high-performance-cookbook/book
  which
 Les will probably find helpful.


 Thanks for linking to this.  I'm pre-ordering right away.

 And thanks for the pointers, they are exactly the kind of enumerated
 things I was looking to elicit.  These are the kinds of things that are hard
 to track down in a single place.  I think it'd be nice for the community to
 contribute this stuff to a single page ('best practices', 'checklist',
 whatever you want to call it).  It would certainly make things easier when
 getting started.

 Thanks again,

 Les


 Since I got a plug on the book I will chip in again to the thread :)

 Some things that were mentioned already:

 Install JNA absolutely (without JNA the snapshot command has to fork to
 hard link the sstables, I have seen clients backoff from this). Also the
 performance focused Cassandra devs always try to squeeze out performance by
 utilizing more native features.

 OpenJDK vs Sun. I agree, almost always try to do what 'most others' do in
 production, this way you get surprised less.

 Other stuff:

 RAID. You might want to go RAID 1+0 if you are aiming for uptime. RAID 0
 has better performance, but if you lose a node your capacity is diminished,
 rebuilding and rejoining a node involves more manpower more steps and more
 chances for human error.

 Collect statistics on the normal system items CPU, disk (size and
 utilization), memory. Then collect the JMX cassandra counters and understand
 how they interact. For example record ReadCount and WriteCount per column
 family, then use try to determine how this effects disk utilization. You can
 use this for capacity planning. Then try using a key/row cache. Evaluate
 again. Check the hit rate graph for your new cache. How did this effect your
 disk? You want to head off anything that can be a performance killer like
 traffic patterns changing or data growing significantly.

 Do not be short on hardware. I do not want to say overbuy but if uptime
 is important have spares drives and servers and have room to grow.

 Balance that ring :)

 I have not read the original thread concerning the problem you mentioned.
 One way to avoid OOM is large amounts of RAM :) On a more serious note most
 OOM's are caused by setting caches or memtables too large. If the OOM was
 caused by a software bug, the cassandra devs are on the ball and move fast.
 I still suggest not jumping into a release right away. I know its hard to
 live without counters or CQL since new things are super cool. But if you
 want all those 9s your going to have to stay disciplined. Unless a release
 has a fix for a problem you think you have, stay a minor or revision back,
 or at least wait some time before upgrading to it, and do some internal
 confidence testing before pulling the trigger on an update.

 Almost all usecases demand that repair be run regularly due to the nature
 of distributed deletes.

 Other good tips, subscribe to all the mailing lists, and hang out in the