Re: Can you query Cassandra while it's doing major compaction

2012-02-03 Thread Adrian Cockcroft
At Netflix we rotate the major compactions around the cluster, don't
run them all at once. We also either take that node out of client
traffic so it doesn't get used as a coordinator or use the Astyanax
client that is latency and token aware to steer traffic to the other
replicas.

We are running on EC2 with lots of CPU and RAM but only two internal
disk spindles, so if you have lots of IOPS available in your config,
you may find that it doesn't affect read times much.

Adrian

On Thu, Feb 2, 2012 at 11:44 PM, Peter Schuller
peter.schul...@infidyne.com wrote:
 If every node in the cluster is running major compaction, would it be able to
 answer any read request?  And is it wise to write anything to a cluster
 while it's doing major compaction?

 Compaction is something that is supposed to be continuously running in
 the background. As noted, it will have a performance impact in that it
 (1) generates I/O, (2) leads to cache eviction, and (if you're CPU
 bound rather than disk bound) (3) adds CPU load.

 But there is no intention that clients should have to care about
 compaction; it's to be viewed as a background operation continuously
 happening. A good rule of thumb is that an individual node should be
 able to handle your traffic when doing compaction; you don't want to
 be in the position where you're just barely dealing with the traffic,
 and a node doing compaction not being able to handle it.

 --
 / Peter Schuller (@scode, http://worldmodscode.wordpress.com)


Benchmarking Cassandra scalability to over 1M writes/s on AWS

2011-11-03 Thread Adrian Cockcroft
Hi folks,

we just posted a detailed Netflix technical blog entry on this
http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html

Hope you find it interesting/useful

Cheers Adrian


Re: Very slow writes in Cassandra

2011-10-31 Thread Adrian Cockcroft
You are using replication factor of one and the Lustre clustered
filesystem over the network. Not good practice.

Try RF=3 and local disks. Lustre duplicates much of the functionality
of Cassandra, there is no point using both. Make your Lustre server
nodes into Cassandra nodes instead.

Adrian

On Sun, Oct 30, 2011 at 6:33 PM, Evgeny erepe...@cmcrc.com wrote:
 Hello Cassandra users,

 I'm newbie in NoSQL and Cassandara in particular. At the moment doing some
 benchmarking with Cassandra and experiencing very slow write throughput.

 As it is said, Cassandra can perform hundreds of thousands of inserts per
 second, however I'm not observing this: 1) when I send 100 thousand inserts
 simultaneously via 8 CQL clients, then throughput is ~14470 inserts per
 second.
 2) when I do the same via 8 Thrift clients, then throughput is ~16300 inserts
 per seconds.

 I think Cassandra performance can be improved, but I don't know what to tune.
 Please take a look at the test conditions below and advise something.
 Thank you.

 Tests conditions:

   1. Cassandra cluster is deployed on three machines, each machine has 8
   cores Intel(R) Xeon(R) CPU E5420 @ 2.50GHz, RAM is 16GB,
   network speed is 1000Mb/s.

   2. The data sample is

 set MM[utf8('1:exc_source_algo:2010010500.00:ENTER:0')]['order_id'] =
 '1.0';
 set MM[utf8('1:exc_source_algo:2010010500.00:ENTER:0')]['security'] =
 'AA1';
 set MM[utf8('1:exc_source_algo:2010010500.00:ENTER:0')]['price'] =
 '47.1';
 set MM[utf8('1:exc_source_algo:2010010500.00:ENTER:0')]['volume'] =
 '300.0';
 set MM[utf8('1:exc_source_algo:2010010500.00:ENTER:0')]['se'] = '1';
 set MM[utf8('2:exc_source_algo:2010010500.00:ENTER:0')]['order_id'] =
 '2.0';
 set MM[utf8('2:exc_source_algo:2010010500.00:ENTER:0')]['security'] =
 'AA1';
 set MM[utf8('2:exc_source_algo:2010010500.00:ENTER:0')]['price'] =
 '44.89';
 set MM[utf8('2:exc_source_algo:2010010500.00:ENTER:0')]['volume'] =
 '310.0';
 set MM[utf8('2:exc_source_algo:2010010500.00:ENTER:0')]['se'] = '1';
 set MM[utf8('3:exc_source_algo:2010010500.00:ENTER:0')]['order_id'] =
 '3.0';
 set MM[utf8('3:exc_source_algo:2010010500.00:ENTER:0')]['security'] =
 'AA2';
 set MM[utf8('3:exc_source_algo:2010010500.00:ENTER:0')]['price'] =
 '0.35';

  3. Commit log is written on the local hard drive, the data is written on
  Lustre.

   4. Keyspace description Keyspace: MD: Replication Strategy:
 org.apache.cassandra.locator.NetworkTopologyStrategy Durable Writes: true
 Options: [datacenter1:1]
 Column Families: ColumnFamily: MM Key Validation Class:
 org.apache.cassandra.db.marshal.BytesType Default column value validator:
 org.apache.cassandra.db.marshal.BytesType Columns sorted by:
 org.apache.cassandra.db.marshal.BytesType Row cache size / save period in
 seconds: 0.0/0 Key cache size / save period in seconds:20.0/14400 Memtable
 thresholds: 2.3247/1440/496 (millions of ops/minutes/MB) GC grace
 seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0
 Replicate on write: true Built indexes: []

 Thanks in advance.

 Evgeny.




Re: selective replication

2011-09-14 Thread Adrian Cockcroft
This has been proposed a few times, there are some good use cases for
it, and there is no current mechanism for it, but it's been discussed
as a possible enhancement.

Adrian

On Wed, Sep 14, 2011 at 11:06 AM, Todd Burruss bburr...@expedia.com wrote:
 Has anyone done any work on what I'll call selective replication between
 DCs?  I want to use Cassandra to replicate data to another virtual DC (for
 analytical purposes), but only inserts, not deletes.
 Picture having two data centers, DC1 for OLTP of short lived data (say 90
 day window) and DC2 for OLAP (years of data).  DC2 would probably be a Brisk
 setup.
 In this scenario, clients would get/insert/delete from DC1 (the OLTP system)
 and DC1 would replicate inserts only to DC2 (the OLAP system) for analytics.
  I don't have any experience (yet) with multi-dc replication, but I don't
 think this is possible.
 Thoughts?


Re: Cassandra as in-memory cache

2011-09-11 Thread Adrian Cockcroft
You should be using the off heap row cache option. That way you avoid GC
overhead and the rows are stored in a compact serialized form that means you
get more cache entries in RAM. Trade off is slightly more CPU for
deserialization etc.

Adrian

On Sunday, September 11, 2011, aaron morton aa...@thelastpickle.com wrote:
 If the row cache is enabled the read path will not use the sstables.
Depending on the workload I would then look at setting *low* memtable flush
settings to use as much memory as possible for the row cache. If the row is
in the row cache the read path will not look at SSTables.

 Then set the row cache save settings per CF to ensure the cache is warmed
when the node starts.

 The write path will still use the WAL so if you may want to disable the
commit log using the durable_writes setting on  the keyspace.

 Hope that helps.

 -
 Aaron Morton
 Freelance Cassandra Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 10/09/2011, at 4:38 AM, kapil nayar wrote:

 Hi,

 Can we configure some column-families (or keyspaces) in Cassandra to
perform as a pure in-memory cache?

 The feature should let the memtables always be in-memory (never flushed
to the disk - sstables).
 The memtable flush threshold settings of time/ memory/ operations can be
set to a max value to achieve this.

 However, it seems uneven distribution of the keys across the nodes in the
cluster could lead to java error no-memory available. In order to prevent
this error can we overflow some entries to the disk?

 Thanks,
 Kapil




Re: Direct control over where data is stored?

2011-06-05 Thread Adrian Cockcroft
Sounds like Khanh thinks he can do joins... :-)

User oriented data is easy, key by facebook id, let cassandra handle
location. Set replication factor=3 so you don't lose data and can do
consistent but slower read after write when you need to using quorum.
If you are running on AWS you should distribute your replicas over
availability zones.

Then you can do read A, read B join them in your app code. Single
digit milliseconds for each read or write.

If you want to do bulk operations over many users, use Brisk with a Hadoop job.

HTH
Adrian

On Sat, Jun 4, 2011 at 9:32 PM, Maki Watanabe watanabe.m...@gmail.com wrote:
 You may be able to do it with the Order Preserving Partitioner with
 making key to node mapping before storing data, or you may need your
 custom Partitioner. Please note that you are responsible to distribute
 load between nodes in this case.
 From application design perspective, it is not clear for me why you
 need to store user A and his friends into same box

 maki


 2011/6/5 Khanh Nguyen nguyen.h.kh...@gmail.com:
 Hi everyone,

 Is it possible to have direct control over where objects are stored in
 Cassandra? For example, I have a Cassandra cluster of 4 machines and 4
 objects A, B, C, D; I want to store A at machine 1, B at machine 2, C
 at machine 3 and D at machine 4. My guess is that I need to intervene
 they way Cassandra hashes an object into the keyspace? If so, how
 complicated the task will be?

 I'm new to the list and Cassandra. The reason I am asking is that my
 current project is related to social locality of data: if A and B are
 Facebook friends, I want to store their data as close as possible,
 preferably in the same machine in a cluster.

 Thank you.

 Regards,

 -k




 --
 w3m



Re: batch dump of data from cassandra?

2011-05-23 Thread Adrian Cockcroft
Hi Yang,

You could also use Hadoop (i.e. Brisk), and run a MapReduce job or
Hive query to extract and summarize/renormalize the data into whatever
format you like.

If you use sstable2json, you have to run on every file on every node,
deduplicate/merge all the output across machines, which is what MR
does anyway.

Our data flow is to take backups of a production cluster, restore a
backup to a different cluster running Hadoop, then run our point in
time data extraction for ETL processing by the BI team. The
backup/restore gives a frozen in time (consistent to a second or so)
cluster for extraction. Running live with Brisk means you are running
your extraction over a moving target.

Adrian

On Sun, May 22, 2011 at 11:14 PM, Yang tedd...@gmail.com wrote:
 Thanks Jonathan.

 On Sun, May 22, 2011 at 9:56 PM, Jonathan Ellis jbel...@gmail.com wrote:
 I'd modify SSTableExport.serializeRow (the sstable2json class) to
 output to whatever system you are targeting.

 On Sun, May 22, 2011 at 11:19 PM, Yang tedd...@gmail.com wrote:
 let's say periodically (daily) I need to dump out the contents of my
 Cassandra DB, and do a import into oracle , or some other custom data
 stores,
 is there a way to do it?

 I checked that you can do multi-get() but you probably can't pass the
 entire key domain into the API, cuz the entire db would be returned on
 a single thrift call, and probably overflow the
 API? plus multi-get underneath just sends out per-key lookups one by
 one, while I really do not care about which key corresponds to which
 result, a simple scraping of the underlying SSTable would
 be perfect, because I could utilize the file cache coherency as I read
 down the file.


 Thanks
 Yang




 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com




Re: batch dump of data from cassandra?

2011-05-23 Thread Adrian Cockcroft
Three ways to do this.

Client app does get key for every row, lots of small network operations

brisk / hive does select(*), which is sent to each node to map then
the hadoop network shuffle merges the result

Write your own code to merge all the SStables across the cluster.

So I think that brisk is going to be easier to implement but also
closer in efficiency to the way you want to do it.

Adrian

On Monday, May 23, 2011, Yang tedd...@gmail.com wrote:
 thanks Sri

 I am trying to make sure that Brisk underneath does a simple scraping
 of the rows, instead of doing foreach key ( keys ) { lookup (key) }..
 after that, I can feel comfortable using Brisk for the import/export jobs

 yang

 On Mon, May 23, 2011 at 8:50 AM, SriSatish Ambati
 srisat...@datastax.com wrote:
 Adrian,
 +1
 Using hive  hadoop for the export-import of data from  to Cassandra is one
 of the original use cases we had in mind for Brisk. That also has the
 ability to parallelize the workload and finish rapidly.
 thanks,
 Sri
 On Sun, May 22, 2011 at 11:31 PM, Adrian Cockcroft
 adrian.cockcr...@gmail.com wrote:

 Hi Yang,

 You could also use Hadoop (i.e. Brisk), and run a MapReduce job or
 Hive query to extract and summarize/renormalize the data into whatever
 format you like.

 If you use sstable2json, you have to run on every file on every node,
 deduplicate/merge all the output across machines, which is what MR
 does anyway.

 Our data flow is to take backups of a production cluster, restore a
 backup to a different cluster running Hadoop, then run our point in
 time data extraction for ETL processing by the BI team. The
 backup/restore gives a frozen in time (consistent to a second or so)
 cluster for extraction. Running live with Brisk means you are running
 your extraction over a moving target.

 Adrian

 On Sun, May 22, 2011 at 11:14 PM, Yang tedd...@gmail.com wrote:
  Thanks Jonathan.
 
  On Sun, May 22, 2011 at 9:56 PM, Jonathan Ellis jbel...@gmail.com
  wrote:
  I'd modify SSTableExport.serializeRow (the sstable2json class) to
  output to whatever system you are targeting.
 
  On Sun, May 22, 2011 at 11:19 PM, Yang tedd...@gmail.com wrote:
  let's say periodically (daily) I need to dump out the contents of my
  Cassandra DB, and do a import into oracle , or some other custom data
  stores,
  is there a way to do it?
 
  I checked that you can do multi-get() but you probably can't pass the
  entire key domain into the API, cuz the entire db would be returned on
  a single thrift call, and probably overflow the
  API? plus multi-get underneath just sends out per-key lookups one by
  one, while I really do not care about which key corresponds to which
  result, a simple scraping of the underlying SSTable would
  be perfect, because I could utilize the file cache coherency as I read
  down the file.
 
 
  Thanks
  Yang
 
 
 
 
  --
  Jonathan Ellis
  Project Chair, Apache Cassandra
  co-founder of DataStax, the source for professional Cassandra support
  http://www.datastax.com
 
 



 --
 SriSatish Ambati
 Director of Engineering, DataStax
 @srisatish








Re: Ec2 Stress Results

2011-05-11 Thread Adrian Cockcroft
Hi Alex,

This has been a useful thread, we've been comparing your numbers with
our own tests.

Why did you choose four big instances rather than more smaller ones?

For $8/hr you get four m2.4xl with a total of 8 disks.
For $8.16/hr you could have twelve m1.xl with a total of 48 disks, 3x
disk space, a bit less total RAM and much more CPU

When an instance fails, you have a 25% loss of capacity with 4 or an
8% loss of capacity with 12.

I don't think it makes sense (especially on EC2) to run fewer than 6
instances, we are mostly starting at 12-15.
We can also spread the instances over three EC2 availability zones,
with RF=3 and one copy of the data in each zone.

Cheers
Adrian


On Wed, May 11, 2011 at 5:25 PM, Alex Araujo
cassandra-us...@alex.otherinbox.com wrote:
 On 5/9/11 9:49 PM, Jonathan Ellis wrote:

 On Mon, May 9, 2011 at 5:58 PM, Alex Araujocassandra-  How many
 replicas are you writing?

 Replication factor is 3.

 So you're actually spot on the predicted numbers: you're pushing
 20k*3=60k raw rows/s across your 4 machines.

 You might get another 10% or so from increasing memtable thresholds,
 but bottom line is you're right around what we'd expect to see.
 Furthermore, CPU is the primary bottleneck which is what you want to
 see on a pure write workload.

 That makes a lot more sense.  I upgraded the cluster to 4 m2.4xlarge
 instances (68GB of RAM/8 CPU cores) in preparation for application stress
 tests and the results were impressive @ 200 threads per client:

 +--+--+--+--+--+--+--+--+--+
 | Server Nodes | Client Nodes | --keep-going |   Columns    |    Client    |
    Total     |  Rep Factor  |  Test Rate   | Cluster Rate |
 |              |              |              |              |   Threads    |
   Threads    |              |  (writes/s)  |  (writes/s)  |
 +==+==+==+==+==+==+==+==+==+
 |      4       |      3       |      N       |   1000   |     200      |
     600      |      3       |    44644     |    133931    |
 +--+--+--+--+--+--+--+--+--+

 The issue I'm seeing with app stress tests is that the rate will be
 comparable/acceptable at first (~100k w/s) and will degrade considerably
 (~48k w/s) until a flush and restart.  CPU usage will correspondingly be
 high at first (500-700%) and taper down to 50-200%.  My data model is pretty
 standard (This is pseudo-type information):

 UsersColumn
 UserId32CharHash : {
    emailString: a...@b.com,
    first_nameString: John,
    last_nameString: Doe
 }

 UserGroupsSuperColumn
 GroupIdUUID: {
    UserId32CharHash: {
        date_joinedDateTime: 2011-05-10 13:14.789,
        date_leftDateTime: 2011-05-11 13:14.789,
        activeshort: 0|1
    }
 }

 UserGroupTimelineColumn
 GroupIdUUID: {
    date_joinedTimeUUID: UserId32CharHash
 }

 UserGroupStatusColumn
 CompositeId('GroupIdUUID:UserId32CharHash'): {
    activeshort: 0|1
 }

 Every new User has a row in Users and a ColumnOrSuperColumn in the other 3
 CFs (total of 4 operations).  One notable difference is that the RAID0 on
 this instance type (surprisingly) only contains two ephemeral volumes and
 appear a bit more saturated in iostat, although not enough to clearly stand
 out as the bottleneck.  Is the bottleneck in this scenario likely memtable
 flush and/or commitlog rotation settings?

 RF = 2; ConsistencyLevel = One; -Xmx = 6GB; concurrent_writes: 64; all other
 settings are the defaults.  Thanks, Alex.



Re: best way to backup

2011-04-28 Thread Adrian Cockcroft
Netflix has also gone down this path, we run a regular full backup to
S3 of a compressed tar, and we have scripts that restore everything
into the right place on a different cluster (it needs the same node
count). We also pick up the SSTables as they are created, and drop
them in S3.

Whatever you do, make sure you have a regular process to restore the
data and verify that it contains what you think it should...

Adrian

On Thu, Apr 28, 2011 at 1:35 PM, Jeremy Hanna
jeremy.hanna1...@gmail.com wrote:
 one thing we're looking at doing is watching the cassandra data directory and 
 backing up the sstables to s3 when they are created.  Some guys at simplegeo 
 started tablesnap that does this:
 https://github.com/simplegeo/tablesnap

 What it does is for every sstable that is pushed to s3, it also copies a json 
 file with the current files in the directory, so you can know what to restore 
 in that event (as far as I understand).

 On Apr 28, 2011, at 2:53 PM, William Oberman wrote:

 Even with N-nodes for redundancy, I still want to have backups.  I'm an 
 amazon person, so naturally I'm thinking S3.  Reading over the docs, and 
 messing with nodeutil, it looks like each new snapshot contains the previous 
 snapshot as a subset (and I've read how cassandra uses hard links to avoid 
 excessive disk use).  When does that pattern break down?

 I'm basically debating if I can do a rsync like backup, or if I should do 
 a compressed tar backup.  And I obviously want multiple points in time.  S3 
 does allow file versioning, if a file or file name is changed/resused over 
 time (only matters in the rsync case).  My only concerns with compressed 
 tars is I'll have to have free space to create the archive and I get no 
 delta space savings on the backup (the former is solved by not allowing 
 the disk space to get so low and/or adding more nodes to bring down the 
 space, the latter is solved by S3 being really cheap anyways).

 --
 Will Oberman
 Civic Science, Inc.
 3030 Penn Avenue., First Floor
 Pittsburgh, PA 15201
 (M) 412-480-7835
 (E) ober...@civicscience.com




Re: Multi-DC Deployment

2011-04-20 Thread Adrian Cockcroft
Queues replicate bad data just as well as anything else. The biggest
source of bad data is broken app code... You will still need to
implement a reconciliation/repair checker, as queues have their own
failure modes when they get backed up. We have also looked at using
queues to bounce data between cassandra clusters for other reasons,
and they have their place. However it is a lot more work to implement
than using existing well tested Cassandra functionality to do it for
us.

I think your code needs to retry a failed local-quorum read with a
read-one to get the behavior you are asking for.

Our approach to bad data and corruption issues is backups, wind back
to the last good snapshot. We have figured out incremental backups as
well as full. Our code has some local dependencies, but could be the
basis for a generic solution.

Adrian

On Wed, Apr 20, 2011 at 6:08 PM, Terje Marthinussen
tmarthinus...@gmail.com wrote:
 Assuming that you generally put an API on top of this, delivering to two or
 more systems then boils down to a message queue issue or some similar
 mechanism which handles secure delivery of messages. Maybe not trivial, but
 there are many products that can help you with this, and it is a lot easier
 to implement than a fully distributed storage system.
 Yes, ideally Cassandra will not distribute corruption, but the reason you
 pay up to have 2 fully redundant setups in 2 different datacenters is
 because we do not live in an ideal world. Anyone having tested Cassandra
 since 0.7.0 with any real data will be able to testify how well it can mess
 things up.
 This is not specific to Cassandra, in fact, I would argue thats this is in
 the blood of any distributed system. You want them to distribute after all
 and the tighter the coupling is between nodes, the better they distribute
 bad stuff as well as good stuff.
 There is a bigger risk for a complete failure with 2 tightly coupled
 redundant systems than with 2 almost completely isolated ones. The logic
 here is so simple it is really somewhat beyond discussion.
 There are a few other advantages of isolating the systems. Especially in
 terms of operation, 2 isolated systems would be much easier as you could
 relatively risk fee try out a new cassandra in one datacenter or upgrade one
 datacenter at a time if you needed major operational changes such as schema
 changes or other large changes to the data.
 I see the 2 copies in one datacenters + 1(or maybe 2) in another as a low
 cost middleway between 2 full N+2 (RF=3) systems in both data centers.
 That is, in a traditional design where you need 1 node for normal service,
 you would have 1 extra replicate for redundancy and one replica more (N+2
 redundancy) so you can do maintenance and still be redundant.
 If I have redundancy across datacenters, I would probably still want 2
 replicas to avoid network traffic between DCs in case of a node recovery,
 but N+2 may not be needed as my risk policy may find it acceptable to run
 one datacenters without redundancy for a time limited period for
 maintenance.
 That is, if my original requirement is 1 node, I could do with 3x the HW
 which is not all that much more than the 3x I need for one DC and a lot less
 than the 6x I need for 2 full N+2 systems.
 However, all of the above is really beyond the point of my original
 suggestion.
 Regardless of datacenters, redundancy and distribution of bad or good stuff,
 it would be good to have a way to return whatever data is there, but with a
 flag or similar stating that the consistency level was not met.
 Again, for a lot of services, it is fully acceptable, and a lot better, to
 return an almost complete (or maybe even complete, but no verified by
 quorum) result than no result at all.
 As far as I remember from the code, this just boils down to returning
 whatever you collected from the cluster and setting the proper flag or
 similar on the resultset rather than returning an error.
 Terje
 On Thu, Apr 21, 2011 at 5:01 AM, Adrian Cockcroft
 adrian.cockcr...@gmail.com wrote:

 Hi Terje,

 If you feed data to two rings, you will get inconsistency drift as an
 update to one succeeds and to the other fails from time to time. You
 would have to build your own read repair. This all starts to look like
 I don't trust Cassandra code to work, so I will write my own buggy
 one off versions of Cassandra functionality. I lean towards using
 Cassandra features rather than rolling my own because there is a large
 community testing, fixing and extending Cassandra, and making sure
 that the algorithms are robust. Distributed systems are very hard to
 get right, I trust lots of users and eyeballs on the code more than
 even the best engineer working alone.

 Cassandra doesn't replicate sstable corruptions. It detects corrupt
 data and only replicates good data. Also data isn't replicated to
 three identical nodes in the way you imply, it's replicated around the
 ring. If you lose three nodes, you don't lose a whole node's

Re: Multi-DC Deployment

2011-04-19 Thread Adrian Cockcroft
If you want to use local quorum for a distributed setup, it doesn't
make sense to have less than RF=3 local and remote. Three copies at
both ends will give you high availability. Only one copy of the data
is sent over the wide area link (with recent versions).

There is no need to use mirrored or RAID5 disk in each node in this
case, since you are using RAIN (N for nodes) to protect your data. So
the extra disk space to hold three copies at each end shouldn't be a
big deal. Netflix is using striped internal disks on EC2 nodes for
this.

Adrian

On Mon, Apr 18, 2011 at 11:16 PM, Terje Marthinussen
tmarthinus...@gmail.com wrote:
 Hum...
 Seems like it could be an idea in a case like this with a mode where result
 is always returned (if possible), but where a flay saying if the consistency
 level was met, or to what level it was met (number of nodes answering for
 instance).?
 Terje

 On Tue, Apr 19, 2011 at 1:13 AM, Jonathan Ellis jbel...@gmail.com wrote:

 They will timeout until failure detector realizes the DC1 nodes are
 down (~10 seconds). After that they will immediately return
 UnavailableException until DC1 comes back up.

 On Mon, Apr 18, 2011 at 10:43 AM, Baskar Duraikannu
 baskar.duraikannu...@gmail.com wrote:
  We are planning to deploy Cassandra on two data centers.   Let us say
  that
  we went with three replicas with 2 being in one data center and last
  replica
  in 2nd Data center.
 
  What will happen to Quorum Reads and Writes when DC1 goes down (2 of 3
  replicas are unreachable)?  Will they timeout?
 
 
  Regards,
  Baskar



 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of DataStax, the source for professional Cassandra support
 http://www.datastax.com




Re: Pyramid Organization of Data

2011-04-14 Thread Adrian Cockcroft
We have similar requirements for wide area backup/archive at Netflix.

I think what you want is a replica with RF of at least 3 in NY for all the 
satellites, then each satellite could have a lower RF, but if you want safe 
local quorum I would use 3 everywhere.

Then NY is the sum of all the satellites, so that makes most use of the disk 
space.

For archival storage I suggest you use snapshots in NY and save compressed tar 
files of each keyspace in NY. We've been working on this to allow full and 
incremental backup and restore from our EC2 hosted Cassandra clusters to/from 
S3. Full backup/restore works fine, incremental and per-keyspace restore is 
being worked on.

Adrian

From: Patrick Julien pjul...@gmail.commailto:pjul...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Thu, 14 Apr 2011 05:38:54 -0700
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Pyramid Organization of Data


Thanks,  I'm still working the problem so anything I find out I will post here.

Yes, you're right, that is the question I am asking.

No, adding more storage is not a solution since new york would have several 
hundred times more storage.

On Apr 14, 2011 6:38 AM, aaron morton 
aa...@thelastpickle.commailto:aa...@thelastpickle.com wrote:
 I think your question is NY is the archive, after a certain amount of time 
 we want to delete the row from the original DC but keep it in the archive in 
 NY.

 Once you delete a row, it's deleted as far as the client is concerned. 
 GCGaceSeconds is only concerned with when the tombstone marker can be 
 removed. If NY has a replica of a row from Tokyo and the row is deleted in 
 either DC, it will be deleted in the other DC as well.

 Some thoughts...
 1) Add more storage in the satellite DC's, then tilt you chair to celebrate a 
 job well done :)
 2) Run two clusters as you say.
 3) Just thinking out loud, and I know this does not work now. Would it be 
 possible to support per CF strategy options, so an archive CF only replicates 
 to NY ? Can think of possible problems with repair and LOCAL_QUORUM, out of 
 interest what else would it break?

 Hope that helps.
 Aaron



 On 14 Apr 2011, at 10:17, Patrick Julien wrote:

 We have been successful in implementing, at scale, the comments you
 posted here. I'm wondering what we can do about deleting data
 however.

 The way I see it, we have considerably more storage capacity in NY,
 but not in the other sites. Using this technique here, it occurs to
 me that we would replicate non-NY deleted rows back to NY. Is there a
 way to tell NY not to tombstone rows?

 The ideas I have so far:

 - Set GCGracePeriod to be much higher in NY than in the other sites.
 This way we can get to tombstone'd rows well beyond their disk life in
 other sites.
 - A variant on this solution is to set the TTL on rows in non NY sites
 and again, set the GCGracePeriod to be considerably higher in NY
 - break this up to multiple clusters and do one write from the client
 to the its 'local' cluster and one write to the NY cluster.



 On Fri, Apr 8, 2011 at 7:15 PM, Jonathan Ellis 
 jbel...@gmail.commailto:jbel...@gmail.com wrote:
 No, I'm suggesting you have a Tokyo keyspace that gets replicated as
 {Tokyo: 2, NYC:1}, a London keyspace that gets replicated to {London:
 2, NYC: 1}, for example.

 On Fri, Apr 8, 2011 at 5:59 PM, Patrick Julien 
 pjul...@gmail.commailto:pjul...@gmail.com wrote:
 I'm familiar with this material. I hadn't thought of it from this
 angle but I believe what you're suggesting is that the different data
 centers would hold a different properties file for node discovery
 instead of using auto-discovery.

 So Tokyo, and others, would have a configuration that make it
 oblivious to the non New York data centers.
 New York would have a configuration that would give it knowledge of no
 other data center.

 Would that work? Wouldn't the NY data center wonder where these other
 writes are coming from?

 On Fri, Apr 8, 2011 at 6:38 PM, Jonathan Ellis 
 jbel...@gmail.commailto:jbel...@gmail.com wrote:
 On Fri, Apr 8, 2011 at 12:17 PM, Patrick Julien 
 pjul...@gmail.commailto:pjul...@gmail.com wrote:
 The problem is this: we would like the historical data from Tokyo to
 stay in Tokyo and only be replicated to New York. The one in London
 to be in London and only be replicated to New York and so on for all
 data centers.

 Is this currently possible with Cassandra? I believe we would need to
 run multiple clusters and migrate data manually from data centers to
 North America to achieve this. Also, any suggestions would also be
 welcomed.

 NetworkTopologyStrategy allows configuration replicas per-keyspace,
 per-datacenter:
 http://www.datastax.com/dev/blog/deploying-cassandra-across-multiple-data-centers

 --
 Jonathan Ellis
 Project Chair, Apache Cassandra
 co-founder of 

Re: Pyramid Organization of Data

2011-04-14 Thread Adrian Cockcroft
What you are asking for breaks the eventual consistency model, so you need to 
create a separate cluster in NYC that collects the same updates but has a much 
longer setting to timeout the data for deletion, or doesn't get the deletes. 

One way is to have a trigger on writes on your pyramid nodes in NY that copies 
data over to the long term analysis cluster. The two clusters won't be 
eventually consistent in the presence of failures, but with RF=3 you will get 
up to three triggers for each write, so you get three chances to get the copy 
done. 

Adrian

On Apr 14, 2011, at 10:18 AM, Patrick Julien pjul...@gmail.com wrote:

 Thanks for your input Adrian, we've pretty much settled on this too.
 What I'm trying to figure out is how we do deletes.
 
 We want to do deletes in the satellites because:
 
 a) we'll run out of disk space very quickly with the amount of data we have
 b) we don't need more than 3 days worth of history in the satellites,
 we're currently planning for 7 days of capacity
 
 However, the deletes will get replicated back to NY.  In NY, we don't
 want that, we want to run hadoop/pig over all that data dating back to
 several months/years.  Even if we set the replication factor of the
 satellites to 1 and NY to 3, we'll run out of space very quickly in
 the satellites.
 
 
 On Thu, Apr 14, 2011 at 11:23 AM, Adrian Cockcroft
 acockcr...@netflix.com wrote:
 We have similar requirements for wide area backup/archive at Netflix.
 I think what you want is a replica with RF of at least 3 in NY for all the
 satellites, then each satellite could have a lower RF, but if you want safe
 local quorum I would use 3 everywhere.
 Then NY is the sum of all the satellites, so that makes most use of the disk
 space.
 For archival storage I suggest you use snapshots in NY and save compressed
 tar files of each keyspace in NY. We've been working on this to allow full
 and incremental backup and restore from our EC2 hosted Cassandra clusters
 to/from S3. Full backup/restore works fine, incremental and per-keyspace
 restore is being worked on.
 Adrian
 From: Patrick Julien pjul...@gmail.com
 Reply-To: user@cassandra.apache.org user@cassandra.apache.org
 Date: Thu, 14 Apr 2011 05:38:54 -0700
 To: user@cassandra.apache.org user@cassandra.apache.org
 Subject: Re: Pyramid Organization of Data
 
 Thanks,  I'm still working the problem so anything I find out I will post
 here.
 
 Yes, you're right, that is the question I am asking.
 
 No, adding more storage is not a solution since new york would have several
 hundred times more storage.
 
 On Apr 14, 2011 6:38 AM, aaron morton aa...@thelastpickle.com wrote:
 I think your question is NY is the archive, after a certain amount of
 time we want to delete the row from the original DC but keep it in the
 archive in NY.
 
 Once you delete a row, it's deleted as far as the client is concerned.
 GCGaceSeconds is only concerned with when the tombstone marker can be
 removed. If NY has a replica of a row from Tokyo and the row is deleted in
 either DC, it will be deleted in the other DC as well.
 
 Some thoughts...
 1) Add more storage in the satellite DC's, then tilt you chair to
 celebrate a job well done :)
 2) Run two clusters as you say.
 3) Just thinking out loud, and I know this does not work now. Would it be
 possible to support per CF strategy options, so an archive CF only
 replicates to NY ? Can think of possible problems with repair and
 LOCAL_QUORUM, out of interest what else would it break?
 
 Hope that helps.
 Aaron
 
 
 
 On 14 Apr 2011, at 10:17, Patrick Julien wrote:
 
 We have been successful in implementing, at scale, the comments you
 posted here. I'm wondering what we can do about deleting data
 however.
 
 The way I see it, we have considerably more storage capacity in NY,
 but not in the other sites. Using this technique here, it occurs to
 me that we would replicate non-NY deleted rows back to NY. Is there a
 way to tell NY not to tombstone rows?
 
 The ideas I have so far:
 
 - Set GCGracePeriod to be much higher in NY than in the other sites.
 This way we can get to tombstone'd rows well beyond their disk life in
 other sites.
 - A variant on this solution is to set the TTL on rows in non NY sites
 and again, set the GCGracePeriod to be considerably higher in NY
 - break this up to multiple clusters and do one write from the client
 to the its 'local' cluster and one write to the NY cluster.
 
 
 
 On Fri, Apr 8, 2011 at 7:15 PM, Jonathan Ellis jbel...@gmail.com wrote:
 No, I'm suggesting you have a Tokyo keyspace that gets replicated as
 {Tokyo: 2, NYC:1}, a London keyspace that gets replicated to {London:
 2, NYC: 1}, for example.
 
 On Fri, Apr 8, 2011 at 5:59 PM, Patrick Julien pjul...@gmail.com
 wrote:
 I'm familiar with this material. I hadn't thought of it from this
 angle but I believe what you're suggesting is that the different data
 centers would hold a different properties file for node discovery
 instead of using auto-discovery.
 
 So

Re: Reclaim deleted rows space

2011-01-02 Thread Adrian Cockcroft
How many nodes do you have? You should be able to run a rolling compaction 
around the ring, one node at a time to minimize impact. If one node is too big 
an impact, maybe you should have a bigger cluster? If you are on EC2, try 
running more but smaller instances.

Adrian

From: shimi shim...@gmail.commailto:shim...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Sun, 2 Jan 2011 11:25:42 -0800
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Reclaim deleted rows space

Lets assume I have:
* single 100GB SSTable file
* min compaction threshold is set to 2

If I delete rows which are located in this file. Is the only way to clean the 
deleted rows is by inserting another 100GB of data or by triggering a painful 
major compaction?

Shimi


Re: WELCOME to user@cassandra.apache.org

2010-12-29 Thread Adrian Cockcroft
The book is available via safarionline.comhttp://safarionline.com - many 
educational institutions have group memberships and there is a free trial for 
individuals. It's run by the publishers and the authors get paid. There is a 
nosql training video there as well.

Adrian

On Dec 29, 2010, at 8:52 PM, asil klin 
asdk...@gmail.commailto:asdk...@gmail.com wrote:

Can't I get it for free from anywhere? I am a student researching on Cassandra 
datastore and have no economically profitable reasons to buy?

If anyone of you could pass me a copy to my email address: 
mailto:asdk...@gmail.com asdk...@gmail.commailto:asdk...@gmail.com

Thanks so much !

Asil

On Thu, Dec 30, 2010 at 10:10 AM, Germán Kondolf 
mailto:german.kond...@gmail.comgerman.kond...@gmail.commailto:german.kond...@gmail.com
 wrote:
Hmm... what about just paying for it?

It cost less than $20 on Amazon for the Kindle version...
(http://www.amazon.com/Cassandra-Definitive-Guide-Eben-Hewitt/dp/1449390412http://www.amazon.com/Cassandra-Definitive-Guide-Eben-Hewitt/dp/1449390412).

// Germán Kondolf
http://twitter.com/germanklfhttp://twitter.com/germanklf
http://code.google.com/p/seide/http://code.google.com/p/seide/
// @iPad

On 30/12/2010, at 01:26, asil klin 
mailto:asdk...@gmail.comasdk...@gmail.commailto:asdk...@gmail.com wrote:

 Can anyone pass me a pdf copy of Cassandra The Definitive Guide ?


 Thanks.
 Asil



Re: Severe Reliability Problems - 0.7 RC2

2010-12-20 Thread Adrian Cockcroft
What filesystem are you using? You might try EXT3 or 4 vs. XFS as another area 
of diversity. It sounds as if the page cache or filesystem is messed up. Are 
there any clues in /var/log/messages? How much swap space do you have 
configured?

The kernel level debug stuff I know is all for Solaris unfortunately…

Adrian

From: Dan Hendry dan.hendry.j...@gmail.commailto:dan.hendry.j...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Mon, 20 Dec 2010 12:13:56 -0800
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: RE: Severe Reliability Problems - 0.7 RC2

Yes, I have tried that (although only twice). Same impact as a regular kill: 
nothing happens and I get no stacktrace output. It is however on my list of 
things to try again the next time a node dies. I am also not able to attach 
jstack to the process.

I have also tried disabling JNA (did not help) and I have now changed 
disk_access_mode from auto to mmap_index_only on two of the nodes.

Dan

From: Kani [mailto:javier.canil...@gmail.com]
Sent: December-20-10 14:14
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Severe Reliability Problems - 0.7 RC2

Have you tried to send a KILL -3 to the Cassandra process before you send KILL 
-9? This way you will see what the threads are doing (and maybe blocking). The 
majority of the threads may give you the right spot where to look for the 
problem.

I'm not much of a good linux administrator, but when something goes weird on 
one of my own application (java running over linux box) i tried that command to 
see what the application was doing or trying to.

Kani
On Mon, Dec 20, 2010 at 3:48 PM, Dan Hendry 
dan.hendry.j...@gmail.commailto:dan.hendry.j...@gmail.com wrote:
I have been having severe and strange reliability problems within my Cassandra 
cluster. This weekend, all four of my nodes were down at once. Even now I am 
loosing one every few hours. I have attached output from all the system 
monitoring commands I can think of.

What seems to happen is that the java process locks up and sits and has 100% 
system cpu usage (but no user-CPU) (there are 8 cores so 100%=1/8 total 
capacity). JMX freezes and the node effectively dies, but there is typically 
nothing unusual in the Cassandra logs. About the only thing which seems to be 
correlated is the flushing of memtables tables. One of the strangest stats I am 
getting when in this state is memory paging: 3727168.00 pages scanned/second 
(see sar -B output). Occasionally, if I leave the process alone (~1 h) it 
recovers (maybe 1 in 5 times), otherwise the only way to terminate the 
Cassandra process is with a kill -9. When this happens, Cassandra memory usage 
(as reported by JMX before it dies) is also reasonable (ex 6 GB out of 12 GB 
heap and 24 GB system).

This feels more like a system level problem than a Cassandra problem so I have 
tried diversifying my cluster, one node runs Ubuntu 10.10, the other three 
10.04. One runs OpenJDK (1.6.0_20), the rest run Sun JDK (1.6.0_22). Neither 
change seems be correlated with the problem. These are pretty much stock ubuntu 
installs so nothing special on that front.

Now this has been a relatively sudden development and I can potentially 
attribute it to a few things:
1. Upgrading to RC2
2. Ever increasing amounts of data (there is less than 100 gb per node so this 
should not be the problem).
3. Migrating from a set of machines where data+commit log directories were on 
four small raid 5 hds to machines with two 500 gig drives: one for data and one 
for commitlog + os. I have seen more IO wait on these new machines. But they 
have the same memory and system settings.

I am about at my wits end on this one, any help would be appreciated.


No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.872 / Virus Database: 271.1.1/3327 - Release Date: 12/20/10 
02:34:00


Re: Cassandra Monitoring

2010-12-19 Thread Adrian Cockcroft
I'm currently working to configure AppDynamics to monitor cassandra. It
does byte-code instrumentation, so there is an agent added to the
cassandra JVM, which gives the ability to capture latency for requests and
see where the bottleneck is coming from. We have been using it on our
other Java apps. They have a free version to try it out. It doesn't track
thrift calls out of the box, but I'm encouraging AD to figure out a way to
do that, and working on a config for capturing the entry points in the
meantime.

The way the page cache works is that pages stay in memory linked to a
specific file. If you delete that file, the pages are all considered
invalid at that point, so get zero'ed out and go to the start of the free
list. So compaction creates a new file first (which is competing with
existing read traffic to try and keep its pages in memory) then removes
the old files that were being merged, so at that point there is a supply
of blank pages, but disk reads will be needed to warm up the cache again.
The use case that I'm working with is more like a persistent memcached
replacement, so we are trying to have more RAM than data on m2.4xl EC2
instances (~70GB) and keep all reads in memory all the time.

Adrian

On 12/19/10 5:36 AM, Peter Schuller peter.schul...@infidyne.com wrote:

 How / what are you monitoring? Best practices someone?

I recently set up monitoring using the cassandra-munin-plugins
(https://github.com/jamesgolick/cassandra-munin-plugins). However, due
to various little details that wasn't too fun to integrate properly
with munin-node-configure and automated configuration management. A
problem is also the starting of a JVM for each use of jmxquery, which
can become a problem with many column families.

I like your web server idea. Something persistent that can sit there
and do the JMX acrobatics, and expose something more easily consumed
for stuff like munin/zabbix/etc. It would be pretty nice to have that
out of the box with Cassandra, though I expect that would be
considered bloat. :)

-- 
/ Peter Schuller