Re: Overwhelming a cluster with writes?

2010-04-06 Thread Ilya Maykov
I'm running the nodes with a JVM heap size of 6GB, and here are the
related options from my storage-conf.xml. As mentioned in the first
email, I left everything at the default value. I briefly googled
around for Cassandra performance tuning etc but haven't found a
definitive guide ... any help with tuning these parameters is greatly
appreciated!

  DiskAccessModeauto/DiskAccessMode
  RowWarningThresholdInMB512/RowWarningThresholdInMB
  SlicedBufferSizeInKB64/SlicedBufferSizeInKB
  FlushDataBufferSizeInMB32/FlushDataBufferSizeInMB
  FlushIndexBufferSizeInMB8/FlushIndexBufferSizeInMB
  ColumnIndexSizeInKB64/ColumnIndexSizeInKB
  MemtableThroughputInMB64/MemtableThroughputInMB
  BinaryMemtableThroughputInMB256/BinaryMemtableThroughputInMB
  MemtableOperationsInMillions0.3/MemtableOperationsInMillions
  MemtableFlushAfterMinutes60/MemtableFlushAfterMinutes
  ConcurrentReads8/ConcurrentReads
  ConcurrentWrites64/ConcurrentWrites
  CommitLogSyncperiodic/CommitLogSync
  CommitLogSyncPeriodInMS1/CommitLogSyncPeriodInMS
  GCGraceSeconds864000/GCGraceSeconds

-- Ilya

On Mon, Apr 5, 2010 at 11:26 PM, Boris Shulman shulm...@gmail.com wrote:
 You are running out of memory on your nodes. Before the final crash
 your nodes are probably slow  due to GC. What is your memtable size?
 What cache options did you configure?

 On Tue, Apr 6, 2010 at 7:31 AM, Ilya Maykov ivmay...@gmail.com wrote:
 Hi all,

 I've just started experimenting with Cassandra to get a feel for the
 system. I've set up a test cluster and to get a ballpark idea of its
 performance I wrote a simple tool to load some toy data into the
 system. Surprisingly, I am able to overwhelm my 4-node cluster with
 writes from a single client. I'm trying to figure out if this is a
 problem with my setup, if I'm hitting bugs in the Cassandra codebase,
 or if this is intended behavior. Sorry this email is kind of long,
 here is the TLDR version:

 While writing to Cassandra from a single node, I am able to get the
 cluster into a bad state, where nodes are randomly disconnecting from
 each other, write performance plummets, and sometimes nodes even
 crash. Further, the nodes do not recover as long as the writes
 continue (even at a much lower rate), and sometimes do not recover at
 all unless I restart them. I can get this to happen simply by throwing
 data at the cluster fast enough, and I'm wondering if this is a known
 issue or if I need to tweak my setup.

 Now, the details.

 First, a little bit about the setup:

 4-node cluster of identical machines, running cassandra-0.6.0-rc1 with
 the fixes for CASSANDRA-933, CASSANDRA-934, and CASSANDRA-936 patched
 in. Node specs:
 8-core Intel Xeon e5...@2.00ghz
 8GB RAM
 1Gbit ethernet
 Red Hat Linux 2.6.18
 JVM 1.6.0_19 64-bit
 1TB spinning disk houses both commitlog and data directories (which I
 know is not ideal).
 The client machine is on the same local network and has very similar specs.

 The cassandra nodes are started with the following JVM options:

 ./cassandra JVM_OPTS=-Xms6144m -Xmx6144m -XX:+UseConcMarkSweepGC -d64
 -XX:NewSize=1024m -XX:MaxNewSize=1024m -XX:+DisableExplicitGC

 I'm using default settings for all of the tunable stuff at the bottom
 of storage-conf.xml. I also selected my initial tokens to evenly
 partition the key space when the cluster was bootstrapped. I am using
 the RandomPartitioner.

 Now, about the test. Basically I am trying to get an idea of just how
 fast I can make this thing go. I am writing ~250M data records into
 the cluster, replicated at 3x, using Ran Tavory's Hector client
 (Java), writing with ConsistencyLevel.ZERO and
 FailoverPolicy.FAIL_FAST. The client is using 32 threads with 8
 threads talking to each of the 4 nodes in the cluster. Records are
 identified by a numeric id, and I'm writing them in batches of up to
 10k records per row, with each record in its own column. The row key
 identifies the bucket into which records fall. So, records with ids 0
 -  are written to row 0, 1 - 1 are written to row
 1, etc. Each record is a JSON object with ~10-20 fields.

 Records: {  // Column Family
   0 : {  // row key for the start of the bucket. Buckets span a range
 of up to 1 records
     1 : { /* some JSON */ },  // Column for record with id=1
     3 : { /* some more JSON */ },  // Column for record with id=3
    ...
     : { /* ... */ }
   },
  1 : {  // row key for the start of the next bucket
    10001 : ...
    10004 :
 }

 I am reading the data out of a local, sorted file on the client, so I
 only write a row to Cassandra once all records for that row have been
 read, and each row is written to exactly once. I'm using a
 producer-consumer queue to pump data from the input reader thread to
 the output writer threads. I found that I have to throttle the reader
 thread heavily in order to get good behavior. So, if I make the reader
 sleep for 7 seconds every 1M records, everything is fine - the data
 loads in about an hour, half of which 

Re: Overwhelming a cluster with writes?

2010-04-06 Thread Ran Tavory
Do you see one of the disks used by cassandra filled up when a node crashes?

On Tue, Apr 6, 2010 at 9:39 AM, Ilya Maykov ivmay...@gmail.com wrote:

 I'm running the nodes with a JVM heap size of 6GB, and here are the
 related options from my storage-conf.xml. As mentioned in the first
 email, I left everything at the default value. I briefly googled
 around for Cassandra performance tuning etc but haven't found a
 definitive guide ... any help with tuning these parameters is greatly
 appreciated!

  DiskAccessModeauto/DiskAccessMode
  RowWarningThresholdInMB512/RowWarningThresholdInMB
  SlicedBufferSizeInKB64/SlicedBufferSizeInKB
  FlushDataBufferSizeInMB32/FlushDataBufferSizeInMB
  FlushIndexBufferSizeInMB8/FlushIndexBufferSizeInMB
  ColumnIndexSizeInKB64/ColumnIndexSizeInKB
  MemtableThroughputInMB64/MemtableThroughputInMB
  BinaryMemtableThroughputInMB256/BinaryMemtableThroughputInMB
  MemtableOperationsInMillions0.3/MemtableOperationsInMillions
  MemtableFlushAfterMinutes60/MemtableFlushAfterMinutes
  ConcurrentReads8/ConcurrentReads
  ConcurrentWrites64/ConcurrentWrites
  CommitLogSyncperiodic/CommitLogSync
  CommitLogSyncPeriodInMS1/CommitLogSyncPeriodInMS
  GCGraceSeconds864000/GCGraceSeconds

 -- Ilya

 On Mon, Apr 5, 2010 at 11:26 PM, Boris Shulman shulm...@gmail.com wrote:
  You are running out of memory on your nodes. Before the final crash
  your nodes are probably slow  due to GC. What is your memtable size?
  What cache options did you configure?
 
  On Tue, Apr 6, 2010 at 7:31 AM, Ilya Maykov ivmay...@gmail.com wrote:
  Hi all,
 
  I've just started experimenting with Cassandra to get a feel for the
  system. I've set up a test cluster and to get a ballpark idea of its
  performance I wrote a simple tool to load some toy data into the
  system. Surprisingly, I am able to overwhelm my 4-node cluster with
  writes from a single client. I'm trying to figure out if this is a
  problem with my setup, if I'm hitting bugs in the Cassandra codebase,
  or if this is intended behavior. Sorry this email is kind of long,
  here is the TLDR version:
 
  While writing to Cassandra from a single node, I am able to get the
  cluster into a bad state, where nodes are randomly disconnecting from
  each other, write performance plummets, and sometimes nodes even
  crash. Further, the nodes do not recover as long as the writes
  continue (even at a much lower rate), and sometimes do not recover at
  all unless I restart them. I can get this to happen simply by throwing
  data at the cluster fast enough, and I'm wondering if this is a known
  issue or if I need to tweak my setup.
 
  Now, the details.
 
  First, a little bit about the setup:
 
  4-node cluster of identical machines, running cassandra-0.6.0-rc1 with
  the fixes for CASSANDRA-933, CASSANDRA-934, and CASSANDRA-936 patched
  in. Node specs:
  8-core Intel Xeon e5...@2.00ghz
  8GB RAM
  1Gbit ethernet
  Red Hat Linux 2.6.18
  JVM 1.6.0_19 64-bit
  1TB spinning disk houses both commitlog and data directories (which I
  know is not ideal).
  The client machine is on the same local network and has very similar
 specs.
 
  The cassandra nodes are started with the following JVM options:
 
  ./cassandra JVM_OPTS=-Xms6144m -Xmx6144m -XX:+UseConcMarkSweepGC -d64
  -XX:NewSize=1024m -XX:MaxNewSize=1024m -XX:+DisableExplicitGC
 
  I'm using default settings for all of the tunable stuff at the bottom
  of storage-conf.xml. I also selected my initial tokens to evenly
  partition the key space when the cluster was bootstrapped. I am using
  the RandomPartitioner.
 
  Now, about the test. Basically I am trying to get an idea of just how
  fast I can make this thing go. I am writing ~250M data records into
  the cluster, replicated at 3x, using Ran Tavory's Hector client
  (Java), writing with ConsistencyLevel.ZERO and
  FailoverPolicy.FAIL_FAST. The client is using 32 threads with 8
  threads talking to each of the 4 nodes in the cluster. Records are
  identified by a numeric id, and I'm writing them in batches of up to
  10k records per row, with each record in its own column. The row key
  identifies the bucket into which records fall. So, records with ids 0
  -  are written to row 0, 1 - 1 are written to row
  1, etc. Each record is a JSON object with ~10-20 fields.
 
  Records: {  // Column Family
0 : {  // row key for the start of the bucket. Buckets span a range
  of up to 1 records
  1 : { /* some JSON */ },  // Column for record with id=1
  3 : { /* some more JSON */ },  // Column for record with id=3
 ...
  : { /* ... */ }
},
   1 : {  // row key for the start of the next bucket
 10001 : ...
 10004 :
  }
 
  I am reading the data out of a local, sorted file on the client, so I
  only write a row to Cassandra once all records for that row have been
  read, and each row is written to exactly once. I'm using a
  producer-consumer queue to pump data from the input reader thread to
  the output 

Re: Overwhelming a cluster with writes?

2010-04-06 Thread Ilya Maykov
No, the disks on all nodes have about 750GB free space. Also as
mentioned in my follow-up email, writing with ConsistencyLevel.ALL
makes the slowdowns / crashes go away.

-- Ilya

On Mon, Apr 5, 2010 at 11:46 PM, Ran Tavory ran...@gmail.com wrote:
 Do you see one of the disks used by cassandra filled up when a node crashes?

 On Tue, Apr 6, 2010 at 9:39 AM, Ilya Maykov ivmay...@gmail.com wrote:

 I'm running the nodes with a JVM heap size of 6GB, and here are the
 related options from my storage-conf.xml. As mentioned in the first
 email, I left everything at the default value. I briefly googled
 around for Cassandra performance tuning etc but haven't found a
 definitive guide ... any help with tuning these parameters is greatly
 appreciated!

  DiskAccessModeauto/DiskAccessMode
  RowWarningThresholdInMB512/RowWarningThresholdInMB
  SlicedBufferSizeInKB64/SlicedBufferSizeInKB
  FlushDataBufferSizeInMB32/FlushDataBufferSizeInMB
  FlushIndexBufferSizeInMB8/FlushIndexBufferSizeInMB
  ColumnIndexSizeInKB64/ColumnIndexSizeInKB
  MemtableThroughputInMB64/MemtableThroughputInMB
  BinaryMemtableThroughputInMB256/BinaryMemtableThroughputInMB
  MemtableOperationsInMillions0.3/MemtableOperationsInMillions
  MemtableFlushAfterMinutes60/MemtableFlushAfterMinutes
  ConcurrentReads8/ConcurrentReads
  ConcurrentWrites64/ConcurrentWrites
  CommitLogSyncperiodic/CommitLogSync
  CommitLogSyncPeriodInMS1/CommitLogSyncPeriodInMS
  GCGraceSeconds864000/GCGraceSeconds

 -- Ilya

 On Mon, Apr 5, 2010 at 11:26 PM, Boris Shulman shulm...@gmail.com wrote:
  You are running out of memory on your nodes. Before the final crash
  your nodes are probably slow  due to GC. What is your memtable size?
  What cache options did you configure?
 
  On Tue, Apr 6, 2010 at 7:31 AM, Ilya Maykov ivmay...@gmail.com wrote:
  Hi all,
 
  I've just started experimenting with Cassandra to get a feel for the
  system. I've set up a test cluster and to get a ballpark idea of its
  performance I wrote a simple tool to load some toy data into the
  system. Surprisingly, I am able to overwhelm my 4-node cluster with
  writes from a single client. I'm trying to figure out if this is a
  problem with my setup, if I'm hitting bugs in the Cassandra codebase,
  or if this is intended behavior. Sorry this email is kind of long,
  here is the TLDR version:
 
  While writing to Cassandra from a single node, I am able to get the
  cluster into a bad state, where nodes are randomly disconnecting from
  each other, write performance plummets, and sometimes nodes even
  crash. Further, the nodes do not recover as long as the writes
  continue (even at a much lower rate), and sometimes do not recover at
  all unless I restart them. I can get this to happen simply by throwing
  data at the cluster fast enough, and I'm wondering if this is a known
  issue or if I need to tweak my setup.
 
  Now, the details.
 
  First, a little bit about the setup:
 
  4-node cluster of identical machines, running cassandra-0.6.0-rc1 with
  the fixes for CASSANDRA-933, CASSANDRA-934, and CASSANDRA-936 patched
  in. Node specs:
  8-core Intel Xeon e5...@2.00ghz
  8GB RAM
  1Gbit ethernet
  Red Hat Linux 2.6.18
  JVM 1.6.0_19 64-bit
  1TB spinning disk houses both commitlog and data directories (which I
  know is not ideal).
  The client machine is on the same local network and has very similar
  specs.
 
  The cassandra nodes are started with the following JVM options:
 
  ./cassandra JVM_OPTS=-Xms6144m -Xmx6144m -XX:+UseConcMarkSweepGC -d64
  -XX:NewSize=1024m -XX:MaxNewSize=1024m -XX:+DisableExplicitGC
 
  I'm using default settings for all of the tunable stuff at the bottom
  of storage-conf.xml. I also selected my initial tokens to evenly
  partition the key space when the cluster was bootstrapped. I am using
  the RandomPartitioner.
 
  Now, about the test. Basically I am trying to get an idea of just how
  fast I can make this thing go. I am writing ~250M data records into
  the cluster, replicated at 3x, using Ran Tavory's Hector client
  (Java), writing with ConsistencyLevel.ZERO and
  FailoverPolicy.FAIL_FAST. The client is using 32 threads with 8
  threads talking to each of the 4 nodes in the cluster. Records are
  identified by a numeric id, and I'm writing them in batches of up to
  10k records per row, with each record in its own column. The row key
  identifies the bucket into which records fall. So, records with ids 0
  -  are written to row 0, 1 - 1 are written to row
  1, etc. Each record is a JSON object with ~10-20 fields.
 
  Records: {  // Column Family
    0 : {  // row key for the start of the bucket. Buckets span a range
  of up to 1 records
      1 : { /* some JSON */ },  // Column for record with id=1
      3 : { /* some more JSON */ },  // Column for record with id=3
     ...
      : { /* ... */ }
    },
   1 : {  // row key for the start of the next bucket
     10001 : ...
     10004 :
  }
 
  I am reading the data out of a 

Re: Overwhelming a cluster with writes?

2010-04-06 Thread Benjamin Black
You are blowing away the mostly saner JVM_OPTS running it that way.
Edit cassandra.in.sh (or wherever config is on your system) to
increase mx to 4G (not 6G, for now) and leave everything else
untouched and do not specify JVM_OPTS on the command line.  See if you
get the same behavior.


b

On Mon, Apr 5, 2010 at 11:48 PM, Ilya Maykov ivmay...@gmail.com wrote:
 No, the disks on all nodes have about 750GB free space. Also as
 mentioned in my follow-up email, writing with ConsistencyLevel.ALL
 makes the slowdowns / crashes go away.

 -- Ilya

 On Mon, Apr 5, 2010 at 11:46 PM, Ran Tavory ran...@gmail.com wrote:
 Do you see one of the disks used by cassandra filled up when a node crashes?

 On Tue, Apr 6, 2010 at 9:39 AM, Ilya Maykov ivmay...@gmail.com wrote:

 I'm running the nodes with a JVM heap size of 6GB, and here are the
 related options from my storage-conf.xml. As mentioned in the first
 email, I left everything at the default value. I briefly googled
 around for Cassandra performance tuning etc but haven't found a
 definitive guide ... any help with tuning these parameters is greatly
 appreciated!

  DiskAccessModeauto/DiskAccessMode
  RowWarningThresholdInMB512/RowWarningThresholdInMB
  SlicedBufferSizeInKB64/SlicedBufferSizeInKB
  FlushDataBufferSizeInMB32/FlushDataBufferSizeInMB
  FlushIndexBufferSizeInMB8/FlushIndexBufferSizeInMB
  ColumnIndexSizeInKB64/ColumnIndexSizeInKB
  MemtableThroughputInMB64/MemtableThroughputInMB
  BinaryMemtableThroughputInMB256/BinaryMemtableThroughputInMB
  MemtableOperationsInMillions0.3/MemtableOperationsInMillions
  MemtableFlushAfterMinutes60/MemtableFlushAfterMinutes
  ConcurrentReads8/ConcurrentReads
  ConcurrentWrites64/ConcurrentWrites
  CommitLogSyncperiodic/CommitLogSync
  CommitLogSyncPeriodInMS1/CommitLogSyncPeriodInMS
  GCGraceSeconds864000/GCGraceSeconds

 -- Ilya

 On Mon, Apr 5, 2010 at 11:26 PM, Boris Shulman shulm...@gmail.com wrote:
  You are running out of memory on your nodes. Before the final crash
  your nodes are probably slow  due to GC. What is your memtable size?
  What cache options did you configure?
 
  On Tue, Apr 6, 2010 at 7:31 AM, Ilya Maykov ivmay...@gmail.com wrote:
  Hi all,
 
  I've just started experimenting with Cassandra to get a feel for the
  system. I've set up a test cluster and to get a ballpark idea of its
  performance I wrote a simple tool to load some toy data into the
  system. Surprisingly, I am able to overwhelm my 4-node cluster with
  writes from a single client. I'm trying to figure out if this is a
  problem with my setup, if I'm hitting bugs in the Cassandra codebase,
  or if this is intended behavior. Sorry this email is kind of long,
  here is the TLDR version:
 
  While writing to Cassandra from a single node, I am able to get the
  cluster into a bad state, where nodes are randomly disconnecting from
  each other, write performance plummets, and sometimes nodes even
  crash. Further, the nodes do not recover as long as the writes
  continue (even at a much lower rate), and sometimes do not recover at
  all unless I restart them. I can get this to happen simply by throwing
  data at the cluster fast enough, and I'm wondering if this is a known
  issue or if I need to tweak my setup.
 
  Now, the details.
 
  First, a little bit about the setup:
 
  4-node cluster of identical machines, running cassandra-0.6.0-rc1 with
  the fixes for CASSANDRA-933, CASSANDRA-934, and CASSANDRA-936 patched
  in. Node specs:
  8-core Intel Xeon e5...@2.00ghz
  8GB RAM
  1Gbit ethernet
  Red Hat Linux 2.6.18
  JVM 1.6.0_19 64-bit
  1TB spinning disk houses both commitlog and data directories (which I
  know is not ideal).
  The client machine is on the same local network and has very similar
  specs.
 
  The cassandra nodes are started with the following JVM options:
 
  ./cassandra JVM_OPTS=-Xms6144m -Xmx6144m -XX:+UseConcMarkSweepGC -d64
  -XX:NewSize=1024m -XX:MaxNewSize=1024m -XX:+DisableExplicitGC
 
  I'm using default settings for all of the tunable stuff at the bottom
  of storage-conf.xml. I also selected my initial tokens to evenly
  partition the key space when the cluster was bootstrapped. I am using
  the RandomPartitioner.
 
  Now, about the test. Basically I am trying to get an idea of just how
  fast I can make this thing go. I am writing ~250M data records into
  the cluster, replicated at 3x, using Ran Tavory's Hector client
  (Java), writing with ConsistencyLevel.ZERO and
  FailoverPolicy.FAIL_FAST. The client is using 32 threads with 8
  threads talking to each of the 4 nodes in the cluster. Records are
  identified by a numeric id, and I'm writing them in batches of up to
  10k records per row, with each record in its own column. The row key
  identifies the bucket into which records fall. So, records with ids 0
  -  are written to row 0, 1 - 1 are written to row
  1, etc. Each record is a JSON object with ~10-20 fields.
 
  Records: {  // Column Family
    0 : {  // row key for the 

Re: Memcached protocol?

2010-04-06 Thread Tatu Saloranta
On Mon, Apr 5, 2010 at 5:10 PM, Paul Prescod p...@ayogo.com wrote:
 On Mon, Apr 5, 2010 at 4:48 PM, Tatu Saloranta tsalora...@gmail.com wrote:
 ...

 I would think that there is also possibility of losing some
 increments, or perhaps getting duplicate increments?

 I believe that with vector clocks in Cassandra 0.7 you won't lose
 anything. The conflict resolver will do the summation for you
 properly.

 If I'm wrong, I'd love to hear more, though.

I think the key is that this is not automatic -- there is no general
mechanism for aggregating distinct modifications. Point being that you
could choose one amongst right answers, but not what to do with
concurrent modifications. So what is done instead is have
application-specific resolution strategy which makes use of semantics
of operations, to know how to combine such concurrent modifications
into correct answer. I don't know if this is trivial for case of
counter increments: especially since two concurrent increments give
same new value; yet correct combined result would be one higher (both
used base, added one).

That is to say, my understanding was that vector clocks would be
required but not sufficient for reconciliation of concurrent value
updates.

I may be off here; apologies if I have misunderstood some crucial piece.

-+ Tatu +-


Re: Overwhelming a cluster with writes?

2010-04-06 Thread Rob Coli

On 4/5/10 11:48 PM, Ilya Maykov wrote:

No, the disks on all nodes have about 750GB free space. Also as
mentioned in my follow-up email, writing with ConsistencyLevel.ALL
makes the slowdowns / crashes go away.


I am not sure if the above is consistent with the cause of #896, but the 
other symptoms (I inserted a bunch of data really fast via Thrift and 
GC melted my machine!) sound like it..


https://issues.apache.org/jira/browse/CASSANDRA-896

=Rob


Re: Flush Commit Log

2010-04-06 Thread JKnight JKnight
Yes, no problem with my live Cassandra server.
Thanks,  Jonathan.

On Mon, Apr 5, 2010 at 11:19 PM, Jonathan Ellis jbel...@gmail.com wrote:

 On Mon, Apr 5, 2010 at 9:11 PM, JKnight JKnight beukni...@gmail.com
 wrote:
  Thanks Jonathan,
 
  When I run nodeprobe flush with parameter -host is Cassandra server
 setup
  on my computer, my computer is hang up by Cassandra. (When I kill all
 Java
  process, the computer will work well)

 Sounds like flush generates a lot of i/o.  Not surprising.

  Yesterday, when run nodeprobe flush on my live server, I didn't flush
 all
  keyspace so that commit log files weren't deleted. Today, after flush for
  all keyspace, commit log files were deleted

 So... no problem, right?

 -Jonathan




-- 
Best regards,
JKnight


How do vector clocks and conflicts work?

2010-04-06 Thread Paul Prescod
This may be the blind leading the blind...

On Mon, Apr 5, 2010 at 11:54 PM, Tatu Saloranta tsalora...@gmail.comwrote:
...


 I think the key is that this is not automatic -- there is no general
 mechanism for aggregating distinct modifications. Point being that you
 could choose one amongst right answers, but not what to do with
 concurrent modifications. So what is done instead is have
 application-specific resolution strategy which makes use of semantics
 of operations, to know how to combine such concurrent modifications
 into correct answer.


I agree with all of that.


 I don't know if this is trivial for case of
 counter increments: especially since two concurrent increments give
 same new value; yet correct combined result would be one higher (both
 used base, added one).


As long as the conflict resolver knows that two writers each tried to
increment, then it can increment twice. The conflict resolver must know
about the semantics of increment or decrement or string append or
binary patch or whatever other merge strategy you choose. You'll register
your strategy with Cassandra and it will apply it. Presumably it will also
maintain enough context about what you were trying to accomplish to allow
the merge strategy plugin to do it properly.


 That is to say, my understanding was that vector clocks would be
 required but not sufficient for reconciliation of concurrent value
 updates.


I agree. They are necessary, but not sufficient. The other half is the
merge strategy plugin thing, which is analogous to custom comparators in
Cassandra today.

In CASSANDRA-580, Pedro Gomes talks about the plugins like this: I suppose
for the beginning of the discussion that some sort of interface will be
implemented to allow pluggable logic to be added to the server, personalized
scripts were an idea, I have heard. 

Kevin Kakugawa replies that they'll just use Java class libraries as a first
pass.

 Paul Prescod


Re: Overwhelming a cluster with writes?

2010-04-06 Thread Ilya Maykov
That does sound similar. It's possible that the difference I'm seeing
between ConsistencyLevel.ZERO and ConsistencyLevel.ALL is simply due
to the fact that using ALL slows down the writers enough that the GC
can keep up. I could do a test with multiple clients writing at ALL in
parallel tomorrow. If there are still no problems writing at ALL even
with extra load from additional clients, that might point to problems
in how async writes are handled vs. sync writes.

I will also do some profiling of the server processes with both ZERO
and ALL writer behaviors and report back.

RE: JVM_OPTS, I will try running with the more sane options (but a
larger heap) as well.

-- Ilya

On Mon, Apr 5, 2010 at 11:59 PM, Rob Coli rc...@digg.com wrote:
 On 4/5/10 11:48 PM, Ilya Maykov wrote:

 No, the disks on all nodes have about 750GB free space. Also as
 mentioned in my follow-up email, writing with ConsistencyLevel.ALL
 makes the slowdowns / crashes go away.

 I am not sure if the above is consistent with the cause of #896, but the
 other symptoms (I inserted a bunch of data really fast via Thrift and GC
 melted my machine!) sound like it..

 https://issues.apache.org/jira/browse/CASSANDRA-896

 =Rob



Re: Overwhelming a cluster with writes?

2010-04-06 Thread Ilya Maykov
Right, I meant 4GB heap vs. the standard 1GB. And all other options in
cassandra.in.sh at their defaults.

Sorry I am a bit new to JVM tuning, and very new to Cassandra :)

-- Ilya

On Tue, Apr 6, 2010 at 12:16 AM, Benjamin Black b...@b3k.us wrote:
 I am specifically suggesting you NOT use a heap that large with your
 8GB machines.  Please test with 4GB first.

 On Tue, Apr 6, 2010 at 12:13 AM, Ilya Maykov ivmay...@gmail.com wrote:
 That does sound similar. It's possible that the difference I'm seeing
 between ConsistencyLevel.ZERO and ConsistencyLevel.ALL is simply due
 to the fact that using ALL slows down the writers enough that the GC
 can keep up. I could do a test with multiple clients writing at ALL in
 parallel tomorrow. If there are still no problems writing at ALL even
 with extra load from additional clients, that might point to problems
 in how async writes are handled vs. sync writes.

 I will also do some profiling of the server processes with both ZERO
 and ALL writer behaviors and report back.

 RE: JVM_OPTS, I will try running with the more sane options (but a
 larger heap) as well.

 -- Ilya

 On Mon, Apr 5, 2010 at 11:59 PM, Rob Coli rc...@digg.com wrote:
 On 4/5/10 11:48 PM, Ilya Maykov wrote:

 No, the disks on all nodes have about 750GB free space. Also as
 mentioned in my follow-up email, writing with ConsistencyLevel.ALL
 makes the slowdowns / crashes go away.

 I am not sure if the above is consistent with the cause of #896, but the
 other symptoms (I inserted a bunch of data really fast via Thrift and GC
 melted my machine!) sound like it..

 https://issues.apache.org/jira/browse/CASSANDRA-896

 =Rob





Re: Memcached protocol?

2010-04-06 Thread gabriele renzi
On Tue, Apr 6, 2010 at 2:10 AM, Paul Prescod p...@ayogo.com wrote:
 On Mon, Apr 5, 2010 at 4:48 PM, Tatu Saloranta tsalora...@gmail.com wrote:
 ...

 I would think that there is also possibility of losing some
 increments, or perhaps getting duplicate increments?

 I believe that with vector clocks in Cassandra 0.7 you won't lose
 anything. The conflict resolver will do the summation for you
 properly.

 If I'm wrong, I'd love to hear more, though.

I keep reading this in the list, but why would vector clocks allow
consistent counters in a conflicting update?
Say nodes A,B,C where A,B get concurrent updates, if we do
read-and-set this does not seem useful as we'd end up with a vector
A:x+1,B:x+1 but why would x+1 be the correct value compared to x+2 ?

Or are we imagining spreading pairs key,INCR, key,DECR in which we
assume the writer client did not look at the existing value?


-- 
blog en: http://www.riffraff.info
blog it: http://riffraff.blogsome.com


Re: how to store list data in Apache Cassndra ?

2010-04-06 Thread David Strauss
Another option is to use a SuperColumnFamily, but that extends the depth
of all such values to be arrays. The name and age columns would
therefore also need to be SuperColumns -- just with a single sub-column
each.

Like many things in Cassandra, the preferred storage method depends on
your application's access patterns. It's quite unlike the normalization
procedure for an RDBMS, which is possible without knowing future queries.

On 2010-04-06 09:12, Michael Pearson wrote:
 Column Families are keyed attribute/value pairs, your 'girls' column
 will need to be serialised on save, and deserialiased on load so that
 it can treated as your intended array.  Pickle will do this for you
 (http://docs.python.org/library/pickle.html)
 
 eg:
 
  import pycassa
  import pickle
  client = pycassa.connect()
  cf = pycassa.ColumnFamily(client, 'mygame', 'user')
 
  key = '1234567890'
  value = {
  'name': 'Lee Li',
  'age'; '21',
  'girls': pickle.dumps(['java', 'actionscript', 'python'])
  }
 
 cf.insert(key, value)
 
 hope that helps
 
 -michael
 
 
 On Tue, Apr 6, 2010 at 6:49 PM, Shuge Lee shuge@gmail.com wrote:
 Dear firends:

 how to store list data in Apache Cassndra ?

 For example:
 user['lee'] = {
 'name': 'lee',
 'age'; '21',
 'girls': ['java', 'actionscript', 'python'],
 }
 Notice key `gils`

 I using pycassa (a python lib of cassandra)

 import pycassa
 client = pycassa.connect()
 cf = pycassa.ColumnFamily(client, 'mygame', 'user')

 key = '1234567890'
 value = {
 'name': 'Lee Li',

 'age'; '21',
 'girls': ['java', 'actionscript', 'python'],
 }

 cf.insert(key, value)


 Oops, get error while save a `value` like above.

 So, how to store list data in Apache Cassndra ?


 Thanks for reply.




 --
 Shuge Lee | Lee Li



-- 
David Strauss
   | da...@fourkitchens.com
Four Kitchens
   | http://fourkitchens.com
   | +1 512 454 6659 [office]
   | +1 512 870 8453 [direct]



signature.asc
Description: OpenPGP digital signature


i have one mistake in Cassandra.java when i build it

2010-04-06 Thread 叶江
hi:
   i want to take some experiments on cassandra by java, but when i write
client,a mistake can not convert int to ConsistencyLevel appear, so how can
i solve ? thanks very much.


Re: i have one mistake in Cassandra.java when i build it

2010-04-06 Thread Jonathan Ellis
This means you rebuilt the Thrift code with an old compiler.

If you look in lib/ the thrift jar is tagged with the svn revision we
built with.  Thrift has frequent regressions, so using that same revision
is the best way to avoid unpleasant surprises.

On Tue, Apr 6, 2010 at 4:34 AM, 叶江 yejiang...@gmail.com wrote:
 hi:
    i want to take some experiments on cassandra by java, but when i write
 client,a mistake can not convert int to ConsistencyLevel appear, so how can
 i solve ? thanks very much.


Re: Overwhelming a cluster with writes?

2010-04-06 Thread Jonathan Ellis
On Tue, Apr 6, 2010 at 2:13 AM, Ilya Maykov ivmay...@gmail.com wrote:
 That does sound similar. It's possible that the difference I'm seeing
 between ConsistencyLevel.ZERO and ConsistencyLevel.ALL is simply due
 to the fact that using ALL slows down the writers enough that the GC
 can keep up.

No, it's mostly due to ZERO meaning buffer this locally and write it
when it's convenient, and buffering takes memory.  If you check your
tpstats you will see the pending ops through the roof on the node
handling the thrift connections.


Re: Memcached protocol?

2010-04-06 Thread Jonathan Ellis
On Mon, Apr 5, 2010 at 6:48 PM, Tatu Saloranta tsalora...@gmail.com wrote:
 I would think that there is also possibility of losing some
 increments, or perhaps getting duplicate increments?
 It is not just isolation but also correctness that is hard to maintain
 but correctness also. This can be more easily worked around in cases
 where there is additional data that can be used to resolve potentially
 ambiguous changes (like inferring which of shopping cart additions are
 real, which duplicates).
 With more work I am sure it is possible to get things mostly working,
 it's just question of cost/benefit for specific use cases.

Let me inject a couple useful references:

http://pl.atyp.us/wordpress/?p=2601
http://blog.basho.com/2010/04/05/why-vector-clocks-are-hard/


if cassandra isn't ideal for keep track of counts, how does digg count diggs?

2010-04-06 Thread S Ahmed
From what I read in another thread, Cassandra isn't used for isn't 'ideal'
for keeping track of counts.

For example, I would undertand this to mean keeping track of which stories
were dugg.

If this is true, how would a site like digg keep track of the 'dugg'
counter?

Also, I am assuming with eventual consistancy the number *may* not be 100%
accurate.  If you wanted it to be accurate, would you just use the Quorom
flag? (I believe quorom is to ensure all writes are written to disk)


odd problem retrieving binary values using C++

2010-04-06 Thread Chris Beaumont
Hi all...

I am having a pretty tough time retrieving binary values out of my DB...
I am using cassandra 0.5.1 on Centos 5.4 with java 1.6.0-19

Here is the simple test I am trying to run in C++

/* snip initialization */
 {
transport-open();
 
ColumnPath new_col;
new_col.__isset.column = true; /* this is required! */
new_col.column_family.assign(Standard2);
new_col.super_column.assign();
new_col.column.assign(testing);

char *data_cstr=this\0 is\0 data!;
std::string data;
data.assign(data_cstr, 15);

printf(Data '%s' has length %lu\n, data.c_str(), data.length());
// This properly returns 15
 
client.insert(Keyspace1,newone,new_col,data,55,ONE);
 
ColumnOrSuperColumn ret_val;
 
client.get(ret_val,Keyspace1,newone,new_col,ONE);
 
printf(Column name retrieved is: %s\n, ret_val.column.name.c_str());
printf(Value in column retrieved is: %s\n, ret_val.column.value.c_str());
// This only ever returns  'this'  (i.e., everything before the first \0)
// I understand null termination in %s... see below
printf(Value has length %lu\n, ret_val.column.value.length());
// and this gives me 4
 
transport-close();
  }
/* snip the rest too! */

Am I missing something major in proceeding this way? 

I have tried GDB and eventually all I get back is a string containing 'this'.
Here is the dumped content of Keyspace1/Standard2-1-Data.db...
od -c /u01/cassandra/data/Keyspace1/Standard2-1-Data.db

000  \0   -   1   1   5   5   7   1   6   5   7   6   3   3   4   2
020   7   0   7   9   0   1   4   5   2   8   3   5   8   0   2   3
040   7   5   1   9   9   5   2   8   :   n   e   w   o   n   e  \0
060  \0  \0 264  \0  \0  \0   U  \0  \0  \0 003 254 355  \0 005   s
100   r  \0 020   j   a   v   a   .   u   t   i   l   .   B   i   t
120   S   e   t   n 375 210   ~   9   4 253   ! 003  \0 001   [  \0
140 004   b   i   t   s   t  \0 002   [   J   x   p   u   r  \0 002
160   [   J   x 004 265 022 261   u 223 002  \0  \0   x   p  \0
200  \0  \0 001  \0 202  \b  \0  \0  \0  \0  \0   x  \0  \0  \0   
220  \0  \a   t   e   s   t   i   n   g  \0  \a   t   e   s   t   i
240   n   g  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
260  \0   % 200  \0  \0  \0 200  \0  \0  \0  \0  \0  \0  \0  \0  \0
300  \0 001  \0  \a   t   e   s   t   i   n   g  \0  \0  \0  \0  \0
320  \0  \0  \0   7  \0  \0  \0 017   t   h   i   s  \0   i   s
340  \0   d   a   t   a   !
347

This shows that the data is stored properly to the db file.

# bin/cassandra-cli -host localhost
Connected to localhost/9160
Welcome to cassandra CLI.
cassandra get Keyspace1.Standard2['newone'] 
= (column=testing, value=this is data!, timestamp=55)
Returned 1 results.

Shows the same thing! It's there !!!

I would lean towards a Thrift interface problem... 

In any case... I'd be thankful if someone had a pointer/workaround to this 
show-stopper
of mine...

Best

Chris.



  



Re: if cassandra isn't ideal for keep track of counts, how does digg count diggs?

2010-04-06 Thread S Ahmed
Chris,

When you so patch, does that mean for Cassandra or your own internal
codebase?

Sounds interesting thanks!

On Tue, Apr 6, 2010 at 12:54 PM, Chris Goffinet goffi...@digg.com wrote:

 That's not true. We have been using the Zookeper work we posted on jira.
 That's what we are using internally and have been for months. We are now
 just wrapping up our vector clocks + distributed counter patch so we can
 begin transitioning away from the Zookeeper approach because there are
 problems with it long-term.

 -Chris

 On Apr 6, 2010, at 9:50 AM, Ryan King wrote:

  They don't use cassandra for it yet.
 
  -ryan
 
  On Tue, Apr 6, 2010 at 9:00 AM, S Ahmed sahmed1...@gmail.com wrote:
  From what I read in another thread, Cassandra isn't used for isn't
 'ideal'
  for keeping track of counts.
  For example, I would undertand this to mean keeping track of which
 stories
  were dugg.
  If this is true, how would a site like digg keep track of the 'dugg'
  counter?
  Also, I am assuming with eventual consistancy the number *may* not be
 100%
  accurate.  If you wanted it to be accurate, would you just use the
 Quorom
  flag? (I believe quorom is to ensure all writes are written to disk)




Re: if cassandra isn't ideal for keep track of counts, how does digg count diggs?

2010-04-06 Thread Chris Goffinet
http://issues.apache.org/jira/browse/CASSANDRA-704
http://issues.apache.org/jira/browse/CASSANDRA-721

We have our own internal codebase of Cassandra at Digg. But we are using those 
above patches until we have the vector clock work cleaned up, that patch will 
also goto jira. Most likely the vector clock work will go into 0.7, but since 
we run 0.6 and built it for that version, we will share that patch too.

-Chris

On Apr 6, 2010, at 10:17 AM, S Ahmed wrote:

 Chris,
 
 When you so patch, does that mean for Cassandra or your own internal 
 codebase?  
 
 Sounds interesting thanks!
 
 On Tue, Apr 6, 2010 at 12:54 PM, Chris Goffinet goffi...@digg.com wrote:
 That's not true. We have been using the Zookeper work we posted on jira. 
 That's what we are using internally and have been for months. We are now just 
 wrapping up our vector clocks + distributed counter patch so we can begin 
 transitioning away from the Zookeeper approach because there are problems 
 with it long-term.
 
 -Chris
 
 On Apr 6, 2010, at 9:50 AM, Ryan King wrote:
 
  They don't use cassandra for it yet.
 
  -ryan
 
  On Tue, Apr 6, 2010 at 9:00 AM, S Ahmed sahmed1...@gmail.com wrote:
  From what I read in another thread, Cassandra isn't used for isn't 'ideal'
  for keeping track of counts.
  For example, I would undertand this to mean keeping track of which stories
  were dugg.
  If this is true, how would a site like digg keep track of the 'dugg'
  counter?
  Also, I am assuming with eventual consistancy the number *may* not be 100%
  accurate.  If you wanted it to be accurate, would you just use the Quorom
  flag? (I believe quorom is to ensure all writes are written to disk)
 
 



Re: how to store list data in Apache Cassndra ?

2010-04-06 Thread Tatu Saloranta
On Tue, Apr 6, 2010 at 8:06 AM, Shuge Lee shuge@gmail.com wrote:
     'girls': pickle.dumps(['java', 'actionscript', 'python'])

 I think this is a really bad idea, I can't do any search if using Pickle.

Just to be sure: are you thinking of traditional queries, lookups by
values (find entries that have certain element in a list value)?
If so, you may be in trouble anyway: you can only do efficient queries
by primary entry key, not by values (Cassandra at least can do range
queries on keys, but still).

-+ Tatu +-


Re: A question of 'referential integrity'...

2010-04-06 Thread Tatu Saloranta
On Tue, Apr 6, 2010 at 10:12 AM, Steve sjh_cassan...@shic.co.uk wrote:
 On 06/04/2010 15:26, Eric Evans wrote:
...
 I've read all about QUORUM, and it is generally useful, but as far as I
 can tell, it can't give me a transaction...

Correct. Only individual operations are atomic, and ordering of
insertions is not guaranteed.

I think there were some logged Jira issues to allow grouping of
operations into what seems to amount to transactions, which could help
a lot here... but I can't find it now (or maybe it has only been
discussed so far?).
If I understand this correctly, it would just mean that you could send
a sequence of operations, to be completed as a unit (first into
journal, then into memtable etc).

-+ Tatu +-


Re: Overwhelming a cluster with writes?

2010-04-06 Thread Tatu Saloranta
On Tue, Apr 6, 2010 at 8:17 AM, Jonathan Ellis jbel...@gmail.com wrote:
 On Tue, Apr 6, 2010 at 2:13 AM, Ilya Maykov ivmay...@gmail.com wrote:
 That does sound similar. It's possible that the difference I'm seeing
 between ConsistencyLevel.ZERO and ConsistencyLevel.ALL is simply due
 to the fact that using ALL slows down the writers enough that the GC
 can keep up.

 No, it's mostly due to ZERO meaning buffer this locally and write it
 when it's convenient, and buffering takes memory.  If you check your
 tpstats you will see the pending ops through the roof on the node
 handling the thrift connections.


This sounds like a great FAQ entry? (apologies if it's already included)
So that ideally users would only use this setting if they (think they)
know what they are doing. :-)

-+ Tatu +-


Re: How do vector clocks and conflicts work?

2010-04-06 Thread Tatu Saloranta
On Tue, Apr 6, 2010 at 8:45 AM, Mike Malone m...@simplegeo.com wrote:
 As long as the conflict resolver knows that two writers each tried to
 increment, then it can increment twice. The conflict resolver must know
 about the semantics of increment or decrement or string append or
 binary patch or whatever other merge strategy you choose. You'll register
 your strategy with Cassandra and it will apply it. Presumably it will also
 maintain enough context about what you were trying to accomplish to allow
 the merge strategy plugin to do it properly.


 That is to say, my understanding was that vector clocks would be
 required but not sufficient for reconciliation of concurrent value
 updates.

 The way I envisioned eventually consistent counters working would require
 something slightly more sophisticated... but not too bad. As incr/decr
 operations happen on distributed nodes, each node would keep a (vector
 clock, delta) tuple for that node's local changes. When a client fetched the
 value of the counter the vector clock deltas and the reconciled count would
 be combined into a single result. Similarly, when a replication /
 hinted-handoff / read-repair reconciliation occurred the counts would be
 merged into a single (vector clock, count) tuple.
 Maybe there's a more elegant solution, but that's how I had been thinking
 about this particular problem.

I doubt there is any simple and elegant solution -- if there was, it
would have been invented in 50s if there was. :-)

Given this, yes, something along these lines sounds realistic. It also
sounds like implementation would greatly benefit (if not require)
foundational support from core, as opposed to being done outside of
Cassandra (which I understand you are suggesting). I wasn't sure if
the idea was to try to do this completely separate (aside from vector
clock support).

-+ Tatu +-


Net::Cassandra::Easy deletion failed

2010-04-06 Thread Mike Gallamore

Seems to be internal to java/cassandra itself.

I have some tests and I want to make sure that I have a clean slate 
each time I run the test. Clean as far as my code cares is that value 
is not defined. I'm running  bin/cassandra -f with the default 
install/options. So at the beginning of my test I run:


$rc = $c-mutate([$key], family = 'Standard1', deletions = { byname = 
['value']});


Alas, the cassandra terminal/cassandra itself barfs out:

ERROR 10:59:15,779 Error in ThreadPoolExecutor
java.lang.RuntimeException: java.lang.UnsupportedOperationException: 
This operation is not supported for Super Columns.
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.UnsupportedOperationException: This operation is 
not supported for Super Columns.

at org.apache.cassandra.db.SuperColumn.timestamp(SuperColumn.java:137)
at 
org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:65)
at 
org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:29)
at 
org.apache.cassandra.db.ColumnFamilySerializer.serializeForSSTable(ColumnFamilySerializer.java:87)
at 
org.apache.cassandra.db.ColumnFamilySerializer.serialize(ColumnFamilySerializer.java:73)
at 
org.apache.cassandra.db.RowMutationSerializer.freezeTheMaps(RowMutation.java:334)
at 
org.apache.cassandra.db.RowMutationSerializer.serialize(RowMutation.java:346)
at 
org.apache.cassandra.db.RowMutationSerializer.serialize(RowMutation.java:319)
at 
org.apache.cassandra.db.RowMutation.getSerializedBuffer(RowMutation.java:275)

at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:200)
at 
org.apache.cassandra.service.StorageProxy$3.runMayThrow(StorageProxy.java:310)
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)

... 3 more
ERROR 10:59:15,786 Fatal exception in thread 
Thread[ROW-MUTATION-STAGE:21,5,main]
java.lang.RuntimeException: java.lang.UnsupportedOperationException: 
This operation is not supported for Super Columns.
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)

at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.UnsupportedOperationException: This operation is 
not supported for Super Columns.

at org.apache.cassandra.db.SuperColumn.timestamp(SuperColumn.java:137)
at 
org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:65)
at 
org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:29)
at 
org.apache.cassandra.db.ColumnFamilySerializer.serializeForSSTable(ColumnFamilySerializer.java:87)
at 
org.apache.cassandra.db.ColumnFamilySerializer.serialize(ColumnFamilySerializer.java:73)
at 
org.apache.cassandra.db.RowMutationSerializer.freezeTheMaps(RowMutation.java:334)
at 
org.apache.cassandra.db.RowMutationSerializer.serialize(RowMutation.java:346)
at 
org.apache.cassandra.db.RowMutationSerializer.serialize(RowMutation.java:319)
at 
org.apache.cassandra.db.RowMutation.getSerializedBuffer(RowMutation.java:275)

at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:200)
at 
org.apache.cassandra.service.StorageProxy$3.runMayThrow(StorageProxy.java:310)
at 
org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)

... 3 more

Anyone have any ideas what I'm doing wrong? The value field is just a 
json encoded digit so something like (30) not a real supercolumn but 
the Net::Cassandra::Easy docs didn't have any examples of removing a non 
supercolumns data. Really what I'd like to do is delete the whole row, 
but again I didn't find any examples of how to do this.


Re: A question of 'referential integrity'...

2010-04-06 Thread Steve
On 06/04/2010 18:50, Benjamin Black wrote:
 I'm finding this exchange very confusing.  What exactly about
 Cassandra 'looks absolutely ideal' to you for your project?  The write
 performance, the symmetric, peer to peer architecture, etc?
   

Reasons I like Cassandra for this project:

* Columnar rather than tabular data structures with an extensible
  'schemata' - permitting evolution of back-end data structures to
  support new features without down-time.
* Decentralised architecture with fault tolerance/redundancy
  permitting high availability on shoestring budget hardware in an
  easily scalable pool - in spite of needing to track rapidly
  changing data that precludes meaningful backup.
* Easy to establish that data will be efficiently sharded - allowing
  many concurrent reads and writes - i.e. systemic IO bandwidth is
  scalable - both for reading and writing.
* Lightweight, free and open-source physical data model that
  minimises risk of vendor lock-in or insurmountable problems with
  glitches in commercial closed-source libraries.

A shorter answer might be that, in all ways other than depending upon
'referential integrity' between two 'maps' of hash-values, the data for
the rest of my application looks remarkably like that of large sites
that we know already use Cassandra.

I'm trying to establish the most effective Cassandra approach to achieve
the logical 'referential integrity' while minimising resource
(memory/disk/CPU) use in order to minimise hardware costs for any given
deployment scale - all the while, retaining the above advantages.



Re: A question of 'referential integrity'...

2010-04-06 Thread Steve
On 06/04/2010 18:53, Tatu Saloranta wrote:
 I've read all about QUORUM, and it is generally useful, but as far as I
 can tell, it can't give me a transaction...
 
 Correct. Only individual operations are atomic, and ordering of
 insertions is not guaranteed.
   
As I thought.
 I think there were some logged Jira issues to allow grouping of
 operations into what seems to amount to transactions, which could help
 a lot here... but I can't find it now (or maybe it has only been
 discussed so far?).
 If I understand this correctly, it would just mean that you could send
 a sequence of operations, to be completed as a unit (first into
 journal, then into memtable etc).
   

I think we're on the same page.  I need an atomic 'transaction'
affecting multiple keys - so I write a tuple of all the updates
(inserts/deletes) as a single value into a 'merge-pending' keyset... and
(somehow - perhaps with memtable) I modify data read from  other keysets
to be as-if this 'merge-pending' data had already been been applied to
the independent keysets to which it relates.  A process/thread on each
node would continuously attempt to apply the multiple updates from the
merge-pending data before deleting it and dropping the associated
merge-data from the in-memory transformations. Latency should be very
low (like with a log-based file-system) and throughput should be
reasonably high because there should be a lot of flexibility in batch
processing the 'merge-pending' data.

This way, if there's a failure during merging, there's sufficient
durable record to complete the merge before serving any more remote
requests.  To the remote client, it appears indistinguishable from an
atomic transaction that affected more than one key.



Re: How do vector clocks and conflicts work?

2010-04-06 Thread gabriele renzi
On Tue, Apr 6, 2010 at 9:11 AM, Paul Prescod pres...@gmail.com wrote:
 This may be the blind leading the blind...
 On Mon, Apr 5, 2010 at 11:54 PM, Tatu Saloranta tsalora...@gmail.com
 wrote:
...


 I think the key is that this is not automatic -- there is no general
 mechanism for aggregating distinct modifications. Point being that you
 could choose one amongst right answers, but not what to do with
 concurrent modifications. So what is done instead is have
 application-specific resolution strategy which makes use of semantics
 of operations, to know how to combine such concurrent modifications
 into correct answer.

 I agree with all of that.


 I don't know if this is trivial for case of
 counter increments: especially since two concurrent increments give
 same new value; yet correct combined result would be one higher (both
 used base, added one).

 As long as the conflict resolver knows that two writers each tried to
 increment, then it can increment twice. The conflict resolver must know
 about the semantics of increment or decrement or string append or
 binary patch or whatever other merge strategy you choose. You'll register
 your strategy with Cassandra and it will apply it. Presumably it will also
 maintain enough context about what you were trying to accomplish to allow
 the merge strategy plugin to do it properly.

as long as operations are commutative, isn't the conflict resolution
simply apply all ? A large number of useful operations can be
implemented this way (numeric incr/decr, set ops etc)


problem with Net::Cassanda::Easy deleting columns

2010-04-06 Thread Mike Gallamore
Hello I tried to post this earlier but something seems to have gone wrong
with sending the message.

I have a test perl script that I'm using to test the behaviour of some of my
existing code. It is important that the values start in  a clean state at
the beginning of the tests, as I'm incrementing values, checking scores
etc during the test and need to test that the values I expect are actually
what gets stored. To try to clear this I'm using the following
Net::Cassandra::Easy call (its a perl thrift wrapper):

$rc = $c-mutate([$key], family = 'Standard1', deletions = {
byname = ['value']});

The perl module is pretty poorly documented as are all the other ones I've
looked at (if someone has a better one to use I'd be interested).
Particularly the examples show that something like this is to be used to
delete a supercolumn from a key, but there is no examples of regular columns
or how to delete a whole key row from the datastore. Really all my code
cares is that the value comes back undefined until the test has actually
added a value for the value column.

When I run the code java/cassandra barfs and the test dies (cassandra seems
to keep running happily other than the exception dumps). Here is what
cassandra dumps out to the terminal (running bin/cassandra -f for the test):

da...@vader$ bin/cassandra -f
ERROR 13:01:43,301 Error in ThreadPoolExecutor
java.lang.RuntimeException: java.lang.UnsupportedOperationException: This
operation is not supported for Super Columns.
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.UnsupportedOperationException: This operation is not
supported for Super Columns.
at org.apache.cassandra.db.SuperColumn.timestamp(SuperColumn.java:137)
at
org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:65)
at
org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:29)
at
org.apache.cassandra.db.ColumnFamilySerializer.serializeForSSTable(ColumnFamilySerializer.java:87)
at
org.apache.cassandra.db.ColumnFamilySerializer.serialize(ColumnFamilySerializer.java:73)
at
org.apache.cassandra.db.RowMutationSerializer.freezeTheMaps(RowMutation.java:334)
at
org.apache.cassandra.db.RowMutationSerializer.serialize(RowMutation.java:346)
at
org.apache.cassandra.db.RowMutationSerializer.serialize(RowMutation.java:319)
at
org.apache.cassandra.db.RowMutation.getSerializedBuffer(RowMutation.java:275)
at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:200)
at
org.apache.cassandra.service.StorageProxy$3.runMayThrow(StorageProxy.java:310)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
... 3 more
ERROR 13:01:43,308 Fatal exception in thread
Thread[ROW-MUTATION-STAGE:3,5,main]
java.lang.RuntimeException: java.lang.UnsupportedOperationException: This
operation is not supported for Super Columns.
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:34)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:619)
Caused by: java.lang.UnsupportedOperationException: This operation is not
supported for Super Columns.
at org.apache.cassandra.db.SuperColumn.timestamp(SuperColumn.java:137)
at
org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:65)
at
org.apache.cassandra.db.ColumnSerializer.serialize(ColumnSerializer.java:29)
at
org.apache.cassandra.db.ColumnFamilySerializer.serializeForSSTable(ColumnFamilySerializer.java:87)
at
org.apache.cassandra.db.ColumnFamilySerializer.serialize(ColumnFamilySerializer.java:73)
at
org.apache.cassandra.db.RowMutationSerializer.freezeTheMaps(RowMutation.java:334)
at
org.apache.cassandra.db.RowMutationSerializer.serialize(RowMutation.java:346)
at
org.apache.cassandra.db.RowMutationSerializer.serialize(RowMutation.java:319)
at
org.apache.cassandra.db.RowMutation.getSerializedBuffer(RowMutation.java:275)
at org.apache.cassandra.db.RowMutation.apply(RowMutation.java:200)
at
org.apache.cassandra.service.StorageProxy$3.runMayThrow(StorageProxy.java:310)
at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30)
... 3 more

Anyone have an idea of what would cause this? Importantly: I don't
necessarily know that the value exists in Cassandra before I want to delete
it, I want more of a delete if exists kind of behaviour. The end result
should be if the key exists the column gets removed or the row continues to
not exist. Ideally I would delete the whole row (there is only one column at
the moment but in the future there will be more than one so manually
deleting the value for each of the columns is a bit of a pain).


Re: Net::Cassandra::Easy deletion failed

2010-04-06 Thread Ted Zlatanov
On Tue, 06 Apr 2010 11:07:03 -0700 Mike Gallamore 
mike.e.gallam...@googlemail.com wrote: 

MG Seems to be internal to java/cassandra itself.
MG I have some tests and I want to make sure that I have a clean slate
MG each time I run the test. Clean as far as my code cares is that
MG value is not defined. I'm running  bin/cassandra -f with the
MG default install/options. So at the beginning of my test I run:

Mike, you can submit bugs and questions directly to me, here, or through
http://rt.cpan.org (the CPAN bug tracker).  It's a good idea to test an
operation from the CLI that comes with Cassandra to make sure the
problem is not with the Net::Cassandra::Easy module.

Also, if you set $Net::Cassandra::Easy::DEBUG to 1, you'll see the
actual Thrift objects that get constructed.  In this case (N::C::Easy
0.08) I was constructing a super_column parameter which was wrong.

MG $rc = $c-mutate([$key], family = 'Standard1', deletions = { byname = 
['value']});
...
MG Anyone have any ideas what I'm doing wrong? The value field is just a
MG json encoded digit so something like (30) not a real supercolumn but
MG the Net::Cassandra::Easy docs didn't have any examples of removing a
MG non supercolumns data. Really what I'd like to do is delete the whole
MG row, but again I didn't find any examples of how to do this.

It's a bug in N::C::Easy.  I fixed it in 0.09 so it will work properly with:

$rc = $c-mutate([$key], family = 'Standard1', deletions = { standard = 1, 
byname = ['column1', 'column2']});

AFAIK I can't specify delete all columns in a non-super CF using
Deletions so byname is required (I end up filling the column_names
field in the predicate).  OTOH I can just delete a SuperColumn so the
above is possible in a super CF.

The docs and tests were updated as well.  Let me know if you have
problems; it worked for me.  In the next release I'll update cassidy.pl
to work with non-super CFs as well.  Sorry for the inconvenience.

Thanks
Ted



Re: Net::Cassandra::Easy deletion failed

2010-04-06 Thread Ted Zlatanov
On Tue, 06 Apr 2010 13:24:45 -0700 Mike Gallamore 
mike.e.gallam...@googlemail.com wrote: 

MG Thanks for the reply. The newest version of the module I see on CPAN
MG is 0.08b. I actually had 0.07 installed and am using 0.6beta3 for
MG cassandra. Is there somewhere else I should look for the 0.09 version
MG of the module? I'll also upgrade to the release candidate version of
MG Cassandra and see if that helps.

It takes a few hours for CPAN to update all its mirrors.  I'm attaching
0.09 here since it's a tiny tarball.

Ted



Net-Cassandra-Easy-0.09.tar.gz
Description: Binary data


Re: How do vector clocks and conflicts work?

2010-04-06 Thread Mike Malone
On Tue, Apr 6, 2010 at 11:03 AM, Tatu Saloranta tsalora...@gmail.comwrote:

 On Tue, Apr 6, 2010 at 8:45 AM, Mike Malone m...@simplegeo.com wrote:
  As long as the conflict resolver knows that two writers each tried to
  increment, then it can increment twice. The conflict resolver must know
  about the semantics of increment or decrement or string append or
  binary patch or whatever other merge strategy you choose. You'll
 register
  your strategy with Cassandra and it will apply it. Presumably it will
 also
  maintain enough context about what you were trying to accomplish to
 allow
  the merge strategy plugin to do it properly.
 
 
  That is to say, my understanding was that vector clocks would be
  required but not sufficient for reconciliation of concurrent value
  updates.
 
  The way I envisioned eventually consistent counters working would
 require
  something slightly more sophisticated... but not too bad. As incr/decr
  operations happen on distributed nodes, each node would keep a (vector
  clock, delta) tuple for that node's local changes. When a client fetched
 the
  value of the counter the vector clock deltas and the reconciled count
 would
  be combined into a single result. Similarly, when a replication /
  hinted-handoff / read-repair reconciliation occurred the counts would be
  merged into a single (vector clock, count) tuple.
  Maybe there's a more elegant solution, but that's how I had been thinking
  about this particular problem.

 I doubt there is any simple and elegant solution -- if there was, it
 would have been invented in 50s if there was. :-)

 Given this, yes, something along these lines sounds realistic. It also
 sounds like implementation would greatly benefit (if not require)
 foundational support from core, as opposed to being done outside of
 Cassandra (which I understand you are suggesting). I wasn't sure if
 the idea was to try to do this completely separate (aside from vector
 clock support).


I'd probably put it in core. Or at least put some more generic support for
this sort of conflict resolution in core. I'm looking forward to seeing
Digg's patch for this stuff.

Mike


Re: Net::Cassandra::Easy deletion failed

2010-04-06 Thread Mike Gallamore

On 04/06/2010 01:36 PM, Ted Zlatanov wrote:

On Tue, 06 Apr 2010 13:24:45 -0700 Mike 
Gallamoremike.e.gallam...@googlemail.com  wrote:

MG  Thanks for the reply. The newest version of the module I see on CPAN
MG  is 0.08b. I actually had 0.07 installed and am using 0.6beta3 for
MG  cassandra. Is there somewhere else I should look for the 0.09 version
MG  of the module? I'll also upgrade to the release candidate version of
MG  Cassandra and see if that helps.

It takes a few hours for CPAN to update all its mirrors.  I'm attaching
0.09 here since it's a tiny tarball.

Ted

   
Great it works. Or at least the Cassandra/thrift part seems to work. My 
tests don't pass but I think it is actual logic errors in the test now, 
the column does appear to be getting cleared okay with the new version 
of the module. Thanks.