Re: tombstones problem with 1.0.8

2012-03-27 Thread John Laban
(Radim:  I'm assuming you mean do not delete already deleted columns as
Ross doesn't delete his rows.)

Just to be clear about Ross' situation:  he continually inserts columns and
later deletes columns from the same set of rows.  As long as he *doesn't* *keep
deleting already-deleted columns* (which refreshes the tombstone on them),
the deleted columns *should* get cleaned up, right?  (Even though the row
itself continually gets new columns inserted and other columns deleted?)

Thanks,
John



On Tue, Mar 27, 2012 at 2:21 AM, Radim Kolar h...@filez.com wrote:

 Dne 27.3.2012 11:13, Ross Black napsal(a):

  Any pointers on what I should be looking for in our application that
 would be stopping the deletion of tombstones?

 do not delete already deleted rows. On read cassandra returns deleted rows
 as empty in range slices.



Internal error processing get_slice (NullPointerException)

2012-03-26 Thread John Laban
Has anyone seen this particular NPE before from Cassandra?

This is on 1.0.8.  It seems to happen transiently on multiple nodes in my
cluster, every so often, and goes away.


ERROR [Thrift:45] 2012-03-26 19:59:12,024 Cassandra.java (line 3041)
Internal error processing get_slice
java.lang.NullPointerException
at
org.apache.cassandra.db.SliceFromReadCommand.maybeGenerateRetryCommand(SliceFromReadCommand.java:76)
at
org.apache.cassandra.service.StorageProxy.fetchRows(StorageProxy.java:724)
at
org.apache.cassandra.service.StorageProxy.read(StorageProxy.java:564)
at
org.apache.cassandra.thrift.CassandraServer.readColumnFamily(CassandraServer.java:128)
at
org.apache.cassandra.thrift.CassandraServer.getSlice(CassandraServer.java:283)
at
org.apache.cassandra.thrift.CassandraServer.multigetSliceInternal(CassandraServer.java:365)
at
org.apache.cassandra.thrift.CassandraServer.get_slice(CassandraServer.java:326)
at
org.apache.cassandra.thrift.Cassandra$Processor$get_slice.process(Cassandra.java:3033)
at
org.apache.cassandra.thrift.Cassandra$Processor.process(Cassandra.java:2889)
at
org.apache.cassandra.thrift.CustomTThreadPoolServer$WorkerProcess.run(CustomTThreadPoolServer.java:187)
at
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908)
at java.lang.Thread.run(Thread.java:662)



The line in question is (I think) the one below, so it looks like the
column family reference for a row can sometimes be null?

int liveColumnsInRow = row != null ? row.cf.getLiveColumnCount() : 0;


Thanks,
John


Re: Composite keys and range queries

2012-03-14 Thread John Laban
Hmm, now I'm really confused.

 This may be of use to you
http://www.datastax.com/dev/blog/schema-in-cassandra-1-1

This article is what I actually used to come up with my schema here.  In
the Clustering, composite keys, and more section they're using a schema
very similarly to how I'm trying to use it.  They define a composite key
with two parts, expecting the first part to be used as the partition key
and the second part to be used for ordering.

 The hash for (uuid-1 , p1) may be 100 and the hash for (uuid-1, p2) may
be 1 .

Why?  Shouldn't only uuid-1 be used as the partition key?  (So shouldn't
those two hash to the same location?)

I'm thinking of using supercolumns for this instead as I know they'll work
(where the row key is the uuid and the supercolumn name is the priority),
but aren't composite row keys supposed to essentially replace the need for
supercolumns?

Thanks, and sorry if I'm getting this all wrong,
John



On Wed, Mar 14, 2012 at 12:52 AM, aaron morton aa...@thelastpickle.comwrote:

 You are seeing this http://wiki.apache.org/cassandra/FAQ#range_rp

 The hash for (uuid-1 , p1) may be 100 and the hash for (uuid-1, p2) may be
 1 .

 You cannot do what you want to. Even if you passed a start of
 (uuid1,empty) and no finish, you would not only get rows where the key
 starts with uuid1.

 This may be of use to you
 http://www.datastax.com/dev/blog/schema-in-cassandra-1-1

 Or you can store all the priorities that are valid for an ID in another
 row.

 Cheers

 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 14/03/2012, at 1:05 PM, John Laban wrote:

  Forwarding to the Cassandra mailing list as well, in case this is more
 of an issue on how I'm using Cassandra.
 
  Am I correct to assume that I can use range queries on composite row
 keys, even when using a RandomPartitioner, if I make sure that the first
 part of the composite key is fixed?
 
  Any help would be appreciated,
  John
 
 
 
  On Tue, Mar 13, 2012 at 12:15 PM, John Laban j...@pagerduty.com wrote:
  Hi,
 
  I have a column family that uses a composite key:
 
  (ID, priority) - ...
 
  Where the ID is a UUID and the priority is an integer.
 
  I'm trying to perform a range query now:  I want all the rows where the
 ID matches some fixed UUID, but within a range of priorities.  This is
 supported even if I'm using a RandomPartitioner, right?  (Because the first
 key in the composite key is the partition key, and the second part of the
 composite key is automatically ordered?)
 
  So I perform a range slices query:
 
  val rangeQuery = HFactory.createRangeSlicesQuery(keyspace, new
 CompositeSerializer, StringSerializer.get, BytesArraySerializer.get)
  rangeQuery.setColumnFamily(RouteColumnFamilyName).
  setKeys( new Composite(id, priorityStart), new Composite(id,
 priorityEnd) ).
  setRange( null, null, false, Int.MaxValue )
 
 
  But I get this error:
 
  me.prettyprint.hector.api.exceptions.HInvalidRequestException:
 InvalidRequestException(why:start key's md5 sorts after end key's md5.
  this is not allowed; you probably should not specify end key at all, under
 RandomPartitioner)
 
  Shouldn't they have the same md5, since they have the same partition key?
 
  Am I using the wrong query here, or does Hector not support composte
 range queries, or am I making some mistake in how I think Cassandra's
 composite keys work?
 
  Thanks,
  John
 
 




Re: Composite keys and range queries

2012-03-14 Thread John Laban
Ahhh, ok, I thought that CQL was just being brought up to date with
the functionality already built into composite keys, but I guess I was
mistaken there.

But I guess it's just providing a convenient abstraction, using composite
column names under the hood.  That's where I was confused, thanks.

So, in terms of composite column names vs supercolumns:  is the only
advantage to composite column names that you can do column slicing on
subsets of the subcolumns? I.e. if I don't mind loading all of the
subcolumns for a given supercolumn name in memory at once (since I need
them all anyway), is there any disadvantage to using supercolumns here?
 They seem a little cleaner and more straightforward for my use case, since
I don't have the advantage of the CQL composite key thing.

Thanks,
John


On Wed, Mar 14, 2012 at 12:53 PM, Jeremiah Jordan 
jeremiah.jor...@morningstar.com wrote:

  Right, so until the new CQL stuff exists to actually query with
 something smart enough to know about composite keys , You have to define
 and query on your own.

 Row Key = UUID
 Column = CompositeColumn(string, string)

 You want to then use COLUMN slicing, not row ranges to query the data.
 Where you slice in priority as the first part of a Composite Column Name.

 See the Under the hood and historical notes section of the blog post.
 You want to layout your data per the Physical representation of the
 denormalized timeline rows diagram.
 Where your UUID is the user_id from the example, and your priority is
 the tweet_id

 -Jeremiah


  --
 *From:* John Laban [j...@pagerduty.com]
 *Sent:* Wednesday, March 14, 2012 12:37 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: Composite keys and range queries

   Hmm, now I'm really confused.

   This may be of use to you
 http://www.datastax.com/dev/blog/schema-in-cassandra-1-1

  This article is what I actually used to come up with my schema here.  In
 the Clustering, composite keys, and more section they're using a schema
 very similarly to how I'm trying to use it.  They define a composite key
 with two parts, expecting the first part to be used as the partition key
 and the second part to be used for ordering.

   The hash for (uuid-1 , p1) may be 100 and the hash for (uuid-1, p2)
 may be 1 .

  Why?  Shouldn't only uuid-1 be used as the partition key?  (So
 shouldn't those two hash to the same location?)

  I'm thinking of using supercolumns for this instead as I know they'll
 work (where the row key is the uuid and the supercolumn name is the
 priority), but aren't composite row keys supposed to essentially replace
 the need for supercolumns?

  Thanks, and sorry if I'm getting this all wrong,
 John



 On Wed, Mar 14, 2012 at 12:52 AM, aaron morton aa...@thelastpickle.comwrote:

 You are seeing this http://wiki.apache.org/cassandra/FAQ#range_rp

 The hash for (uuid-1 , p1) may be 100 and the hash for (uuid-1, p2) may
 be 1 .

 You cannot do what you want to. Even if you passed a start of
 (uuid1,empty) and no finish, you would not only get rows where the key
 starts with uuid1.

 This may be of use to you
 http://www.datastax.com/dev/blog/schema-in-cassandra-1-1

 Or you can store all the priorities that are valid for an ID in another
 row.

 Cheers

 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com

 On 14/03/2012, at 1:05 PM, John Laban wrote:

  Forwarding to the Cassandra mailing list as well, in case this is more
 of an issue on how I'm using Cassandra.
 
  Am I correct to assume that I can use range queries on composite row
 keys, even when using a RandomPartitioner, if I make sure that the first
 part of the composite key is fixed?
 
  Any help would be appreciated,
  John
 
 
 
  On Tue, Mar 13, 2012 at 12:15 PM, John Laban j...@pagerduty.com
 wrote:
  Hi,
 
  I have a column family that uses a composite key:
 
  (ID, priority) - ...
 
  Where the ID is a UUID and the priority is an integer.
 
  I'm trying to perform a range query now:  I want all the rows where the
 ID matches some fixed UUID, but within a range of priorities.  This is
 supported even if I'm using a RandomPartitioner, right?  (Because the first
 key in the composite key is the partition key, and the second part of the
 composite key is automatically ordered?)
 
  So I perform a range slices query:
 
  val rangeQuery = HFactory.createRangeSlicesQuery(keyspace, new
 CompositeSerializer, StringSerializer.get, BytesArraySerializer.get)
  rangeQuery.setColumnFamily(RouteColumnFamilyName).
  setKeys( new Composite(id, priorityStart), new
 Composite(id, priorityEnd) ).
  setRange( null, null, false, Int.MaxValue )
 
 
  But I get this error:
 
  me.prettyprint.hector.api.exceptions.HInvalidRequestException:
 InvalidRequestException(why:start key's md5 sorts after end key's md5.
  this is not allowed; you probably should not specify end key at all, under
 RandomPartitioner)
 
  Shouldn't they have the same md5

Re: Composite keys and range queries

2012-03-13 Thread John Laban
Forwarding to the Cassandra mailing list as well, in case this is more of
an issue on how I'm using Cassandra.

Am I correct to assume that I can use range queries on composite row keys,
even when using a RandomPartitioner, if I make sure that the first part of
the composite key is fixed?

Any help would be appreciated,
John



On Tue, Mar 13, 2012 at 12:15 PM, John Laban j...@pagerduty.com wrote:

 Hi,

 I have a column family that uses a composite key:

 (ID, priority) - ...

 Where the ID is a UUID and the priority is an integer.

 I'm trying to perform a range query now:  I want all the rows where the ID
 matches some fixed UUID, but within a range of priorities.  This is
 supported even if I'm using a RandomPartitioner, right?  (Because the first
 key in the composite key is the partition key, and the second part of the
 composite key is automatically ordered?)

 So I perform a range slices query:

 val rangeQuery = HFactory.createRangeSlicesQuery(keyspace, new 
 CompositeSerializer, StringSerializer.get, BytesArraySerializer.get)

 rangeQuery.setColumnFamily(RouteColumnFamilyName).
 setKeys( new Composite(id, priorityStart), new Composite(id, 
 priorityEnd) ).
 setRange( null, null, false, Int.MaxValue )


 But I get this error:

 me.prettyprint.hector.api.exceptions.HInvalidRequestException: 
 InvalidRequestException(why:start key's md5 sorts after end key's md5.  this 
 is not allowed; you probably should not specify end key at all, under 
 RandomPartitioner)


 Shouldn't they have the same md5, since they have the same partition key?

 Am I using the wrong query here, or does Hector not support composte range
 queries, or am I making some mistake in how I think Cassandra's composite
 keys work?

 Thanks,
 John




Re: best practices for simulating transactions in Cassandra

2011-12-15 Thread John Laban
I'm actually using Curator as a Zookeeper client myself.  I haven't used it
in production yet, but so far it seems well written and Jordan Zimmerman at
Netflix has been great on the support end as well.

I haven't tried Cages so I can't really compare, but I think one of the
main deciding factors between the two depends on which zk recipes you need.

John


On Thu, Dec 15, 2011 at 12:07 AM, Boris Yen yulin...@gmail.com wrote:

 I am not sure if this is the right thread to ask about this.

 I read that some people are using cage+zookeeper. I was wondering if
 anyone evaluates https://github.com/Netflix/curator? this seems to be a
 versatile package.

 On Tue, Dec 13, 2011 at 6:06 AM, John Laban j...@pagerduty.com wrote:

 Ok, great.  I'll be sure to look into the virtualization-specific NTP
 guides.

 Another benefit of using Cassandra over Zookeeper for locking is that you
 don't have to worry about losing your connection to Zookeeper (and with it
 your locks) while hammering away at data in Cassandra.  If using Cassandra
 for locks, if you lose your locks you lose your connection to the datastore
 too.   (We're using long-ish session timeouts + connection listeners in ZK
 to mitigate that now.)

 John



 On Mon, Dec 12, 2011 at 12:55 PM, Dominic Williams 
 dwilli...@fightmymonster.com wrote:

 Hi John,

 On 12 December 2011 19:35, John Laban j...@pagerduty.com wrote:

 So I responded to your algorithm in another part of this thread (very
 interesting) but this part of the paper caught my attention:

  When client application code releases a lock, that lock must not
 actually be
  released for a period equal to one millisecond plus twice the maximum
 possible
  drift of the clocks in the client computers accessing the Cassandra
 databases

 I've been worried about this, and added some arbitrary delay in the
 releasing of my locks.  But I don't like it as it's (A) an arbitrary value
 and (B) it will - perhaps greatly - reduce the throughput of the more
 high-contention areas of my system.

 To fix (B) I'll probably just have to try to get rid of locks all
 together in these high-contention areas.

 To fix (A), I'd need to know what the maximum possible drift of my
 clocks will be.  How did you determine this?  What value do you use, out of
 curiosity?  What does the network layout of your client machines look like?
  (Are any of your hosts geographically separated or all running in the same
 DC?  What's the maximum latency between hosts?  etc?)  Do you monitor the
 clock skew on an ongoing basis?  Am I worrying too much?


 If you setup NTP carefully no machine should drift more than 4ms say. I
 forget where, but you'll find the best documentation on how to make a
 bullet-proof NTP setup on vendor sites for virtualization software (because
 virtualization software can cause drift so NTP setup has to be just so)

 What this means is that, for example, to be really safe when a thread
 releases a lock you should wait say 9ms. Some points:-
 -- since the sleep is performed before release, an isolated operation
 should not be delayed at all
 -- only a waiting thread or a thread requesting a lock immediately it is
 released will be delayed, and no extra CPU or memory load is involved
 -- in practice for the vast majority of application layer data
 operations this restriction will have no effect on overall performance as
 experienced by a user, because such operations nearly always read and write
 to data with limited scope, for example the data of two users involved in
 some transaction
 -- the clocks issue does mean that you can't really serialize access to
 more broadly shared data where more than 5 or 10 such requests are made a
 second, say, but in reality even if the extra 9ms sleep on release wasn't
 necessary, variability in database operation execution time (say under
 load, or when something goes wrong) means trouble might occur serializing
 with that level of contention

 So in summary, although this drift thing seems bad at first, partly
 because it is a new consideration, in practice it's no big deal so long as
 you look after your clocks (and the main issue to watch out for is when
 application nodes running on virtualization software, hypervisors et al
 have setup issues that make their clocks drift under load, and it is a good
 idea to be wary of that)

 Best, Dominic


 Sorry for all the questions but I'm very concerned about this
 particular problem :)

 Thanks,
 John


 On Mon, Dec 12, 2011 at 4:36 AM, Dominic Williams 
 dwilli...@fightmymonster.com wrote:

 Hi guys, just thought I'd chip in...

 Fight My Monster is still using Cages, which is working fine, but...

 I'm looking at using Cassandra to replace Cages/ZooKeeper(!) There are
 2 main reasons:-

 1. Although a fast ZooKeeper cluster can handle a lot of load (we
 aren't getting anywhere near to capacity and we do a *lot*
 of serialisation) at some point it will be necessary to start hashing lock
 paths onto separate ZooKeeper clusters

Re: best practices for simulating transactions in Cassandra

2011-12-12 Thread John Laban
 transactions are added to
 locking system in Pelops/Hector/Pycassa, Cassandra will provide better
 performance than ZooKeeper for storing snapshots, especially as transaction
 size increases

 Best, Dominic

 On 11 December 2011 01:53, Guy Incognito dnd1...@gmail.com wrote:

  you could try writing with the clock of the initial replay entry?

 On 06/12/2011 20:26, John Laban wrote:

 Ah, neat.  It is similar to what was proposed in (4) above with adding
 transactions to Cages, but instead of snapshotting the data to be rolled
 back (the before data), you snapshot the data to be replayed (the after
 data).  And then later, if you find that the transaction didn't complete,
 you just keep replaying the transaction until it takes.

  The part I don't understand with this approach though:  how do you
 ensure that someone else didn't change the data between your initial failed
 transaction and the later replaying of the transaction?  You could get lost
 writes in that situation.

  Dominic (in the Cages blog post) explained a workaround with that for
 his rollback proposal:  all subsequent readers or writers of that data
 would have to check for abandoned transactions and roll them back
 themselves before they could read the data.  I don't think this is possible
 with the XACT_LOG replay approach in these slides though, based on how
 the data is indexed (cassandra node token + timeUUID).


  PS:  How are you liking Cages?




 2011/12/6 Jérémy SEVELLEC jsevel...@gmail.com

 Hi John,

  I had exactly the same reflexions.

  I'm using zookeeper and cage to lock et isolate.

  but how to rollback?
 It's impossible so try replay!

  the idea is explained in this presentation
 http://www.slideshare.net/mattdennis/cassandra-data-modeling (starting
 from slide 24)

  - insert your whole data into one column
 - make the job
 - remove (or expire) your column.

  if there is a problem during making the job, you keep the
 possibility to replay and replay and replay (synchronously or in a batch).

  Regards

  Jérémy


 2011/12/5 John Laban j...@pagerduty.com

 Hello,

  I'm building a system using Cassandra as a datastore and I have a
 few places where I am need of transactions.

  I'm using ZooKeeper to provide locking when I'm in need of some
 concurrency control or isolation, so that solves that half of the puzzle.

  What I need now is to sometimes be able to get atomicity across
 multiple writes by simulating the begin/rollback/commit abilities of a
 relational DB.  In other words, there are places where I need to perform
 multiple updates/inserts, and if I fail partway through, I would ideally 
 be
 able to rollback the partially-applied updates.

  Now, I *know* this isn't possible with Cassandra.  What I'm looking
 for are all the best practices, or at least tips and tricks, so that I can
 get around this limitation in Cassandra and still maintain a consistent
 datastore.  (I am using quorum reads/writes so that eventual consistency
 doesn't kick my ass here as well.)

  Below are some ideas I've been able to dig up.  Please let me know
 if any of them don't make sense, or if there are better approaches:


  1) Updates to a row in a column family are atomic.  So try to model
 your data so that you would only ever need to update a single row in a
 single CF at once.  Essentially, you model your data around transactions.
  This is tricky but can certainly be done in some situations.

  2) If you are only dealing with multiple row *inserts* (and not
 updates), have one of the rows act as a 'commit' by essentially validating
 the presence of the other rows.  For example, say you were performing an
 operation where you wanted to create an Account row and 5 User rows all at
 once (this is an unlikely example, but bear with me).  You could insert 5
 rows into the Users CF, and then the 1 row into the Accounts CF, which 
 acts
 as the commit.  If something went wrong before the Account could be
 created, any Users that had been created so far would be orphaned and
 unusable, as your business logic can ensure that they can't exist without
 an Account.  You could also have an offline cleanup process that swept 
 away
 orphans.

  3) Try to model your updates as idempotent column inserts instead.
  How do you model updates as inserts?  Instead of munging the value
 directly, you could insert a column containing the operation you want to
 perform (like +5).  It would work kind of like the Consistent Vote
 Counting implementation: ( https://gist.github.com/41 ).  How do
 you make the inserts idempotent?  Make sure the column names correspond to
 a request ID or some other identifier that would be identical across
 re-drives of a given (perhaps originally failed) request.  This could 
 leave
 your datastore in a temporarily inconsistent state, but would eventually
 become consistent after a successful re-drive of the original request.

  4) You could take an approach like Dominic Williams proposed with
 Cages:
 http://ria101

Re: best practices for simulating transactions in Cassandra

2011-12-12 Thread John Laban
Hi Dominic,

So I responded to your algorithm in another part of this thread (very
interesting) but this part of the paper caught my attention:

 When client application code releases a lock, that lock must not actually
be
 released for a period equal to one millisecond plus twice the maximum
possible
 drift of the clocks in the client computers accessing the Cassandra
databases

I've been worried about this, and added some arbitrary delay in the
releasing of my locks.  But I don't like it as it's (A) an arbitrary value
and (B) it will - perhaps greatly - reduce the throughput of the more
high-contention areas of my system.

To fix (B) I'll probably just have to try to get rid of locks all together
in these high-contention areas.

To fix (A), I'd need to know what the maximum possible drift of my clocks
will be.  How did you determine this?  What value do you use, out of
curiosity?  What does the network layout of your client machines look like?
 (Are any of your hosts geographically separated or all running in the same
DC?  What's the maximum latency between hosts?  etc?)  Do you monitor the
clock skew on an ongoing basis?  Am I worrying too much?

Sorry for all the questions but I'm very concerned about this particular
problem :)

Thanks,
John


On Mon, Dec 12, 2011 at 4:36 AM, Dominic Williams 
dwilli...@fightmymonster.com wrote:

 Hi guys, just thought I'd chip in...

 Fight My Monster is still using Cages, which is working fine, but...

 I'm looking at using Cassandra to replace Cages/ZooKeeper(!) There are 2
 main reasons:-

 1. Although a fast ZooKeeper cluster can handle a lot of load (we aren't
 getting anywhere near to capacity and we do a *lot* of serialisation) at
 some point it will be necessary to start hashing lock paths onto separate
 ZooKeeper clusters, and I tend to believe that these days you should choose
 platforms that handle sharding themselves (e.g. choose Cassandra rather
 than MySQL)

 2. Why have more components in your system when you can have less!!! KISS

 Recently I therefore tried to devise an algorithm which can be used to add
 a distributed locking layer to clients such as Pelops, Hector, Pycassa etc.

 There is a doc describing the algorithm, to which may be added an appendix
 describing a protocol so that locking can be interoperable between the
 clients. That could be extended to describe a protocol for transactions.
 Word of warning this is a *beta* algorithm that has only been seen by a
 select group so far, and therefore not even 100% sure it works but there is
 a useful general discussion regarding serialization of reads/writes so I
 include it anyway (and since this algorithm is going to be out there now,
 if there's anyone out there who fancies doing a Z proof or disproof, that
 would be fantastic).
 http://media.fightmymonster.com/Shared/docs/Wait%20Chain%20Algorithm.pdf

 Final word on this re transactions: if/when transactions are added to
 locking system in Pelops/Hector/Pycassa, Cassandra will provide better
 performance than ZooKeeper for storing snapshots, especially as transaction
 size increases

 Best, Dominic

 On 11 December 2011 01:53, Guy Incognito dnd1...@gmail.com wrote:

  you could try writing with the clock of the initial replay entry?

 On 06/12/2011 20:26, John Laban wrote:

 Ah, neat.  It is similar to what was proposed in (4) above with adding
 transactions to Cages, but instead of snapshotting the data to be rolled
 back (the before data), you snapshot the data to be replayed (the after
 data).  And then later, if you find that the transaction didn't complete,
 you just keep replaying the transaction until it takes.

  The part I don't understand with this approach though:  how do you
 ensure that someone else didn't change the data between your initial failed
 transaction and the later replaying of the transaction?  You could get lost
 writes in that situation.

  Dominic (in the Cages blog post) explained a workaround with that for
 his rollback proposal:  all subsequent readers or writers of that data
 would have to check for abandoned transactions and roll them back
 themselves before they could read the data.  I don't think this is possible
 with the XACT_LOG replay approach in these slides though, based on how
 the data is indexed (cassandra node token + timeUUID).


  PS:  How are you liking Cages?




 2011/12/6 Jérémy SEVELLEC jsevel...@gmail.com

 Hi John,

  I had exactly the same reflexions.

  I'm using zookeeper and cage to lock et isolate.

  but how to rollback?
 It's impossible so try replay!

  the idea is explained in this presentation
 http://www.slideshare.net/mattdennis/cassandra-data-modeling (starting
 from slide 24)

  - insert your whole data into one column
 - make the job
 - remove (or expire) your column.

  if there is a problem during making the job, you keep the
 possibility to replay and replay and replay (synchronously or in a batch).

  Regards

  Jérémy


 2011/12/5 John Laban j...@pagerduty.com

 Hello

Re: best practices for simulating transactions in Cassandra

2011-12-12 Thread John Laban
 be fairly simple to use a TTL again to
 make locks auto expire after N seconds, this would make it more like google
 chubby.

 It also allows for bad clients to game the system but that's not
 something that could be dealt with using authorization apis.

 For legacy reasons the linked code uses super columns but a regular
 column family will work just fine.

 -Jake


 On Mon, Dec 12, 2011 at 7:36 AM, Dominic Williams 
 dwilli...@fightmymonster.com wrote:

 Hi guys, just thought I'd chip in...

 Fight My Monster is still using Cages, which is working fine, but...

 I'm looking at using Cassandra to replace Cages/ZooKeeper(!) There are
 2 main reasons:-

 1. Although a fast ZooKeeper cluster can handle a lot of load (we
 aren't getting anywhere near to capacity and we do a *lot*
 of serialisation) at some point it will be necessary to start hashing lock
 paths onto separate ZooKeeper clusters, and I tend to believe that these
 days you should choose platforms that handle sharding themselves (e.g.
 choose Cassandra rather than MySQL)

 2. Why have more components in your system when you can have less!!!
 KISS

 Recently I therefore tried to devise an algorithm which can be used to
 add a distributed locking layer to clients such as Pelops, Hector, Pycassa
 etc.

 There is a doc describing the algorithm, to which may be added an
 appendix describing a protocol so that locking can be interoperable between
 the clients. That could be extended to describe a protocol for
 transactions. Word of warning this is a *beta* algorithm that has only been
 seen by a select group so far, and therefore not even 100% sure it works
 but there is a useful general discussion regarding serialization of
 reads/writes so I include it anyway (and since this algorithm is going to
 be out there now, if there's anyone out there who fancies doing a Z proof
 or disproof, that would be fantastic).
 http://media.fightmymonster.com/Shared/docs/Wait%20Chain%20Algorithm.pdf

 Final word on this re transactions: if/when transactions are added to
 locking system in Pelops/Hector/Pycassa, Cassandra will provide better
 performance than ZooKeeper for storing snapshots, especially as transaction
 size increases

 Best, Dominic

 On 11 December 2011 01:53, Guy Incognito dnd1...@gmail.com wrote:

  you could try writing with the clock of the initial replay entry?

 On 06/12/2011 20:26, John Laban wrote:

 Ah, neat.  It is similar to what was proposed in (4) above with adding
 transactions to Cages, but instead of snapshotting the data to be rolled
 back (the before data), you snapshot the data to be replayed (the 
 after
 data).  And then later, if you find that the transaction didn't complete,
 you just keep replaying the transaction until it takes.

  The part I don't understand with this approach though:  how do you
 ensure that someone else didn't change the data between your initial 
 failed
 transaction and the later replaying of the transaction?  You could get 
 lost
 writes in that situation.

  Dominic (in the Cages blog post) explained a workaround with that
 for his rollback proposal:  all subsequent readers or writers of that data
 would have to check for abandoned transactions and roll them back
 themselves before they could read the data.  I don't think this is 
 possible
 with the XACT_LOG replay approach in these slides though, based on how
 the data is indexed (cassandra node token + timeUUID).


  PS:  How are you liking Cages?




 2011/12/6 Jérémy SEVELLEC jsevel...@gmail.com

 Hi John,

  I had exactly the same reflexions.

  I'm using zookeeper and cage to lock et isolate.

  but how to rollback?
 It's impossible so try replay!

  the idea is explained in this presentation
 http://www.slideshare.net/mattdennis/cassandra-data-modeling (starting
 from slide 24)

  - insert your whole data into one column
 - make the job
 - remove (or expire) your column.

  if there is a problem during making the job, you keep the
 possibility to replay and replay and replay (synchronously or in a 
 batch).

  Regards

  Jérémy


 2011/12/5 John Laban j...@pagerduty.com

 Hello,

  I'm building a system using Cassandra as a datastore and I have a
 few places where I am need of transactions.

  I'm using ZooKeeper to provide locking when I'm in need of some
 concurrency control or isolation, so that solves that half of the 
 puzzle.

  What I need now is to sometimes be able to get atomicity across
 multiple writes by simulating the begin/rollback/commit abilities of a
 relational DB.  In other words, there are places where I need to perform
 multiple updates/inserts, and if I fail partway through, I would 
 ideally be
 able to rollback the partially-applied updates.

  Now, I *know* this isn't possible with Cassandra.  What I'm
 looking for are all the best practices, or at least tips and tricks, so
 that I can get around this limitation in Cassandra and still maintain a
 consistent datastore.  (I am using quorum reads/writes so that eventual

Re: best practices for simulating transactions in Cassandra

2011-12-06 Thread John Laban
Ah, neat.  It is similar to what was proposed in (4) above with adding
transactions to Cages, but instead of snapshotting the data to be rolled
back (the before data), you snapshot the data to be replayed (the after
data).  And then later, if you find that the transaction didn't complete,
you just keep replaying the transaction until it takes.

The part I don't understand with this approach though:  how do you ensure
that someone else didn't change the data between your initial failed
transaction and the later replaying of the transaction?  You could get lost
writes in that situation.

Dominic (in the Cages blog post) explained a workaround with that for his
rollback proposal:  all subsequent readers or writers of that data would
have to check for abandoned transactions and roll them back themselves
before they could read the data.  I don't think this is possible with the
XACT_LOG replay approach in these slides though, based on how the data is
indexed (cassandra node token + timeUUID).


PS:  How are you liking Cages?




2011/12/6 Jérémy SEVELLEC jsevel...@gmail.com

 Hi John,

 I had exactly the same reflexions.

 I'm using zookeeper and cage to lock et isolate.

 but how to rollback?
 It's impossible so try replay!

 the idea is explained in this presentation
 http://www.slideshare.net/mattdennis/cassandra-data-modeling (starting
 from slide 24)

 - insert your whole data into one column
 - make the job
 - remove (or expire) your column.

 if there is a problem during making the job, you keep the possibility to
 replay and replay and replay (synchronously or in a batch).

 Regards

 Jérémy


 2011/12/5 John Laban j...@pagerduty.com

 Hello,

 I'm building a system using Cassandra as a datastore and I have a few
 places where I am need of transactions.

 I'm using ZooKeeper to provide locking when I'm in need of some
 concurrency control or isolation, so that solves that half of the puzzle.

 What I need now is to sometimes be able to get atomicity across multiple
 writes by simulating the begin/rollback/commit abilities of a relational
 DB.  In other words, there are places where I need to perform multiple
 updates/inserts, and if I fail partway through, I would ideally be able to
 rollback the partially-applied updates.

 Now, I *know* this isn't possible with Cassandra.  What I'm looking for
 are all the best practices, or at least tips and tricks, so that I can get
 around this limitation in Cassandra and still maintain a consistent
 datastore.  (I am using quorum reads/writes so that eventual consistency
 doesn't kick my ass here as well.)

 Below are some ideas I've been able to dig up.  Please let me know if any
 of them don't make sense, or if there are better approaches:


 1) Updates to a row in a column family are atomic.  So try to model your
 data so that you would only ever need to update a single row in a single CF
 at once.  Essentially, you model your data around transactions.  This is
 tricky but can certainly be done in some situations.

 2) If you are only dealing with multiple row *inserts* (and not updates),
 have one of the rows act as a 'commit' by essentially validating the
 presence of the other rows.  For example, say you were performing an
 operation where you wanted to create an Account row and 5 User rows all at
 once (this is an unlikely example, but bear with me).  You could insert 5
 rows into the Users CF, and then the 1 row into the Accounts CF, which acts
 as the commit.  If something went wrong before the Account could be
 created, any Users that had been created so far would be orphaned and
 unusable, as your business logic can ensure that they can't exist without
 an Account.  You could also have an offline cleanup process that swept away
 orphans.

 3) Try to model your updates as idempotent column inserts instead.  How
 do you model updates as inserts?  Instead of munging the value directly,
 you could insert a column containing the operation you want to perform
 (like +5).  It would work kind of like the Consistent Vote Counting
 implementation: ( https://gist.github.com/41 ).  How do you make the
 inserts idempotent?  Make sure the column names correspond to a request ID
 or some other identifier that would be identical across re-drives of a
 given (perhaps originally failed) request.  This could leave your datastore
 in a temporarily inconsistent state, but would eventually become consistent
 after a successful re-drive of the original request.

 4) You could take an approach like Dominic Williams proposed with Cages:
 http://ria101.wordpress.com/2010/05/12/locking-and-transactions-over-cassandra-using-cages/
The gist is that you snapshot all the original values that you're about
 to munge somewhere else (in his case, ZooKeeper), make your updates, and
 then delete the snapshot (and that delete needs to be atomic).  If the
 snapshot data was never deleted, then subsequent accessors (even readers)
 of the data rows need to do the rollback

best practices for simulating transactions in Cassandra

2011-12-05 Thread John Laban
Hello,

I'm building a system using Cassandra as a datastore and I have a few
places where I am need of transactions.

I'm using ZooKeeper to provide locking when I'm in need of some concurrency
control or isolation, so that solves that half of the puzzle.

What I need now is to sometimes be able to get atomicity across multiple
writes by simulating the begin/rollback/commit abilities of a relational
DB.  In other words, there are places where I need to perform multiple
updates/inserts, and if I fail partway through, I would ideally be able to
rollback the partially-applied updates.

Now, I *know* this isn't possible with Cassandra.  What I'm looking for are
all the best practices, or at least tips and tricks, so that I can get
around this limitation in Cassandra and still maintain a consistent
datastore.  (I am using quorum reads/writes so that eventual consistency
doesn't kick my ass here as well.)

Below are some ideas I've been able to dig up.  Please let me know if any
of them don't make sense, or if there are better approaches:


1) Updates to a row in a column family are atomic.  So try to model your
data so that you would only ever need to update a single row in a single CF
at once.  Essentially, you model your data around transactions.  This is
tricky but can certainly be done in some situations.

2) If you are only dealing with multiple row *inserts* (and not updates),
have one of the rows act as a 'commit' by essentially validating the
presence of the other rows.  For example, say you were performing an
operation where you wanted to create an Account row and 5 User rows all at
once (this is an unlikely example, but bear with me).  You could insert 5
rows into the Users CF, and then the 1 row into the Accounts CF, which acts
as the commit.  If something went wrong before the Account could be
created, any Users that had been created so far would be orphaned and
unusable, as your business logic can ensure that they can't exist without
an Account.  You could also have an offline cleanup process that swept away
orphans.

3) Try to model your updates as idempotent column inserts instead.  How do
you model updates as inserts?  Instead of munging the value directly, you
could insert a column containing the operation you want to perform (like
+5).  It would work kind of like the Consistent Vote Counting
implementation: ( https://gist.github.com/41 ).  How do you make the
inserts idempotent?  Make sure the column names correspond to a request ID
or some other identifier that would be identical across re-drives of a
given (perhaps originally failed) request.  This could leave your datastore
in a temporarily inconsistent state, but would eventually become consistent
after a successful re-drive of the original request.

4) You could take an approach like Dominic Williams proposed with Cages:
http://ria101.wordpress.com/2010/05/12/locking-and-transactions-over-cassandra-using-cages/
  The gist is that you snapshot all the original values that you're
about
to munge somewhere else (in his case, ZooKeeper), make your updates, and
then delete the snapshot (and that delete needs to be atomic).  If the
snapshot data was never deleted, then subsequent accessors (even readers)
of the data rows need to do the rollback of the previous transaction
themselves before they can read/write this data.  They do the rollback by
just overwriting the current values with what is in the snapshot.  It
offloads the work of the rollback to the next worker that accesses the
data.  This approach probably needs an generic/high-level programming layer
to handle all of the details and complexity, and it doesn't seem like it
was ever added to Cages.


Are there other approaches or best practices that I missed?  I would be
very interested in hearing any opinions from those who have tackled these
problems before.

Thanks!
John