Hi, regarding the retrying strategy, I understand that it might make sense assuming that the client can actually perform a retry.
We are trying to build a fault tolerance solution based on Cassandra. In some scenarios, the client machine can go down during a transaction. Would it be bad design to store all the data that need to be consistent under one big key? In this case the batch_mutate operations will not be big since just a small part is updated/add at a time. But at least we know that the operation either succeeded or failed. We basically have: CF: usernames (similar to Twitter model) SCF: User_tree (it has all the information related to the user) Thanks On Mon, Jul 19, 2010 at 9:40 PM, Alex Yiu <bigcontentf...@gmail.com> wrote: > > Hi, Stuart, > If I may paraphrase what Jonathan said, typically your batch_mutate > operation is idempotent. > That is, you can replay / retry the same operation within a short timeframe > without any undesirable side effect. > The assumption behind the "short timeframe" here refers to: there is no > other concurrent writer trying to write anything conflicting in an > interleaving fashion. > Imagine that if there was another writer trying to write: >> "some-uuid-1": { >> "path": "/foo/bar", >> "size": 100000 >> }, > ... >> { >> "/foo/bar": { >> "uuid": "some-uuid-1" >> }, > Then, there is a chance of 4 write operations (two writes for "/a/b/c" into > 2 CFs and two writes for "/foo/bar" into 2) would interleave each other and > create an undesirable result. > I guess that is not a likely situation in your case. > Hopefully, my email helps. > See also: > http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic > > Regards, > Alex Yiu > > > On Fri, Jul 9, 2010 at 11:50 AM, Jonathan Ellis <jbel...@gmail.com> wrote: >> >> typically you will update both as part of a batch_mutate, and if it >> fails, retry the operation. re-writing any part that succeeded will >> be harmless. >> >> On Thu, Jul 8, 2010 at 11:13 AM, Stuart Langridge >> <stuart.langri...@canonical.com> wrote: >> > Hi, Cassandra people! >> > >> > We're looking at Cassandra as a possible replacement for some parts of >> > our database structures, and on an early look I'm a bit confused about >> > atomicity guarantees and rollbacks and such, so I wanted to ask what >> > standard practice is for dealing with the sorts of situation I outline >> > below. >> > >> > Imagine that we're storing information about files. Each file has a path >> > and a uuid, and sometimes we need to look up stuff about a file by its >> > path and sometimes by its uuid. The best way to do this, as I understand >> > it, is to store the data in Cassandra twice: once indexed by nodeid and >> > once by path. So, I have two ColumnFamilies, one indexed by uuid: >> > >> > { >> > "some-uuid-1": { >> > "path": "/a/b/c", >> > "size": 100000 >> > }, >> > "some-uuid-2" { >> > ... >> > }, >> > ... >> > } >> > >> > and one indexed by path >> > >> > { >> > "/a/b/c": { >> > "uuid": "some-uuid-1", >> > "size": 100000 >> > }, >> > "/d/e/f" { >> > ... >> > }, >> > ... >> > } >> > >> > So, first, do please correct me if I've misunderstood the terminology >> > here (and I've shown a "short form" of ColumnFamily here, as per >> > http://arin.me/blog/wtf-is-a-supercolumn-cassandra-data-model). >> > >> > The thing I don't quite get is: what happens when I want to add a new >> > file? I need to add it to both these ColumnFamilies, but there's no "add >> > it to both" atomic operation. What's the way that people handle the >> > situation where I add to the first CF and then my program crashes, so I >> > never added to the second? (Assume that there is lots more data than >> > I've outlined above, so that "put it all in one SuperColumnFamily, >> > because that can be updated atomically" won't work because it would end >> > up with our entire database in one SCF). Should we add to one, and then >> > if we fail to add to the other for some reason continually retry until >> > it works? Have a "garbage collection" procedure which finds >> > discrepancies between indexes like this and fixes them up and run it >> > from cron? We'd love to hear some advice on how to do this, or if we're >> > modelling the data in the wrong way and there's a better way which >> > avoids these problems! >> > >> > sil >> > >> > >> > >> >> >> >> -- >> Jonathan Ellis >> Project Chair, Apache Cassandra >> co-founder of Riptano, the source for professional Cassandra support >> http://riptano.com > > -- Patricio.-