cli composite type literal with empty string component

2012-02-08 Thread Bryce Allen
I have a CF defined like this in CLI syntax:

create column family Test
with key_validation_class = UTF8Type
and comparator = 'CompositeType(AsciiType, UTF8Type)'
and default_validation_class = UTF8Type
and column_metadata = [
{ column_name : 'deleted:',
  validation_class : BooleanType },
{ column_name : 'version:',
  validation_class : LongType },
];

I expected these columns to map to (deleted, ) and (version, )
in pycassa, but this is not the case:

  TEST.insert(r1, { (deleted, ): False, (version, ): 1,
  (a, b): c })
  AttributeError: 'int' object has no attribute 'encode'

  TEST.column_validators
  {'\x00\x07deleted\x00': 'BooleanType',
   '\x00\x07version\x00': 'LongType'}

The obvious workaround is to use pycassa to define the schema:

  SYSTEM_MANAGER.create_column_family(test, Test2,
key_validation_class=UTF8_TYPE,
comparator_type=CompositeType(ASCII_TYPE, UTF8_TYPE),
default_validation_class=UTF8_TYPE,
column_validation_classes={
  (version, ): LONG_TYPE,
  (deleted, ): BOOLEAN_TYPE })

and this does really produce a different schema:
  
  TEST2.column_validators
  {'\x00\x07version\x00\x00\x00\x00': 'LongType',
   '\x00\x07deleted\x00\x00\x00\x00': 'BooleanType'}

To mimic what CLI does, I leave off the last component instead of
using :

  SYSTEM_MANAGER.create_column_family(test, Test3,
key_validation_class=UTF8_TYPE,
comparator_type=CompositeType(ASCII_TYPE, UTF8_TYPE),
default_validation_class=UTF8_TYPE,
column_validation_classes={
  (version,): LONG_TYPE,
  (deleted,): BOOLEAN_TYPE })

  TEST3.column_validators
  {'\x00\x07deleted\x00': 'BooleanType',
   '\x00\x07version\x00': 'LongType'}

But I see no way to address these columns from pycassa. I have a
workaround, but I find the inconsistency perplexing, and would rather
not have to do the busywork to convert my schema syntax. Is there a way
to address columns with an empty string component in the CLI?

Thanks,
Bryce


signature.asc
Description: PGP signature


Re: cli composite type literal with empty string component

2012-02-08 Thread Bryce Allen
Never mind; the issue with addressing composite column names with empty
components was fixed in the latest pycassa, which is why I was even
able to create them in the Test3 schema below. I get an error in 1.2.1
which I used to be running, but it all seems to work in 1.4.0.

-Bryce

On Wed, 8 Feb 2012 10:25:07 -0600
Bryce Allen bal...@ci.uchicago.edu wrote:
 I have a CF defined like this in CLI syntax:
 
 create column family Test
 with key_validation_class = UTF8Type
 and comparator = 'CompositeType(AsciiType, UTF8Type)'
 and default_validation_class = UTF8Type
 and column_metadata = [
 { column_name : 'deleted:',
   validation_class : BooleanType },
 { column_name : 'version:',
   validation_class : LongType },
 ];
 
 I expected these columns to map to (deleted, ) and (version, )
 in pycassa, but this is not the case:
 
   TEST.insert(r1, { (deleted, ): False, (version, ): 1,
 (a, b): c })
   AttributeError: 'int' object has no attribute 'encode'
 
   TEST.column_validators
   {'\x00\x07deleted\x00': 'BooleanType',
'\x00\x07version\x00': 'LongType'}
 
 The obvious workaround is to use pycassa to define the schema:
 
   SYSTEM_MANAGER.create_column_family(test, Test2,
 key_validation_class=UTF8_TYPE,
 comparator_type=CompositeType(ASCII_TYPE, UTF8_TYPE),
 default_validation_class=UTF8_TYPE,
 column_validation_classes={
   (version, ): LONG_TYPE,
   (deleted, ): BOOLEAN_TYPE })
 
 and this does really produce a different schema:
   
   TEST2.column_validators
   {'\x00\x07version\x00\x00\x00\x00': 'LongType',
'\x00\x07deleted\x00\x00\x00\x00': 'BooleanType'}
 
 To mimic what CLI does, I leave off the last component instead of
 using :
 
   SYSTEM_MANAGER.create_column_family(test, Test3,
 key_validation_class=UTF8_TYPE,
 comparator_type=CompositeType(ASCII_TYPE, UTF8_TYPE),
 default_validation_class=UTF8_TYPE,
 column_validation_classes={
   (version,): LONG_TYPE,
   (deleted,): BOOLEAN_TYPE })
 
   TEST3.column_validators
   {'\x00\x07deleted\x00': 'BooleanType',
'\x00\x07version\x00': 'LongType'}
 
 But I see no way to address these columns from pycassa. I have a
 workaround, but I find the inconsistency perplexing, and would rather
 not have to do the busywork to convert my schema syntax. Is there a
 way to address columns with an empty string component in the CLI?
 
 Thanks,
 Bryce


signature.asc
Description: PGP signature


Re: cli composite type literal with empty string component

2012-02-08 Thread Bryce Allen
In case anyone else is curious about what is going on here:
https://github.com/pycassa/pycassa/issues/112

The links to the Cassandra JIRA are instructive.

-Bryce

On Wed, 8 Feb 2012 10:59:37 -0600
Bryce Allen bal...@ci.uchicago.edu wrote:
 Never mind; the issue with addressing composite column names with
 empty components was fixed in the latest pycassa, which is why I was
 even able to create them in the Test3 schema below. I get an error in
 1.2.1 which I used to be running, but it all seems to work in 1.4.0.
 
 -Bryce
 
 On Wed, 8 Feb 2012 10:25:07 -0600
 Bryce Allen bal...@ci.uchicago.edu wrote:
  I have a CF defined like this in CLI syntax:
  
  create column family Test
  with key_validation_class = UTF8Type
  and comparator = 'CompositeType(AsciiType, UTF8Type)'
  and default_validation_class = UTF8Type
  and column_metadata = [
  { column_name : 'deleted:',
validation_class : BooleanType },
  { column_name : 'version:',
validation_class : LongType },
  ];
  
  I expected these columns to map to (deleted, ) and (version,
  ) in pycassa, but this is not the case:
  
TEST.insert(r1, { (deleted, ): False, (version, ): 1,
(a, b): c })
AttributeError: 'int' object has no attribute 'encode'
  
TEST.column_validators
{'\x00\x07deleted\x00': 'BooleanType',
 '\x00\x07version\x00': 'LongType'}
  
  The obvious workaround is to use pycassa to define the schema:
  
SYSTEM_MANAGER.create_column_family(test, Test2,
  key_validation_class=UTF8_TYPE,
  comparator_type=CompositeType(ASCII_TYPE, UTF8_TYPE),
  default_validation_class=UTF8_TYPE,
  column_validation_classes={
(version, ): LONG_TYPE,
(deleted, ): BOOLEAN_TYPE })
  
  and this does really produce a different schema:

TEST2.column_validators
{'\x00\x07version\x00\x00\x00\x00': 'LongType',
 '\x00\x07deleted\x00\x00\x00\x00': 'BooleanType'}
  
  To mimic what CLI does, I leave off the last component instead of
  using :
  
SYSTEM_MANAGER.create_column_family(test, Test3,
  key_validation_class=UTF8_TYPE,
  comparator_type=CompositeType(ASCII_TYPE, UTF8_TYPE),
  default_validation_class=UTF8_TYPE,
  column_validation_classes={
(version,): LONG_TYPE,
(deleted,): BOOLEAN_TYPE })
  
TEST3.column_validators
{'\x00\x07deleted\x00': 'BooleanType',
 '\x00\x07version\x00': 'LongType'}
  
  But I see no way to address these columns from pycassa. I have a
  workaround, but I find the inconsistency perplexing, and would
  rather not have to do the busywork to convert my schema syntax. Is
  there a way to address columns with an empty string component in
  the CLI?
  
  Thanks,
  Bryce


signature.asc
Description: PGP signature


Re: two dimensional slicing

2012-01-30 Thread Bryce Allen
 to do the index lookup). It's definitely not much more
complicated when using RP; I was caught up in some nuances of our old
model when I wrote the last email.

-Bryce



 
 Could you re-write the entire list every version update?
 
 CF: VersionedList
 row: list_name:version
 col_name: name
 col_value: last updated version
 
 So you slice one row at the upper version and discard all the columns
 where the value is less than the lower version ? 
 
 Cheers
 
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 27/01/2012, at 5:31 AM, Bryce Allen wrote:
 
  Thanks, comments inline:
  
  On Mon, 23 Jan 2012 20:59:34 +1300
  aaron morton aa...@thelastpickle.com wrote:
  It depends a bit on the data and the query patterns. 
  
  * How many versions do you have ? 
  We may have 10k versions in some cases, with up to a million names
  total in any given version but more often 10K. To manage this we
  are currently using two CFs, one for storing compacted complete
  lists and one for storing deltas on the compacted list. Based on
  usage, we will create a new compacted list and start writing deltas
  against that. We should be able to limit the number of deltas in a
  single row to below 100; I'd like to be able to keep it lower but
  I'm not sure we can maintain that under all load scenarios. The
  compacted lists are straightforward, but there are many ways to
  structure the deltas and they all have trade offs. A CF with
  composite columns that supported two dimensional slicing would be
  perfect.
  
  * How many names in each version ?
  We plan on limiting to a total of 1 million names, and around
  10,000 per version (by limiting the batch size), but many deltas
  will have 10 names.
  
  * When querying do you know the versions numbers you want to query
  from ? How many are there normally?
  Currently we don't know the version numbers in advance - they are
  timestamps, and we are querying for versions less than or equal to
  the desired timestamp. We have talked about using vector clock
  versions and maintaining an index mapping time to version numbers,
  in which case we would know the exact versions after the index
  lookup, at the expense of another RTT on every operation.
  
  * How frequent are the updates and the reads ?
  We expect reads to be more frequent than writes. Unfortunately we
  don't have solid numbers on what to expect, but I would guess 20x.
  Update operations will involve several reads to determine where to
  write.
  
  
  I would lean towards using two standard CF's, one to list all the
  version numbers (in a single row probably) and one to hold the
  names in a particular version. 
  
  To do your query slice the first CF and then run multi gets to the
  second. 
  
  Thats probably not the best solution, if you can add some more info
  it may get better.
  I'm actually leaning back toward BOP, as I run into more issues
  and complexity with the RP models. I'd really like to implement both
  and compare them, but at this point I need to focus on one to get
  things working, so I'm trying to make a best initial guess.
  
  
  
  On 21/01/2012, at 6:20 AM, Bryce Allen wrote:
  
  I'm storing very large versioned lists of names, and I'd like to
  query a range of names within a given range of versions, which is
  a two dimensional slice, in a single query. This is easy to do
  using ByteOrderedPartitioner, but seems to require multiple (non
  parallel) queries and extra CFs when using RandomPartitioner.
  
  I see two approaches when using RP:
  
  1) Data is stored in a super column family, with one dimension
  being the super column names and the other the sub column names.
  Since slicing on sub columns requires a list of super column
  names, a second standard CF is needed to get a range of names
  before doing a query on the main super CF. With CASSANDRA-2710,
  the same is possible using a standard CF with composite types
  instead of a super CF.
  
  2) If one of the dimensions is small, a two dimensional slice
  isn't required. The data can be stored in a standard CF with
  linear ordering on a composite type (large_dimension,
  small_dimension). Data is queried based on the large dimension,
  and the client throws out the extra data in the other dimension.
  
  Neither of the above solutions are ideal. Does anyone else have a
  use case where two dimensional slicing is useful? Given the
  disadvantages of BOP, is it practical to make the composite column
  query model richer to support this sort of use case?
  
  Thanks,
  Bryce
  
 


signature.asc
Description: PGP signature


Re: two dimensional slicing

2012-01-30 Thread Bryce Allen
On Mon, 30 Jan 2012 11:14:37 -0600
Bryce Allen bal...@ci.uchicago.edu wrote:
 With RP, the idea is to query many versions in ListVersionIndex
 starting at the desired version going backward, hoping that it will
 hit a compact version. We could also maintain a separate
 CompactVersion index, and accept another query.
Actually a better way to handle this is to store the latest compacted
version with each delta version in the index. When doing compaction, all
the deltas between it and the next compaction (or end) are updated to
point at the new compaction. E.g.:

ts0:  20;20 - compacted version
ts1:  21;20
ts2:  22;20
...
ts9:  29;20
ts10: 30;20
ts11: 31;20

compaction is done on version 30:

...
ts9:  29;20
ts10: 30;30 - new compacted version
ts11: 31;30

Perhaps compaction is a bad term because it already has meaning in
Cassandra, but I can't think of a better name at the moment.

-Bryce


signature.asc
Description: PGP signature


Re: two dimensional slicing

2012-01-26 Thread Bryce Allen
Thanks, comments inline:

On Mon, 23 Jan 2012 20:59:34 +1300
aaron morton aa...@thelastpickle.com wrote:
 It depends a bit on the data and the query patterns. 
 
 * How many versions do you have ? 
We may have 10k versions in some cases, with up to a million names
total in any given version but more often 10K. To manage this we are
currently using two CFs, one for storing compacted complete lists and
one for storing deltas on the compacted list. Based on usage, we will
create a new compacted list and start writing deltas against that. We
should be able to limit the number of deltas in a single row to below
100; I'd like to be able to keep it lower but I'm not sure we can
maintain that under all load scenarios. The compacted lists are
straightforward, but there are many ways to structure the deltas and
they all have trade offs. A CF with composite columns that supported
two dimensional slicing would be perfect.

 * How many names in each version ?
We plan on limiting to a total of 1 million names, and around 10,000 per
version (by limiting the batch size), but many deltas will have 10
names.

 * When querying do you know the versions numbers you want to query
 from ? How many are there normally?
Currently we don't know the version numbers in advance - they are
timestamps, and we are querying for versions less than or equal to the
desired timestamp. We have talked about using vector clock versions and
maintaining an index mapping time to version numbers, in which case we
would know the exact versions after the index lookup, at the expense of
another RTT on every operation.

 * How frequent are the updates and the reads ?
We expect reads to be more frequent than writes. Unfortunately we don't
have solid numbers on what to expect, but I would guess 20x. Update
operations will involve several reads to determine where to write.


 I would lean towards using two standard CF's, one to list all the
 version numbers (in a single row probably) and one to hold the names
 in a particular version. 
 
 To do your query slice the first CF and then run multi gets to the
 second. 
 
 Thats probably not the best solution, if you can add some more info
 it may get better.
I'm actually leaning back toward BOP, as I run into more issues
and complexity with the RP models. I'd really like to implement both
and compare them, but at this point I need to focus on one to get
things working, so I'm trying to make a best initial guess.


 
 On 21/01/2012, at 6:20 AM, Bryce Allen wrote:
 
  I'm storing very large versioned lists of names, and I'd like to
  query a range of names within a given range of versions, which is a
  two dimensional slice, in a single query. This is easy to do using
  ByteOrderedPartitioner, but seems to require multiple (non parallel)
  queries and extra CFs when using RandomPartitioner.
  
  I see two approaches when using RP:
  
  1) Data is stored in a super column family, with one dimension being
  the super column names and the other the sub column names. Since
  slicing on sub columns requires a list of super column names, a
  second standard CF is needed to get a range of names before doing a
  query on the main super CF. With CASSANDRA-2710, the same is
  possible using a standard CF with composite types instead of a
  super CF.
  
  2) If one of the dimensions is small, a two dimensional slice isn't
  required. The data can be stored in a standard CF with linear
  ordering on a composite type (large_dimension, small_dimension).
  Data is queried based on the large dimension, and the client throws
  out the extra data in the other dimension.
  
  Neither of the above solutions are ideal. Does anyone else have a
  use case where two dimensional slicing is useful? Given the
  disadvantages of BOP, is it practical to make the composite column
  query model richer to support this sort of use case?
  
  Thanks,
  Bryce
 


signature.asc
Description: PGP signature


two dimensional slicing

2012-01-20 Thread Bryce Allen
I'm storing very large versioned lists of names, and I'd like to
query a range of names within a given range of versions, which is a two
dimensional slice, in a single query. This is easy to do using
ByteOrderedPartitioner, but seems to require multiple (non parallel)
queries and extra CFs when using RandomPartitioner.

I see two approaches when using RP:

1) Data is stored in a super column family, with one dimension being
the super column names and the other the sub column names. Since
slicing on sub columns requires a list of super column names, a
second standard CF is needed to get a range of names before doing a
query on the main super CF. With CASSANDRA-2710, the same is possible
using a standard CF with composite types instead of a super CF.

2) If one of the dimensions is small, a two dimensional slice isn't
required. The data can be stored in a standard CF with linear ordering
on a composite type (large_dimension, small_dimension). Data is queried
based on the large dimension, and the client throws out the extra data
in the other dimension.

Neither of the above solutions are ideal. Does anyone else have a use
case where two dimensional slicing is useful? Given the disadvantages of
BOP, is it practical to make the composite column query model richer to
support this sort of use case?

Thanks,
Bryce


signature.asc
Description: PGP signature


Re: How to reliably achieve unique constraints with Cassandra?

2012-01-06 Thread Bryce Allen
On Fri, 6 Jan 2012 10:38:17 -0800
Mohit Anchlia mohitanch...@gmail.com wrote:
 It could be as simple as reading before writing to make sure that
 email doesn't exist. But I think you are looking at how to handle 2
 concurrent requests for same email? Only way I can think of is:
 
 1) Create new CF say tracker
 2) write email and time uuid to CF tracker
 3) read from CF tracker
 4) if you find a row other than yours then wait and read again from
 tracker after few ms
 5) read from USER CF
 6) write if no rows in USER CF
 7) delete from tracker
 
 Please note you might have to modify this logic a little bit, but this
 should give you some ideas of how to approach this problem without
 locking.

Distributed locking is pretty subtle; I haven't seen a correct solution
that uses just Cassandra, even with QUORUM read/write. I suspect it's
not possible.

With the above proposal, in step 4 two processes could both have
inserted an entry in the tracker before either gets a chance to check,
so you need a way to order the requests. I don't think the timestamp
works for ordering, because it's set by the client (even the internal
timestamp is set by the client), and will likely be different from
when the data is actually committed and available to read by other
clients.

For example:

* At time 0ms, client 1 starts insert of u...@example.org
* At time 1ms, client 2 also starts insert for u...@example.org
* At time 2ms, client 2 data is committed
* At time 3ms, client 2 reads tracker and sees that it's the only one,
  so enters the critical section
* At time 4ms, client 1 data is committed
* At time 5ms, client 2 reads tracker, and sees that is not the only
  one, but since it has the lowest timestamp (0ms vs 1ms), it enters
  the critical section.

I don't think Cassandra counters work for ordering either.

This approach is similar to the Zookeeper lock recipe:
http://zookeeper.apache.org/doc/current/recipes.html#sc_recipes_Locks
but zookeeper has sequence nodes, which provide a consistent way of
ordering the requests. Zookeeper also avoids the busy waiting.

I'd be happy to be proven wrong. But even if it is possible, if it
involves a lot of complexity and busy waiting it's probably not worth
it. There's a reason people are using Zookeeper with Cassandra.

-Bryce


signature.asc
Description: PGP signature


Re: How to reliably achieve unique constraints with Cassandra?

2012-01-06 Thread Bryce Allen
On Fri, 6 Jan 2012 10:03:38 -0800
Drew Kutcharian d...@venarc.com wrote:
 I know that this can be done using a lock manager such as ZooKeeper
 or HazelCast, but the issue with using either of them is that if
 ZooKeeper or HazelCast is down, then you can't be sure about the
 reliability of the lock. So this potentially, in the very rare
 instance where the lock manager is down and two users are registering
 with the same email, can cause major issues.

For most applications, if the lock managers is down, you don't acquire
the lock, so you don't enter the critical section. Rather than allowing
inconsistency, you become unavailable (at least to writes that require
a lock).

-Bryce


signature.asc
Description: PGP signature


Re: How to reliably achieve unique constraints with Cassandra?

2012-01-06 Thread Bryce Allen
This looks like it:
http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Implementing-locks-using-cassandra-only-tp5527076p5527076.html

There's also some interesting JIRA tickets related to locking/CAS:
https://issues.apache.org/jira/browse/CASSANDRA-2686
https://issues.apache.org/jira/browse/CASSANDRA-48

-Bryce

On Fri, 06 Jan 2012 14:53:21 -0600
Jeremiah Jordan jeremiah.jor...@morningstar.com wrote:
 Correct, any kind of locking in Cassandra requires clocks that are in 
 sync, and requires you to wait possible clock out of sync time
 before reading to check if you got the lock, to prevent the issue you
 describe below.
 
 There was a pretty detailed discussion of locking with only Cassandra
 a month or so back on this list.
 
 -Jeremiah
 
 On 01/06/2012 02:42 PM, Bryce Allen wrote:
  On Fri, 6 Jan 2012 10:38:17 -0800
  Mohit Anchliamohitanch...@gmail.com  wrote:
  It could be as simple as reading before writing to make sure that
  email doesn't exist. But I think you are looking at how to handle 2
  concurrent requests for same email? Only way I can think of is:
 
  1) Create new CF say tracker
  2) write email and time uuid to CF tracker
  3) read from CF tracker
  4) if you find a row other than yours then wait and read again from
  tracker after few ms
  5) read from USER CF
  6) write if no rows in USER CF
  7) delete from tracker
 
  Please note you might have to modify this logic a little bit, but
  this should give you some ideas of how to approach this problem
  without locking.
  Distributed locking is pretty subtle; I haven't seen a correct
  solution that uses just Cassandra, even with QUORUM read/write. I
  suspect it's not possible.
 
  With the above proposal, in step 4 two processes could both have
  inserted an entry in the tracker before either gets a chance to
  check, so you need a way to order the requests. I don't think the
  timestamp works for ordering, because it's set by the client (even
  the internal timestamp is set by the client), and will likely be
  different from when the data is actually committed and available to
  read by other clients.
 
  For example:
 
  * At time 0ms, client 1 starts insert of u...@example.org
  * At time 1ms, client 2 also starts insert for u...@example.org
  * At time 2ms, client 2 data is committed
  * At time 3ms, client 2 reads tracker and sees that it's the only
  one, so enters the critical section
  * At time 4ms, client 1 data is committed
  * At time 5ms, client 2 reads tracker, and sees that is not the only
 one, but since it has the lowest timestamp (0ms vs 1ms), it
  enters the critical section.
 
  I don't think Cassandra counters work for ordering either.
 
  This approach is similar to the Zookeeper lock recipe:
  http://zookeeper.apache.org/doc/current/recipes.html#sc_recipes_Locks
  but zookeeper has sequence nodes, which provide a consistent way of
  ordering the requests. Zookeeper also avoids the busy waiting.
 
  I'd be happy to be proven wrong. But even if it is possible, if it
  involves a lot of complexity and busy waiting it's probably not
  worth it. There's a reason people are using Zookeeper with
  Cassandra.
 
  -Bryce


signature.asc
Description: PGP signature


Re: How to reliably achieve unique constraints with Cassandra?

2012-01-06 Thread Bryce Allen
I don't think it's just clock drift. There is also the period of time
between when the client selects a timestamp, and when the data ends up
committed to cassandra. That drift seems harder to control, when the
nodes and/or clients are under load.

I agree that it would be nice to have something like this in Cassandra
core, but from the JIRA tickets it looks like this has been tried
before, and for various reasons was not added. It's definitely
non-trivial to get right.

On Fri, 6 Jan 2012 13:33:02 -0800
Mohit Anchlia mohitanch...@gmail.com wrote:
 This looks like right way to do it. But remember this still doesn't
 gurantee if your clocks drifts way too much. But it's trade-off with
 having to manage one additional component or use something internal to
 C*. It would be good to see similar functionality implemented in C* so
 that clients don't have to deal with it explicitly.
 
 On Fri, Jan 6, 2012 at 1:16 PM, Bryce Allen bal...@ci.uchicago.edu
 wrote:
  This looks like it:
  http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Implementing-locks-using-cassandra-only-tp5527076p5527076.html
 
  There's also some interesting JIRA tickets related to locking/CAS:
  https://issues.apache.org/jira/browse/CASSANDRA-2686
  https://issues.apache.org/jira/browse/CASSANDRA-48
 
  -Bryce
 
  On Fri, 06 Jan 2012 14:53:21 -0600
  Jeremiah Jordan jeremiah.jor...@morningstar.com wrote:
  Correct, any kind of locking in Cassandra requires clocks that are
  in sync, and requires you to wait possible clock out of sync time
  before reading to check if you got the lock, to prevent the issue
  you describe below.
 
  There was a pretty detailed discussion of locking with only
  Cassandra a month or so back on this list.
 
  -Jeremiah
 
  On 01/06/2012 02:42 PM, Bryce Allen wrote:
   On Fri, 6 Jan 2012 10:38:17 -0800
   Mohit Anchliamohitanch...@gmail.com  wrote:
   It could be as simple as reading before writing to make sure
   that email doesn't exist. But I think you are looking at how to
   handle 2 concurrent requests for same email? Only way I can
   think of is:
  
   1) Create new CF say tracker
   2) write email and time uuid to CF tracker
   3) read from CF tracker
   4) if you find a row other than yours then wait and read again
   from tracker after few ms
   5) read from USER CF
   6) write if no rows in USER CF
   7) delete from tracker
  
   Please note you might have to modify this logic a little bit,
   but this should give you some ideas of how to approach this
   problem without locking.
   Distributed locking is pretty subtle; I haven't seen a correct
   solution that uses just Cassandra, even with QUORUM read/write. I
   suspect it's not possible.
  
   With the above proposal, in step 4 two processes could both have
   inserted an entry in the tracker before either gets a chance to
   check, so you need a way to order the requests. I don't think the
   timestamp works for ordering, because it's set by the client
   (even the internal timestamp is set by the client), and will
   likely be different from when the data is actually committed and
   available to read by other clients.
  
   For example:
  
   * At time 0ms, client 1 starts insert of u...@example.org
   * At time 1ms, client 2 also starts insert for u...@example.org
   * At time 2ms, client 2 data is committed
   * At time 3ms, client 2 reads tracker and sees that it's the only
   one, so enters the critical section
   * At time 4ms, client 1 data is committed
   * At time 5ms, client 2 reads tracker, and sees that is not the
   only one, but since it has the lowest timestamp (0ms vs 1ms), it
   enters the critical section.
  
   I don't think Cassandra counters work for ordering either.
  
   This approach is similar to the Zookeeper lock recipe:
   http://zookeeper.apache.org/doc/current/recipes.html#sc_recipes_Locks
   but zookeeper has sequence nodes, which provide a consistent way
   of ordering the requests. Zookeeper also avoids the busy waiting.
  
   I'd be happy to be proven wrong. But even if it is possible, if
   it involves a lot of complexity and busy waiting it's probably
   not worth it. There's a reason people are using Zookeeper with
   Cassandra.
  
   -Bryce


signature.asc
Description: PGP signature


Re: How to reliably achieve unique constraints with Cassandra?

2012-01-06 Thread Bryce Allen
That's a good question, and I'm not sure - I'm fairly new to both ZK
and Cassandra. I found this wiki page:
http://wiki.apache.org/hadoop/ZooKeeper/FailureScenarios
and I think the lock recipe still works, even if a stale read happens.
Assuming that wiki page is correct.

There is still subtlety to locking with ZK though, see (Locks based
on ephemeral nodes) from the zk mailing list in October:
http://mail-archives.apache.org/mod_mbox/zookeeper-user/201110.mbox/thread?0

-Bryce

On Fri, 6 Jan 2012 13:36:52 -0800
Drew Kutcharian d...@venarc.com wrote:
 Bryce, 
 
 I'm not sure about ZooKeeper, but I know if you have a partition
 between HazelCast nodes, than the nodes can acquire the same lock
 independently in each divided partition. How does ZooKeeper handle
 this situation?
 
 -- Drew
 
 
 On Jan 6, 2012, at 12:48 PM, Bryce Allen wrote:
 
  On Fri, 6 Jan 2012 10:03:38 -0800
  Drew Kutcharian d...@venarc.com wrote:
  I know that this can be done using a lock manager such as ZooKeeper
  or HazelCast, but the issue with using either of them is that if
  ZooKeeper or HazelCast is down, then you can't be sure about the
  reliability of the lock. So this potentially, in the very rare
  instance where the lock manager is down and two users are
  registering with the same email, can cause major issues.
  
  For most applications, if the lock managers is down, you don't
  acquire the lock, so you don't enter the critical section. Rather
  than allowing inconsistency, you become unavailable (at least to
  writes that require a lock).
  
  -Bryce
 


signature.asc
Description: PGP signature


Re: Choosing a Partitioner Type for Random java.util.UUID Row Keys

2011-12-22 Thread Bryce Allen
Thanks, that definitely has advantages over using a super column. We
ran into thrift timeouts when the super column got large, and with the
super column range query there is no way (AFAIK) to batch the request at
the subcolumn level.

-Bryce

On Thu, 22 Dec 2011 10:06:58 +1300
aaron morton aa...@thelastpickle.com wrote:
 AFAIK there are no plans kill the BOP, but I would still try to make
 your life easier by using the RP. . 
 
 My understanding of the problem is at certain times you snapshot the
 files in a dir; and the main query you want to handle is At what
 points between time t0 and time t1 did files x,y and z exist?.
 
 You could consider:
 
 1) Partitioning the time series data in across each row, then make
 the row key is the timestamp for the start of the partition. If you
 have rollup partitions consider making the row key timestamp :
 partition_size , e.g. 123456789.1d for a 1 day partition that
 starts at 123456789 2) In each row use column names that have the
 form timestamp : file_name where time stamp is the time of the
 snapshot. 
 
 To query between two times (t0 and t1):
 
 1) Determine which partitions the time span covers, this will give
 you a list of rows. 2) Execute a multi-get slice for the all rows
 using  t0:* and t1:* (I'm using * here as a null, check with your
 client to see how to use composite columns.)
 
 Hope that helps. 
 Aaron
 
 
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 21/12/2011, at 9:03 AM, Bryce Allen wrote:
 
  I wasn't aware of CompositeColumns, thanks for the tip. However I
  think it still doesn't allow me to do the query I need - basically
  I need to do a timestamp range query, limiting only to certain file
  names at each timestamp. With BOP and a separate row for each
  timestamp, prefixed by a random UUID, and file names as column
  names, I can do this query. With CompositeColumns, I can only query
  one contiguous range, so I'd have to know the timestamps before
  hand to limit the file names. I can resolve this using indexes, but
  on paper it looks like this would be significantly slower (it would
  take me 5 round trips instead of 3 to complete each query, and the
  query is made multiple times on every single client request).
  
  The two down sides I've seen listed for BOP are balancing issues and
  hotspots. I can understand why RP is recommended, from the balancing
  issues alone. However these aren't problems for my application. Is
  there anything else I am missing? Does the Cassandra team plan on
  continuing to support BOP? I haven't completely ruled out RP, but I
  like having BOP as an option, it opens up interesting modeling
  alternatives that I think have real advantages for some
  (if uncommon) applications.
  
  Thanks,
  Bryce
  
  On Wed, 21 Dec 2011 08:08:16 +1300
  aaron morton aa...@thelastpickle.com wrote:
  Bryce, 
 Have you considered using CompositeColumns and a standard
  CF? Row key is the UUID column name is (timestamp : dir_entry) you
  can then slice all columns with a particular time stamp. 
  
 Even if you have a random key, I would use the RP unless
  you have an extreme use case. 
  
  Cheers
  
  -
  Aaron Morton
  Freelance Developer
  @aaronmorton
  http://www.thelastpickle.com
  
  On 21/12/2011, at 3:06 AM, Bryce Allen wrote:
  
  I think it comes down to how much you benefit from row range
  scans, and how confident you are that going forward all data will
  continue to use random row keys.
  
  I'm considering using BOP as a way of working around the non
  indexes super column limitation. In my current schema, row keys
  are random UUIDs, super column names are timestamps, and columns
  contain a snapshot in time of directory contents, and could be
  quite large. If instead I use row keys that are
  (uuid)-(timestamp), and use a standard column family, I can do a
  row range query and select only specific columns. I'm still
  evaluating if I can do this with BOP - ideally the token would
  just use the first 128 bits of the key, and I haven't found any
  documentation on how it compares keys of different length.
  
  Another trick with BOP is to use MD5(rowkey)-rowkey for data that
  has non uniform row keys. I think it's reasonable to use if most
  data is uniform and benefits from range scans, but a few things
  are added that aren't/don't. This trick does make the keys larger,
  which increases storage cost and IO load, so it's probably a bad
  idea if a significant subset of the data requires it.
  
  Disclaimer - I wrote that wiki article to fill in a documentation
  gap, since there were no examples of BOP and I wasted a lot of
  time before I noticed the hex byte array vs decimal distinction
  for specifying the initial tokens (which to be fair is
  documented, just easy to miss on a skim). I'm also new to
  cassandra, I'm just describing what makes sense to me on paper.
  FWIW I confirmed that random UUIDs (type 4) row

Re: Choosing a Partitioner Type for Random java.util.UUID Row Keys

2011-12-20 Thread Bryce Allen
I think it comes down to how much you benefit from row range scans, and
how confident you are that going forward all data will continue to use
random row keys.

I'm considering using BOP as a way of working around the non indexes
super column limitation. In my current schema, row keys are random
UUIDs, super column names are timestamps, and columns contain a
snapshot in time of directory contents, and could be quite large. If
instead I use row keys that are (uuid)-(timestamp), and use a standard
column family, I can do a row range query and select only specific
columns. I'm still evaluating if I can do this with BOP - ideally the
token would just use the first 128 bits of the key, and I haven't found
any documentation on how it compares keys of different length.

Another trick with BOP is to use MD5(rowkey)-rowkey for data that has
non uniform row keys. I think it's reasonable to use if most data is
uniform and benefits from range scans, but a few things are added that
aren't/don't. This trick does make the keys larger, which increases
storage cost and IO load, so it's probably a bad idea if a significant
subset of the data requires it.

Disclaimer - I wrote that wiki article to fill in a documentation gap,
since there were no examples of BOP and I wasted a lot of time before I
noticed the hex byte array vs decimal distinction for specifying the
initial tokens (which to be fair is documented, just easy to miss on a
skim). I'm also new to cassandra, I'm just describing what makes sense
to me on paper. FWIW I confirmed that random UUIDs (type 4) row keys
really do evenly distribute when using BOP.

-Bryce

On Mon, 19 Dec 2011 19:01:00 -0800
Drew Kutcharian d...@venarc.com wrote:
 Hey Guys,
 
 I just came across
 http://wiki.apache.org/cassandra/ByteOrderedPartitioner and it got me
 thinking. If the row keys are java.util.UUID which are generated
 randomly (and securely), then what type of partitioner would be the
 best? Since the key values are already random, would it make a
 difference to use RandomPartitioner or one can use
 ByteOrderedPartitioner or OrderPreservingPartitioning as well and get
 the same result?
 
 -- Drew
 


signature.asc
Description: PGP signature


Re: Choosing a Partitioner Type for Random java.util.UUID Row Keys

2011-12-20 Thread Bryce Allen
I wasn't aware of CompositeColumns, thanks for the tip. However I think
it still doesn't allow me to do the query I need - basically I need to
do a timestamp range query, limiting only to certain file names at
each timestamp. With BOP and a separate row for each timestamp,
prefixed by a random UUID, and file names as column names, I can do this
query. With CompositeColumns, I can only query one contiguous range, so
I'd have to know the timestamps before hand to limit the file names. I
can resolve this using indexes, but on paper it looks like this would be
significantly slower (it would take me 5 round trips instead of 3 to
complete each query, and the query is made multiple times on every
single client request).

The two down sides I've seen listed for BOP are balancing issues and
hotspots. I can understand why RP is recommended, from the balancing
issues alone. However these aren't problems for my application. Is
there anything else I am missing? Does the Cassandra team plan on
continuing to support BOP? I haven't completely ruled out RP, but I
like having BOP as an option, it opens up interesting modeling
alternatives that I think have real advantages for some
(if uncommon) applications.

Thanks,
Bryce

On Wed, 21 Dec 2011 08:08:16 +1300
aaron morton aa...@thelastpickle.com wrote:
 Bryce, 
   Have you considered using CompositeColumns and a standard CF?
 Row key is the UUID column name is (timestamp : dir_entry) you can
 then slice all columns with a particular time stamp. 
 
   Even if you have a random key, I would use the RP unless you
 have an extreme use case. 
 
  Cheers
 
 -
 Aaron Morton
 Freelance Developer
 @aaronmorton
 http://www.thelastpickle.com
 
 On 21/12/2011, at 3:06 AM, Bryce Allen wrote:
 
  I think it comes down to how much you benefit from row range scans,
  and how confident you are that going forward all data will continue
  to use random row keys.
  
  I'm considering using BOP as a way of working around the non indexes
  super column limitation. In my current schema, row keys are random
  UUIDs, super column names are timestamps, and columns contain a
  snapshot in time of directory contents, and could be quite large. If
  instead I use row keys that are (uuid)-(timestamp), and use a
  standard column family, I can do a row range query and select only
  specific columns. I'm still evaluating if I can do this with BOP -
  ideally the token would just use the first 128 bits of the key, and
  I haven't found any documentation on how it compares keys of
  different length.
  
  Another trick with BOP is to use MD5(rowkey)-rowkey for data that
  has non uniform row keys. I think it's reasonable to use if most
  data is uniform and benefits from range scans, but a few things are
  added that aren't/don't. This trick does make the keys larger,
  which increases storage cost and IO load, so it's probably a bad
  idea if a significant subset of the data requires it.
  
  Disclaimer - I wrote that wiki article to fill in a documentation
  gap, since there were no examples of BOP and I wasted a lot of time
  before I noticed the hex byte array vs decimal distinction for
  specifying the initial tokens (which to be fair is documented, just
  easy to miss on a skim). I'm also new to cassandra, I'm just
  describing what makes sense to me on paper. FWIW I confirmed that
  random UUIDs (type 4) row keys really do evenly distribute when
  using BOP.
  
  -Bryce
  
  On Mon, 19 Dec 2011 19:01:00 -0800
  Drew Kutcharian d...@venarc.com wrote:
  Hey Guys,
  
  I just came across
  http://wiki.apache.org/cassandra/ByteOrderedPartitioner and it got
  me thinking. If the row keys are java.util.UUID which are generated
  randomly (and securely), then what type of partitioner would be the
  best? Since the key values are already random, would it make a
  difference to use RandomPartitioner or one can use
  ByteOrderedPartitioner or OrderPreservingPartitioning as well and
  get the same result?
  
  -- Drew
  
 


signature.asc
Description: PGP signature