date:20140607

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Colin

I believe Byteorderedpartitioner is being deprecated and for good reason.  I 
would look at what you could achieve by using wide rows and murmur3partitioner.



--
Colin
320-221-9531


 On Jun 6, 2014, at 5:27 PM, Kevin Burton bur...@spinn3r.com wrote:
 
 We have the requirement to have clients read from our tables while they're 
 being written.
 
 Basically, any write that we make to cassandra needs to be sent out over the 
 Internet to our customers.
 
 We also need them to resume so if they go offline, they can just pick up 
 where they left off.
 
 They need to do this in parallel, so if we have 20 cassandra nodes, they can 
 have 20 readers each efficiently (and without coordination) reading from our 
 tables.
 
 Here's how we're planning on doing it.
 
 We're going to use the ByteOrderedPartitioner .
 
 I'm writing with a primary key of the timestamp, however, in practice, this 
 would yield hotspots.
 
 (I'm also aware that time isn't a very good pk in a distribute system as I 
 can easily have a collision so we're going to use a scheme similar to a uuid 
 to make it unique per writer).
 
 One node would take all the load, followed by the next node, etc.
 
 So my plan to stop this is to prefix a slice ID to the timestamp.  This way 
 each piece of content has a unique ID, but the prefix will place it on a node.
 
 The slide ID is just a byte… so this means there are 255 buckets in which I 
 can place data.  
 
 This means I can have clients each start with a slice, and a timestamp, and 
 page through the data with tokens.
 
 This way I can have a client reading with 255 threads from 255 regions in the 
 cluster, in parallel, without any hot spots.
 
 Thoughts on this strategy?  
 
 -- 
 Founder/CEO Spinn3r.com
 Location: San Francisco, CA
 Skype: burtonator
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 
 War is peace. Freedom is slavery. Ignorance is strength. Corporations are 
 people.

Re: Data model for streaming a large table in real time.

2014-06-07 Thread DuyHai Doan

 One node would take all the load, followed by the next node -- with
this design, you are not exploiting all the power of the cluster. If only
one node takes all the load at a time, what is the point having 20 or 10
nodes ?

 You'd better off using limited wide row with bucketing to achieve this.

 You can have a look at this past thread, it may give you some ideas:
https://www.mail-archive.com/user@cassandra.apache.org/msg35666.html




On Sat, Jun 7, 2014 at 12:27 AM, Kevin Burton bur...@spinn3r.com wrote:

 We have the requirement to have clients read from our tables while they're
 being written.

 Basically, any write that we make to cassandra needs to be sent out over
 the Internet to our customers.

 We also need them to resume so if they go offline, they can just pick up
 where they left off.

 They need to do this in parallel, so if we have 20 cassandra nodes, they
 can have 20 readers each efficiently (and without coordination) reading
 from our tables.

 Here's how we're planning on doing it.

 We're going to use the ByteOrderedPartitioner .

 I'm writing with a primary key of the timestamp, however, in practice,
 this would yield hotspots.

 (I'm also aware that time isn't a very good pk in a distribute system as I
 can easily have a collision so we're going to use a scheme similar to a
 uuid to make it unique per writer).

 One node would take all the load, followed by the next node, etc.

 So my plan to stop this is to prefix a slice ID to the timestamp.  This
 way each piece of content has a unique ID, but the prefix will place it on
 a node.

 The slide ID is just a byte… so this means there are 255 buckets in which
 I can place data.

 This means I can have clients each start with a slice, and a timestamp,
 and page through the data with tokens.

 This way I can have a client reading with 255 threads from 255 regions in
 the cluster, in parallel, without any hot spots.

 Thoughts on this strategy?

 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 Skype: *burtonator*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com
 War is peace. Freedom is slavery. Ignorance is strength. Corporations are
 people.

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Kevin Burton

I just checked the source and in 2.1.0 it's not deprecated.

So it *might* be *being* deprecated but I haven't seen anything stating
that.


On Sat, Jun 7, 2014 at 8:03 AM, Colin colpcl...@gmail.com wrote:

 I believe Byteorderedpartitioner is being deprecated and for good reason.
  I would look at what you could achieve by using wide rows and
 murmur3partitioner.



 --
 Colin
 320-221-9531


 On Jun 6, 2014, at 5:27 PM, Kevin Burton bur...@spinn3r.com wrote:

 We have the requirement to have clients read from our tables while they're
 being written.

 Basically, any write that we make to cassandra needs to be sent out over
 the Internet to our customers.

 We also need them to resume so if they go offline, they can just pick up
 where they left off.

 They need to do this in parallel, so if we have 20 cassandra nodes, they
 can have 20 readers each efficiently (and without coordination) reading
 from our tables.

 Here's how we're planning on doing it.

 We're going to use the ByteOrderedPartitioner .

 I'm writing with a primary key of the timestamp, however, in practice,
 this would yield hotspots.

 (I'm also aware that time isn't a very good pk in a distribute system as I
 can easily have a collision so we're going to use a scheme similar to a
 uuid to make it unique per writer).

 One node would take all the load, followed by the next node, etc.

 So my plan to stop this is to prefix a slice ID to the timestamp.  This
 way each piece of content has a unique ID, but the prefix will place it on
 a node.

 The slide ID is just a byte… so this means there are 255 buckets in which
 I can place data.

 This means I can have clients each start with a slice, and a timestamp,
 and page through the data with tokens.

 This way I can have a client reading with 255 threads from 255 regions in
 the cluster, in parallel, without any hot spots.

 Thoughts on this strategy?

 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 Skype: *burtonator*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com
 War is peace. Freedom is slavery. Ignorance is strength. Corporations are
 people.




-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Colin Clark

It's an anti-pattern and there are better ways to do this.

I have implemented the paging algorithm you've described using wide rows
and bucketing.  This approach is a more efficient utilization of
Cassandra's built in wholesome goodness.

Also, I wouldn't let any number of clients (huge) connect directly the
cluster to do this-put some type of app server in between to handle the
comm's and fan out.  You'll get better utilization of resources and less
overhead in addition to flexibility of which data center you're utilizing
to serve requests.



--
Colin
320-221-9531


On Jun 7, 2014, at 12:28 PM, Kevin Burton bur...@spinn3r.com wrote:

I just checked the source and in 2.1.0 it's not deprecated.

So it *might* be *being* deprecated but I haven't seen anything stating
that.


On Sat, Jun 7, 2014 at 8:03 AM, Colin colpcl...@gmail.com wrote:

 I believe Byteorderedpartitioner is being deprecated and for good reason.
  I would look at what you could achieve by using wide rows and
 murmur3partitioner.



 --
 Colin
 320-221-9531


 On Jun 6, 2014, at 5:27 PM, Kevin Burton bur...@spinn3r.com wrote:

 We have the requirement to have clients read from our tables while they're
 being written.

 Basically, any write that we make to cassandra needs to be sent out over
 the Internet to our customers.

 We also need them to resume so if they go offline, they can just pick up
 where they left off.

 They need to do this in parallel, so if we have 20 cassandra nodes, they
 can have 20 readers each efficiently (and without coordination) reading
 from our tables.

 Here's how we're planning on doing it.

 We're going to use the ByteOrderedPartitioner .

 I'm writing with a primary key of the timestamp, however, in practice,
 this would yield hotspots.

 (I'm also aware that time isn't a very good pk in a distribute system as I
 can easily have a collision so we're going to use a scheme similar to a
 uuid to make it unique per writer).

 One node would take all the load, followed by the next node, etc.

 So my plan to stop this is to prefix a slice ID to the timestamp.  This
 way each piece of content has a unique ID, but the prefix will place it on
 a node.

 The slide ID is just a byte… so this means there are 255 buckets in which
 I can place data.

 This means I can have clients each start with a slice, and a timestamp,
 and page through the data with tokens.

 This way I can have a client reading with 255 threads from 255 regions in
 the cluster, in parallel, without any hot spots.

 Thoughts on this strategy?

 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 Skype: *burtonator*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com
 War is peace. Freedom is slavery. Ignorance is strength. Corporations are
 people.




-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Kevin Burton

On Sat, Jun 7, 2014 at 10:41 AM, Colin Clark co...@clark.ws wrote:

 It's an anti-pattern and there are better ways to do this.


Entirely possible :)

It would be nice to have a document with a bunch of common cassandra design
patterns.

I've been trying to track down a pattern for this and a lot of this is
pieced in different places an individual blogs posts so one has to reverse
engineer it.


 I have implemented the paging algorithm you've described using wide rows
 and bucketing.  This approach is a more efficient utilization of
 Cassandra's built in wholesome goodness.


So.. I assume the general pattern is to:

create a bucket.. you create like 2^16 buckets, this is your partition key.


Then you place a timestamp next to the bucket in a primary key.

So essentially:

primary key( bucket, timestamp )…

.. so to read from this buck you essentially execute:

select * from foo where bucket = 100 and timestamp  12345790 limit 1;



 Also, I wouldn't let any number of clients (huge) connect directly the
 cluster to do this-put some type of app server in between to handle the
 comm's and fan out.  You'll get better utilization of resources and less
 overhead in addition to flexibility of which data center you're utilizing
 to serve requests.


this is interesting… since the partition is the bucket, you could make some
poor decisions based on the number of buckets.

For example,

if you use 2^64 buckets, the number of items in each bucket is going to be
rather small.  So you're going to have tons of queries each fetching 0-1
row (if you have a small amount of data).

But if you use very FEW buckets.. say 5, but you have a cluster of 1000
nodes, then you will have 5 of these buckets on 5 nodes, and the rest of
the nodes without any data.

Hm..

the byte ordered partitioner solves this problem because I can just pick a
fixed number of buckets and then this is the primary key prefix and the
data in a bucket can be split up across machines based on any arbitrary
split even in the middle of a 'bucket' …


-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.

A list of all potential problems when using byte ordered partitioner?

2014-06-07 Thread Kevin Burton

I believe I'm aware of the problems that can arise due to the byte ordered
partitioner.

Is there a full list of ALL the problems?  I want to make sure I'm not
missing anything.

The main problems I'm aware of are:

... natural inserts where the key is something like a username will tend
to have radically uneven value distribution.

Another problem is that bulk loading data will sequentially overload one of
the nodes, followed by the next.  The will trigger a bottleneck in the
cluster and your write throughput will only be as good as a single node.

I believe my design would benefit from the byte ordered partitioner.

I'm going to use the MD5 hash of the a unique identifier for the primary
key.  (MD5 probably as it doesn't need to be secure and is faster than
SHA1, but perhaps SHA1 anyway just to avoid any potential collisions).

This way I can use the cluster in both modes.. .since I'm going to be using
a hashcode as my primary key, for some tables, it will be have like
RandomPartitioner.

And for certain tables which are append-only , I can get range queries
across the cluster if I design my primary key correctly.

Kevin

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Kevin Burton

Another way around this is to have a separate table storing the number of
buckets.

This way if you have too few buckets, you can just increase them in the
future.

Of course, the older data will still have too few buckets :-(


On Sat, Jun 7, 2014 at 11:09 AM, Kevin Burton bur...@spinn3r.com wrote:


 On Sat, Jun 7, 2014 at 10:41 AM, Colin Clark co...@clark.ws wrote:

 It's an anti-pattern and there are better ways to do this.


 Entirely possible :)

 It would be nice to have a document with a bunch of common cassandra
 design patterns.

 I've been trying to track down a pattern for this and a lot of this is
 pieced in different places an individual blogs posts so one has to reverse
 engineer it.


 I have implemented the paging algorithm you've described using wide rows
 and bucketing.  This approach is a more efficient utilization of
 Cassandra's built in wholesome goodness.


 So.. I assume the general pattern is to:

 create a bucket.. you create like 2^16 buckets, this is your partition
 key.

 Then you place a timestamp next to the bucket in a primary key.

 So essentially:

 primary key( bucket, timestamp )…

 .. so to read from this buck you essentially execute:

 select * from foo where bucket = 100 and timestamp  12345790 limit 1;



 Also, I wouldn't let any number of clients (huge) connect directly the
 cluster to do this-put some type of app server in between to handle the
 comm's and fan out.  You'll get better utilization of resources and less
 overhead in addition to flexibility of which data center you're utilizing
 to serve requests.


 this is interesting… since the partition is the bucket, you could make
 some poor decisions based on the number of buckets.

 For example,

 if you use 2^64 buckets, the number of items in each bucket is going to be
 rather small.  So you're going to have tons of queries each fetching 0-1
 row (if you have a small amount of data).

 But if you use very FEW buckets.. say 5, but you have a cluster of 1000
 nodes, then you will have 5 of these buckets on 5 nodes, and the rest of
 the nodes without any data.

 Hm..

 the byte ordered partitioner solves this problem because I can just pick a
 fixed number of buckets and then this is the primary key prefix and the
 data in a bucket can be split up across machines based on any arbitrary
 split even in the middle of a 'bucket' …


 --

 Founder/CEO Spinn3r.com
  Location: *San Francisco, CA*
 Skype: *burtonator*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com
 War is peace. Freedom is slavery. Ignorance is strength. Corporations are
 people.




-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Colin

Maybe it makes sense to describe what you're trying to accomplish in more 
detail.

A common bucketing approach is along the lines of year, month, day, hour, 
minute, etc and then use a timeuuid as a cluster column.  

Depending upon the semantics of the transport protocol you plan on utilizing, 
either the client code keep track of pagination, or the app server could, if 
you utilized some type of request/reply/ack flow.  You could keep sequence 
numbers for each client, and begin streaming data to them or allowing query 
upon reconnect, etc.

But again, more details of the use case might prove useful.

--
Colin
320-221-9531


 On Jun 7, 2014, at 1:53 PM, Kevin Burton bur...@spinn3r.com wrote:
 
 Another way around this is to have a separate table storing the number of 
 buckets.
 
 This way if you have too few buckets, you can just increase them in the 
 future.
 
 Of course, the older data will still have too few buckets :-(
 
 
 On Sat, Jun 7, 2014 at 11:09 AM, Kevin Burton bur...@spinn3r.com wrote:
 
 On Sat, Jun 7, 2014 at 10:41 AM, Colin Clark co...@clark.ws wrote:
 It's an anti-pattern and there are better ways to do this.
 
 Entirely possible :)
 
 It would be nice to have a document with a bunch of common cassandra design 
 patterns.
 
 I've been trying to track down a pattern for this and a lot of this is 
 pieced in different places an individual blogs posts so one has to reverse 
 engineer it.
  
 I have implemented the paging algorithm you've described using wide rows 
 and bucketing.  This approach is a more efficient utilization of 
 Cassandra's built in wholesome goodness.
 
 So.. I assume the general pattern is to:
 
 create a bucket.. you create like 2^16 buckets, this is your partition key.  
  
 
 Then you place a timestamp next to the bucket in a primary key.
 
 So essentially:
 
 primary key( bucket, timestamp )… 
 
 .. so to read from this buck you essentially execute: 
 
 select * from foo where bucket = 100 and timestamp  12345790 limit 1;
  
 
 Also, I wouldn't let any number of clients (huge) connect directly the 
 cluster to do this-put some type of app server in between to handle the 
 comm's and fan out.  You'll get better utilization of resources and less 
 overhead in addition to flexibility of which data center you're utilizing 
 to serve requests. 
 
 this is interesting… since the partition is the bucket, you could make some 
 poor decisions based on the number of buckets.
 
 For example, 
 
 if you use 2^64 buckets, the number of items in each bucket is going to be 
 rather small.  So you're going to have tons of queries each fetching 0-1 row 
 (if you have a small amount of data).
 
 But if you use very FEW buckets.. say 5, but you have a cluster of 1000 
 nodes, then you will have 5 of these buckets on 5 nodes, and the rest of the 
 nodes without any data.
 
 Hm..
 
 the byte ordered partitioner solves this problem because I can just pick a 
 fixed number of buckets and then this is the primary key prefix and the data 
 in a bucket can be split up across machines based on any arbitrary split 
 even in the middle of a 'bucket' …
 
 
 -- 
 Founder/CEO Spinn3r.com
 Location: San Francisco, CA
 Skype: burtonator
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 
 War is peace. Freedom is slavery. Ignorance is strength. Corporations are 
 people.
 
 
 
 -- 
 Founder/CEO Spinn3r.com
 Location: San Francisco, CA
 Skype: burtonator
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 
 War is peace. Freedom is slavery. Ignorance is strength. Corporations are 
 people.

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Kevin Burton

On Sat, Jun 7, 2014 at 1:34 PM, Colin colpcl...@gmail.com wrote:

 Maybe it makes sense to describe what you're trying to accomplish in more
 detail.


Essentially , I'm appending writes of recent data by our crawler and
sending that data to our customers.

They need to sync to up to date writes…we need to get them writes within
seconds.

A common bucketing approach is along the lines of year, month, day, hour,
 minute, etc and then use a timeuuid as a cluster column.


I mean that is acceptable.. but that means for that 1 minute interval, all
writes are going to that one node (and its replicas)

So that means the total cluster throughput is bottlenecked on the max disk
throughput.

Same thing for reads… unless our customers are lagged, they are all going
to stampede and ALL of them are going to read data from one node, in a one
minute timeframe.

That's no fun..  that will easily DoS our cluster.


 Depending upon the semantics of the transport protocol you plan on
 utilizing, either the client code keep track of pagination, or the app
 server could, if you utilized some type of request/reply/ack flow.  You
 could keep sequence numbers for each client, and begin streaming data to
 them or allowing query upon reconnect, etc.

 But again, more details of the use case might prove useful.


I think if we were to just 100 buckets it would probably work just fine.
 We're probably not going to be more than 100 nodes in the next year and if
we are that's still reasonable performance.

I mean if each box has a 400GB SSD that's 40TB of VERY fast data.

Kevin

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Colin

The add seconds to the bucket.  Also, the data will get cached-it's not going 
to hit disk on every read.

Look at the key cache settings on the table.  Also, in 2.1 you have even more 
control over caching.

--
Colin
320-221-9531


 On Jun 7, 2014, at 4:30 PM, Kevin Burton bur...@spinn3r.com wrote:
 
 
 On Sat, Jun 7, 2014 at 1:34 PM, Colin colpcl...@gmail.com wrote:
 Maybe it makes sense to describe what you're trying to accomplish in more 
 detail.
 
 Essentially , I'm appending writes of recent data by our crawler and sending 
 that data to our customers.
  
 They need to sync to up to date writes…we need to get them writes within 
 seconds. 
 
 A common bucketing approach is along the lines of year, month, day, hour, 
 minute, etc and then use a timeuuid as a cluster column.  
 
 I mean that is acceptable.. but that means for that 1 minute interval, all 
 writes are going to that one node (and its replicas)
 
 So that means the total cluster throughput is bottlenecked on the max disk 
 throughput.
 
 Same thing for reads… unless our customers are lagged, they are all going to 
 stampede and ALL of them are going to read data from one node, in a one 
 minute timeframe.
 
 That's no fun..  that will easily DoS our cluster.
  
 Depending upon the semantics of the transport protocol you plan on 
 utilizing, either the client code keep track of pagination, or the app 
 server could, if you utilized some type of request/reply/ack flow.  You 
 could keep sequence numbers for each client, and begin streaming data to 
 them or allowing query upon reconnect, etc.
 
 But again, more details of the use case might prove useful.
 
 I think if we were to just 100 buckets it would probably work just fine.  
 We're probably not going to be more than 100 nodes in the next year and if we 
 are that's still reasonable performance.  
 
 I mean if each box has a 400GB SSD that's 40TB of VERY fast data. 
 
 Kevin
 
 -- 
 Founder/CEO Spinn3r.com
 Location: San Francisco, CA
 Skype: burtonator
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 
 War is peace. Freedom is slavery. Ignorance is strength. Corporations are 
 people.

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Kevin Burton

well you could add milliseconds, at best you're still bottlenecking most of
your writes one one box.. maybe 2-3 if there are ones that are lagging.

Anyway.. I think using 100 buckets is probably fine..

Kevin


On Sat, Jun 7, 2014 at 2:45 PM, Colin colpcl...@gmail.com wrote:

 The add seconds to the bucket.  Also, the data will get cached-it's not
 going to hit disk on every read.

 Look at the key cache settings on the table.  Also, in 2.1 you have even
 more control over caching.

 --
 Colin
 320-221-9531


 On Jun 7, 2014, at 4:30 PM, Kevin Burton bur...@spinn3r.com wrote:


 On Sat, Jun 7, 2014 at 1:34 PM, Colin colpcl...@gmail.com wrote:

 Maybe it makes sense to describe what you're trying to accomplish in more
 detail.


 Essentially , I'm appending writes of recent data by our crawler and
 sending that data to our customers.

 They need to sync to up to date writes…we need to get them writes within
 seconds.

 A common bucketing approach is along the lines of year, month, day, hour,
 minute, etc and then use a timeuuid as a cluster column.


 I mean that is acceptable.. but that means for that 1 minute interval, all
 writes are going to that one node (and its replicas)

 So that means the total cluster throughput is bottlenecked on the max disk
 throughput.

 Same thing for reads… unless our customers are lagged, they are all going
 to stampede and ALL of them are going to read data from one node, in a one
 minute timeframe.

 That's no fun..  that will easily DoS our cluster.


 Depending upon the semantics of the transport protocol you plan on
 utilizing, either the client code keep track of pagination, or the app
 server could, if you utilized some type of request/reply/ack flow.  You
 could keep sequence numbers for each client, and begin streaming data to
 them or allowing query upon reconnect, etc.

 But again, more details of the use case might prove useful.


 I think if we were to just 100 buckets it would probably work just fine.
  We're probably not going to be more than 100 nodes in the next year and if
 we are that's still reasonable performance.

 I mean if each box has a 400GB SSD that's 40TB of VERY fast data.

 Kevin

 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 Skype: *burtonator*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com
 War is peace. Freedom is slavery. Ignorance is strength. Corporations are
 people.




-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Colin Clark

No, you're not-the partition key will get distributed across the cluster if
you're using random or murmur.  You could also ensure that by adding
another column, like source to ensure distribution. (Add the seconds to the
partition key, not the clustering columns)

I can almost guarantee that if you put too much thought into working
against what Cassandra offers out of the box, that it will bite you later.

In fact, the use case that you're describing may best be served by a
queuing mechanism, and using Cassandra only for the underlying store.

I used this exact same approach in a use case that involved writing over a
million events/second to a cluster with no problems.  Initially, I thought
ordered partitioner was the way to go too.  And I used separate processes
to aggregate, conflate, and handle distribution to clients.

Just my two cents, but I also spend the majority of my days helping people
utilize Cassandra correctly, and rescuing those that haven't.

:)

--
Colin
320-221-9531


On Jun 7, 2014, at 6:53 PM, Kevin Burton bur...@spinn3r.com wrote:

well you could add milliseconds, at best you're still bottlenecking most of
your writes one one box.. maybe 2-3 if there are ones that are lagging.

Anyway.. I think using 100 buckets is probably fine..

Kevin


On Sat, Jun 7, 2014 at 2:45 PM, Colin colpcl...@gmail.com wrote:

 The add seconds to the bucket.  Also, the data will get cached-it's not
 going to hit disk on every read.

 Look at the key cache settings on the table.  Also, in 2.1 you have even
 more control over caching.

 --
 Colin
 320-221-9531


 On Jun 7, 2014, at 4:30 PM, Kevin Burton bur...@spinn3r.com wrote:


 On Sat, Jun 7, 2014 at 1:34 PM, Colin colpcl...@gmail.com wrote:

 Maybe it makes sense to describe what you're trying to accomplish in more
 detail.


 Essentially , I'm appending writes of recent data by our crawler and
 sending that data to our customers.

 They need to sync to up to date writes…we need to get them writes within
 seconds.

 A common bucketing approach is along the lines of year, month, day, hour,
 minute, etc and then use a timeuuid as a cluster column.


 I mean that is acceptable.. but that means for that 1 minute interval, all
 writes are going to that one node (and its replicas)

 So that means the total cluster throughput is bottlenecked on the max disk
 throughput.

 Same thing for reads… unless our customers are lagged, they are all going
 to stampede and ALL of them are going to read data from one node, in a one
 minute timeframe.

 That's no fun..  that will easily DoS our cluster.


 Depending upon the semantics of the transport protocol you plan on
 utilizing, either the client code keep track of pagination, or the app
 server could, if you utilized some type of request/reply/ack flow.  You
 could keep sequence numbers for each client, and begin streaming data to
 them or allowing query upon reconnect, etc.

 But again, more details of the use case might prove useful.


 I think if we were to just 100 buckets it would probably work just fine.
  We're probably not going to be more than 100 nodes in the next year and if
 we are that's still reasonable performance.

 I mean if each box has a 400GB SSD that's 40TB of VERY fast data.

 Kevin

 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 Skype: *burtonator*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com
 War is peace. Freedom is slavery. Ignorance is strength. Corporations are
 people.




-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.

Re: problem removing dead node from ring

2014-06-07 Thread Curious Patient

Hey all,

 OK I gave removing the downed node from the cassandra ring another try.

To recap what's going on, this is what my ring looks like with nodetool
status:

[root@beta-new:~] #nodetool status

Datacenter: datacenter1

===

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

--  Address Load   Tokens  Owns   Host ID
Rack

UN  10.10.1.94  178.38 KB  256 49.4%
fd2f76ae-8dcf-4e93-a37f-bf1e9088696e  rack1

DN  10.10.1.98 ?  256 50.6%
f2a48fc7-a362-43f5-9061-4bb3739fdeaf  rack1

So I followed the steps in this document one more time:

http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_replace_node_t.html

And setup the following in the cassandra.yaml according to the above
instructions:

cluster_name: ‘Test Cluster'

num_tokens: 256

seed_provider:

listen_address: 10.10.1.153

auto_bootstrap: yes

broadcast_address: 10.10.1.153

endpoint_snitch: SimpleSnitch

initial_token: -9173731940639284976


The initial_token is the one belonging to the dead node that I'm trying to
get rid of.

I then make sure that the /var/lib/casssandra directory is completely empty
and run this startup command:

[root@cassandra1 cassandrahome]# ./bin/cassandra
-Dcassandra.replace_address=10.10.1.98 -f

Using the IP of the node I want to remove as the value to
casandra_replace_address

And when I do this is the error I get:

java.lang.RuntimeException: Cannot replace_address /10.10.1.98 because it
doesn't exist in gossip


So how can I get cassandra to realize that this node needs to be replaced
and that it SHOULDN'T exist in gossip because the node is down? That would
seem obvious to me, so why isn't it obvious to her? :)


Thanks

Tim









On Wed, Jun 4, 2014 at 4:36 PM, Robert Coli rc...@eventbrite.com wrote:

 On Tue, Jun 3, 2014 at 9:03 PM, Matthew Allen matthew.j.al...@gmail.com
 wrote:

 Thanks Robert, this makes perfect sense.  Do you know if CASSANDRA-6961
 will be ported to 1.2.x ?


 I just asked driftx, he said not gonna happen.


 And apologies if these appear to be dumb questions, but is a repair more
 suitable than a rebuild because the rebuild only contacts 1 replica (per
 range), which may itself contain stale data ?


 Exactly that.

 https://issues.apache.org/jira/browse/CASSANDRA-2434

 Discusses related issues in quite some detail. The tl;dr is that until
 2434 is resolved, streams do not necessarily come from the node departing
 the range, and therefore the unique replica count is decreased by
 changing cluster topology.

 =Rob

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Kevin Burton

Thanks for the feedback on this btw.. .it's helpful.  My notes below.

On Sat, Jun 7, 2014 at 5:14 PM, Colin Clark co...@clark.ws wrote:

 No, you're not-the partition key will get distributed across the cluster
 if you're using random or murmur.


Yes… I'm aware.  But in practice this is how it will work…

If we create bucket b0, that will get hashed to h0…

So say I have 50 machines performing writes, they are all on the same time
thanks to ntpd, so they all compute b0 for the current bucket based on the
time.

That gets hashed to h0…

If h0 is hosted on node0 … then all writes go to node zero for that 1
second interval.

So all my writes are bottlenecking on one node.  That node is *changing*
over time… but they're not being dispatched in parallel over N nodes.  At
most writes will only ever reach 1 node a time.



 You could also ensure that by adding another column, like source to ensure
 distribution. (Add the seconds to the partition key, not the clustering
 columns)

 I can almost guarantee that if you put too much thought into working
 against what Cassandra offers out of the box, that it will bite you later.


Sure.. I'm trying to avoid the 'bite you later' issues. More so because I'm
sure there are Cassandra gotchas to worry about.  Everything has them.
 Just trying to avoid the land mines :-P


 In fact, the use case that you're describing may best be served by a
 queuing mechanism, and using Cassandra only for the underlying store.


Yes… that's what I'm doing.  We're using apollo to fan out the queue, but
the writes go back into cassandra and needs to be read out sequentially.



 I used this exact same approach in a use case that involved writing over a
 million events/second to a cluster with no problems.  Initially, I thought
 ordered partitioner was the way to go too.  And I used separate processes
 to aggregate, conflate, and handle distribution to clients.



Yes. I think using 100 buckets will work for now.  Plus I don't have to
change the partitioner on our existing cluster and I'm lazy :)



 Just my two cents, but I also spend the majority of my days helping people
 utilize Cassandra correctly, and rescuing those that haven't.


Definitely appreciate the feedback!  Thanks!

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Colin Clark

Not if you add another column to the partition key; source for example.

I would really try to stay away from the ordered partitioner if at all
possible.

What ingestion rates are you expecting, in size and speed.

--
Colin
320-221-9531


On Jun 7, 2014, at 9:05 PM, Kevin Burton bur...@spinn3r.com wrote:


Thanks for the feedback on this btw.. .it's helpful.  My notes below.

On Sat, Jun 7, 2014 at 5:14 PM, Colin Clark co...@clark.ws wrote:

 No, you're not-the partition key will get distributed across the cluster
 if you're using random or murmur.


Yes… I'm aware.  But in practice this is how it will work…

If we create bucket b0, that will get hashed to h0…

So say I have 50 machines performing writes, they are all on the same time
thanks to ntpd, so they all compute b0 for the current bucket based on the
time.

That gets hashed to h0…

If h0 is hosted on node0 … then all writes go to node zero for that 1
second interval.

So all my writes are bottlenecking on one node.  That node is *changing*
over time… but they're not being dispatched in parallel over N nodes.  At
most writes will only ever reach 1 node a time.



 You could also ensure that by adding another column, like source to ensure
 distribution. (Add the seconds to the partition key, not the clustering
 columns)

 I can almost guarantee that if you put too much thought into working
 against what Cassandra offers out of the box, that it will bite you later.


Sure.. I'm trying to avoid the 'bite you later' issues. More so because I'm
sure there are Cassandra gotchas to worry about.  Everything has them.
 Just trying to avoid the land mines :-P


 In fact, the use case that you're describing may best be served by a
 queuing mechanism, and using Cassandra only for the underlying store.


Yes… that's what I'm doing.  We're using apollo to fan out the queue, but
the writes go back into cassandra and needs to be read out sequentially.



 I used this exact same approach in a use case that involved writing over a
 million events/second to a cluster with no problems.  Initially, I thought
 ordered partitioner was the way to go too.  And I used separate processes
 to aggregate, conflate, and handle distribution to clients.



Yes. I think using 100 buckets will work for now.  Plus I don't have to
change the partitioner on our existing cluster and I'm lazy :)



 Just my two cents, but I also spend the majority of my days helping people
 utilize Cassandra correctly, and rescuing those that haven't.


Definitely appreciate the feedback!  Thanks!

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Kevin Burton

What's 'source' ? You mean like the URL?

If source too random it's going to yield too many buckets.

Ingestion rates are fairly high but not insane.  About 4M inserts per
hour.. from 5-10GB…


On Sat, Jun 7, 2014 at 7:13 PM, Colin Clark co...@clark.ws wrote:

 Not if you add another column to the partition key; source for example.

 I would really try to stay away from the ordered partitioner if at all
 possible.

 What ingestion rates are you expecting, in size and speed.

 --
 Colin
 320-221-9531


 On Jun 7, 2014, at 9:05 PM, Kevin Burton bur...@spinn3r.com wrote:


 Thanks for the feedback on this btw.. .it's helpful.  My notes below.

 On Sat, Jun 7, 2014 at 5:14 PM, Colin Clark co...@clark.ws wrote:

 No, you're not-the partition key will get distributed across the cluster
 if you're using random or murmur.


 Yes… I'm aware.  But in practice this is how it will work…

 If we create bucket b0, that will get hashed to h0…

 So say I have 50 machines performing writes, they are all on the same time
 thanks to ntpd, so they all compute b0 for the current bucket based on the
 time.

 That gets hashed to h0…

 If h0 is hosted on node0 … then all writes go to node zero for that 1
 second interval.

 So all my writes are bottlenecking on one node.  That node is *changing*
 over time… but they're not being dispatched in parallel over N nodes.  At
 most writes will only ever reach 1 node a time.



 You could also ensure that by adding another column, like source to
 ensure distribution. (Add the seconds to the partition key, not the
 clustering columns)

 I can almost guarantee that if you put too much thought into working
 against what Cassandra offers out of the box, that it will bite you later.


 Sure.. I'm trying to avoid the 'bite you later' issues. More so because
 I'm sure there are Cassandra gotchas to worry about.  Everything has them.
  Just trying to avoid the land mines :-P


 In fact, the use case that you're describing may best be served by a
 queuing mechanism, and using Cassandra only for the underlying store.


 Yes… that's what I'm doing.  We're using apollo to fan out the queue, but
 the writes go back into cassandra and needs to be read out sequentially.



 I used this exact same approach in a use case that involved writing over
 a million events/second to a cluster with no problems.  Initially, I
 thought ordered partitioner was the way to go too.  And I used separate
 processes to aggregate, conflate, and handle distribution to clients.



 Yes. I think using 100 buckets will work for now.  Plus I don't have to
 change the partitioner on our existing cluster and I'm lazy :)



 Just my two cents, but I also spend the majority of my days helping
 people utilize Cassandra correctly, and rescuing those that haven't.


 Definitely appreciate the feedback!  Thanks!

 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 Skype: *burtonator*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com
 War is peace. Freedom is slavery. Ignorance is strength. Corporations are
 people.




-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.

Object mapper for CQL

2014-06-07 Thread Kevin Burton

Looks like the java-driver is working on an object mapper:

More modules including a simple object mapper will come shortly.
But of course I need one now …
I'm curious what others are doing here.

I don't want to pass around Row objects in my code if I can avoid it..
Ideally I would just run a query and get back a POJO.

Another issue is how are these POJOs generated.  Are they generated from
the schema?  is the schema generated from the POJOs ?  From a side file?

And granted, there are existing ORMs out there but I don't think any
support CQL.

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Colin Clark

With 100 nodes, that ingestion rate is actually quite low and I don't think
you'd need another column in the partition key.

You seem to be set in your current direction.  Let us know how it works out.

--
Colin
320-221-9531


On Jun 7, 2014, at 9:18 PM, Kevin Burton bur...@spinn3r.com wrote:

What's 'source' ? You mean like the URL?

If source too random it's going to yield too many buckets.

Ingestion rates are fairly high but not insane.  About 4M inserts per
hour.. from 5-10GB…


On Sat, Jun 7, 2014 at 7:13 PM, Colin Clark co...@clark.ws wrote:

 Not if you add another column to the partition key; source for example.

 I would really try to stay away from the ordered partitioner if at all
 possible.

 What ingestion rates are you expecting, in size and speed.

 --
 Colin
 320-221-9531


 On Jun 7, 2014, at 9:05 PM, Kevin Burton bur...@spinn3r.com wrote:


 Thanks for the feedback on this btw.. .it's helpful.  My notes below.

 On Sat, Jun 7, 2014 at 5:14 PM, Colin Clark co...@clark.ws wrote:

 No, you're not-the partition key will get distributed across the cluster
 if you're using random or murmur.


 Yes… I'm aware.  But in practice this is how it will work…

 If we create bucket b0, that will get hashed to h0…

 So say I have 50 machines performing writes, they are all on the same time
 thanks to ntpd, so they all compute b0 for the current bucket based on the
 time.

 That gets hashed to h0…

 If h0 is hosted on node0 … then all writes go to node zero for that 1
 second interval.

 So all my writes are bottlenecking on one node.  That node is *changing*
 over time… but they're not being dispatched in parallel over N nodes.  At
 most writes will only ever reach 1 node a time.



 You could also ensure that by adding another column, like source to
 ensure distribution. (Add the seconds to the partition key, not the
 clustering columns)

 I can almost guarantee that if you put too much thought into working
 against what Cassandra offers out of the box, that it will bite you later.


 Sure.. I'm trying to avoid the 'bite you later' issues. More so because
 I'm sure there are Cassandra gotchas to worry about.  Everything has them.
  Just trying to avoid the land mines :-P


 In fact, the use case that you're describing may best be served by a
 queuing mechanism, and using Cassandra only for the underlying store.


 Yes… that's what I'm doing.  We're using apollo to fan out the queue, but
 the writes go back into cassandra and needs to be read out sequentially.



 I used this exact same approach in a use case that involved writing over
 a million events/second to a cluster with no problems.  Initially, I
 thought ordered partitioner was the way to go too.  And I used separate
 processes to aggregate, conflate, and handle distribution to clients.



 Yes. I think using 100 buckets will work for now.  Plus I don't have to
 change the partitioner on our existing cluster and I'm lazy :)



 Just my two cents, but I also spend the majority of my days helping
 people utilize Cassandra correctly, and rescuing those that haven't.


 Definitely appreciate the feedback!  Thanks!

 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 Skype: *burtonator*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com
 War is peace. Freedom is slavery. Ignorance is strength. Corporations are
 people.




-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Kevin Burton

Oh.. To start with we're going to use from 2-10 nodes..

I think we're going to take the original strategy and just to use 100
buckets .. 0-99… then the timestamp under that..  I think it should be fine
and won't require an ordered partitioner. :)

Thanks!


On Sat, Jun 7, 2014 at 7:38 PM, Colin Clark co...@clark.ws wrote:

 With 100 nodes, that ingestion rate is actually quite low and I don't
 think you'd need another column in the partition key.

 You seem to be set in your current direction.  Let us know how it works
 out.

 --
 Colin
 320-221-9531


 On Jun 7, 2014, at 9:18 PM, Kevin Burton bur...@spinn3r.com wrote:

 What's 'source' ? You mean like the URL?

 If source too random it's going to yield too many buckets.

 Ingestion rates are fairly high but not insane.  About 4M inserts per
 hour.. from 5-10GB…


 On Sat, Jun 7, 2014 at 7:13 PM, Colin Clark co...@clark.ws wrote:

 Not if you add another column to the partition key; source for example.

 I would really try to stay away from the ordered partitioner if at all
 possible.

 What ingestion rates are you expecting, in size and speed.

 --
 Colin
 320-221-9531


 On Jun 7, 2014, at 9:05 PM, Kevin Burton bur...@spinn3r.com wrote:


 Thanks for the feedback on this btw.. .it's helpful.  My notes below.

 On Sat, Jun 7, 2014 at 5:14 PM, Colin Clark co...@clark.ws wrote:

 No, you're not-the partition key will get distributed across the cluster
 if you're using random or murmur.


 Yes… I'm aware.  But in practice this is how it will work…

 If we create bucket b0, that will get hashed to h0…

 So say I have 50 machines performing writes, they are all on the same
 time thanks to ntpd, so they all compute b0 for the current bucket based on
 the time.

 That gets hashed to h0…

 If h0 is hosted on node0 … then all writes go to node zero for that 1
 second interval.

 So all my writes are bottlenecking on one node.  That node is *changing*
 over time… but they're not being dispatched in parallel over N nodes.  At
 most writes will only ever reach 1 node a time.



 You could also ensure that by adding another column, like source to
 ensure distribution. (Add the seconds to the partition key, not the
 clustering columns)

 I can almost guarantee that if you put too much thought into working
 against what Cassandra offers out of the box, that it will bite you later.


 Sure.. I'm trying to avoid the 'bite you later' issues. More so because
 I'm sure there are Cassandra gotchas to worry about.  Everything has them.
  Just trying to avoid the land mines :-P


 In fact, the use case that you're describing may best be served by a
 queuing mechanism, and using Cassandra only for the underlying store.


 Yes… that's what I'm doing.  We're using apollo to fan out the queue, but
 the writes go back into cassandra and needs to be read out sequentially.



 I used this exact same approach in a use case that involved writing over
 a million events/second to a cluster with no problems.  Initially, I
 thought ordered partitioner was the way to go too.  And I used separate
 processes to aggregate, conflate, and handle distribution to clients.



 Yes. I think using 100 buckets will work for now.  Plus I don't have to
 change the partitioner on our existing cluster and I'm lazy :)



 Just my two cents, but I also spend the majority of my days helping
 people utilize Cassandra correctly, and rescuing those that haven't.


 Definitely appreciate the feedback!  Thanks!

 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 Skype: *burtonator*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com
 War is peace. Freedom is slavery. Ignorance is strength. Corporations are
 people.




 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 Skype: *burtonator*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com
 War is peace. Freedom is slavery. Ignorance is strength. Corporations are
 people.




-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
Skype: *burtonator*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
https://plus.google.com/102718274791889610666/posts
http://spinn3r.com
War is peace. Freedom is slavery. Ignorance is strength. Corporations are
people.

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Colin

To have any redundancy in the system, start with at least 3 nodes and a 
replication factor of 3.

Try to have at least 8 cores, 32 gig ram, and separate disks for log and data.

Will you be replicating data across data centers?

--
Colin
320-221-9531


 On Jun 7, 2014, at 9:40 PM, Kevin Burton bur...@spinn3r.com wrote:
 
 Oh.. To start with we're going to use from 2-10 nodes.. 
 
 I think we're going to take the original strategy and just to use 100 buckets 
 .. 0-99… then the timestamp under that..  I think it should be fine and won't 
 require an ordered partitioner. :)
 
 Thanks!
 
 
 On Sat, Jun 7, 2014 at 7:38 PM, Colin Clark co...@clark.ws wrote:
 With 100 nodes, that ingestion rate is actually quite low and I don't think 
 you'd need another column in the partition key.
 
 You seem to be set in your current direction.  Let us know how it works out.
 
 --
 Colin
 320-221-9531
 
 
 On Jun 7, 2014, at 9:18 PM, Kevin Burton bur...@spinn3r.com wrote:
 
 What's 'source' ? You mean like the URL?
 
 If source too random it's going to yield too many buckets.  
 
 Ingestion rates are fairly high but not insane.  About 4M inserts per 
 hour.. from 5-10GB… 
 
 
 On Sat, Jun 7, 2014 at 7:13 PM, Colin Clark co...@clark.ws wrote:
 Not if you add another column to the partition key; source for example.  
 
 I would really try to stay away from the ordered partitioner if at all 
 possible.
 
 What ingestion rates are you expecting, in size and speed.
 
 --
 Colin
 320-221-9531
 
 
 On Jun 7, 2014, at 9:05 PM, Kevin Burton bur...@spinn3r.com wrote:
 
 
 Thanks for the feedback on this btw.. .it's helpful.  My notes below.
 
 On Sat, Jun 7, 2014 at 5:14 PM, Colin Clark co...@clark.ws wrote:
 No, you're not-the partition key will get distributed across the cluster 
 if you're using random or murmur.
 
 Yes… I'm aware.  But in practice this is how it will work…
 
 If we create bucket b0, that will get hashed to h0…
 
 So say I have 50 machines performing writes, they are all on the same 
 time thanks to ntpd, so they all compute b0 for the current bucket based 
 on the time.
 
 That gets hashed to h0…
 
 If h0 is hosted on node0 … then all writes go to node zero for that 1 
 second interval.
 
 So all my writes are bottlenecking on one node.  That node is *changing* 
 over time… but they're not being dispatched in parallel over N nodes.  At 
 most writes will only ever reach 1 node a time.
 
  
 You could also ensure that by adding another column, like source to 
 ensure distribution. (Add the seconds to the partition key, not the 
 clustering columns)
 
 I can almost guarantee that if you put too much thought into working 
 against what Cassandra offers out of the box, that it will bite you 
 later.
 
 Sure.. I'm trying to avoid the 'bite you later' issues. More so because 
 I'm sure there are Cassandra gotchas to worry about.  Everything has 
 them.  Just trying to avoid the land mines :-P
  
 In fact, the use case that you're describing may best be served by a 
 queuing mechanism, and using Cassandra only for the underlying store.
 
 Yes… that's what I'm doing.  We're using apollo to fan out the queue, but 
 the writes go back into cassandra and needs to be read out sequentially.
  
 
 I used this exact same approach in a use case that involved writing over 
 a million events/second to a cluster with no problems.  Initially, I 
 thought ordered partitioner was the way to go too.  And I used separate 
 processes to aggregate, conflate, and handle distribution to clients.
 
 
 Yes. I think using 100 buckets will work for now.  Plus I don't have to 
 change the partitioner on our existing cluster and I'm lazy :)
  
 
 Just my two cents, but I also spend the majority of my days helping 
 people utilize Cassandra correctly, and rescuing those that haven't.
 
 Definitely appreciate the feedback!  Thanks!
  
 -- 
 Founder/CEO Spinn3r.com
 Location: San Francisco, CA
 Skype: burtonator
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 
 War is peace. Freedom is slavery. Ignorance is strength. Corporations are 
 people.
 
 
 
 -- 
 Founder/CEO Spinn3r.com
 Location: San Francisco, CA
 Skype: burtonator
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 
 War is peace. Freedom is slavery. Ignorance is strength. Corporations are 
 people.
 
 
 
 -- 
 Founder/CEO Spinn3r.com
 Location: San Francisco, CA
 Skype: burtonator
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 
 War is peace. Freedom is slavery. Ignorance is strength. Corporations are 
 people.

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Kevin Burton

Right now I'm just putting everything together as a proof of concept… so
just two cheap replicas for now.  And it's at 1/1th of the load.

If we lose data it's ok :)

I think our config will be 2-3x 400GB SSDs in RAID0 , 3 replicas, 16 cores,
probably 48-64GB of RAM each box.

Just one datacenter for now…

We're probably going to be migrating to using linux containers at some
point.  This way we can have like 16GB , one 400GB SSD, 4 cores for each
image.  And we can ditch the RAID which is nice. :)


On Sat, Jun 7, 2014 at 7:51 PM, Colin colpcl...@gmail.com wrote:

 To have any redundancy in the system, start with at least 3 nodes and a
 replication factor of 3.

 Try to have at least 8 cores, 32 gig ram, and separate disks for log and
 data.

 Will you be replicating data across data centers?

 --
 Colin
 320-221-9531


 On Jun 7, 2014, at 9:40 PM, Kevin Burton bur...@spinn3r.com wrote:

 Oh.. To start with we're going to use from 2-10 nodes..

 I think we're going to take the original strategy and just to use 100
 buckets .. 0-99… then the timestamp under that..  I think it should be fine
 and won't require an ordered partitioner. :)

 Thanks!


 On Sat, Jun 7, 2014 at 7:38 PM, Colin Clark co...@clark.ws wrote:

 With 100 nodes, that ingestion rate is actually quite low and I don't
 think you'd need another column in the partition key.

 You seem to be set in your current direction.  Let us know how it works
 out.

 --
 Colin
 320-221-9531


 On Jun 7, 2014, at 9:18 PM, Kevin Burton bur...@spinn3r.com wrote:

 What's 'source' ? You mean like the URL?

 If source too random it's going to yield too many buckets.

 Ingestion rates are fairly high but not insane.  About 4M inserts per
 hour.. from 5-10GB…


 On Sat, Jun 7, 2014 at 7:13 PM, Colin Clark co...@clark.ws wrote:

 Not if you add another column to the partition key; source for example.

 I would really try to stay away from the ordered partitioner if at all
 possible.

 What ingestion rates are you expecting, in size and speed.

 --
 Colin
 320-221-9531


 On Jun 7, 2014, at 9:05 PM, Kevin Burton bur...@spinn3r.com wrote:


 Thanks for the feedback on this btw.. .it's helpful.  My notes below.

 On Sat, Jun 7, 2014 at 5:14 PM, Colin Clark co...@clark.ws wrote:

 No, you're not-the partition key will get distributed across the
 cluster if you're using random or murmur.


 Yes… I'm aware.  But in practice this is how it will work…

 If we create bucket b0, that will get hashed to h0…

 So say I have 50 machines performing writes, they are all on the same
 time thanks to ntpd, so they all compute b0 for the current bucket based on
 the time.

 That gets hashed to h0…

 If h0 is hosted on node0 … then all writes go to node zero for that 1
 second interval.

 So all my writes are bottlenecking on one node.  That node is *changing*
 over time… but they're not being dispatched in parallel over N nodes.  At
 most writes will only ever reach 1 node a time.



 You could also ensure that by adding another column, like source to
 ensure distribution. (Add the seconds to the partition key, not the
 clustering columns)

 I can almost guarantee that if you put too much thought into working
 against what Cassandra offers out of the box, that it will bite you later.


 Sure.. I'm trying to avoid the 'bite you later' issues. More so because
 I'm sure there are Cassandra gotchas to worry about.  Everything has them.
  Just trying to avoid the land mines :-P


 In fact, the use case that you're describing may best be served by a
 queuing mechanism, and using Cassandra only for the underlying store.


 Yes… that's what I'm doing.  We're using apollo to fan out the queue,
 but the writes go back into cassandra and needs to be read out sequentially.



 I used this exact same approach in a use case that involved writing
 over a million events/second to a cluster with no problems.  Initially, I
 thought ordered partitioner was the way to go too.  And I used separate
 processes to aggregate, conflate, and handle distribution to clients.



 Yes. I think using 100 buckets will work for now.  Plus I don't have to
 change the partitioner on our existing cluster and I'm lazy :)



 Just my two cents, but I also spend the majority of my days helping
 people utilize Cassandra correctly, and rescuing those that haven't.


 Definitely appreciate the feedback!  Thanks!

 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 Skype: *burtonator*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com
 War is peace. Freedom is slavery. Ignorance is strength. Corporations
 are people.




 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 Skype: *burtonator*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com
 War is peace. Freedom is slavery. Ignorance is strength. Corporations

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Colin Clark

Write Consistency Level + Read Consistency Level  Replication Factor
ensure your reads will read consistently and having 3 nodes lets you
achieve redundancy in event of node failure.

So writing with CL of local quorum and reading with CL of local quorum
(2+23) with replication factor of 3 ensures reads and protection against
losing a node.

In event of losing a node, you can downgrade the CL automatically and then
also accept a little eventual consistency.


--
Colin
320-221-9531


On Jun 7, 2014, at 10:03 PM, James Campbell ja...@breachintelligence.com
wrote:

 This is a basic question, but having heard that advice before, I'm curious
about why the minimum recommended replication factor is three? Certainly
additional redundancy, and, I believe, a minimum threshold for paxos. Are
there other reasons?
On Jun 7, 2014 10:52 PM, Colin colpcl...@gmail.com wrote:
 To have any redundancy in the system, start with at least 3 nodes and a
replication factor of 3.

 Try to have at least 8 cores, 32 gig ram, and separate disks for log and
data.

 Will you be replicating data across data centers?

-- 
Colin
320-221-9531


On Jun 7, 2014, at 9:40 PM, Kevin Burton bur...@spinn3r.com wrote:

  Oh.. To start with we're going to use from 2-10 nodes..

 I think we're going to take the original strategy and just to use 100
buckets .. 0-99… then the timestamp under that..  I think it should be fine
and won't require an ordered partitioner. :)

 Thanks!


On Sat, Jun 7, 2014 at 7:38 PM, Colin Clark co...@clark.ws wrote:

  With 100 nodes, that ingestion rate is actually quite low and I don't
 think you'd need another column in the partition key.

  You seem to be set in your current direction.  Let us know how it works
 out.

 --
 Colin
 320-221-9531


 On Jun 7, 2014, at 9:18 PM, Kevin Burton bur...@spinn3r.com wrote:

   What's 'source' ? You mean like the URL?

  If source too random it's going to yield too many buckets.

  Ingestion rates are fairly high but not insane.  About 4M inserts per
 hour.. from 5-10GB…


 On Sat, Jun 7, 2014 at 7:13 PM, Colin Clark co...@clark.ws wrote:

  Not if you add another column to the partition key; source for example.


  I would really try to stay away from the ordered partitioner if at all
 possible.

  What ingestion rates are you expecting, in size and speed.

 --
 Colin
 320-221-9531


 On Jun 7, 2014, at 9:05 PM, Kevin Burton bur...@spinn3r.com wrote:


  Thanks for the feedback on this btw.. .it's helpful.  My notes below.

 On Sat, Jun 7, 2014 at 5:14 PM, Colin Clark co...@clark.ws wrote:

  No, you're not-the partition key will get distributed across the
 cluster if you're using random or murmur.


  Yes… I'm aware.  But in practice this is how it will work…

  If we create bucket b0, that will get hashed to h0…

  So say I have 50 machines performing writes, they are all on the same
 time thanks to ntpd, so they all compute b0 for the current bucket based on
 the time.

  That gets hashed to h0…

  If h0 is hosted on node0 … then all writes go to node zero for that 1
 second interval.

  So all my writes are bottlenecking on one node.  That node is
 *changing* over time… but they're not being dispatched in parallel over N
 nodes.  At most writes will only ever reach 1 node a time.



  You could also ensure that by adding another column, like source to
 ensure distribution. (Add the seconds to the partition key, not the
 clustering columns)

  I can almost guarantee that if you put too much thought into working
 against what Cassandra offers out of the box, that it will bite you later.


  Sure.. I'm trying to avoid the 'bite you later' issues. More so because
 I'm sure there are Cassandra gotchas to worry about.  Everything has them.
  Just trying to avoid the land mines :-P


  In fact, the use case that you're describing may best be served by a
 queuing mechanism, and using Cassandra only for the underlying store.


  Yes… that's what I'm doing.  We're using apollo to fan out the queue,
 but the writes go back into cassandra and needs to be read out sequentially.



  I used this exact same approach in a use case that involved writing
 over a million events/second to a cluster with no problems.  Initially, I
 thought ordered partitioner was the way to go too.  And I used separate
 processes to aggregate, conflate, and handle distribution to clients.



  Yes. I think using 100 buckets will work for now.  Plus I don't have to
 change the partitioner on our existing cluster and I'm lazy :)



  Just my two cents, but I also spend the majority of my days helping
 people utilize Cassandra correctly, and rescuing those that haven't.


  Definitely appreciate the feedback!  Thanks!

  --

  Founder/CEO Spinn3r.com
  Location: *San Francisco, CA*
 Skype: *burtonator*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
  http://spinn3r.com
 War is peace. Freedom is slavery. Ignorance is strength.

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Colin Clark

You won't need containers - running one instance of Cassandra in that
configuration will hum along quite nicely and will make use of the cores
and memory.

I'd forget the raid anyway and just mount the disks separately (jbod)

--
Colin
320-221-9531


On Jun 7, 2014, at 10:02 PM, Kevin Burton bur...@spinn3r.com wrote:

Right now I'm just putting everything together as a proof of concept… so
just two cheap replicas for now.  And it's at 1/1th of the load.

If we lose data it's ok :)

I think our config will be 2-3x 400GB SSDs in RAID0 , 3 replicas, 16 cores,
probably 48-64GB of RAM each box.

Just one datacenter for now…

We're probably going to be migrating to using linux containers at some
point.  This way we can have like 16GB , one 400GB SSD, 4 cores for each
image.  And we can ditch the RAID which is nice. :)


On Sat, Jun 7, 2014 at 7:51 PM, Colin colpcl...@gmail.com wrote:

 To have any redundancy in the system, start with at least 3 nodes and a
 replication factor of 3.

 Try to have at least 8 cores, 32 gig ram, and separate disks for log and
 data.

 Will you be replicating data across data centers?

 --
 Colin
 320-221-9531


 On Jun 7, 2014, at 9:40 PM, Kevin Burton bur...@spinn3r.com wrote:

 Oh.. To start with we're going to use from 2-10 nodes..

 I think we're going to take the original strategy and just to use 100
 buckets .. 0-99… then the timestamp under that..  I think it should be fine
 and won't require an ordered partitioner. :)

 Thanks!


 On Sat, Jun 7, 2014 at 7:38 PM, Colin Clark co...@clark.ws wrote:

 With 100 nodes, that ingestion rate is actually quite low and I don't
 think you'd need another column in the partition key.

 You seem to be set in your current direction.  Let us know how it works
 out.

 --
 Colin
 320-221-9531


 On Jun 7, 2014, at 9:18 PM, Kevin Burton bur...@spinn3r.com wrote:

 What's 'source' ? You mean like the URL?

 If source too random it's going to yield too many buckets.

 Ingestion rates are fairly high but not insane.  About 4M inserts per
 hour.. from 5-10GB…


 On Sat, Jun 7, 2014 at 7:13 PM, Colin Clark co...@clark.ws wrote:

 Not if you add another column to the partition key; source for example.

 I would really try to stay away from the ordered partitioner if at all
 possible.

 What ingestion rates are you expecting, in size and speed.

 --
 Colin
 320-221-9531


 On Jun 7, 2014, at 9:05 PM, Kevin Burton bur...@spinn3r.com wrote:


 Thanks for the feedback on this btw.. .it's helpful.  My notes below.

 On Sat, Jun 7, 2014 at 5:14 PM, Colin Clark co...@clark.ws wrote:

 No, you're not-the partition key will get distributed across the
 cluster if you're using random or murmur.


 Yes… I'm aware.  But in practice this is how it will work…

 If we create bucket b0, that will get hashed to h0…

 So say I have 50 machines performing writes, they are all on the same
 time thanks to ntpd, so they all compute b0 for the current bucket based on
 the time.

 That gets hashed to h0…

 If h0 is hosted on node0 … then all writes go to node zero for that 1
 second interval.

 So all my writes are bottlenecking on one node.  That node is *changing*
 over time… but they're not being dispatched in parallel over N nodes.  At
 most writes will only ever reach 1 node a time.



 You could also ensure that by adding another column, like source to
 ensure distribution. (Add the seconds to the partition key, not the
 clustering columns)

 I can almost guarantee that if you put too much thought into working
 against what Cassandra offers out of the box, that it will bite you later.


 Sure.. I'm trying to avoid the 'bite you later' issues. More so because
 I'm sure there are Cassandra gotchas to worry about.  Everything has them.
  Just trying to avoid the land mines :-P


 In fact, the use case that you're describing may best be served by a
 queuing mechanism, and using Cassandra only for the underlying store.


 Yes… that's what I'm doing.  We're using apollo to fan out the queue,
 but the writes go back into cassandra and needs to be read out sequentially.



 I used this exact same approach in a use case that involved writing
 over a million events/second to a cluster with no problems.  Initially, I
 thought ordered partitioner was the way to go too.  And I used separate
 processes to aggregate, conflate, and handle distribution to clients.



 Yes. I think using 100 buckets will work for now.  Plus I don't have to
 change the partitioner on our existing cluster and I'm lazy :)



 Just my two cents, but I also spend the majority of my days helping
 people utilize Cassandra correctly, and rescuing those that haven't.


 Definitely appreciate the feedback!  Thanks!

 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 Skype: *burtonator*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com
 War is peace. Freedom is slavery. Ignorance is strength.

Re: Data model for streaming a large table in real time.

2014-06-07 Thread Kevin Burton

we're using containers for other reasons, not just cassandra.

Tightly constraining resources means we don't have to worry about cassandra
, the JVM , or Linux doing something silly and using too many resources and
taking down the whole box.


On Sat, Jun 7, 2014 at 8:25 PM, Colin Clark co...@clark.ws wrote:

 You won't need containers - running one instance of Cassandra in that
 configuration will hum along quite nicely and will make use of the cores
 and memory.

 I'd forget the raid anyway and just mount the disks separately (jbod)

 --
 Colin
 320-221-9531


 On Jun 7, 2014, at 10:02 PM, Kevin Burton bur...@spinn3r.com wrote:

 Right now I'm just putting everything together as a proof of concept… so
 just two cheap replicas for now.  And it's at 1/1th of the load.

 If we lose data it's ok :)

 I think our config will be 2-3x 400GB SSDs in RAID0 , 3 replicas, 16
 cores, probably 48-64GB of RAM each box.

 Just one datacenter for now…

 We're probably going to be migrating to using linux containers at some
 point.  This way we can have like 16GB , one 400GB SSD, 4 cores for each
 image.  And we can ditch the RAID which is nice. :)


 On Sat, Jun 7, 2014 at 7:51 PM, Colin colpcl...@gmail.com wrote:

 To have any redundancy in the system, start with at least 3 nodes and a
 replication factor of 3.

 Try to have at least 8 cores, 32 gig ram, and separate disks for log and
 data.

 Will you be replicating data across data centers?

 --
 Colin
 320-221-9531


 On Jun 7, 2014, at 9:40 PM, Kevin Burton bur...@spinn3r.com wrote:

 Oh.. To start with we're going to use from 2-10 nodes..

 I think we're going to take the original strategy and just to use 100
 buckets .. 0-99… then the timestamp under that..  I think it should be fine
 and won't require an ordered partitioner. :)

 Thanks!


 On Sat, Jun 7, 2014 at 7:38 PM, Colin Clark co...@clark.ws wrote:

 With 100 nodes, that ingestion rate is actually quite low and I don't
 think you'd need another column in the partition key.

 You seem to be set in your current direction.  Let us know how it works
 out.

 --
 Colin
 320-221-9531


 On Jun 7, 2014, at 9:18 PM, Kevin Burton bur...@spinn3r.com wrote:

 What's 'source' ? You mean like the URL?

 If source too random it's going to yield too many buckets.

 Ingestion rates are fairly high but not insane.  About 4M inserts per
 hour.. from 5-10GB…


 On Sat, Jun 7, 2014 at 7:13 PM, Colin Clark co...@clark.ws wrote:

 Not if you add another column to the partition key; source for example.


 I would really try to stay away from the ordered partitioner if at all
 possible.

 What ingestion rates are you expecting, in size and speed.

 --
 Colin
 320-221-9531


 On Jun 7, 2014, at 9:05 PM, Kevin Burton bur...@spinn3r.com wrote:


 Thanks for the feedback on this btw.. .it's helpful.  My notes below.

 On Sat, Jun 7, 2014 at 5:14 PM, Colin Clark co...@clark.ws wrote:

 No, you're not-the partition key will get distributed across the
 cluster if you're using random or murmur.


 Yes… I'm aware.  But in practice this is how it will work…

 If we create bucket b0, that will get hashed to h0…

 So say I have 50 machines performing writes, they are all on the same
 time thanks to ntpd, so they all compute b0 for the current bucket based on
 the time.

 That gets hashed to h0…

 If h0 is hosted on node0 … then all writes go to node zero for that 1
 second interval.

 So all my writes are bottlenecking on one node.  That node is
 *changing* over time… but they're not being dispatched in parallel over N
 nodes.  At most writes will only ever reach 1 node a time.



 You could also ensure that by adding another column, like source to
 ensure distribution. (Add the seconds to the partition key, not the
 clustering columns)

 I can almost guarantee that if you put too much thought into working
 against what Cassandra offers out of the box, that it will bite you later.


 Sure.. I'm trying to avoid the 'bite you later' issues. More so because
 I'm sure there are Cassandra gotchas to worry about.  Everything has them.
  Just trying to avoid the land mines :-P


 In fact, the use case that you're describing may best be served by a
 queuing mechanism, and using Cassandra only for the underlying store.


 Yes… that's what I'm doing.  We're using apollo to fan out the queue,
 but the writes go back into cassandra and needs to be read out 
 sequentially.



 I used this exact same approach in a use case that involved writing
 over a million events/second to a cluster with no problems.  Initially, I
 thought ordered partitioner was the way to go too.  And I used separate
 processes to aggregate, conflate, and handle distribution to clients.



 Yes. I think using 100 buckets will work for now.  Plus I don't have to
 change the partitioner on our existing cluster and I'm lazy :)



 Just my two cents, but I also spend the majority of my days helping
 people utilize Cassandra correctly, and rescuing those that haven't.


 Definitely

Re: Object mapper for CQL

2014-06-07 Thread Kuldeep Mishra

There is one High Level Java client for Cassandra which supports CQL is
Kundera.
You can find it here https://github.com/impetus-opensource/Kundera.

Other useful links are
https://github.com/impetus-opensource/Kundera/wiki/Getting-Started-in-5-minutes
https://github.com/impetus-opensource/Kundera/wiki/Object-mapper

How to use CQL
https://github.com/impetus-opensource/Kundera/wiki/Cassandra-Specific-Features

I hope it would help you.





On Sun, Jun 8, 2014 at 7:53 AM, Kevin Burton bur...@spinn3r.com wrote:

 Looks like the java-driver is working on an object mapper:

 More modules including a simple object mapper will come shortly.
 But of course I need one now …
 I'm curious what others are doing here.

 I don't want to pass around Row objects in my code if I can avoid it..
 Ideally I would just run a query and get back a POJO.

 Another issue is how are these POJOs generated.  Are they generated from
 the schema?  is the schema generated from the POJOs ?  From a side file?

 And granted, there are existing ORMs out there but I don't think any
 support CQL.

 --

 Founder/CEO Spinn3r.com
 Location: *San Francisco, CA*
 Skype: *burtonator*
 blog: http://burtonator.wordpress.com
 … or check out my Google+ profile
 https://plus.google.com/102718274791889610666/posts
 http://spinn3r.com
 War is peace. Freedom is slavery. Ignorance is strength. Corporations are
 people.




-- 
Thanks and Regards
Kuldeep Kumar Mishra
+919540965199

Re: Data model for streaming a large table in real time.

Re: Data model for streaming a large table in real time.

Re: Data model for streaming a large table in real time.

Re: Data model for streaming a large table in real time.

Re: Data model for streaming a large table in real time.

A list of all potential problems when using byte ordered partitioner?

Re: Data model for streaming a large table in real time.

Re: Data model for streaming a large table in real time.

Re: Data model for streaming a large table in real time.

Re: Data model for streaming a large table in real time.

Re: Data model for streaming a large table in real time.

Re: Data model for streaming a large table in real time.

Re: problem removing dead node from ring

Re: Data model for streaming a large table in real time.

Re: Data model for streaming a large table in real time.

Re: Data model for streaming a large table in real time.

Object mapper for CQL

Re: Data model for streaming a large table in real time.

Re: Data model for streaming a large table in real time.

Re: Data model for streaming a large table in real time.

Re: Data model for streaming a large table in real time.

Re: Data model for streaming a large table in real time.

Re: Data model for streaming a large table in real time.

Re: Data model for streaming a large table in real time.

Re: Object mapper for CQL

25 matches

Site Navigation

Mail list logo

Footer information