Re: Hackathon?!?
Ack, I agreed to speak at http://nosqleu.com/, I never did hear a final date but they put up a schedule online (april 20-22). But, 22 probably is a better date, and Eric and Stu are fully capable of representing rackspace without me. :) -Jonathan On Wed, Mar 10, 2010 at 10:50 PM, Chris Goffinet goffi...@digg.com wrote: We could do it on April 22 (1 week later), that's my birthday :-) What better way to celebrate haha. -Chris On Mar 10, 2010, at 9:58 AM, Jonathan Ellis wrote: I'm in either way, but if we push it a week later then the twitter guys could (a) make it and (b) pimp it at their own conference. On Wed, Mar 10, 2010 at 12:26 AM, Jeff Hodges jhod...@twitter.com wrote: Ah, hell. Thought this was the first day. Can't make it. -- Jeff On Mar 9, 2010 9:32 PM, Ryan King r...@twitter.com wrote: I'm already committed to talking about cassandra that day at our company's developer conference (chirp.twitter.com). -ryan On Tue, Mar 9, 2010 at 6:26 PM, Jeff Hodges jhod...@twitter.com wrote: I'm down. -- Jeff ...
Re: problem with running simple example using cassandra-cli with 0.6.0-beta2
Yes, I was expecting the column names to come back as strings like the way it does with 0.5.1. Bill On Thu, Mar 11, 2010 at 12:03 AM, Jonathan Ellis jbel...@gmail.com wrote: I think he means how the column names are rendered as bytes but the values are strings. On Wed, Mar 10, 2010 at 5:22 PM, Brandon Williams dri...@gmail.com wrote: On Wed, Mar 10, 2010 at 5:09 PM, Bill Au bill.w...@gmail.com wrote: I am checking out 0.6.0-beta2 since I need the batch-mutate function. I am just trying to run the example is the cassandra-cli Wiki: http://wiki.apache.org/cassandra/CassandraCli Here is what I am getting: cassandra set Keyspace1.Standard1['jsmith']['first'] = 'John' Value inserted. cassandra get Keyspace1.Standard1['jsmith'] = (column=6669727374, value=John, timestamp=1268261785077) Returned 1 results. The column name being returned by get (6669727374) does not match what is set (first). This is true for all column names. cassandra set Keyspace1.Standard1['jsmith']['last'] = 'Smith' Value inserted. cassandra set Keyspace1.Standard1['jsmith']['age'] = '42' Value inserted. cassandra get Keyspace1.Standard1['jsmith'] = (column=6c617374, value=Smith, timestamp=1268262480130) = (column=6669727374, value=John, timestamp=1268261785077) = (column=616765, value=42, timestamp=1268262484133) Returned 3 results. Is this a problem in 0.6.0-beta2 or am I doing anything wrong? Bill This is normal. You've added the 'first', 'last', and 'age' columns to the 'jsmith' row, and then asked for the entire row, so you got all 3 columns back. -Brandon
Re: cassandra 0.6.0 beta 2 download contains beta 1?
On Wed, 2010-03-10 at 13:39 -0500, Vick Khera wrote: On Wed, Mar 10, 2010 at 11:30 AM, Eric Evans eev...@rackspace.com wrote: apache-cassandra-0.6.0-beta1.jar apache-cassandra-0.6.0-beta2.jar Ugh, my bad. I must have failed to `clean' in between the aborted beta1 and beta2. The beta2 also does not include the other support jar files like log4j. Not being a java person, I didn't know what to do so I just started my experimentation with the 0.5.1 release which has it all bundled. Yes, this is a new feature^H^H^H^H^Hcontroversy in that most of the third-party jars are no longer distributed by us, and must be fetched using `ant ivy-retrieve'. This is currently being disputed, see https://issues.apache.org/jira/browse/CASSANDRA-850 for more on that. For what it's worth, this was documented in both the changelog (CHANGES.txt) and the release notes (NEWS.txt), which you really should be reading. -- Eric Evans eev...@rackspace.com
Re: Effective allocation of multiple disks
On Wed, 2010-03-10 at 23:20 -0600, Jonathan Ellis wrote: On Wed, Mar 10, 2010 at 9:31 PM, Anthony Molinaro antho...@alumni.caltech.edu wrote: I would almost recommend just keeping things simple and removing multiple data directories from the config altogether and just documenting that you should plan on using OS level mechanisms for growing diskspace and io. I think that is a pretty sane suggestion actually. Or maybe leave the code as is and just document the situation more clearly? If you're adding more disks to increase storage capacity and you don't strictly need the extra IO, then multiple data directories might be preferable to other forms of aggregation (it's certainly simpler than say a volume manager). -- Eric Evans eev...@rackspace.com
Re: cassandra 0.6.0 beta 2 download contains beta 1?
On Thu, Mar 11, 2010 at 12:53 PM, Eric Evans eev...@rackspace.com wrote: Yes, this is a new feature^H^H^H^H^Hcontroversy in that most of the third-party jars are no longer distributed by us, and must be fetched using `ant ivy-retrieve'. This is currently being disputed, see https://issues.apache.org/jira/browse/CASSANDRA-850 for more on that. For what it's worth, this was documented in both the changelog (CHANGES.txt) and the release notes (NEWS.txt), which you really should be reading. As a newcomer, I started by reading the wiki and following examples. The quick-start guide failed, so I just backed out of the beta to the released version. The wiki recommends using the beta release to protect against on-disk format changes that may happen. Is it really good to make ant necessary to use the binary distribution? Might as well just stop distributing the binary if you need a developer environment to use it anyway. But now I know what to do to use beta2 so perhaps I'll try again with that. Thanks for the info.
Re: Effective allocation of multiple disks
Except that for a major compaction the whole thing gets put in one directory. That's the problem w/ the JBOD approach. On Thu, Mar 11, 2010 at 12:01 PM, Eric Evans eev...@rackspace.com wrote: On Wed, 2010-03-10 at 23:20 -0600, Jonathan Ellis wrote: On Wed, Mar 10, 2010 at 9:31 PM, Anthony Molinaro antho...@alumni.caltech.edu wrote: I would almost recommend just keeping things simple and removing multiple data directories from the config altogether and just documenting that you should plan on using OS level mechanisms for growing diskspace and io. I think that is a pretty sane suggestion actually. Or maybe leave the code as is and just document the situation more clearly? If you're adding more disks to increase storage capacity and you don't strictly need the extra IO, then multiple data directories might be preferable to other forms of aggregation (it's certainly simpler than say a volume manager). -- Eric Evans eev...@rackspace.com
Re: SuperColumn.getSubColumns() ordering
it's ordered by the column name as determined by the subcolumn comparator you declared in the definition, yes On Thu, Mar 11, 2010 at 12:24 PM, Matteo Caprari matteo.capr...@gmail.com wrote: Hi. If I iterate over SuperColumn.getSubColumn(), do I get columns sorted by the column name? Thanks. -- :Matteo Caprari matteo.capr...@gmail.com
client.get_count query
What does this query return? Is there a way to do a range query and get the row count? (e.g. row start = TOW' row end = 'TOWZ') Thanks
Re: cassandra 0.6.0 beta 2 download contains beta 1?
On Thu, 2010-03-11 at 13:21 -0500, Vick Khera wrote: As a newcomer, I started by reading the wiki and following examples. The quick-start guide failed, so I just backed out of the beta to the released version. The wiki recommends using the beta release to protect against on-disk format changes that may happen. Ok, that makes sense. http://wiki.apache.org/cassandra/GettingStarted has generally targeted the current release, so a change like this wouldn't be documented there until after making it into a stable. Is it really good to make ant necessary to use the binary distribution? If we're going to get into this in any depth, we should probably start another thread, but as I indicated in the ticket[1], this change was introduced to solve real problems we were having with dependency management. Basically, we traded: * manual, tedious, and error prone task(s) * the legal requirement to document licensing and attribution For: * requiring network access (though I seem to be the only one that considers this a major drawback). * requiring ant to be installed * one extra step prerequisite step (invoking `ant ivy-retrieve') So what this really boils down to is a question of which is worse, the disease or the cure. Bare in mind though that despite the best of intentions, license and attribution was not being kept up with properly which results in legal risks that are unacceptable to the ASF. It's also worth pointing out that the dependency tree has seen quite a bit of churn in addition to more than doubling in size (and every indication is that this will continue in the near-term). Might as well just stop distributing the binary if you need a developer environment to use it anyway. Maybe we should. :) [1]: https://issues.apache.org/jira/browse/CASSANDRA-850 -- Eric Evans eev...@rackspace.com
Re: client.get_count query
On Thu, 2010-03-11 at 11:29 -0800, Sonny Heer wrote: What does this query return? It counts the number of columns in a row or super column. Try: http://wiki.apache.org/cassandra/API#get_count Is there a way to do a range query and get the row count? (e.g. row start = TOW' row end = 'TOWZ') No, there isn't. -- Eric Evans eev...@rackspace.com
Re: client.get_count query
i suspect your looking for: https://issues.apache.org/jira/browse/CASSANDRA-653 cheers, jesse -- jesse mcconnell jesse.mcconn...@gmail.com On Thu, Mar 11, 2010 at 13:44, Sonny Heer sonnyh...@gmail.com wrote: Thanks. Are there plans to implement a row count feature? I have a model which doesn't store any columns since I could potentially have a large # of columns. So all the valuable information has been moved into the row key. On Thu, Mar 11, 2010 at 11:38 AM, Eric Evans eev...@rackspace.com wrote: On Thu, 2010-03-11 at 11:29 -0800, Sonny Heer wrote: What does this query return? It counts the number of columns in a row or super column. Try: http://wiki.apache.org/cassandra/API#get_count Is there a way to do a range query and get the row count? (e.g. row start = TOW' row end = 'TOWZ') No, there isn't. -- Eric Evans eev...@rackspace.com
Re: client.get_count query
On Thu, 2010-03-11 at 11:44 -0800, Sonny Heer wrote: Thanks. Are there plans to implement a row count feature? Not that I'm aware of. I have a model which doesn't store any columns since I could potentially have a large # of columns. So all the valuable information has been moved into the row key. This doesn't sound right; either there is a problem with your datamodel, or your choice of datastore. How many columns are we talking about here? -- Eric Evans eev...@rackspace.com
Re: Effective allocation of multiple disks
I'm still wondering what happens when you have something like 2 500GB disks, with 2 sstables which use up 25OGB, one on each disk, then a major compaction occurs. Will it still compact and probably fill up a disk (especially with the 2x overhead of compaction mentioned either here or on the wiki)? Seems like you basically could easily get into a situation where you can't fix it without something like a volume manager, or a complete shutdown, move data to bigger disk upgrade. I guess one way might be to treat each disk as a separate node (ie, give it some fraction of the keyspace based on its disk space), then when you add a directory to the config you would have to load balance but only within that node. I'm sure that complicates ring maintenance but maybe its a better experience, as the multiple data directories should all fill uniformly? Just some other thoughts. -Anthony On Thu, Mar 11, 2010 at 12:45:14PM -0600, Jonathan Ellis wrote: Except that for a major compaction the whole thing gets put in one directory. That's the problem w/ the JBOD approach. On Thu, Mar 11, 2010 at 12:01 PM, Eric Evans eev...@rackspace.com wrote: On Wed, 2010-03-10 at 23:20 -0600, Jonathan Ellis wrote: On Wed, Mar 10, 2010 at 9:31 PM, Anthony Molinaro antho...@alumni.caltech.edu wrote: I would almost recommend just keeping things simple and removing multiple data directories from the config altogether and just documenting that you should plan on using OS level mechanisms for growing diskspace and io. I think that is a pretty sane suggestion actually. Or maybe leave the code as is and just document the situation more clearly? If you're adding more disks to increase storage capacity and you don't strictly need the extra IO, then multiple data directories might be preferable to other forms of aggregation (it's certainly simpler than say a volume manager). -- Eric Evans eev...@rackspace.com -- Anthony Molinaro antho...@alumni.caltech.edu
Re: Use Case scenario: Keeping a window of data + online analytics
Daniel, Can you provide more information (an example would be very nice) on using batch_mutate deletes to build a time-series store in Cassandra? I have been reading up on batch_mutate from the Wiki: http://wiki.apache.org/cassandra/API It seems to me that since the outer map of mutation_map maps key to the inner map, it would be removing old data associated with the keys provided. Is it possible to remove old data based on time stamp for all keys? Is it also possible remove old keys if there has been no new data associated with it? Bill On Mon, Mar 8, 2010 at 8:44 AM, Daniel Lundin d...@eintr.org wrote: A few comments on building a time-series store in Cassandra... Using the timestamp dimension of columns, reusing columns, could prove quite useful. This allows simple use of batch_mutate deletes (new in 0.6) to purge old data outside the active time window.
Re: client.get_count query
a lot. In the trillions (where each column name stores the valuable information and column values are empty). I read somewhere that column size should be in the single MB digits. Storing it in the key allows true horizontal scalability. Is this true? On Thu, Mar 11, 2010 at 11:59 AM, Eric Evans eev...@rackspace.com wrote: On Thu, 2010-03-11 at 11:44 -0800, Sonny Heer wrote: Thanks. Are there plans to implement a row count feature? Not that I'm aware of. I have a model which doesn't store any columns since I could potentially have a large # of columns. So all the valuable information has been moved into the row key. This doesn't sound right; either there is a problem with your datamodel, or your choice of datastore. How many columns are we talking about here? -- Eric Evans eev...@rackspace.com
Re: Effective allocation of multiple disks
On Thu, Mar 11, 2010 at 10:45 AM, Jonathan Ellis jbel...@gmail.com wrote: Except that for a major compaction the whole thing gets put in one directory. That's the problem w/ the JBOD approach. Even without major compaction, you can get significant imbalances in how much data is on each disk which will bottleneck your IO throughput. We're running JBOD right now, but going to switch to RAID 0 soon. -ryan
question about deleting from cassandra
Let take Twitter as an example. All the tweets are timestamped. I want to keep only a month's worth of tweets for each user. The number of tweets that fit within this one month window varies from user to user. What is the best way to accomplish this? There are millions of users. Do I need to loop through all of them and handle the delete one user at a time? Or is there a better way to do this? If a user has not post a new tweet in more than a month, I also want to remove the user itself. Do I also need to do looking through all the users one at a time? Bill
libcassandra - C++ Cassandra Client
We have developed a C++ client library based on the hector Java client for Cassandra that we intend on using for Drizzle integration. This library is still very much alpha and more features will be added while we work on drizzle integration. Connection pooling or failover is currently not implemented but will likely be added in the very near future. The source is available on github at: http://github.com/posulliv/libcassandra -Padraig
Re: libcassandra - C++ Cassandra Client
Cool! On Thu, Mar 11, 2010 at 11:12 PM, Padraig O'Sullivan osullivan.padr...@gmail.com wrote: We have developed a C++ client library based on the hector Java client for Cassandra that we intend on using for Drizzle integration. This library is still very much alpha and more features will be added while we work on drizzle integration. Connection pooling or failover is currently not implemented but will likely be added in the very near future. The source is available on github at: http://github.com/posulliv/libcassandra -Padraig
Re: libcassandra - C++ Cassandra Client
How is Drizzle being integrated with Cassandra? Are there any resources on the Internet that I could read up? Thanks Avinash On Thu, Mar 11, 2010 at 8:12 PM, Padraig O'Sullivan osullivan.padr...@gmail.com wrote: We have developed a C++ client library based on the hector Java client for Cassandra that we intend on using for Drizzle integration. This library is still very much alpha and more features will be added while we work on drizzle integration. Connection pooling or failover is currently not implemented but will likely be added in the very near future. The source is available on github at: http://github.com/posulliv/libcassandra -Padraig
Re: libcassandra - C++ Cassandra Client
On Thu, Mar 11, 2010 at 11:31 PM, Avinash Lakshman avinash.laksh...@gmail.com wrote: How is Drizzle being integrated with Cassandra? Are there any resources on the Internet that I could read up? The idea is to create a storage engine (along with some INFORMATION_SCHEMA tables probably) in drizzle on top of cassandra but we are just starting to think about it so there is no resources available right now. Keep an eye on the blueprints and code branches of drizzle for work on it. Any tasks we are working on, we will likely create a blueprint for it. Blueprints and code branches are available on launchpad - http://launchpad.net/drizzle Thanks Avinash On Thu, Mar 11, 2010 at 8:12 PM, Padraig O'Sullivan osullivan.padr...@gmail.com wrote: We have developed a C++ client library based on the hector Java client for Cassandra that we intend on using for Drizzle integration. This library is still very much alpha and more features will be added while we work on drizzle integration. Connection pooling or failover is currently not implemented but will likely be added in the very near future. The source is available on github at: http://github.com/posulliv/libcassandra -Padraig
Re: libcassandra - C++ Cassandra Client
On Mar 11, 2010, at 10:51 PM, Padraig O'Sullivan osullivan.padr...@gmail.com wrote: On Thu, Mar 11, 2010 at 11:31 PM, Avinash Lakshman avinash.laksh...@gmail.com wrote: How is Drizzle being integrated with Cassandra? Are there any resources on the Internet that I could read up? The idea is to create a storage engine (along with some INFORMATION_SCHEMA tables probably) in drizzle on top of cassandra but we are just starting to think about it so there is no resources available right now. Keep an eye on the blueprints and code branches of drizzle for work on it. Any tasks we are working on, we will likely create a blueprint for it. Blueprints and code branches are available on launchpad - http://launchpad.net/drizzle Thanks Avinash On Thu, Mar 11, 2010 at 8:12 PM, Padraig O'Sullivan osullivan.padr...@gmail.com wrote: We have developed a C++ client library based on the hector Java client for Cassandra that we intend on using for Drizzle integration. This library is still very much alpha and more features will be added while we work on drizzle integration. Connection pooling or failover is currently not implemented but will likely be added in the very near future. The source is available on github at: http://github.com/posulliv/libcassandra -Padraig Interesting. What would be an example use case? Would this be more appropriate for a static set of columns? I can imagine having access to a mysql dialect would make for easy access to some simple orm mapping using existing libraries among other things.
Re: question about deleting from cassandra
On 12 March 2010 03:34, Bill Au bill.w...@gmail.com wrote: Let take Twitter as an example. All the tweets are timestamped. I want to keep only a month's worth of tweets for each user. The number of tweets that fit within this one month window varies from user to user. What is the best way to accomplish this? This is the expiry problem that has been discussed on this list before. As far as I can see there are no easy ways to do it with 0.5 If you use the ordered partitioner and make the first part of the keys a timestamp (or part of it) then you can get the keys and delete them. However, these deletes will be quite inefficient, currently each row must be deleted individually (there was a patch to range delete kicking around, I don't know if it's accepted yet) But even if range delete is implemented, it's still quite inefficient and not really what you want, and doesn't work with the RandomPartitioner If you have some metadata to say who tweeted within a given period (say 10 days or 30 days) and you store the tweets all in the same key per user per period (say with one column per tweet, or use supercolumns), then you can just delete one key per user per period. One of the problems with using a time-based key with ordered partitioner is that you're always going to have a data imbalance, so you may want to try hashing *part* of the key (The first part) so you can still range scan the next part. This may fix load balancing while still enabling you to use range scans to do data expiry. e.g. your key is Hash of day number + user id + timestamp Then you can range scan the entire day's tweets to expire them, and range scan a given user's tweets for a given day efficiently (and doing this for 30 days is just 30 range scans) Putting a hash in there fixes load balancing with OPP. Mark