Re: question about deleting from cassandra
I guess you can also vote for this ticket : https://issues.apache.org/jira/browse/CASSANDRA-699 :) -- Sylvain On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson wrote: > On 12 March 2010 03:34, Bill Au wrote: >> >> Let take Twitter as an example. All the tweets are timestamped. I want >> to keep only a month's worth of tweets for each user. The number of tweets >> that fit within this one month window varies from user to user. What is the >> best way to accomplish this? > > This is the "expiry" problem that has been discussed on this list before. As > far as I can see there are no easy ways to do it with 0.5 > > If you use the ordered partitioner and make the first part of the keys a > timestamp (or part of it) then you can get the keys and delete them. > > However, these deletes will be quite inefficient, currently each row must be > deleted individually (there was a patch to range delete kicking around, I > don't know if it's accepted yet) > > But even if range delete is implemented, it's still quite inefficient and > not really what you want, and doesn't work with the RandomPartitioner > > If you have some metadata to say who tweeted within a given period (say 10 > days or 30 days) and you store the tweets all in the same key per user per > period (say with one column per tweet, or use supercolumns), then you can > just delete one key per user per period. > > One of the problems with using a time-based key with ordered partitioner is > that you're always going to have a data imbalance, so you may want to try > hashing *part* of the key (The first part) so you can still range scan the > next part. This may fix load balancing while still enabling you to use range > scans to do data expiry. > > e.g. your key is > > Hash of day number + user id + timestamp > > Then you can range scan the entire day's tweets to expire them, and range > scan a given user's tweets for a given day efficiently (and doing this for > 30 days is just 30 range scans) > > Putting a hash in there fixes load balancing with OPP. > > Mark >
Re: problem with running simple example using cassandra-cli with 0.6.0-beta2
Thanks. With 0.6.0-beta2 using Standard2 does show a human-readable column. However, the behavior is definitely different between 0.5.1 and 0.6.0-beta2. I am using the binary distribution of 0.5.1: cassandra> show version 0.5.1 cassandra> set Keyspace1.Standard1['jsmith']['first'] = 'John' Value inserted. cassandra> set Keyspace1.Standard1['jsmith']['last'] = 'Smith' Value inserted. cassandra> set Keyspace1.Standard1['jsmith']['age'] = '42' Value inserted. cassandra> get Keyspace1.Standard1['jsmith'] => (column=last, value=Smith, timestamp=1268408466548) => (column=first, value=John, timestamp=1268408464036) => (column=age, value=42, timestamp=1268408468895) Returned 3 results. With 0.5.1 using Standard1 does show a human-readable column as documented in the Wiki. Not sure which one is the correct behavior here. Bill On Thu, Mar 11, 2010 at 1:22 PM, Eric Evans wrote: > On Wed, 2010-03-10 at 18:09 -0500, Bill Au wrote: > > I am checking out 0.6.0-beta2 since I need the batch-mutate function. > > I am just trying to run the example is the cassandra-cli Wiki: > > > > http://wiki.apache.org/cassandra/CassandraCli > > > > Here is what I am getting: > > > > cassandra> set Keyspace1.Standard1['jsmith']['first'] = 'John' > > Value inserted. > > cassandra> get Keyspace1.Standard1['jsmith'] > > => (column=6669727374, value=John, timestamp=1268261785077) > > Returned 1 results. > > > > The column name being returned by get (6669727374) does not match what > > is set (first). This is true for all column names. > > > > cassandra> set Keyspace1.Standard1['jsmith']['last'] = 'Smith' > > Value inserted. > > cassandra> set Keyspace1.Standard1['jsmith']['age'] = '42' > > Value inserted. > > cassandra> get Keyspace1.Standard1['jsmith'] > > => (column=6c617374, value=Smith, timestamp=1268262480130) > > => (column=6669727374, value=John, timestamp=1268261785077) > > => (column=616765, value=42, timestamp=1268262484133) > > Returned 3 results. > > > > Is this a problem in 0.6.0-beta2 or am I doing anything wrong? > > No, you're not doing anything wrong. What you're seeing is the hex > representation of a BytesType, which is the comparator that Standard1 in > the example config uses. This is the same for 0.5.1 too. > > If you haven't made any changes to the default config, try using > Standard2 as the column family and you'll see a human-readable column > name as expected (Standard2 uses a UTF8Type comparator). > > The wiki page has sample output that is confusing, (it's probably > cut-and-paste from a time when Standard1 used an ASCII or UTF8 > comparator), we should probably fix that. > > -- > Eric Evans > eev...@rackspace.com > >
get_range_slice(s) question
I've noticed that both 0.5.1 and 0.6b2 return (ReplicationFactor) identical copies of the data stored in my keyspace whenever I make a call to get_range_slice or get_range_slices using ConsistencyLevel.QUORUM. So with ReplicationFactor set to 2 for my application's KeySpace I get double the number of KeySlices that I expect to get. When using ConsistencyLevel.ONE I get only one KeySlice for each row. The same routine running against the Standard1 keyspace with a ReplicationFactor of 1 returns only a single KeySlice for each row. A ReplicationFactor of three gives me three identical KeySlices when using ConsistencyLevel.QUORUM. Is this the intended behavior of get_range_slices? I remember reading in one of the Dynamo papers that applications (and not Dynamo) are required to sort out any discrepancies in the data, but in this case there aren't any discrepancies. Omer
Re: problem with running simple example using cassandra-cli with 0.6.0-beta2
On Fri, 2010-03-12 at 11:21 -0500, Bill Au wrote: > Thanks. With 0.6.0-beta2 using Standard2 does show a human-readable > column. > > However, the behavior is definitely different between 0.5.1 and > 0.6.0-beta2. I am using the binary distribution of 0.5.1: > > cassandra> show version > 0.5.1 > cassandra> set Keyspace1.Standard1['jsmith']['first'] = 'John' > Value inserted. > cassandra> set Keyspace1.Standard1['jsmith']['last'] = 'Smith' > Value inserted. > cassandra> set Keyspace1.Standard1['jsmith']['age'] = '42' > Value inserted. > cassandra> get Keyspace1.Standard1['jsmith'] > => (column=last, value=Smith, timestamp=1268408466548) > => (column=first, value=John, timestamp=1268408464036) > => (column=age, value=42, timestamp=1268408468895) > Returned 3 results. > > With 0.5.1 using Standard1 does show a human-readable column as > documented > in the Wiki. Right you are, my mistake. This changed in https://issues.apache.org/jira/browse/CASSANDRA-661 (which occurred between 0.5 and 0.6). > Not sure which one is the correct behavior here. The current behavior is correct. I'll update the examples to avoid future confusion. -- Eric Evans eev...@rackspace.com
Re: problem with running simple example using cassandra-cli with 0.6.0-beta2
Thanks for clearing this up for me. Bill On Fri, Mar 12, 2010 at 11:49 AM, Eric Evans wrote: > On Fri, 2010-03-12 at 11:21 -0500, Bill Au wrote: > > Thanks. With 0.6.0-beta2 using Standard2 does show a human-readable > > column. > > > > However, the behavior is definitely different between 0.5.1 and > > 0.6.0-beta2. I am using the binary distribution of 0.5.1: > > > > cassandra> show version > > 0.5.1 > > cassandra> set Keyspace1.Standard1['jsmith']['first'] = 'John' > > Value inserted. > > cassandra> set Keyspace1.Standard1['jsmith']['last'] = 'Smith' > > Value inserted. > > cassandra> set Keyspace1.Standard1['jsmith']['age'] = '42' > > Value inserted. > > cassandra> get Keyspace1.Standard1['jsmith'] > > => (column=last, value=Smith, timestamp=1268408466548) > > => (column=first, value=John, timestamp=1268408464036) > > => (column=age, value=42, timestamp=1268408468895) > > Returned 3 results. > > > > With 0.5.1 using Standard1 does show a human-readable column as > > documented > > in the Wiki. > > Right you are, my mistake. This changed in > https://issues.apache.org/jira/browse/CASSANDRA-661 (which occurred > between 0.5 and 0.6). > > > Not sure which one is the correct behavior here. > > The current behavior is correct. I'll update the examples to avoid > future confusion. > > -- > Eric Evans > eev...@rackspace.com > >
Re: Effective allocation of multiple disks
Ryan- Are you going to use software or hardware based RAID 0? Does anyone on the list have any data to compare the performance of hardware RAID 0 vs. software LVM RAID 0? I would think software RAID 0 would be fine since there is no actual computation being done... Thanks! -Eric On Thu, Mar 11, 2010 at 1:16 PM, Ryan King wrote: > > > Even without major compaction, you can get significant imbalances in > how much data is on each disk which will bottleneck your IO > throughput. We're running JBOD right now, but going to switch to RAID > 0 soon. > > -ryan >
How to force GC in Cassandra?
Suppose I insert a lot of new items but also delete a lot of new items daily, it will be ideal if I can force GC to happen during mid night (when traffic is low). Is there any way to manually force GC to be executed? In this way I can add a cronjob to trigger gc in mid night. I tried nodetool and the JMX interface but they don't seem to have that. -Weijun
Re: Effective allocation of multiple disks
On Thu, 11 Mar 2010 12:01:27 -0600 Eric Evans wrote: EE> On Wed, 2010-03-10 at 23:20 -0600, Jonathan Ellis wrote: >> On Wed, Mar 10, 2010 at 9:31 PM, Anthony Molinaro >> wrote: >> > I would almost recommend just keeping things simple and removing >> > multiple data directories from the config altogether and just >> > documenting that you should plan on using OS level mechanisms for >> > growing diskspace and io. >> >> I think that is a pretty sane suggestion actually. EE> Or maybe leave the code as is and just document the situation more EE> clearly? If you're adding more disks to increase storage capacity EE> and you don't strictly need the extra IO, then multiple data EE> directories might be preferable to other forms of aggregation (it's EE> certainly simpler than say a volume manager). Could Cassandra use a block device as raw storage? You avoid the filesystem overhead and it lets the sysadmin determine the best kind of device (RAID or not underneath) to allocate. Ted
Cassandra Demo/Tutorial Applications
I was looking at this from CASSANDRA-873 as well as hands-on homework (!) for my OSCON tutorial. Have couple of questions. Would appreciate insights: A) Cassandra-873 suggests Luenandra as one demo application B) Are there other ideas that will bring out the various aspects of Cassandra ? C) What would be the goal of demo apps ? Tutorial to help folks learn the ins and outs of Cassandra ? Show case capabilities ? I think Cassandra-873 belongs to the latter; Twissandra most probably belongs to the former. D) Hadoop on Cassandra might be a good demo/tutorial E) How would one structure the infrastructure for the demo/tutorials ? What assumptions can we make in creating them ? As AMIs to be run in EC2 ? Also to be run on 2-3 local machines for folks who can spare some ? Or as multiple processes - all in one machine ? What is an optimum configuration for learning and demo ? We need to make it simple (to reflect the domain) but not simpler. F) Am looking for ideas from developers and users - hence the cross posting. I hope apache mailer is smart enough to dedup - will find it soon ... Cheers
Re: get_range_slice(s) question
That would be a bug, not intended behavior. Can you open a ticket? On Fri, Mar 12, 2010 at 11:48 AM, Omer van der Horst Jansen wrote: > I've noticed that both 0.5.1 and 0.6b2 return (ReplicationFactor) > identical copies of the data stored in my keyspace whenever I make a > call to get_range_slice or get_range_slices using > ConsistencyLevel.QUORUM. > > So with ReplicationFactor set to 2 for my application's KeySpace I get > double the number of KeySlices that I expect to get. When using > ConsistencyLevel.ONE I get only one KeySlice for each row. > > The same routine running against the Standard1 keyspace with a > ReplicationFactor of 1 returns only a single KeySlice for each row. A > ReplicationFactor of three gives me three identical KeySlices when using > ConsistencyLevel.QUORUM. > > Is this the intended behavior of get_range_slices? I remember reading in > one of the Dynamo papers that applications (and not Dynamo) are required > to sort out any discrepancies in the data, but in this case there aren't > any discrepancies. > > Omer > > > >
Re: How to force GC in Cassandra?
I think you mean compaction? You can use nodeprobe / nodetool for that. http://wiki.apache.org/cassandra/NodeProbe On Fri, Mar 12, 2010 at 12:40 PM, Weijun Li wrote: > Suppose I insert a lot of new items but also delete a lot of new items > daily, it will be ideal if I can force GC to happen during mid night (when > traffic is low). Is there any way to manually force GC to be executed? In > this way I can add a cronjob to trigger gc in mid night. I tried nodetool > and the JMX interface but they don't seem to have that. > > -Weijun >
Re: Effective allocation of multiple disks
We're going to us software raid. -ryan On Fri, Mar 12, 2010 at 9:24 AM, Eric Rosenberry wrote: > Ryan- > Are you going to use software or hardware based RAID 0? > > Does anyone on the list have any data to compare the performance of hardware > RAID 0 vs. software LVM RAID 0? > I would think software RAID 0 would be fine since there is no actual > computation being done... > Thanks! > > -Eric > > On Thu, Mar 11, 2010 at 1:16 PM, Ryan King wrote: >> >> Even without major compaction, you can get significant imbalances in >> how much data is on each disk which will bottleneck your IO >> throughput. We're running JBOD right now, but going to switch to RAID >> 0 soon. >> >> -ryan > >
Grails Cassandra plugin
Folks- I put together a quick n' dirty grails plugin for Cassandra, wrapped with Hector. Its available at http://github.com/wolpert/grails-cassandra in its initial state. I wouldn't call it 'production-ready' yet. :-) We're using Cassandra at work and I wanted an easy way to access Cassandra from a grails application, but couldn't find anything. I have some plans on how where I want it to go, but I'm open to suggestions. I'll submit the code to grails plugins once I get a bit further along with it. Its pretty basic at this point. -- Virtually, Ned Wolpert "Settle thy studies, Faustus, and begin..." --Marlowe
Cassandra 0.5.1 get_key_range problem
Hello, When using the get_key_range method with ConsistencyLevel.ONE an entire block of keys is not returned. I loop over the get_key_range method, advancing the start key after each call (requesting 8K keys per call). When running the program several times, I got the same results with large key blocks not returned. Then, I change the program to use ConsistencyLevel.ALL, then all the keys are returned as expected. Change the program back to use ConsistencyLevel.ONE and all the keys are now returned. Has anyone else seen this issue? I would have expected ConsistencyLevel.ONE to be able to return all the keys. My 6 node cluster uses a replication factor of 3. Thanks for your help, Jon
Re: Strategies for storing lexically ordered data in supercolumns
On Thu, Mar 11, 2010 at 12:54 AM, Peter Chang wrote: > I'm wondering about good strategies for picking keys that I want to be > lexically sorted in a super column family. For example, my data looks like > this: > > [user1_uuid][connections][some_key_for_user2] = "" > [user1_uuid][connections][some_key_for_user3] = "" > > I was thinking that I wanted some_key_for_user2 to be sorted by a user's > name. So I was thinking I set the subcolumn compareWith to UTF8Type or > BytesType and construct a key > > [user's lastname + user's firstname + user's uuid] > > This would result in sorted subcolumn and user list. That's fine. But I > wonder what would happen if, say, a user changes their last name. Happens > rarely but I imagine people getting married and modifying their name. Now > the sort is no longer correct. There seems to be some bad consequences to > creating keys based on data that can change. > > So what is the general (elegant, easy to maintain) strategy here? Always > sort in your server-side code and don't bother trying to have the data > sorted? > Having row keys based on something potentially volatile is something I would avoid since that determines which machine the row belongs to and moving data between machines isn't a cheap operation. What you'll probably want to do is make the key something unique (like a uuid), store the user's name as a column on the row (thus making it easy to update) and maintain a secondary index to get the named-based sorting you want. If you're expecting a few million users, maintaining the index in a special row will work fine (eg, the row name is "NAMEINDEX" and the columns are the name+uuid similar to what you described.) If you have billions of users, you'll need to get a bit fancier (partition based on letter of the last name, for example.) -Brandon
Re: Grails Cassandra plugin
Great! You should also link it from http://wiki.apache.org/cassandra/ClientExamples (click "Login" at the top to create an account.) On Fri, Mar 12, 2010 at 3:57 PM, Ned Wolpert wrote: > Folks- > > I put together a quick n' dirty grails plugin for Cassandra, wrapped with > Hector. Its available at http://github.com/wolpert/grails-cassandra in its > initial state. I wouldn't call it 'production-ready' yet. :-) > > We're using Cassandra at work and I wanted an easy way to access Cassandra > from a grails application, but couldn't find anything. I have some plans on > how where I want it to go, but I'm open to suggestions. I'll submit the code > to grails plugins once I get a bit further along with it. Its pretty basic > at this point. > > -- > Virtually, Ned Wolpert > "Settle thy studies, Faustus, and begin..." --Marlowe >
Re: Cassandra 0.5.1 get_key_range problem
get_key_range is deprecated. You should use get_range_slice. On Fri, Mar 12, 2010 at 3:59 PM, Jon Graham wrote: > Hello, > > When using the get_key_range method with ConsistencyLevel.ONE an entire > block of keys is not returned. > I loop over the get_key_range method, advancing the start key after each > call (requesting 8K keys per call). > > When running the program several times, I got the same results with large > key blocks not returned. > > Then, I change the program to use ConsistencyLevel.ALL, then all the keys > are returned as expected. > > Change the program back to use ConsistencyLevel.ONE and all the keys are now > returned. > > Has anyone else seen this issue? > > I would have expected ConsistencyLevel.ONE to be able to return all the > keys. My 6 node cluster uses > a replication factor of 3. > > Thanks for your help, > Jon >
Re: Grails Cassandra plugin
Document updated On Fri, Mar 12, 2010 at 2:50 PM, Jonathan Ellis wrote: > Great! > > You should also link it from > http://wiki.apache.org/cassandra/ClientExamples (click "Login" at the > top to create an account.) > > On Fri, Mar 12, 2010 at 3:57 PM, Ned Wolpert > wrote: > > Folks- > > > > I put together a quick n' dirty grails plugin for Cassandra, wrapped > with > > Hector. Its available at http://github.com/wolpert/grails-cassandra in > its > > initial state. I wouldn't call it 'production-ready' yet. :-) > > > > We're using Cassandra at work and I wanted an easy way to access > Cassandra > > from a grails application, but couldn't find anything. I have some plans > on > > how where I want it to go, but I'm open to suggestions. I'll submit the > code > > to grails plugins once I get a bit further along with it. Its pretty > basic > > at this point. > > > > -- > > Virtually, Ned Wolpert > > "Settle thy studies, Faustus, and begin..." --Marlowe > > > -- Virtually, Ned Wolpert "Settle thy studies, Faustus, and begin..." --Marlowe
Re: Cassandra 0.5.1 get_key_range problem
Thanks once again Jonathan, I don't mind switching to an updated API call. Was there any known issue like I described with the get_key_range method? Could the use of certain start/end keys, return counts or consistency levels contibute to the issue I'm seeing? Best Regards, Jon On Fri, Mar 12, 2010 at 1:53 PM, Jonathan Ellis wrote: > get_key_range is deprecated. You should use get_range_slice. > > On Fri, Mar 12, 2010 at 3:59 PM, Jon Graham wrote: > > Hello, > > > > When using the get_key_range method with ConsistencyLevel.ONE an entire > > block of keys is not returned. > > I loop over the get_key_range method, advancing the start key after each > > call (requesting 8K keys per call). > > > > When running the program several times, I got the same results with large > > key blocks not returned. > > > > Then, I change the program to use ConsistencyLevel.ALL, then all the keys > > are returned as expected. > > > > Change the program back to use ConsistencyLevel.ONE and all the keys are > now > > returned. > > > > Has anyone else seen this issue? > > > > I would have expected ConsistencyLevel.ONE to be able to return all the > > keys. My 6 node cluster uses > > a replication factor of 3. > > > > Thanks for your help, > > Jon > > >
Re: Grails Cassandra plugin
great, I'm happy you found Hector useful :) btw, in hector 0.5.0-8 I added some interesting performance JMX counters so may be worth to update yours from 0.5.0-6 to -8 when you have time. On Fri, Mar 12, 2010 at 11:55 PM, Ned Wolpert wrote: > Document updated > > > On Fri, Mar 12, 2010 at 2:50 PM, Jonathan Ellis wrote: > >> Great! >> >> You should also link it from >> http://wiki.apache.org/cassandra/ClientExamples (click "Login" at the >> top to create an account.) >> >> On Fri, Mar 12, 2010 at 3:57 PM, Ned Wolpert >> wrote: >> > Folks- >> > >> > I put together a quick n' dirty grails plugin for Cassandra, wrapped >> with >> > Hector. Its available at http://github.com/wolpert/grails-cassandra in >> its >> > initial state. I wouldn't call it 'production-ready' yet. :-) >> > >> > We're using Cassandra at work and I wanted an easy way to access >> Cassandra >> > from a grails application, but couldn't find anything. I have some plans >> on >> > how where I want it to go, but I'm open to suggestions. I'll submit the >> code >> > to grails plugins once I get a bit further along with it. Its pretty >> basic >> > at this point. >> > >> > -- >> > Virtually, Ned Wolpert >> > "Settle thy studies, Faustus, and begin..." --Marlowe >> > >> > > > > -- > Virtually, Ned Wolpert > > "Settle thy studies, Faustus, and begin..." --Marlowe >
Re: SuperColumn.getSubColumns() ordering
Thanks. On Thu, Mar 11, 2010 at 6:46 PM, Jonathan Ellis wrote: > it's ordered by the column name as determined by the subcolumn > comparator you declared in the definition, yes > > On Thu, Mar 11, 2010 at 12:24 PM, Matteo Caprari > wrote: >> Hi. >> >> If I iterate over SuperColumn.getSubColumn(), do I get >> columns sorted by the column name? >> >> Thanks. >> -- >> :Matteo Caprari >> matteo.capr...@gmail.com >> > -- :Matteo Caprari matteo.capr...@gmail.com
Re: Grails Cassandra plugin
I added an issue in my github project for the update. Since I have your ear, in hector, if the cassandra server restarts (one server in the pool) hector will not try to reconnect to the cassandra server even if its listening. Is that a known issue? On Fri, Mar 12, 2010 at 3:35 PM, Ran Tavory wrote: > great, I'm happy you found Hector useful :) > btw, in hector 0.5.0-8 I added some interesting performance JMX counters so > may be worth to update yours from 0.5.0-6 to -8 when you have time. > > > On Fri, Mar 12, 2010 at 11:55 PM, Ned Wolpert > wrote: > >> Document updated >> >> >> On Fri, Mar 12, 2010 at 2:50 PM, Jonathan Ellis wrote: >> >>> Great! >>> >>> You should also link it from >>> http://wiki.apache.org/cassandra/ClientExamples (click "Login" at the >>> top to create an account.) >>> >>> On Fri, Mar 12, 2010 at 3:57 PM, Ned Wolpert >>> wrote: >>> > Folks- >>> > >>> > I put together a quick n' dirty grails plugin for Cassandra, wrapped >>> with >>> > Hector. Its available at http://github.com/wolpert/grails-cassandra in >>> its >>> > initial state. I wouldn't call it 'production-ready' yet. :-) >>> > >>> > We're using Cassandra at work and I wanted an easy way to access >>> Cassandra >>> > from a grails application, but couldn't find anything. I have some >>> plans on >>> > how where I want it to go, but I'm open to suggestions. I'll submit the >>> code >>> > to grails plugins once I get a bit further along with it. Its pretty >>> basic >>> > at this point. >>> > >>> > -- >>> > Virtually, Ned Wolpert >>> > "Settle thy studies, Faustus, and begin..." --Marlowe >>> > >>> >> >> >> >> -- >> Virtually, Ned Wolpert >> >> "Settle thy studies, Faustus, and begin..." --Marlowe >> > > -- Virtually, Ned Wolpert "Settle thy studies, Faustus, and begin..." --Marlowe
Re: Strategies for storing lexically ordered data in supercolumns
But wouldn't name + UUID be considered volatile? That was the crux of my questions. On Fri, Mar 12, 2010 at 1:07 PM, Brandon Williams wrote: > On Thu, Mar 11, 2010 at 12:54 AM, Peter Chang wrote: > >> I'm wondering about good strategies for picking keys that I want to be >> lexically sorted in a super column family. For example, my data looks like >> this: >> >> [user1_uuid][connections][some_key_for_user2] = "" >> [user1_uuid][connections][some_key_for_user3] = "" >> >> I was thinking that I wanted some_key_for_user2 to be sorted by a user's >> name. So I was thinking I set the subcolumn compareWith to UTF8Type or >> BytesType and construct a key >> >> [user's lastname + user's firstname + user's uuid] >> >> This would result in sorted subcolumn and user list. That's fine. But I >> wonder what would happen if, say, a user changes their last name. Happens >> rarely but I imagine people getting married and modifying their name. Now >> the sort is no longer correct. There seems to be some bad consequences to >> creating keys based on data that can change. >> >> So what is the general (elegant, easy to maintain) strategy here? Always >> sort in your server-side code and don't bother trying to have the data >> sorted? >> > > Having row keys based on something potentially volatile is something I > would avoid since that determines which machine the row belongs to and > moving data between machines isn't a cheap operation. > > What you'll probably want to do is make the key something unique (like a > uuid), store the user's name as a column on the row (thus making it easy to > update) and maintain a secondary index to get the named-based sorting you > want. If you're expecting a few million users, maintaining the index in a > special row will work fine (eg, the row name is "NAMEINDEX" and the columns > are the name+uuid similar to what you described.) If you have billions of > users, you'll need to get a bit fancier (partition based on letter of the > last name, for example.) > > -Brandon >
Re: Strategies for storing lexically ordered data in supercolumns
On Fri, Mar 12, 2010 at 7:07 PM, Peter Chang wrote: > But wouldn't name + UUID be considered volatile? That was the crux of my > questions. It would, but the distinction here is that it is now a column, not a row key. -Brandon
Re: Strategies for storing lexically ordered data in supercolumns
My original post is probably confusing. I was originally talking about columns and I don't see what the solution is. * "So I was thinking I set the subcolumn compareWith to UTF8Type or BytesType and construct a key [for the subcolumn, not a row key] * * * *[user's lastname + user's firstname + user's uuid]* * * *This would result in sorted subcolumn and user list."* * * Nevertheless, I still don't see/understand the solution. Let's say the person's name changes. The sort is no longer valid. That column value would need to be changed in order for the sort to be correct. On Fri, Mar 12, 2010 at 5:10 PM, Brandon Williams wrote: > On Fri, Mar 12, 2010 at 7:07 PM, Peter Chang wrote: > >> But wouldn't name + UUID be considered volatile? That was the crux of my >> questions. > > > It would, but the distinction here is that it is now a column, not a row > key. > > -Brandon >
Re: Strategies for storing lexically ordered data in supercolumns
To be more explicit: ['500c9280-2cdd-11df-869b-005056c1'] ['connections'] ['Hacker-Alyssa-1ab54760-2ca8-11df-aabd-005056c1'] ['500c9280-2cdd-11df-869b-005056c1'] ['connections'] ['Jones-Jim-1a6dd756b0-2ca1-11df-b937-005056c1'] But Alyssa gets married and changes her name to Zamboni. The next time I read these subcolumns the user's will not be sorted. On Fri, Mar 12, 2010 at 5:21 PM, Peter Chang wrote: > My original post is probably confusing. I was originally talking about > columns and I don't see what the solution is. > > * "So I was thinking I set the subcolumn compareWith to UTF8Type or > BytesType and construct a key [for the subcolumn, not a row key] * > * > * > *[user's lastname + user's firstname + user's uuid]* > * * > *This would result in sorted subcolumn and user list."* > * > * > Nevertheless, I still don't see/understand the solution. Let's say the > person's name changes. The sort is no longer valid. That column value would > need to be changed in order for the sort to be correct. > > > On Fri, Mar 12, 2010 at 5:10 PM, Brandon Williams wrote: > >> On Fri, Mar 12, 2010 at 7:07 PM, Peter Chang wrote: >> >>> But wouldn't name + UUID be considered volatile? That was the crux of my >>> questions. >> >> >> It would, but the distinction here is that it is now a column, not a row >> key. >> >> -Brandon >> > >
Re: Strategies for storing lexically ordered data in supercolumns
On Fri, Mar 12, 2010 at 7:21 PM, Peter Chang wrote: > My original post is probably confusing. I was originally talking about > columns and I don't see what the solution is. Sorry, I misunderstood. * "So I was thinking I set the subcolumn compareWith to UTF8Type or > BytesType and construct a key [for the subcolumn, not a row key] * > * > * > *[user's lastname + user's firstname + user's uuid]* > * * > *This would result in sorted subcolumn and user list."* > * > * > Nevertheless, I still don't see/understand the solution. Let's say the > person's name changes. The sort is no longer valid. That column value would > need to be changed in order for the sort to be correct. > When their name changes, you delete the existing column and insert a new one with the correct name, which will then sort correctly. -Brandon
Re: Strategies for storing lexically ordered data in supercolumns
Yes, I can update that one entry. But what if that subcolumn key is used across many different places? ['Jones-Bob']['connections'] ['Hacker-Alyssa-1ab54760-2ca8-11df-aabd-005056c1'] ['Crabtree-Sam']['connections'] ['Hacker-Alyssa-1ab54760-2ca8-11df-aabd-005056c1'] ['Rice-Brown']['connections'] ['Hacker-Alyssa-1ab54760-2ca8-11df-aabd-005056c1'] ... I can update every single entry but now I need to keep track of them (which I guess I'm doing anyway). I was wondering if there was a more elegant solution but it seems unlikely based on the given constraints. On Fri, Mar 12, 2010 at 5:26 PM, Brandon Williams wrote: > On Fri, Mar 12, 2010 at 7:21 PM, Peter Chang wrote: > >> My original post is probably confusing. I was originally talking about >> columns and I don't see what the solution is. > > > Sorry, I misunderstood. > > * "So I was thinking I set the subcolumn compareWith to UTF8Type or >> BytesType and construct a key [for the subcolumn, not a row key] * >> * >> * >> *[user's lastname + user's firstname + user's uuid]* >> * * >> *This would result in sorted subcolumn and user list."* >> * >> * >> Nevertheless, I still don't see/understand the solution. Let's say the >> person's name changes. The sort is no longer valid. That column value would >> need to be changed in order for the sort to be correct. >> > > When their name changes, you delete the existing column and insert a new > one with the correct name, which will then sort correctly. > > -Brandon >
Re: Strategies for storing lexically ordered data in supercolumns
On Fri, Mar 12, 2010 at 7:46 PM, Peter Chang wrote: > Yes, I can update that one entry. But what if that subcolumn key is used > across many different places? > > ['Jones-Bob']['connections'] > ['Hacker-Alyssa-1ab54760-2ca8-11df-aabd-005056c1'] > ['Crabtree-Sam']['connections'] > ['Hacker-Alyssa-1ab54760-2ca8-11df-aabd-005056c1'] > ['Rice-Brown']['connections'] > ['Hacker-Alyssa-1ab54760-2ca8-11df-aabd-005056c1'] > ... > > I can update every single entry but now I need to keep track of them (which > I guess I'm doing anyway). I was wondering if there was a more elegant > solution but it seems unlikely based on the given constraints. > You have to update them all and track them, correct. What you're looking for sounds like transaction support, which Cassandra does not have. On the bright side, writes are cheap. -Brandon
Re: Cassandra Demo/Tutorial Applications
On Fri, Mar 12, 2010 at 1:55 PM, Krishna Sankar wrote: > I was looking at this from CASSANDRA-873 as well as hands-on homework (!) > for my OSCON tutorial. Have couple of questions. Would appreciate insights: > > A) Cassandra-873 suggests Luenandra as one demo application > B) Are there other ideas that will bring out the various aspects of > Cassandra ? multi-user blog (single-user is too easy :) - extra credit: with full-text search using lucandra discussion forum - also w/ FTS > C) What would be the goal of demo apps ? Tutorial to help folks learn the > ins and outs of Cassandra ? Show case capabilities ? I think Cassandra-873 > belongs to the latter; Twissandra most probably belongs to the former. I think you nailed it. > D) Hadoop on Cassandra might be a good demo/tutorial Sure, I'll buy that. I can't think of any standalone projects for that, but "compute a twissandra tag cloud" would be pretty cool. (Might need to write a twissandra bot to load stuff in to make an interesting cloud. :) > E) How would one structure the infrastructure for the demo/tutorials ? What > assumptions can we make in creating them ? As AMIs to be run in EC2 ? I'd probably go with "virtualbox images" as being simpler for people who don't have an AWS key already. (VB can read vmware player images, i think. But there is no free vmware for OS X, so you'd want to check that before going w/ vmware format.) Or just have people d/l cassandra and a configuration xml. Probably easier than teaching people to use virtualbox who haven't before. > Also > to be run on 2-3 local machines for folks who can spare some ? Or as > multiple processes - all in one machine ? You're not going to have time to teach cluster management. Keep it to 1.
Re: Cassandra Demo/Tutorial Applications
There are several large data sets on the net you could use to build. Demo with. Search logs, wikipedia, uk govt stuff Dbpedia may be interesting as they have some of the stuff extracted out --- Sent from my phone Ian Holsman - 703 879-3128 On 13/03/2010, at 4:46 PM, Jonathan Ellis wrote: On Fri, Mar 12, 2010 at 1:55 PM, Krishna Sankar wrote: I was looking at this from CASSANDRA-873 as well as hands-on homework (!) for my OSCON tutorial. Have couple of questions. Would appreciate insights: A) Cassandra-873 suggests Luenandra as one demo application B) Are there other ideas that will bring out the various aspects of Cassandra ? multi-user blog (single-user is too easy :) - extra credit: with full-text search using lucandra discussion forum - also w/ FTS C) What would be the goal of demo apps ? Tutorial to help folks learn the ins and outs of Cassandra ? Show case capabilities ? I think Cassandra-873 belongs to the latter; Twissandra most probably belongs to the former. I think you nailed it. D) Hadoop on Cassandra might be a good demo/tutorial Sure, I'll buy that. I can't think of any standalone projects for that, but "compute a twissandra tag cloud" would be pretty cool. (Might need to write a twissandra bot to load stuff in to make an interesting cloud. :) E) How would one structure the infrastructure for the demo/ tutorials ? What assumptions can we make in creating them ? As AMIs to be run in EC2 ? I'd probably go with "virtualbox images" as being simpler for people who don't have an AWS key already. (VB can read vmware player images, i think. But there is no free vmware for OS X, so you'd want to check that before going w/ vmware format.) Or just have people d/l cassandra and a configuration xml. Probably easier than teaching people to use virtualbox who haven't before. Also to be run on 2-3 local machines for folks who can spare some ? Or as multiple processes - all in one machine ? You're not going to have time to teach cluster management. Keep it to 1.
About the replication strategy of Cassandra
Hi all. I am interested in the architecture of Cassandra. Cassandra offers the replication policy such as "Rack Unaware" "Rack Aware(within a datacenter)" "Datacenter Aware". It is necessary to select these replication policies by the application. The algorithm when the replication policy based on "Rack Aware(within a datacenter)" and the "Datacenter Aware" strategy is selected might be a little difficult. In Cassandra, Zookeeper was selected to the election algorithm of the node that the system was using. 1. Please give notes the replication strategy of Cassandra is selected. 2. About the Zab protocol adopted with Zookeeper. The weak point of the Paxos protocol of Chubby is a delay. Is the Zab protocol more excellent than this Paxos protocol? --- Kazuki Aranami Twitter: http://twitter.com/kimtea http://d.hatena.ne.jp/kazuki-aranami/ ---
Re: Incr/Decr Counters in Cassandra
Badly need it for my work let me know if i can do something to speed it up :) Regards, On Wed, Nov 4, 2009 at 1:32 PM, Chris Goffinet wrote: > Hey, > > At Digg we've been thinking about counters in Cassandra. In a lot of our > use cases we need this type of support from a distributed storage system. > Anyone else out there who has such needs as well? Zookeeper actually has > such support and we might use that if we can't get the support in Cassandra. > > --- > Chris Goffinet > goffi...@digg.com > > > > > >