Re: Can a same key exists for two rows in two different column families without clashing ?
Thanks Stephen for the Great Explanation! On Wed, Feb 2, 2011 at 4:31 PM, Stephen Connolly stephen.alan.conno...@gmail.com wrote: On 2 February 2011 10:03, Ertio Lew ertio...@gmail.com wrote: Can a same key exists for two rows in two different column families without clashing ? Other words, does the same algorithm needs to enforced for generating keys for different column families or can different algorithms(for generating keys) be enforced on column family basis? I have tried out that they can, but I wanted to know if there may be any problems associated with this. Thanks. Ertio Lew it is a bad analogy for many reasons but if you replace row key with primary key and column family with table then you might get an answer. a better analogy is to think of the following. public class Keyspace { public final MapString,MapString,byte[] columnFamily1; public final MapString,MapString,byte[] columnFamily2; public final MapString,MapString,MapString,byte[] superColumnFamily3; } (still not quite correct, but mostly so for our purposes); you are asking given Keyspace keyspace; String key1 = makeKeyAlg1(); keyspace.columnFamily1.put(key1,...); String key2 = makeKeyAlg2(); keyspace.columnFamily2.put(key2,...); when key1.equals(key2) then is there a problem? They are two separate maps... why would there be. -Stephen
Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?
Hey all, I need to store supercolumns each with around 8 subcolumns; All the data for a supercolumn is written at once and all subcolumns need to be retrieved together. The data in each subcolumn is not big, it just contains keys to other rows. Would it be preferred to have a supercolumn family or just a standard column family containing all the subcolumns data serialized in single column(s) ? Thanks Aditya Narayan
Re: reduced cached mem; resident set size growth
On 01/28/2011 09:19 PM, Chris Burroughs wrote: Thanks Oleg and Zhu. I swear that wasn't a new hotspot version when I checked, but that's obviously not the case. I'll update one node to the latest as soon as I can and report back. RSS over 48 hours with java 6 update 23: http://img716.imageshack.us/img716/5202/u2348hours.png I'll continue monitoring but RSS still appears to grow without bounds. Zhu reported a similar problem with Ubuntu 10.04. While possible, it would seem seam extraordinary unlikely that there is a glibc or kernel bug affecting us both.
Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?
Actually, I am trying to use Cassandra to display to users on my applicaiton, the list of all Reminders set by themselves for themselves, on the application. I need to store rows containing the timeline of daily Reminders put by the users, for themselves, on application. The reminders need to be presented to the user in a chronological order like a news feed. Each reminder has got certain tags associated with it(so that, at times, user may also choose to see the reminders filtered by tags in chronological order). So I thought of a schema something like this:- -Each Reminder details may be stored as separate rows in column family. -For presenting the timeline of reminders set by user to be presented to the user, the timeline row of each user would contain the Id/Key(s) (of the Reminder rows) as the supercolumn names and the subcolumns inside that supercolumns could contain the list of tags associated with particular reminder. All tags set at once during first write. The no of tags(subcolumns) will be around 8 maximum. Any comments, suggestions and feedback on the schema design are requested.. Thanks Aditya Narayan On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayan ady...@gmail.com wrote: Hey all, I need to store supercolumns each with around 8 subcolumns; All the data for a supercolumn is written at once and all subcolumns need to be retrieved together. The data in each subcolumn is not big, it just contains keys to other rows. Would it be preferred to have a supercolumn family or just a standard column family containing all the subcolumns data serialized in single column(s) ? Thanks Aditya Narayan
Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?
To reiterate, so I know we're both on the same page, your schema would be something like this: - A column family (as you describe) to store the details of a reminder. One reminder per row. The row key would be a TimeUUID. - A super column family to store the reminders for each user, for each day. The row key would be something like: MMDD:user_id. The column names would simply be the TimeUUID of the messages. The sub column names would be the tag names of the various reminders. The idea is that you would then get a slice of each row for a user, for a day, that would only contain sub column names with the tags you're looking for? Then based upon the column names returned, you'd look-up the reminders. That seems like a solid schema to me. Bill- On 02/02/2011 09:37 AM, Aditya Narayan wrote: Actually, I am trying to use Cassandra to display to users on my applicaiton, the list of all Reminders set by themselves for themselves, on the application. I need to store rows containing the timeline of daily Reminders put by the users, for themselves, on application. The reminders need to be presented to the user in a chronological order like a news feed. Each reminder has got certain tags associated with it(so that, at times, user may also choose to see the reminders filtered by tags in chronological order). So I thought of a schema something like this:- -Each Reminder details may be stored as separate rows in column family. -For presenting the timeline of reminders set by user to be presented to the user, the timeline row of each user would contain the Id/Key(s) (of the Reminder rows) as the supercolumn names and the subcolumns inside that supercolumns could contain the list of tags associated with particular reminder. All tags set at once during first write. The no of tags(subcolumns) will be around 8 maximum. Any comments, suggestions and feedback on the schema design are requested.. Thanks Aditya Narayan On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayanady...@gmail.com wrote: Hey all, I need to store supercolumns each with around 8 subcolumns; All the data for a supercolumn is written at once and all subcolumns need to be retrieved together. The data in each subcolumn is not big, it just contains keys to other rows. Would it be preferred to have a supercolumn family or just a standard column family containing all the subcolumns data serialized in single column(s) ? Thanks Aditya Narayan
Re: Does HH work (or make sense) for counters?
When you create a counter column family, there is an option called replicate_on_write. When this option is off then during a write the increment is written to only one node and not replicated at all. In particular it is not hinted to any node. While unsafe, if you can accept its potential consequences, this option can make sense if you want to do sustain very fast increments, because replication for counters (unlike for normal writes) implies a read. Right now, replicate_on_write is off by default, but if you turn it on HH should work as expected (or then, that would likely be a bug). Sylvain On Tue, Feb 1, 2011 at 8:23 PM, Narendra Sharma narendra.sha...@gmail.comwrote: Version: Cassandra 0.7.1 (build from trunk) Setup: - Cluster of 2 nodes (Say A and B) - HH enabled - Using the default Keyspace definition in cassandra.yaml - Using SuperCounter1 CF Client: - Using CL of ONE I started the two Cassandra nodes, created schema and then shutdown one of the instances (say B). Executed counter update and read operations on A with CL=ONE. Everything worked fine. All counters were returned with correct values. Now started node B, waited for couple of mins. Executed only counter read operation on B with CL=ONE. Initially got no counters for any of the rows. On second (and subsequent tries) try got counters for only one (same row always) out of ten rows. After doing one read with CL=QUORUM, reads with CL=ONE started returning correct data. Thanks, Naren
Commit log compaction
How often and by what criteria is the commit log compacted/truncated? Thanks, Maxim -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Commit-log-compaction-tp5985221p5985221.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Secondary indexes on super columns
Hi! I would like to know if secondary indexes are foreseen for super columns / columns inside of super columns? If yes, will it be in a near future? Thanks a lot in advance Sébastien Druon
Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?
I think you got it exactly what I wanted to convey except for few things I want to clarify: I was thinking of a single row containing all reminders ( not split by day). History of the reminders need to be maintained for some time. After certain time (say 3 or 6 months) they may be deleted by ttl facility. While presenting the reminders timeline to the user, latest supercolumns like around 50 from the start_end will be picked up and their subcolumns values will be compared to the Tags user has chosen to see and, corresponding to the filtered subcolumn values(tags), the rows of the reminder details would be picked up.. Is supercolumn a preferable choice for this ? Can there be a better schema than this ? -Aditya Narayan On Wed, Feb 2, 2011 at 8:54 PM, William R Speirs bill.spe...@gmail.com wrote: To reiterate, so I know we're both on the same page, your schema would be something like this: - A column family (as you describe) to store the details of a reminder. One reminder per row. The row key would be a TimeUUID. - A super column family to store the reminders for each user, for each day. The row key would be something like: MMDD:user_id. The column names would simply be the TimeUUID of the messages. The sub column names would be the tag names of the various reminders. The idea is that you would then get a slice of each row for a user, for a day, that would only contain sub column names with the tags you're looking for? Then based upon the column names returned, you'd look-up the reminders. That seems like a solid schema to me. Bill- On 02/02/2011 09:37 AM, Aditya Narayan wrote: Actually, I am trying to use Cassandra to display to users on my applicaiton, the list of all Reminders set by themselves for themselves, on the application. I need to store rows containing the timeline of daily Reminders put by the users, for themselves, on application. The reminders need to be presented to the user in a chronological order like a news feed. Each reminder has got certain tags associated with it(so that, at times, user may also choose to see the reminders filtered by tags in chronological order). So I thought of a schema something like this:- -Each Reminder details may be stored as separate rows in column family. -For presenting the timeline of reminders set by user to be presented to the user, the timeline row of each user would contain the Id/Key(s) (of the Reminder rows) as the supercolumn names and the subcolumns inside that supercolumns could contain the list of tags associated with particular reminder. All tags set at once during first write. The no of tags(subcolumns) will be around 8 maximum. Any comments, suggestions and feedback on the schema design are requested.. Thanks Aditya Narayan On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayanady...@gmail.com wrote: Hey all, I need to store supercolumns each with around 8 subcolumns; All the data for a supercolumn is written at once and all subcolumns need to be retrieved together. The data in each subcolumn is not big, it just contains keys to other rows. Would it be preferred to have a supercolumn family or just a standard column family containing all the subcolumns data serialized in single column(s) ? Thanks Aditya Narayan
Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?
Any time I see/hear a single row containing all ... I get nervous. That single row is going to reside on a single node. That is potentially a lot of load (don't know the system) for that single node. Why wouldn't you split it by at least user? If it won't be a lot of load, then why are you using Cassandra? This seems like something that could easily fit into an SQL/relational style DB. If it's too much data (millions of users, 100s of millions of reminders) for a standard SQL/relational model, then it's probably too much for a single row. I'm not familiar with the TTL functionality of Cassandra... sorry cannot help/comment there, still learning :-) Yea, my $0.02 is that this is an effective way to leverage super columns. Bill- On 02/02/2011 10:43 AM, Aditya Narayan wrote: I think you got it exactly what I wanted to convey except for few things I want to clarify: I was thinking of a single row containing all reminders ( not split by day). History of the reminders need to be maintained for some time. After certain time (say 3 or 6 months) they may be deleted by ttl facility. While presenting the reminders timeline to the user, latest supercolumns like around 50 from the start_end will be picked up and their subcolumns values will be compared to the Tags user has chosen to see and, corresponding to the filtered subcolumn values(tags), the rows of the reminder details would be picked up.. Is supercolumn a preferable choice for this ? Can there be a better schema than this ? -Aditya Narayan On Wed, Feb 2, 2011 at 8:54 PM, William R Speirsbill.spe...@gmail.com wrote: To reiterate, so I know we're both on the same page, your schema would be something like this: - A column family (as you describe) to store the details of a reminder. One reminder per row. The row key would be a TimeUUID. - A super column family to store the reminders for each user, for each day. The row key would be something like: MMDD:user_id. The column names would simply be the TimeUUID of the messages. The sub column names would be the tag names of the various reminders. The idea is that you would then get a slice of each row for a user, for a day, that would only contain sub column names with the tags you're looking for? Then based upon the column names returned, you'd look-up the reminders. That seems like a solid schema to me. Bill- On 02/02/2011 09:37 AM, Aditya Narayan wrote: Actually, I am trying to use Cassandra to display to users on my applicaiton, the list of all Reminders set by themselves for themselves, on the application. I need to store rows containing the timeline of daily Reminders put by the users, for themselves, on application. The reminders need to be presented to the user in a chronological order like a news feed. Each reminder has got certain tags associated with it(so that, at times, user may also choose to see the reminders filtered by tags in chronological order). So I thought of a schema something like this:- -Each Reminder details may be stored as separate rows in column family. -For presenting the timeline of reminders set by user to be presented to the user, the timeline row of each user would contain the Id/Key(s) (of the Reminder rows) as the supercolumn names and the subcolumns inside that supercolumns could contain the list of tags associated with particular reminder. All tags set at once during first write. The no of tags(subcolumns) will be around 8 maximum. Any comments, suggestions and feedback on the schema design are requested.. Thanks Aditya Narayan On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayanady...@gmail.comwrote: Hey all, I need to store supercolumns each with around 8 subcolumns; All the data for a supercolumn is written at once and all subcolumns need to be retrieved together. The data in each subcolumn is not big, it just contains keys to other rows. Would it be preferred to have a supercolumn family or just a standard column family containing all the subcolumns data serialized in single column(s) ? Thanks Aditya Narayan
unsubscribe
Sent from my iPad
Subscribe
Sent from my iPad
Re: Secondary indexes on super columns
On Wed, Feb 2, 2011 at 7:37 AM, Sébastien Druon sdr...@spotuse.com wrote: Hi! I would like to know if secondary indexes are foreseen for super columns / columns inside of super columns? No. If yes, will it be in a near future? Probably not. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: unsubscribe
http://wiki.apache.org/cassandra/FAQ#unsubscribe On Wed, Feb 2, 2011 at 7:55 AM, JJ jjcha...@gmail.com wrote: Sent from my iPad -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: CQL
On Wed, 2011-02-02 at 06:57 +, Vivek Mishra wrote: I am trying to run CQL from a java client and facing one issue. Keyspace is passed as null. When I execute Use Keyspace1 followed by my Select query it is still not working. Can you provide some minimal sample code that demonstrates the problem you're seeing? -- Eric Evans eev...@rackspace.com
Re: unsubscribe
On Wed, 2011-02-02 at 07:55 -0800, JJ wrote: Sent from my iPad This won't work (even from an iPad), you need to mail user-unsubscr...@cassandra.apache.org -- Eric Evans eev...@rackspace.com
Re: cassandra as session store
We're using Cassandra as the back end for a home grown session management system. That system was originally built back in 2005 using BerkelyDB/Java and a data distribution system that used UDP multicast. Maintenance was becoming increasingly painful. I wrote a prototype replacement service using Cassandra 0.6 but decided to wait for the availability of official TTL support in 0.7 before switching over. The new system has been running in production now for a little over a week. My main issue is that Cassandra is using far more disk space than I expected it to. The vast bulk of disk space seems to be used for *Index.db files. I'm hoping that the 10-day GCGraceSeconds interval that kicks in on Friday will help me there. Most of our apps that use this service generate their own session keys. I assume by hashing and salting a user ID and/or calling something like java.util.UUID.randomUUID(). My schema is currently very simple -- there's a single CF containing a (binary) payload column and a column that indicates whether or not the data has been compressed. We have a few rogue apps that store humongous XML documents in the session and compression helps to deal with that. That's also why memcached wasn't going to work in our scenario. On Tue, Feb 1, 2011 at 12:18 PM, Kallin Nagelberg kallin.nagelb...@gmail.com wrote: Hey, I am currently investigating Cassandra for storing what are effectively web sessions. Our production environment has about 10 high end servers behind a load balancer, and we'd like to add distributed session support. My main concerns are performance, consistency, and the ability to create unique session keys. The last thing we would want is users picking up each others sessions. After spending a few days investigating Cassandra I'm thinking of creating a single keyspace with a single super-column-family. The scf would store a few standard columns, and a supercolumn of arbitrary session attributes, like: 0s809sdf8s908sf90s: { prop1: x, created : timestamp, lastAccessed: timestamp, prop2: y, arbirtraryProperties : { someRandomProperty1:xxyyzz, someRandomProperty2:xxyyzz, someRandomProperty3:xxyyzz } Does this sound like a reasonable use case? We are on a tight timeline and I'm currently on the fence about getting something up and running like this on a tight timeline. Thanks, -Kal
Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?
You got me wrong perhaps.. I am already splitting the row on per user basis ofcourse, otherwise the schema wont make sense for my usage. The row contains only *reminders of a single user* sorted in chronological order. The reminder Id are stored as supercolumn name and subcolumn contain tags for that reminder. On Wed, Feb 2, 2011 at 9:19 PM, William R Speirs bill.spe...@gmail.com wrote: Any time I see/hear a single row containing all ... I get nervous. That single row is going to reside on a single node. That is potentially a lot of load (don't know the system) for that single node. Why wouldn't you split it by at least user? If it won't be a lot of load, then why are you using Cassandra? This seems like something that could easily fit into an SQL/relational style DB. If it's too much data (millions of users, 100s of millions of reminders) for a standard SQL/relational model, then it's probably too much for a single row. I'm not familiar with the TTL functionality of Cassandra... sorry cannot help/comment there, still learning :-) Yea, my $0.02 is that this is an effective way to leverage super columns. Bill- On 02/02/2011 10:43 AM, Aditya Narayan wrote: I think you got it exactly what I wanted to convey except for few things I want to clarify: I was thinking of a single row containing all reminders ( not split by day). History of the reminders need to be maintained for some time. After certain time (say 3 or 6 months) they may be deleted by ttl facility. While presenting the reminders timeline to the user, latest supercolumns like around 50 from the start_end will be picked up and their subcolumns values will be compared to the Tags user has chosen to see and, corresponding to the filtered subcolumn values(tags), the rows of the reminder details would be picked up.. Is supercolumn a preferable choice for this ? Can there be a better schema than this ? -Aditya Narayan On Wed, Feb 2, 2011 at 8:54 PM, William R Speirsbill.spe...@gmail.com wrote: To reiterate, so I know we're both on the same page, your schema would be something like this: - A column family (as you describe) to store the details of a reminder. One reminder per row. The row key would be a TimeUUID. - A super column family to store the reminders for each user, for each day. The row key would be something like: MMDD:user_id. The column names would simply be the TimeUUID of the messages. The sub column names would be the tag names of the various reminders. The idea is that you would then get a slice of each row for a user, for a day, that would only contain sub column names with the tags you're looking for? Then based upon the column names returned, you'd look-up the reminders. That seems like a solid schema to me. Bill- On 02/02/2011 09:37 AM, Aditya Narayan wrote: Actually, I am trying to use Cassandra to display to users on my applicaiton, the list of all Reminders set by themselves for themselves, on the application. I need to store rows containing the timeline of daily Reminders put by the users, for themselves, on application. The reminders need to be presented to the user in a chronological order like a news feed. Each reminder has got certain tags associated with it(so that, at times, user may also choose to see the reminders filtered by tags in chronological order). So I thought of a schema something like this:- -Each Reminder details may be stored as separate rows in column family. -For presenting the timeline of reminders set by user to be presented to the user, the timeline row of each user would contain the Id/Key(s) (of the Reminder rows) as the supercolumn names and the subcolumns inside that supercolumns could contain the list of tags associated with particular reminder. All tags set at once during first write. The no of tags(subcolumns) will be around 8 maximum. Any comments, suggestions and feedback on the schema design are requested.. Thanks Aditya Narayan On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayanady...@gmail.com wrote: Hey all, I need to store supercolumns each with around 8 subcolumns; All the data for a supercolumn is written at once and all subcolumns need to be retrieved together. The data in each subcolumn is not big, it just contains keys to other rows. Would it be preferred to have a supercolumn family or just a standard column family containing all the subcolumns data serialized in single column(s) ? Thanks Aditya Narayan
changing JMX port in 0.7
An instance of Cassandra starts and is listening on the ports described below: Port Description Defined In 9160 Client traffic via the Thrift protocolcassandra.yaml7000Cluster traffic via gossipcassandra.yaml8080Port for monitoring attributes via JMX cassandra.in.sh My $CASSANDRA_HOME/conf/cassandra.in.sh has no configuration for JMX. In $CASSANDRA_HOME/conf/cassandra-env.sh : JMX_PORT=8080 When I change this, the port change isn't reflected. I am starting cassandra with: cassandra -f I'd like to change the default port... -sd -- Sasha Dolgy sasha.do...@gmail.com
Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?
I did not understand before... sorry. Again, depending upon how many reminders you have for a single user, this could be a long/wide row. Again, it really comes down to how many reminders are we talking about and how often will they be read/written. While a single row can contain millions (maybe more) columns, that doesn't mean it's a good idea. I'm working on a logging system with Cassandra and ran into this same type of problem. Do I put all of the messages for a single system into a single row keyed off that system's name? I quickly came to the answer of no and now I break my row keys into POSIX_timestamp:system where my timestamps are buckets for every 5 minutes. This nicely distributes the load across the nodes in my system. Bill- On 02/02/2011 11:18 AM, Aditya Narayan wrote: You got me wrong perhaps.. I am already splitting the row on per user basis ofcourse, otherwise the schema wont make sense for my usage. The row contains only *reminders of a single user* sorted in chronological order. The reminder Id are stored as supercolumn name and subcolumn contain tags for that reminder. On Wed, Feb 2, 2011 at 9:19 PM, William R Speirsbill.spe...@gmail.com wrote: Any time I see/hear a single row containing all ... I get nervous. That single row is going to reside on a single node. That is potentially a lot of load (don't know the system) for that single node. Why wouldn't you split it by at least user? If it won't be a lot of load, then why are you using Cassandra? This seems like something that could easily fit into an SQL/relational style DB. If it's too much data (millions of users, 100s of millions of reminders) for a standard SQL/relational model, then it's probably too much for a single row. I'm not familiar with the TTL functionality of Cassandra... sorry cannot help/comment there, still learning :-) Yea, my $0.02 is that this is an effective way to leverage super columns. Bill- On 02/02/2011 10:43 AM, Aditya Narayan wrote: I think you got it exactly what I wanted to convey except for few things I want to clarify: I was thinking of a single row containing all reminders (not split by day). History of the reminders need to be maintained for some time. After certain time (say 3 or 6 months) they may be deleted by ttl facility. While presenting the reminders timeline to the user, latest supercolumns like around 50 from the start_end will be picked up and their subcolumns values will be compared to the Tags user has chosen to see and, corresponding to the filtered subcolumn values(tags), the rows of the reminder details would be picked up.. Is supercolumn a preferable choice for this ? Can there be a better schema than this ? -Aditya Narayan On Wed, Feb 2, 2011 at 8:54 PM, William R Speirsbill.spe...@gmail.com wrote: To reiterate, so I know we're both on the same page, your schema would be something like this: - A column family (as you describe) to store the details of a reminder. One reminder per row. The row key would be a TimeUUID. - A super column family to store the reminders for each user, for each day. The row key would be something like: MMDD:user_id. The column names would simply be the TimeUUID of the messages. The sub column names would be the tag names of the various reminders. The idea is that you would then get a slice of each row for a user, for a day, that would only contain sub column names with the tags you're looking for? Then based upon the column names returned, you'd look-up the reminders. That seems like a solid schema to me. Bill- On 02/02/2011 09:37 AM, Aditya Narayan wrote: Actually, I am trying to use Cassandra to display to users on my applicaiton, the list of all Reminders set by themselves for themselves, on the application. I need to store rows containing the timeline of daily Reminders put by the users, for themselves, on application. The reminders need to be presented to the user in a chronological order like a news feed. Each reminder has got certain tags associated with it(so that, at times, user may also choose to see the reminders filtered by tags in chronological order). So I thought of a schema something like this:- -Each Reminder details may be stored as separate rows in column family. -For presenting the timeline of reminders set by user to be presented to the user, the timeline row of each user would contain the Id/Key(s) (of the Reminder rows) as the supercolumn names and the subcolumns inside that supercolumns could contain the list of tags associated with particular reminder. All tags set at once during first write. The no of tags(subcolumns) will be around 8 maximum. Any comments, suggestions and feedback on the schema design are requested.. Thanks Aditya Narayan On Wed, Feb 2, 2011 at 7:49 PM, Aditya Narayanady...@gmail.com wrote: Hey all, I need to store supercolumns each with around 8 subcolumns; All the data for a supercolumn is written at once and all subcolumns need to be retrieved
Re: changing JMX port in 0.7
Silly me. On windows it has to be changed in $CASSANDRA_HOME/bin/cassandra.bat On Wed, Feb 2, 2011 at 5:39 PM, Sasha Dolgy sdo...@gmail.com wrote: An instance of Cassandra starts and is listening on the ports described below: Port Description Defined In 9160 Client traffic via the Thrift protocolcassandra.yaml7000Cluster traffic via gossipcassandra.yaml8080Port for monitoring attributes via JMX cassandra.in.sh My $CASSANDRA_HOME/conf/cassandra.in.sh has no configuration for JMX. In $CASSANDRA_HOME/conf/cassandra-env.sh : JMX_PORT=8080 When I change this, the port change isn't reflected. I am starting cassandra with: cassandra -f I'd like to change the default port... -sd -- Sasha Dolgy sasha.do...@gmail.com -- Sasha Dolgy sasha.do...@gmail.com
Re: changing JMX port in 0.7
:-) On Wed, Feb 2, 2011 at 10:14 PM, Sasha Dolgy sdo...@gmail.com wrote: Silly me. On windows it has to be changed in $CASSANDRA_HOME/bin/cassandra.bat On Wed, Feb 2, 2011 at 5:39 PM, Sasha Dolgy sdo...@gmail.com wrote: An instance of Cassandra starts and is listening on the ports described below: Port Description Defined In 9160 Client traffic via the Thrift protocolcassandra.yaml7000Cluster traffic via gossipcassandra.yaml8080Port for monitoring attributes via JMX cassandra.in.sh My $CASSANDRA_HOME/conf/cassandra.in.sh has no configuration for JMX. In $CASSANDRA_HOME/conf/cassandra-env.sh : JMX_PORT=8080 When I change this, the port change isn't reflected. I am starting cassandra with: cassandra -f I'd like to change the default port... -sd -- Sasha Dolgy sasha.do...@gmail.com -- Sasha Dolgy sasha.do...@gmail.com
Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?
@Bill Thank you BIll! @Cassandra users Can others also leave their suggestions and comments about my schema, please. Also my question about whether to use a superColumn or alternatively, just store the data (that would otherwise be stored in subcolumns) as serialized into a single column in standard type column family. Thanks -Aditya Narayan On Wed, Feb 2, 2011 at 10:11 PM, William R Speirs bill.spe...@gmail.com wrote: I did not understand before... sorry. Again, depending upon how many reminders you have for a single user, this could be a long/wide row. Again, it really comes down to how many reminders are we talking about and how often will they be read/written. While a single row can contain millions (maybe more) columns, that doesn't mean it's a good idea. I'm working on a logging system with Cassandra and ran into this same type of problem. Do I put all of the messages for a single system into a single row keyed off that system's name? I quickly came to the answer of no and now I break my row keys into POSIX_timestamp:system where my timestamps are buckets for every 5 minutes. This nicely distributes the load across the nodes in my system. Bill- On 02/02/2011 11:18 AM, Aditya Narayan wrote: You got me wrong perhaps.. I am already splitting the row on per user basis ofcourse, otherwise the schema wont make sense for my usage. The row contains only *reminders of a single user* sorted in chronological order. The reminder Id are stored as supercolumn name and subcolumn contain tags for that reminder. On Wed, Feb 2, 2011 at 9:19 PM, William R Speirsbill.spe...@gmail.com wrote: Any time I see/hear a single row containing all ... I get nervous. That single row is going to reside on a single node. That is potentially a lot of load (don't know the system) for that single node. Why wouldn't you split it by at least user? If it won't be a lot of load, then why are you using Cassandra? This seems like something that could easily fit into an SQL/relational style DB. If it's too much data (millions of users, 100s of millions of reminders) for a standard SQL/relational model, then it's probably too much for a single row. I'm not familiar with the TTL functionality of Cassandra... sorry cannot help/comment there, still learning :-) Yea, my $0.02 is that this is an effective way to leverage super columns. Bill- On 02/02/2011 10:43 AM, Aditya Narayan wrote: I think you got it exactly what I wanted to convey except for few things I want to clarify: I was thinking of a single row containing all reminders ( not split by day). History of the reminders need to be maintained for some time. After certain time (say 3 or 6 months) they may be deleted by ttl facility. While presenting the reminders timeline to the user, latest supercolumns like around 50 from the start_end will be picked up and their subcolumns values will be compared to the Tags user has chosen to see and, corresponding to the filtered subcolumn values(tags), the rows of the reminder details would be picked up.. Is supercolumn a preferable choice for this ? Can there be a better schema than this ? -Aditya Narayan On Wed, Feb 2, 2011 at 8:54 PM, William R Speirsbill.spe...@gmail.com wrote: To reiterate, so I know we're both on the same page, your schema would be something like this: - A column family (as you describe) to store the details of a reminder. One reminder per row. The row key would be a TimeUUID. - A super column family to store the reminders for each user, for each day. The row key would be something like: MMDD:user_id. The column names would simply be the TimeUUID of the messages. The sub column names would be the tag names of the various reminders. The idea is that you would then get a slice of each row for a user, for a day, that would only contain sub column names with the tags you're looking for? Then based upon the column names returned, you'd look-up the reminders. That seems like a solid schema to me. Bill- On 02/02/2011 09:37 AM, Aditya Narayan wrote: Actually, I am trying to use Cassandra to display to users on my applicaiton, the list of all Reminders set by themselves for themselves, on the application. I need to store rows containing the timeline of daily Reminders put by the users, for themselves, on application. The reminders need to be presented to the user in a chronological order like a news feed. Each reminder has got certain tags associated with it(so that, at times, user may also choose to see the reminders filtered by tags in chronological order). So I thought of a schema something like this:- -Each Reminder details may be stored as separate rows in column family. -For presenting the timeline of reminders set by user to be presented to the user, the timeline row of each user would contain the Id/Key(s) (of the Reminder rows) as the supercolumn names and the subcolumns inside that
Re: cassandra as session store
Sounds like you're seeing the bug in 0.7.0 preventing deletion of non-Data.db files (i.e. your Index.db) post-compaction. This is fixed for 0.7.1. (https://issues.apache.org/jira/browse/CASSANDRA-2059) On Wed, Feb 2, 2011 at 8:15 AM, Omer van der Horst Jansen ome...@gmail.com wrote: We're using Cassandra as the back end for a home grown session management system. That system was originally built back in 2005 using BerkelyDB/Java and a data distribution system that used UDP multicast. Maintenance was becoming increasingly painful. I wrote a prototype replacement service using Cassandra 0.6 but decided to wait for the availability of official TTL support in 0.7 before switching over. The new system has been running in production now for a little over a week. My main issue is that Cassandra is using far more disk space than I expected it to. The vast bulk of disk space seems to be used for *Index.db files. I'm hoping that the 10-day GCGraceSeconds interval that kicks in on Friday will help me there. Most of our apps that use this service generate their own session keys. I assume by hashing and salting a user ID and/or calling something like java.util.UUID.randomUUID(). My schema is currently very simple -- there's a single CF containing a (binary) payload column and a column that indicates whether or not the data has been compressed. We have a few rogue apps that store humongous XML documents in the session and compression helps to deal with that. That's also why memcached wasn't going to work in our scenario. On Tue, Feb 1, 2011 at 12:18 PM, Kallin Nagelberg kallin.nagelb...@gmail.com wrote: Hey, I am currently investigating Cassandra for storing what are effectively web sessions. Our production environment has about 10 high end servers behind a load balancer, and we'd like to add distributed session support. My main concerns are performance, consistency, and the ability to create unique session keys. The last thing we would want is users picking up each others sessions. After spending a few days investigating Cassandra I'm thinking of creating a single keyspace with a single super-column-family. The scf would store a few standard columns, and a supercolumn of arbitrary session attributes, like: 0s809sdf8s908sf90s: { prop1: x, created : timestamp, lastAccessed: timestamp, prop2: y, arbirtraryProperties : { someRandomProperty1:xxyyzz, someRandomProperty2:xxyyzz, someRandomProperty3:xxyyzz } Does this sound like a reasonable use case? We are on a tight timeline and I'm currently on the fence about getting something up and running like this on a tight timeline. Thanks, -Kal -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
quick shout-out to the riptano/datastax folks!
Just a quick shout-out to the riptano folks and becoming part of/forming DataStax! Congrats!
Re: unsubscribe
Can't the mailinglist server be changed to treat messages with unsubscribe as subject as an unsubscribe as well? Otherwise it will just keep happening, as people simply don't remember or take time to find out? Just my 2 cents... Groets, Hugo. On 2 feb 2011, at 16:54, Jonathan Ellis jbel...@gmail.com wrote: http://wiki.apache.org/cassandra/FAQ#unsubscribe On Wed, Feb 2, 2011 at 7:55 AM, JJ jjcha...@gmail.com wrote: Sent from my iPad -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: unsubscribe
I'm afraid that would unsubscribe us, no? On Wed, Feb 2, 2011 at 6:37 PM, F. Hugo Zwaal h...@unitedgames.com wrote: Can't the mailinglist server be changed to treat messages with unsubscribe as subject as an unsubscribe as well? Otherwise it will just keep happening, as people simply don't remember or take time to find out? Just my 2 cents... Groets, Hugo.
Re: unsubscribe
To make it short.. No it can't. Bye, Norman (ASF Infrastructure Team) 2011/2/2 F. Hugo Zwaal h...@unitedgames.com: Can't the mailinglist server be changed to treat messages with unsubscribe as subject as an unsubscribe as well? Otherwise it will just keep happening, as people simply don't remember or take time to find out? Just my 2 cents... Groets, Hugo. On 2 feb 2011, at 16:54, Jonathan Ellis jbel...@gmail.com wrote: http://wiki.apache.org/cassandra/FAQ#unsubscribe On Wed, Feb 2, 2011 at 7:55 AM, JJ jjcha...@gmail.com wrote: Sent from my iPad -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Slow network writes
Hello I try make little cluster of 2 cassandra (0.7.0) nodes and I make little test in php: ?php define(LIBPATH, lib/); define(RECORDSSETCOUNT, 100); require_once(thrift/Thrift.php); require_once(thrift/transport/TSocket.php); require_once(thrift/transport/TFramedTransport.php); require_once(thrift/protocol/TBinaryProtocol.php); require_once(LIBPATH.cassandra/Cassandra.php); require_once(LIBPATH.cassandra/cassandra_types.php); //- $transport = new TFramedTransport(new TSocket(10.24.84.4, 9160)); $protocol = new TBinaryProtocolAccelerated($transport); $client = new CassandraClient($protocol); $transport-open(); $client-set_keyspace(test); //- $l_row = array(qw = 12, as = 67, df = df, id = uid, uid = 1212); $l_begin = microtime(true); for($i=0; $i 100; ++$i) { $l_columns = array(); foreach($l_row as $l_key = $l_value) { $l_columns[] = new cassandra_Column(array(name = $l_key, value = $l_value, timestamp = time())); }; $l_supercolumn = new cassandra_SuperColumn(array(name = $l_row[id], columns = $l_columns)); $l_c_or_sc = new cassandra_ColumnOrSuperColumn(array(super_column = $l_supercolumn)); $l_mutation = new cassandra_Mutation(array(column_or_supercolumn = $l_c_or_sc)); $client-batch_mutate(array($l_row[uid] = array('adsdfsdfsd' = array($l_mutation))), cassandra_ConsistencyLevel::ONE); if($i !($i % 1000)) { print (microtime(true) - $l_begin).\n; $l_begin = microtime(true); }; }; print done\n; sleep(20); ? When i run this test on the same machine that run cassandra daemon with ip(10.24.84.4) i got foolow results: 0.64255094528198 0.53704404830933 0.4430079460144 0.43299198150635 But when i switch test on the other cassandra daemon with ip(10.24.84.7), so test and cassandra daemon work on separates machines i got follow results: 2.4974539279938 2.3667190074921 2.2672221660614 2.3015670776367 2.2397489547729 So in my case performance degrade up to 5 times. Why this happens, and how can i solve this? Latency of my network is good, ping give: PING 10.24.84.7 (10.24.84.7) 56(84) bytes of data. 64 bytes from 10.24.84.7: icmp_seq=1 ttl=64 time=0.758 ms 64 bytes from 10.24.84.7: icmp_seq=2 ttl=64 time=0.696 ms 64 bytes from 10.24.84.7: icmp_seq=3 ttl=64 time=0.687 ms 64 bytes from 10.24.84.7: icmp_seq=4 ttl=64 time=0.735 ms 64 bytes from 10.24.84.7: icmp_seq=5 ttl=64 time=0.689 ms 64 bytes from 10.24.84.7: icmp_seq=6 ttl=64 time=0.631 ms ^V64 bytes from 10.24.84.7: icmp_seq=7 ttl=64 time=0.379 ms PS: my system is Linux 2.6.32-311-ec2 #23-Ubuntu SMP Thu Dec 2 11:14:35 UTC 2010 x86_64 GNU/Linux
Re: reduced cached mem; resident set size growth
On Wed, Feb 2, 2011 at 6:22 AM, Chris Burroughs chris.burrou...@gmail.com wrote: On 01/28/2011 09:19 PM, Chris Burroughs wrote: Thanks Oleg and Zhu. I swear that wasn't a new hotspot version when I checked, but that's obviously not the case. I'll update one node to the latest as soon as I can and report back. RSS over 48 hours with java 6 update 23: http://img716.imageshack.us/img716/5202/u2348hours.png I'll continue monitoring but RSS still appears to grow without bounds. Zhu reported a similar problem with Ubuntu 10.04. While possible, it would seem seam extraordinary unlikely that there is a glibc or kernel bug affecting us both. We're seeing a similar problem with one of our clusters (but over a longer time scale). Its possible that its not a leak, but just fragmentation. Unless you've told it otherwise, the jvm uses glibc's malloc implementation for off-heap allocations. We're currently running a test with jemalloc on one node to see if the problem goes away. -ryan
Re: reduced cached mem; resident set size growth
On 02/02/2011 12:49 PM, Ryan King wrote: We're seeing a similar problem with one of our clusters (but over a longer time scale). Its possible that its not a leak, but just fragmentation. Unless you've told it otherwise, the jvm uses glibc's malloc implementation for off-heap allocations. We're currently running a test with jemalloc on one node to see if the problem goes away. Thanks Ryan. Is it over a longer time scale because of some action taken to mitigate the problem, or has it always been that long for you?
Re: reduced cached mem; resident set size growth
On Wed, Feb 2, 2011 at 10:29 AM, Chris Burroughs chris.burrou...@gmail.com wrote: On 02/02/2011 12:49 PM, Ryan King wrote: We're seeing a similar problem with one of our clusters (but over a longer time scale). Its possible that its not a leak, but just fragmentation. Unless you've told it otherwise, the jvm uses glibc's malloc implementation for off-heap allocations. We're currently running a test with jemalloc on one node to see if the problem goes away. Thanks Ryan. Is it over a longer time scale because of some action taken to mitigate the problem, or has it always been that long for you? My guess is that its a longer timeframe because the cluster is really low traffic (around 100qps across 10 nodes). -ryan
0.7.0 mx4j, get attribute
I'm using 0.7.0 and experimenting with the new mx4j support. http://host:port/mbean?objectname=org.apache.cassandra.request%3Atype%3DReadStage Returns a nice pretty html page. For purposes of monitoring I would like to get a single attribute as xml. The docs [1] decribe a getattribute endpoint. But I have been unable to get anything other than a blank response from that. mx4j does not seem to include any logging for troubleshooting. Example: http://host:port/getattribute?objectname=org.apache.cassandra.request%3atype%3dReadStageattribute=PendingTasks returns 200 OK with no data. If anyone could point out what embarrassingly simple mistake I am making I would be much obliged. [1] http://mx4j.sourceforge.net/docs/ch05.html
Re: 0.7.0 mx4j, get attribute
On Wed, Feb 2, 2011 at 10:40 AM, Chris Burroughs chris.burrou...@gmail.com wrote: I'm using 0.7.0 and experimenting with the new mx4j support. http://host:port/mbean?objectname=org.apache.cassandra.request%3Atype%3DReadStage Returns a nice pretty html page. For purposes of monitoring I would like to get a single attribute as xml. The docs [1] decribe a getattribute endpoint. But I have been unable to get anything other than a blank response from that. mx4j does not seem to include any logging for troubleshooting. Example: http://host:port/getattribute?objectname=org.apache.cassandra.request%3atype%3dReadStageattribute=PendingTasks returns 200 OK with no data. If anyone could point out what embarrassingly simple mistake I am making I would be much obliged. [1] http://mx4j.sourceforge.net/docs/ch05.html Note that many objects in cassandra aren't initialized until they're used for the first time. -ryan
Re: unsubscribe
How about adding an autosignature with unsubscription info? /Janne On Feb 2, 2011, at 19:42 , Norman Maurer wrote: To make it short.. No it can't. Bye, Norman (ASF Infrastructure Team) 2011/2/2 F. Hugo Zwaal h...@unitedgames.com: Can't the mailinglist server be changed to treat messages with unsubscribe as subject as an unsubscribe as well? Otherwise it will just keep happening, as people simply don't remember or take time to find out? Just my 2 cents... Groets, Hugo. On 2 feb 2011, at 16:54, Jonathan Ellis jbel...@gmail.com wrote: http://wiki.apache.org/cassandra/FAQ#unsubscribe On Wed, Feb 2, 2011 at 7:55 AM, JJ jjcha...@gmail.com wrote: Sent from my iPad -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Counters in 0.8 -- conditional?
I'm looking at http://wiki.apache.org/cassandra/Counters So, the counter feature -- it doesn't seem to count rows based in criteria, such as index condition. Is that correct? Yes, it's just about supporting counters in and of themselves (which is non-trivial in a distributed system). It is unrelated to counting rows or columns, unless the application happens to use them for that. -- / Peter Schuller
Re: Commit log compaction
Thank you. So what is exactly the condition that causes the older commit log files to actually be removed? I observe that indeed they are rotated out when the threshold is reached, but then new ones a placed in the directory and the older ones are still there. Thanks, Maxim -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Commit-log-compaction-tp5985221p5986399.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Counters in 0.8 -- conditional?
Thanks. Just wanted to note that counting the number of rows where foo=bar is a fairly ubiquitous task in db applications. In case of big data, trafficking all these data to client just to count something isn't optimal at all. Maxim -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Counters-in-0-8-conditional-tp5985214p5986442.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: py_stress error in Cassandra 0.7
As the README suggests, you need to run ant gen-thrift-py first. On Wed, Feb 2, 2011 at 2:53 PM, shan...@accenture.com wrote: Hi, I am trying to get the py_stress to work in Cassandra 0.7. I keep getting this error: ubuntu@ip-10-114-85-218:~/apache-cassandra-0.7.0/contrib/py_stress$ python stress.py Traceback (most recent call last): File stress.py, line 520, in module make_keyspaces() File stress.py, line 185, in make_keyspaces cfams = [CfDef(keyspace='Keyspace1', name='Standard1', column_metadata=colms), NameError: global name 'CfDef' is not defined Any suggestions? Thanks, *Shan (Susie) Lu, *Analyst** Accenture Technology Labs - Silicon Valley ** cell +1 425.749.2546 tel:+14257492546 email *shan...@accenture.com charles.nebol...@accenture.com* This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited.
Re: quick shout-out to the riptano/datastax folks!
Thanks, Dave! On Wed, Feb 2, 2011 at 9:17 AM, Dave Viner davevi...@gmail.com wrote: Just a quick shout-out to the riptano folks and becoming part of/forming DataStax! Congrats! -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Slow network writes
You need to use multiple threads to measure throughput. I strongly recommend starting with contrib/stress from the source distribution, which is multithreaded out of the box. On Wed, Feb 2, 2011 at 9:43 AM, ruslan usifov ruslan.usi...@gmail.com wrote: Hello I try make little cluster of 2 cassandra (0.7.0) nodes and I make little test in php: ?php define(LIBPATH, lib/); define(RECORDSSETCOUNT, 100); require_once(thrift/Thrift.php); require_once(thrift/transport/TSocket.php); require_once(thrift/transport/TFramedTransport.php); require_once(thrift/protocol/TBinaryProtocol.php); require_once(LIBPATH.cassandra/Cassandra.php); require_once(LIBPATH.cassandra/cassandra_types.php); //- $transport = new TFramedTransport(new TSocket(10.24.84.4, 9160)); $protocol = new TBinaryProtocolAccelerated($transport); $client = new CassandraClient($protocol); $transport-open(); $client-set_keyspace(test); //- $l_row = array(qw = 12, as = 67, df = df, id = uid, uid = 1212); $l_begin = microtime(true); for($i=0; $i 100; ++$i) { $l_columns = array(); foreach($l_row as $l_key = $l_value) { $l_columns[] = new cassandra_Column(array(name = $l_key, value = $l_value, timestamp = time())); }; $l_supercolumn = new cassandra_SuperColumn(array(name = $l_row[id], columns = $l_columns)); $l_c_or_sc = new cassandra_ColumnOrSuperColumn(array(super_column = $l_supercolumn)); $l_mutation = new cassandra_Mutation(array(column_or_supercolumn = $l_c_or_sc)); $client-batch_mutate(array($l_row[uid] = array('adsdfsdfsd' = array($l_mutation))), cassandra_ConsistencyLevel::ONE); if($i !($i % 1000)) { print (microtime(true) - $l_begin).\n; $l_begin = microtime(true); }; }; print done\n; sleep(20); ? When i run this test on the same machine that run cassandra daemon with ip(10.24.84.4) i got foolow results: 0.64255094528198 0.53704404830933 0.4430079460144 0.43299198150635 But when i switch test on the other cassandra daemon with ip(10.24.84.7), so test and cassandra daemon work on separates machines i got follow results: 2.4974539279938 2.3667190074921 2.2672221660614 2.3015670776367 2.2397489547729 So in my case performance degrade up to 5 times. Why this happens, and how can i solve this? Latency of my network is good, ping give: PING 10.24.84.7 (10.24.84.7) 56(84) bytes of data. 64 bytes from 10.24.84.7: icmp_seq=1 ttl=64 time=0.758 ms 64 bytes from 10.24.84.7: icmp_seq=2 ttl=64 time=0.696 ms 64 bytes from 10.24.84.7: icmp_seq=3 ttl=64 time=0.687 ms 64 bytes from 10.24.84.7: icmp_seq=4 ttl=64 time=0.735 ms 64 bytes from 10.24.84.7: icmp_seq=5 ttl=64 time=0.689 ms 64 bytes from 10.24.84.7: icmp_seq=6 ttl=64 time=0.631 ms ^V64 bytes from 10.24.84.7: icmp_seq=7 ttl=64 time=0.379 ms PS: my system is Linux 2.6.32-311-ec2 #23-Ubuntu SMP Thu Dec 2 11:14:35 UTC 2010 x86_64 GNU/Linux -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?
Can I have some more feedback about my schema perhaps somewhat more criticisive/harsh ? Thanks again, Aditya Narayan On Wed, Feb 2, 2011 at 10:27 PM, Aditya Narayan ady...@gmail.com wrote: @Bill Thank you BIll! @Cassandra users Can others also leave their suggestions and comments about my schema, please. Also my question about whether to use a superColumn or alternatively, just store the data (that would otherwise be stored in subcolumns) as serialized into a single column in standard type column family. Thanks -Aditya Narayan On Wed, Feb 2, 2011 at 10:11 PM, William R Speirs bill.spe...@gmail.com wrote: I did not understand before... sorry. Again, depending upon how many reminders you have for a single user, this could be a long/wide row. Again, it really comes down to how many reminders are we talking about and how often will they be read/written. While a single row can contain millions (maybe more) columns, that doesn't mean it's a good idea. I'm working on a logging system with Cassandra and ran into this same type of problem. Do I put all of the messages for a single system into a single row keyed off that system's name? I quickly came to the answer of no and now I break my row keys into POSIX_timestamp:system where my timestamps are buckets for every 5 minutes. This nicely distributes the load across the nodes in my system. Bill- On 02/02/2011 11:18 AM, Aditya Narayan wrote: You got me wrong perhaps.. I am already splitting the row on per user basis ofcourse, otherwise the schema wont make sense for my usage. The row contains only *reminders of a single user* sorted in chronological order. The reminder Id are stored as supercolumn name and subcolumn contain tags for that reminder. On Wed, Feb 2, 2011 at 9:19 PM, William R Speirsbill.spe...@gmail.com wrote: Any time I see/hear a single row containing all ... I get nervous. That single row is going to reside on a single node. That is potentially a lot of load (don't know the system) for that single node. Why wouldn't you split it by at least user? If it won't be a lot of load, then why are you using Cassandra? This seems like something that could easily fit into an SQL/relational style DB. If it's too much data (millions of users, 100s of millions of reminders) for a standard SQL/relational model, then it's probably too much for a single row. I'm not familiar with the TTL functionality of Cassandra... sorry cannot help/comment there, still learning :-) Yea, my $0.02 is that this is an effective way to leverage super columns. Bill- On 02/02/2011 10:43 AM, Aditya Narayan wrote: I think you got it exactly what I wanted to convey except for few things I want to clarify: I was thinking of a single row containing all reminders ( not split by day). History of the reminders need to be maintained for some time. After certain time (say 3 or 6 months) they may be deleted by ttl facility. While presenting the reminders timeline to the user, latest supercolumns like around 50 from the start_end will be picked up and their subcolumns values will be compared to the Tags user has chosen to see and, corresponding to the filtered subcolumn values(tags), the rows of the reminder details would be picked up.. Is supercolumn a preferable choice for this ? Can there be a better schema than this ? -Aditya Narayan On Wed, Feb 2, 2011 at 8:54 PM, William R Speirsbill.spe...@gmail.com wrote: To reiterate, so I know we're both on the same page, your schema would be something like this: - A column family (as you describe) to store the details of a reminder. One reminder per row. The row key would be a TimeUUID. - A super column family to store the reminders for each user, for each day. The row key would be something like: MMDD:user_id. The column names would simply be the TimeUUID of the messages. The sub column names would be the tag names of the various reminders. The idea is that you would then get a slice of each row for a user, for a day, that would only contain sub column names with the tags you're looking for? Then based upon the column names returned, you'd look-up the reminders. That seems like a solid schema to me. Bill- On 02/02/2011 09:37 AM, Aditya Narayan wrote: Actually, I am trying to use Cassandra to display to users on my applicaiton, the list of all Reminders set by themselves for themselves, on the application. I need to store rows containing the timeline of daily Reminders put by the users, for themselves, on application. The reminders need to be presented to the user in a chronological order like a news feed. Each reminder has got certain tags associated with it(so that, at times, user may also choose to see the reminders filtered by tags in chronological order). So I thought of a schema something like this:- -Each Reminder details may be stored as separate rows in column family. -For presenting the timeline
Re: Commit log compaction
On Wed, Feb 2, 2011 at 12:29 PM, buddhasystem potek...@bnl.gov wrote: Thank you. So what is exactly the condition that causes the older commit log files to actually be removed? Commit log segments (whose size are controllable via the commitlog_rotation_threshold_in_mb option) are eligable for removal when they do not contain any data that has yet to be flushed to memtables. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Pig not reading all cassandra data
I noticed in the jobtracker log that when the pig job kicks off, I get the following info message: 2011-02-02 09:13:07,269 INFO org.apache.hadoop.mapred.JobInProgress: Input size for job job_201101241634_0193 = 0. Number of splits = 1 So I looked at the job.split file that is created for the Pig job and compared it to the job.split file created for the map-reduce job. The map reduce file contains an entry for each split, whereas the job.split file for the Pig job contains just the one split. I added some code to the ColumnFamilyInputFormat to output what it thinks it sees as it should be creating input splits for the pig jobs, and the call to getSplits() appears to be returning the correct list of splits. I can't figure out where it goes wrong though when the splits should be written to the job.split file. Does anybody know the specific class responsible for creating that file in a Pig job, and why it might be affected by using the pig CassandraStorage module? Is anyone else successfully running Pig jobs against a 0.7 cluster? Thanks, Matt
Does an unused ColumnFamily consume resources better used by live CFs?
We have an old test CF and I was wondering if it might be taking resources better used by our app's CFs. Thank you. David
Re: how to change compare_with
I tried help update column family. It gave me : *valid attributes are: - column_type: Super or Standard - comment: Human-readable column family description. Any string is acceptable - rows_cached: Number or percentage of rows to cache - row_cache_save_period: Period with which to persist the row cache, in seconds - keys_cached: Number or percentage of keys to cache - key_cache_save_period: Period with which to persist the key cache, in seconds - read_repair_chance: Probability (0.0-1.0) with which to perform read repairs on CL.ONE reads - gc_grace: Discard tombstones after this many seconds - column_metadata: null - memtable_operations: Flush memtables after this many operations - memtable_throughput: ... or after this many bytes have been written - memtable_flush_after: ... or after this many seconds - default_validation_class: null - min_compaction_threshold: Avoid minor compactions of less than this number of sstable files - max_compaction_threshold: Compact no more than this number of sstable files at once - column_metadata: Metadata which describes columns of column family. Supported format is [{ k:v, k:v, ... }, { ... }, ...] Valid attributes: column_name, validation_class (see comparator), index_type (integer), index_name. *So what is to be used ? And also if possible please provide information on how do that in Java using Hector. Thank you. Vedarth Kulkarni, TYBSc (Computer Science). On Thu, Feb 3, 2011 at 2:58 AM, Jonathan Ellis jbel...@gmail.com wrote: On Wed, Feb 2, 2011 at 12:48 PM, Vedarth Kulkarni vedar...@gmail.com wrote: Hello there, I am using Cassandra 0.7. Is there any way to change the 'compare_with' from my program ?, I am using Hector and I am programming in Java. Yes. Is it possible to change it from the bin/cassandra-cli ? Yes. help update column family; -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Cassandra memory needs
Hi All, I am trying to understand the relationship between data set/SSTable(s) size and Cassandra heap. Q1. Here is the memory calc from the Wiki: For a rough rule of thumb, Cassandra's internal datastructures will require about memtable_throughput_in_mb * 3 * number of hot CFs + 1G + internal caches. This formula does not depend on the data set size. Does this mean that provided Cassandra has sufficient disk space to accommodate growing data set, it can run in fixed memory for bulk load? Am I right that memory impact of compacting increasing SSTAble sizes is capped by a parameter in_memory_compaction_limit_in_mb? Q2. What would I need to monitor to predict ahead the need to double the number of nodes assuming sufficient storage per node? Is there a simple rule of thumb saying that for a heap of size X a node can handle SSTable of size Y? I do realize that the i/o and CPU play a role here but could that be reduced to a factor: Y = f(X) * z where z is 1 for a specified server config. I am assuming random partitioner and a fixed number of write clients. Q3. Does the formula account for deserialization during reads? What does 1G represent? Thank you very much, Oleg
Re: Does an unused ColumnFamily consume resources better used by live CFs?
Not if it's been flushed since the last time it was written to. On Wed, Feb 2, 2011 at 1:34 PM, David Dabbs dmda...@gmail.com wrote: We have an old “test” CF and I was wondering if it might be taking resources better used by our app’s CFs. Thank you. David -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: Slow network writes
Is it possible that the key 1212 maps to the first node? I am assuming RF=1. You could try random keys to test this theory... Oleg
RE: py_stress error in Cassandra 0.7
I tried running with the 0.7 version and get this error: Buildfile: build.xml gen-thrift-py: [echo] Generating Thrift Python code from /home/ubuntu/apache-cassandra-0.7.0/interface/cassandra.thrift [exec] [WARNING:/home/ubuntu/apache-cassandra-0.7.0/interface/cassandra.thrift:375] Constant strings should be quoted: ConsistencyLevel.ONE [exec] [exec] [exec] [FAILURE:/home/ubuntu/apache-cassandra-0.7.0/interface/cassandra.thrift:375] type error: const consistency_level was declared as enum BUILD FAILED /home/ubuntu/apache-cassandra-0.7.0/build.xml:250: exec returned: 1 Total time: 0 seconds Thank you, Shan (Susie) Lu, Accenture Tech Labs SV email shan...@accenture.commailto:charles.nebol...@accenture.com From: Brandon Williams [mailto:dri...@gmail.com] Sent: Wednesday, February 02, 2011 1:18 PM To: user@cassandra.apache.org Subject: Re: py_stress error in Cassandra 0.7 As the README suggests, you need to run ant gen-thrift-py first. On Wed, Feb 2, 2011 at 2:53 PM, shan.luhttp://shan.lu@accenture.comhttp://accenture.com wrote: Hi, I am trying to get the py_stress to work in Cassandra 0.7. I keep getting this error: ubuntu@ip-10-114-85-218:~/apache-cassandra-0.7.0/contrib/py_stress$mailto:ubuntu@ip-10-114-85-218:~/apache-cassandra-0.7.0/contrib/py_stress$ python stress.py Traceback (most recent call last): File stress.py, line 520, in module make_keyspaces() File stress.py, line 185, in make_keyspaces cfams = [CfDef(keyspace='Keyspace1', name='Standard1', column_metadata=colms), NameError: global name 'CfDef' is not defined Any suggestions? Thanks, Shan (Susie) Lu, Analyst Accenture Technology Labs - Silicon Valley cell +1 425.749.2546tel:+14257492546 email shan...@accenture.commailto:charles.nebol...@accenture.com This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited. This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited.
Re: py_stress error in Cassandra 0.7
That means you have an old version of the Thrift compiler. On Wed, Feb 2, 2011 at 1:54 PM, shan...@accenture.com wrote: I tried running with the 0.7 version and get this error: Buildfile: build.xml gen-thrift-py: [echo] Generating Thrift Python code from /home/ubuntu/apache-cassandra-0.7.0/interface/cassandra.thrift [exec] [WARNING:/home/ubuntu/apache-cassandra-0.7.0/interface/cassandra.thrift:375] Constant strings should be quoted: ConsistencyLevel.ONE [exec] [exec] [exec] [FAILURE:/home/ubuntu/apache-cassandra-0.7.0/interface/cassandra.thrift:375] type error: const consistency_level was declared as enum BUILD FAILED /home/ubuntu/apache-cassandra-0.7.0/build.xml:250: exec returned: 1 Total time: 0 seconds Thank you, Shan (Susie) Lu, Accenture Tech Labs SV email shan...@accenture.com From: Brandon Williams [mailto:dri...@gmail.com] Sent: Wednesday, February 02, 2011 1:18 PM To: user@cassandra.apache.org Subject: Re: py_stress error in Cassandra 0.7 As the README suggests, you need to run ant gen-thrift-py first. On Wed, Feb 2, 2011 at 2:53 PM, shan...@accenture.com wrote: Hi, I am trying to get the py_stress to work in Cassandra 0.7. I keep getting this error: ubuntu@ip-10-114-85-218:~/apache-cassandra-0.7.0/contrib/py_stress$ python stress.py Traceback (most recent call last): File stress.py, line 520, in module make_keyspaces() File stress.py, line 185, in make_keyspaces cfams = [CfDef(keyspace='Keyspace1', name='Standard1', column_metadata=colms), NameError: global name 'CfDef' is not defined Any suggestions? Thanks, Shan (Susie) Lu, Analyst Accenture Technology Labs - Silicon Valley cell +1 425.749.2546 email shan...@accenture.com This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited. This message is for the designated recipient only and may contain privileged, proprietary, or otherwise private information. If you have received it in error, please notify the sender immediately and delete the original. Any other use of the email by you is prohibited. -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: py_stress error in Cassandra 0.7
Have you generated Cassandra Thrift interface? You will need to install Thrift first: http://wiki.apache.org/cassandra/InstallThrift Then, in the interface directory under Cassandra's home you can run thrift --gen py cassandra.thrift If the above does not install generated cassandra thrift module, then copy it manually to the site-packages directory of your python installation. On my server it is in /usr/lib/python/site-packages I hope this helps... Oleg
Re: Counters in 0.8 -- conditional?
Thanks. Just wanted to note that counting the number of rows where foo=bar is a fairly ubiquitous task in db applications. In case of big data, trafficking all these data to client just to count something isn't optimal at all. You can ask Cassandra to do the counting, but the cost is still going to involve reading the data on the Cassandra end. Hence, O(n) rather than O(1). (It would obviously be nice if counts could be done O(1), but it's not trivial to implement or obvious how to do it in order for it to be generally useful. Even non-distributed databases like PostgreSQL have issues with that.) -- / Peter Schuller
Re: Slow network writes
2011/2/3 Oleg Proudnikov ol...@cloudorange.com Is it possible that the key 1212 maps to the first node? I am assuming RF=1. You could try random keys to test this theory... Yes you right 1212 goes to first node. I distribute tokens like described in Operations: http://wiki.apache.org/cassandra/Operations: 0 85070591730234615865843651857942052864 So delay in my second experiment(where i got big delay in insert), appear as result of delay communications between nodes?
Re: Counters in 0.8 -- conditional?
Thanks. Yes I know it's by no means trivial. I thought in case there was an index on the column on which I want to place condition, the index machinery itself can do the counting (i.e. when the index is updated, the counter is incremented). It doesn't seem too orthogonal to the current implementation, at least from my very limited experience. Maxim -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Counters-in-0-8-conditional-tp5985214p5986871.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Slow network writes
ruslan usifov ruslan.usifov at gmail.com writes: 2011/2/3 Oleg Proudnikov olegp at cloudorange.com Is it possible that the key 1212 maps to the first node? I am assuming RF=1. You could try random keys to test this theory... Yes you right 1212 goes to first node. I distribute tokens like described in Operations: http://wiki.apache.org/cassandra/Operations:085070591730234615865843651857942052864So delay in my second experiment(where i got big delay in insert), appear as result of delay communications between nodes? That was the theory, assuming you are using replication factor of 1. It is difficult to say where the key falls just by looking at the ring - random partitioner could through this key on either node. After writing 1 million rows you could actually see some SSTables in data directory on one node and none on the other.
Re: Cassandra memory needs
I am trying to understand the relationship between data set/SSTable(s) size and Cassandra heap. http://wiki.apache.org/cassandra/LargeDataSetConsiderations For a rough rule of thumb, Cassandra's internal datastructures will require about memtable_throughput_in_mb * 3 * number of hot CFs + 1G + internal caches. This formula does not depend on the data set size. Does this mean that provided Cassandra has sufficient disk space to accommodate growing data set, it can run in fixed memory for bulk load? No, for reasons that I hope are covered at the above URL. The calculation you refer to has more to with how you tweak your memtables for performance which is only loosely coupled to data size. The cost of index sampling and bloom filters are very directly related to database size however (see wiki url). It is essentially a trade-off; where a typical b-tree database would simply start demanding additional seeks as the index size grows larger, Cassandra does limit the seeks but instead has a stricter memory requirements. If you're only looking to smack huge amounts of data into the database without every reading them, or reading them very very rarely, it is sub-optimal from a memory perspective. Note though that these are memory requirements per row key, rather than per byte of data. Am I right that memory impact of compacting increasing SSTAble sizes is capped by a parameter in_memory_compaction_limit_in_mb? That limits the amount of memory allocated for individual row compactions yes, and will put a cap on the GC pressure generated in addition to allowing huge rows to be compacted independently of heap size. Q2. What would I need to monitor to predict ahead the need to double the number of nodes assuming sufficient storage per node? Is there a simple rule of thumb saying that for a heap of size X a node can handle SSTable of size Y? I do realize that the i/o and CPU play a role here but could that be reduced to a factor: Y = f(X) * z where z is 1 for a specified server config. I am assuming random partitioner and a fixed number of write clients. Disregarding memtable tweaking that will have more to do with throughput, the most important factor in terms of scaling memory requirements w.r.t. data size, is the number of row keys and the length of the average row. I recommend just empirically inserting say 10 million rows with realistic row keys and observing the size of the resulting index and bloom filter files. Take into account to what extent compaction will cause memory usage to temporarily spike. Also take into account that if you plan on having very large rows, the indexes will begin having more than one entry per row (see column_index_size_in_kb in the configuration). If your use-case is somehow truly extreme in the sense of huge data sets with little to no requirement on query efficiency, the per row key costs can be cut down by adjusting index_interval in the configuration to affect the cost of index sampling, and the target false positive rates of bloom filters could be adjusted (in source, not conf) to cut down on that. But really, that would be an unusual thing to do I think and I wouldn't recommend touching that without careful consideration and deep understanding of your expected use-case. Q3. Does the formula account for deserialization during reads? What does 1G represent? I don't know the background of that particular wiki statement, but my guess is that 1G is just sort of a general gut feel good to have base memory size rather than something very specifically calculated. -- / Peter Schuller
Re: how to change compare_with
I think Jonathan mispoke. You cannot change the 'compare_with' attribute of an existing column family. The solution is to create a new column family with the data type that you need. See 'help create column family;' -- Tyler Hobbs Software Engineer, DataStax http://datastax.com/ Maintainer of the pycassa http://github.com/pycassa/pycassa Cassandra Python client library
Re: Cassandra memory needs
Oleg, I just wanted to add that I confirmed the importance of that rule of thumb the hard way. I created two extra CFs and was able to reliably crash the nodes during writes. I guess for the final setting I'll rely on results of my testing. But it's also important to not cause the swap death of your machine (i.e. when you go too high on JVM memory). Regards Maxim -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Cassandra-memory-needs-tp5986663p5986911.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: how to change compare_with
On Wed, Feb 2, 2011 at 3:01 PM, Tyler Hobbs ty...@datastax.com wrote: I think Jonathan mispoke. I thought I was mistaken, but I was wrong. :) You cannot change the 'compare_with' attribute of an existing column family. You can, but it's up to you to make sure that the new type makes sense. Most frequently, you see this when changing from BytesType to something more structured. (If you screw up and specify a compare_with that is nonsensical for your data, just change it back.) -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
How do I get 0.7.1?
Thanks. Maxim -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-do-I-get-0-7-1-tp5986927p5986927.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Slow network writes
Jonathan, where do I find that contrib/stress? Maxim -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Slow-network-writes-tp5985757p5986937.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: How do I get 0.7.1?
I don't think 0.7.1 is out yet, so you'll have to wait. On Wed, Feb 2, 2011 at 3:17 PM, buddhasystem potek...@bnl.gov wrote: Thanks. Maxim -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-do-I-get-0-7-1-tp5986927p5986927.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com. -- Salvador Fuentes Jr.
Re: how to change compare_with
Not only does the type need to make sense, but it also needs to sort in exactly the same order as the previous type did... in which case there would be no reason to change it? We should probably just say no, you cannot do this, and explicitly prevent it. On Wed, Feb 2, 2011 at 3:14 PM, Jonathan Ellis jbel...@gmail.com wrote: On Wed, Feb 2, 2011 at 3:01 PM, Tyler Hobbs ty...@datastax.com wrote: I think Jonathan mispoke. I thought I was mistaken, but I was wrong. :) You cannot change the 'compare_with' attribute of an existing column family. You can, but it's up to you to make sure that the new type makes sense. Most frequently, you see this when changing from BytesType to something more structured. (If you screw up and specify a compare_with that is nonsensical for your data, just change it back.) -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: How do I get 0.7.1?
the take #2 vote was canceled due to a couple of issues... take #3 had not been called yet - Stephen --- Sent from my Android phone, so random spelling mistakes, random nonsense words and other nonsense are a direct result of using swype to type on the screen On 2 Feb 2011 23:29, Sal Fuentes fuente...@gmail.com wrote:
Re: how to change compare_with
Correct. But with more and more clients being able to do intelligent things based on metadata it's not just decoration. (UTF8Type, LexicalUUIDType, BytesType, and AsciiType all have the same ordering. I believe IntegerType and LongType are equivalent orderings as well.) On Wed, Feb 2, 2011 at 3:35 PM, Stu Hood stuh...@gmail.com wrote: Not only does the type need to make sense, but it also needs to sort in exactly the same order as the previous type did... in which case there would be no reason to change it? We should probably just say no, you cannot do this, and explicitly prevent it. On Wed, Feb 2, 2011 at 3:14 PM, Jonathan Ellis jbel...@gmail.com wrote: On Wed, Feb 2, 2011 at 3:01 PM, Tyler Hobbs ty...@datastax.com wrote: I think Jonathan mispoke. I thought I was mistaken, but I was wrong. :) You cannot change the 'compare_with' attribute of an existing column family. You can, but it's up to you to make sure that the new type makes sense. Most frequently, you see this when changing from BytesType to something more structured. (If you screw up and specify a compare_with that is nonsensical for your data, just change it back.) -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Re: How do I get 0.7.1?
Stephen, sorry I didn't understand your missive. Maxim -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/How-do-I-get-0-7-1-tp5986927p5987184.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
unsubscribe
unsubscribe
Re: unsubscribe
http://wiki.apache.org/cassandra/FAQ#unsubscribe How do I unsubscribe from the email list? Send an email to user-unsubscr...@cassandra.apache.org On Wed, Feb 2, 2011 at 5:24 PM, Ronald Bradford ronald.bradf...@gmail.com wrote: unsubscribe
rolling window of data
Hi, We're trying to use Cassandra 0.7 to store a rolling window of log data (e.g. last 90 days). We use the timestamp of the log entries as the column names so we can do time range queries. Everything seems to be working fine, but it's not clear if there is an efficient way to delete data that is more than 90 days old. Originally I thought that using a slice range on a deletion would do the trick, but that apparently is not supported yet. Another idea I had was to store the timestamp of the log entry as Cassandra's timestamp and pass in artificial timestamps to remove (thrift API), but that seems hacky. Does anyone know if there is a good way to support this kind of rolling window of data efficiently? Thanks. -Jeffrey
Re: rolling window of data
This project may provide some inspiration for youhttps://github.com/thobbs/logsandraNot sure if it has a rolling window, if you find out let me know :)AaronOn 03 Feb, 2011,at 06:08 PM, Jeffrey Wang jw...@palantir.com wrote:Hi,Were trying to use Cassandra 0.7 to store a rolling window of log data (e.g. last 90 days). We use the timestamp of the log entries as the column names so we can do time range queries. Everything seems to be working fine, but its not clear if there is an efficient way to delete data that is more than 90 days old.Originally I thought that using a slice range on a deletion would do the trick, but that apparently is not supported yet. Another idea I had was to store the timestamp of the log entry as Cassandras timestamp and pass in artificial timestamps to remove (thrift API), but that seems hacky. Does anyone know if there is a good way to support this kind of rolling window of data efficiently? Thanks.-Jeffrey
Tracking down read latency
Hello. Were encountering some high read latency issues. But our main Cass expert is out-of-office so it falls to me. We're more read than write, though there doesn't seem to be many pending reads. I have seen active/pending row-read at three or four, though. Pool NameActive Pending Completed FILEUTILS-DELETE-POOL 0 0 46 STREAM-STAGE 0 0 0 RESPONSE-STAGE0 0 17471880 ROW-READ-STAGE1 1 37652361 LB-OPERATIONS 0 0 0 MISCELLANEOUS-POOL0 0 0 GMFD 0 0 154630 LB-TARGET 0 0 0 CONSISTENCY-MANAGER 0 02993464 ROW-MUTATION-STAGE0 0 16383305 MESSAGE-STREAMING-POOL0 0 0 LOAD-BALANCER-STAGE 0 0 0 FLUSH-SORTER-POOL 0 0 0 MEMTABLE-POST-FLUSHER 0 0116 FLUSH-WRITER-POOL 0 0116 AE-SERVICE-STAGE 0 0 0 HINTED-HANDOFF-POOL 0 0 16 Does the high iops on our data mean we need to tune Key or other caches? $ iostat Linux 2.6.18-194.11.3.el5 02/03/2011 avg-cpu: %user %nice %system %iowait %steal %idle 1.00 0.00 0.25 1.32 0.00 97.43 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn sda 1.52 6.66 20.53 6262 273904650 sda1 0.00 0.00 0.00 2484 18 sda2 1.46 5.68 18.95 75741946 252831120 sda3 0.06 0.98 1.58 13141400 21073512 # data here sdb 103.06 13964.72 2718.28 186315436859 36266884800 sdb1 103.06 13964.72 2718.28 186315436235 36266884800 # commit logs here sdc 1.47 1.71 309.36 22800725 4127423000 sdc1 1.47 1.71 309.36 22799901 4127423000 We're running on a beefy 64-bit Nehalem, so mmap should be available/possible. I need to check with our Cassandra lead when he's available as to why we're not using mmap or auto. From /opt/cassandra/conf/storage-conf.xml. DiskAccessModestandard/DiskAccessMode Heap size is 16gb. JVM_OPTS= \ -ea \ -Xms16G \ -Xmx16G \ -XX:+UseParNewGC \ -XX:+UseConcMarkSweepGC \ -XX:+CMSParallelRemarkEnabled \ -XX:SurvivorRatio=8 \ -XX:MaxTenuringThreshold=1 \ -XX:CMSInitiatingOccupancyFraction=75 \ -XX:+UseCMSInitiatingOccupancyOnly \ -XX:+HeapDumpOnOutOfMemoryError \ -XX:+UseCompressedOops \ -XX:+UseThreadPriorities \ -XX:ThreadPriorityPolicy=42 \ -Dcassandra.compaction.priority=1 If I've omitted any key infos, please advise and I'll provide. Thanks, David
RE: rolling window of data
Thanks for the link, but unfortunately it doesn't look like it uses a rolling window. As far as I can tell, log entries just keep getting inserted into Cassandra. -Jeffrey From: Aaron Morton [mailto:aa...@thelastpickle.com] Sent: Wednesday, February 02, 2011 9:21 PM To: user@cassandra.apache.org Subject: Re: rolling window of data This project may provide some inspiration for you https://github.com/thobbs/logsandra Not sure if it has a rolling window, if you find out let me know :) Aaron On 03 Feb, 2011,at 06:08 PM, Jeffrey Wang jw...@palantir.com wrote: Hi, We're trying to use Cassandra 0.7 to store a rolling window of log data (e.g. last 90 days). We use the timestamp of the log entries as the column names so we can do time range queries. Everything seems to be working fine, but it's not clear if there is an efficient way to delete data that is more than 90 days old. Originally I thought that using a slice range on a deletion would do the trick, but that apparently is not supported yet. Another idea I had was to store the timestamp of the log entry as Cassandra's timestamp and pass in artificial timestamps to remove (thrift API), but that seems hacky. Does anyone know if there is a good way to support this kind of rolling window of data efficiently? Thanks. -Jeffrey
Re: Tracking down read latency
On Wed, Feb 2, 2011 at 9:35 PM, David Dabbs dmda...@gmail.com wrote: We’re encountering some high read latency issues. What is reporting high read latency? We're more read than write, though there doesn't seem to be many pending reads. I have seen active/pending row-read at three or four, though. In general if you were I/O bound on reads (the most common pathological case) you would see much higher row-read stage pending. [ sane looking tpstats ] Your tpstats does not look like a node which is struggling. avg-cpu: %user %nice %system %iowait %steal %idle 1.00 0.00 0.25 1.32 0.00 97.43 Your system also seems to not be breaking a sweat. We're running on a beefy 64-bit Nehalem, so mmap should be available/possible. I need to check with our Cassandra lead when he's available as to why we're not using mmap or auto. Probably because of : https://issues.apache.org/jira/browse/CASSANDRA-1214 Heap size is 16gb. 16gb out of how much total? Do your GC logs seem to indicate reasonable GC performance? Do all nodes generally have a complete view of the ring and all nodes generally seem to be up? =Rob
Re: Schema Design Question : Supercolumn family or just a Standard column family with columns containing serialized aggregate data?
On Wed, Feb 2, 2011 at 3:27 PM, Aditya Narayan ady...@gmail.com wrote: Can I have some more feedback about my schema perhaps somewhat more criticisive/harsh ? It sounds reasonable to me. Since you're writing/reading all of the subcolumns at the same time, I would opt for a standard column with the tags serialized into a column value. I don't think you need to worry about row lengths here. Depending on the reminder size and how many times it's likely to be repeated in the timeline, you could explore denormalizing a bit more by storing the reminders in the timelines themselves, perhaps with a separate row per (user, tag) combination. This would cut down on your seeks quite a bit, but it may not be necessary at this point (or at all). -- Tyler Hobbs Software Engineer, DataStax http://datastax.com/ Maintainer of the pycassa http://github.com/pycassa/pycassa Cassandra Python client library
Re: how to change compare_with
Thank you. I got it from the examples provided by Hector. Vedarth Kulkarni, TYBSc (Computer Science). On Thu, Feb 3, 2011 at 6:22 AM, Jonathan Ellis jbel...@gmail.com wrote: Correct. But with more and more clients being able to do intelligent things based on metadata it's not just decoration. (UTF8Type, LexicalUUIDType, BytesType, and AsciiType all have the same ordering. I believe IntegerType and LongType are equivalent orderings as well.) On Wed, Feb 2, 2011 at 3:35 PM, Stu Hood stuh...@gmail.com wrote: Not only does the type need to make sense, but it also needs to sort in exactly the same order as the previous type did... in which case there would be no reason to change it? We should probably just say no, you cannot do this, and explicitly prevent it. On Wed, Feb 2, 2011 at 3:14 PM, Jonathan Ellis jbel...@gmail.com wrote: On Wed, Feb 2, 2011 at 3:01 PM, Tyler Hobbs ty...@datastax.com wrote: I think Jonathan mispoke. I thought I was mistaken, but I was wrong. :) You cannot change the 'compare_with' attribute of an existing column family. You can, but it's up to you to make sure that the new type makes sense. Most frequently, you see this when changing from BytesType to something more structured. (If you screw up and specify a compare_with that is nonsensical for your data, just change it back.) -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com
Does Consistency QUORUM broken on cassandra 0.7.0 and 0.6.11
As noticed in this issue https://issues.apache.org/jira/browse/CASSANDRA-2081. Does this mean that QUORUM doesn't work on 0.7.0 and 0.6.11?
performance degradation in cluster
First time I tun single instance of Cassandra and my application on a system (16GB ram and 8 core), the time taken was 480sec. When I added one more system ,(means this time I was running 2 instance of Cassandra in cluster) and running application from single client , I found time taken in increased to 1000sec. And I also found that that data distribution was also very odd on both system (in one system data were about 2.5GB and another were 140MB). Is any configuration require while running Cassandra in a cluster other than adding seeds ? hanks Regards, abhinav
Re: Slow network writes
2011/2/3 Oleg Proudnikov ol...@cloudorange.com ruslan usifov ruslan.usifov at gmail.com writes: 2011/2/3 Oleg Proudnikov olegp at cloudorange.com Is it possible that the key 1212 maps to the first node? I am assuming RF=1. You could try random keys to test this theory... Yes you right 1212 goes to first node. I distribute tokens like described in Operations: http://wiki.apache.org/cassandra/Operations:085070591730234615865843651857942052864So delay in my second experiment(where i got big delay in insert), appear as result of delay communications between nodes? That was the theory, assuming you are using replication factor of 1. It is difficult to say where the key falls just by looking at the ring - random partitioner could through this key on either node. After writing 1 million rows Hm this is very simple to calculate for random, partitioner, this script on python do that: from hashlib import md5; def tokens(nodes): l_retval = []; for x in xrange(nodes): l_retval.append(2 ** 127 / nodes * x); return l_retval; def wherekey(key, orderednodetokens): l_m = md5(); l_m.update(key); l_keytoken = long(l_m.hexdigest(), 16); l_found = False; l_i = 0; for l_nodetoken in orderednodetokens: if l_keytoken = l_nodetoken: l_found = True; break; l_i += 1; if l_found: return l_i; return 0; ring = tokens(2); print wherekey(1212, ring); So for key 1212 will by chosen 0 node. 10.24.84.4 in my case