Re: Heap is N.N full. Immediately on startup
( I am monitoring via a visual vm plugin that shows generation sizes ) I've increased the index sampling rate from 64 to 512 and the bloom filter fp ratio from default to 0.1 ( and rebuilt stables ) - still getting 1-2 sstable reads with LCS However at startup I see a 5GB old gen ( that seems to be very stable at around 5.5GB under moderate 90:10 read:write load - couple hundred q/s ) is that a realistic size with these values and a ~200GB CF with 3-400mil rows? ( no secondary indexes, skinny rows 1-5K each ) ? Also hinted handoff is off, as full GC pauses of 1-3seconds on nodes I have not yet attended to ( sampling rate at 64, bloom fp default ) are kicking off the failure detector even at a phi_convict_threshold of 12 ( apparently ? ) and everything is in an up/down dance :-) Thanks! Andras From: aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Friday 22 February 2013 17:44 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Heap is N.N full. Immediately on startup To get a good idea of how GC is performing turn on the GC logging in cassandra-env.sh. After a full cms GC event, see how big the tenured heap is. If it's not reducing enough then GC will never get far enough ahead. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 22/02/2013, at 8:37 AM, Andras Szerdahelyi andras.szerdahe...@ignitionone.commailto:andras.szerdahe...@ignitionone.com wrote: Thank you- indeed my index interval is 64 with a CF of 300M rows + bloom filter false positive chance was default. Raising the index interval to 512 didn't fix this alone, so I guess I'll have to set the bloom filter to some reasonable value and scrub. From: aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thursday 21 February 2013 17:58 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Heap is N.N full. Immediately on startup My first guess would be the bloom filter and index sampling from lots-o-rows Check the row count in cfstats Check the bloom filter size in cfstats. Background on memory requirements http://www.mail-archive.com/user@cassandra.apache.org/msg25762.html Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.comhttp://www.thelastpickle.com/ On 20/02/2013, at 11:27 PM, Andras Szerdahelyi andras.szerdahe...@ignitionone.commailto:andras.szerdahe...@ignitionone.com wrote: Hey list, Any ideas ( before I take a heap dump ) what might be consuming my 8GB JVM heap at startup in Cassandra 1.1.6 besides * row cache : not persisted and is at 0 keys when this warning is produced * Memtables : no write traffic at startup, my app's column families are durable_writes:false * Pending tasks : no pending tasks, except for 928 compactions ( not sure where those are coming from ) I drew these conclusions from the StatusLogger output below: INFO [ScheduledTasks:1] 2013-02-20 05:13:25,198 GCInspector.java (line 122) GC for ConcurrentMarkSweep: 14959 ms for 2 collections, 7017934560 used; max is 8375238656 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,198 StatusLogger.java (line 57) Pool NameActive Pending Blocked INFO [ScheduledTasks:1] 2013-02-20 05:13:25,199 StatusLogger.java (line 72) ReadStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) RequestResponseStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) ReadRepairStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,200 StatusLogger.java (line 72) MutationStage 0-1 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) ReplicateOnWriteStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) GossipStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) AntiEntropyStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) MigrationStage0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,201 StatusLogger.java (line 72) StreamStage 0 0 0 INFO [ScheduledTasks:1] 2013-02-20 05:13:25,202 StatusLogger.java (line 72)
Re: Data Model - Additional Column Families or one CF?
Greetings! Thank you very much sharing your insight and experience. I am trying to migrate a normalized Schema -- 1 TB database. The data is hierarchical... child entities carry foreign keys to the parent entities. There are several instances like ShapeTable, Circle, Square, Rectangle etc... (specialization hierarchies). Often a child records are added to Parent. Some fields are updated. Rarely some child records are deleted. What is the ideal schema for Cassandra? Create one ColumnFamily with many SuperColumns each one for Shapes. I am trying to utilize the partitioning feature to distribute the load around the world according to the jurisdiction of the Parent record. I am particularly interested in learning how to define the Shape CF but insert Rectangle, Circle, with different columns? Thank you for your help regards Raman
please explain read path when key not in database
Hello! Explain please, how this work when I request for key which is not in database * The closest node (as determined by proximity sorting as described above) will be sent a command to perform an actual data read (i.e., return data to the co-ordinating node). * As required by consistency level, additional nodes may be sent digest commands, asking them to perform the read locally but send back the digest only. o For example, at replication factor 3 a read at consistency level QUORUM would require one digest read in additional to the data read sent to the closest node. (SeeReadCallback http://wiki.apache.org/cassandra/ReadCallback, instantiated byStorageProxy http://wiki.apache.org/cassandra/StorageProxy) I have multi-DC with NetworkTopologyStrategy and RF:1 per datacenter, and reads are at consitency level ONE. If local node responsible for key replied that it have no data for this key - will coordinator send digest commands? Thanks!
RE: Read Perf
Thanks. For our case, the no of rows will more or less be the same. The only thing which changes is the columns and they keep getting added. -Original Message- From: Hiller, Dean [mailto:dean.hil...@nrel.gov] Sent: 26 February 2013 09:21 To: user@cassandra.apache.org Subject: Re: Read Perf To find stuff on disk, there is a bloomfilter for each file in memory. On the docs, 1 billion rows has 2Gig of RAM, so it really will have a huge dependency on your number of rows. As you get more rows, you may need to modify the bloomfilter false positive to use less RAM but that means slower reads. Ie. As you add more rows, you will have slower reads on a single machine. We hit the RAM limit on one machine with 1 billion rows so we are in the process of tweaking the ratio of 0.000744(the default) to 0.1 to give us more time to solve. Since we see no I/o load on our machines(or rather extremely little), we plan on moving to leveled compaction where 0.1 is the default in new releases and size tiered new default I think is 0.01. Ie. If you store more data per row, this is not an issue as much but still something to consider. (Also, rows have a limit I think as well on data size but not sure what that is. I know the column limit on a row is in the millions, somewhere lower than 10 million). Later, Dean From: Kanwar Sangha kan...@mavenir.commailto:kan...@mavenir.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Monday, February 25, 2013 8:31 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Read Perf Hi - I am doing a performance run using modified YCSB client and was able to populate 8TB on a node and then ran some read workloads. I am seeing an average TPS of 930 ops/sec for random reads. There is no key cache/row cache. Question - Will the read TPS degrade if the data size increases to say 20 TB , 50 TB, 100 TB ? If I understand correctly, the read should remain constant irrespective of the data size since we eventually have sorted SStables and binary search would be done on the index filter to find the row ? Thanks, Kanwar
Re: Data Model - Additional Column Families or one CF?
The bottleneck is RAM, each CF uses more RAM. We tried to go above 15000 column families and that hurt big time so we added a feature to PlayOrm and now have 60,000 virtual Column families all in one column family. This turned out to be HUGE benefit though as those 60,000 tables now have been easy to modify cluster settings in one shot. Right now, we are modifying the false positive ratio to free up ram on just one column family that all 60,000 run in. You can always follow the same pattern PlayOrm uses which is just to prefix every row key but we preferred something else do the heavy lifting for us (ie. PlayOrm), plus the command line tool isn't too bad for inspecting the data with PlayOrm either. Of course, we now have 7 billion rows and that is a new issue for us. While we could just add more nodes and scale, they are pushing us to make it even more cost effective so we are trying to scale on a single node first then scale out. Later, Dean From: Javier Sotelo javier.a.sot...@gmail.commailto:javier.a.sot...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Tuesday, February 26, 2013 12:27 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Data Model - Additional Column Families or one CF? Aaron, Would 50 CFs be pushing it? According to http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-improved-memory-and-disk-space-management, This has been tested to work across hundreds or even thousands of ColumnFamilies. What is the bottleneck, IO? Thanks, Javier On Sun, Feb 24, 2013 at 5:51 PM, Adam Venturella aventure...@gmail.commailto:aventure...@gmail.com wrote: Thanks Aaron, this was a big help! — Sent from Mailboxhttps://bit.ly/SZvoJe for iPhone On Thu, Feb 21, 2013 at 9:27 AM, aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com wrote: If you have a limited / known number (say 30) of types, I would create a CF for each of them. If the number of types is unknown or very large I would have one CF with the row key you described. Generally I avoid data models that require new CF's as the data grows. Additionally having different CF's allows you to use different cache settings, compactions settings and even storage mediums. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 21/02/2013, at 7:43 AM, Adam Venturella aventure...@gmail.commailto:aventure...@gmail.com wrote: My data needs only require me to store JSON, and I can handle this in 1 column family by prefixing row keys with a type, for example: comments:{message_id} Where comments: represents the prefix and {message_id} represents some row key to a message object in the same column family. In this case comments:{message_id} would be a wide row using comment creation time and descending clustering order to sort the messages as they are added. My question is, would I be better off splitting comments into their own Column Family or is storing them in with the Messages Column Family sufficient, they are all messages after all. Or do Column Families really just provide a nice organizational front for data. I'm just storing JSON.
Re: Read Perf
In that case, make sure you don't plan on going into the millions or test the limit as I pretty sure it can't go above 10 million. (from previous posts on this list). Dean On 2/26/13 8:23 AM, Kanwar Sangha kan...@mavenir.com wrote: Thanks. For our case, the no of rows will more or less be the same. The only thing which changes is the columns and they keep getting added. -Original Message- From: Hiller, Dean [mailto:dean.hil...@nrel.gov] Sent: 26 February 2013 09:21 To: user@cassandra.apache.org Subject: Re: Read Perf To find stuff on disk, there is a bloomfilter for each file in memory. On the docs, 1 billion rows has 2Gig of RAM, so it really will have a huge dependency on your number of rows. As you get more rows, you may need to modify the bloomfilter false positive to use less RAM but that means slower reads. Ie. As you add more rows, you will have slower reads on a single machine. We hit the RAM limit on one machine with 1 billion rows so we are in the process of tweaking the ratio of 0.000744(the default) to 0.1 to give us more time to solve. Since we see no I/o load on our machines(or rather extremely little), we plan on moving to leveled compaction where 0.1 is the default in new releases and size tiered new default I think is 0.01. Ie. If you store more data per row, this is not an issue as much but still something to consider. (Also, rows have a limit I think as well on data size but not sure what that is. I know the column limit on a row is in the millions, somewhere lower than 10 million). Later, Dean From: Kanwar Sangha kan...@mavenir.commailto:kan...@mavenir.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Monday, February 25, 2013 8:31 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Read Perf Hi - I am doing a performance run using modified YCSB client and was able to populate 8TB on a node and then ran some read workloads. I am seeing an average TPS of 930 ops/sec for random reads. There is no key cache/row cache. Question - Will the read TPS degrade if the data size increases to say 20 TB , 50 TB, 100 TB ? If I understand correctly, the read should remain constant irrespective of the data size since we eventually have sorted SStables and binary search would be done on the index filter to find the row ? Thanks, Kanwar
Re: Data Model - Additional Column Families or one CF?
Oh, and 50 CF's should be fine. Dean From: Javier Sotelo javier.a.sot...@gmail.commailto:javier.a.sot...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Tuesday, February 26, 2013 12:27 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Data Model - Additional Column Families or one CF? Aaron, Would 50 CFs be pushing it? According to http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-improved-memory-and-disk-space-management, This has been tested to work across hundreds or even thousands of ColumnFamilies. What is the bottleneck, IO? Thanks, Javier On Sun, Feb 24, 2013 at 5:51 PM, Adam Venturella aventure...@gmail.commailto:aventure...@gmail.com wrote: Thanks Aaron, this was a big help! — Sent from Mailboxhttps://bit.ly/SZvoJe for iPhone On Thu, Feb 21, 2013 at 9:27 AM, aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com wrote: If you have a limited / known number (say 30) of types, I would create a CF for each of them. If the number of types is unknown or very large I would have one CF with the row key you described. Generally I avoid data models that require new CF's as the data grows. Additionally having different CF's allows you to use different cache settings, compactions settings and even storage mediums. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 21/02/2013, at 7:43 AM, Adam Venturella aventure...@gmail.commailto:aventure...@gmail.com wrote: My data needs only require me to store JSON, and I can handle this in 1 column family by prefixing row keys with a type, for example: comments:{message_id} Where comments: represents the prefix and {message_id} represents some row key to a message object in the same column family. In this case comments:{message_id} would be a wide row using comment creation time and descending clustering order to sort the messages as they are added. My question is, would I be better off splitting comments into their own Column Family or is storing them in with the Messages Column Family sufficient, they are all messages after all. Or do Column Families really just provide a nice organizational front for data. I'm just storing JSON.
Re: please explain read path when key not in database
This is my understanding from using cassandra for probably around 2 years….(though I still make mistakes sometimes)…. For CL.ONE read Depending on the client, the client may go through one of it's known nodes(co-ordinating node) which goes to real node(clients like astyanax/hector read in the ring information and usually go direct so for CL_ONE, no co-ordination really needed). The node it finally gets to may not have the data yet and will return no row while the other 2 node might have data. For CL.QUOROM read and RF=3 Client goes to the node with data(again depending on client) and that node sends off a request to one of the other 2. Let's say A does not have row yet, but B has row, comparison results and latest wins and a repair for that row is kicked off to get all nodes in sync of that row. If local node responsible for key replied that it have no data for this key - will coordinator send digest commands? IT looks like CL_ONE does trigger a read repair according to this doc (found googling CL_ONE read repair cassandra) http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/CL-ONE-reads-RR-badness-threshold-interaction-td6247418.html http://wiki.apache.org/cassandra/ReadRepair Later, Dean Explain please, how this work when I request for key which is not in database * The closest node (as determined by proximity sorting as described above) will be sent a command to perform an actual data read (i.e., return data to the co-ordinating node). * As required by consistency level, additional nodes may be sent digest commands, asking them to perform the read locally but send back the digest only. * For example, at replication factor 3 a read at consistency level QUORUM would require one digest read in additional to the data read sent to the closest node. (See ReadCallbackhttp://wiki.apache.org/cassandra/ReadCallback, instantiated by StorageProxyhttp://wiki.apache.org/cassandra/StorageProxy) I have multi-DC with NetworkTopologyStrategy and RF:1 per datacenter, and reads are at consitency level ONE. If local node responsible for key replied that it have no data for this key - will coordinator send digest commands? Thanks!
Re: Read Perf
Depends, are you 1. Reading the same size of data as the data set size grows? (reading more data does generally get slower like reading 1MB vs. 10MB) 2. Reading the same number of columns as the data set size grows? 3. Never reading in the entire row? If the answer to all of the above is yes, yes, yes, then it should be fine but always better to test. ALSO, a big note, you MUST test doing a read repair as that will slow things down BIG TIME. We only have 130GB per node and in general cassandra is made for 300G to 500G per node on 1T drive(typical config). This is due to the maintenance so TEST your maintenance stuff before you get burned there. Just run nodetool upgradesstables and time it. This definitely gets slower as your data grows and gives you a good idea of how long operations will take. Of course, better yet, take a node completely out, wipe it and put it back in and see how long it takes to get all the data back into by running the read repair. With 10T, you will have a lot of issues I imagine. Dean On 2/26/13 8:43 AM, Kanwar Sangha kan...@mavenir.com wrote: Yep. So the read will remain constant in this case ? -Original Message- From: Hiller, Dean [mailto:dean.hil...@nrel.gov] Sent: 26 February 2013 09:32 To: user@cassandra.apache.org Subject: Re: Read Perf In that case, make sure you don't plan on going into the millions or test the limit as I pretty sure it can't go above 10 million. (from previous posts on this list). Dean On 2/26/13 8:23 AM, Kanwar Sangha kan...@mavenir.com wrote: Thanks. For our case, the no of rows will more or less be the same. The only thing which changes is the columns and they keep getting added. -Original Message- From: Hiller, Dean [mailto:dean.hil...@nrel.gov] Sent: 26 February 2013 09:21 To: user@cassandra.apache.org Subject: Re: Read Perf To find stuff on disk, there is a bloomfilter for each file in memory. On the docs, 1 billion rows has 2Gig of RAM, so it really will have a huge dependency on your number of rows. As you get more rows, you may need to modify the bloomfilter false positive to use less RAM but that means slower reads. Ie. As you add more rows, you will have slower reads on a single machine. We hit the RAM limit on one machine with 1 billion rows so we are in the process of tweaking the ratio of 0.000744(the default) to 0.1 to give us more time to solve. Since we see no I/o load on our machines(or rather extremely little), we plan on moving to leveled compaction where 0.1 is the default in new releases and size tiered new default I think is 0.01. Ie. If you store more data per row, this is not an issue as much but still something to consider. (Also, rows have a limit I think as well on data size but not sure what that is. I know the column limit on a row is in the millions, somewhere lower than 10 million). Later, Dean From: Kanwar Sangha kan...@mavenir.commailto:kan...@mavenir.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Monday, February 25, 2013 8:31 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Read Perf Hi - I am doing a performance run using modified YCSB client and was able to populate 8TB on a node and then ran some read workloads. I am seeing an average TPS of 930 ops/sec for random reads. There is no key cache/row cache. Question - Will the read TPS degrade if the data size increases to say 20 TB , 50 TB, 100 TB ? If I understand correctly, the read should remain constant irrespective of the data size since we eventually have sorted SStables and binary search would be done on the index filter to find the row ? Thanks, Kanwar
Re: Data Model - Additional Column Families or one CF?
Thanks Dean, very helpful info. Javier On Tue, Feb 26, 2013 at 7:33 AM, Hiller, Dean dean.hil...@nrel.gov wrote: Oh, and 50 CF's should be fine. Dean From: Javier Sotelo javier.a.sot...@gmail.commailto: javier.a.sot...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Tuesday, February 26, 2013 12:27 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Data Model - Additional Column Families or one CF? Aaron, Would 50 CFs be pushing it? According to http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-improved-memory-and-disk-space-management, This has been tested to work across hundreds or even thousands of ColumnFamilies. What is the bottleneck, IO? Thanks, Javier On Sun, Feb 24, 2013 at 5:51 PM, Adam Venturella aventure...@gmail.com mailto:aventure...@gmail.com wrote: Thanks Aaron, this was a big help! — Sent from Mailboxhttps://bit.ly/SZvoJe for iPhone On Thu, Feb 21, 2013 at 9:27 AM, aaron morton aa...@thelastpickle.com mailto:aa...@thelastpickle.com wrote: If you have a limited / known number (say 30) of types, I would create a CF for each of them. If the number of types is unknown or very large I would have one CF with the row key you described. Generally I avoid data models that require new CF's as the data grows. Additionally having different CF's allows you to use different cache settings, compactions settings and even storage mediums. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 21/02/2013, at 7:43 AM, Adam Venturella aventure...@gmail.commailto: aventure...@gmail.com wrote: My data needs only require me to store JSON, and I can handle this in 1 column family by prefixing row keys with a type, for example: comments:{message_id} Where comments: represents the prefix and {message_id} represents some row key to a message object in the same column family. In this case comments:{message_id} would be a wide row using comment creation time and descending clustering order to sort the messages as they are added. My question is, would I be better off splitting comments into their own Column Family or is storing them in with the Messages Column Family sufficient, they are all messages after all. Or do Column Families really just provide a nice organizational front for data. I'm just storing JSON.
Re: Data Model - Additional Column Families or one CF?
http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/whytf_would_i_need_with On Tue, Feb 26, 2013 at 12:28 PM, Javier Sotelo javier.a.sot...@gmail.com wrote: Thanks Dean, very helpful info. Javier On Tue, Feb 26, 2013 at 7:33 AM, Hiller, Dean dean.hil...@nrel.gov wrote: Oh, and 50 CF's should be fine. Dean From: Javier Sotelo javier.a.sot...@gmail.commailto:javier.a.sot...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Tuesday, February 26, 2013 12:27 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Data Model - Additional Column Families or one CF? Aaron, Would 50 CFs be pushing it? According to http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-improved-memory-and-disk-space-management, This has been tested to work across hundreds or even thousands of ColumnFamilies. What is the bottleneck, IO? Thanks, Javier On Sun, Feb 24, 2013 at 5:51 PM, Adam Venturella aventure...@gmail.commailto:aventure...@gmail.com wrote: Thanks Aaron, this was a big help! — Sent from Mailboxhttps://bit.ly/SZvoJe for iPhone On Thu, Feb 21, 2013 at 9:27 AM, aaron morton aa...@thelastpickle.commailto:aa...@thelastpickle.com wrote: If you have a limited / known number (say 30) of types, I would create a CF for each of them. If the number of types is unknown or very large I would have one CF with the row key you described. Generally I avoid data models that require new CF's as the data grows. Additionally having different CF's allows you to use different cache settings, compactions settings and even storage mediums. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 21/02/2013, at 7:43 AM, Adam Venturella aventure...@gmail.commailto:aventure...@gmail.com wrote: My data needs only require me to store JSON, and I can handle this in 1 column family by prefixing row keys with a type, for example: comments:{message_id} Where comments: represents the prefix and {message_id} represents some row key to a message object in the same column family. In this case comments:{message_id} would be a wide row using comment creation time and descending clustering order to sort the messages as they are added. My question is, would I be better off splitting comments into their own Column Family or is storing them in with the Messages Column Family sufficient, they are all messages after all. Or do Column Families really just provide a nice organizational front for data. I'm just storing JSON.
Re: Authentication and Authorization with Cassandra 1.2.2.
does this help? Links at the bottom show the cql statements to add/modify users: http://www.datastax.com/docs/1.2/security/native_authentication On Feb 26, 2013, at 4:06 PM, C.F.Scheidecker Antunes cf.antu...@gmail.com wrote: Hello all, Cassandra has changed and now has a default authentication and authorization mechanism. The classes org.apache.cassandra.auth.PasswordAuthenticator (authenticator) and org.apache.cassandra.auth.CassandraAuthorizer (authorization) provide that. They both write to a keyspace called system_auth and there are 2 column families that are used for it, namely credentials and permissions. The permissions table is defined in CassandraAuthorizer as follows: CREATE TABLE system_auth.permissions (username text, resource text, permissions settext, PRIMARY KEY(username, resource) ) WITH gc_grace_seconds=(90 * 24 * 60 * 60) // 3 months The credentials table is created in PasswordAuthenticator as follows: CREATE TABLE system_auth.credentials (username text, salted_hash text, // salt + hash + number of rounds options maptext,text, // for future extensions PRIMARY KEY(username) ) WITH gc_grace_seconds=(90 * 24 * 60 * 60) // 3 months The password is hashed as BCrypt.hashpw(password, BCrypt.gensalt(GENSALT_LOG2_ROUNDS)); where GENSALT_LOG2_ROUNDS is set to 10. Out of the box, the keyspace system_auth is there but the CFs are not defined when one issues a describe system_auth inside cassandra-cli application. The configuration file says: PasswordAuthenticator relies on username/password pairs to authenticate users. It keeps usernames and hashed passwords in system_auth.credentials table. Please increase system_auth keyspace replication factor if you use this authenticator. On the configuration file /etc/cassandra/cassandra.yaml I have set: authenticator: org.apache.cassandra.auth.PasswordAuthenticator authorizer: org.apache.cassandra.auth.CassandraAuthorizer Therefore I have 3 questions. 1) How can I increase the replication factor if the keyspace system_auth is already there? Can I do this? Currently the replication factor is 1: [cassandra@system_auth] describe; Keyspace: system_auth: Replication Strategy: org.apache.cassandra.locator.SimpleStrategy Durable Writes: true Options: [replication_factor:1] Column Families: 2) Shall I create the CFs credentials and permissions via cassandra-cli as well? If I issue a select command from cqlsh I can see: cqlsh:system_auth SELECT * FROM credentials; username | options | salted_hash ---+-+-- cassandra |null | Eventhough there is no credentials CF defined on the schema yet. 3) What is the process of adding more users? Shall I do via cassandra-cli and or cqlsh? How shall I specify the read and write privileges as well as the keyspaces for which it has writes? Something like this: OpsCenter.rw=carlos system.rw=carlos system_traces.rw=carlos nando.rw=carlos
Re: Bulk Loading-Unable to select from CQL3 tables with NO COMPACT STORAGE option after Bulk Loading - Cassandra version 1.2.1
CQL 3 tables that do not use compact storage store use Composite Types , which other code may not be expecting. Take a look at the CQL 3 table definitions through cassandra-cli and you may see the changes you need to make when creating the SSTables. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 26/02/2013, at 3:44 AM, praveen.akun...@wipro.com wrote: Hi All, I am using the bulk loader program provided in Datastax website. http://www.datastax.com/dev/blog/bulk-loading I am able to load data into tables created with COMPACT STORAGE option and also into tables created with out this option. However, I am unable to read data from the table created without COMPACT STORAGE option. I created 2 tables as below: CREATE TABLE TABLE1( field1 text PRIMARY KEY, field2 text, field3 text, field4 text ) WITH COMPACT STORAGE; CREATE TABLE TABLE2( field1 text PRIMARY KEY, field2 text, field3 text, field4 text ); Now, I loaded these 2 tables using the Java bulk loader program(Create SSTables and load them using SSTableloader utility). I can read the data from TABLE1, but, when I try to read data from TABLE2, I am getting timeout from both cqlsh cli. Screen Shot 2013-02-25 at 8.10.58 PM.png Screen Shot 2013-02-25 at 8.10.38 PM.png Is this expected behavior, or am I doing something wrong? Can anyone please help. Thanks Best Regards, Praveen Wipro Limited (Company Regn No in UK - FC 019088) Address: Level 2, West wing, 3 Sheldon Square, London W2 6PS, United Kingdom. Tel +44 20 7432 8500 Fax: +44 20 7286 5703 VAT Number: 563 1964 27 (Branch of Wipro Limited (Incorporated in India at Bangalore with limited liability vide Reg no L9KA1945PLC02800 with Registrar of Companies at Bangalore, India. Authorized share capital: Rs 5550 mn)) Please do not print this email unless it is absolutely necessary. The information contained in this electronic message and any attachments to this message are intended for the exclusive use of the addressee(s) and may contain proprietary, confidential or privileged information. If you are not the intended recipient, you should not disseminate, distribute or copy this e-mail. Please notify the sender immediately and destroy all copies of this message and any attachments. WARNING: Computer viruses can be transmitted via email. The recipient should check this email and any attachments for the presence of viruses. The company accepts no liability for any damage caused by any virus transmitted by this email. www.wipro.com
Re: no other nodes seen on priam cluster
Hi Marcelo A few questions: Have your added the priam java agent to cassandras JVM argurments (e.g. -javaagent:$CASS_HOME/lib/priam-cass-extensions-1.1.15.jar) and does the web container running priam have permissions to write to the cassandra config directory? Also what do the priam logs say? If you want to get up and running quickly with cassandra, AWS and priam quickly check out www.instaclustr.comhttp://www.instaclustr.com/?cid=cass-listyou. We deploy Cassandra under your AWS account and you have full root access to the nodes if you want to explore and play around + there is a free tier which is great for experimenting and trying Cassandra out. Cheers Ben On Wed, Feb 27, 2013 at 6:09 AM, Marcelo Elias Del Valle mvall...@gmail.com wrote: Hello, I am using cassandra 1.2.1 and I am trying to set up a Priam cluster on AWS with two nodes. However, I can't get both nodes up and running because of a weird error (at least to me). When I start both nodes, they are both able to connect to each other and do some communication. However, after some seconds, I just see Java.lang.RuntimeException: No other nodes seen! , so they disconnect and die. I tried to test all ports (7000, 9160 and 7199) between both nodes and there is no firewall. On the second node, before the above exception, I get a broken pipe, as shown bellow. Any hint? DEBUG 18:54:31,776 attempting to connect to /10.224.238.170 DEBUG 18:54:32,402 Reseting version for /10.224.238.170 DEBUG 18:54:32,778 Connection version 6 from /10.224.238.170 DEBUG 18:54:32,779 Upgrading incoming connection to be compressed DEBUG 18:54:32,779 Max version for /10.224.238.170 is 6 DEBUG 18:54:32,779 Setting version 6 for /10.224.238.170 DEBUG 18:54:32,780 set version for /10.224.238.170 to 6 DEBUG 18:54:33,455 Disseminating load info ... DEBUG 18:54:59,082 Reseting version for /10.224.238.170 DEBUG 18:55:00,405 error writing to /10.224.238.170 java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcher.write0(Native Method) at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:29) at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:72) at sun.nio.ch.IOUtil.write(IOUtil.java:43) at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:334) at java.nio.channels.Channels.writeFullyImpl(Channels.java:59) at java.nio.channels.Channels.writeFully(Channels.java:81) at java.nio.channels.Channels.access$000(Channels.java:47) at java.nio.channels.Channels$1.write(Channels.java:155) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:65) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:123) at org.xerial.snappy.SnappyOutputStream.flush(SnappyOutputStream.java:272) at java.io.DataOutputStream.flush(DataOutputStream.java:106) at org.apache.cassandra.net.OutboundTcpConnection.writeConnected(OutboundTcpConnection.java:189) at org.apache.cassandra.net.OutboundTcpConnection.run(OutboundTcpConnection.java:143) DEBUG 18:55:01,405 attempting to connect to /10.224.238.170 DEBUG 18:55:01,461 Started replayAllFailedBatches DEBUG 18:55:01,462 forceFlush requested but everything is clean in batchlog DEBUG 18:55:01,463 Finished replayAllFailedBatches INFO 18:55:01,472 JOINING: schema complete, ready to bootstrap DEBUG 18:55:01,473 ... got ring + schema info INFO 18:55:01,473 JOINING: getting bootstrap token ERROR 18:55:01,475 Exception encountered during startup java.lang.RuntimeException: No other nodes seen! Unable to bootstrap.If you intended to start a single-node cluster, you should make sure your broadcast_address (or listen_address) is listed as a seed. Otherwise, you need to determine why the seed being contacted has no knowledge of the rest of the cluster. Usually, this can be solved by giving all nodes the same seed list. and on the first node: DEBUG 18:54:30,833 Disseminating load info ... DEBUG 18:54:31,532 Connection version 6 from /10.242.139.159 DEBUG 18:54:31,533 Upgrading incoming connection to be compressed DEBUG 18:54:31,534 Max version for /10.242.139.159 is 6 DEBUG 18:54:31,534 Setting version 6 for /10.242.139.159 DEBUG 18:54:31,534 set version for /10.242.139.159 to 6 DEBUG 18:54:31,542 Reseting version for /10.242.139.159 DEBUG 18:54:31,791 Connection version 6 from /10.242.139.159 DEBUG 18:54:31,792 Upgrading incoming connection to be compressed DEBUG 18:54:31,792 Max version for /10.242.139.159 is 6 DEBUG 18:54:31,792 Setting version 6 for /10.242.139.159 DEBUG 18:54:31,793 set version for /10.242.139.159 to 6 INFO 18:54:32,414 Node /10.242.139.159 is now part of the cluster DEBUG 18:54:32,415 Resetting pool for /10.242.139.159 DEBUG 18:54:32,415 removing expire time for endpoint : /10.242.139.159 INFO 18:54:32,415 InetAddress /10.242.139.159 is now UP DEBUG 18:54:32,789 attempting to connect to ec2-75-101-233-115.compute-1.amazonaws.com/10.242.139.159 DEBUG 18:54:58,840 Started replayAllFailedBatches DEBUG