Re: data model question : finding out the n most recent changes items
What you described this sounds like the most appropriate: CREATE TABLE user_file ( user_id uuid, modified_date timestamp, file_id timeuuid, PRIMARY KEY(user_id, modified_date) ); If you normally need more information about the file then either store that as additional fields or pack the data using something like JSON or Protobuf. my return list may still not accurate because a single directory could have lot of modification changes. I basically end up pulling out series of modification timestamp for the same directory. Not sure I understand the problem. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 10/07/2013, at 6:51 PM, Jimmy Lin y2klyf+w...@gmail.com wrote: I have an application that need to find out the n most recent modified files for a given user id. I started out few tables but still couldn't get what i want, I hope someone get point to some right direction... See my tables below. #1 won't work, because file_id's timeuuid contains creation time, not the modification time. #2 won't work, because i can't order by a non primary key column(modified_date) #3,#4 although i can now get a time series of modification time of each file belongs to a user, my return list may still not accurate because a single directory could have lot of modification changes. I basically end up pulling out series of modification timestamp for the same directory. Any suggestion? Thanks #1 CREATE TABLE user_file ( user_id uuid, file_id timeuuid, PRIMARY KEY(user_id, file_id) ); #2 CREATE TABLE user_file ( user_id uuid, file_id timeuuid, modified_date timestamp, PRIMARY KEY(user_id, file_id) ); #3 CREATE TABLE user_file ( user_id uuid, file_id timeuuid, modified_date timestamp, PRIMARY KEY(user_id, file_id, modified_date) ); #4 CREATE TABLE user_file ( user_id uuid, modified_date timestamp, file_id timeuuid, PRIMARY KEY(user_id, modified_date, file_id) );
Re: data model question : finding out the n most recent changes items
what I mean is, I really just want the last modified date instead of series of timestamp and still able to sort or order by it. (maybe I should rephrase my question as how to sort or order by last modified column in a row) CREATE TABLE user_file ( user_id uuid, modified_date timestamp, file_id timeuuid, PRIMARY KEY(user_id, modified_date) ); e.g user1 update file A 3 times in a row, and update file B, then update file A again. insert into user_file values(user1_uuid, date1, file_a_uuid); insert into user_file values(user1_uuid, date2, file_a_uuid); insert into user_file values(user1_uuid, date3, file_a_uuid); insert into user_file values(user1_uuid, date4, file_b_uuid); insert into user_file values(user1_uuid, date5, file_a_uuid); #trying to get top 3 most recent changed files select * from user_file where user_id=user1_uuid limit 3 using CQL, I will get 3 rows back(all file a) (user1_uuid, date1, file_a_uuid); (user1_uuid, date2, file_a_uuid); (user1_uuid, date3, file_a_uuid); what I want is (file a AND file b) user1_uuid, date1, file_a_uuid user1_uuid, date4, file_b_uuid So how do I order by/sort by last modified column in a row? thanks On Thu, Jul 11, 2013 at 12:00 AM, aaron morton aa...@thelastpickle.comwrote: What you described this sounds like the most appropriate: CREATE TABLE user_file ( user_id uuid, modified_date timestamp, file_id timeuuid, PRIMARY KEY(user_id, modified_date) ); If you normally need more information about the file then either store that as additional fields or pack the data using something like JSON or Protobuf. my return list may still not accurate because a single directory could have lot of modification changes. I basically end up pulling out series of modification timestamp for the same directory. Not sure I understand the problem. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 10/07/2013, at 6:51 PM, Jimmy Lin y2klyf+w...@gmail.com wrote: I have an application that need to find out the n most recent modified files for a given user id. I started out few tables but still couldn't get what i want, I hope someone get point to some right direction... See my tables below. #1 won't work, because file_id's timeuuid contains creation time, not the modification time. #2 won't work, because i can't order by a non primary key column(modified_date) #3,#4 although i can now get a time series of modification time of each file belongs to a user, my return list may still not accurate because a single directory could have lot of modification changes. I basically end up pulling out series of modification timestamp for the same directory. Any suggestion? Thanks #1 CREATE TABLE user_file ( user_id uuid, file_id timeuuid, PRIMARY KEY(user_id, file_id) ); #2 CREATE TABLE user_file ( user_id uuid, file_id timeuuid, modified_date timestamp, PRIMARY KEY(user_id, file_id) ); #3 CREATE TABLE user_file ( user_id uuid, file_id timeuuid, modified_date timestamp, PRIMARY KEY(user_id, file_id, modified_date) ); #4 CREATE TABLE user_file ( user_id uuid, modified_date timestamp, file_id timeuuid, PRIMARY KEY(user_id, modified_date, file_id) );
RE: data model question : finding out the n most recent changes items
Hi, Do you need to store the history of updates to a file? If this is not required, then you can make the userid and file id as the row key. You need to simply update the modified_date timestamp. There will be only one row per file per user. Thanks and Regards M. Lohith Samaga -Original Message- From: y2k...@gmail.com on behalf of Jimmy Lin Sent: Thu 11-Jul-13 13:09 To: user@cassandra.apache.org Subject: Re: data model question : finding out the n most recent changes items what I mean is, I really just want the last modified date instead of series of timestamp and still able to sort or order by it. (maybe I should rephrase my question as how to sort or order by last modified column in a row) CREATE TABLE user_file ( user_id uuid, modified_date timestamp, file_id timeuuid, PRIMARY KEY(user_id, modified_date) ); e.g user1 update file A 3 times in a row, and update file B, then update file A again. insert into user_file values(user1_uuid, date1, file_a_uuid); insert into user_file values(user1_uuid, date2, file_a_uuid); insert into user_file values(user1_uuid, date3, file_a_uuid); insert into user_file values(user1_uuid, date4, file_b_uuid); insert into user_file values(user1_uuid, date5, file_a_uuid); #trying to get top 3 most recent changed files select * from user_file where user_id=user1_uuid limit 3 using CQL, I will get 3 rows back(all file a) (user1_uuid, date1, file_a_uuid); (user1_uuid, date2, file_a_uuid); (user1_uuid, date3, file_a_uuid); what I want is (file a AND file b) user1_uuid, date1, file_a_uuid user1_uuid, date4, file_b_uuid So how do I order by/sort by last modified column in a row? thanks On Thu, Jul 11, 2013 at 12:00 AM, aaron morton aa...@thelastpickle.comwrote: What you described this sounds like the most appropriate: CREATE TABLE user_file ( user_id uuid, modified_date timestamp, file_id timeuuid, PRIMARY KEY(user_id, modified_date) ); If you normally need more information about the file then either store that as additional fields or pack the data using something like JSON or Protobuf. my return list may still not accurate because a single directory could have lot of modification changes. I basically end up pulling out series of modification timestamp for the same directory. Not sure I understand the problem. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 10/07/2013, at 6:51 PM, Jimmy Lin y2klyf+w...@gmail.com wrote: I have an application that need to find out the n most recent modified files for a given user id. I started out few tables but still couldn't get what i want, I hope someone get point to some right direction... See my tables below. #1 won't work, because file_id's timeuuid contains creation time, not the modification time. #2 won't work, because i can't order by a non primary key column(modified_date) #3,#4 although i can now get a time series of modification time of each file belongs to a user, my return list may still not accurate because a single directory could have lot of modification changes. I basically end up pulling out series of modification timestamp for the same directory. Any suggestion? Thanks #1 CREATE TABLE user_file ( user_id uuid, file_id timeuuid, PRIMARY KEY(user_id, file_id) ); #2 CREATE TABLE user_file ( user_id uuid, file_id timeuuid, modified_date timestamp, PRIMARY KEY(user_id, file_id) ); #3 CREATE TABLE user_file ( user_id uuid, file_id timeuuid, modified_date timestamp, PRIMARY KEY(user_id, file_id, modified_date) ); #4 CREATE TABLE user_file ( user_id uuid, modified_date timestamp, file_id timeuuid, PRIMARY KEY(user_id, modified_date, file_id) ); Information transmitted by this e-mail is proprietary to MphasiS, its associated companies and/ or its customers and is intended for use only by the individual or entity to which it is addressed, and may contain information that is privileged, confidential or exempt from disclosure under applicable law. If you are not the intended recipient or it appears that this mail has been forwarded to you without proper authority, you are notified that any use or dissemination of this information in any manner is strictly prohibited. In such cases, please notify us immediately at mailmas...@mphasis.com and delete this mail from your records.
Re: data model question : finding out the n most recent changes items
Thanks for the suggestion. I don't care the history of the update time to a file, BUT I do want to ordered by it. Reason for that is, without that, and if I have 10k+ file belongs to a user, I have to fetch all the last modified time of all these 10k+ file and sort through them in my application and only return the top N. Kind of expensive. I would like to see if it is possible to rely on Cassandra native storage to achieve this. CREATE TABLE user_file ( user_id uuid, file_id timeuuid, last_modified_time timestamp, PRIMARY KEY(user_id, file_id) ); select * from user_file where user_id=user1_uuid order by last_modified_time limit 10 Above CQL would be invalid, because last_modified_time is not part of the compound key, and is not allowed to used for order by purpose. On Thu, Jul 11, 2013 at 12:51 AM, Lohith Samaga M lohith.sam...@mphasis.com wrote: ** Hi, Do you need to store the history of updates to a file? If this is not required, then you can make the userid and file id as the row key. You need to simply update the modified_date timestamp. There will be only one row per file per user. Thanks and Regards M. Lohith Samaga -Original Message- From: y2k...@gmail.com on behalf of Jimmy Lin Sent: Thu 11-Jul-13 13:09 To: user@cassandra.apache.org Subject: Re: data model question : finding out the n most recent changes items what I mean is, I really just want the last modified date instead of series of timestamp and still able to sort or order by it. (maybe I should rephrase my question as how to sort or order by last modified column in a row) CREATE TABLE user_file ( user_id uuid, modified_date timestamp, file_id timeuuid, PRIMARY KEY(user_id, modified_date) ); e.g user1 update file A 3 times in a row, and update file B, then update file A again. insert into user_file values(user1_uuid, date1, file_a_uuid); insert into user_file values(user1_uuid, date2, file_a_uuid); insert into user_file values(user1_uuid, date3, file_a_uuid); insert into user_file values(user1_uuid, date4, file_b_uuid); insert into user_file values(user1_uuid, date5, file_a_uuid); #trying to get top 3 most recent changed files select * from user_file where user_id=user1_uuid limit 3 using CQL, I will get 3 rows back(all file a) (user1_uuid, date1, file_a_uuid); (user1_uuid, date2, file_a_uuid); (user1_uuid, date3, file_a_uuid); what I want is (file a AND file b) user1_uuid, date1, file_a_uuid user1_uuid, date4, file_b_uuid So how do I order by/sort by last modified column in a row? thanks On Thu, Jul 11, 2013 at 12:00 AM, aaron morton aa...@thelastpickle.com wrote: What you described this sounds like the most appropriate: CREATE TABLE user_file ( user_id uuid, modified_date timestamp, file_id timeuuid, PRIMARY KEY(user_id, modified_date) ); If you normally need more information about the file then either store that as additional fields or pack the data using something like JSON or Protobuf. my return list may still not accurate because a single directory could have lot of modification changes. I basically end up pulling out series of modification timestamp for the same directory. Not sure I understand the problem. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 10/07/2013, at 6:51 PM, Jimmy Lin y2klyf+w...@gmail.com wrote: I have an application that need to find out the n most recent modified files for a given user id. I started out few tables but still couldn't get what i want, I hope someone get point to some right direction... See my tables below. #1 won't work, because file_id's timeuuid contains creation time, not the modification time. #2 won't work, because i can't order by a non primary key column(modified_date) #3,#4 although i can now get a time series of modification time of each file belongs to a user, my return list may still not accurate because a single directory could have lot of modification changes. I basically end up pulling out series of modification timestamp for the same directory. Any suggestion? Thanks #1 CREATE TABLE user_file ( user_id uuid, file_id timeuuid, PRIMARY KEY(user_id, file_id) ); #2 CREATE TABLE user_file ( user_id uuid, file_id timeuuid, modified_date timestamp, PRIMARY KEY(user_id, file_id) ); #3 CREATE TABLE user_file ( user_id uuid, file_id timeuuid, modified_date timestamp, PRIMARY KEY(user_id, file_id, modified_date) ); #4 CREATE TABLE user_file ( user_id uuid, modified_date timestamp, file_id timeuuid, PRIMARY KEY(user_id, modified_date, file_id) );
Alternate major compaction
Hi About a year ago, we did a major compaction in our cassandra cluster (a n00b mistake, I know), and since then we've had huge sstables that never get compacted, and we were condemned to repeat the major compaction process every once in a while (we are using SizeTieredCompaction strategy, and we've not avaluated yet LeveledCompaction, because it has its downsides, and we've had no time to test all of them in our environment). I was trying to find a way to solve this situation (that is, do something like a major compaction that writes small sstables, not huge as major compaction does), and I couldn't find it in the documentation. I tried cleanup and scrub/upgradesstables, but they don't do that (as documentation states). Then I tried deleting all data in a node and then bootstrapping it (or nodetool rebuild-ing it), hoping that this way the sstables would get cleaned from deleted records and updates. But the deleted node just copied the sstables from another node as they were, cleaning nothing. So I tried a new approach: I switched the sstable compaction strategy (SizeTiered to Leveled), forcing the sstables to be rewritten from scratch, and then switching it back (Leveled to SizeTiered). It took a while (but so do the major compaction process) and it worked, I have smaller sstables, and I've regained a lot of disk space. I'm happy with the results, but it doesn't seem a orthodox way of cleaning the sstables. What do you think, is it something wrong or crazy? Is there a different way to achieve the same thing? Let's put an example: Suppose you have a write-only columnfamily (no updates and no deletes, so no need for LeveledCompaction, because SizeTiered works perfectly and requires less I/O) and you mistakenly run a major compaction on it. After a few months you need more space and you delete half the data, and you find out that you're not freeing half the disk space, because most of those records were in the major compacted sstables. How can you free the disk space? Waiting will do you no good, because the huge sstable won't get compacted anytime soon. You can run another major compaction, but that would just postpone the real problem. Then you can switch compaction strategy and switch it back, as I just did. Is there any other way? -- [image: Groupalia] http://es.groupalia.com/ www.groupalia.com http://es.groupalia.com/Tomàs NúñezIT-SysprodTel. + 34 93 159 31 00 Fax. + 34 93 396 18 52Llull, 95-97, 2º planta, 08005 BarcelonaSkype: tomas.nunez.groupaliatomas.nu...@groupalia.comnombre.apell...@groupalia.com[image: Twitter] Twitter http://twitter.com/#%21/groupaliaes[image: Twitter] Facebook https://www.facebook.com/GroupaliaEspana[image: Twitter] Linkedin http://www.linkedin.com/company/groupalia linkedin.pngtwitter.pngfacebook.pnggroupalia.jpg
listen_address and rpc_address address on different interface
Hello, I was wondering if anyone has measured the performance improvements to having the listen address and client address bound to different interface? We a have 2gbit connection serving both at the moment and this doesn't come close to being saturated. But being very keen on fast reads at the 99th percentile we're interested in even the smallest improvements. Next question - Has anyone ever moved an existing node to have the listen address and client access address bound to different addresses? Our Problem Currently our only address is a DNS entry which we would like to keep bound to the client access. If we were to take down a node and change the listen address then re-join the ring, the other nodes will mark the node as dead when we take it down and assume we have a new node when we bring it back on a different address. Lots of wasted rebalancing and compaction will start. We use Cassandra 1.2.4 w/vnodes. Not sure there will be anyway around this. So back to question one, am I wasting my time? Thanks, Chris
Re: High performance hardware with lot of data per node - Global learning about configuration
Hi, We also recently migrated to 3 hi.4xlarge boxes(Raid0 SSD) and the disk IO performance is definitely better than the earlier non SSD servers, we are serving up to 14k reads/s with a latency of 3-3.5 ms/op. I wanted to share our config options and ask about the data back up strategy for Raid0. We are using C* 1.2.6 with key_chache and row_cache of 300MB I have not changed/ modified any other parameter except for going with multithreaded GC. I will be playing around with other factors and update everyone if I find something interesting. Also, just wanted to share backup strategy and see if I can get something useful from how others are taking backup of their raid0. I am using tablesnap to upload SSTables to s3 and I have attached a separate EBS volume to every box and have set up rsync to mirror Cassandra data from Raid0 to EBS. I would really appreciate if you guys can share how you taking backups. Thanks On Jul 9, 2013, at 7:11 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi, Using C*1.2.2. We recently dropped our 18 m1.xLarge (4CPU, 15GB RAM, 4 Raid-0 Disks) servers to get 3 hi1.4xLarge (16CPU, 60GB RAM, 2 Raid-0 SSD) servers instead, for about the same price. We tried it after reading some benchmark published by Netflix. It is awesome and I recommend it to anyone who is using more than 18 xLarge server or can afford these high cost / high performance EC2 instances. SSD gives a very good throughput with an awesome latency. Yet, we had about 200 GB data per server and now about 1 TB. To alleviate memory pressure inside the heap I had to reduce the index sampling. I changed the index_interval value from 128 to 512, with no visible impact on latency, but a great improvement inside the heap which doesn't complain about any pressure anymore. Is there some more tuning I could use, more tricks that could be useful while using big servers, with a lot of data per node and relatively high throughput ? SSD are at 20-40 % of their throughput capacity (according to OpsCenter), CPU almost never reach a bigger load than 5 or 6 (with 16 CPU), 15 GB RAM used out of 60GB. At this point I have kept my previous configuration, which is almost the default one from the Datastax community AMI. There is a part of it, you can consider that any property that is not in here is configured as default : cassandra.yaml key_cache_size_in_mb: (empty) - so default - 100MB (hit rate between 88 % and 92 %, good enough ?) row_cache_size_in_mb: 0 (not usable in our use case, a lot of different and random reads) flush_largest_memtables_at: 0.80 reduce_cache_sizes_at: 0.90 concurrent_reads: 32 (I am thinking to increase this to 64 or more since I have just a few servers to handle more concurrence) concurrent_writes: 32 (I am thinking to increase this to 64 or more too) memtable_total_space_in_mb: 1024 (to avoid having a full heap, shoul I use bigger value, why for ?) rpc_server_type: sync (I tried hsha and had the ERROR 12:02:18,971 Read an invalid frame size of 0. Are you using TFramedTransport on the client side? error). No idea how to fix this, and I use 5 different clients for different purpose (Hector, Cassie, phpCassa, Astyanax, Helenus)... multithreaded_compaction: false (Should I try enabling this since I now use SSD ?) compaction_throughput_mb_per_sec: 16 (I will definitely up this to 32 or even more) cross_node_timeout: true endpoint_snitch: Ec2MultiRegionSnitch index_interval: 512 cassandra-env.sh I am not sure about how to tune the heap, so I mainly use defaults MAX_HEAP_SIZE=8G HEAP_NEWSIZE=400M (I tried with higher values, and it produced bigger GC times (1600 ms instead of 200 ms now with 400M) -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly Does this configuration seems coherent ? Right now, performance are correct, latency 5ms almost all the time. What can I do to handle more data per node and keep these performances or get even better once ? I know this is a long message but if you have any comment or insight even on part of it, don't hesitate to share it. I guess this kind of comment on configuration is usable by the entire community. Alain
IllegalArgumentException on query with AbstractCompositeType
Hi, I've been tearing my hair out trying to figure out why this query fails. In fact, it only fails on machines with slower CPUs and after having previously run some other junit tests. I'm running junits to an embedded Cassandra server, which works well in pretty much all other cases, but this one is flaky. I've tried to rule out timing issues by placing a 10 second delay just before this query, just in case somehow the data isn't getting into the db in a timely manner, but that doesn't have any effect. I've also tried removing the ORDER BY clause, which seems to be the place in the code it's getting hung up on, but that also doesn't have any effect. The ALLOW FILTERING clause also has no effect. DEBUG [Native-Transport-Requests:16] 2013-07-10 16:28:21,993 Message.java (line 277) Received: QUERY SELECT * FROM conv_msgdata_by_participant_cql WHERE entityConversationId='bulktestfromus...@test.cacontact_811b5efc-b621-4361-9dc9-2e4755be7d89' AND messageId'2013-07-10T20:29:09.773Zzz' ORDER BY messageId DESC LIMIT 15 ALLOW FILTERING; ERROR [ReadStage:34] 2013-07-10 16:28:21,995 CassandraDaemon.java (line 132) Exception in thread Thread[ReadStage:34,5,main] java.lang.RuntimeException: java.lang.IllegalArgumentException at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1582) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:886) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:908) at java.lang.Thread.run(Thread.java:662) Caused by: java.lang.IllegalArgumentException at java.nio.Buffer.limit(Buffer.java:247) at org.apache.cassandra.db.marshal.AbstractCompositeType.getBytes(AbstractCompositeType.java:51) at org.apache.cassandra.db.marshal.AbstractCompositeType.getWithShortLength(AbstractCompositeType.java:60) at org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:78) at org.apache.cassandra.db.marshal.AbstractCompositeType.compare(AbstractCompositeType.java:31) at org.apache.cassandra.db.columniterator.IndexedSliceReader$BlockFetcher.isColumnBeforeSliceFinish(IndexedSliceReader.java:216) at org.apache.cassandra.db.columniterator.IndexedSliceReader$SimpleBlockFetcher.init(IndexedSliceReader.java:450) at org.apache.cassandra.db.columniterator.IndexedSliceReader.init(IndexedSliceReader.java:85) at org.apache.cassandra.db.columniterator.SSTableSliceIterator.createReader(SSTableSliceIterator.java:68) at org.apache.cassandra.db.columniterator.SSTableSliceIterator.init(SSTableSliceIterator.java:44) at org.apache.cassandra.db.filter.SliceQueryFilter.getSSTableColumnIterator(SliceQueryFilter.java:101) at org.apache.cassandra.db.filter.QueryFilter.getSSTableColumnIterator(QueryFilter.java:68) at org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:275) at org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:65) at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1363) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1220) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1132) at org.apache.cassandra.db.Table.getRow(Table.java:355) at org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:70) at org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:1052) at org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:1578) Here's the table it's querying from: CREATE TABLE conv_msgdata_by_participant_cql ( entityConversationId text, messageId text, jsonMessage text, msgReadFlag boolean, msgReadDate text, PRIMARY KEY (entityConversationId, messageId) ) ; CREATE INDEX ON conv_msgdata_by_participant_cql(msgReadFlag); Any ideas? Thanks, Anne
Re: Alternate major compaction
Thanks Takenori, Looks like the tool provides some good info that people can use. It would be great if you can share it with the community. On Thu, Jul 11, 2013 at 6:51 AM, Takenori Sato ts...@cloudian.com wrote: Hi, I think it is a common headache for users running a large Cassandra cluster in production. Running a major compaction is not the only cause, but more. For example, I see two typical scenario. 1. backup use case 2. active wide row In the case of 1, say, one data is removed a year later. This means, tombstone on the row is 1 year away from the original row. To remove an expired row entirely, a compaction set has to include all the rows. So, when do the original, 1 year old row, and the tombstoned row are included in a compaction set? It is likely to take one year. In the case of 2, such an active wide row exists in most of sstable files. And it typically contains many expired columns. But none of them wouldn't be removed entirely because a compaction set practically do not include all the row fragments. Btw, there is a very convenient MBean API is available. It is CompactionManager's forceUserDefinedCompaction. You can invoke a minor compaction on a file set you define. So the question is how to find an optimal set of sstable files. Then, I wrote a tool to check garbage, and print outs some useful information to find such an optimal set. Here's a simple log output. # /opt/cassandra/bin/checksstablegarbage -e /cassandra_data/UserData/Test5_BLOB-hc-4-Data.db [Keyspace, ColumnFamily, gcGraceSeconds(gcBefore)] = [UserData, Test5_BLOB, 300(1373504071)] === ROW_KEY, TOTAL_SIZE, COMPACTED_SIZE, TOMBSTONED, EXPIRED, REMAINNING_SSTABLE_FILES === hello5/100.txt.1373502926003, 40, 40, YES, YES, Test5_BLOB-hc-3-Data.db --- TOTAL, 40, 40 === REMAINNING_SSTABLE_FILES means any other sstable files that contain the respective row. So, the following is an optimal set. # /opt/cassandra/bin/checksstablegarbage -e /cassandra_data/UserData/Test5_BLOB-hc-4-Data.db /cassandra_data/UserData/Test5_BLOB-hc-3-Data.db [Keyspace, ColumnFamily, gcGraceSeconds(gcBefore)] = [UserData, Test5_BLOB, 300(1373504131)] === ROW_KEY, TOTAL_SIZE, COMPACTED_SIZE, TOMBSTONED, EXPIRED, REMAINNING_SSTABLE_FILES === hello5/100.txt.1373502926003, 223, 0, YES, YES --- TOTAL, 223, 0 === This tool relies on SSTableReader and an aggregation iterator as Cassandra does in compaction. I was considering to share this with the community. So let me know if anyone is interested. Ah, note that it is based on 1.0.7. So I will need to check and update for newer versions. Thanks, Takenori On Thu, Jul 11, 2013 at 6:46 PM, Tomàs Núnez tomas.nu...@groupalia.comwrote: Hi About a year ago, we did a major compaction in our cassandra cluster (a n00b mistake, I know), and since then we've had huge sstables that never get compacted, and we were condemned to repeat the major compaction process every once in a while (we are using SizeTieredCompaction strategy, and we've not avaluated yet LeveledCompaction, because it has its downsides, and we've had no time to test all of them in our environment). I was trying to find a way to solve this situation (that is, do something like a major compaction that writes small sstables, not huge as major compaction does), and I couldn't find it in the documentation. I tried cleanup and scrub/upgradesstables, but they don't do that (as documentation states). Then I tried deleting all data in a node and then bootstrapping it (or nodetool rebuild-ing it), hoping that this way the sstables would get cleaned from deleted records and updates. But the deleted node just copied the sstables from another node as they were, cleaning nothing. So I tried a new approach: I switched the sstable compaction strategy (SizeTiered to Leveled), forcing the sstables to be rewritten from scratch, and then switching it back (Leveled to SizeTiered). It took a while (but so do the major compaction process) and it worked, I have smaller sstables, and I've regained a lot of disk space. I'm happy with the results, but it doesn't seem a orthodox way of cleaning the sstables. What do you think, is it something wrong or crazy? Is there a different way to achieve the same thing? Let's put an example: Suppose you have a
Re: Alternate major compaction
Perhaps I should already know this but why is running a major compaction considered so bad? We're running 1.1.6. Thanks. On Thu, Jul 11, 2013 at 7:51 AM, Takenori Sato ts...@cloudian.com wrote: Hi, I think it is a common headache for users running a large Cassandra cluster in production. Running a major compaction is not the only cause, but more. For example, I see two typical scenario. 1. backup use case 2. active wide row In the case of 1, say, one data is removed a year later. This means, tombstone on the row is 1 year away from the original row. To remove an expired row entirely, a compaction set has to include all the rows. So, when do the original, 1 year old row, and the tombstoned row are included in a compaction set? It is likely to take one year. In the case of 2, such an active wide row exists in most of sstable files. And it typically contains many expired columns. But none of them wouldn't be removed entirely because a compaction set practically do not include all the row fragments. Btw, there is a very convenient MBean API is available. It is CompactionManager's forceUserDefinedCompaction. You can invoke a minor compaction on a file set you define. So the question is how to find an optimal set of sstable files. Then, I wrote a tool to check garbage, and print outs some useful information to find such an optimal set. Here's a simple log output. # /opt/cassandra/bin/checksstablegarbage -e /cassandra_data/UserData/Test5_BLOB-hc-4-Data.db [Keyspace, ColumnFamily, gcGraceSeconds(gcBefore)] = [UserData, Test5_BLOB, 300(1373504071)] === ROW_KEY, TOTAL_SIZE, COMPACTED_SIZE, TOMBSTONED, EXPIRED, REMAINNING_SSTABLE_FILES === hello5/100.txt.1373502926003, 40, 40, YES, YES, Test5_BLOB-hc-3-Data.db --- TOTAL, 40, 40 === REMAINNING_SSTABLE_FILES means any other sstable files that contain the respective row. So, the following is an optimal set. # /opt/cassandra/bin/checksstablegarbage -e /cassandra_data/UserData/Test5_BLOB-hc-4-Data.db /cassandra_data/UserData/Test5_BLOB-hc-3-Data.db [Keyspace, ColumnFamily, gcGraceSeconds(gcBefore)] = [UserData, Test5_BLOB, 300(1373504131)] === ROW_KEY, TOTAL_SIZE, COMPACTED_SIZE, TOMBSTONED, EXPIRED, REMAINNING_SSTABLE_FILES === hello5/100.txt.1373502926003, 223, 0, YES, YES --- TOTAL, 223, 0 === This tool relies on SSTableReader and an aggregation iterator as Cassandra does in compaction. I was considering to share this with the community. So let me know if anyone is interested. Ah, note that it is based on 1.0.7. So I will need to check and update for newer versions. Thanks, Takenori On Thu, Jul 11, 2013 at 6:46 PM, Tomàs Núnez tomas.nu...@groupalia.comwrote: Hi About a year ago, we did a major compaction in our cassandra cluster (a n00b mistake, I know), and since then we've had huge sstables that never get compacted, and we were condemned to repeat the major compaction process every once in a while (we are using SizeTieredCompaction strategy, and we've not avaluated yet LeveledCompaction, because it has its downsides, and we've had no time to test all of them in our environment). I was trying to find a way to solve this situation (that is, do something like a major compaction that writes small sstables, not huge as major compaction does), and I couldn't find it in the documentation. I tried cleanup and scrub/upgradesstables, but they don't do that (as documentation states). Then I tried deleting all data in a node and then bootstrapping it (or nodetool rebuild-ing it), hoping that this way the sstables would get cleaned from deleted records and updates. But the deleted node just copied the sstables from another node as they were, cleaning nothing. So I tried a new approach: I switched the sstable compaction strategy (SizeTiered to Leveled), forcing the sstables to be rewritten from scratch, and then switching it back (Leveled to SizeTiered). It took a while (but so do the major compaction process) and it worked, I have smaller sstables, and I've regained a lot of disk space. I'm happy with the results, but it doesn't seem a orthodox way of cleaning the sstables. What do you think, is it something wrong or crazy? Is there a different way to achieve the same thing? Let's put an example: Suppose you have a write-only
Re: Cassandra performance tuning...
You should be able to set the key_validation_class on the column family to use a different data type for the row keys. You may not be able to change this for a CF with existing data without some troubles due to a mismatch of data types; if that's a concern you'll have to create a separate CF and migrate your data. On Wed, Jul 10, 2013 at 2:20 PM, Tony Anecito adanec...@yahoo.com wrote: Hi All, I am trying to compare Cassandra to another relational database. I am getting around 2-3msec response time using Datastax driver, Java 1.7.0_05 64-bit jre and the other database is under 500 microseconds for the jdbc SQL preparedStatement execute.. One of the major differences is Cassandra uses text for the default primary key in the Column family and the SQL table I use int which is faster. Can the primary column family key data type be changed to a int? I also know Casandra uses varint for IntegerType and not sure that will be what I need but I will try it if I can change key column to that. If I try Int32Type for the primary key I suspect I will need to reload the data after that change. I have looked at the default Java Options in the Cassandra bat file and they seem a good starting point but I am just starting to tune now that I can get Column Family caching to work. Regards, -Tony
Re: Alternate major compaction
Information is only deleted from Cassandra during a compaction. Using SizeTieredCompaction, compaction only occurs when a number of similarly sized sstables are combined into a new sstable. When you perform a major compaction, all sstables are combined into one, very large, sstable. As a result, any tombstoned data in that large sstable will only be removed when a number of very large sstable exists. This means tombstoned data maybe trapped in that sstable for a very long time (or indefinitely depending on your usecase). -Mike On Jul 11, 2013, at 9:31 AM, Brian Tarbox wrote: Perhaps I should already know this but why is running a major compaction considered so bad? We're running 1.1.6. Thanks. On Thu, Jul 11, 2013 at 7:51 AM, Takenori Sato ts...@cloudian.com wrote: Hi, I think it is a common headache for users running a large Cassandra cluster in production. Running a major compaction is not the only cause, but more. For example, I see two typical scenario. 1. backup use case 2. active wide row In the case of 1, say, one data is removed a year later. This means, tombstone on the row is 1 year away from the original row. To remove an expired row entirely, a compaction set has to include all the rows. So, when do the original, 1 year old row, and the tombstoned row are included in a compaction set? It is likely to take one year. In the case of 2, such an active wide row exists in most of sstable files. And it typically contains many expired columns. But none of them wouldn't be removed entirely because a compaction set practically do not include all the row fragments. Btw, there is a very convenient MBean API is available. It is CompactionManager's forceUserDefinedCompaction. You can invoke a minor compaction on a file set you define. So the question is how to find an optimal set of sstable files. Then, I wrote a tool to check garbage, and print outs some useful information to find such an optimal set. Here's a simple log output. # /opt/cassandra/bin/checksstablegarbage -e /cassandra_data/UserData/Test5_BLOB-hc-4-Data.db [Keyspace, ColumnFamily, gcGraceSeconds(gcBefore)] = [UserData, Test5_BLOB, 300(1373504071)] === ROW_KEY, TOTAL_SIZE, COMPACTED_SIZE, TOMBSTONED, EXPIRED, REMAINNING_SSTABLE_FILES === hello5/100.txt.1373502926003, 40, 40, YES, YES, Test5_BLOB-hc-3-Data.db --- TOTAL, 40, 40 === REMAINNING_SSTABLE_FILES means any other sstable files that contain the respective row. So, the following is an optimal set. # /opt/cassandra/bin/checksstablegarbage -e /cassandra_data/UserData/Test5_BLOB-hc-4-Data.db /cassandra_data/UserData/Test5_BLOB-hc-3-Data.db [Keyspace, ColumnFamily, gcGraceSeconds(gcBefore)] = [UserData, Test5_BLOB, 300(1373504131)] === ROW_KEY, TOTAL_SIZE, COMPACTED_SIZE, TOMBSTONED, EXPIRED, REMAINNING_SSTABLE_FILES === hello5/100.txt.1373502926003, 223, 0, YES, YES --- TOTAL, 223, 0 === This tool relies on SSTableReader and an aggregation iterator as Cassandra does in compaction. I was considering to share this with the community. So let me know if anyone is interested. Ah, note that it is based on 1.0.7. So I will need to check and update for newer versions. Thanks, Takenori On Thu, Jul 11, 2013 at 6:46 PM, Tomàs Núnez tomas.nu...@groupalia.com wrote: Hi About a year ago, we did a major compaction in our cassandra cluster (a n00b mistake, I know), and since then we've had huge sstables that never get compacted, and we were condemned to repeat the major compaction process every once in a while (we are using SizeTieredCompaction strategy, and we've not avaluated yet LeveledCompaction, because it has its downsides, and we've had no time to test all of them in our environment). I was trying to find a way to solve this situation (that is, do something like a major compaction that writes small sstables, not huge as major compaction does), and I couldn't find it in the documentation. I tried cleanup and scrub/upgradesstables, but they don't do that (as documentation states). Then I tried deleting all data in a node and then bootstrapping it (or nodetool rebuild-ing it), hoping that this way the sstables would get cleaned from deleted records and updates. But the deleted node just copied the
Re: High performance hardware with lot of data per node - Global learning about configuration
We've also noticed very good read and write latencies with the hi1.4xls compared to our previous instance classes. We actually ran a mixed cluster of hi1.4xls and m2.4xls to watch side-by-side comparison. Despite the significant improvement in underlying hardware, we've noticed that streaming performance with 1.2.6+vnodes is a lot slower than we would expect. Bootstrapping a node into a ring with large storage loads can take 6+ hours. We have a JIRA open that describes our current config: https://issues.apache.org/jira/browse/CASSANDRA-5726 Aiman: We also use tablesnap for our backups. We're using a slightly modified version [1]. We currently backup every sst as soon as they hit disk (tablesnap's inotify), but we're considering moving to a periodic snapshot approach as the sst churn after going from 24 nodes - 6 nodes is quite high. Mike [1]: https://github.com/librato/tablesnap On Thu, Jul 11, 2013 at 7:33 AM, Aiman Parvaiz ai...@grapheffect.comwrote: Hi, We also recently migrated to 3 hi.4xlarge boxes(Raid0 SSD) and the disk IO performance is definitely better than the earlier non SSD servers, we are serving up to 14k reads/s with a latency of 3-3.5 ms/op. I wanted to share our config options and ask about the data back up strategy for Raid0. We are using C* 1.2.6 with key_chache and row_cache of 300MB I have not changed/ modified any other parameter except for going with multithreaded GC. I will be playing around with other factors and update everyone if I find something interesting. Also, just wanted to share backup strategy and see if I can get something useful from how others are taking backup of their raid0. I am using tablesnap to upload SSTables to s3 and I have attached a separate EBS volume to every box and have set up rsync to mirror Cassandra data from Raid0 to EBS. I would really appreciate if you guys can share how you taking backups. Thanks On Jul 9, 2013, at 7:11 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi, Using C*1.2.2. We recently dropped our 18 m1.xLarge (4CPU, 15GB RAM, 4 Raid-0 Disks) servers to get 3 hi1.4xLarge (16CPU, 60GB RAM, 2 Raid-0 SSD) servers instead, for about the same price. We tried it after reading some benchmark published by Netflix. It is awesome and I recommend it to anyone who is using more than 18 xLarge server or can afford these high cost / high performance EC2 instances. SSD gives a very good throughput with an awesome latency. Yet, we had about 200 GB data per server and now about 1 TB. To alleviate memory pressure inside the heap I had to reduce the index sampling. I changed the index_interval value from 128 to 512, with no visible impact on latency, but a great improvement inside the heap which doesn't complain about any pressure anymore. Is there some more tuning I could use, more tricks that could be useful while using big servers, with a lot of data per node and relatively high throughput ? SSD are at 20-40 % of their throughput capacity (according to OpsCenter), CPU almost never reach a bigger load than 5 or 6 (with 16 CPU), 15 GB RAM used out of 60GB. At this point I have kept my previous configuration, which is almost the default one from the Datastax community AMI. There is a part of it, you can consider that any property that is not in here is configured as default : cassandra.yaml key_cache_size_in_mb: (empty) - so default - 100MB (hit rate between 88 % and 92 %, good enough ?) row_cache_size_in_mb: 0 (not usable in our use case, a lot of different and random reads) flush_largest_memtables_at: 0.80 reduce_cache_sizes_at: 0.90 concurrent_reads: 32 (I am thinking to increase this to 64 or more since I have just a few servers to handle more concurrence) concurrent_writes: 32 (I am thinking to increase this to 64 or more too) memtable_total_space_in_mb: 1024 (to avoid having a full heap, shoul I use bigger value, why for ?) rpc_server_type: sync (I tried hsha and had the ERROR 12:02:18,971 Read an invalid frame size of 0. Are you using TFramedTransport on the client side? error). No idea how to fix this, and I use 5 different clients for different purpose (Hector, Cassie, phpCassa, Astyanax, Helenus)... multithreaded_compaction: false (Should I try enabling this since I now use SSD ?) compaction_throughput_mb_per_sec: 16 (I will definitely up this to 32 or even more) cross_node_timeout: true endpoint_snitch: Ec2MultiRegionSnitch index_interval: 512 cassandra-env.sh I am not sure about how to tune the heap, so I mainly use defaults MAX_HEAP_SIZE=8G HEAP_NEWSIZE=400M (I tried with higher values, and it produced bigger GC times (1600 ms instead of 200 ms now with 400M) -XX:+UseParNewGC -XX:+UseConcMarkSweepGC -XX:+CMSParallelRemarkEnabled -XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:CMSInitiatingOccupancyFraction=70 -XX:+UseCMSInitiatingOccupancyOnly Does this configuration
Token Aware Routing: Routing Key Vs Composite Key with vnodes
Hi All, I am a bit confused on how the underlying token aware routing is working in the case of composite key. Let's say I have a column family like this USERS( uuid userId, text firstname, text lastname, int age, PRIMARY KEY(userId, firstname, lastname)) My question is do we need to have the values of the userId, firstName and lastName available in the same time to create the token from the composite key, or we can get the right token just by looking at the routing key userId? Looking at the datastax driver code, is a bit confusing, it seems that it calculate the token only when all the values of a composite key is available, or I am missing something? Thanks, Haithem
Re: Working with libcql
On 2013-07-09 11:46, Shubham Mittal wrote: yeah I tried that and below is the output I get LOG: resolving remote host localhost:9160 libcql is an implementation for the new binary transport protocol: https://git-wip-us.apache.org/repos/asf?p=cassandra.git;a=blob_plain;f=doc/native_protocol.spec;hb=refs/heads/cassandra-1.2 It is not a thrift transport. By default it uses the 9042 port. You'll have to activate it on the server. Write (or uncomment) start_native_transport: true in conf/cassandra.yaml. According to the posted log, you connect to the thrift transport port, 9160. As you send a frame of the new transport protocol to the old thrift protocol, the server does not understand it and closes your connection. Regards, Sorin LOG: resolved remote host, attempting to connect LOG: connection successful to remote host LOG: sending message: 0x0105 {version: 0x01, flags: 0x00, stream: 0x00, opcode: 0x05, length: 0} OPTIONS LOG: wrote to socket 8 bytes LOG: error reading header End of file and I checked all the keyspaces in my cluster, it changes nothing in the cluster. I couldn't understand the code much. What is this code supposed to do anyways? On Tue, Jul 9, 2013 at 4:20 AM, aaron morton aa...@thelastpickle.com mailto:aa...@thelastpickle.com wrote: Did you see the demo app ? Seems to have a few examples of reading data. https://github.com/mstump/libcql/blob/master/demo/main.cpp#L85 Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 9/07/2013, at 1:14 AM, Shubham Mittal smsmitta...@gmail.com mailto:smsmitta...@gmail.com wrote: Hi, I found out that there exist a C++ client libcql for cassandra but its github repository just provides the example on how to connect to cassandra. Is there anyone who has written some code using libcql to read and write data to a cassandra DB, kindly share it. Thanks
Re: alter column family ?
Hi Rob, Are the schema's held somewhere else ? Going through the process that you sent, when I restart the nodes, the original schema's show up (btw, you were correct on your assessment, even though the schema shows they are the same with the gossipinfo command, they are not the same when looking at them with cassandra-cli, not even close on 2 of the nodes). So, I went through the process of clearing out the system CF's, in steps 4 and 5, when the cassandra's restarted two of them (the ones with the incorrect schema's), complained about the schema and loaded what looks like a generic one. But, all of them have schemas and 2 are correct and one is not. This means I cannot execute step 7 , since the schema now exists with the name on all the nodes. For example, the incorrect schema is called MySchema, after the restart and the messages complaining about CF's not existing, there is a schema called MySchema, on 2 nodes they are correct, on 2 nodes they are not. I have also tried to force the node with the incorrect schema to come up on its own by shutting down the cluster except for a node with a correct schema. I went through the same steps and brought that node down and back up, same results. Thoughts ? ideas ? Jim From: Robert Coli rc...@eventbrite.commailto:rc...@eventbrite.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Tue, 9 Jul 2013 17:10:53 -0700 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: alter column family ? On Tue, Jul 9, 2013 at 11:52 AM, Langston, Jim jim.langs...@compuware.commailto:jim.langs...@compuware.com wrote: On the command (4 node cluster): nodetool gossipinfo -h localhost |grep SCHEMA |sort | uniq -c | sort -n 4 SCHEMA:60edeaa8-70a4-3825-90a5-d7746ffa8e4d If your schemas actually agree (and given that you're in 1.1.2) you probably are encountering : https://issues.apache.org/jira/browse/CASSANDRA-4432 Which is one of the 1.1.2 era schema stuck issues I was referring to earlier. On the second part, I have the same Cassandra version in staging and production, with staging being a smaller cluster. Not sure what you mean by nuking schema's (ie. delete directories ?) I like when googling things returns related threads in which I have previously advised people to do a detailed list of things, heh : http://mail-archives.apache.org/mod_mbox/cassandra-user/201208.mbox/%3CCAN1VBD-01aD7wT2w1eyY2KpHwcj+CoMjvE4=j5zaswybmw_...@mail.gmail.com%3E Here's a slightly clarified version of these steps... 0. Dump your existing schema to schema_definition_file 1. Take all nodes out of service; 2. Run nodetool drain on each and verify that they have drained (grep -i DRAINED system.log) 3. Stop cassandra on each node; 4. Move /var/lib/cassandra/data/system out of the way 5. Move /var/lib/cassandra/saved_caches/system-* out of the way 6. Start all nodes; 7. cassandra-cli schema_definition_file on one node only. (includes create keyspace and create column familiy entries) Note: you should not literally do this, you should break your schema_definition_file into individual statements and wait until schema agreement between each DDL statement. 8. Put the nodes back in service. 9. Done. =Rob
Re: alter column family ?
On Thu, Jul 11, 2013 at 9:17 AM, Langston, Jim jim.langs...@compuware.comwrote: Are the schema's held somewhere else ? Going through the process that you sent, when I restart the nodes, the original schema's show up If you do not stop all nodes at once and then remove the system CFs, the existing schema will re-propogate via Gossip. To be clear, I was suggesting that you dump the schema with cassandra-cli, erase the current schema with the cluster down, bring the cluster back up (NOW WITH NO SCHEMA) and then load the schema from the dump via cassandra-cli. Also, in case I didn't mention it before, you should upgrade your version of Cassandra ASAP. :) =Rob
Re: Logging Cassandra Reads/Writes
Aaron, Thanks for the references! I'll try the things you mentioned and see how it goes! Best, Mohammad On Wed, Jul 10, 2013 at 8:07 PM, aaron morton [via cassandra-u...@incubator.apache.org] ml-node+s3065146n7588930...@n2.nabble.com wrote: Some info on request tracing http://www.datastax.com/dev/blog/tracing-in-cassandra-1-2 1) Is it possible to log which node provides the real data in a read operation? It's available at the DEBUG level of logging. You probably just want to enable it on the org.apache.cassandra.db.StorageProxy class, see log4j-server.properties for info 2) Also, is it possible to log the different delays involved in each operation-- for example, 0.1 seconds to get digests from all nodes, 1 second to transfer data, etc.? Not Applicable as you've seen, we request to all replicas at the same time. There is more logging that will show when the responses are processed, try turning DEBUG logging on for a small 3 node cluster and send one request. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 10/07/2013, at 8:58 AM, Mohit Anchlia [hidden email]http://user/SendEmail.jtp?type=nodenode=7588930i=0 wrote: There is a new tracing feature in Cassandra 1.2 that might help you with this. On Tue, Jul 9, 2013 at 1:31 PM, Blair Zajac [hidden email]http://user/SendEmail.jtp?type=nodenode=7588930i=1 wrote: No idea on the logging, I'm pretty new to Cassandra. Regards, Blair On Jul 9, 2013, at 12:50 PM, hajjat [hidden email]http://user/SendEmail.jtp?type=nodenode=7588930i=2 wrote: Blair, thanks for the clarification! My friend actually just told me the same.. Any idea on how to do logging?? Thanks! -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Logging-Cassandra-Reads-Writes-tp7588893p7588896.html Sent from the [hidden email]http://user/SendEmail.jtp?type=nodenode=7588930i=3mailing list archive at Nabble.com. -- If you reply to this email, your message will be added to the discussion below: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Logging-Cassandra-Reads-Writes-tp7588893p7588930.html To unsubscribe from Logging Cassandra Reads/Writes, click herehttp://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=7588893code=aGFqamF0QHB1cmR1ZS5lZHV8NzU4ODg5M3w4NTA5MDAwMjU= . NAMLhttp://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- *Mohammad Hajjat* *Ph.D. Student* *Electrical and Computer Engineering* *Purdue University* -- View this message in context: http://cassandra-user-incubator-apache-org.3065146.n2.nabble.com/Logging-Cassandra-Reads-Writes-tp7588893p7588957.html Sent from the cassandra-u...@incubator.apache.org mailing list archive at Nabble.com.
Re: Alternate major compaction
On Thu, Jul 11, 2013 at 2:46 AM, Tomàs Núnez tomas.nu...@groupalia.comwrote: Hi About a year ago, we did a major compaction in our cassandra cluster (a n00b mistake, I know), and since then we've had huge sstables that never get compacted, and we were condemned to repeat the major compaction process every once in a while (we are using SizeTieredCompaction strategy, and we've not avaluated yet LeveledCompaction, because it has its downsides, and we've had no time to test all of them in our environment). I was trying to find a way to solve this situation (that is, do something like a major compaction that writes small sstables, not huge as major compaction does), and I couldn't find it in the documentation. https://github.com/pcmanus/cassandra/tree/sstable_split 1) run sstable_split on One Big SSTable (being careful to avoid name collisions if done with node running) 2) stop node 3) remove One Big SSTable 4) start node This approach is significantly more i/o efficient than your online solution, but does require a node restart and messing around directly with SSTables. Your online solution is clever! If you choose to use this tool, please let us know the result. With some feedback, pcmanus (Sylvain) is likely to merge it into Cassandra as a useful tool for dealing with for example this situation. =Rob
Re: node tool ring displays 33.33% owns on 3 node cluster with replication
Thanks Rob! I was able to confirm with getendpoints. Cheers, ~Jason From: Robert Coli rc...@eventbrite.commailto:rc...@eventbrite.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Wednesday, July 10, 2013 4:09 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Cc: Francois Richard frich...@yahoo-inc.commailto:frich...@yahoo-inc.com Subject: Re: node tool ring displays 33.33% owns on 3 node cluster with replication On Wed, Jul 10, 2013 at 4:04 PM, Jason Tyler jaty...@yahoo-inc.commailto:jaty...@yahoo-inc.com wrote: Is this simply a display issue, or have I lost replication? Almost certainly just a display issue. Do nodetool -h localhost getendpoints keyspace columnfamily 0, which will tell you the endpoints for the non-transformed key 0. It should give you 3 endpoints. You could also do this test with a known existing key and then go to those nodes and verify that they have that data on disk via sstable2json. (FWIW, it is an odd display issue/bug if it is one. Because it has reverted to pre-1.1 behavior...) =Rob
Re: alter column family ?
Yes, I got the gist of what you were after, even making sure I broke out the schema dump and load them in individually, but I haven't gotten that far. It feels like the 2 node that are not coming up with the right schema are not seeing the nodes with the correct ones. And yes, I hear the beat of the upgrade drum, I was hoping to do one step at a time so I don't carry my problem over. Jim From: Robert Coli rc...@eventbrite.commailto:rc...@eventbrite.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thu, 11 Jul 2013 09:43:43 -0700 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: alter column family ? On Thu, Jul 11, 2013 at 9:17 AM, Langston, Jim jim.langs...@compuware.commailto:jim.langs...@compuware.com wrote: Are the schema's held somewhere else ? Going through the process that you sent, when I restart the nodes, the original schema's show up If you do not stop all nodes at once and then remove the system CFs, the existing schema will re-propogate via Gossip. To be clear, I was suggesting that you dump the schema with cassandra-cli, erase the current schema with the cluster down, bring the cluster back up (NOW WITH NO SCHEMA) and then load the schema from the dump via cassandra-cli. Also, in case I didn't mention it before, you should upgrade your version of Cassandra ASAP. :) =Rob
Re: alter column family ?
On Thu, Jul 11, 2013 at 10:16 AM, Langston, Jim jim.langs...@compuware.comwrote: It feels like the 2 node that are not coming up with the right schema are not seeing the nodes with the correct ones. At the time that the nodes come up, they should have no schema other than the system columnfamilies. Only once all 3 nodes see each other should you be re-creating the schema. I'm not understanding your above sentence in light of this? =Rob
Re: High performance hardware with lot of data per node - Global learning about configuration
Thanks for the info Mike, we ran in to a race condition which was killing table snap, I want to share the problem and the solution/ work around and may be someone can throw some light on the effects of the solution. tablesnap was getting killed with this error message: Failed uploading %s. Aborting.\n%s Looking at the code it took me to the following: def worker(self): bucket = self.get_bucket() while True: f = self.fileq.get() keyname = self.build_keyname(f) try: self.upload_sstable(bucket, keyname, f) except: self.log.critical(Failed uploading %s. Aborting.\n%s % (f, format_exc())) # Brute force kill self os.kill(os.getpid(), signal.SIGKILL) self.fileq.task_done() It builds the filename and then before it could upload it, the file disappears (which is possible), I simply commented out the line which kills tablesnap if the file is not found, it fixes the issue we were having but I would appreciate if some one has any insights on any ill effects this might have on backup or restoration process. Thanks On Jul 11, 2013, at 7:03 AM, Mike Heffner m...@librato.com wrote: We've also noticed very good read and write latencies with the hi1.4xls compared to our previous instance classes. We actually ran a mixed cluster of hi1.4xls and m2.4xls to watch side-by-side comparison. Despite the significant improvement in underlying hardware, we've noticed that streaming performance with 1.2.6+vnodes is a lot slower than we would expect. Bootstrapping a node into a ring with large storage loads can take 6+ hours. We have a JIRA open that describes our current config: https://issues.apache.org/jira/browse/CASSANDRA-5726 Aiman: We also use tablesnap for our backups. We're using a slightly modified version [1]. We currently backup every sst as soon as they hit disk (tablesnap's inotify), but we're considering moving to a periodic snapshot approach as the sst churn after going from 24 nodes - 6 nodes is quite high. Mike [1]: https://github.com/librato/tablesnap On Thu, Jul 11, 2013 at 7:33 AM, Aiman Parvaiz ai...@grapheffect.com wrote: Hi, We also recently migrated to 3 hi.4xlarge boxes(Raid0 SSD) and the disk IO performance is definitely better than the earlier non SSD servers, we are serving up to 14k reads/s with a latency of 3-3.5 ms/op. I wanted to share our config options and ask about the data back up strategy for Raid0. We are using C* 1.2.6 with key_chache and row_cache of 300MB I have not changed/ modified any other parameter except for going with multithreaded GC. I will be playing around with other factors and update everyone if I find something interesting. Also, just wanted to share backup strategy and see if I can get something useful from how others are taking backup of their raid0. I am using tablesnap to upload SSTables to s3 and I have attached a separate EBS volume to every box and have set up rsync to mirror Cassandra data from Raid0 to EBS. I would really appreciate if you guys can share how you taking backups. Thanks On Jul 9, 2013, at 7:11 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi, Using C*1.2.2. We recently dropped our 18 m1.xLarge (4CPU, 15GB RAM, 4 Raid-0 Disks) servers to get 3 hi1.4xLarge (16CPU, 60GB RAM, 2 Raid-0 SSD) servers instead, for about the same price. We tried it after reading some benchmark published by Netflix. It is awesome and I recommend it to anyone who is using more than 18 xLarge server or can afford these high cost / high performance EC2 instances. SSD gives a very good throughput with an awesome latency. Yet, we had about 200 GB data per server and now about 1 TB. To alleviate memory pressure inside the heap I had to reduce the index sampling. I changed the index_interval value from 128 to 512, with no visible impact on latency, but a great improvement inside the heap which doesn't complain about any pressure anymore. Is there some more tuning I could use, more tricks that could be useful while using big servers, with a lot of data per node and relatively high throughput ? SSD are at 20-40 % of their throughput capacity (according to OpsCenter), CPU almost never reach a bigger load than 5 or 6 (with 16 CPU), 15 GB RAM used out of 60GB. At this point I have kept my previous configuration, which is almost the default one from the Datastax community AMI. There is a part of it, you can consider that any property that is not in here is configured as default : cassandra.yaml key_cache_size_in_mb: (empty) - so default - 100MB (hit rate between 88 % and 92 %, good enough ?) row_cache_size_in_mb: 0 (not usable in our use case, a lot of different and random reads) flush_largest_memtables_at: 0.80 reduce_cache_sizes_at: 0.90
Re: alter column family ?
Thanks Rob, I went through the whole sequence again and now have gotten to the point of being able to try and pull in the schema, but now getting this error from the one node I'm executing on. [default@unknown] create keyspace OTracker ... with placement_strategy = 'SimpleStrategy' ... and strategy_options = {replication_factor : 3} ... and durable_writes = true; 9209ec36-3b3f-3e24-9dfb-8a45a5b29a2a Waiting for schema agreement... ... schemas agree across the cluster NotFoundException() [default@unknown] All the nodes see each other and are available, all only contain a system schema, none have a OTracker schema Jim From: Robert Coli rc...@eventbrite.commailto:rc...@eventbrite.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thu, 11 Jul 2013 10:35:43 -0700 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: alter column family ? On Thu, Jul 11, 2013 at 10:16 AM, Langston, Jim jim.langs...@compuware.commailto:jim.langs...@compuware.com wrote: It feels like the 2 node that are not coming up with the right schema are not seeing the nodes with the correct ones. At the time that the nodes come up, they should have no schema other than the system columnfamilies. Only once all 3 nodes see each other should you be re-creating the schema. I'm not understanding your above sentence in light of this? =Rob
Re: alter column family ?
On Thu, Jul 11, 2013 at 11:00 AM, Langston, Jim jim.langs...@compuware.comwrote: I went through the whole sequence again and now have gotten to the point of being able to try and pull in the schema, but now getting this error from the one node I'm executing on. [default@unknown] create keyspace OTracker 9209ec36-3b3f-3e24-9dfb-8a45a5b29a2a Waiting for schema agreement... ... schemas agree across the cluster NotFoundException() This is pretty unusual. All the nodes see each other and are available, all only contain a system schema, none have a OTracker schema If you look in the logs for schema related stuff when you try to create OTracker, what do you see? Do you see the above UUID schema version in the logs? At this point I am unable to suggest anything other than upgrading to the head of 1.1 line and try to create your keyspace there. There should be no chance of old state being implicated in your now stuck schema, so it seems likely that the problem has re-occured due to the version of Cassandra you are running. Sorry I am unable to be of more assistance and that my advice appears to have resulted in your cluster being in worse condition than when you started. I probably mentioned but will do so again that if you have the old system keyspace directories, you can stop cassandra on all nodes and then revert to them. =Rob
Re: alter column family ?
Was just looking at a bug with uppercase , could that be the error ? And, yes, definitely saved off the original system keyspaces. I'm tailing the logs when running the cassandra-cli, but I do not see anything in the logs .. Jim From: Robert Coli rc...@eventbrite.commailto:rc...@eventbrite.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thu, 11 Jul 2013 11:07:55 -0700 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: alter column family ? On Thu, Jul 11, 2013 at 11:00 AM, Langston, Jim jim.langs...@compuware.commailto:jim.langs...@compuware.com wrote: I went through the whole sequence again and now have gotten to the point of being able to try and pull in the schema, but now getting this error from the one node I'm executing on. [default@unknown] create keyspace OTracker 9209ec36-3b3f-3e24-9dfb-8a45a5b29a2a Waiting for schema agreement... ... schemas agree across the cluster NotFoundException() This is pretty unusual. All the nodes see each other and are available, all only contain a system schema, none have a OTracker schema If you look in the logs for schema related stuff when you try to create OTracker, what do you see? Do you see the above UUID schema version in the logs? At this point I am unable to suggest anything other than upgrading to the head of 1.1 line and try to create your keyspace there. There should be no chance of old state being implicated in your now stuck schema, so it seems likely that the problem has re-occured due to the version of Cassandra you are running. Sorry I am unable to be of more assistance and that my advice appears to have resulted in your cluster being in worse condition than when you started. I probably mentioned but will do so again that if you have the old system keyspace directories, you can stop cassandra on all nodes and then revert to them. =Rob
merge sstables
Hello , I have small size of sstables like 5mb around 2000 files. Is there a way i can merge into bigger size ? thanks chandra
Re: merge sstables
I assume you are using the leveled compaction strategy because you have 5mb sstables and 5mb is the default size for leveled compaction. To change this default, you can run the following in the cassandra-cli: update column family cf_name with compaction_strategy_options = {sstable_size_in_mb: 256}; To force the current sstables to be rewritten, I think you'll need to issue a nodetool scrub on each node. Someone please correct me if I'm wrong on this. Faraaz On Thu, Jul 11, 2013 at 11:34:08AM -0700, chandra Varahala wrote: Hello , I have small size of sstables like 5mb around 2000 files. Is there a way i can merge into bigger size ? thanks chandra
Re: merge sstables
yes, but nodetool scrub is not working .. thanks chandra On Thu, Jul 11, 2013 at 2:39 PM, Faraaz Sareshwala fsareshw...@quantcast.com wrote: I assume you are using the leveled compaction strategy because you have 5mb sstables and 5mb is the default size for leveled compaction. To change this default, you can run the following in the cassandra-cli: update column family cf_name with compaction_strategy_options = {sstable_size_in_mb: 256}; To force the current sstables to be rewritten, I think you'll need to issue a nodetool scrub on each node. Someone please correct me if I'm wrong on this. Faraaz On Thu, Jul 11, 2013 at 11:34:08AM -0700, chandra Varahala wrote: Hello , I have small size of sstables like 5mb around 2000 files. Is there a way i can merge into bigger size ? thanks chandra
Re: High performance hardware with lot of data per node - Global learning about configuration
Aiman, I believe that is one of the cases we added a check for: https://github.com/librato/tablesnap/blob/master/tablesnap#L203-L207 Mike On Thu, Jul 11, 2013 at 1:54 PM, Aiman Parvaiz ai...@grapheffect.comwrote: Thanks for the info Mike, we ran in to a race condition which was killing table snap, I want to share the problem and the solution/ work around and may be someone can throw some light on the effects of the solution. tablesnap was getting killed with this error message: Failed uploading %s. Aborting.\n%s Looking at the code it took me to the following: def worker(self): bucket = self.get_bucket() while True: f = self.fileq.get() keyname = self.build_keyname(f) try: self.upload_sstable(bucket, keyname, f) except: self.log.critical(Failed uploading %s. Aborting.\n%s % (f, format_exc())) # Brute force kill self os.kill(os.getpid(), signal.SIGKILL) self.fileq.task_done() It builds the filename and then before it could upload it, the file disappears (which is possible), I simply commented out the line which kills tablesnap if the file is not found, it fixes the issue we were having but I would appreciate if some one has any insights on any ill effects this might have on backup or restoration process. Thanks On Jul 11, 2013, at 7:03 AM, Mike Heffner m...@librato.com wrote: We've also noticed very good read and write latencies with the hi1.4xls compared to our previous instance classes. We actually ran a mixed cluster of hi1.4xls and m2.4xls to watch side-by-side comparison. Despite the significant improvement in underlying hardware, we've noticed that streaming performance with 1.2.6+vnodes is a lot slower than we would expect. Bootstrapping a node into a ring with large storage loads can take 6+ hours. We have a JIRA open that describes our current config: https://issues.apache.org/jira/browse/CASSANDRA-5726 Aiman: We also use tablesnap for our backups. We're using a slightly modified version [1]. We currently backup every sst as soon as they hit disk (tablesnap's inotify), but we're considering moving to a periodic snapshot approach as the sst churn after going from 24 nodes - 6 nodes is quite high. Mike [1]: https://github.com/librato/tablesnap On Thu, Jul 11, 2013 at 7:33 AM, Aiman Parvaiz ai...@grapheffect.comwrote: Hi, We also recently migrated to 3 hi.4xlarge boxes(Raid0 SSD) and the disk IO performance is definitely better than the earlier non SSD servers, we are serving up to 14k reads/s with a latency of 3-3.5 ms/op. I wanted to share our config options and ask about the data back up strategy for Raid0. We are using C* 1.2.6 with key_chache and row_cache of 300MB I have not changed/ modified any other parameter except for going with multithreaded GC. I will be playing around with other factors and update everyone if I find something interesting. Also, just wanted to share backup strategy and see if I can get something useful from how others are taking backup of their raid0. I am using tablesnap to upload SSTables to s3 and I have attached a separate EBS volume to every box and have set up rsync to mirror Cassandra data from Raid0 to EBS. I would really appreciate if you guys can share how you taking backups. Thanks On Jul 9, 2013, at 7:11 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi, Using C*1.2.2. We recently dropped our 18 m1.xLarge (4CPU, 15GB RAM, 4 Raid-0 Disks) servers to get 3 hi1.4xLarge (16CPU, 60GB RAM, 2 Raid-0 SSD) servers instead, for about the same price. We tried it after reading some benchmark published by Netflix. It is awesome and I recommend it to anyone who is using more than 18 xLarge server or can afford these high cost / high performance EC2 instances. SSD gives a very good throughput with an awesome latency. Yet, we had about 200 GB data per server and now about 1 TB. To alleviate memory pressure inside the heap I had to reduce the index sampling. I changed the index_interval value from 128 to 512, with no visible impact on latency, but a great improvement inside the heap which doesn't complain about any pressure anymore. Is there some more tuning I could use, more tricks that could be useful while using big servers, with a lot of data per node and relatively high throughput ? SSD are at 20-40 % of their throughput capacity (according to OpsCenter), CPU almost never reach a bigger load than 5 or 6 (with 16 CPU), 15 GB RAM used out of 60GB. At this point I have kept my previous configuration, which is almost the default one from the Datastax community AMI. There is a part of it, you can consider that any property that is not in here is configured as default : cassandra.yaml key_cache_size_in_mb: (empty) - so default - 100MB (hit rate between 88 %
Re: merge sstables
Scrub will keep the file size same. YOu need to move all sstables to be L0. the way to do this is to remove the json file which has level information. On Thu, Jul 11, 2013 at 11:48 AM, chandra Varahala hadoopandcassan...@gmail.com wrote: yes, but nodetool scrub is not working .. thanks chandra On Thu, Jul 11, 2013 at 2:39 PM, Faraaz Sareshwala fsareshw...@quantcast.com wrote: I assume you are using the leveled compaction strategy because you have 5mb sstables and 5mb is the default size for leveled compaction. To change this default, you can run the following in the cassandra-cli: update column family cf_name with compaction_strategy_options = {sstable_size_in_mb: 256}; To force the current sstables to be rewritten, I think you'll need to issue a nodetool scrub on each node. Someone please correct me if I'm wrong on this. Faraaz On Thu, Jul 11, 2013 at 11:34:08AM -0700, chandra Varahala wrote: Hello , I have small size of sstables like 5mb around 2000 files. Is there a way i can merge into bigger size ? thanks chandra
Re: merge sstables
On Thu, Jul 11, 2013 at 1:52 PM, sankalp kohli kohlisank...@gmail.comwrote: Scrub will keep the file size same. YOu need to move all sstables to be L0. the way to do this is to remove the json file which has level information. This will work, but I believe is subject to this? ./src/java/org/apache/cassandra/db/compaction/LeveledManifest.java line 228 of 577 // LevelDB gives each level a score of how much data it contains vs its ideal amount, and // compacts the level with the highest score. But this falls apart spectacularly once you // get behind. Consider this set of levels: // L0: 988 [ideal: 4] // L1: 117 [ideal: 10] // L2: 12 [ideal: 100] // // The problem is that L0 has a much higher score (almost 250) than L1 (11), so what we'll // do is compact a batch of MAX_COMPACTING_L0 sstables with all 117 L1 sstables, and put the // result (say, 120 sstables) in L1. Then we'll compact the next batch of MAX_COMPACTING_L0, // and so forth. So we spend most of our i/o rewriting the L1 data with each batch. // // If we could just do *all* L0 a single time with L1, that would be ideal. But we can't // -- see the javadoc for MAX_COMPACTING_L0. // // LevelDB's way around this is to simply block writes if L0 compaction falls behind. // We don't have that luxury. // // So instead, we force compacting higher levels first. This may not minimize the number // of reads done as quickly in the short term, but it minimizes the i/o needed to compact // optimially which gives us a long term win. Ideal would be something like a major compaction for LCS which allows end user to change resulting SSTable sizes without forcing everything back to L0. =Rob
Rhombus - A time-series object store for Cassandra
Hello, Just wanted to share a project that we have been working on. It's a time-series object store for Cassandra. We tried to generalize the common use cases for storing time-series data in Cassandra and automatically handle the denormalization, indexing, and wide row sharding. It currently exists as a Java Library. We have it deployed as a web service in a Dropwizard app server with a REST style interface. The plan is to eventually release that Dropwizard app too. The project and explanation is available on Github at: https://github.com/Pardot/Rhombus I would love to hear feedback. Many Thanks, Rob
Re: Token Aware Routing: Routing Key Vs Composite Key with vnodes
It is my understanding that you must have all parts of the partition key in order to calculate the token. The partition key is the first part of the primary key, in your case the userId. You should be able to get the token from the userId. Give it a try: cqlsh select userId, token(userId) from users limit 10; On 07/11/2013 08:54 AM, Haithem Jarraya wrote: Hi All, I am a bit confused on how the underlying token aware routing is working in the case of composite key. Let's say I have a column family like this USERS( uuid userId, text firstname, text lastname, int age, PRIMARY KEY(userId, firstname, lastname)) My question is do we need to have the values of the userId, firstName and lastName available in the same time to create the token from the composite key, or we can get the right token just by looking at the routing key userId? Looking at the datastax driver code, is a bit confusing, it seems that it calculate the token only when all the values of a composite key is available, or I am missing something? Thanks, Haithem -- *Colin Blower* /Software Engineer/ Barracuda Networks Inc. +1 408-342-5576 (o)
Re: merge sstables
He has around 10G of data so should not be bad. This problem is if you have lot of data. On Thu, Jul 11, 2013 at 2:10 PM, Robert Coli rc...@eventbrite.com wrote: On Thu, Jul 11, 2013 at 1:52 PM, sankalp kohli kohlisank...@gmail.comwrote: Scrub will keep the file size same. YOu need to move all sstables to be L0. the way to do this is to remove the json file which has level information. This will work, but I believe is subject to this? ./src/java/org/apache/cassandra/db/compaction/LeveledManifest.java line 228 of 577 // LevelDB gives each level a score of how much data it contains vs its ideal amount, and // compacts the level with the highest score. But this falls apart spectacularly once you // get behind. Consider this set of levels: // L0: 988 [ideal: 4] // L1: 117 [ideal: 10] // L2: 12 [ideal: 100] // // The problem is that L0 has a much higher score (almost 250) than L1 (11), so what we'll // do is compact a batch of MAX_COMPACTING_L0 sstables with all 117 L1 sstables, and put the // result (say, 120 sstables) in L1. Then we'll compact the next batch of MAX_COMPACTING_L0, // and so forth. So we spend most of our i/o rewriting the L1 data with each batch. // // If we could just do *all* L0 a single time with L1, that would be ideal. But we can't // -- see the javadoc for MAX_COMPACTING_L0. // // LevelDB's way around this is to simply block writes if L0 compaction falls behind. // We don't have that luxury. // // So instead, we force compacting higher levels first. This may not minimize the number // of reads done as quickly in the short term, but it minimizes the i/o needed to compact // optimially which gives us a long term win. Ideal would be something like a major compaction for LCS which allows end user to change resulting SSTable sizes without forcing everything back to L0. =Rob
unsubscribe
Re: listen_address and rpc_address address on different interface
On Thu, Jul 11, 2013 at 2:53 AM, Christopher Wirt chris.w...@struq.comwrote: ** If we were to take down a node and change the listen address then re-join the ring, the other nodes will mark the node as dead when we take it down and assume we have a new node when we bring it back on a different address. ** Lots of wasted rebalancing and compaction will start. We use Cassandra 1.2.4 w/vnodes. In theory you can : 1) stop cassandra 2) change ip/config/etc. 3) restart cassandra with auto_bootstrap=false in cassandra.yaml I believe this should just work because the node knows what tokens it is claiming from the system keyspace, it simply announces to the cluster that it is now responsible for each of those ranges. The other nodes say should just say ok. If you do this, please let us know the results! Obviously you should try it first on a non-production cluster... So back to question one, am I wasting my time? My hunch is probably but it is just a hunch. =Rob
How many DCs can you have in a cluster?
In this C* Summit 2013 talk titled A Deep Dive Into How Cassandra Resolves Inconsistent Data [1], Jason Brown of Netflix mentions that they have 5 data centers in the same cluster, two in the US, one in Europe, one in Brazil and one in Asia (I'm going from memory now since I don't want to watch the video again). Is there a practical limit on how many different data centers one can have in a single cluster? Thanks, Blair [1] http://www.youtube.com/watch?v=VRZk-NhfX18list=PLqcm6qE9lgKJzVvwHprow9h7KMpb5hcUUindex=57
Re: Alternate major compaction
Hi, I made the repository public. Now you can checkout from here. https://github.com/cloudian/support-tools checksstablegarbage is the tool. Enjoy, and any feedback is welcome. Thanks, - Takenori On Thu, Jul 11, 2013 at 10:12 PM, srmore comom...@gmail.com wrote: Thanks Takenori, Looks like the tool provides some good info that people can use. It would be great if you can share it with the community. On Thu, Jul 11, 2013 at 6:51 AM, Takenori Sato ts...@cloudian.com wrote: Hi, I think it is a common headache for users running a large Cassandra cluster in production. Running a major compaction is not the only cause, but more. For example, I see two typical scenario. 1. backup use case 2. active wide row In the case of 1, say, one data is removed a year later. This means, tombstone on the row is 1 year away from the original row. To remove an expired row entirely, a compaction set has to include all the rows. So, when do the original, 1 year old row, and the tombstoned row are included in a compaction set? It is likely to take one year. In the case of 2, such an active wide row exists in most of sstable files. And it typically contains many expired columns. But none of them wouldn't be removed entirely because a compaction set practically do not include all the row fragments. Btw, there is a very convenient MBean API is available. It is CompactionManager's forceUserDefinedCompaction. You can invoke a minor compaction on a file set you define. So the question is how to find an optimal set of sstable files. Then, I wrote a tool to check garbage, and print outs some useful information to find such an optimal set. Here's a simple log output. # /opt/cassandra/bin/checksstablegarbage -e /cassandra_data/UserData/Test5_BLOB-hc-4-Data.db [Keyspace, ColumnFamily, gcGraceSeconds(gcBefore)] = [UserData, Test5_BLOB, 300(1373504071)] === ROW_KEY, TOTAL_SIZE, COMPACTED_SIZE, TOMBSTONED, EXPIRED, REMAINNING_SSTABLE_FILES === hello5/100.txt.1373502926003, 40, 40, YES, YES, Test5_BLOB-hc-3-Data.db --- TOTAL, 40, 40 === REMAINNING_SSTABLE_FILES means any other sstable files that contain the respective row. So, the following is an optimal set. # /opt/cassandra/bin/checksstablegarbage -e /cassandra_data/UserData/Test5_BLOB-hc-4-Data.db /cassandra_data/UserData/Test5_BLOB-hc-3-Data.db [Keyspace, ColumnFamily, gcGraceSeconds(gcBefore)] = [UserData, Test5_BLOB, 300(1373504131)] === ROW_KEY, TOTAL_SIZE, COMPACTED_SIZE, TOMBSTONED, EXPIRED, REMAINNING_SSTABLE_FILES === hello5/100.txt.1373502926003, 223, 0, YES, YES --- TOTAL, 223, 0 === This tool relies on SSTableReader and an aggregation iterator as Cassandra does in compaction. I was considering to share this with the community. So let me know if anyone is interested. Ah, note that it is based on 1.0.7. So I will need to check and update for newer versions. Thanks, Takenori On Thu, Jul 11, 2013 at 6:46 PM, Tomàs Núnez tomas.nu...@groupalia.comwrote: Hi About a year ago, we did a major compaction in our cassandra cluster (a n00b mistake, I know), and since then we've had huge sstables that never get compacted, and we were condemned to repeat the major compaction process every once in a while (we are using SizeTieredCompaction strategy, and we've not avaluated yet LeveledCompaction, because it has its downsides, and we've had no time to test all of them in our environment). I was trying to find a way to solve this situation (that is, do something like a major compaction that writes small sstables, not huge as major compaction does), and I couldn't find it in the documentation. I tried cleanup and scrub/upgradesstables, but they don't do that (as documentation states). Then I tried deleting all data in a node and then bootstrapping it (or nodetool rebuild-ing it), hoping that this way the sstables would get cleaned from deleted records and updates. But the deleted node just copied the sstables from another node as they were, cleaning nothing. So I tried a new approach: I switched the sstable compaction strategy (SizeTiered to Leveled), forcing the sstables to be rewritten from scratch, and then switching it back (Leveled to SizeTiered). It took a while (but so do the major compaction process) and it worked, I have smaller sstables,