Re: Failing operations repair
It would be really great to look at your slides. Do you have any plans to share your presentation? On Sat, Jun 9, 2012 at 1:14 AM, Віталій Тимчишин tiv...@gmail.com wrote: Thanks a lot. I was not sure if coordinator somehow tries to roll-back transactions that failed to reach it's consistency level. (Yet I could not imagine a method to do this, without 2-phase commit :) ) 2012/6/8 aaron morton aa...@thelastpickle.com I am making some cassandra presentations in Kyiv and would like to check that I am telling people truth :) Thanks for spreading the word :) 1) Failed (from client-side view) operation may still be applied to cluster Yes. If you fail with UnavailableException it's because from the coordinators view of the cluster there is less than CL nodes available. So retry. Somewhat similar story with TimedOutException. 2) Coordinator does not try anything to roll-back operation that failed because it was processed by less then consitency level number of nodes. Correct. 3) Hinted handoff works only for successfull operations. HH will be stored if the coordinator proceeds with the request. In 1.X HH is stored on the coordinator if a replica is down when the request starts and if the node does not reply in rpc_timeout. 4) Counters are not reliable because of (1) If you get a TimedOutException when writing a counter you should not re-send the request. 5) Read-repair may help to propagate operation that was failed it's consistency level, but was persisted to some nodes. Yes. It works in the background, by default is only enabled on 10% of requests. Note that RR is not the same as the Consistent Level for read. If you work as a CL ONE the results from CL nodes are always compared and differences resolved. RR is concerned with the replicas not involved in the CL read. 6) Manual repair is still needed because of (2) and (3) Manual repair is *the* was to achieve consistency of data on disk. HH and RR are optimisations designed to reduce the chance of a Digest Mismatch during a read with CL ONE. It is also essential for distributing Tombstones before they are purged by compaction. P.S. If some points apply only to some cassandra versions, I will be happy to know this too. Assume everyone for version 1.X Thanks - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 8/06/2012, at 1:20 AM, Віталій Тимчишин wrote: Hello. I am making some cassandra presentations in Kyiv and would like to check that I am telling people truth :) Could community tell me if next points are true: 1) Failed (from client-side view) operation may still be applied to cluster 2) Coordinator does not try anything to roll-back operation that failed because it was processed by less then consitency level number of nodes. 3) Hinted handoff works only for successfull operations. 4) Counters are not reliable because of (1) 5) Read-repair may help to propagate operation that was failed it's consistency level, but was persisted to some nodes. 6) Manual repair is still needed because of (2) and (3) P.S. If some points apply only to some cassandra versions, I will be happy to know this too. -- Best regards, Vitalii Tymchyshyn -- Best regards, Vitalii Tymchyshyn
Re: cassandra read latency help
You may also consider disabling key/row cache at all. 1mm rows * 400 bytes = 400MB of data, can easily be in fs cache, and you will access your hot keys with thousands of qps without hitting disk at all. Enabling compression can make situation even better. On Thu, May 31, 2012 at 12:01 PM, Gurpreet Singh gurpreet.si...@gmail.comwrote: Aaron, Thanks for your email. The test kinda resembles how the actual application will be. It is going to be a simple key-value store with 500 million keys per node. The traffic will be read heavy in steady state, and there will be some keys that will have a lot more traffic than others. The expected hot rows are estimated to be anywhere between 50 to 1 million keys. I have already populated this test system with 500 million keys, compacted it all to 1 file to check the size of the bloom filter and the index. This is how i am estimating my memory for 500 million keys. plz correct me if i am wrong or if i am missing any step. bloom filter: 1 gig index samples: Index file is 8.5 gig. I believe this index file is for all keys. Index interval is 128. Hence in RAM, this would be (8.5g / 128)*10 (factor for datastructure overhead) = 664 mb (lets say 1 gig) key cache size (3 million): 3 gigs memtable_total_space_mb : 2 gigs This totals 7 gig. my heap size is 8 gigs. Is there anything else that i am missing here? When i do top right now, it shows java as 96% memory, thats a concern because there is no write load. Should i be looking at any other number here? Off heap row cache: 500,000 - 750,000 ~ 3 and 5 gigs (avg row size = 250-500 bytes) My test system has 16 gigs RAM, production system will mostly have 32 gigs RAM and 12 spindles instead of 6 that i am testing with. I changed the underneath filesystem from xfs to ext2, and i am seeing better results, though not the best. The cfstats latency is down to 20 ms for 35 qps read load. row cache hit rate is 0.21, key cache = 0.75. Measuring from the client side, i am seeing roughly 10-15 ms per key, i would want even lesser though, any tips would greatly help. In production, i am hoping the row cache hit rate will be higher. The biggest thing that is affecting my system right now is the Invalid frame size of 0 error that cassandra server seems to be printing. Its causing read timeouts every minute or 2 minutes. I havent been able to figure out a way to fix this one. I see someone else also reported seeing this, but not sure where the problem is hector, cassandra or thrift. Thanks Gurpreet On Wed, May 30, 2012 at 4:38 PM, aaron morton aa...@thelastpickle.comwrote: 80 ms per request sounds high. I'm doing some guessing here, i am guessing memory usage is the problem.. * I assume you are not longer seeing excessive GC activity. * The key cache will not get used when you hit the row cache. I would disable the row cache if you have a random workload, which it looks like you do. * 500 million is a lot of keys to have on a single node. At the default index sample of every 128 keys it will have about 4 million samples, which is probably taking up a lot of memory. Is this testing a real world scenario or an abstract benchmark ? IMHO you will get more insight from testing something that resembles your application. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 26/05/2012, at 8:48 PM, Gurpreet Singh wrote: Hi Aaron, Here is the latest on this.. i switched to a node with 6 disks and running some read tests, and i am seeing something weird. setup: 1 node, cassandra 1.0.9, 8 cpu, 16 gig RAM, 6 7200 rpm SATA data disks striped 512 kb, commitlog mirrored. 1 keyspace with just 1 column family random partitioner total number of keys: 500 million (the keys are just longs from 1 to 500 million) avg key size: 8 bytes bloom filter size: 1 gig total disk usage: 70 gigs compacted 1 sstable mean compacted row size: 149 bytes heap size: 8 gigs keycache size: 2 million (takes around 2 gigs in RAM) rowcache size: 1 million (off-heap) memtable_total_space_mb : 2 gigs test: Trying to do 5 reads per second. Each read is a multigetslice query for just 1 key, 2 columns. observations: row cache hit rate: 0.4 key cache hit rate: 0.0 (this will increase later on as system moves to steady state) cfstats - 80 ms iostat (every 5 seconds): r/s : 400 %util: 20% (all disks are at equal utilization) await: 65-70 ms (for each disk) svctm : 2.11 ms (for each disk) r-kB/s - 35000 why this is weird is because.. 5 reads per second is causing a latency of 80 ms per request (according to cfstats). isnt this too high? 35 MB/s is being read from the disk. That is again very weird. This number is way too high, avg row size is just 149 bytes. Even index reads should not cause this high data being read from the disk. what i understand is that each read request translates to 2 disk accesses
Re: cassandra read latency help
But I think it's bad idea, since hot data will be evenly distributed between multiple sstables and filesystem pages. On Thu, May 31, 2012 at 1:08 PM, crypto five cryptof...@gmail.com wrote: You may also consider disabling key/row cache at all. 1mm rows * 400 bytes = 400MB of data, can easily be in fs cache, and you will access your hot keys with thousands of qps without hitting disk at all. Enabling compression can make situation even better. On Thu, May 31, 2012 at 12:01 PM, Gurpreet Singh gurpreet.si...@gmail.com wrote: Aaron, Thanks for your email. The test kinda resembles how the actual application will be. It is going to be a simple key-value store with 500 million keys per node. The traffic will be read heavy in steady state, and there will be some keys that will have a lot more traffic than others. The expected hot rows are estimated to be anywhere between 50 to 1 million keys. I have already populated this test system with 500 million keys, compacted it all to 1 file to check the size of the bloom filter and the index. This is how i am estimating my memory for 500 million keys. plz correct me if i am wrong or if i am missing any step. bloom filter: 1 gig index samples: Index file is 8.5 gig. I believe this index file is for all keys. Index interval is 128. Hence in RAM, this would be (8.5g / 128)*10 (factor for datastructure overhead) = 664 mb (lets say 1 gig) key cache size (3 million): 3 gigs memtable_total_space_mb : 2 gigs This totals 7 gig. my heap size is 8 gigs. Is there anything else that i am missing here? When i do top right now, it shows java as 96% memory, thats a concern because there is no write load. Should i be looking at any other number here? Off heap row cache: 500,000 - 750,000 ~ 3 and 5 gigs (avg row size = 250-500 bytes) My test system has 16 gigs RAM, production system will mostly have 32 gigs RAM and 12 spindles instead of 6 that i am testing with. I changed the underneath filesystem from xfs to ext2, and i am seeing better results, though not the best. The cfstats latency is down to 20 ms for 35 qps read load. row cache hit rate is 0.21, key cache = 0.75. Measuring from the client side, i am seeing roughly 10-15 ms per key, i would want even lesser though, any tips would greatly help. In production, i am hoping the row cache hit rate will be higher. The biggest thing that is affecting my system right now is the Invalid frame size of 0 error that cassandra server seems to be printing. Its causing read timeouts every minute or 2 minutes. I havent been able to figure out a way to fix this one. I see someone else also reported seeing this, but not sure where the problem is hector, cassandra or thrift. Thanks Gurpreet On Wed, May 30, 2012 at 4:38 PM, aaron morton aa...@thelastpickle.comwrote: 80 ms per request sounds high. I'm doing some guessing here, i am guessing memory usage is the problem.. * I assume you are not longer seeing excessive GC activity. * The key cache will not get used when you hit the row cache. I would disable the row cache if you have a random workload, which it looks like you do. * 500 million is a lot of keys to have on a single node. At the default index sample of every 128 keys it will have about 4 million samples, which is probably taking up a lot of memory. Is this testing a real world scenario or an abstract benchmark ? IMHO you will get more insight from testing something that resembles your application. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 26/05/2012, at 8:48 PM, Gurpreet Singh wrote: Hi Aaron, Here is the latest on this.. i switched to a node with 6 disks and running some read tests, and i am seeing something weird. setup: 1 node, cassandra 1.0.9, 8 cpu, 16 gig RAM, 6 7200 rpm SATA data disks striped 512 kb, commitlog mirrored. 1 keyspace with just 1 column family random partitioner total number of keys: 500 million (the keys are just longs from 1 to 500 million) avg key size: 8 bytes bloom filter size: 1 gig total disk usage: 70 gigs compacted 1 sstable mean compacted row size: 149 bytes heap size: 8 gigs keycache size: 2 million (takes around 2 gigs in RAM) rowcache size: 1 million (off-heap) memtable_total_space_mb : 2 gigs test: Trying to do 5 reads per second. Each read is a multigetslice query for just 1 key, 2 columns. observations: row cache hit rate: 0.4 key cache hit rate: 0.0 (this will increase later on as system moves to steady state) cfstats - 80 ms iostat (every 5 seconds): r/s : 400 %util: 20% (all disks are at equal utilization) await: 65-70 ms (for each disk) svctm : 2.11 ms (for each disk) r-kB/s - 35000 why this is weird is because.. 5 reads per second is causing a latency of 80 ms per request (according to cfstats). isnt this too high? 35 MB/s is being read from the disk. That is again very weird. This number
Re: Cassandra dying when gets many deletes
I agree with your observations. From another hand I found that ColumnFamily.size() doesn't calculate object size correctly. It doesn't count two fields Objects sizes and returns 0 if there is no object in columns container. I increased initial size variable value to 24 which is size of two objects(I didn't now what's correct value), and cassandra started calculating live ratio correctly, increasing trhouhput value and flushing memtables. On Tue, Apr 24, 2012 at 2:00 AM, Vitalii Tymchyshyn tiv...@gmail.comwrote: ** Hello. For me there are no dirty column families in your message tells it's possibly the same problem. The issue is that column families that gets full row deletes only do not get ANY SINGLE dirty byte accounted and so can't be picked by flusher. Any ratio can't help simply because it is multiplied by 0. Check your cfstats. 24.04.12 09:54, crypto five написав(ла): Thank you Vitalii. Looking at the Jonathan's answer to your patch I think it's probably not my case. I see that LiveRatio is calculated in my case, but calculations look strange: WARN [MemoryMeter:1] 2012-04-23 23:29:48,430 Memtable.java (line 181) setting live ratio to maximum of 64 instead of Infinity INFO [MemoryMeter:1] 2012-04-23 23:29:48,432 Memtable.java (line 186) CFS(Keyspace='lexems', ColumnFamily='countersCF') liveRatio is 64.0 (just-counted was 64.0). calculation took 63355ms for 0 columns Looking at the comments in the code: If it gets higher than 64 something is probably broken., looks like it's probably the problem. Not sure how to investigate it. 2012/4/23 Віталій Тимчишин tiv...@gmail.com See https://issues.apache.org/jira/browse/CASSANDRA-3741 I did post a fix there that helped me. 2012/4/24 crypto five cryptof...@gmail.com Hi, I have 50 millions of rows in column family on 4G RAM box. I allocatedf 2GB to cassandra. I have program which is traversing this CF and cleaning some data there, it generates about 20k delete statements per second. After about of 3 millions deletions cassandra stops responding to queries: it doesn't react to CLI, nodetool etc. I see in the logs that it tries to free some memory but can't even if I wait whole day. Also I see following in the logs: INFO [ScheduledTasks:1] 2012-04-23 18:38:13,333 StorageService.java (line 2647) Unable to reduce heap usage since there are no dirty column families When I am looking at memory dump I see that memory goes to ConcurrentSkipListMap(10%), HeapByteBuffer(13%), DecoratedKey(6%), int[](6%), BigInteger(8.2%), ConcurrentSkipListMap$HeadIndex(7.2%), ColumnFamily(6.5%), ThreadSafeSortedColumns(13.7%), long[](5.9%). What can I do to make cassandra stop dying? Why it can't free the memory? Any ideas? Thank you. -- Best regards, Vitalii Tymchyshyn
Re: Cassandra dying when gets many deletes
Thank you Vitalii. Looking at the Jonathan's answer to your patch I think it's probably not my case. I see that LiveRatio is calculated in my case, but calculations look strange: WARN [MemoryMeter:1] 2012-04-23 23:29:48,430 Memtable.java (line 181) setting live ratio to maximum of 64 instead of Infinity INFO [MemoryMeter:1] 2012-04-23 23:29:48,432 Memtable.java (line 186) CFS(Keyspace='lexems', ColumnFamily='countersCF') liveRatio is 64.0 (just-counted was 64.0). calculation took 63355ms for 0 columns Looking at the comments in the code: If it gets higher than 64 something is probably broken., looks like it's probably the problem. Not sure how to investigate it. 2012/4/23 Віталій Тимчишин tiv...@gmail.com See https://issues.apache.org/jira/browse/CASSANDRA-3741 I did post a fix there that helped me. 2012/4/24 crypto five cryptof...@gmail.com Hi, I have 50 millions of rows in column family on 4G RAM box. I allocatedf 2GB to cassandra. I have program which is traversing this CF and cleaning some data there, it generates about 20k delete statements per second. After about of 3 millions deletions cassandra stops responding to queries: it doesn't react to CLI, nodetool etc. I see in the logs that it tries to free some memory but can't even if I wait whole day. Also I see following in the logs: INFO [ScheduledTasks:1] 2012-04-23 18:38:13,333 StorageService.java (line 2647) Unable to reduce heap usage since there are no dirty column families When I am looking at memory dump I see that memory goes to ConcurrentSkipListMap(10%), HeapByteBuffer(13%), DecoratedKey(6%), int[](6%), BigInteger(8.2%), ConcurrentSkipListMap$HeadIndex(7.2%), ColumnFamily(6.5%), ThreadSafeSortedColumns(13.7%), long[](5.9%). What can I do to make cassandra stop dying? Why it can't free the memory? Any ideas? Thank you. -- Best regards, Vitalii Tymchyshyn
Cassandra dying when gets many deletes
Hi, I have 50 millions of rows in column family on 4G RAM box. I allocatedf 2GB to cassandra. I have program which is traversing this CF and cleaning some data there, it generates about 20k delete statements per second. After about of 3 millions deletions cassandra stops responding to queries: it doesn't react to CLI, nodetool etc. I see in the logs that it tries to free some memory but can't even if I wait whole day. Also I see following in the logs: INFO [ScheduledTasks:1] 2012-04-23 18:38:13,333 StorageService.java (line 2647) Unable to reduce heap usage since there are no dirty column families When I am looking at memory dump I see that memory goes to ConcurrentSkipListMap(10%), HeapByteBuffer(13%), DecoratedKey(6%), int[](6%), BigInteger(8.2%), ConcurrentSkipListMap$HeadIndex(7.2%), ColumnFamily(6.5%), ThreadSafeSortedColumns(13.7%), long[](5.9%). What can I do to make cassandra stop dying? Why it can't free the memory? Any ideas? Thank you.