Re: Upgrading from 1.2 to 2.1 questions
Sure but the question is really about going from 1.2 to 2.0 ... On 2015-02-02 13:59:27 +, Kai Wang said: I would not use 2.1.2 for production yet. It doesn't seem stable enough based on the feedbacks I see here. The newest 2.0.12 may be a better option. On Feb 2, 2015 8:43 AM, Sibbald, Charles charles.sibb...@bskyb.com wrote: Hi Oleg, What is the minor version of 1.2? I am looking to do the same for 1.2.14 in a very large cluster. Regards Charles On 02/02/2015 13:33, Oleg Dulin oleg.du...@gmail.com wrote: Dear Distinguished Colleagues: We'd like to upgrade our cluster from 1.2 to 2.0 and then to 2.1 . We are using Pelops Thrift client, which has long been abandoned by its authors. I've read that 2.x has changes to the Thrift protocol making it incompatible with 1.2 (and of course now the link to that site eludes me). If that is true, we need to first upgrade our Thrift client and then upgrade cassandra. Let's start by confirming if that indeed is the case -- if that is true, I have my work cut out for me. Anyone knows for sure ? Regards, Oleg Information in this email including any attachments may be privileged, confidential and is intended exclusively for the addressee. The views expressed may not be official policy, but the personal views of the originator. If you have received it in error, please notify the sender by return e-mail and delete it from your system. You should not reproduce, distribute, store, retransmit, use or disclose its contents to anyone. Please note we reserve the right to monitor all e-mail communication through our internal and external networks. SKY and the SKY marks are trademarks of British Sky Broadcasting Group plc and Sky International AG and are used under licence. British Sky Broadcasting Limited (Registration No. 2906991), Sky-In-Home Service Limited (Registration No. 2067075) and Sky Subscribers Services Limited (Registration No. 2340150) are direct or indirect subsidiaries of British Sky Broadcasting Group plc (Registration No. 2247735). All of the companies mentioned in this paragraph are incorporated in England and Wales and share the same registered office at Grant Way, Isleworth, Middlesex TW7 5QD.
Re: Upgrading from 1.2 to 2.1 questions
Our minor version is 1.2.15 ... I am not looking forward to the experience, and would like to gather as much information as possible. This presents an opportunity to also review the data structures we use and possibly move them out of Cassandra. Oleg On 2015-02-02 13:42:52 +, Sibbald, Charles said: Hi Oleg, What is the minor version of 1.2? I am looking to do the same for 1.2.14 in a very large cluster. Regards Charles On 02/02/2015 13:33, Oleg Dulin oleg.du...@gmail.com wrote: Dear Distinguished Colleagues: We'd like to upgrade our cluster from 1.2 to 2.0 and then to 2.1 . We are using Pelops Thrift client, which has long been abandoned by its authors. I've read that 2.x has changes to the Thrift protocol making it incompatible with 1.2 (and of course now the link to that site eludes me). If that is true, we need to first upgrade our Thrift client and then upgrade cassandra. Let's start by confirming if that indeed is the case -- if that is true, I have my work cut out for me. Anyone knows for sure ? Regards, Oleg Information in this email including any attachments may be privileged, confidential and is intended exclusively for the addressee. The views expressed may not be official policy, but the personal views of the originator. If you have received it in error, please notify the sender by return e-mail and delete it from your system. You should not reproduce, distribute, store, retransmit, use or disclose its contents to anyone. Please note we reserve the right to monitor all e-mail communication through our internal and external networks. SKY and the SKY marks are trademarks of British Sky Broadcasting Group plc and Sky International AG and are used under licence. British Sky Broadcasting Limited (Registration No. 2906991), Sky-In-Home Service Limited (Registration No. 2067075) and Sky Subscribers Services Limited (Registration No. 2340150) are direct or indirect subsidiaries of British Sky Broadcasting Group plc (Registration No. 2247735). All of the companies mentioned in this paragraph are incorporated in England and Wales and share the same registered office at Grant Way, Isleworth, Middlesex TW7 5QD.
Upgrading from 1.2 to 2.1 questions
Dear Distinguished Colleagues: We'd like to upgrade our cluster from 1.2 to 2.0 and then to 2.1 . We are using Pelops Thrift client, which has long been abandoned by its authors. I've read that 2.x has changes to the Thrift protocol making it incompatible with 1.2 (and of course now the link to that site eludes me). If that is true, we need to first upgrade our Thrift client and then upgrade cassandra. Let's start by confirming if that indeed is the case -- if that is true, I have my work cut out for me. Anyone knows for sure ? Regards, Oleg
Re: Upgrading from 1.2 to 2.1 questions
What about Java clients that were built for 1.2 and how they work with 2.0 ? On 2015-02-02 14:32:53 +, Carlos Rolo said: Using Pycassa (https://github.com/pycassa/pycassa)I had no trouble with the Clients writing/reading from 1.2.x to 2.0.x (Can't recall the minor versions out of my head right now). Regards, Carlos Juzarte Rolo Cassandra Consultant Pythian - Love your data rolo@pythian | Twitter: cjrolo | Linkedin: linkedin.com/in/carlosjuzarterolo Tel: 1649 www.pythian.com On Mon, Feb 2, 2015 at 3:21 PM, Oleg Dulin oleg.du...@gmail.com wrote: Sure but the question is really about going from 1.2 to 2.0 ... On 2015-02-02 13:59:27 +, Kai Wang said: I would not use 2.1.2 for production yet. It doesn't seem stable enough based on the feedbacks I see here. The newest 2.0.12 may be a better option. On Feb 2, 2015 8:43 AM, Sibbald, Charles charles.sibb...@bskyb.com wrote: Hi Oleg, What is the minor version of 1.2? I am looking to do the same for 1.2.14 in a very large cluster. Regards Charles On 02/02/2015 13:33, Oleg Dulin oleg.du...@gmail.com wrote: Dear Distinguished Colleagues: We'd like to upgrade our cluster from 1.2 to 2.0 and then to 2.1 . We are using Pelops Thrift client, which has long been abandoned by its authors. I've read that 2.x has changes to the Thrift protocol making it incompatible with 1.2 (and of course now the link to that site eludes me). If that is true, we need to first upgrade our Thrift client and then upgrade cassandra. Let's start by confirming if that indeed is the case -- if that is true, I have my work cut out for me. Anyone knows for sure ? Regards, Oleg Information in this email including any attachments may be privileged, confidential and is intended exclusively for the addressee. The views expressed may not be official policy, but the personal views of the originator. If you have received it in error, please notify the sender by return e-mail and delete it from your system. You should not reproduce, distribute, store, retransmit, use or disclose its contents to anyone. Please note we reserve the right to monitor all e-mail communication through our internal and external networks. SKY and the SKY marks are trademarks of British Sky Broadcasting Group plc and Sky International AG and are used under licence. British Sky Broadcasting Limited (Registration No. 2906991), Sky-In-Home Service Limited (Registration No. 2067075) and Sky Subscribers Services Limited (Registration No. 2340150) are direct or indirect subsidiaries of British Sky Broadcasting Group plc (Registration No. 2247735). All of th! e compani es mentioned in this paragraph are incorporated in England and Wales and share the same registered office at Grant Way, Isleworth, Middlesex TW7 5QD. -- -- Regards, Oleg Dulin http://www.olegdulin.com
EC2 Snitch load imbalance
I have a setup with 6 cassandra nodes (1.2.18), using RandomPartition, not using vnodes -- this is a legacy cluster. We went from 3 nodes to 6 in the last few days to add capacity. However, there appears to be an imbalance: Datacenter: us-east == Replicas: 2 Address RackStatus State LoadOwns Token 113427455640312821154458202477256070484 x.x.x.73 1d Up Normal 154.64 GB 33.33% 85070591730234615865843651857942052863 x.x.x.2511a Up Normal 62.26 GB16.67% 28356863910078205288614550619314017621 x.x.x.238 1b Up Normal 243.7 GB50.00% 56713727820156410577229101238628035242 x.x.x.25 1a Up Normal 169.3 GB33.33% 210 x.x.x.162 1b Up Normal 118.24 GB 50.00% 141784319550391026443072753096570088105 x.x.x.208 1d Up Normal 226.85 GB 16.67% 113427455640312821154458202477256070484 What is the cause of this imbalance ? How can I rectify it ? Regards, Oleg
Re: EC2 Snitch load imbalance
Thanks Mark. The output in my original post is with keyspace specified. On 2014-10-28 12:00:15 +, Mark Reddy said: Oleg, If you are running nodetool status, be sure to specify the keyspace also. If you don't specify the keyspace the results will be nonsense. https://issues.apache.org/jira/browse/CASSANDRA-7173 Regards, Mark On 28 October 2014 10:35, Oleg Dulin oleg.du...@gmail.com wrote: I have a setup with 6 cassandra nodes (1.2.18), using RandomPartition, not using vnodes -- this is a legacy cluster. We went from 3 nodes to 6 in the last few days to add capacity. However, there appears to be an imbalance: Datacenter: us-east == Replicas: 2 Address Rack Status State Load Owns Token 113427455640312821154458202477256070484 x.x.x.73 1d Up Normal 154.64 GB 33.33% 85070591730234615865843651857942052863 x.x.x.251 1a Up Normal 62.26 GB 16.67% 28356863910078205288614550619314017621 x.x.x.238 1b Up Normal 243.7 GB 50.00% 56713727820156410577229101238628035242 x.x.x.25 1a Up Normal 169.3 GB 33.33% 210 x.x.x.162 1b Up Normal 118.24 GB 50.00% 141784319550391026443072753096570088105 x.x.x.208 1d Up Normal 226.85 GB 16.67% 113427455640312821154458202477256070484 What is the cause of this imbalance ? How can I rectify it ? Regards, Oleg
Moving Cassandra from EC2 Classic into VPC
Dear Colleagues: I need to move Cassandra from EC2 classic into VPC. What I was thinking is that I can create a new data center within VPC and rebuild it from my existing one (switching to vnodes while I am at it). However, I don't understand how the ec2-snitch will deal with this. Another idea I had was taking the ec2-snitch configuration and converting it into a Property file snitch. But I still don't understand how to perform this move since I need my newly created VPC instances to have public IPs -- something I would like to avoid. Any thoughts are appreciated. Regards, Oleg
Re: Moving Cassandra from EC2 Classic into VPC
I get that, but if you read my opening post, I have an existing cluster in EC2 classic that I have no idea how to move to VPC cleanly. On 2014-09-08 19:52:28 +, Bram Avontuur said: I have setup Cassandra into VPC with the EC2Snitch and it works without issues. I didn't need to do anything special to the configuration. I have created instances in 2 availability zones, and it automatically picks it up as 2 different data racks. Just make sure your nodes can see each other in the VPC, e.g. setup a security group that allows connections from other nodes from the same group. There should be no need to use public IP's if whatever talks to cassandra is also within your VPC. Hope this helps. Bram On Mon, Sep 8, 2014 at 3:34 PM, Oleg Dulin oleg.du...@gmail.com wrote: Dear Colleagues: I need to move Cassandra from EC2 classic into VPC. What I was thinking is that I can create a new data center within VPC and rebuild it from my existing one (switching to vnodes while I am at it). However, I don't understand how the ec2-snitch will deal with this. Another idea I had was taking the ec2-snitch configuration and converting it into a Property file snitch. But I still don't understand how to perform this move since I need my newly created VPC instances to have public IPs -- something I would like to avoid. Any thoughts are appreciated. Regards, Oleg
Options for expanding Cassandra cluster on AWS
Distinguished Colleagues: Our current Cassandra cluster on AWS looks like this: 3 nodes in N. Virginia, one per zone. RF=3 Each node is a c3.4xlarge with 2x160G SSDs in RAID-0 (~300 Gig SSD on each node). Works great, I find it the most optimal configuration for a Cassandra node. But the time is coming soon when I need to expand storage capacity. I have the following options in front of me: 1) Add 3 more c3.4xlarge nodes. This keeps the amount of data on each node reasonable, and all repairs and other tasks can complete in a reasonable amount of time. The downside is that c3.4xlarge are pricey. 2) Add provisioned EBS volumes. These days I can get SSD-backed EBS with up to 4000 IOPS provisioned. I can add those volumes to data_directories list in Yaml, and I expect Cassandra can deal with that JBOD-style The upside is that it is much cheaper than option #1 above; the downside is that it is a much slower configuration and repairs can take longer. I'd appreciate any input on this topic. Thanks in advance, Oleg
ANNOUNCEMENT: cassandra-aws project
Colleagues: I'd like to announce a pet project I started: https://github.com/olegdulin/cassandra-aws What I would like to accomplish as an end-goal is an Amazon marketplace AMI that makes it easy to configure a new Cassandra cluster or add new nodes to an existing Cassandra cluster, w/o having to jump through hoops. Ideally I'd like to do for Cassandra what RDS does for PostgreSQL in AWS, for instance, but I am not sure if ultimately it is possible. To get started, I shared some notes in the wiki as well as a couple of scripts I used to simplify things for myself. I put those scripts together from input I received on the #cassandra IRC channel and this mailing list and I am very greatful to the community for helping me through this -- so this is my contribution back. Consider this email as a solicitation for help. I am open to discussions, and contributions, and suggestions, anything you can help with. Regards, Oleg
Re: ANNOUNCEMENT: cassandra-aws project
I guess I didn't know about the ComboAMI! Thanks! I'll look into this. I have been rolling my own AMIs for a simple reason -- we have on-premises environments and in AWS. I wanted them to be the same, structurally, so I used our on-prem configurations as a starting point. Regards, Oleg On 2014-06-06 15:25:44 +, Michael Shuler said: On 06/06/2014 09:57 AM, Oleg Dulin wrote: I'd like to announce a pet project I started: https://github.com/olegdulin/cassandra-aws Cool :) https://github.com/riptano/ComboAMI is the DataStax AMI repo. What I would like to accomplish as an end-goal is an Amazon marketplace AMI that makes it easy to configure a new Cassandra cluster or add new nodes to an existing Cassandra cluster, w/o having to jump through hoops. Ideally I'd like to do for Cassandra what RDS does for PostgreSQL in AWS, for instance, but I am not sure if ultimately it is possible. Is there something that ComboAMI doesn't cover for your needs or is there some area that could be improved upon? To get started, I shared some notes in the wiki as well as a couple of scripts I used to simplify things for myself. I put those scripts together from input I received on the #cassandra IRC channel and this mailing list and I am very greatful to the community for helping me through this -- so this is my contribution back. Consider this email as a solicitation for help. I am open to discussions, and contributions, and suggestions, anything you can help with. Would it be less overall work to implement changes you'd like to see by contributing them to ComboAMI? I fully support lots of variations of tools - whatever makes things easiest for people to do exactly what they need, or in languages they're comfortable with, etc.
How to balance this cluster out ?
I have a cluster that looks like this: Datacenter: us-east == Replicas: 2 Address RackStatus State LoadOwns Token 113427455640312821154458202477256070484 *.*.*.1 1b Up Normal 141.88 GB 66.67% 56713727820156410577229101238628035242 *.*.*.2 1a Up Normal 113.2 GB66.67% 210 *.*.*.3 1d Up Normal 102.37 GB 66.67% 113427455640312821154458202477256070484 Obviously, the first node in 1b has 40% more data than the others. If I wanted to rebalance this cluster, how would I go about that ? Would shifting the tokens accomplish what I need and which tokens ? Regards, Oleg
Re: How to rebalance a cluster?
I keep asking same question it seems -- sign of insanity. Cassandra version 1.2, not using vnodes (legacy). On 2014-03-07 19:37:48 +, Robert Coli said: On Fri, Mar 7, 2014 at 6:00 AM, Oleg Dulin oleg.du...@gmail.com wrote: I have the following situation: 10.194.2.5 RAC1 Up Normal 378.6 GB 50.00% 0 10.194.2.4 RAC1 Up Normal 427.5 GB 50.00% 127605887595351923798765477786913079295 10.194.2.7 RAC1 Up Normal 350.63 GB 50.00% 85070591730234615865843651857942052864 10.194.2.6 RAC1 Up Normal 314.42 GB 50.00% 42535295865117307932921825928971026432 As you can see, the 2.4 node has over 100 G more data than 2.6 . You can definitely see the imbalance. It also happens to be the heaviest loaded node by CPU usage. The first step is to understand why. Are you using vnodes? What version of Cassandra? What would be a clean way to rebalance ? If I use move operation follwoed by cleanup, would it require a repair afterwards ? Move is not, as I understand it, subject to CASSANDRA-2434, so should not require a post-move repair. =Rob L
How safe is nodetool move in 1.2 ?
I need to rebalance my cluster. I am sure this question has been asked before -- will 1.2 continue to serve reads and writes correctly while move is in progress ? Need this for my sanity. -- Regards, Oleg Dulin http://www.olegdulin.com
More node imbalance questions
At a different customer, I have this situation: 10.194.2.5RAC1Up Normal 192.2 GB50.00% 0 10.194.2.4RAC1Up Normal 348.07 GB 50.00% 127605887595351923798765477786913079295 10.194.2.7RAC1Up Normal 387.31 GB 50.00% 85070591730234615865843651857942052864 10.194.2.6RAC1Up Normal 454.97 GB 50.00% 42535295865117307932921825928971026432 Is my understanding correct that I should just move token arround by proportional amounts to bring the disk utilization in line ? -- Regards, Oleg Dulin http://www.olegdulin.com
Re: Commitlog questions
Parag: To answer your questions: 1) Default is just that, a default. I wouldn't advise raising it though. The bigger it is the longer it takes to restart the node. 2) I think they juse use fsync. There is no queue. All files in cassandra use java.nio buffers, but they need to be fsynced periodically. Look at commitlog_sync parameters in cassandra.yaml file, the comments there explain how it works. I believe the difference between periodic and batch is just that -- if it is periodic, it will fsync every 10 seconds, if it is batch it will fsync if there were any changes within a time window. On 2014-04-09 10:06:52 +, Parag Patel said: 1) Why is the default 4GB? Has anyone changed this? What are some aspects to consider when determining the commitlog size? 2) If the commitlog is in periodic mode, there is a property to set a time interval to flush the incoming mutations to disk. This implies that there is a queue inside Cassandra to hold this data in memory until it is flushed. a. Is there a name for this queue? b. Is there a limit for this queue? c. Are there any tuning parameters for this queue? Thanks, Parag -- Regards, Oleg Dulin http://www.olegdulin.com
Why is my cluster imbalanced ?
I added two more nodes on Friday, and moved tokens around. For four nodes, the tokesn should be: Node #1: 0 Node #2: 42535295865117307932921825928971026432 Node #3: 85070591730234615865843651857942052864 Node #4: 127605887595351923798765477786913079296 And yet my ring status shows this (for a specific keyspace). RF=2. Datacenter: us-east == Replicas: 2 AddressRackStatus State LoadOwns Token 42535295865117307932921825928971026432 x.x.x.1 1b Up Normal 13.51 GB25.00% 127605887595351923798765477786913079296 x.x.x.2 1b Up Normal 4.46 GB 25.00% 85070591730234615865843651857942052164 x.x.x.3 1a Up Normal 62.58 GB100.00% 0 x.x.x.4 1b Up Normal 66.71 GB50.00% 42535295865117307932921825928971026432 Datacenter: us-west == Replicas: 1 AddressRackStatus State LoadOwns Token x.x.x.5 1b Up Normal 62.72 GB100.00% 100 -- Regards, Oleg Dulin http://www.olegdulin.com
Re: Why is my cluster imbalanced ?
Excellent, thanks. On 2014-04-07 12:23:51 +, Tupshin Harper said: Your us-east datacenter, has RF=2, and 2 racks, which is the right way to do it (I would rarely recommend using a different number of racks than your RF). But by having three nodes on one rack (1b) and only one on the other(1a), you are telling Cassandra to distribute the data so that no two copies of the same partition exist on the same rack. So with rack ownership of 100% and 100% respectively, there is no even way to distribute your data among those four nodes. tl;dr Switch node 2 to rack 1a. -Tupshin On Mon, Apr 7, 2014 at 8:08 AM, Oleg Dulin oleg.du...@gmail.com wrote: I added two more nodes on Friday, and moved tokens around. For four nodes, the tokesn should be: Node #1:0 Node #2: 42535295865117307932921825928971026432 Node #3: 85070591730234615865843651857942052864 Node #4: 127605887595351923798765477786913079296 And yet my ring status shows this (for a specific keyspace). RF=2. Datacenter: us-east == Replicas: 2 AddressRackStatus State LoadOwns Token 42535295865117307932921825928971026432 x.x.x.1 1b Up Normal 13.51 GB25.00% 127605887595351923798765477786913079296 x.x.x.2 1b Up Normal 4.46 GB 25.00% 85070591730234615865843651857942052164 x.x.x.3 1a Up Normal 62.58 GB100.00% 0 x.x.x.4 1b Up Normal 66.71 GB50.00% 42535295865117307932921825928971026432 Datacenter: us-west == Replicas: 1 AddressRackStatus State LoadOwns Token x.x.x.5 1b Up Normal 62.72 GB100.00% 100 -- Regards, Oleg Dulin http://www.olegdulin.com -- Regards, Oleg Dulin http://www.olegdulin.com
Re: Why is my cluster imbalanced ?
Tupshin: For EC2, 3 us-east, would you recommend RF=3 ? That would make sense, wouldn't it... That's what I'll do for production. Oleg On 2014-04-07 12:23:51 +, Tupshin Harper said: Your us-east datacenter, has RF=2, and 2 racks, which is the right way to do it (I would rarely recommend using a different number of racks than your RF). But by having three nodes on one rack (1b) and only one on the other(1a), you are telling Cassandra to distribute the data so that no two copies of the same partition exist on the same rack. So with rack ownership of 100% and 100% respectively, there is no even way to distribute your data among those four nodes. tl;dr Switch node 2 to rack 1a. -Tupshin On Mon, Apr 7, 2014 at 8:08 AM, Oleg Dulin oleg.du...@gmail.com wrote: I added two more nodes on Friday, and moved tokens around. For four nodes, the tokesn should be: Node #1:0 Node #2: 42535295865117307932921825928971026432 Node #3: 85070591730234615865843651857942052864 Node #4: 127605887595351923798765477786913079296 And yet my ring status shows this (for a specific keyspace). RF=2. Datacenter: us-east == Replicas: 2 AddressRackStatus State LoadOwns Token 42535295865117307932921825928971026432 x.x.x.1 1b Up Normal 13.51 GB25.00% 127605887595351923798765477786913079296 x.x.x.2 1b Up Normal 4.46 GB 25.00% 85070591730234615865843651857942052164 x.x.x.3 1a Up Normal 62.58 GB100.00% 0 x.x.x.4 1b Up Normal 66.71 GB50.00% 42535295865117307932921825928971026432 Datacenter: us-west == Replicas: 1 AddressRackStatus State LoadOwns Token x.x.x.5 1b Up Normal 62.72 GB100.00% 100 -- Regards, Oleg Dulin http://www.olegdulin.com -- Regards, Oleg Dulin http://www.olegdulin.com
Re: need help with Cassandra 1.2 Full GCing -- output of jmap histogram
Sigh, so I am back to where I started from... I did lower gc_grace... jmap -histo:live shows heap is stuffed with DeletedColumn and ExpiringColumn This is extremely frustrating. On 2014-03-11 19:24:50 +, Oleg Dulin said: Good news is that since I lowered gc_grace period it collected over 100Gigs of tombstones and seems much happier now. Oleg On 2014-03-10 13:33:43 +, Jonathan Lacefield said: Hello, You have several options: 1) going forward lower gc_grace_seconds http://www.datastax.com/documentation/cassandra/1.2/cassandra/configuration/configStorage_r.html?pagename=docsversion=1.2file=configuration/storage_configuration#gc-grace-seconds - this is very use case specific. Default is 10 days. Some users will put this at 0 for specific use cases. 2) you could also lower tombstone compaction threshold and interval to get tombstone compaction to fire more often on your tables/cfs: https://datastax.jira.com/wiki/pages/viewpage.action?pageId=54493436 3) to clean out old tombstones you could always run a manual compaction, those these aren't typically recommended though: http://www.datastax.com/documentation/cassandra/1.2/cassandra/tools/toolsNodetool_r.html For 1 and 2, be sure your disks can keep up with compaction to ensure tombstone, or other, compaction fires regularly enough to clean out old tombstones. Also, you probably want to ensure you are using Level Compaction: http://www.datastax.com/dev/blog/when-to-use-leveled-compaction. Again, this assumes your disk system can handle the increased io from Leveled Compaction. Also, you may be running into this with the older version of Cassandra: https://issues.apache.org/jira/browse/CASSANDRA-6541 Hope this helps. Jonathan Jonathan Lacefield Solutions Architect, DataStax (404) 822 3487 image image On Mon, Mar 10, 2014 at 6:41 AM, Oleg Dulin oleg.du...@gmail.com wrote: I get that :) What I'd like to know is how to fix that :) On 2014-03-09 20:24:54 +, Takenori Sato said: You have millions of org.apache.cassandra.db.DeletedColumn instances on the snapshot. This means you have lots of column tombstones, and I guess, which are read into memory by slice query. On Sun, Mar 9, 2014 at 10:55 PM, Oleg Dulin oleg.du...@gmail.com wrote: I am trying to understand why one of my nodes keeps full GC. I have Xmx set to 8gigs, memtable total size is 2 gigs. Consider the top entries from jmap -histo:live @ http://pastebin.com/UaatHfpJ -- Regards, Oleg Dulin http://www.olegdulin.com -- Regards, Oleg Dulin http://www.olegdulin.com S -- Regards, Oleg Dulin http://www.olegdulin.com
Need help understanding hinted_handoff_throttle_in_kb
I came across something on the cassandra it that made me concerned. Default value for hinted_handoff_throttle_in_kb is 1024, one Meg per second. I have four nodes and rf=2. I have hints timeout set to 24, to avoid having to do repairs if I took longer than that to reboot a node. What got me thinking though is that if I'm generating gigabytes worth of hints during the day and across four nodes the throttle becomes 250k per second, that is too slow to replay all of my hints properly. Is tht right ? I need to understand this setting better. I would like to make sure that all of my hints get replayed. What is a recommended setting ? Any input is greatly appreciated. Regards, Oleg
1.2: Why can't I see what is in hints CF ?
Check this out: [default@system] list hints limit 10; Using default cell limit of 100 null TimedOutException() at org.apache.cassandra.thrift.Cassandra$get_range_slices_result.read(Cassandra.java:12932) at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:78) at org.apache.cassandra.thrift.Cassandra$Client.recv_get_range_slices(Cassandra.java:734) at org.apache.cassandra.thrift.Cassandra$Client.get_range_slices(Cassandra.java:718) at org.apache.cassandra.cli.CliClient.executeList(CliClient.java:1495) at org.apache.cassandra.cli.CliClient.executeCLIStatement(CliClient.java:279) at org.apache.cassandra.cli.CliMain.processStatementInteractive(CliMain.java:213) at org.apache.cassandra.cli.CliMain.main(CliMain.java:339) My nodes are accumulating hints and I am wondering what in the world is going on... -- Regards, Oleg Dulin http://www.olegdulin.com
Re: How to guarantee consistency between counter and materialized view?
Robert Coli rc...@eventbrite.com wrote: On Tue, Mar 11, 2014 at 4:30 PM, ziju feng pkdog...@gmail.com wrote: Is there any way to guarantee a counter's value no. =Rob I wouldn't use cassandra for counters... Use something like redis if that is what you want.
Re: need help with Cassandra 1.2 Full GCing -- output of jmap histogram
Good news is that since I lowered gc_grace period it collected over 100Gigs of tombstones and seems much happier now. Oleg On 2014-03-10 13:33:43 +, Jonathan Lacefield said: Hello, You have several options: 1) going forward lower gc_grace_seconds http://www.datastax.com/documentation/cassandra/1.2/cassandra/configuration/configStorage_r.html?pagename=docsversion=1.2file=configuration/storage_configuration#gc-grace-seconds - this is very use case specific. Default is 10 days. Some users will put this at 0 for specific use cases. 2) you could also lower tombstone compaction threshold and interval to get tombstone compaction to fire more often on your tables/cfs: https://datastax.jira.com/wiki/pages/viewpage.action?pageId=54493436 3) to clean out old tombstones you could always run a manual compaction, those these aren't typically recommended though: http://www.datastax.com/documentation/cassandra/1.2/cassandra/tools/toolsNodetool_r.html For 1 and 2, be sure your disks can keep up with compaction to ensure tombstone, or other, compaction fires regularly enough to clean out old tombstones. Also, you probably want to ensure you are using Level Compaction: http://www.datastax.com/dev/blog/when-to-use-leveled-compaction. Again, this assumes your disk system can handle the increased io from Leveled Compaction. Also, you may be running into this with the older version of Cassandra: https://issues.apache.org/jira/browse/CASSANDRA-6541 Hope this helps. Jonathan Jonathan Lacefield Solutions Architect, DataStax (404) 822 3487 image image On Mon, Mar 10, 2014 at 6:41 AM, Oleg Dulin oleg.du...@gmail.com wrote: I get that :) What I'd like to know is how to fix that :) On 2014-03-09 20:24:54 +, Takenori Sato said: You have millions of org.apache.cassandra.db.DeletedColumn instances on the snapshot. This means you have lots of column tombstones, and I guess, which are read into memory by slice query. On Sun, Mar 9, 2014 at 10:55 PM, Oleg Dulin oleg.du...@gmail.com wrote: I am trying to understand why one of my nodes keeps full GC. I have Xmx set to 8gigs, memtable total size is 2 gigs. Consider the top entries from jmap -histo:live @ http://pastebin.com/UaatHfpJ -- Regards, Oleg Dulin http://www.olegdulin.com -- Regards, Oleg Dulin http://www.olegdulin.com S -- Regards, Oleg Dulin http://www.olegdulin.com
Re: need help with Cassandra 1.2 Full GCing -- output of jmap histogram
I get that :) What I'd like to know is how to fix that :) On 2014-03-09 20:24:54 +, Takenori Sato said: You have millions of org.apache.cassandra.db.DeletedColumn instances on the snapshot. This means you have lots of column tombstones, and I guess, which are read into memory by slice query. On Sun, Mar 9, 2014 at 10:55 PM, Oleg Dulin oleg.du...@gmail.com wrote: I am trying to understand why one of my nodes keeps full GC. I have Xmx set to 8gigs, memtable total size is 2 gigs. Consider the top entries from jmap -histo:live @ http://pastebin.com/UaatHfpJ -- Regards, Oleg Dulin http://www.olegdulin.com -- Regards, Oleg Dulin http://www.olegdulin.com
need help with Cassandra 1.2 Full GCing -- output of jmap histogram
I am trying to understand why one of my nodes keeps full GC. I have Xmx set to 8gigs, memtable total size is 2 gigs. Consider the top entries from jmap -histo:live @ http://pastebin.com/UaatHfpJ -- Regards, Oleg Dulin http://www.olegdulin.com
How to rebalance a cluster?
I have the following situation: 10.194.2.5RAC1Up Normal 378.6 GB50.00% 0 10.194.2.4RAC1Up Normal 427.5 GB50.00% 127605887595351923798765477786913079295 10.194.2.7RAC1Up Normal 350.63 GB 50.00% 85070591730234615865843651857942052864 10.194.2.6RAC1Up Normal 314.42 GB 50.00% 42535295865117307932921825928971026432 As you can see, the 2.4 node has over 100 G more data than 2.6 . You can definitely see the imbalance. It also happens to be the heaviest loaded node by CPU usage. What would be a clean way to rebalance ? If I use move operation follwoed by cleanup, would it require a repair afterwards ? -- Regards, Oleg Dulin http://www.olegdulin.com
Re: Cass 1.2.11 : java.lang.AssertionError: originally calculated column size
Bumping this up -- anything ? anyone ? On 2014-02-13 16:01:50 +, Oleg Dulin said: I am getting these exceptions on one of the nodes, quite often, during compactions: java.lang.AssertionError: originally calculated column size of 84562492 but now it is 84562600 Usually this is on the same column family. I believe this is preventing compactions from completing, and subsequently causing other performance issues for me. Is there a way to fix that ? Would nodetool scrub take care of this ? -- Regards, Oleg Dulin http://www.olegdulin.com
Cass 1.2.11 : java.lang.AssertionError: originally calculated column size
I am getting these exceptions on one of the nodes, quite often, during compactions: java.lang.AssertionError: originally calculated column size of 84562492 but now it is 84562600 Usually this is on the same column family. I believe this is preventing compactions from completing, and subsequently causing other performance issues for me. Is there a way to fix that ? Would nodetool scrub take care of this ? -- Regards, Oleg Dulin http://www.olegdulin.com
Cass 1.2.11: Replacing a node procedure
Dear Distinguished Colleagues: I have a situation where in the production environment one of the machines is overheating and needs to be serviced. Now, the landscape looks like this: 4 machines in primary DC, 4 machiens in DR DC. Replication factor is 2. I also have a QA environment with 4 machines in a single DC, RF=2 as well. We need to work with the manufaturer to figure out what is wrong with the machine. The proposed course of action is the following: 1) Take the faulty prod machine (lets call it X) out of production. 2) Take a healthy QA machine (lets call it Y) out of QA 3) Plug QA machine into the prod cluster and rebuild it. 4) Plug prod machine into the QA cluster and leave it alone and let the manufacturer service it to their liking until they say it is fixed, at which point we will just leave it in QA. So basically we are talking about replacing a dead node. I found this: http://www.datastax.com/documentation/cassandra/1.2/webhelp/index.html#cassandra/operations/ops_replace_node_t.html I am not using vnodes, just plain vanilla tokens and RandomPartitioner. So that procedure doesn't apply. I need some help putting together a step-by-step checklist what I would need to do. -- Regards, Oleg Dulin http://www.olegdulin.com
Re: Cass 1.2.11: Replacing a node procedure
Here is what I am thinking. 1) Add the new node with token-1 of the old one and let it bootstrap. 2) Once it bootstrapped, remove the old node from the ring Now, it is #2 that I need clarification on. Do I use decommission or remove ? How long should I expect those processes to run ? Regards, Oleg On 2014-02-13 22:01:10 +, Oleg Dulin said: Dear Distinguished Colleagues: I have a situation where in the production environment one of the machines is overheating and needs to be serviced. Now, the landscape looks like this: 4 machines in primary DC, 4 machiens in DR DC. Replication factor is 2. I also have a QA environment with 4 machines in a single DC, RF=2 as well. We need to work with the manufaturer to figure out what is wrong with the machine. The proposed course of action is the following: 1) Take the faulty prod machine (lets call it X) out of production. 2) Take a healthy QA machine (lets call it Y) out of QA 3) Plug QA machine into the prod cluster and rebuild it. 4) Plug prod machine into the QA cluster and leave it alone and let the manufacturer service it to their liking until they say it is fixed, at which point we will just leave it in QA. So basically we are talking about replacing a dead node. I found this: http://www.datastax.com/documentation/cassandra/1.2/webhelp/index.html#cassandra/operations/ops_replace_node_t.html I am not using vnodes, just plain vanilla tokens and RandomPartitioner. So that procedure doesn't apply. I need some help putting together a step-by-step checklist what I would need to do. -- Regards, Oleg Dulin http://www.olegdulin.com
Re: Thrift CAS usage
On 2014-02-12 23:11:01 +, mahesh rajamani said: Hi, I am using CAS feature through thrift cas api. I am able to set the expected column with some value and use cas through thrift api. But I am sure what I should set for expected column list to achieve IF NOT EXIST condition for a column. Can someone help me on this? -- Regards, Mahesh Rajamani Read the column first... -- Regards, Oleg Dulin http://www.olegdulin.com
Re: Cassandra 1.2 : OutOfMemoryError: unable to create new native thread
I figured it out. Another process on that machine was leaking threads. All is well! Thanks guys! Oleg On 2013-12-16 13:48:39 +, Maciej Miklas said: the cassandra-env.sh has option JVM_OPTS=$JVM_OPTS -Xss180k it will give this error if you start cassandra with java 7. So increase the value, or remove option. Regards, Maciej On Mon, Dec 16, 2013 at 2:37 PM, srmore comom...@gmail.com wrote: What is your thread stack size (xss) ? try increasing that, that could help. Sometimes the limitation is imposed by the host provider (e.g. amazon ec2 etc.) Thanks, Sandeep On Mon, Dec 16, 2013 at 6:53 AM, Oleg Dulin oleg.du...@gmail.com wrote: Hi guys! I beleive my limits settings are correct. Here is the output of ulimits -a: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1547135 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 10 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 32768 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited However, I just had a couple of cassandra nodes go down over the weekend for no apparent reason with the following error: java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:691) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949) at java.util.concurrent.ThreadPoolExecutor.processWorkerExit(ThreadPoolExecutor.java:1017) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1163) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Any input is greatly appreciated. -- Regards, Oleg Dulin http://www.olegdulin.com -- Regards, Oleg Dulin http://www.olegdulin.com
Cassandra 1.2 : OutOfMemoryError: unable to create new native thread
Hi guys! I beleive my limits settings are correct. Here is the output of ulimits -a: core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 1547135 max locked memory (kbytes, -l) unlimited max memory size (kbytes, -m) unlimited open files (-n) 10 pipe size(512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) unlimited max user processes (-u) 32768 virtual memory (kbytes, -v) unlimited file locks (-x) unlimited However, I just had a couple of cassandra nodes go down over the weekend for no apparent reason with the following error: java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:691) at java.util.concurrent.ThreadPoolExecutor.addWorker(ThreadPoolExecutor.java:949) at java.util.concurrent.ThreadPoolExecutor.processWorkerExit(ThreadPoolExecutor.java:1017) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1163) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Any input is greatly appreciated. -- Regards, Oleg Dulin http://www.olegdulin.com
Re: Cassandra and bug track type number sequencing.
If you want sequential numbers, you can't trust distributed counters from Cassandra. However, you could use Redis for this. Additionally, you can also use a random UUID and only show the customer first 6 characters -- it is unique enough... Oleg On 2013-12-16 09:33:39 +, Jacob Rhoden said: Hi Guys, As per the subject, is there any way at all to easily associate small numbers in systems where users traditionally associate “bug/request” tickets with short numbers? In this use case I imagine the requirements would be as follows: • The numbers don’t necessary need to be sequential, just need to be short enough for a user to read out loud. • The numbers must be unique. • It doesn’t need to scale, i.e. a typical “request” system is not getting hundreds of requests per second. In an ideal world, we could do away with associating “requests” with numbers, but its so ubiquitous I’m not sure you can sell doing away with short number codes. I am toying with the idea of a Cassandra table that makes available short “blocks” of numbers that an app server can hold “reservations” on. i.e. create table request_id_block( start int, end int, uuid uuid, reserved_by int, reserved_until bigint, primary key(start,end)); Will having an app server mark a block as reserved (QUOROM) and then reading it back (QUOROM) be enough to for an app server to know it owns that block of numbers? Best regards, Jacob -- Regards, Oleg Dulin http://www.olegdulin.com
Re: 1.1.11: system keyspace is filling up
What happens if they are not being successfully delivered ? Will they eventually TTL-out ? Also, do I need to truncate hints on every node or is it replicated ? Oleg On 2013-11-04 21:34:55 +, Robert Coli said: On Mon, Nov 4, 2013 at 11:34 AM, Oleg Dulin oleg.du...@gmail.com wrote: I have a dual DC setup, 4 nodes, RF=4 in each. The one that is used as primary has its system keyspace fill up with 200 gigs of data, majority of which is hints. Why does this happen ? How can I clean it up ? If you have this many hints, you probably have flapping / frequent network partition, or very overloaded nodes. If you compare the number of hints to the number of dropped messages, that would be informative. If you're hinting because you're dropping, increase capacity. If you're hinting because of partition, figure out why there's so much partition. WRT cleaning up hints, they will automatically be cleaned up eventually, as long as they are successfully being delivered. If you need to manually clean them up you can truncate system.hints keyspace. =Rob -- Regards, Oleg Dulin http://www.olegdulin.com
Re: Cass 1.1.11 out of memory during compaction ?
If i do that, wouldn't I need to scrub my sstables ? Takenori Sato ts...@cloudian.com wrote: Try increasing column_index_size_in_kb. A slice query to get some ranges(SliceFromReadCommand) requires to read all the column indexes for the row, thus could hit OOM if you have a very wide row. On Sun, Nov 3, 2013 at 11:54 PM, Oleg Dulin oleg.du...@gmail.com wrote: Cass 1.1.11 ran out of memory on me with this exception (see below). My parameters are 8gig heap, new gen is 1200M. ERROR [ReadStage:55887] 2013-11-02 23:35:18,419 AbstractCassandraDaemon.java (line 132) Exception in thread Thread[ReadStage:55887,5,main] java.lang.OutOfMemoryError: Java heap space at org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:323) at org.apache.cassandra.utils.ByteBufferUtil.read( ByteBufferUtil.java:398)at org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:380) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:88) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:83) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:73) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:37) at org.apache.cassandra.db.columniterator.IndexedSliceReader$ IndexedBlockFetcher.getNextBlock(IndexedSliceReader.java:179)at org.apache.cassandra.db.columniterator.IndexedSliceReader. computeNext(IndexedSliceReader.java:121)at org.apache.cassandra.db.columniterator.IndexedSliceReader. computeNext(IndexedSliceReader.java:48)at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135) at org.apache.cassandra.db.columniterator. SSTableSliceIterator.hasNext(SSTableSliceIterator.java:116)at org.apache.cassandra.utils.MergeIterator$Candidate. advance(MergeIterator.java:147)at org.apache.cassandra.utils.MergeIterator$ManyToOne. advance(MergeIterator.java:126)at org.apache.cassandra.utils.MergeIterator$ManyToOne. computeNext(MergeIterator.java:100)at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135) at org.apache.cassandra.db.filter.SliceQueryFilter. collectReducedColumns(SliceQueryFilter.java:117)at org.apache.cassandra.db.filter.QueryFilter. collateColumns(QueryFilter.java:140) at org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:292) at org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:64) at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1362) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1224) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1159) at org.apache.cassandra.db.Table.getRow(Table.java:378)at org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:69) at org.apache.cassandra.db.ReadVerbHandler.doVerb( ReadVerbHandler.java:51)at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Any thoughts ? This is a dual data center set up, with 4 nodes in each DC and RF=2 in each. -- Regards, Oleg Dulin a href=http://www.olegdulin.com;http://www.olegdulin.com/a
1.1.11: system keyspace is filling up
I have a dual DC setup, 4 nodes, RF=4 in each. The one that is used as primary has its system keyspace fill up with 200 gigs of data, majority of which is hints. Why does this happen ? How can I clean it up ? -- Regards, Oleg Dulin http://www.olegdulin.com
Cass 1.1.11 out of memory during compaction ?
Cass 1.1.11 ran out of memory on me with this exception (see below). My parameters are 8gig heap, new gen is 1200M. ERROR [ReadStage:55887] 2013-11-02 23:35:18,419 AbstractCassandraDaemon.java (line 132) Exception in thread Thread[ReadStage:55887,5,main] java.lang.OutOfMemoryError: Java heap space at org.apache.cassandra.io.util.RandomAccessReader.readBytes(RandomAccessReader.java:323) at org.apache.cassandra.utils.ByteBufferUtil.read(ByteBufferUtil.java:398) at org.apache.cassandra.utils.ByteBufferUtil.readWithShortLength(ByteBufferUtil.java:380) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:88) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:83) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:73) at org.apache.cassandra.db.ColumnSerializer.deserialize(ColumnSerializer.java:37) at org.apache.cassandra.db.columniterator.IndexedSliceReader$IndexedBlockFetcher.getNextBlock(IndexedSliceReader.java:179) at org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(IndexedSliceReader.java:121) at org.apache.cassandra.db.columniterator.IndexedSliceReader.computeNext(IndexedSliceReader.java:48) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135) at org.apache.cassandra.db.columniterator.SSTableSliceIterator.hasNext(SSTableSliceIterator.java:116) at org.apache.cassandra.utils.MergeIterator$Candidate.advance(MergeIterator.java:147) at org.apache.cassandra.utils.MergeIterator$ManyToOne.advance(MergeIterator.java:126) at org.apache.cassandra.utils.MergeIterator$ManyToOne.computeNext(MergeIterator.java:100) at com.google.common.collect.AbstractIterator.tryToComputeNext(AbstractIterator.java:140) at com.google.common.collect.AbstractIterator.hasNext(AbstractIterator.java:135) at org.apache.cassandra.db.filter.SliceQueryFilter.collectReducedColumns(SliceQueryFilter.java:117) at org.apache.cassandra.db.filter.QueryFilter.collateColumns(QueryFilter.java:140) at org.apache.cassandra.db.CollationController.collectAllData(CollationController.java:292) at org.apache.cassandra.db.CollationController.getTopLevelColumns(CollationController.java:64) at org.apache.cassandra.db.ColumnFamilyStore.getTopLevelColumns(ColumnFamilyStore.java:1362) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1224) at org.apache.cassandra.db.ColumnFamilyStore.getColumnFamily(ColumnFamilyStore.java:1159) at org.apache.cassandra.db.Table.getRow(Table.java:378) at org.apache.cassandra.db.SliceFromReadCommand.getRow(SliceFromReadCommand.java:69) at org.apache.cassandra.db.ReadVerbHandler.doVerb(ReadVerbHandler.java:51) at org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:59) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:722) Any thoughts ? This is a dual data center set up, with 4 nodes in each DC and RF=2 in each. -- Regards, Oleg Dulin http://www.olegdulin.com
Frustration with repair process in 1.1.11
First I need to vent. rant One of my cassandra cluster is a dual data center setup, with DC1 acting as primary, and DC2 acting as a hot backup. Well, guess what ? I am pretty sure that it falls behind on replication. So I am told I need to run repair. I run repair (with -pr) on DC2. First time I run it it gets *stuck* (i.e. frozen) within the first 30 seconds, with no error or any sort of message. I then run it again -- and it completes in seconds on each node, with about 50 gigs of data on each. That seems suspicious, so I do some research. I am told on IRC that running repair -pr will only do the repair on 100 tokens (the offset from DC1 to DC2)… Seriously ??? Repair process is, indeed, a joke: https://issues.apache.org/jira/browse/CASSANDRA-5396 . Repair is the worst thing you can do to your cluster, it consumes enormous resources, and can leave your cluster in an inconsistent state. Oh and by the way you must run it every week…. Whoever invented that process must not live in a real world, with real applications. /rant No… lets have a constructive conversation. How do I know, with certainty, that my DC2 cluster is up to date on replication ? I have a few options: 1) I set read repair chance to 100% on critical column families and I write a tool to scan every CF, every column of every row. This strikes me as very silly. Q1: Do I need to scan every column or is looking at one column enough to trigger a read repair ? 2) Can someone explain to me how the repair works such that I don't totally trash my cluster or spill into work week ? Is there any improvement and clarity in 1.2 ? How about 2.0 ? -- Regards, Oleg Dulin http://www.olegdulin.com
Too many open files with Cassandra 1.2.11
Got this error: WARN [Thread-8] 2013-10-29 02:58:24,565 CustomTThreadPoolServer.java (line 122) Transport error occurred during acceptance of message. 2 org.apache.thrift.transport.TTransportException: java.net.SocketException: Too many open files 3 at org.apache.cassandra.thrift.TCustomServerSocket.acceptImpl(TCustomServerSocket.java:109) 4 at org.apache.cassandra.thrift.TCustomServerSocket.acceptImpl(TCustomServerSocket.java:36) 5 at org.apache.thrift.transport.TServerTransport.accept(TServerTransport.java:31) 6 at org.apache.cassandra.thrift.CustomTThreadPoolServer.serve(CustomTThreadPoolServer.java:110) 7 at org.apache.cassandra.thrift.ThriftServer$ThriftServerThread.run(ThriftServer.java:111) I haven't seen this since 1.0 days. 1.1.11 had it all fixed I thought. ulimit outputs unlimited What could cause this ? Any help is greatly apprecaited. -- Regards, Oleg Dulin http://www.olegdulin.com
Adding a data center with data already in place
I am using Cassandra 1.1.11 and plan on upgrading soon, but in the meantime here is what happened. I couldn't run repairs because of a slow WAN pipe, so i removed the second data center from the cluster. Today I need to bring that data center back in. It is not 2-3 days out dated. I have two options: 1) Treat this as a new data center and let the nodes sync from scratch, or 2) Bring the nodes back up with all the data in place and do a repair. We are talking about 30-40Gigs per node. There are 4 nodes in both data centers, with RF=2. -- Regards, Oleg Dulin http://www.olegdulin.com
Unbalanced ring mystery multi-DC issue with 1.1.11
Consider this output from nodetool ring: Address DC RackStatus State Load Effective-Ownership Token 127605887595351923798765477786913079396 dc1.5 DC1 RAC1Up Normal 32.07 GB50.00% 0 dc2.100DC2 RAC1Up Normal 8.21 GB 50.00% 100 dc1.6 DC1 RAC1Up Normal 32.82 GB50.00% 42535295865117307932921825928971026432 dc2.101DC2 RAC1Up Normal 12.41 GB50.00% 42535295865117307932921825928971026532 dc1.7 DC1 RAC1Up Normal 28.37 GB50.00% 85070591730234615865843651857942052864 dc2.102DC2 RAC1Up Normal 12.27 GB50.00% 85070591730234615865843651857942052964 dc1.8 DC1 RAC1Up Normal 27.34 GB50.00% 127605887595351923798765477786913079296 dc2.103DC2 RAC1Up Normal 13.46 GB50.00% 127605887595351923798765477786913079396 I concealed IPs and DC names for confidentiality. All of the data loading was happening against DC1 at a pretty brisk rate, of, say, 200K writes per minute. Note how my tokens are offset by 100. Shouldn't that mean that load on each node should be roughly identical ? In DC1 it is roughly around 30 G on each node. In DC2 it is almost 1/3rd of the nearest DC1 node by token range. To verify that the nodes are in sync, I ran nodetool -h localhost repair MyKeySpace --partitioner-range on each node in DC2. Watching the logs, I see that the repair went really quick and all column families are in sync! I need help making sense of this. Is this because DC1 is not fully compacted ? Is it because DC2 is not fully synced and I am not checking correctly ? How can I tell that there is still replication going on in progress (note, I started my load yesterday at 9:50am). -- Regards, Oleg Dulin http://www.olegdulin.com
Re: Unbalanced ring mystery multi-DC issue with 1.1.11
Wanted to add one more thing: I can also tell that the numbers are not consistent across DRs this way -- I have a column family with really wide rows (a couple million columns). DC1 reports higher column counts than DC2. DC2 only becomes consistent after I do the command a couple of times and trigger a read-repair. But why would nodetool repair logs show that everything is in sync ? Regards, Oleg On 2013-09-27 10:23:45 +, Oleg Dulin said: Consider this output from nodetool ring: Address DC RackStatus State Load Effective-Ownership Token 127605887595351923798765477786913079396 dc1.5 DC1 RAC1Up Normal 32.07 GB50.00% 0 dc2.100DC2 RAC1Up Normal 8.21 GB 50.00% 100 dc1.6 DC1 RAC1Up Normal 32.82 GB50.00% 42535295865117307932921825928971026432 dc2.101DC2 RAC1Up Normal 12.41 GB50.00% 42535295865117307932921825928971026532 dc1.7 DC1 RAC1Up Normal 28.37 GB50.00% 85070591730234615865843651857942052864 dc2.102DC2 RAC1Up Normal 12.27 GB50.00% 85070591730234615865843651857942052964 dc1.8 DC1 RAC1Up Normal 27.34 GB50.00% 127605887595351923798765477786913079296 dc2.103DC2 RAC1Up Normal 13.46 GB50.00% 127605887595351923798765477786913079396 I concealed IPs and DC names for confidentiality. All of the data loading was happening against DC1 at a pretty brisk rate, of, say, 200K writes per minute. Note how my tokens are offset by 100. Shouldn't that mean that load on each node should be roughly identical ? In DC1 it is roughly around 30 G on each node. In DC2 it is almost 1/3rd of the nearest DC1 node by token range. To verify that the nodes are in sync, I ran nodetool -h localhost repair MyKeySpace --partitioner-range on each node in DC2. Watching the logs, I see that the repair went really quick and all column families are in sync! I need help making sense of this. Is this because DC1 is not fully compacted ? Is it because DC2 is not fully synced and I am not checking correctly ? How can I tell that there is still replication going on in progress (note, I started my load yesterday at 9:50am). -- Regards, Oleg Dulin http://www.olegdulin.com
Re: Unbalanced ring mystery multi-DC issue with 1.1.11
Here is some more information. I am running full repair on one of the nodes and I am observing strange behavior. Both DCs were up during the data load. But repair is reporting a lot of out-of-sync data. Why would that be ? Is there a way for me to tell that WAN may be dropping hinted handoff traffic ? Regards, Oleg On 2013-09-27 10:35:34 +, Oleg Dulin said: Wanted to add one more thing: I can also tell that the numbers are not consistent across DRs this way -- I have a column family with really wide rows (a couple million columns). DC1 reports higher column counts than DC2. DC2 only becomes consistent after I do the command a couple of times and trigger a read-repair. But why would nodetool repair logs show that everything is in sync ? Regards, Oleg On 2013-09-27 10:23:45 +, Oleg Dulin said: Consider this output from nodetool ring: Address DC RackStatus State Load Effective-Ownership Token 127605887595351923798765477786913079396 dc1.5 DC1 RAC1Up Normal 32.07 GB50.00% 0 dc2.100DC2 RAC1Up Normal 8.21 GB 50.00%100 dc1.6 DC1 RAC1Up Normal 32.82 GB50.00% 42535295865117307932921825928971026432 dc2.101DC2 RAC1Up Normal 12.41 GB50.00% 42535295865117307932921825928971026532 dc1.7 DC1 RAC1Up Normal 28.37 GB50.00% 85070591730234615865843651857942052864 dc2.102DC2 RAC1Up Normal 12.27 GB50.00% 85070591730234615865843651857942052964 dc1.8 DC1 RAC1Up Normal 27.34 GB50.00% 127605887595351923798765477786913079296 dc2.103DC2 RAC1Up Normal 13.46 GB50.00% 127605887595351923798765477786913079396 I concealed IPs and DC names for confidentiality. All of the data loading was happening against DC1 at a pretty brisk rate, of, say, 200K writes per minute. Note how my tokens are offset by 100. Shouldn't that mean that load on each node should be roughly identical ? In DC1 it is roughly around 30 G on each node. In DC2 it is almost 1/3rd of the nearest DC1 node by token range. To verify that the nodes are in sync, I ran nodetool -h localhost repair MyKeySpace --partitioner-range on each node in DC2. Watching the logs, I see that the repair went really quick and all column families are in sync! I need help making sense of this. Is this because DC1 is not fully compacted ? Is it because DC2 is not fully synced and I am not checking correctly ? How can I tell that there is still replication going on in progress (note, I started my load yesterday at 9:50am). -- Regards, Oleg Dulin http://www.olegdulin.com
Need help configuring WAN replication over slow WAN
Here is a problem: My customer has a 45Megabit connection to their off-site DR data center. They have about 500G worth of data. That connection is shared. Needless to say this is not an optimal configuration. To replicate all that in real time it'll take a week. My primary cluster is 4 nodes, RF=2. DR cluster is also 4 nodes, RF=2. I need a way to somehow setup the primary cluster, populate all the data, then transfer it to the DR cluster. One suggestion is: 1) Setup the primary cluster, plus configure a Mac Mini as a backup data center but on the same network 2) Populate the data 3) Physically take Mac Mini to the DR data center and transfer its data to one of the nodes and then run nodetool cleanup to move the data around amongs nodes. Now… this doesn't strike me as optimal. I feel like I'll need to run repair on the new cluster, which defeats the purpose -- it'll just hog the 45Megabit pipe… Somehow I need away to load all the data into primary cluster, then ship it over to the backup in a more timely fashion… Any suggestions are greatly appreciated. Also, I need a way to know if the replication is up to date or not. -- Regards, Oleg Dulin http://www.olegdulin.com
Pycassa xget not parsing composite column name properly
I have a column family defined as: create column family LSItemIdsByFieldValueIndex_Integer with column_type = 'Standard' and comparator = 'CompositeType(org.apache.cassandra.db.marshal.IntegerType,org.apache.cassandra.db.marshal.UTF8Type)' and default_validation_class = 'UTF8Type' and key_validation_class = 'UTF8Type'; This snippet of code: result=searchIndex.get_range(column_count=1) for key,columns in result: print '\t',key indexData=searchIndex[indexCF].xget(key) for name, value in indexData: print name does not correctly print column name as parsed into a tuple of two parts. Am I doing something wrong here ? -- Regards, Oleg Dulin http://www.olegdulin.com
Re: Running Cassandra with no open TCP ports
Mark: This begs a question -- why are you using Cassandra for this ? There are simpler noSQL stores than Cassandra that are better for embedding. Oleg On 2013-05-28 02:24:48 +, Mark Mccraw said: Hi All, I'm using Cassandra as an embedded datastore for a small service that doesn't need (or want) to act as a database service in any way. Moreover, we may want to start up multiple instances of the application, and right now whenever that happens, we get port conflicts on 7000 because Cassandra is listening for connections. I couldn't find an obvious way to disable listening on any port. Is there an easy way? Thanks! Mark -- Regards, Oleg Dulin http://www.olegdulin.com
Re: Iterating through large numbers of rows with JDBC
On 2013-05-11 14:42:32 +, Robert Wille said: I'm using the JDBC driver to access Cassandra. I'm wondering if its possible to iterate through a large number of records (e.g. to perform maintenance on a large column family). I tried calling Connection.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY), but it times out, so I'm guessing that cursors aren't supported. Is there another way to do this, or do I need to use a different API? Thanks in advance Robert If you feel that you need to iterate through a large number of rows then you are probably not using a correct data model. Can you describe your use case ? -- Regards, Oleg Dulin NYC Java Big Data Engineer http://www.olegdulin.com/
Cassanrda 1.1.11 compression: how to tell if it works ?
I have a column family with really wide rows set to use Snappy like this: compression_options = {'sstable_compression' : 'org.apache.cassandra.io.compress.SnappyCompressor'} My understanding is that if a file is compressed I should not be able to use strings command to view its contents. But it seems like I can view the contents like this: strings *-Data.db At what point does compression start ? How can I confirm it is working ? -- Regards, Oleg Dulin NYC Java Big Data Engineer http://www.olegdulin.com/
How much heap does Cassandra 1.1.11 really need ?
Here is my question. It can't possibly be a good set up to use 16gig heap space, but this is the best I can do. Setting it to default never worked well for me, setting it to 8g doesn't work well either. It can't keep up with flushing memtables. It is possibly that someone at some point may have broken something in the config files. If I were to look for hints there, what should I look at ? Look at my gc log from Cassandra: Starts off like this: 2013-04-29T08:53:44.548-0400: 5.386: [GC 1677824K-11345K(16567552K), 0.0509880 secs] 2 2013-04-29T08:53:47.701-0400: 8.539: [GC 1689169K-42027K(16567552K), 0.1269180 secs] 3 2013-04-29T08:54:05.361-0400: 26.199: [GC 1719851K-231763K(16567552K), 0.1436070 secs] 4 2013-04-29T08:55:44.797-0400: 125.635: [GC 1909587K-1480096K(16567552K), 1.2626270 secs] 5 2013-04-29T08:58:44.367-0400: 305.205: [GC 3157920K-2358588K(16567552K), 1.1198150 secs] 6 2013-04-29T09:01:12.167-0400: 453.005: [GC 4036412K-3634298K(16567552K), 1.0098650 secs] 7 2013-04-29T09:03:35.204-0400: 596.042: [GC 5312122K-4339703K(16567552K), 0.4597180 secs] 8 2013-04-29T09:04:51.562-0400: 672.400: [GC 6017527K-4956381K(16567552K), 0.5361800 secs] 9 2013-04-29T09:04:59.205-0400: 680.043: [GC 6634205K-5131825K(16567552K), 0.1741690 secs] 10 2013-04-29T09:05:06.638-0400: 687.476: [GC 6809649K-5027933K(16567552K), 0.0607470 secs] 11 2013-04-29T09:05:13.908-0400: 694.747: [GC 6705757K-5012439K(16567552K), 0.0624410 secs] 12 2013-04-29T09:05:20.909-0400: 701.747: [GC 6690263K-5039538K(16567552K), 0.0618750 secs] 13 2013-04-29T09:06:35.914-0400: 776.752: [GC 6717362K-5819204K(16567552K), 0.5738550 secs] 14 2013-04-29T09:08:05.589-0400: 866.428: [GC 7497028K-6678597K(16567552K), 0.6781900 secs] 15 2013-04-29T09:08:12.458-0400: 873.296: [GC 8356421K-6865736K(16567552K), 0.1423040 secs] 16 2013-04-29T09:08:18.690-0400: 879.529: [GC 8543560K-6742902K(16567552K), 0.0516470 secs] 17 2013-04-29T09:08:24.914-0400: 885.752: [GC 8420726K-6725877K(16567552K), 0.0517290 secs] 18 2013-04-29T09:08:31.008-0400: 891.846: [GC 8403701K-6741781K(16567552K), 0.0532540 secs] 19 2013-04-29T09:08:37.201-0400: 898.039: [GC 8419605K-6759614K(16567552K), 0.0563290 secs] 20 2013-04-29T09:08:43.493-0400: 904.331: [GC 8437438K-6772147K(16567552K), 0.0569580 secs] 21 2013-04-29T09:08:49.757-0400: 910.595: [GC 8449971K-6776883K(16567552K), 0.0558070 secs] 22 2013-04-29T09:08:55.973-0400: 916.812: [GC 8454707K-6789404K(16567552K), 0.0577230 secs] …… look what it is today: 41536 2013-05-03T07:17:13.519-0400: 339814.357: [GC 9178946K-9176740K(16567552K), 0.0265830 secs] 41537 2013-05-03T07:17:19.556-0400: 339820.394: [GC 10854564K-9178449K(16567552K), 0.0253180 secs] 41538 2013-05-03T07:17:24.390-0400: 339825.228: [GC 10856273K-9179073K(16567552K), 0.0266450 secs] 41539 2013-05-03T07:17:30.729-0400: 339831.567: [GC 10856897K-9178629K(16567552K), 0.0261150 secs] 41540 2013-05-03T07:17:35.584-0400: 339836.422: [GC 10856453K-9178586K(16567552K), 0.0250870 secs] 41541 2013-05-03T07:17:38.514-0400: 339839.352: [GC 10856410K-9179314K(16567552K), 0.0258120 secs] 41542 2013-05-03T07:17:43.200-0400: 339844.038: [GC 10857138K-9180160K(16567552K), 0.0250150 secs] 41543 2013-05-03T07:17:46.566-0400: 339847.404: [GC 10857984K-9179071K(16567552K), 0.0264420 secs] 41544 2013-05-03T07:17:52.913-0400: 339853.751: [GC 10856895K-9179870K(16567552K), 0.0262430 secs] 41545 2013-05-03T07:17:58.303-0400: 339859.141: [GC 10857694K-9179209K(16567552K), 0.0255130 secs] 41546 2013-05-03T07:18:03.427-0400: 339864.265: [GC 10857033K-9178316K(16567552K), 0.0263140 secs] 41547 2013-05-03T07:18:11.657-0400: 339872.495: [GC 10856140K-9178351K(16567552K), 0.0265340 secs] 41548 2013-05-03T07:18:17.429-0400: 339878.267: [GC 10856175K-9179067K(16567552K), 0.0254820 secs] 41549 2013-05-03T07:18:21.251-0400: 339882.089: [GC 10856891K-9179680K(16567552K), 0.0264210 secs] 41550 2013-05-03T07:18:25.062-0400: 339885.900: [GC 10857504K-9178985K(16567552K), 0.0267200 secs] -- Regards, Oleg Dulin NYC Java Big Data Engineer http://www.olegdulin.com/
Re: How much heap does Cassandra 1.1.11 really need ?
What constitutes an extreme write ? On 2013-05-03 15:45:33 +, Edward Capriolo said: If your writes are so extreme that metables are flushing all the time, the best you can do is turn off all caches, do bloom filters off heap, and then instruct cassandra to use large portions of the heap as memtables. On Fri, May 3, 2013 at 11:40 AM, Bryan Talbot btal...@aeriagames.com wrote: It's true that a 16GB heap is generally not a good idea; however, it's not clear from the data provided what problem you're trying to solve. What is it that you don't like about the default settings? -Bryan On Fri, May 3, 2013 at 4:27 AM, Oleg Dulin oleg.du...@gmail.com wrote: Here is my question. It can't possibly be a good set up to use 16gig heap space, but this is the best I can do. Setting it to default never worked well for me, setting it to 8g doesn't work well either. It can't keep up with flushing memtables. It is possibly that someone at some point may have broken something in the config files. If I were to look for hints there, what should I look at ? Look at my gc log from Cassandra: Starts off like this: 2013-04-29T08:53:44.548-0400: 5.386: [GC 1677824K-11345K(16567552K), 0.0509880 secs] 2 2013-04-29T08:53:47.701-0400: 8.539: [GC 1689169K-42027K(16567552K), 0.1269180 secs] 3 2013-04-29T08:54:05.361-0400: 26.199: [GC 1719851K-231763K(16567552K), 0.1436070 secs] 4 2013-04-29T08:55:44.797-0400: 125.635: [GC 1909587K-1480096K(16567552K), 1.2626270 secs] 5 2013-04-29T08:58:44.367-0400: 305.205: [GC 3157920K-2358588K(16567552K), 1.1198150 secs] 6 2013-04-29T09:01:12.167-0400: 453.005: [GC 4036412K-3634298K(16567552K), 1.0098650 secs] 7 2013-04-29T09:03:35.204-0400: 596.042: [GC 5312122K-4339703K(16567552K), 0.4597180 secs] 8 2013-04-29T09:04:51.562-0400: 672.400: [GC 6017527K-4956381K(16567552K), 0.5361800 secs] 9 2013-04-29T09:04:59.205-0400: 680.043: [GC 6634205K-5131825K(16567552K), 0.1741690 secs] 10 2013-04-29T09:05:06.638-0400: 687.476: [GC 6809649K-5027933K(16567552K), 0.0607470 secs] 11 2013-04-29T09:05:13.908-0400: 694.747: [GC 6705757K-5012439K(16567552K), 0.0624410 secs] 12 2013-04-29T09:05:20.909-0400: 701.747: [GC 6690263K-5039538K(16567552K), 0.0618750 secs] 13 2013-04-29T09:06:35.914-0400: 776.752: [GC 6717362K-5819204K(16567552K), 0.5738550 secs] 14 2013-04-29T09:08:05.589-0400: 866.428: [GC 7497028K-6678597K(16567552K), 0.6781900 secs] 15 2013-04-29T09:08:12.458-0400: 873.296: [GC 8356421K-6865736K(16567552K), 0.1423040 secs] 16 2013-04-29T09:08:18.690-0400: 879.529: [GC 8543560K-6742902K(16567552K), 0.0516470 secs] 17 2013-04-29T09:08:24.914-0400: 885.752: [GC 8420726K-6725877K(16567552K), 0.0517290 secs] 18 2013-04-29T09:08:31.008-0400: 891.846: [GC 8403701K-6741781K(16567552K), 0.0532540 secs] 19 2013-04-29T09:08:37.201-0400: 898.039: [GC 8419605K-6759614K(16567552K), 0.0563290 secs] 20 2013-04-29T09:08:43.493-0400: 904.331: [GC 8437438K-6772147K(16567552K), 0.0569580 secs] 21 2013-04-29T09:08:49.757-0400: 910.595: [GC 8449971K-6776883K(16567552K), 0.0558070 secs] 22 2013-04-29T09:08:55.973-0400: 916.812: [GC 8454707K-6789404K(16567552K), 0.0577230 secs] …… look what it is today: 41536 2013-05-03T07:17:13.519-0400: 339814.357: [GC 9178946K-9176740K(16567552K), 0.0265830 secs] 41537 2013-05-03T07:17:19.556-0400: 339820.394: [GC 10854564K-9178449K(16567552K), 0.0253180 secs] 41538 2013-05-03T07:17:24.390-0400: 339825.228: [GC 10856273K-9179073K(16567552K), 0.0266450 secs] 41539 2013-05-03T07:17:30.729-0400: 339831.567: [GC 10856897K-9178629K(16567552K), 0.0261150 secs] 41540 2013-05-03T07:17:35.584-0400: 339836.422: [GC 10856453K-9178586K(16567552K), 0.0250870 secs] 41541 2013-05-03T07:17:38.514-0400: 339839.352: [GC 10856410K-9179314K(16567552K), 0.0258120 secs] 41542 2013-05-03T07:17:43.200-0400: 339844.038: [GC 10857138K-9180160K(16567552K), 0.0250150 secs] 41543 2013-05-03T07:17:46.566-0400: 339847.404: [GC 10857984K-9179071K(16567552K), 0.0264420 secs] 41544 2013-05-03T07:17:52.913-0400: 339853.751: [GC 10856895K-9179870K(16567552K), 0.0262430 secs] 41545 2013-05-03T07:17:58.303-0400: 339859.141: [GC 10857694K-9179209K(16567552K), 0.0255130 secs] 41546 2013-05-03T07:18:03.427-0400: 339864.265: [GC 10857033K-9178316K(16567552K), 0.0263140 secs] 41547 2013-05-03T07:18:11.657-0400: 339872.495: [GC 10856140K-9178351K(16567552K), 0.0265340 secs] 41548 2013-05-03T07:18:17.429-0400: 339878.267: [GC 10856175K-9179067K(16567552K), 0.0254820 secs] 41549 2013-05-03T07:18:21.251-0400: 339882.089: [GC 10856891K-9179680K(16567552K), 0.0264210 secs] 41550 2013-05-03T07:18:25.062-0400: 339885.900: [GC 10857504K-9178985K(16567552K), 0.0267200 secs] -- Regards, Oleg Dulin NYC Java Big Data Engineer http://www.olegdulin.com/ -- Regards, Oleg Dulin NYC Java Big Data Engineer http://www.olegdulin.com/
Cass 1.1.1 and 1.1.11 Exception during compactions
We saw this exception with 1.1.1 and also with 1.1.11 (we upgraded for unrelated reasons, to fix the FD leak during slice queries) -- name of the CF replaced with * for confidentiality: 10419 ERROR [CompactionExecutor:36] 2013-04-29 07:50:49,060 AbstractCassandraDaemon.java (line 132) Exception in thread T hread[CompactionExecutor:36,1,main] 10420 java.lang.RuntimeException: Last written key DecoratedKey(138024912283272996716128964353306009224, 6138633035613062 2d61362d376330612d666531662d373738616630636265396535) = current key DecoratedKey(12706537740594940274338371890 1402082101, 64323962636163652d646561372d333039322d386166322d663064346132363963386131) writing into *-tmp-hf-7372-Data.db 10421 at org.apache.cassandra.io.sstable.SSTableWriter.beforeAppend(SSTableWriter.java:134) 10422 at org.apache.cassandra.io.sstable.SSTableWriter.append(SSTableWriter.java:153) 10423 at org.apache.cassandra.db.compaction.CompactionTask.execute(CompactionTask.java:160) 10424 at org.apache.cassandra.db.compaction.LeveledCompactionTask.execute(LeveledCompactionTask.java:50) 10425 at org.apache.cassandra.db.compaction.CompactionManager$2.runMayThrow(CompactionManager.java:164) 10426 at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:30) 10427 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471) 10428 at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:334) 10429 at java.util.concurrent.FutureTask.run(FutureTask.java:166) 10430 at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 10431 at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 10432 at java.lang.Thread.run(Thread.java:722) ANy thoughts ? Should I be concerned about data being lost ? -- Regards, Oleg Dulin NYC Java Big Data Engineer http://www.olegdulin.com/
Re: Replication factor and performance questions
Should be all under 400Gig on each. My question is -- is there additional overhead with replicas making requests to one another for keys they don't have ? how much of an overhead is that ? On 2012-11-05 17:00:37 +, Michael Kjellman said: Rule of thumb is to try to keep nodes under 400GB. Compactions/Repairs/Move operations etc become a nightmare otherwise. How much data do you expect to have on each node? Also depends on caches, bloom filters etc On 11/5/12 8:57 AM, Oleg Dulin oleg.du...@gmail.com wrote: I have 4 nodes at my disposal. I can configure them like this: 1) RF=1, each node has 25% of the data. On random-reads, how big is the performance penalty if a node needs to look for data on another replica ? 2) RF=2, each node has 50% of the data. Same question ? -- Regards, Oleg Dulin NYC Java Big Data Engineer http://www.olegdulin.com/ 'Like' us on Facebook for exclusive content and other resources on all Barracuda Networks solutions. Visit http://barracudanetworks.com/facebook -- Regards, Oleg Dulin NYC Java Big Data Engineer http://www.olegdulin.com/
Re: Text searches and free form queries
It works pretty fast. Cool. Just keep an eye out for how big the lucene token row gets. Cheers Indeed, it may get out of hand, but for now we are ok -- for the foreseable future I would say. Should it get larger, I can split it up into rows -- i.e. all tokens that start with a, all tokens that start with b, etc.
1.1.1 is repair still needed ?
My understanding is that the repair has to happen within gc_grace period. But in 1.1.1 you can set gc_grace by CF. A couple of my CFs that are frequently updated have gc_grace of 1 hour, but we do run a weekly repair. So the question is, is this still needed ? Do we even need to run nodetool repair ? If gc_grace is 10 days on all other CFs, are we saying that as long as we restart that node within the 10 day period we don't need to run nodetool repair ? The reason I bring this up is because repair once in a while runs for more than a day on some of these nodes (500+Gigs of data) and it is causing slowness with read requests. -- Regards, Oleg Dulin NYC Java Big Data Engineer http://www.olegdulin.com/
Re: Text searches and free form queries
So, what I ended up doing is this -- As I write my records into the main CF, I tokenize some fields that I want to search on using Lucene and write an index into a separate CF, such that my columns are a composite of: luceneToken:record key I can then search my records by doing a slice for each lucene token in the search query and then do an intersection of the sets. It works pretty fast. Regards, Oleg On 2012-09-05 01:28:44 +, aaron morton said: AFAIk if you want to keep it inside cassandra then DSE, roll your own from scratch or start with https://github.com/tjake/Solandra . Outside of Cassandra I've heard of people using Elastic Search or Solr which I *think* is now faster at updating the index. Hope that helps. - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 4/09/2012, at 3:00 AM, Andrey V. Panov panov.a...@gmail.com wrote: Some one did search on Lucene, but for very fresh data they build search index in memory so data become available for search without delays. On 3 September 2012 22:25, Oleg Dulin oleg.du...@gmail.com wrote: Dear Distinguished Colleagues: -- Regards, Oleg Dulin NYC Java Big Data Engineer http://www.olegdulin.com/
Re: Cassandra 1.1.1 on Java 7
So, my experiment didn't quite work out. I was hoping to use G1 collector to minimize pauses -- pauses didn't really go away, but what's worse is I think the memtable memory calculations are driven by CMS, so my memtables would fill up and cause Cass to run out of heap :( On 2012-09-09 19:04:41 +, Jeremy Hanna said: Starting with 1.6.0_34, you'll need xss set to 180k. It's updated with the forthcoming 1.1.5 as well as the next minor rev of 1.0.x (1.0.12). https://issues.apache.org/jira/browse/CASSANDRA-4631 See also the comments on https://issues.apache.org/jira/browse/CASSANDRA-4602 for the reference to what required a higher stack. On Sep 9, 2012, at 12:47 PM, Christopher Keller cnkel...@gmail.com wrote: This is necessary under the later versions of 1.6v35 as well. Nodetool will show the cluster as being down even though individual nodes will be up. --Chris On Sep 9, 2012, at 7:13 AM, dong.yajun dongt...@gmail.com wrote: ruuning for a while, you should set the -Xss to more than 160k when you using jdk1.7. On Sun, Sep 9, 2012 at 3:39 AM, Peter Schuller peter.schul...@infidyne.com wrote: Has anyone tried running 1.1.1 on Java 7? Have been running jdk 1.7 on several clusters on 1.1 for a while now. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com) -- Ric Dong Newegg Ecommerce, MIS department -- The downside of being better than everyone else is that people tend to assume you're pretentious. -- Regards, Oleg Dulin NYC Java Big Data Engineer http://www.olegdulin.com/
Re: Long-life TTL and extending TTL
You should create an index where you store references to your records. You can use composite column names where column name=composite(timestamp,key) then you would get a slice of all columns where timestamp part of the composite is = TTL in the past, and then iterate through them and delete the items. Regards, Oleg On 2012-09-10 09:47:31 +, Robin Verlangen said: Hi there, I'm working on a project that might want to set TTL to roughly 7 years. However it might occur that the TTL should be reduced or extended. Is there any way of updating the TTL without being in need of rewriting the data back again? This would cause way to much overhead for this. If not, is running a Map/Reduce task on the whole data set the best option or should I think in a difference approach for this challenge? My last question is regarding to a long term TTL, does this have any negative impact on the cluster? Maybe during compaction, repair, reading/writing? Best regards, Robin Verlangen Software engineer W http://www.robinverlangen.nl E ro...@us2.nl Disclaimer: The information contained in this message and attachments is intended solely for the attention and use of the named addressee and may be confidential. If you are not the intended recipient, you are reminded that the information remains the property of the sender. You must not use, disclose, distribute, copy, print or rely on this e-mail. If you have received this message in error, please contact the sender immediately and irrevocably delete this message and any copies.
Re: High commit size
It is memory-mapped I/O. I wouldn't worry about it. BTW, Windows might not be the best choice to run Cassandra on. My experience running Cassandra on Windows has not been positive one. We no longer support Windows as our production platform. Regards, Oleg On 2012-09-10 09:00:02 +, Rene Kochen said: Hi all, On my test cluster I have three Windows Server 2008 R2 machines running Cassandra 1.0.11 If i use memory mapped IO (the default), then the nodes freeze after a while. Paging is disabled. The private bytes are OK (8GB). That is the amount I use in the -Xms and -Xmx arguments. The virtual size is big as expected because of the memory mapped IO. However, the working set size (size in RAM) is 24 GB (my total RAM usage). If I look with Process Explorer to the physical memory section I see a very high value in the WS Sharable section. Anyone has a clue what is going om here? Many thanks! Rene image -- Regards, Oleg Dulin NYC Java Big Data Engineer http://www.olegdulin.com/
JVM 7, Cass 1.1.1 and G1 garbage collector
I am currently profiling a Cassandra 1.1.1 set up using G1 and JVM 7. It is my feeble attempt to reduce Full GC pauses. Has anyone had any experience with this ? Anyone tried it ? -- Regards, Oleg Dulin NYC Java Big Data Engineer http://www.olegdulin.com/
Cassandra 1.1.1 on Java 7
Has anyone tried running 1.1.1 on Java 7? I know Datastax does not recommend it for DSE, is there a reason why ? Regards, Oleg
Text searches and free form queries
Dear Distinguished Colleagues: I need to add full-text search and somewhat free form queries to my application. Our data is made up of items that are stored in a single column family, and we have a bunch of secondary indices for look ups. An item has header fields and data fields, and the structure of the items CF is a super column family with row-key being item's natural ID, super column for header, super column for data. Our application is made up of a several redundant/load balanced servers all pointing at a Cassandra cluster. Our servers run embedded Jetty. I need to be able to find items by a combination of field values. Currently I have an index for items by field value which works reasonably well. I could also add support for data types and index items by fields of appropriate types, so we can do range queries on items. Ultimately, though, what we want is full text search with suggestions and human language sensitivity. We want to search by date ranges, by field values, etc. I did some homework on this topic, and here is what I see as options: 1) Use an SQL database as a helper. This is rather clunky, not sure what it gets us since just about anything that can be done in SQL can be done in Cassandra with proper structures. Then the problem here also is where am I going to get an open source database that can handle the workload ? Probably nowhere, nor do I get natural language support. 2) Each of our servers can index data using Lucene, but again we have to come up with a clunky mechanism where either one of the servers does the indexing and results are replicated, or each server does its own indexing. 3) We can use Solr as is, perhaps with some small modifications it can run within our server JVM -- since we already run embedded Jetty. I like this idea, actually, but I know that Solr indexing doesn't take advantage of Cassandra. 4) Datastax Enterprise with search, presumably, supports Solr indexing of existing column families -- but for the life of me I couldn't figure out how exactly it does that. The Wikipedia example shows that Solr can create column families based on Solr schemas that I can then query using Cassandra itself (which is great) and supposedly I can modify those column families directly and Solr will reindex them (which is even better), but I am not sure how that fits into our server design. The other concern is locking in to a commercial product, something I am very much worried about. So, one possibility I can see is using Solr embedded within our own server solution but storing its indexes in the file system outside of Cassandra. This is not optimal, and maybe over time i can add my own support for storing Solr index in Cassandra w/o relying on the Datastax solution. In any case, what are your thoughts and experiences ? Regards, Oleg
Deleting a row from a counter CF
I get this: InvalidRequestException(why:invalid operation for commutative columnfamily Any thoughts ? We use Pelops...
Data aggregation -- help me design a solution
Here are my requirements. We use Cassandra. I get millions of invoice line items into the system. As I load them I need to build up some data structures. * Invoice line items by invoice id (each line item has an invoice id on it ), with total dollar value * Invoice line items by customer id , with total dollar value * Invoice line items by territory, with total dollar value In all of those cases, what we want is to see the total by a given attribute, that's all there is to it. Line items may change daily, i.e. a territory may change or they may correct the values. In this case I need to update the aggregations accordingly. Here are my ideas: - I can use counters and store the data in buckets - I can just store the data in buckets and do the math in Java In both cases the challenge is that the items can be updated. Which means I need to look up a current version of an item and decide how to proceed. That puts a huge performance penalty on the application (# of line items we receive is in the millions and we need to process them in a timely fashion). Help me out here -- any ideas on how I could design this in Cassandra ? Regards, Oleg
Wide rows and reads
Here is my flow: One process write a really wide row (250K+ supercolumns, each one with 5 subcolumns, for the total of 1K or so per supercolumn) Second process comes in literally 2-3 seconds later and starts reading from it. My observation is that nothing good happens. It is ridiculously slow to read. It seems that if I wait long enough, the reads from that row will be much faster. Could someone enlighten me as to what exactly happens when I do this ? Regards, Oleg
Supercolumn behavior on writes
Does a write to a sub column involve deserialization of the entire super column ? Thanks, Oleg
Disappearing keyspaces in Cassandra 1.1
I am using cassandra 1.1.0 on a 3 node environment. I just truncated a few column families then restarted the nodes. now when I restarted them it says my keyspace doesn't exist. The data for the keyspace is still in the data directory. Does anyone know what could have caused this?
Data corruption issues with 1.1
I can't quite describe what happened, but essentially one day I found that my column values that are supposed to be UTF-8 strings started getting bogus characters. Is there a known data corruption issue with 1.1 ?
nodetool repair -- should I schedule a weekly one ?
We have a 3-node cluster. We use RF of 3 and CL of ONE for both reads and writes…. Is there a reason I should schedule a regular nodetool repair job ? Thanks, Oleg
TimedOutException()
We are using Cassandra 1.1.0 with an older Pelops version, but I don't think that in itself is a problem here. I am getting this exception: TimedOutException() at org.apache.cassandra.thrift.Cassandra$get_slice_result.read(Cassandra.java:7660) at org.apache.cassandra.thrift.Cassandra$Client.recv_get_slice(Cassandra.java:570) at org.apache.cassandra.thrift.Cassandra$Client.get_slice(Cassandra.java:542) at org.scale7.cassandra.pelops.Selector$3.execute(Selector.java:683) at org.scale7.cassandra.pelops.Selector$3.execute(Selector.java:680) at org.scale7.cassandra.pelops.Operand.tryOperation(Operand.java:82) Is my understanding correct that this is where cassandra is telling us it can't accomplish something within that timeout value -- as opposed to network timeout ? Where is it set ? Thanks, Oleg
Re: TimedOutException()
Tyler Hobbs ty...@datastax.com wrote: On Fri, Jun 1, 2012 at 9:39 AM, Oleg Dulin oleg.du...@gmail.com wrote: Is my understanding correct that this is where cassandra is telling us it can't accomplish something within that timeout value -- as opposed to network timeout ? Where is it set ? That's correct. Basically, the coordinator sees that a replica has not responded (or can not respond) before hitting a timeout. This is controlled by rpc_timeout_in_ms in cassandra.yaml. -- Tyler Hobbs DataStax a href=http://datastax.com/;http://datastax.com//a So if we are using random partitioner, and read consistency of one, what does that mean ? We have a 3 node cluster, use write / read consistency of one, replication factor of 3. Is the node we are connecting to try to proxy requests ? Wouldn't our configuration ensure all nodes have replicas ?
Renaming a keyspace in 1.1
Is it possible ? How ?
Data aggregation - averages, sums, etc.
Dear distinguished colleagues: I am trying to come up with a data model that lets me do aggregations, such as sums and averages. Here are my requirements: 1. Data may be updated concurrently 2. I want to avoid changing schema; we have a multitennant cloud solution that is driven by configuration. Schema is the same for all customs. Here is what I have at my disposal: 1. We have a proprietary distributed in memory column store that acts as a buffer between the server and Cassandra. Frequent reads are not a problem. 2. I know I have counter columns. I can do sums. But can I do averages ? One of the ideas is to record data as it comes in organized by time and periodically aggregate it. Thoughts ?
Re: how can we get (a lot) more performance from cassandra
Please do keep us posted. We have a somewhat similar Cassandra utilization pattern, and I would like to know what your solution is... On 2012-05-16 20:38:37 +, Yiming Sun said: Thanks Oleg. Another caveat from our side is, we have a very large data space (imaging picking 100 items out of 3 million, the chance of having 2 items from the same bin is pretty low). We will experiment with row cache, and hopefully it will help, not the opposite (the tuning guide says row cache could be detrimental in some circumstances). -- Y. On Wed, May 16, 2012 at 4:25 PM, Oleg Dulin oleg.du...@gmail.com wrote: Indeed. This is how we are trying to solve this problem. Our application has a built-in cache that resembles a supercolumn or standardcolumn data structure and has API that resembles a combination of Pelops selector and mutator. You can do something like that for Hector. The cache is constrained and uses LRU to purge unused items and keep memory usage steady. It is not perfect and we have bugs still but it cuts down on 90% of cassandra reads. On 2012-05-16 20:07:11 +, Mike Peters said: Hi Yiming, Cassandra is optimized for write-heavy environments. If you have a read-heavy application, you shouldn't be running your reads through Cassandra. On the bright side - Cassandra read throughput will remain consistent, regardless of your volume. But you are going to have to wrap your reads with memcache (or redis), so that the bulk of your reads can be served from memory. Thanks, Mike Peters On 5/16/2012 3:59 PM, Yiming Sun wrote: Hello, I asked the question as a follow-up under a different thread, so I figure I should ask here instead in case the other one gets buried, and besides, I have a little more information. We find the lack of performance disturbing as we are only able to get about 3-4MB/sec read performance out of Cassandra. We are using cassandra as the backend for an IR repository of digital texts. It is a read-mostly repository with occasional writes. Each row represents a book volume, and each column of a row represents a page of the volume. Granted the data size is small -- the average size of a column text is 2-3KB, and each row has about 250 columns (varies quite a bit from one volume to another). Currently we are running a 3-node cluster, and will soon be upgraded to a 6-node setup. Each node is a VM with 4 cores and 16GB of memory. All VMs use SAN as disk storage. To retrieve a volume, a slice query is used via Hector that specifies the row key (the volume), and a list of column keys (pages), and the consistency level is set to ONE. It is typical to retrieve multiple volumes per request. The read rate that I have been seeing is about 3-4 MB/sec, and that is reading the raw bytes... using string serializer the rate is even lower, about 2.2MB/sec. The server log shows the GC ParNew frequently gets longer than 200ms, often in the range of 4-5seconds. But nowhere near 15 seconds (which is an indication that JVM heap is being swapped out). Currently we have not added JNA. From a blog post, it seems JNA is able to increase the performance by 13%, and we are hoping to increase the performance by something more like 1300% (3-4 MB/sec is just disturbingly low). And we are hesitant to disable swap entirely since one of the nodes is running a couple other services Do you have any suggestions on how we may boost the performance? Thanks! -- Y.
Configuring cassandra cluster with host preferences
I am running my processes on the same nodes as Cassandra. What I'd like to do is when I get a connection from Pelops, it gives preference to the Cassandra node local to the host my process is on. Is it possible ? How ? Regards, Oleg Dulin Please note my new office #: 732-917-0159