Re: new node gets no data
ahh, I think you may have hit a corner case here. If the RF still 1 ? INFO [AntiEntropySessions:1] 2012-03-16 06:15:13,727 AntiEntropyService.java (line 663) [repair #%s] No neighbors to repair with on range %s: session completed Means there are no nodes which share the range with this node. So there is nothing to repair. To put it another way: As far is 161.101 is concerned none of the keys it is responsible for are stored on another node. So there are no other nodes that could be involved in a repair session. It looks like some data may have been written to 161.101 so I think the safest approach would be: * increase the RF to 2 * repair * decrease the RF to 1 When you added the node was auto_bootstrap enabled ? I would have thought that would stream data from the first node to the new one. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 16/03/2012, at 7:22 PM, Thorsten von Eicken wrote: Thanks for the suggestion Aaron, unfortunately, that seems to do absolutely nothing: # nodetool -h localhost repair INFO [RMI TCP Connection(160)-127.0.0.1] 2012-03-16 06:15:13,718 StorageService.java (line 1770) Starting repair command #1, repairing 1 ranges. INFO [AntiEntropySessions:1] 2012-03-16 06:15:13,727 AntiEntropyService.java (line 658) [repair #6472b290-6f2f-11e1--472739b10cff] new session: will sync /10.80.161.101 on range (0,85070591730234615865843651857942052864] for rslog_production.[users, req_text, req_attr_idx, req_word_idx, req_word_freq, sessions, requests, info] INFO [AntiEntropySessions:1] 2012-03-16 06:15:13,727 AntiEntropyService.java (line 663) [repair #%s] No neighbors to repair with on range %s: session completed INFO [RMI TCP Connection(160)-127.0.0.1] 2012-03-16 06:15:13,727 StorageService.java (line 1807) Repair command #1 completed successfully Stumped... TvE On 3/15/2012 6:41 PM, aaron morton wrote: trying running nodetool repair on 10.80.161.101 and then cleanup on 10.102.37.168 if everything is ok. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 16/03/2012, at 6:45 AM, Thorsten von Eicken wrote: I added a second node to a single-node ring. RF=1. I can't get the new node to receive any data. Logs look fine. Here's what nodetool reports: # nodetool -h localhost ring Address DC RackStatus State Load OwnsToken 85070591730234615865843651857942052864 10.102.37.168 datacenter1 rack1 Up Normal 807.81 GB 50.00% 0 10.80.161.101 datacenter1 rack1 Up Normal 1.15 MB 50.00% 85070591730234615865843651857942052864 Just a little imbalance. Yes, I use partitioner: org.apache.cassandra.dht.RandomPartitioner I tried moving the new node's token up/down by 1 and it triggers the log messages you'd expect, but no data gets transferred. How do I troubleshoot this? Below are the log messages I see when restarting the new node: INFO [main] 2012-03-15 17:31:08,616 AbstractCassandraDaemon.java (line 120) JVM vendor/version: Java HotSpot(TM) 64-Bit Server VM/1.6.0_24 INFO [main] 2012-03-15 17:31:14,812 CommitLog.java (line 178) Log replay complete, 8 replayed mutations INFO [main] 2012-03-15 17:31:14,825 StorageService.java (line 390) Cassandra version: 1.0.6 INFO [main] 2012-03-15 17:31:14,825 StorageService.java (line 391) Thrift API version: 19.19.0 INFO [main] 2012-03-15 17:31:14,825 StorageService.java (line 404) Loading persisted ring state INFO [main] 2012-03-15 17:31:14,834 StorageService.java (line 482) Starting up server gossip INFO [main] 2012-03-15 17:31:15,372 MessagingService.java (line 247) Starting Encrypted Messaging Service on SSL port 7000 INFO [main] 2012-03-15 17:31:15,376 MessagingService.java (line 268) Starting Messaging Service on port 7001 INFO [main] 2012-03-15 17:31:15,401 StorageService.java (line 579) Using saved token 85070591730234615865843651857942052864 INFO [main] 2012-03-15 17:31:15,402 ColumnFamilyStore.java (line 692) Enqueuing flush of Memtable-LocationInfo@645492252(53/66 serialized/live bytes, 2 ops) INFO [FlushWriter:1] 2012-03-15 17:31:15,403 Memtable.java (line 240) Writing Memtable-LocationInfo@645492252(53/66 serialized/live bytes, 2 ops) INFO [FlushWriter:1] 2012-03-15 17:31:15,421 Memtable.java (line 277) Completed flushing /mnt/ebs/data/system/LocationInfo-hc-32-Data.db (163 bytes) INFO [main] 2012-03-15 17:31:15,424 StorageService.java (line 948) Node /10.80.161.101 state jump to normal INFO [main] 2012-03-15 17:31:15,434 StorageService.java (line 589) Bootstrap/Replace/Move completed! Now serving reads. # describe keyspace Keyspace: rslog_production: Replication Strategy: org.apache.cassandra.locator.SimpleStrategy Durable Writes: true Options: [replication_factor:1] Column Families:
Re: Bootstrapping a new node to a running cluster
I think your original plan is sound. 1. Up the RF to 4. 2. Add the node with auto_bootstrap true 3. Once bootrapping has finished the new node has all the data it needs. 4. Check for secondary index creation using describe in the CLI to see which are build. You can also see progress using nodetool compactionstats I'm a bit puzzled though, I just tried to increase R to 3 in a cluster with N=2. It serves reads and writes without issues CL.one. Is the described restriction is something that will be implemented in the future? I had a quick glance at the code. IIRC there was an explicit check if RF N, but I cannot find it any more. I'm guessing we now rely on a normal UnavailableFailure if there are not enough UP nodes. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 16/03/2012, at 8:56 PM, Mikael Wikblom wrote: ok, thank you both for the clarification. So the correct approach would be to bootstrap the new node and run repair on each of the nodes in the cluster. I'm a bit puzzled though, I just tried to increase R to 3 in a cluster with N=2. It serves reads and writes without issues CL.one. Is the described restriction is something that will be implemented in the future? Thank you Regards On 03/16/2012 03:07 AM, aaron morton wrote: The documentation is correct. I was mistakenly remembering discussions in the past about RF #nodes. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 16/03/2012, at 4:34 AM, Doğan Çeçen wrote: I'm not sure why this is not allowed. As long as I do not use CL.all there will be enough nodes available to satisfy the read / write (at least when I look at ReadCallback and the WriteResponseHandler). Or am I missing something here? According to http://www.datastax.com/docs/1.0/cluster_architecture/replication As a general rule, the replication factor should not exceed the number of nodes in the cluster. However, it is possible to increase replication factor, and then add the desired number of nodes afterwards. When replication factor exceeds the number of nodes, writes will be rejected, but reads will be served as long as the desired consistency level can be met. -- () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org - against proprietary attachments -- Mikael Wikblom Software Architect SiteVision AB 019-217058 mikael.wikb...@sitevision.se http://www.sitevision.se
Re: Bootstrapping a new node to a running cluster
ok, thank you for your time! Cheers On 03/16/2012 10:12 AM, aaron morton wrote: I think your original plan is sound. 1. Up the RF to 4. 2. Add the node with auto_bootstrap true 3. Once bootrapping has finished the new node has all the data it needs. 4. Check for secondary index creation using describe in the CLI to see which are build. You can also see progress using nodetool compactionstats I'm a bit puzzled though, I just tried to increase R to 3 in a cluster with N=2. It serves reads and writes without issues CL.one. Is the described restriction is something that will be implemented in the future? I had a quick glance at the code. IIRC there was an explicit check if RF N, but I cannot find it any more. I'm guessing we now rely on a normal UnavailableFailure if there are not enough UP nodes. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 16/03/2012, at 8:56 PM, Mikael Wikblom wrote: ok, thank you both for the clarification. So the correct approach would be to bootstrap the new node and run repair on each of the nodes in the cluster. I'm a bit puzzled though, I just tried to increase R to 3 in a cluster with N=2. It serves reads and writes without issues CL.one. Is the described restriction is something that will be implemented in the future? Thank you Regards On 03/16/2012 03:07 AM, aaron morton wrote: The documentation is correct. I was mistakenly remembering discussions in the past about RF #nodes. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com http://www.thelastpickle.com/ On 16/03/2012, at 4:34 AM, Doğan Çeçen wrote: I'm not sure why this is not allowed. As long as I do not use CL.all there will be enough nodes available to satisfy the read / write (at least when I look at ReadCallback and the WriteResponseHandler). Or am I missing something here? According to http://www.datastax.com/docs/1.0/cluster_architecture/replication As a general rule, the replication factor should not exceed the number of nodes in the cluster. However, it is possible to increase replication factor, and then add the desired number of nodes afterwards. When replication factor exceeds the number of nodes, writes will be rejected, but reads will be served as long as the desired consistency level can be met. -- () ascii ribbon campaign - against html e-mail /\ www.asciiribbon.org http://www.asciiribbon.org/ - against proprietary attachments -- Mikael Wikblom Software Architect SiteVision AB 019-217058 mikael.wikb...@sitevision.se http://www.sitevision.se -- Mikael Wikblom Software Architect SiteVision AB 019-217058 mikael.wikb...@sitevision.se http://www.sitevision.se
Re: Datastax Enterprise mixed workload cluster configuration
Hi, Since this thread already contains the system setup, I just want to ask another question: If you have 3 data centers (DC1,DC2 and DC3) and you have a keyspace where the strategy options are such that each DC gets one replica. If you only write to the nodes in one DC1 what is the path the replicas take assuming you're correctly interleaved and evenly spaced the tokens of all the nodes? If you write a record in a node in DC1 will it replicate it to the node in DC2 and the node in DC2 will replicate it to the node in DC3? Or will the node in DC1 replicate the record both to DC2 and DC3? Cheers, Alex On Thu, Mar 15, 2012 at 11:26 PM, Alexandru Sicoe adsi...@gmail.com wrote: Sorry for that last message, I was confused because I thought I needed to use the DseSimpleSnitch but of course I can use the PropertyFileSnitch and that allows me to get the configuration with 3 data centers explained. Cheers, Alex On Thu, Mar 15, 2012 at 10:56 AM, Alexandru Sicoe adsi...@gmail.comwrote: Thanks Tyler, I see that cassandra.yaml has endpoint_snitch: com.datastax.bdp.snitch.DseSimpleSnitch. Will this pick up the configuration from the cassandra-topology.properties file as does the PropertyFileSnitch ? Or is there some other way of telling it which nodes are in withc DC? Cheers, Alex On Wed, Mar 14, 2012 at 9:09 PM, Tyler Hobbs ty...@datastax.com wrote: Yes, you can do this. You will want to have three DCs: DC1 with [1, 2, 3], DC2 with [4, 5, 6], and DC3 with [7, 8, 9]. For your normal data keyspace, the replication strategy should be NTS, and the strategy_options should have some replicas in each of the three DCs. For example: {DC1: 3, DC2: 3, DC3: 3} if you need that level of replication in each one (although you probably only want an RF of 1 for DC3). Your clients that are performing writes should only open connections against the nodes in DC1, and you should write at CL.ONE or CL.LOCAL_QUORUM. Likewise for reads, your clients should only connect to nodes in DC2, and you should read at CL.ONE or CL.LOCAL_QUORUM. The nodes in DC3 should run as analytics nodes. I believe the default CL for m/r jobs is ONE, which would work. As far as tokens go, interleaving all three DCs and evenly spacing the tokens will work. For example, the ordering of your nodes might be [1, 4, 7, 2, 5, 8, 3, 6, 9]. On Wed, Mar 14, 2012 at 12:05 PM, Alexandru Sicoe adsi...@gmail.comwrote: Hi everyone, I want to test out the Datastax Enterprise software to have a mixed workload setup with an analytics and a real time part. However I am not sure how to configure it to achieve what I want: I will have 3 real machines on one side of a gateway (1,2,3) and 6 VMs on another(4,5,6). 1,2,3 will each have a normal Cassandra node that just takes data directly from my data sources. I want them to replicate the data to the other 6 VMs. Now, out of those 6 VMs 4,5,6 will run normal Cassandra nodes and 7,8,9 will run Analytics nodes. So I only want to write to the 1,2,3 and I only want to serve user reads from 4,5,6 and do analytics on 7,8,9. Can I achieve this by configuring 1,2,3,4,5,6 as normal nodes and the rest as analytics nodes? If I alternate the tokens as it's explained in http://www.datastax.com/docs/1.0/datastax_enterprise/init_dse_cluster#init-dseis it analoguous to achieving something like 3 DCs each getting their own replica? Thanks, Alex -- Tyler Hobbs DataStax http://datastax.com/
Re: CASSANDRA-2388 - ColumnFamilyRecordReader fails for a given split because a host is down
Sorry for such a late reply. I'm not always keeping up with the mailing list. Is the following scenario covered by 2388? I have a test cluster of 6 nodes with a replication factor of 3. Each server can execute hadoop tasks. 1 cassandra node is down for the test. The job is kicked off from node 1 jobtracker. A task is executed from node 1, and fails because the local cassandra instance is down retry on node 6, this tries to connect to node 1 and fails retry on node 5, this tries to connect to node 1 and fails retry on node 4, this tries to connect to node 1 and fails After 4 failures the task is killed and the job fails. Node 2 and 3 which contain the other replicas never run the task. The node selection seems to be random. I can modify the cassandra code to check connectivity in ColumnFamilyRecordReader but I suspect this is fixing the wrong problem. There are two problems here. 1) hadoop's jobtracker isn't preferencing tasks to tasktracker that would provide data locality. 2) connection replica nodes are never attempted directly, instead the task must fail and be re-submitted to another tasktracker which hopefully is a replica node. [snip] but this comment from mck seems to say it should work http://mail-archives.apache.org/mod_mbox/cassandra-user/201109.mbox/% 3C1315253057.7466.222.camel@localhost%3E not in your case. ColumnFamilyInputFormat splits the query into InputSplits. This is done via the api calls describe_ring and describe_splits. These InputSplits (ColumnFamilySplit) each has a list of locations which are the replica nodes. Now hadoop is supposed to preference sending tasks to tasktrackers based on the split's location. This is problem (1). I haven't seen it actually work. The closest information i got is http://abel-perez.com/hadoop-task-assignment Problem (2) is ColumnFamilyRecordReader.getLocation() returns you the address from the list of locations for the current split that matches the localhost. This preferences data locality. If none of the locations is local then it simply returns the first location in the list. This explains your use case not working. One fix for you to experiment with is to increase the allowed task failures (i think it is mapred.max.tracker.failures) to the number of nodes you have. Then each node would be (randomly) tried before the task killed and job failed. ~mck -- Friendship with the upright, with the truthful and with the well informed is beneficial. Friendship with those who flatter, with those who are meek and who compromise with principles, and with those who talk cleverly is harmful. Confucius | http://github.com/finn-no | http://tech.finn.no | signature.asc Description: This is a digitally signed message part
Re: Single Node Cassandra Installation
You'll need to either read or write at at least quorum to get consistent data from the cluster so you may as well do both. Now that you mention it, I was wrong about downtime, with a two node cluster reads or writes at quorum will mean both nodes need to be online. Perhaps you could have an emergency switch in your application which flips to consistency of 1 if one of your Cassandra servers goes down? Just make sure it's set back to quorum when the second one returns or again you could end up with inconsistent data. On Fri, Mar 16, 2012 at 2:04 AM, Drew Kutcharian d...@venarc.com wrote: Thanks for the comments, I guess I will end up doing a 2 node cluster with replica count 2 and read consistency 1. -- Drew On Mar 15, 2012, at 4:20 PM, Thomas van Neerijnen wrote: So long as data loss and downtime are acceptable risks a one node cluster is fine. Personally this is usually only acceptable on my workstation, even my dev environment is redundant, because servers fail, usually when you least want them to, like for example when you've decided to save costs by waiting before implementing redundancy. Could a failure end up costing you more than you've saved? I'd rather get cheaper servers (maybe even used off ebay??) so I could have at least two of them. If you do go with a one node solution, altho I haven't tried it myself Priam looks like a good place to start for backups, otherwise roll your own with incremental snapshotting turned on and a watch on the snapshot directory. Storage on something like S3 or Cloud Files is very cheap so there's no good excuse for no backups. On Thu, Mar 15, 2012 at 7:12 PM, R. Verlangen ro...@us2.nl wrote: Hi Drew, One other disadvantage is the lack of consistency level and replication. Both ware part of the high availability / redundancy. So you would really need to backup your single-node-cluster to some other external location. Good luck! 2012/3/15 Drew Kutcharian d...@venarc.com Hi, We are working on a project that initially is going to have very little data, but we would like to use Cassandra to ease the future scalability. Due to budget constraints, we were thinking to run a single node Cassandra for now and then add more nodes as required. I was wondering if it is recommended to run a single node cassandra in production? Are there any other issues besides lack of high availability? Thanks, Drew
Re: 1.0.8 with Leveled compaction - Possible issues
Perfect.. this helped a lot - and I can confirm that I have run in to the same issue as described in: http://mail-archives.apache.org/mod_mbox/cassandra-user/201203.mbox/% 3CCALqbeQbQ=d-hORVhA-LHOo_a5j46fQrsZMm+OQgfkgR=4rr...@mail.gmail.com%3E Where it goes down when it tries to move up files to a higher level - that is out of bounds. Nice that I could get a overview of the levels by looking in the .json-file as well. Any timeframe on when we can expect 1.0.9 to be released? /Johan -- Johan Elmerfjord | Sr. Systems Administration/Mgr, EMEA | Adobe Systems, Product Technical Operations | p. +45 3231 6008 | x86008 | cell. +46 735 101 444 | jelme...@adobe.com On Thu, 2012-03-15 at 17:00 -0700, Watanabe Maki wrote: update column family with LCS option + upgradesstables should convert all of your sstables. Set lig4j config: org.apache.cassandra.db.compaction=DEBUG in conf/log4j-server.properties and retry your procedure to find what is happen. maki On 2012/03/16, at 7:05, Johan Elmerfjord jelme...@adobe.com wrote: Hi, I'm testing the community-version of Cassandra 1.0.8. We are currently on 0.8.7 in our production-setup. We have 3 Column Families that each takes between 20 and 35 GB on disk per node. (8*2 nodes total) We would like to change to Leveled Compaction - and even try compression as well to reduce the space needed for compactions. We are running on SSD-drives as latency is a key-issue. As test I have imported one Column Family from 3 production-nodes to a 3 node test-cluster. The data on the 3 nodes ranges from 19-33GB. (with at least one large SSTable (Tiered size - recently compacted)). After loading this data to the 3 test-nodes, and running scrub and repair, I took a backup of the data so I have good test-set of data to work on. Then I changed changed to leveled compaction, using the cassandra-cli: UPDATE COLUMN FAMILY TestCF1 WITH compaction_strategy=LeveledCompactionStrategy; I could see the change being written to the logfile on all nodes. Then I don't know for for sure if I need to run anything else to make the change happen - or if it's just to wait. My test-cluster does not receive new data. For this KS CF and on each of the nodes I have tried some or several of: upgradesstable, scrub, compact, cleanup and repair - each task taking between 40 minutes and 4 hours. With the exception of compact that returns almost immediately with no visible compactions made. On some node I ended up with over 3 files with the default 5MB size for leveled compaction, on another node it didn't look like anything has been done and I still have a 19GB SSTable. I then made another change. UPDATE COLUMN FAMILY TestCF1 WITH compaction_strategy=LeveledCompactionStrategy AND compaction_strategy_options=[{sstable_size_in_mb: 64}]; WARNING: [{}] strategy_options syntax is deprecated, please use {} Which is probably wrong in the documentation - and should be: UPDATE COLUMN FAMILY TestCF1 WITH compaction_strategy=LeveledCompactionStrategy AND compaction_strategy_options={sstable_size_in_mb: 64}; I think that we will be able to find the data in 3 searches with a 64MB size - and still only use around 700MB while doing compactions - and keep the number of files ~3000 per CF. A few days later it looks like I still have a mix between original huge SStables, 5MB once - and some nodes has 64MB files as well. Do I need to do something special to clean this up? I have tried another scrub /upgradesstables/clean - but nothing seems to do any change to me. Finally I have also tried to enable compression: UPDATE COLUMN FAMILY TestCF1 WITH compression_options=[{sstable_compression:SnappyCompressor, chunk_length_kb:64}]; - which results in the same [{}] - warning. As you can see below - this created CompressionInfo.db - files on some nodes - but not on all. Is there a way I can force Teired sstables to be converted into Leveled once - and then to compression as well? Why are the original file (Tiered Sized SSTables still present on testnode1 - when is it supposed to delete them? Can I change back and forth between compression (on/off - or chunksizes) - and between Leveled vs Size Tiered compaction? Is there a way to see if the node is done - or waiting for something? When is it safe to apply another setting - does it have to complete one reorg before moving on to the next? Any input or own experiences are warmly welcome. Best regards, Johan Some lines of example directory-listings below.: Some files for testnode 3. (looks like it's still have the original Size Tiered files around, and a mixture of compressed 64MB files - and 5MB files? total 19G drwxr-xr-x 3 cass cass 4.0K Mar 13 17:11 snapshots -rw-r--r-- 1 cass cass 6.0G Mar 13 18:42 TestCF1-hc-6346-Index.db -rw-r--r-- 1 cass cass 1.3M
RE: Large hints column family
I took the reset the world approach, things are much better now and the hints table is staying empty. Bit disconcerting that it could get so large and not be able to recover itself, but at least there was a solution. Thanks From: aaron morton [mailto:aa...@thelastpickle.com] Sent: Thursday, March 15, 2012 7:24 PM To: user@cassandra.apache.org Subject: Re: Large hints column family These messages make it look like the node is having trouble delivering hints. INFO [HintedHandoff:1] 2012-03-13 16:13:34,188 HintedHandOffManager.java (line 284) Endpoint /192.168.20.4 died before hint delivery, aborting INFO [HintedHandoff:1] 2012-03-13 17:03:50,986 HintedHandOffManager.java (line 354) Timed out replaying hints to /192.168.20.3; aborting further deliveries Take another look at the logs on this machine and on 20.4 and 20.3. I would be looking int why so many hints are been stored. GC ? are there also logs about dropped messages ? If you want to reset the world, make sure the nodes have all run repair and then drop the hints. Either via JMX or stopped in the node and deleting the files on disk. Cheers - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 16/03/2012, at 12:58 PM, Bryce Godfrey wrote: We were having some occasional memory pressure issues, but we just added some more RAM a few days ago to the nodes and things are running more smoothly now, but in general nodes have not been going up and down. I tried to do a list HintsColumnFamily from Cassandra-cli and it locks my Cassandra node and never returns, forcing me to kill the Cassandra process and restart it to get the node back. Here is my settings which I believe are default since I don't remember changing them: hinted_handoff_enabled: true max_hint_window_in_ms: 360 # one hour hinted_handoff_throttle_delay_in_ms: 50 Greping for Hinted in system log I get these INFO [HintedHandoff:1] 2012-03-13 16:13:22,215 HintedHandOffManager.java (line 373) Finished hinted handoff of 852703 rows to endpoint /192.168.20.3 INFO [HintedHandoff:1] 2012-03-13 16:13:34,188 HintedHandOffManager.java (line 284) Endpoint /192.168.20.4 died before hint delivery, aborting INFO [ScheduledTasks:1] 2012-03-13 16:15:32,569 StatusLogger.java (line 65) HintedHandoff 1 1 0 INFO [HintedHandoff:1] 2012-03-13 16:15:44,362 HintedHandOffManager.java (line 296) Started hinted handoff for token: 113427455640312814857969558651062452224 with IP: /192.168.20.3 INFO [HintedHandoff:1] 2012-03-13 16:21:37,266 HintedHandOffManager.java (line 296) Started hinted handoff for token: 113427455640312814857969558651062452224 with IP: /192.168.20.3 INFO [ScheduledTasks:1] 2012-03-13 16:23:07,662 StatusLogger.java (line 65) HintedHandoff 1 2 0 INFO [ScheduledTasks:1] 2012-03-13 16:25:49,330 StatusLogger.java (line 65) HintedHandoff 1 2 0 INFO [ScheduledTasks:1] 2012-03-13 16:30:52,503 StatusLogger.java (line 65) HintedHandoff 1 2 0 INFO [ScheduledTasks:1] 2012-03-13 16:42:22,202 StatusLogger.java (line 65) HintedHandoff 1 2 0 INFO [HintedHandoff:1] 2012-03-13 17:03:50,986 HintedHandOffManager.java (line 354) Timed out replaying hints to /192.168.20.3; aborting further deliveries INFO [HintedHandoff:1] 2012-03-13 17:03:50,986 ColumnFamilyStore.java (line 704) Enqueuing flush of Memtable-HintsColumnFamily@661547256(34298224/74465815 serialized/live bytes, 78808 ops) INFO [HintedHandoff:1] 2012-03-13 17:11:00,098 HintedHandOffManager.java (line 373) Finished hinted handoff of 44160 rows to endpoint /192.168.20.3 INFO [HintedHandoff:1] 2012-03-13 17:11:36,596 HintedHandOffManager.java (line 296) Started hinted handoff for token: 56713727820156407428984779325531226112 with IP: /192.168.20.4 INFO [ScheduledTasks:1] 2012-03-13 17:12:25,248 StatusLogger.java (line 65) HintedHandoff 1 2 0 INFO [HintedHandoff:1] 2012-03-13 18:47:56,151 HintedHandOffManager.java (line 296) Started hinted handoff for token: 113427455640312814857969558651062452224 with IP: /192.168.20.3 INFO [ScheduledTasks:1] 2012-03-13 18:50:24,326 StatusLogger.java (line 65) HintedHandoff 1 2 0 INFO [ScheduledTasks:1] 2012-03-14 12:12:48,177 StatusLogger.java (line 65) HintedHandoff 1 2 0 INFO [ScheduledTasks:1] 2012-03-14 12:13:57,685 StatusLogger.java (line 65) HintedHandoff 1 2 0 INFO [ScheduledTasks:1] 2012-03-14 12:14:57,258 StatusLogger.java (line 65) HintedHandoff 1 2 0 INFO [ScheduledTasks:1] 2012-03-14 12:14:58,260 StatusLogger.java (line 65) HintedHandoff 1 2 0 INFO [ScheduledTasks:1] 2012-03-14 12:15:59,093 StatusLogger.java (line 65) HintedHandoff
Order rows numerically
If I define my rowkeys to be Integer (key_validation_class=IntegerType) , how can I order the rows numerically ? ByteOrderedPartitioner orders lexically and retrieval using get_range does not seem to make sense in order. If I were to change rowkey to be UTF8 (key_validation_class=UTF8Type), BOP still does not give numerical enough. For range of rowkey from 1 to 2, I get 1, 10,11.,2 (lexical ordering). Any workaround for this ? Thanks.
RE: 0.8.1 Vs 1.0.7
I would guess more aggressive compaction settings, did you update rows or insert some twice? If you run major compaction a couple times on the 0.8.1 cluster does the data size get smaller? You can use the describe command to check if compression got turned on. -Jeremiah From: Ravikumar Govindarajan [ravikumar.govindara...@gmail.com] Sent: Thursday, March 15, 2012 4:41 AM To: user@cassandra.apache.org Subject: 0.8.1 Vs 1.0.7 Hi, I ran some data import tests for cassandra 0.8.1 and 1.0.7. The results were a little bit surprising 0.8.1, SimpleStrategy, Rep_Factor=3,QUORUM Writes, RP, SimpleSnitch XXX.XXX.XXX.A datacenter1 rack1 Up Normal 140.61 GB 12.50% XXX.XXX.XXX.B datacenter1 rack1 Up Normal 139.92 GB 12.50% XXX.XXX.XXX.C datacenter1 rack1 Up Normal 138.81 GB 12.50% XXX.XXX.XXX.D datacenter1 rack1 Up Normal 139.78 GB 12.50% XXX.XXX.XXX.E datacenter1 rack1 Up Normal 137.44 GB 12.50% XXX.XXX.XXX.F datacenter1 rack1 Up Normal 138.48 GB 12.50% XXX.XXX.XXX.G datacenter1 rack1 Up Normal 140.52 GB 12.50% XXX.XXX.XXX.H datacenter1 rack1 Up Normal 145.24 GB 12.50% 1.0.7, NTS, Rep_Factor{DC1:3, DC2:2}, LOCAL_QUORUM writes, RP [DC2 m/c yet to join ring], PropertyFileSnitch XXX.XXX.XXX.A DC1 RAC1 Up Normal 48.72 GB 12.50% XXX.XXX.XXX.B DC1 RAC1 Up Normal 51.23 GB 12.50% XXX.XXX.XXX.C DC1 RAC1 Up Normal 52.4GB 12.50% XXX.XXX.XXX.D DC1 RAC1 Up Normal 49.64 GB 12.50% XXX.XXX.XXX.E DC1 RAC1 Up Normal 48.5GB 12.50% XXX.XXX.XXX.F DC1 RAC1 Up Normal53.38 GB 12.50% XXX.XXX.XXX.G DC1 RAC1 Up Normal 51.11 GB 12.50% XXX.XXX.XXX.H DC1 RAC1 Up Normal 53.36 GB 12.50% There seems to be 3X savings in size for the same dataset running 1.0.7. I have not enabled compression for any of the CFs. Will it be enabled by default when creating a new CF in 1.0.7? cassandra.yaml is also mostly identical. Thanks and Regards, Ravi
Re: Single Node Cassandra Installation
Doing reads and writes at CL=1 with RF=2 N=2 does not imply that the reads will be inconsistent. It's more complicated than the simple counting of blocked replicas. It is easy to support the notion that it will be largely consistent, in fact very consistent for most use cases. By default Cassandra tries to write to both nodes, always. Writes will only fail (on a node) if it is down, and even then hinted handoff will attempt to keep both nodes in sync when the troubled node comes back up. The point of having two nodes is to have read and write availability in the face of transient failure. If you are interested there is a good exposition of what 'consistency' means in a system like Cassandra from the link below[1]. [1] http://www.eecs.berkeley.edu/~pbailis/projects/pbs/ On Fri, Mar 16, 2012 at 6:50 AM, Thomas van Neerijnen t...@bossastudios.comwrote: You'll need to either read or write at at least quorum to get consistent data from the cluster so you may as well do both. Now that you mention it, I was wrong about downtime, with a two node cluster reads or writes at quorum will mean both nodes need to be online. Perhaps you could have an emergency switch in your application which flips to consistency of 1 if one of your Cassandra servers goes down? Just make sure it's set back to quorum when the second one returns or again you could end up with inconsistent data. On Fri, Mar 16, 2012 at 2:04 AM, Drew Kutcharian d...@venarc.com wrote: Thanks for the comments, I guess I will end up doing a 2 node cluster with replica count 2 and read consistency 1. -- Drew On Mar 15, 2012, at 4:20 PM, Thomas van Neerijnen wrote: So long as data loss and downtime are acceptable risks a one node cluster is fine. Personally this is usually only acceptable on my workstation, even my dev environment is redundant, because servers fail, usually when you least want them to, like for example when you've decided to save costs by waiting before implementing redundancy. Could a failure end up costing you more than you've saved? I'd rather get cheaper servers (maybe even used off ebay??) so I could have at least two of them. If you do go with a one node solution, altho I haven't tried it myself Priam looks like a good place to start for backups, otherwise roll your own with incremental snapshotting turned on and a watch on the snapshot directory. Storage on something like S3 or Cloud Files is very cheap so there's no good excuse for no backups. On Thu, Mar 15, 2012 at 7:12 PM, R. Verlangen ro...@us2.nl wrote: Hi Drew, One other disadvantage is the lack of consistency level and replication. Both ware part of the high availability / redundancy. So you would really need to backup your single-node-cluster to some other external location. Good luck! 2012/3/15 Drew Kutcharian d...@venarc.com Hi, We are working on a project that initially is going to have very little data, but we would like to use Cassandra to ease the future scalability. Due to budget constraints, we were thinking to run a single node Cassandra for now and then add more nodes as required. I was wondering if it is recommended to run a single node cassandra in production? Are there any other issues besides lack of high availability? Thanks, Drew -- Ben Coverston DataStax -- The Apache Cassandra Company
cassandra-cli and uncreachable status confusion
Hi guys, While creating schema on our cluster today I didn't get any errors even when some of the hosts in the cluster were unreachable (not the ones in the same data centre but in another region). cli kept on showing all nodes agreeing where all nodes were agreeing. Now after this when I did describe cluster I did get appropriate unreachable messages for the nodes that were timing out on connections. Can someone please explain if at the time of schema creation does the nodes just talk to other nodes within the DC for agreement or it has to talk to each and every node within the whole cluster before agreeing on schema changes? cheers, Shoaib
Re: 1.0.8 with Leveled compaction - Possible issues
The Cassandra team has been released new version every month last half year. So I anticipate they will release 1.0.9 before April. Just my forecast:-) maki On 2012/03/16, at 22:41, Johan Elmerfjord jelme...@adobe.com wrote: Perfect.. this helped a lot - and I can confirm that I have run in to the same issue as described in: http://mail-archives.apache.org/mod_mbox/cassandra-user/201203.mbox/%3CCALqbeQbQ=d-hORVhA-LHOo_a5j46fQrsZMm+OQgfkgR=4rr...@mail.gmail.com%3E Where it goes down when it tries to move up files to a higher level - that is out of bounds. Nice that I could get a overview of the levels by looking in the .json-file as well. Any timeframe on when we can expect 1.0.9 to be released? /Johan -- Johan Elmerfjord | Sr. Systems Administration/Mgr, EMEA | Adobe Systems, Product Technical Operations | p. +45 3231 6008 | x86008 | cell. +46 735 101 444 | jelme...@adobe.com On Thu, 2012-03-15 at 17:00 -0700, Watanabe Maki wrote: update column family with LCS option + upgradesstables should convert all of your sstables. Set lig4j config: org.apache.cassandra.db.compaction=DEBUG in conf/log4j-server.properties and retry your procedure to find what is happen. maki On 2012/03/16, at 7:05, Johan Elmerfjord jelme...@adobe.com wrote: Hi, I'm testing the community-version of Cassandra 1.0.8. We are currently on 0.8.7 in our production-setup. We have 3 Column Families that each takes between 20 and 35 GB on disk per node. (8*2 nodes total) We would like to change to Leveled Compaction - and even try compression as well to reduce the space needed for compactions. We are running on SSD-drives as latency is a key-issue. As test I have imported one Column Family from 3 production-nodes to a 3 node test-cluster. The data on the 3 nodes ranges from 19-33GB. (with at least one large SSTable (Tiered size - recently compacted)). After loading this data to the 3 test-nodes, and running scrub and repair, I took a backup of the data so I have good test-set of data to work on. Then I changed changed to leveled compaction, using the cassandra-cli: UPDATE COLUMN FAMILY TestCF1 WITH compaction_strategy=LeveledCompactionStrategy; I could see the change being written to the logfile on all nodes. Then I don't know for for sure if I need to run anything else to make the change happen - or if it's just to wait. My test-cluster does not receive new data. For this KS CF and on each of the nodes I have tried some or several of: upgradesstable, scrub, compact, cleanup and repair - each task taking between 40 minutes and 4 hours. With the exception of compact that returns almost immediately with no visible compactions made. On some node I ended up with over 3 files with the default 5MB size for leveled compaction, on another node it didn't look like anything has been done and I still have a 19GB SSTable. I then made another change. UPDATE COLUMN FAMILY TestCF1 WITH compaction_strategy=LeveledCompactionStrategy AND compaction_strategy_options=[{sstable_size_in_mb: 64}]; WARNING: [{}] strategy_options syntax is deprecated, please use {} Which is probably wrong in the documentation - and should be: UPDATE COLUMN FAMILY TestCF1 WITH compaction_strategy=LeveledCompactionStrategy AND compaction_strategy_options={sstable_size_in_mb: 64}; I think that we will be able to find the data in 3 searches with a 64MB size - and still only use around 700MB while doing compactions - and keep the number of files ~3000 per CF. A few days later it looks like I still have a mix between original huge SStables, 5MB once - and some nodes has 64MB files as well. Do I need to do something special to clean this up? I have tried another scrub /upgradesstables/clean - but nothing seems to do any change to me. Finally I have also tried to enable compression: UPDATE COLUMN FAMILY TestCF1 WITH compression_options=[{sstable_compression:SnappyCompressor, chunk_length_kb:64}]; - which results in the same [{}] - warning. As you can see below - this created CompressionInfo.db - files on some nodes - but not on all. Is there a way I can force Teired sstables to be converted into Leveled once - and then to compression as well? Why are the original file (Tiered Sized SSTables still present on testnode1 - when is it supposed to delete them? Can I change back and forth between compression (on/off - or chunksizes) - and between Leveled vs Size Tiered compaction? Is there a way to see if the node is done - or waiting for something? When is it safe to apply another setting - does it have to complete one reorg before moving on to the next? Any input or own experiences are warmly welcome. Best regards, Johan Some lines of example directory-listings below.: Some files for testnode 3. (looks like it's still have the original Size Tiered files around, and a mixture of compressed 64MB
Re: Order rows numerically
How about to fill zeros before smaller digits? Ex. 0001, 0002, etc maki On 2012/03/17, at 6:29, A J s5a...@gmail.com wrote: If I define my rowkeys to be Integer (key_validation_class=IntegerType) , how can I order the rows numerically ? ByteOrderedPartitioner orders lexically and retrieval using get_range does not seem to make sense in order. If I were to change rowkey to be UTF8 (key_validation_class=UTF8Type), BOP still does not give numerical enough. For range of rowkey from 1 to 2, I get 1, 10,11.,2 (lexical ordering). Any workaround for this ? Thanks.
Re: Order rows numerically
if your keys are 1-n and you are using BOP, then almost certainly your ring will be massively unbalanced with the first node getting clobbered. You'll have bigger issues than getting lexical ordering. I'd try to rethink your design so that you don't need BOP. On 03/16/2012 06:49 PM, Watanabe Maki wrote: How about to fill zeros before smaller digits? Ex. 0001, 0002, etc maki On 2012/03/17, at 6:29, A Js5a...@gmail.com wrote: If I define my rowkeys to be Integer (key_validation_class=IntegerType) , how can I order the rows numerically ? ByteOrderedPartitioner orders lexically and retrieval using get_range does not seem to make sense in order. If I were to change rowkey to be UTF8 (key_validation_class=UTF8Type), BOP still does not give numerical enough. For range of rowkey from 1 to 2, I get 1, 10,11.,2 (lexical ordering). Any workaround for this ? Thanks.
Re: Question regarding secondary indices
Thanks Aaron for the response. I see those logs. I had one more question. Looks like sstableloader takes only one directory at a time. Is it possible to load multiple directories in one call. Something like sstableloader /drive1/keyspace1 /drive2/keyspace1... This way one can take adv of the speedup that you get from reading accross multiple drives. Or alternatively is it possible to run multiple instances of sstableloader on the same machine concurrently? Thanks! On Thu, Mar 15, 2012 at 6:54 PM, aaron morton aa...@thelastpickle.comwrote: You should see a log line with Index build of {} complete. You can also see which indexes are built using the describe command in cassandra-cli. - Aaron Morton[default@XX] describe; Keyspace: XX: ... Column Families: ColumnFamily: XXX ... Built indexes: [] Cheers Freelance Developer @aaronmorton http://www.thelastpickle.com - Aaron Morton Freelance Developer @aaronmorton http://www.thelastpickle.com On 16/03/2012, at 10:04 AM, Sanjeev Kulkarni wrote: Hi, I'm using a 4 node cassandra cluster running 0.8.10 with rf=3. Its a brand new setup. I have a single col family which contains about 10 columns. I have enabled secondary indices on 3 of them. I used sstableloader to bulk load some data into this cluster. I poked around the logs and saw the following messages Submitting index build of attr_001 .. which indicates that cassandra has started building indices. How will I know when the building of the indices is done? Is there some log messages that I should look for? Thanks!