upgradesstables/cleanup/compaction strategy change
Hi! I have a 2.0.13 cluster which I have just extended, and I'm now looking into upgrading it to 2.1. * The cleanup after the extension is partially done. * I'm also looking into changing a few tables into Leveled Compaction Strategy. In the interest of speeding up things by avoiding unnecessary rewrites of data, I'm pondering if I can: 1. Upgrade to 2.1, then run cleanup instead of upgradesstables getting cleanup + upgrade of sstable format to ka at the same time? 2. Upgrade to 2.1, then change compaction strategy and get LCS + upgrade of sstable format to ka at the same time? Comments on that? Thanks, \EF
Re: Extending a partially upgraded cluster - supported
On 2016-05-18 20:19, Jeff Jirsa wrote: You can’t stream between versions, so in order to grow the cluster, you’ll need to be entirely on 2.0 or entirely on 2.1. OK. I was sure you can't stream between a 2.0 node and a 2.1 node, but if I understand you correctly you can't stream between two 2.1 nodes unless the sstables on the source node has been upgraded to "ka", i.e. the 2.1 sstable version? Looks like it's extend first, upgrade later, given that we're a bit close on disk capacity. Thanks, \EF If you go to 2.1 first, be sure you run upgradesstables before you try to extend the cluster. On 5/18/16, 11:17 AM, "Erik Forsberg" <forsb...@opera.com> wrote: Hi! I have a 2.0.13 cluster which I need to do two things with: * Extend it * Upgrade to 2.1.14 I'm pondering in what order to do things. Is it a supported operation to extend a partially upgraded cluster, i.e. a cluster upgraded to 2.0 where not all sstables have been upgraded? If I do that, will the sstables written on the new nodes be in the 2.1 format when I add them? Or will they be written in the 2.0 format so I'll have to run upgradesstables anyway? The cleanup I do on the existing nodes, will write the new 2.1 format, right? There might be other reasons not to do this, one being that it's seldom wise to do many operations at once. So please enlighten me on how bad an idea this is :-) Thanks, \EF
Extending a partially upgraded cluster - supported
Hi! I have a 2.0.13 cluster which I need to do two things with: * Extend it * Upgrade to 2.1.14 I'm pondering in what order to do things. Is it a supported operation to extend a partially upgraded cluster, i.e. a cluster upgraded to 2.0 where not all sstables have been upgraded? If I do that, will the sstables written on the new nodes be in the 2.1 format when I add them? Or will they be written in the 2.0 format so I'll have to run upgradesstables anyway? The cleanup I do on the existing nodes, will write the new 2.1 format, right? There might be other reasons not to do this, one being that it's seldom wise to do many operations at once. So please enlighten me on how bad an idea this is :-) Thanks, \EF
Lot's of hints, but only on a few nodes
I have this situation where a few (like, 3-4 out of 84) nodes misbehave. Very long GC pauses, dropping out of cluster etc. This happens while loading data (via CQL), and analyzing metrics it looks like on these few nodes, a lot of hints are being generated close to the time when they start to misbehave. Since this is Cassandra 2.0.13 which have a less than optimal hints implementation, largs numbers of hints is a GC troublemaker. Again looking at metrics, it looks like hints are being generated for a large number of nodes, so it doesn't look like the destination nodes are at fault. So, I'm confused. Any Hints (pun intended) on what could cause a few nodes to generate more hints than the rest of the cluster? Regards, \EF
Re: A few misbehaving nodes
On 2016-04-19 15:54, sai krishnam raju potturi wrote: hi; do we see any hung process like Repairs on those 3 nodes? what does "nodetool netstats" show?? No hung process from what I can see. root@cssa02-06:~# nodetool tpstats Pool NameActive Pending Completed Blocked All time blocked ReadStage 0 01530227 0 0 RequestResponseStage 0 0 19230947 0 0 MutationStage 0 0 37059234 0 0 ReadRepairStage 0 0 80178 0 0 ReplicateOnWriteStage 0 0 0 0 0 GossipStage 0 0 43003 0 0 CacheCleanupExecutor 0 0 0 0 0 MigrationStage0 0 0 0 0 MemoryMeter 0 0267 0 0 FlushWriter 0 0202 0 5 ValidationExecutor0 0212 0 0 InternalResponseStage 0 0 0 0 0 AntiEntropyStage 0 0427 0 0 MemtablePostFlusher 0 0669 0 0 MiscStage 0 0212 0 0 PendingRangeCalculator0 0 70 0 0 CompactionExecutor0 0 1206 0 0 commitlog_archiver0 0 0 0 0 HintedHandoff 0 1113 0 0 Message type Dropped RANGE_SLICE 1 READ_REPAIR 0 PAGED_RANGE 0 BINARY 0 READ 219 MUTATION 3 _TRACE 0 REQUEST_RESPONSE 2 COUNTER_MUTATION 0 root@cssa02-06:~# nodetool netstats Mode: NORMAL Not sending any streams. Read Repair Statistics: Attempted: 75317 Mismatch (Blocking): 0 Mismatch (Background): 11 Pool NameActive Pending Completed Commandsn/a 1 19248846 Responses n/a 0 19875699 \EF
How are writes handled while adding nodes to cluster?
Hi! How are writes handled while I'm adding a node to a cluster, i.e. while the new node is in JOINING state? Are they queued up as hinted handoffs, or are they being written to the joining node? In the former case I guess I have to make sure my max_hint_window_in_ms is long enough for the node to become NORMAL or hints will get dropped and I must do repair. Am I right? Thanks, \EF
One node misbehaving (lot's of GC), ideas?
Hi! We having problems with one node (out of 56 in total) misbehaving. Symptoms are: * High number of full CMS old space collections during early morning when we're doing bulkloads. Yes, bulkloads, not CQL, and only a few thrift insertions. * Really long stop-the-world GC events (I've seen up to 50 seconds) for both CMS and ParNew. * CPU usage higher during early morning hours compared to other nodes. * The large number of Garbage Collections *seems* to correspond to doing a lot of compactions (SizeTiered for most of our CFs, Leveled for a few small ones) * Node loosing track of what other nodes are up and keeping that state until restart (this I think is a bug caused by the GC behaviour, with the stop-the-world making the node not accepting gossip connections from other nodes) This is on 2.0.13 with vnodes (256 per node). All other nodes have normal behaviour, with a few (2-3) full CMS old space in the same 3h period that the trouble node is making some 30 ones. Heap space is 8G, with NEW_SIZE set to 800M. With 6G/800M the problem was even worse (it seems, this is a bit hard to debug as it happens *almost* every night). nodetool status shows that although we have a certain unbalance in the cluster, this node is neither the most nor the least loaded. I.e. we have between 1.6% and 2.1% in the Owns column, and the troublesome node reports 1.7%. All nodes are under puppet control, so configuration is the same everywhere. We're running NetworkTopolyStrategy with rack awareness, and here's a deviation from recommended settings - we have slightly varying number of nodes in the racks: 15 cssa01 15 cssa02 13 cssa03 13 cssa04 The affected node is in the cssa04 rack. Could this mean I have some kind of hotspot situation? Why would that show up as more GC work? I'm quite puzzled here, so I'm looking for hints on how to identify what is causing this. Regards, \EF
Re: Cluster status instability
To elaborate a bit on what Marcin said: * Once a node starts to believe that a few other nodes are down, it seems to stay that way for a very long time (hours). I'm not even sure it will recover without a restart. * I've tried to stop then start gossip with nodetool on the node that thinks several other nodes is down. Did not help. * nodetool gossipinfo when run on an affected node claims STATUS:NORMAL for all nodes (including the ones marked as down in status output) * It is quite possible that the problem starts at the time of day when we have a lot of bulkloading going on. But why does it then stay for several hours after the load goes down? * I have the feeling this started with our upgrade from 1.2.18 to 2.0.12 about a month ago, but I have no hard data to back that up. Regarding region/snitch - this is not an AWS deployment, we run on our own datacenter with GossipingPropertyFileSnitch. Right now I have this situation with one node (04-05) thinking that there are 4 nodes down. The rest of the cluster (56 nodes in total) thinks all nodes are up. Load on cluster right now is minimal, there's no GC going on. Heap usage is approximately 3.5/6Gb. root@cssa04-05:~# nodetool status|grep DN DN 2001:4c28:1:413:0:1:2:5 1.07 TB256 1.8% 114ff46e-57d0-40dd-87fb-3e4259e96c16 rack2 DN 2001:4c28:1:413:0:1:2:6 1.06 TB256 1.8% b161a6f3-b940-4bba-9aa3-cfb0fc1fe759 rack2 DN 2001:4c28:1:413:0:1:2:13 896.82 GB 256 1.6% 4a488366-0db9-4887-b538-4c5048a6d756 rack2 DN 2001:4c28:1:413:0:1:3:7 1.04 TB256 1.8% 95cf2cdb-d364-4b30-9b91-df4c37f3d670 rack3 Excerpt from nodetool gossipinfo showing one node that status thinks is down (2:5) and one that status thinks is up (3:12): /2001:4c28:1:413:0:1:2:5 generation:1427712750 heartbeat:2310212 NET_VERSION:7 RPC_ADDRESS:0.0.0.0 RELEASE_VERSION:2.0.13 RACK:rack2 LOAD:1.172524771195E12 INTERNAL_IP:2001:4c28:1:413:0:1:2:5 HOST_ID:114ff46e-57d0-40dd-87fb-3e4259e96c16 DC:iceland SEVERITY:0.0 STATUS:NORMAL,100493381707736523347375230104768602825 SCHEMA:4b994277-19a5-3458-b157-f69ef9ad3cda /2001:4c28:1:413:0:1:3:12 generation:1427714889 heartbeat:2305710 NET_VERSION:7 RPC_ADDRESS:0.0.0.0 RELEASE_VERSION:2.0.13 RACK:rack3 LOAD:1.047542503234E12 INTERNAL_IP:2001:4c28:1:413:0:1:3:12 HOST_ID:bb20ddcb-0a14-4d91-b90d-fb27536d6b00 DC:iceland SEVERITY:0.0 STATUS:NORMAL,100163259989151698942931348962560111256 SCHEMA:4b994277-19a5-3458-b157-f69ef9ad3cda I also tried disablegossip + enablegossip on 02-05 to see if that made 04-05 mark it as up, with no success. Please let me know what other debug information I can provide. Regards, \EF On Thu, Apr 2, 2015 at 6:56 PM, daemeon reiydelle daeme...@gmail.com wrote: Do you happen to be using a tool like Nagios or Ganglia that are able to report utilization (CPU, Load, disk io, network)? There are plugins for both that will also notify you of (depending on whether you enabled the intermediate GC logging) about what is happening. On Thu, Apr 2, 2015 at 8:35 AM, Jan cne...@yahoo.com wrote: Marcin ; are all your nodes within the same Region ? If not in the same region, what is the Snitch type that you are using ? Jan/ On Thursday, April 2, 2015 3:28 AM, Michal Michalski michal.michal...@boxever.com wrote: Hey Marcin, Are they actually going up and down repeatedly (flapping) or just down and they never come back? There might be different reasons for flapping nodes, but to list what I have at the top of my head right now: 1. Network issues. I don't think it's your case, but you can read about the issues some people are having when deploying C* on AWS EC2 (keyword to look for: phi_convict_threshold) 2. Heavy load. Node is under heavy load because of massive number of reads / writes / bulkloads or e.g. unthrottled compaction etc., which may result in extensive GC. Could any of these be a problem in your case? I'd start from investigating GC logs e.g. to see how long does the stop the world full GC take (GC logs should be on by default from what I can see [1]) [1] https://issues.apache.org/jira/browse/CASSANDRA-5319 Michał Kind regards, Michał Michalski, michal.michal...@boxever.com On 2 April 2015 at 11:05, Marcin Pietraszek mpietras...@opera.com wrote: Hi! We have 56 node cluster with C* 2.0.13 + CASSANDRA-9036 patch installed. Assume we have nodes A, B, C, D, E. On some irregular basis one of those nodes starts to report that subset of other nodes is in DN state although C* deamon on all nodes is running: A$ nodetool status UN B DN C DN D UN E B$ nodetool status UN A UN C UN D UN E C$ nodetool status DN A UN B UN D UN E After restart of A node, C and D report that A it's in UN and also A claims that whole cluster is in UN state. Right now I don't have any clear steps to reproduce that situation, do you guys have any idea what could be causing such behaviour? How this
changes to metricsReporterConfigFile requires restart of cassandra?
Hi! I was pleased to find out that cassandra 2.0.x has added support for pluggable metrics export, which even includes a graphite metrics sender. Question: Will changes to the metricsReporterConfigFile require a restart of cassandra to take effect? I.e, if I want to add a new exported metric to that file, will I have to restart my cluster? Thanks, \EF
Anonymous user in permissions system?
Hi! Is there such a thing as the anonymous/unauthenticated user in the cassandra permissions system? What I would like to do is to grant select, i.e. provide read-only access, to users which have not presented a username and password. Then grant update/insert to other users which have presented a username and (correct) password. Doable? Regards, \EF
Re: Working with legacy data via CQL
On 2014-11-19 01:37, Robert Coli wrote: Thanks, I can reproduce the issue with that, and I should be able to look into it tomorrow. FWIW, I believe the issue is server-side, not in the driver. I may be able to suggest a workaround once I figure out what's going on. Is there a JIRA tracking this issue? I like being aware of potential issues with legacy tables ... :D I created one, just for you! :-) https://issues.apache.org/jira/browse/CASSANDRA-8339 \EF
Re: Working with legacy data via CQL
On 2014-11-15 01:24, Tyler Hobbs wrote: What version of cassandra did you originally create the column family in? Have you made any schema changes to it through cql or cassandra-cli, or has it always been exactly the same? Oh that's a tough question given that the cluster has been around since 2011. So CF was probably created in Cassandra 0.7 or 0.8 via thrift calls from pycassa, and I don't think there has been any schema changes to it since. Thanks, \EF On Wed, Nov 12, 2014 at 2:06 AM, Erik Forsberg forsb...@opera.com mailto:forsb...@opera.com wrote: On 2014-11-11 19:40, Alex Popescu wrote: On Tuesday, November 11, 2014, Erik Forsberg forsb...@opera.com mailto:forsb...@opera.com mailto:forsb...@opera.com mailto:forsb...@opera.com wrote: You'll have better chances to get an answer about the Python driver on its own mailing list https://groups.google.com/a/lists.datastax.com/forum/#!forum/python-driver-user As I said, this also happens when using cqlsh: cqlsh:test SELECT column1,value from Users where key = a6b07340-047c-4d4c-9a02-1b59eabf611c and column1 = 'date_created'; column1 | value --+-- date_created | '\x00\x00\x00\x00Ta\xf3\xe0' (1 rows) Failed to decode value '\x00\x00\x00\x00Ta\xf3\xe0' (for column 'value') as text: 'utf8' codec can't decode byte 0xf3 in position 6: unexpected end of data So let me rephrase: How do I work with data where the table has metadata that makes some columns differ from the main validation class? From cqlsh, or the python driver, or any driver? Thanks, \EF -- Tyler Hobbs DataStax http://datastax.com/
Re: Working with legacy data via CQL
On 2014-11-17 09:56, Erik Forsberg wrote: On 2014-11-15 01:24, Tyler Hobbs wrote: What version of cassandra did you originally create the column family in? Have you made any schema changes to it through cql or cassandra-cli, or has it always been exactly the same? Oh that's a tough question given that the cluster has been around since 2011. So CF was probably created in Cassandra 0.7 or 0.8 via thrift calls from pycassa, and I don't think there has been any schema changes to it since. Actually, I don't think it matters. I created a minimal repeatable set of python code (see below). Running that against a 2.0.11 server, creating fresh keyspace and CF, then insert some data with thrift/pycassa, then trying to extract the data that has a different validation class, the python-driver and cqlsh bails out. cqlsh example after running the below script: cqlsh:badcql select * from Users where column1 = 'default_account_id' ALLOW FILTERING; value \xf9\x8bu}!\xe9C\xbb\xa7=\xd0\x8a\xff';\xe5 (in col 'value') can't be deserialized as text: 'utf8' codec can't decode byte 0xf9 in position 0: invalid start byte cqlsh:badcql select * from Users where column1 = 'date_created' ALLOW FILTERING; value '\x00\x00\x00\x00Ti\xe0\xbe' (in col 'value') can't be deserialized as text: 'utf8' codec can't decode bytes in position 6-7: unexpected end of data So the question remains - how do I work with this data from cqlsh and / or the python driver? Thanks, \EF --repeatable example-- #!/usr/bin/env python # Run this in virtualenv with pycassa and cassandra-driver installed via pip import pycassa import cassandra import calendar import traceback import time from uuid import uuid4 keyspace = badcql sysmanager = pycassa.system_manager.SystemManager(localhost) sysmanager.create_keyspace(keyspace, strategy_options={'replication_factor':'1'}) sysmanager.create_column_family(keyspace, Users, key_validation_class=pycassa.system_manager.LEXICAL_UUID_TYPE, comparator_type=pycassa.system_manager.ASCII_TYPE, default_validation_class=pycassa.system_manager.UTF8_TYPE) sysmanager.create_index(keyspace, Users, username, pycassa.system_manager.UTF8_TYPE) sysmanager.create_index(keyspace, Users, email, pycassa.system_manager.UTF8_TYPE) sysmanager.alter_column(keyspace, Users, default_account_id, pycassa.system_manager.LEXICAL_UUID_TYPE) sysmanager.create_index(keyspace, Users, active, pycassa.system_manager.INT_TYPE) sysmanager.alter_column(keyspace, Users, date_created, pycassa.system_manager.LONG_TYPE) pool = pycassa.pool.ConnectionPool(keyspace, ['localhost:9160']) cf = pycassa.ColumnFamily(pool, Users) user_uuid = uuid4() cf.insert(user_uuid, {'username':'test_username', 'auth_method':'ldap', 'email':'t...@example.com', 'active':1, 'date_created':long(calendar.timegm(time.gmtime())), 'default_account_id':uuid4()}) from cassandra.cluster import Cluster cassandra_cluster = Cluster([localhost]) cassandra_session = cassandra_cluster.connect(keyspace) print username, cassandra_session.execute('SELECT value from Users where key = %s and column1 = %s', (user_uuid, 'username',)) print email, cassandra_session.execute('SELECT value from Users where key = %s and column1 = %s', (user_uuid, 'email',)) try: print default_account_id, cassandra_session.execute('SELECT value from Users where key = %s and column1 = %s', (user_uuid, 'default_account_id',)) except Exception as e: print Exception trying to get default_account_id, traceback.format_exc() cassandra_session = cassandra_cluster.connect(keyspace) try: print active, cassandra_session.execute('SELECT value from Users where key = %s and column1 = %s', (user_uuid, 'active',)) except Exception as e: print Exception trying to get active, traceback.format_exc() cassandra_session = cassandra_cluster.connect(keyspace) try: print date_created, cassandra_session.execute('SELECT value from Users where key = %s and column1 = %s', (user_uuid, 'date_created',)) except Exception as e: print Exception trying to get date_created, traceback.format_exc() -- end of example --
Re: Working with legacy data via CQL
On 2014-11-11 19:40, Alex Popescu wrote: On Tuesday, November 11, 2014, Erik Forsberg forsb...@opera.com mailto:forsb...@opera.com wrote: You'll have better chances to get an answer about the Python driver on its own mailing list https://groups.google.com/a/lists.datastax.com/forum/#!forum/python-driver-user As I said, this also happens when using cqlsh: cqlsh:test SELECT column1,value from Users where key = a6b07340-047c-4d4c-9a02-1b59eabf611c and column1 = 'date_created'; column1 | value --+-- date_created | '\x00\x00\x00\x00Ta\xf3\xe0' (1 rows) Failed to decode value '\x00\x00\x00\x00Ta\xf3\xe0' (for column 'value') as text: 'utf8' codec can't decode byte 0xf3 in position 6: unexpected end of data So let me rephrase: How do I work with data where the table has metadata that makes some columns differ from the main validation class? From cqlsh, or the python driver, or any driver? Thanks, \EF
Working with legacy data via CQL
Hi! I have some data in a table created using thrift. In cassandra-cli, the 'show schema' output for this table is: create column family Users with column_type = 'Standard' and comparator = 'AsciiType' and default_validation_class = 'UTF8Type' and key_validation_class = 'LexicalUUIDType' and column_metadata = [ {column_name : 'date_created', validation_class : LongType}, {column_name : 'active', validation_class : IntegerType, index_name : 'Users_active_idx_1', index_type : 0}, {column_name : 'email', validation_class : UTF8Type, index_name : 'Users_email_idx_1', index_type : 0}, {column_name : 'username', validation_class : UTF8Type, index_name : 'Users_username_idx_1', index_type : 0}, {column_name : 'default_account_id', validation_class : LexicalUUIDType}]; From cqlsh, it looks like this: [cqlsh 4.1.1 | Cassandra 2.0.11 | CQL spec 3.1.1 | Thrift protocol 19.39.0] Use HELP for help. cqlsh:test describe table Users; CREATE TABLE Users ( key 'org.apache.cassandra.db.marshal.LexicalUUIDType', column1 ascii, active varint, date_created bigint, default_account_id 'org.apache.cassandra.db.marshal.LexicalUUIDType', email text, username text, value text, PRIMARY KEY ((key), column1) ) WITH COMPACT STORAGE; CREATE INDEX Users_active_idx_12 ON Users (active); CREATE INDEX Users_email_idx_12 ON Users (email); CREATE INDEX Users_username_idx_12 ON Users (username); Now, when I try to extract data from this using cqlsh or the python-driver, I have no problems getting data for the columns which are actually UTF8,but for those where column_metadata have been set to something else, there's trouble. Example using the python driver: -- snip -- In [8]: u = uuid.UUID(a6b07340-047c-4d4c-9a02-1b59eabf611c) In [9]: sess.execute('SELECT column1,value from Users where key = %s and column1 = %s', [u, 'username']) Out[9]: [Row(column1='username', value=u'uc6vf')] In [10]: sess.execute('SELECT column1,value from Users where key = %s and column1 = %s', [u, 'date_created']) --- UnicodeDecodeErrorTraceback (most recent call last) ipython-input-10-d06f98a160e1 in module() 1 sess.execute('SELECT column1,value from Users where key = %s and column1 = %s', [u, 'date_created']) /home/forsberg/dev/virtualenvs/ospapi/local/lib/python2.7/site-packages/cassandra/cluster.pyc in execute(self, query, parameters, timeout, trace) 1279 future = self.execute_async(query, parameters, trace) 1280 try: - 1281 result = future.result(timeout) 1282 finally: 1283 if trace: /home/forsberg/dev/virtualenvs/ospapi/local/lib/python2.7/site-packages/cassandra/cluster.pyc in result(self, timeout) 2742 return PagedResult(self, self._final_result) 2743 elif self._final_exception: - 2744 raise self._final_exception 2745 else: 2746 raise OperationTimedOut(errors=self._errors, last_host=self._current_host) UnicodeDecodeError: 'utf8' codec can't decode byte 0xf3 in position 6: unexpected end of data -- snap -- cqlsh gives me similar errors. Can I tell the python driver to parse some column values as integers, or is this an unsupported case? For sure this is an ugly table, but I have data in it, and I would like to avoid having to rewrite all my tools at once, so if I could support it from CQL that would be great. Regards, \EF
Running out of disk at bootstrap in low-disk situation
Hi! We have unfortunately managed to put ourselves in a situation where we are really close to full disks on our existing 27 nodes. We are now trying to add 15 more nodes, but running into problems with out of disk space on the new nodes while joining. We're using vnodes, on Cassandra 1.2.18 (yes, I know that's old, and I'll upgrade as soon as I'm out of this problematic situation). I've added all the 15 nodes, with some time inbetween - definitely more than the 2-minute rule. But it seems like compaction is not keeping up with the incoming data. Or at least that's my theory. What are the recommended settings to avoid this problem? I have now set compaction threshold to 0 for unlimited compaction bandwidth, hoping that will help (will it?) Will it help to lower the streaming throughput too? I'm unsure about the latter since from observation it seems that compaction will not start until it has finished streaming from a node. With 27 nodes sharing the incoming bandwidth, all of them will take equally long time to finish and then the compaction can occur. I guess I could limit streaming bandwidth on some of the source nodes too. Or am I completely wrong here? Other ideas most welcome. Regards, \EF
Restart joining node
Hi! On the same subject as before - due to full disk during bootstrap, my joining nodes are stuck. What's the correct procedure here, will a plain restart of the node do the right thing, i.e. continue where bootstrap stopped, or is it better to clean the data directories before new start of daemon? Regards, \EF
Re: LeveledCompaction, streaming bulkload, and lot's of small sstables
On 2014-08-18 19:52, Robert Coli wrote: On Mon, Aug 18, 2014 at 6:21 AM, Erik Forsberg forsb...@opera.com mailto:forsb...@opera.com wrote: Is there some configuration knob I can tune to make this happen faster? I'm getting a bit confused by the description for min_sstable_size, bucket_high, bucket_low etc - and I'm not sure if they apply in this case. You probably don't want to use multi-threaded compaction, it is removed upstream. nodetool setcompactionthroughput 0 Assuming you have enough IO headroom etc. OK. I disabled multithreaded and gave it a bit more throughput to play with, but I still don't think that's the full story. What I see is the following case: 1) My hadoop cluster is bulkloading around 1000 sstables to the Cassandra cluster. 2) Cassandra will start compacting. With SizeTiered, I would see multiple ongoing compactions on the CF in question, each taking on 32 sstables and compacting to one, all of them running at the same time. With Leveled, I see only one compaction, taking on 32 sstables compacting to one. When that finished, it will start another one. So it's essentially a serial process, and it takes a much longer time than what it does with SizeTiered. While this compaction is ongoing, read performance is not very good. http://www.datastax.com/dev/blog/performance-improvements-in-cassandra-1-2 mentions LCS is parallelized in Cassandra 1.2, but maybe that patch doesn't cover my use case (although I realize that my use case is maybe a bit weird) So my question is if this is something I can tune? I'm running 1.2.18 now, but am strongly considering upgrade to 2.0.X. Regards, \EF
LeveledCompaction, streaming bulkload, and lot's of small sstables
Hi! I'm bulkloading via streaming from Hadoop to my Cassandra cluster. This results in a rather large set of relatively small (~1MiB) sstables as the number of mappers that generate sstables on the hadoop cluster is high. With SizeTieredCompactionStrategy, the cassandra cluster would quickly compact all these small sstables into decently sized sstables. With LeveledCompactionStrategy however, it takes a much longer time. I have multithreaded_compaction: true, but it is only taking on 32 sstables at a time in one single compaction task, so when it starts with ~1500 sstables, it takes quite some time. I'm not running out of I/O. Is there some configuration knob I can tune to make this happen faster? I'm getting a bit confused by the description for min_sstable_size, bucket_high, bucket_low etc - and I'm not sure if they apply in this case. I'm pondering options for decreasing the number of sstables being streamed from the hadoop side, but if that is possible remains to be seen. Thanks! \EF
sstableloader and ttls
Hi! If I use sstableloader to load data to a cluster, and the source sstables contain some columns where the TTL has expired, i.e. the sstable has not yet been compacted - will those entries be properly removed on the destination side? Thanks, \EF
Running sstableloader from live Cassandra server
Hi! I'm looking into moving some data from one Cassandra cluster to another, both of them running Cassandra 1.2.13 (or maybe some later 1.2 version if that helps me avoid some fatal bug). Sstableloader will probably be the right thing for me, and given the size of my tables, I will want to run the sstableloader on the source cluster, but at the same time, that source cluster needs to keep running to serve data to clients. If I understand the docs right, this means I will have to: 1. Bring up a new network interface on each of my source nodes. No problem, I have an IPv6 /64 to choose from :-) 2. Put a cassandra.yaml in the classpath of the sstableloader that differs from the one in /etc/cassandra/conf, i.e. the one used by the source cluster's cassandra, with the following: * listen_address set to my new interface. * rpc_address set to my new interface. * rpc_port set as on the destination cluster (i.e. 9160) * cluster_name set as on the destination cluster. * storage_port as on the destination cluster (i.e. 7000) Given the above I should be able to run sstableloader on the nodes of my source cluster, even with source cluster cassandra daemon running. Am I right, or did I miss anything? Thanks, \EF
EOFException in bulkloader, then IllegalStateException
Hi! I'm bulkloading from Hadoop to Cassandra. Currently in the process of moving to new hardware for both Hadoop and Cassandra, and while testrunning bulkload, I see the following error: Exception in thread Streaming to /2001:4c28:1:413:0:1:1:12:1 java.lang.RuntimeException: java.io.EOFException at com.google.common.base.Throwables.propagate(Throwables.java:155) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.cassandra.streaming.FileStreamTask.receiveReply(FileStreamTask.java:193) at org.apache.cassandra.streaming.FileStreamTask.stream(FileStreamTask.java:180) at org.apache.cassandra.streaming.FileStreamTask.runMayThrow(FileStreamTask.java:91) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) ... 3 more I see no exceptions related to this on the destination node (2001:4c28:1:413:0:1:1:12:1). This makes the whole map task fail with: 2014-01-27 10:46:50,878 ERROR org.apache.hadoop.security.UserGroupInformation: PriviledgedActionException as:forsberg (auth:SIMPLE) cause:java.io.IOException: Too many hosts failed: [/2001:4c28:1:413:0:1:1:12] 2014-01-27 10:46:50,878 WARN org.apache.hadoop.mapred.Child: Error running child java.io.IOException: Too many hosts failed: [/2001:4c28:1:413:0:1:1:12] at org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:244) at org.apache.cassandra.hadoop.BulkRecordWriter.close(BulkRecordWriter.java:209) at org.apache.hadoop.mapred.MapTask$NewDirectOutputCollector.close(MapTask.java:540) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:650) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:322) at org.apache.hadoop.mapred.Child$4.run(Child.java:266) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1278) at org.apache.hadoop.mapred.Child.main(Child.java:260) 2014-01-27 10:46:50,880 INFO org.apache.hadoop.mapred.Task: Runnning cleanup for the task The failed task was on hadoop worker node hdp01-12-4. However, hadoop later retries this map task on a different hadoop worker node (hdp01-10-2), and that retry succeeds. So that's weird, but I could live with it. Now, however, comes the real trouble - the hadoop job does not finish due to one task running on hdp01-12-4 being stuck with this: Exception in thread Streaming to /2001:4c28:1:413:0:1:1:12:1 java.lang.IllegalStateException: target reports current file is /opera/log2/hadoop/mapred/local/taskTracker/forsberg/jobcache/job_201401161243_0288/attempt_201401161243_0288_m_000473_0/work/tmp/iceland_test/Data_hourly/iceland_test-Data_hourly-ib-1-Data.db but is /opera/log6/hadoop/mapred/local/taskTracker/forsberg/jobcache/job_201401161243_0288/attempt_201401161243_0288_m_00_0/work/tmp/iceland_test/Data_hourly/iceland_test-Data_hourly-ib-1-Data.db at org.apache.cassandra.streaming.StreamOutSession.validateCurrentFile(StreamOutSession.java:154) at org.apache.cassandra.streaming.StreamReplyVerbHandler.doVerb(StreamReplyVerbHandler.java:45) at org.apache.cassandra.streaming.FileStreamTask.receiveReply(FileStreamTask.java:199) at org.apache.cassandra.streaming.FileStreamTask.stream(FileStreamTask.java:180) at org.apache.cassandra.streaming.FileStreamTask.runMayThrow(FileStreamTask.java:91) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:895) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:918) at java.lang.Thread.run(Thread.java:662) This just sits there forever, or at least until the hadoop task timeout kicks in. So two questions here: 1) Any clues on what might cause the first EOFException? It seems to appear for *some* of my bulkloads. Not all, but frequent enough to be a problem. Like, every 10:th bulkload I do seems to have the problem. 2) The second problem I have a feeling could be related to https://issues.apache.org/jira/browse/CASSANDRA-4223, but with the extra quirk that with the bulkload case, we have *multiple java processes* creating streaming sessions on the same host, so streaming session IDs are not unique. I'm thinking 2) happens because the EOFException made the streaming session in 1) sit around on the target node without being closed. This is on Cassandra 1.2.1. I know that's pretty old, but I would like to avoid upgrading
Re: EOFException in bulkloader, then IllegalStateException
On 2014-01-27 12:56, Erik Forsberg wrote: This is on Cassandra 1.2.1. I know that's pretty old, but I would like to avoid upgrading until I have made this migration from old to new hardware. Upgrading to 1.2.13 might be an option. Update: Exactly the same behaviour on Cassandra 1.2.13. Thanks, \EF
Graveyard compactions, when do they occur?
Hi! I was trying out the truncate command in cassandra-cli. http://wiki.apache.org/cassandra/CassandraCli08 says A snapshot of the data is created, which is deleted asyncronously during a 'graveyard' compaction. When do graveyard compactions happen? Do I have to trigger them somehow? Thanks, \EF
On Bloom filters and Key Cache
Hi! We're currently testing Cassandra with a large number of row keys per node - nodetool cfstats approximated number of keys to something like 700M per node. This seems to have caused a very large heap consumption. After reading http://wiki.apache.org/cassandra/LargeDataSetConsiderations I think I've tracked this down to the bloom filter, and the sampled index entries. Regarding bloom filters, have I understood correctly that they are stored on Heap, and that the Bloom Filter Space Used reported by 'nodetool cfstats' is an approximation of the heap space used by bloom filters? It reports the on-disk size, but if I understand CASSANDRA-3497, the on-disk size is smaller than the on-Heap size? I understand that increasing bloom_filter_fp_chance will decrease the bloom filter size, but at the cost of worse performance when asking for keys that don't exist. I do have a fair amount of queries for keys that don't exist. How much will increasing the key cache help, i.e. decrease bloom filter size but increase key cache size? Will the key cache cache negative results, i.e. the fact that a key didn't exist? Regards, \EF
sstable size increase at compaction
Hi! We're using the bulkloader to load data to Cassandra. During and after bulkloading, the minor compaction process seems to result in larger sstables being created. An example: INFO [CompactionExecutor:105] 2012-03-21 15:18:46,608 CompactionTask.java (line 115) Compacting [SSTableReader(pat h='/cassandra/OSP5/Data/OSP5-Data-hc-1755-Data.db'), (REMOVED A BUNCH OF OTHER SSTABLE PATHS), SSTableReader(path='/cassandra/OSP5/Data/OSP5-Data-hc-1749-Data.db'), SSTableReader(path='/cassandra/O SP5/Data/OSP5-Data-hc-1753-Data.db')] INFO [CompactionExecutor:105] 2012-03-21 15:30:04,188 CompactionTask.java (line 226) Compacted to [/cassandra/OSP5/Data/OSP5-Data-hc-3270-Data.db,]. 84,214,484 to 105,498,673 (~125% of original) bytes for 2,132,056 keys at 0.148486MB/s. Time: 677,580ms. The sstables are compressed (DeflateCompressor with chunk size 128) on the Hadoop cluster before being transferred to Cassandra, and the CF has the same compression settings: [default@Keyspace1] describe Data; ColumnFamily: Data (Super) Key Validation Class: org.apache.cassandra.db.marshal.UTF8Type Default column value validator: org.apache.cassandra.db.marshal.LongType Columns sorted by: org.apache.cassandra.db.marshal.LongType/org.apache.cassandra.db.marshal.UTF8Type GC grace seconds: 864000 Compaction min/max thresholds: 4/32 Read repair chance: 1.0 DC Local Read repair chance: 0.0 Replicate on write: true Caching: KEYS_ONLY Bloom Filter FP chance: 0.01 Built indexes: [] Compaction Strategy: org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy Compression Options: chunk_length_kb: 128 sstable_compression: org.apache.cassandra.io.compress.DeflateCompressor Any clues on this? Regards, \EF
Re: sstable size increase at compaction
On 2012-03-21 16:36, Erik Forsberg wrote: Hi! We're using the bulkloader to load data to Cassandra. During and after bulkloading, the minor compaction process seems to result in larger sstables being created. An example: This is on Cassandra 1.1, btw. \EF
Re: Max TTL?
On 2012-02-20 21:20, aaron morton wrote: Nothing obvious. Samarth (working on same project) found that his patch to CASSANDRA-3754 was cleaned up a bit too much, which caused a negative ttl. https://issues.apache.org/jira/browse/CASSANDRA-3754?focusedCommentId=13212395page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13212395 So problem found. Regards, \EF
Max TTL?
Hi! When setting ttl on columns, is there a maximum value (other than MAXINT, 2**31-1) that can be used? I have a very odd behaviour here, where I try to set ttl to 9 622 973 (~111 days) which works, but setting it to 11 824 305 (~137 days) does not - it seems columns are deleted instantly at insertion. This is using the BulkOutputFormat. And it could be a problem with our code, i.e. the code using BulkOutputFormat. So, uhm, just asking to see if we're hitting something obvious. Regards, \EF
Streaming sessions from BulkOutputFormat job being listed long after they were killed
Hi! If I run a hadoop job that uses BulkOutputFormat to write data to Cassandra, and that hadoop job is aborted, i.e. streaming sessions are not completed, it seems like the streaming sessions hang around for a very long time, I've observed at least 12-15h, in output from 'nodetool netstats'. To me it seems like they go away only after a restart of Cassandra. Is this a known behaviour? Does it cause any problems, f. ex. consuming memory, or should I just ignore it? Regards, \EF
Recommended configuration for good streaming performance?
Hi! We're experimenting with streaming from Hadoop to Cassandra using BulkoutputFormat, on cassandra-1.1 branch. Are there any specific settings we should tune on the Cassandra servers in order to get the best streaming performance? Our Cassandra hardware are 16 core (including HT cores) with 24GiB of RAM. They have two disks each. So far we've configured them with commitlog on one disk and sstables on the other, but with streaming not using commitlog (correct?) maybe it makes sense to have sstables on both disks, doubling available I/O? Thoughts on number of parallel streaming clients? Thanks, \EF
Can I use BulkOutputFormat from 1.1 to load data to older Cassandra versions?
Hi! Can the new BulkOutputFormat (https://issues.apache.org/jira/browse/CASSANDRA-3045) be used to load data to servers running cassandra 0.8.7 and/or Cassandra 1.0.6? I'm thinking of using jar files from the development version to load data onto a production cluster which I want to keep on a production version of Cassandra. Can I do that, or does BulkOutputFormat require an API level that is only in the development version of Cassandra? Thanks, \EF
Re: Multiple large disks in server - setup considerations
On Tue, 31 May 2011 13:23:36 -0500 Jonathan Ellis jbel...@gmail.com wrote: Have you read http://wiki.apache.org/cassandra/CassandraHardware ? I had, but it was a while ago so I guess I kind of deserved an RTFM! :-) After re-reading it, I still want to know: * If we disregard the performance hit caused by having the commitlog on the same physical device as parts of the data, are there any other grave effects on Cassandra's functionality with a setup like that? * How does Cassandra handle a case where one of the disks in a striped RAID0 partition goes bad and is replaced? Is the only option to wipe everything from that node and reinit the node, or will it handle corrupt files? I.e, what's the recommended thing to do from an operations point of view when a disk dies on one of the nodes in a RAID0 Cassandra setup? What will cause the least risk for data loss? What will be the fastest way to get the node up to speed with the rest of the cluster? Thanks, \EF On Tue, May 31, 2011 at 7:47 AM, Erik Forsberg forsb...@opera.com wrote: Hi! I'm considering setting up a small (4-6 nodes) Cassandra cluster on machines that each have 3x2TB disks. There's no hardware RAID in the machine, and if there were, it could only stripe single disks together, not parts of disks. I'm planning RF=2 (or higher). I'm pondering what the best disk configuration is. Two alternatives: 1) Make small partition on first disk for Linux installation and commit log. Use Linux' software RAID0 to stripe the remaining space on disk1 + the two remaining disks into one large XFS partition. 2) Make small partition on first disk for Linux installation and commit log. Mount rest of disk 1 as /var/cassandra1, then disk2 as /var/cassandra2 and disk3 as /var/cassandra3. Is it unwise to put the commit log on the same physical disk as some of the data? I guess it could impact write performance, but maybe it's bad from a data consistency point of view? How does Cassandra handle replacement of a bad disk in the two alternatives? With option 1) I guess there's risk of files being corrupt. With option 2) they will simply be missing after replacing the disk with a new one. With option 2) I guess I'm limiting the size of the total amount of data in the largest CF at compaction to, hmm.. the free space on the disk with most free space, correct? Comments welcome! Thanks, \EF -- Erik Forsberg forsb...@opera.com Developer, Opera Software - http://www.opera.com/ -- Erik Forsberg forsb...@opera.com Developer, Opera Software - http://www.opera.com/
Multiple large disks in server - setup considerations
Hi! I'm considering setting up a small (4-6 nodes) Cassandra cluster on machines that each have 3x2TB disks. There's no hardware RAID in the machine, and if there were, it could only stripe single disks together, not parts of disks. I'm planning RF=2 (or higher). I'm pondering what the best disk configuration is. Two alternatives: 1) Make small partition on first disk for Linux installation and commit log. Use Linux' software RAID0 to stripe the remaining space on disk1 + the two remaining disks into one large XFS partition. 2) Make small partition on first disk for Linux installation and commit log. Mount rest of disk 1 as /var/cassandra1, then disk2 as /var/cassandra2 and disk3 as /var/cassandra3. Is it unwise to put the commit log on the same physical disk as some of the data? I guess it could impact write performance, but maybe it's bad from a data consistency point of view? How does Cassandra handle replacement of a bad disk in the two alternatives? With option 1) I guess there's risk of files being corrupt. With option 2) they will simply be missing after replacing the disk with a new one. With option 2) I guess I'm limiting the size of the total amount of data in the largest CF at compaction to, hmm.. the free space on the disk with most free space, correct? Comments welcome! Thanks, \EF -- Erik Forsberg forsb...@opera.com Developer, Opera Software - http://www.opera.com/