CQL Clarification
Hello, Just wondering if I can get a quick clarification on some simple CQL. We utilize Thrift CQL Queries to access our cassandra setup. As clarified in a previous question I had, when using CQL and Thrift, timestamps on the cassandra column data is assigned by the server, not the client, unless AND TIMESTAMP is utilized in the query, for example: http://www.datastax.com/docs/1.0/references/cql/UPDATE According to the Datastax documentation, this timestamp should be: Values serialized with the timestamp type are encoded as 64-bit signed integers representing a number of milliseconds since the standard base time known as the epoch: January 1 1970 at 00:00:00 GMT. However, my testing showed that updates didn't work when I used a timestamp of this format. Looking at the Cassandra code, it appears that cassandra will assign a timestamp of System.currentTimeMillis() * 1000 when a timestamp is not specified, which would be the number of nanoseconds since the stand base time. In my test environment, setting the timestamp to be the current time * 1000 seems to work. It seems that if you have an older installation without TIMESTAMP being specified in the CQL, or a mixed environment, the timestamp should be * 1000. Just making sure I'm reading everything properly... improperly setting the timestamp could cause us some serious damage. Thanks, -Mike
Re: Really odd issue (AWS related?)
Hello, We've done some additional monitoring, and I think we have more information. We've been collecting vmstat information every minute, attempting to catch a node with issues,. So, it appears, that the cassandra node runs fine. Then suddenly, without any correlation to any event that I can identify, the I/O wait time goes way up, and stays up indefinitely. Even non-cassandra I/O activities (such as snapshots and backups) start causing large I/O Wait times when they typically would not. Previous to an issue, we would typically see I/O wait times 3-4% with very few blocked processes on I/O. Once this issue manifests itself, i/O wait times for the same activities jump to 30-40% with many blocked processes. The I/O wait times do go back down when there is literally no activity. - Updating the node to the latest Amazon Linux patches and rebooting the instance doesn't correct the issue. - Backing up the node, and replacing the instance does correct the issue. I/O wait times return to normal. One relatively recent change we've made is we upgraded to m1.xlarge instances which has 4 ephemeral drives available. We create a logical volume from the 4 drives with the idea that we should be able to get increased I/O throughput. When we ran m1.large instances, we had the same setup, although it was only using 2 ephemeral drives. We chose to use LVM, vs. madm because we were having issues having madm create the raid volume reliably on restart (and research showed that this was a common problem). LVM just worked (and had worked for months before this upgrade).. For reference, this is the script we used to create the logical volume: vgcreate mnt_vg /dev/sdb /dev/sdc /dev/sdd /dev/sde lvcreate -L 1600G -n mnt_lv -i 4 mnt_vg -I 256K blockdev --setra 65536 /dev/mnt_vg/mnt_lv sleep 2 mkfs.xfs /dev/mnt_vg/mnt_lv sleep 3 mkdir -p /data mount -t xfs -o noatime /dev/mnt_vg/mnt_lv /data sleep 3 Another tidbit... thus far (and this maybe only a coincidence), we've only had to replace DB nodes within a single availability zone within us-east. Other availability zones, in the same region, have yet to show an issue. It looks like I'm going to need to replace a third DB node today. Any advice would be appreciated. Thanks, -Mike On Apr 26, 2013, at 10:14 AM, Michael Theroux wrote: Thanks. We weren't monitoring this value when the issue occurred, and this particular issue has not appeared for a couple of days (knock on wood). Will keep an eye out though, -Mike On Apr 26, 2013, at 5:32 AM, Jason Wee wrote: top command? st : time stolen from this vm by the hypervisor jason On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux mthero...@yahoo.com wrote: Sorry, Not sure what CPU steal is :) I have AWS console with detailed monitoring enabled... things seem to track close to the minute, so I can see the CPU load go to 0... then jump at about the minute Cassandra reports the dropped messages, -Mike On Apr 25, 2013, at 9:50 PM, aaron morton wrote: The messages appear right after the node wakes up. Are you tracking CPU steal ? - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 25/04/2013, at 4:15 AM, Robert Coli rc...@eventbrite.com wrote: On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux mthero...@yahoo.com wrote: Another related question. Once we see messages being dropped on one node, our cassandra client appears to see this, reporting errors. We use LOCAL_QUORUM with a RF of 3 on all queries. Any idea why clients would see an error? If only one node reports an error, shouldn't the consistency level prevent the client from seeing an issue? If the client is talking to a broken/degraded coordinator node, RF/CL are unable to protect it from RPCTimeout. If it is unable to coordinate the request in a timely fashion, your clients will get errors. =Rob
Re: Secondary Index on table with a lot of data crashes Cassandra
What are we doing wrong? Can it be that Cassandra is actually trying to read all the CF data rather than just the keys! (actually, it doesn't need to go to the users CF at all - all the data it needs is in the index CF) Data is not stored as a BTree, that's the RDBMS approach. We hit the in memory bloom filter, then perhaps the -index.db and finally the -data.db. While in this edge case it may be possible to serve your query just from the -index.db there is no optimisation in place for that. Select user_name from users where status = 2; Always crashes. What is the error ? 2. understand if there is something in this use case which indicates that we are not using Cassandra the way it is meant. Just like a RDBMS data base, this are fastest when you use the primary key, a bit slower when you use a non primary index, and slowest still when you do not use an index. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 25/04/2013, at 8:32 PM, moshe.kr...@barclays.com wrote: IMHO: user_name is not a column, it is the row key. Therefore, according tohttp://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/ , the row does not contain a relevant column index, which causes the iterator to read each column (including value) of each row. I believe that instead of referring to user_name as if it were a column, you need to refer to it via the reserved word “KEY”, e.g.: Select KEY from users where status = 2; Always glad to share a theory with a friend…. From: Tamar Rosen [mailto:ta...@correlor.com] Sent: Thursday, April 25, 2013 11:04 AM To: user@cassandra.apache.org Subject: Secondary Index on table with a lot of data crashes Cassandra Hi, We have a case of a reproducible crash, probably due to out of memory, but I don't understand why. The installation is currently single node. We have a column family with approx 5 rows. In cql, the CF definition is: CREATE TABLE users ( user_name text PRIMARY KEY, big_json text, status int ); Each big_json can have 500K or more of data. There is also a secondary index on the status column. Status can have various values, over 90% of all rows have status = 2. Calling: Select user_name from users limit 8; Is pretty fast Calling: Select user_name from users where status = 1; is slower, even though much less data is returned. Calling: Select user_name from users where status = 2; Always crashes. What are we doing wrong? Can it be that Cassandra is actually trying to read all the CF data rather than just the keys! (actually, it doesn't need to go to the users CF at all - all the data it needs is in the index CF) Also, in the code I am doing the same using Astyanax index query with pagination, and the behavior is the same. Please help me: 1. solve the immediate issue 2. understand if there is something in this use case which indicates that we are not using Cassandra the way it is meant. Thanks, Tamar Rosen Correlor.com ___ This message may contain information that is confidential or privileged. If you are not an intended recipient of this message, please delete it and any attachments, and notify the sender that you have received it in error. Unless specifically stated in the message or otherwise indicated, you may not duplicate, redistribute or forward this message or any portion thereof, including any attachments, by any means to any other person, including any retail investor or customer. This message is not a recommendation, advice, offer or solicitation, to buy/sell any product or service, and is not an official confirmation of any transaction. Any opinions presented are solely those of the author and do not necessarily represent those of Barclays. This message is subject to terms available at: www.barclays.com/emaildisclaimer and, if received from Barclays' Sales or Trading desk, the terms available at: www.barclays.com/salesandtradingdisclaimer/. By messaging with Barclays you consent to the foregoing. Barclays Bank PLC is a company registered in England (number 1026167) with its registered office at 1 Churchill Place, London, E14 5HP. This email may relate to or be sent from other members of the Barclays group. ___
Re: 1.2.3 and 1.2.4 memory usage growth on idle cluster
INFO 11:10:56,273 GC for ParNew: 1039 ms for 1 collections, 6631277912 used; max is 10630070272 It depends on the settings. It looks like you are using non default JVM settings. It'd recommend restoring the default JVM settings as a start. CHeers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 25/04/2013, at 9:30 PM, Igor i...@4friends.od.ua wrote: Hello Does anybody seen memory problems on idle cluster? I have 8-node ring with cassandra 1.2.3 which never been used and stay idle for several weeks. Yesterday when I decided to upgrade it to 1.2.4 I found lot of messages like INFO 11:10:56,273 GC for ParNew: 1039 ms for 1 collections, 6631277912 used; max is 10630070272 INFO 11:10:56,273 Pool NameActive Pending Blocked INFO 11:10:56,275 ReadStage 0 0 0 INFO 11:10:56,276 RequestResponseStage 0 0 0 INFO 11:10:56,276 ReadRepairStage 0 0 0 INFO 11:10:56,277 MutationStage 0 0 0 INFO 11:10:56,277 ReplicateOnWriteStage 0 0 0 INFO 11:10:56,278 GossipStage 0 0 0 INFO 11:10:56,278 AntiEntropyStage 0 0 0 INFO 11:10:56,278 MigrationStage0 0 0 INFO 11:10:56,279 MemtablePostFlusher 0 0 0 INFO 11:10:56,279 FlushWriter 0 0 0 INFO 11:10:56,280 MiscStage 0 0 0 INFO 11:10:56,280 commitlog_archiver0 0 0 INFO 11:10:56,280 InternalResponseStage 0 0 0 INFO 11:10:56,281 HintedHandoff 0 0 0 INFO 11:10:56,281 CompactionManager 0 0 INFO 11:10:56,281 MessagingServicen/a 0,0 INFO 11:10:56,281 Cache Type Size Capacity KeysToSave Provider INFO 11:10:56,281 KeyCache 7368104857600 all INFO 11:10:56,281 RowCache 00 all org.apache.cassandra.cache.SerializingCacheProvider INFO 11:10:56,281 ColumnFamilyMemtable ops,data INFO 11:10:56,281 system.local 4,52 INFO 11:10:56,281 system.peers 30,6093 INFO 11:10:56,282 system.batchlog 0,0 INFO 11:10:56,282 system.NodeIdInfo 0,0 INFO 11:10:56,282 system.LocationInfo 0,0 INFO 11:10:56,282 system.Schema 0,0 INFO 11:10:56,282 system.Migrations 0,0 INFO 11:10:56,282 system.schema_keyspaces 0,0 INFO 11:10:56,282 system.schema_columns 0,0 INFO 11:10:56,282 system.schema_columnfamilies 0,0 INFO 11:10:56,282 system.IndexInfo 0,0 INFO 11:10:56,282 system.range_xfers0,0 INFO 11:10:56,282 system.peer_events0,0 INFO 11:10:56,283 system.hints 0,0 INFO 11:10:56,283 system.HintsColumnFamily 0,0 INFO 11:10:56,283 system_auth.users 0,0 INFO 11:10:56,283 system_traces.sessions0,0 INFO 11:10:56,283 system_traces.events 0,0 INFO 11:11:21,205 GC for ParNew: 1035 ms for 1 collections, 6633037168 used; max is 10630070272 So you can see - there is no any activity. And what I can see from the java heap graph - it constantly grow. I plan to use this ring in prod, but this strange behaviour confuses me.
Re: CQL indexing
This discussion belongs on the user list, also please only email one list at a time. The article discusses improvements in secondary indexes in 1.2 http://www.datastax.com/dev/blog/improving-secondary-index-write-performance-in-1-2 If you have some more specific questions let us know. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 26/04/2013, at 7:01 PM, Sri Ramya ramya.1...@gmail.com wrote: HI In cql to perform a query based on columns you have to create a index on that column. What exactly happening when we create a index on a column. What the index column family might contain.
Re: Many creation/inserts in parallel
At first many CF are being created in parallel (about 1000 CF). Can you explain this in a bit more detail ? By in parallel do you mean multiple threads creating CF's at the same time ? I would also recommend taking a second look at your data model, you probably do not want to create so many CF's. During tests we're receiving some exceptions from driver, e.g.: The CF you are trying to read / write from does not exist. Check if the table exists using cqlsh / cassandra-cli. Check your code to make sure it was created. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 26/04/2013, at 10:49 PM, Sasha Yanushkevich yanus...@gmail.com wrote: Hi All We are testing Cassandra 1.2.3 (3 nodes with RF:2) with FluentCassandra driver. At first many CF are being created in parallel (about 1000 CF). After creation is done follows many insertions of little amount of data into the DB. During tests we're receiving some exceptions from driver, e.g.: FluentCassandra.Operations.CassandraOperationException: unconfigured columnfamily table_78_9 and FluentCassandra.Operations.CassandraOperationException: Connection to Cassandra has timed out Though in Cassandra's logs there are no exceptions. What should we do to handle these exceptions? -- Best regards, Alexander
Re: Really odd issue (AWS related?)
I forgot to mention, When things go really bad, I'm seeing I/O waits in the 80-95% range. I restarted cassandra once when a node is in this situation, and it took 45 minutes to start (primarily reading SSTables). Typically, a node would start in about 5 minutes. Thanks, -Mike On Apr 28, 2013, at 12:37 PM, Michael Theroux wrote: Hello, We've done some additional monitoring, and I think we have more information. We've been collecting vmstat information every minute, attempting to catch a node with issues,. So, it appears, that the cassandra node runs fine. Then suddenly, without any correlation to any event that I can identify, the I/O wait time goes way up, and stays up indefinitely. Even non-cassandra I/O activities (such as snapshots and backups) start causing large I/O Wait times when they typically would not. Previous to an issue, we would typically see I/O wait times 3-4% with very few blocked processes on I/O. Once this issue manifests itself, i/O wait times for the same activities jump to 30-40% with many blocked processes. The I/O wait times do go back down when there is literally no activity. - Updating the node to the latest Amazon Linux patches and rebooting the instance doesn't correct the issue. - Backing up the node, and replacing the instance does correct the issue. I/O wait times return to normal. One relatively recent change we've made is we upgraded to m1.xlarge instances which has 4 ephemeral drives available. We create a logical volume from the 4 drives with the idea that we should be able to get increased I/O throughput. When we ran m1.large instances, we had the same setup, although it was only using 2 ephemeral drives. We chose to use LVM, vs. madm because we were having issues having madm create the raid volume reliably on restart (and research showed that this was a common problem). LVM just worked (and had worked for months before this upgrade).. For reference, this is the script we used to create the logical volume: vgcreate mnt_vg /dev/sdb /dev/sdc /dev/sdd /dev/sde lvcreate -L 1600G -n mnt_lv -i 4 mnt_vg -I 256K blockdev --setra 65536 /dev/mnt_vg/mnt_lv sleep 2 mkfs.xfs /dev/mnt_vg/mnt_lv sleep 3 mkdir -p /data mount -t xfs -o noatime /dev/mnt_vg/mnt_lv /data sleep 3 Another tidbit... thus far (and this maybe only a coincidence), we've only had to replace DB nodes within a single availability zone within us-east. Other availability zones, in the same region, have yet to show an issue. It looks like I'm going to need to replace a third DB node today. Any advice would be appreciated. Thanks, -Mike On Apr 26, 2013, at 10:14 AM, Michael Theroux wrote: Thanks. We weren't monitoring this value when the issue occurred, and this particular issue has not appeared for a couple of days (knock on wood). Will keep an eye out though, -Mike On Apr 26, 2013, at 5:32 AM, Jason Wee wrote: top command? st : time stolen from this vm by the hypervisor jason On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux mthero...@yahoo.com wrote: Sorry, Not sure what CPU steal is :) I have AWS console with detailed monitoring enabled... things seem to track close to the minute, so I can see the CPU load go to 0... then jump at about the minute Cassandra reports the dropped messages, -Mike On Apr 25, 2013, at 9:50 PM, aaron morton wrote: The messages appear right after the node wakes up. Are you tracking CPU steal ? - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 25/04/2013, at 4:15 AM, Robert Coli rc...@eventbrite.com wrote: On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux mthero...@yahoo.com wrote: Another related question. Once we see messages being dropped on one node, our cassandra client appears to see this, reporting errors. We use LOCAL_QUORUM with a RF of 3 on all queries. Any idea why clients would see an error? If only one node reports an error, shouldn't the consistency level prevent the client from seeing an issue? If the client is talking to a broken/degraded coordinator node, RF/CL are unable to protect it from RPCTimeout. If it is unable to coordinate the request in a timely fashion, your clients will get errors. =Rob
Re: Deletes, null values
What's your table definition ? select '1228#16857','1228#16866','1228#16875','1237#16544','1237#16553' from myCF where key = 'all'; The output looks correct to me. CQL table return values, including null, for all of the selected columns. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 27/04/2013, at 12:48 AM, Sorin Manolache sor...@gmail.com wrote: On 2013-04-26 11:55, Alain RODRIGUEZ wrote: Of course: From CQL 2 (cqlsh -2): delete '183#16684','183#16714','183#16717' from myCF where key = 'all'; And selecting this data as follow gives me the result above: select '1228#16857','1228#16866','1228#16875','1237#16544','1237#16553' from myCF where key = 'all'; From thrift (phpCassa client): $pool = new ConnectionPool('myKeyspace',array('192.168.100.201'),6,0,3,3); $my_cf= new ColumnFamily($pool, 'myCF', true, true, ConsistencyLevel::QUORUM, ConsistencyLevel::QUORUM); $my_cf-remove('all', array('1228#16857','1228#16866','1228#16875')); I see. I'm sorry, I know nothing about phpCassa. I use batch_mutation with deletions and it works. But I guess phpCassa must use the same thrift primitives. Sorin 2013/4/25 Sorin Manolache sor...@gmail.com mailto:sor...@gmail.com On 2013-04-25 11:48, Alain RODRIGUEZ wrote: Hi, I tried to delete some columns using cql2 as well as thrift on C*1.2.2 and instead of being unreachable, deleted columns have a null value. I am using no value in this CF, the only information I use is the existence of the column. So when I select all the column for a given key I have the following returned: 1228#16857 | 1228#16866 | 1228#16875 | 1237#16544 | 1237#16553 ---+--__+--+--__-+__-- null | null | null | | This is quite annoying since my app thinks that I have 5 columns there when I should have 2 only. I first thought that this was a visible marker of tombstones but they didn't vanish after a major compaction. How can I get rid of these null/ghost columns and why does it happen ? I do something similar but I don't see null values. Could you please post the code where you delete the columns? Sorin
Re: Is Cassandra oversized for this kind of use case?
Sounds like something C* would be good at. I would do some searching on Time Series data in cassandra, such as http://www.datastax.com/dev/blog/advanced-time-series-with-cassandra And definitely consider storing data at the smallest level on granularity. On the analytics side there is good news and no so good news. First the good news is reads do not block writes as in a traditional RDBMS (without MVCC) running with Transaction Isolation of Repeatable Read or higher. The not the so good news it's not as easy to support the wide range of analytical queries that you are used to with SQL using the standard Thrift/CQL API. If you need very flexible analysis I recommend looking into Hive / Pig with Hadoop, DataStax Enterprise is a commercial product but free for development and a great way to learn without having to worry about the setup http://www.datastax.com/ You may also be interested in http://www.pentaho.com/ or http://www.karmasphere.com/ Hope that helps. - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 27/04/2013, at 5:26 AM, Hiller, Dean dean.hil...@nrel.gov wrote: I would at least start with 3 cheap nodes with RF=3 and start with CL=TWO on writes and reads most likely getting your feet wet. Don't buy very expensive computers like a lot do getting into the game for the first time…Every time I walk into a new gig, they seem to think they need to spend 6/10k per node. I think this kind of scenario sounds find to use cassandra. When you say virtualize, I believe you mean use Vms…..many use Amazon Vms and there is stuff to configure if you are on amazon specifically for this. If you are on your own VM's, you do need to worry about if two nodes end up on the same hardware stealing resources from each other or if hardware fails as well. Ie. The idea in noSQL is you typically have 3 copies of all data so if one node goes down, you are still live with CL=TWO. Also, plan on doing ~300GB per node typically depending on how it works out in testing. Later, Dean From: Marc Teufel teufel.m...@googlemail.commailto:teufel.m...@googlemail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Friday, April 26, 2013 10:59 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Is Cassandra oversized for this kind of use case? Okay one billion rows of data is a lot, compared to that i am far far away - means i can stay with Oracle? Maybe. But you're right when you say its not only about big data but also about your need. So storing the data is one part, doing analytical analysis is the second. I do a lot of calculations and queries to generate management criteria about how the production is going on actually, how the production went the last week, month, years and so on. Saving in a 5 minute rhythm is only a compromise to reduce the amount of data - maybe in the future the usecase will change an is about to store status of each machine as soon as it changes. This will of course increase the amount of data and the complexity of my queries again. And sure I show Live Data today... 5 Minute old Live Data... but if i tell the CEO that i am also able to work with real live data, i am sure this is what he wants to get ;-) Can you recommend me to use Cassandra for this kind of scenario or is this oversized ? Does it makes sense to start with 2 Nodes ? Can i virtualize these two Nodes ? Thx a lot for your assistance. Marc 2013/4/26 Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov Well, it depends more on what you will do with the data. I know I was on a sybase(RDBMS) with 1 billion rows but it was getting close to not being able to handle more (constraints had to be turned off and all sorts of optimizations done and expert consultants brought in and everything). BUT there are other use cases where noSQL is great for (ie. It is not just great for big data type systems). It is great for really high write throughput as you can add more nodes and handle more writes/second than an RDBMS very easily yet you may be doing so many deletes that the system constantly stays at a small data set. You may want to analyze the data constantly or near real time involving huge amounts of reads / second in which case noSQL can be better as well. Ie. Nosql is not just for big data. I know with PlayOrm for cassandra, we have handled many different use cases out there. Later, Dean From: Marc Teufel teufel.m...@googlemail.commailto:teufel.m...@googlemail.commailto:teufel.m...@googlemail.commailto:teufel.m...@googlemail.com Reply-To:
question about internode_compression
When internode_compression is enabled, will the compression algorithm used be the same as whatever I am using for sstable_compression? - John
Re: cost estimate about some Cassandra patchs
Does anyone know enough of the inner working of Cassandra to tell me how much work is needed to patch Cassandra to enable such communication vectorization/batch ? Assuming you mean have the coordinator send multiple row read/write requests in a single message to replicas Pretty sure this has been raised as a ticket before but I cannot find one now. It would be a significant change and I'm not sure how big the benefit is. To send the messages the coordinator places them in a queue, there is little delay sending. Then it waits on them async. So there may be some saving on networking but from the coordinators point of view I think the impact is minimal. What is your use case? Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 27/04/2013, at 4:04 AM, DE VITO Dominique dominique.dev...@thalesgroup.com wrote: Hi, We are created a new partitioner that groups some rows with **different** row keys on the same replicas. But neither the batch_mutate, or the multiget_slice are able to take opportunity of this partitioner-defined placement to vectorize/batch communications between the coordinator and the replicas. Does anyone know enough of the inner working of Cassandra to tell me how much work is needed to patch Cassandra to enable such communication vectorization/batch ? Thanks. Regards, Dominique
Re: Adding nodes in 1.2 with vnodes requires huge disks
We're going to try running a shuffle before adding a new node again... maybe that will help I don't think hurt but I doubt it will help. It seems when new nodes join, they are streamed *all* sstables in the cluster. How many nodes did you join, what was the num_tokens ? Did you notice streaming from all nodes (in the logs) or are you saying this in response to the cluster load increasing ? The purple line machine, I just stopped the joining process because the main cluster was dropping mutation messages at this point on a few nodes (and it still had dozens of sstables to stream.) Which were the new nodes ? Can you show the output from nodetool status? Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 27/04/2013, at 9:35 AM, Bryan Talbot btal...@aeriagames.com wrote: I believe that nodetool rebuild is used to add a new datacenter, not just a new host to an existing cluster. Is that what you ran to add the node? -Bryan On Fri, Apr 26, 2013 at 1:27 PM, John Watson j...@disqus.com wrote: Small relief we're not the only ones that had this issue. We're going to try running a shuffle before adding a new node again... maybe that will help - John On Fri, Apr 26, 2013 at 5:07 AM, Francisco Nogueira Calmon Sobral fsob...@igcorp.com.br wrote: I am using the same version and observed something similar. I've added a new node, but the instructions from Datastax did not work for me. Then I ran nodetool rebuild on the new node. After finished this command, it contained two times the load of the other nodes. Even when I ran nodetool cleanup on the older nodes, the situation was the same. The problem only seemed to disappear when nodetool repair was applied to all nodes. Regards, Francisco Sobral. On Apr 25, 2013, at 4:57 PM, John Watson j...@disqus.com wrote: After finally upgrading to 1.2.3 from 1.1.9, enabling vnodes, and running upgradesstables, I figured it would be safe to start adding nodes to the cluster. Guess not? It seems when new nodes join, they are streamed *all* sstables in the cluster. https://dl.dropbox.com/s/bampemkvlfck2dt/Screen%20Shot%202013-04-25%20at%2012.35.24%20PM.png The gray the line machine ran out disk space and for some reason cascaded into errors in the cluster about 'no host id' when trying to store hints for it (even though it hadn't joined yet). The purple line machine, I just stopped the joining process because the main cluster was dropping mutation messages at this point on a few nodes (and it still had dozens of sstables to stream.) I followed this: http://www.datastax.com/docs/1.2/operations/add_replace_nodes Is there something missing in that documentation? Thanks, John
cassandra-shuffle time to completion and required disk space
The amount of time/space cassandra-shuffle requires when upgrading to using vnodes should really be apparent in documentation (when some is made). Only semi-noticeable remark about the exorbitant amount of time is a bullet point in: http://wiki.apache.org/cassandra/VirtualNodes/Balance Shuffling will entail moving a lot of data around the cluster and so has the potential to consume a lot of disk and network I/O, and to take a considerable amount of time. For this to be an online operation, the shuffle will need to operate on a lower priority basis to other streaming operations, and should be expected to take days or weeks to complete. We tried running shuffle on a QA version of our cluster and 2 things were brought to light: - Even with no reads/writes it was going to take 20 days - Each machine needed enough free diskspace to potentially hold the entire cluster's sstables on disk Regards, John
Re: Really odd issue (AWS related?)
Hi Mike, We had issues with the ephemeral drives when we first got started, although we never got to the bottom of it so I can't help much with troubleshooting unfortunately. Contrary to a lot of the comments on the mailing list we've actually had a lot more success with EBS drives (PIOPs!). I'd definitely suggest try striping 4 EBS drives (Raid 0) and using PIOPs. You could be having a noisy neighbour problem, I don't believe that m1.large or m1.xlarge instances get all of the actual hardware, virtualisation on EC2 still sucks in isolating resources. We've also had more success with Ubuntu on EC2, not so much with our Cassandra nodes but some of our other services didn't run as well on Amazon Linux AMIs. Alex On Sun, Apr 28, 2013 at 7:12 PM, Michael Theroux mthero...@yahoo.comwrote: I forgot to mention, When things go really bad, I'm seeing I/O waits in the 80-95% range. I restarted cassandra once when a node is in this situation, and it took 45 minutes to start (primarily reading SSTables). Typically, a node would start in about 5 minutes. Thanks, -Mike On Apr 28, 2013, at 12:37 PM, Michael Theroux wrote: Hello, We've done some additional monitoring, and I think we have more information. We've been collecting vmstat information every minute, attempting to catch a node with issues,. So, it appears, that the cassandra node runs fine. Then suddenly, without any correlation to any event that I can identify, the I/O wait time goes way up, and stays up indefinitely. Even non-cassandra I/O activities (such as snapshots and backups) start causing large I/O Wait times when they typically would not. Previous to an issue, we would typically see I/O wait times 3-4% with very few blocked processes on I/O. Once this issue manifests itself, i/O wait times for the same activities jump to 30-40% with many blocked processes. The I/O wait times do go back down when there is literally no activity. - Updating the node to the latest Amazon Linux patches and rebooting the instance doesn't correct the issue. - Backing up the node, and replacing the instance does correct the issue. I/O wait times return to normal. One relatively recent change we've made is we upgraded to m1.xlarge instances which has 4 ephemeral drives available. We create a logical volume from the 4 drives with the idea that we should be able to get increased I/O throughput. When we ran m1.large instances, we had the same setup, although it was only using 2 ephemeral drives. We chose to use LVM, vs. madm because we were having issues having madm create the raid volume reliably on restart (and research showed that this was a common problem). LVM just worked (and had worked for months before this upgrade).. For reference, this is the script we used to create the logical volume: vgcreate mnt_vg /dev/sdb /dev/sdc /dev/sdd /dev/sde lvcreate -L 1600G -n mnt_lv -i 4 mnt_vg -I 256K blockdev --setra 65536 /dev/mnt_vg/mnt_lv sleep 2 mkfs.xfs /dev/mnt_vg/mnt_lv sleep 3 mkdir -p /data mount -t xfs -o noatime /dev/mnt_vg/mnt_lv /data sleep 3 Another tidbit... thus far (and this maybe only a coincidence), we've only had to replace DB nodes within a single availability zone within us-east. Other availability zones, in the same region, have yet to show an issue. It looks like I'm going to need to replace a third DB node today. Any advice would be appreciated. Thanks, -Mike On Apr 26, 2013, at 10:14 AM, Michael Theroux wrote: Thanks. We weren't monitoring this value when the issue occurred, and this particular issue has not appeared for a couple of days (knock on wood). Will keep an eye out though, -Mike On Apr 26, 2013, at 5:32 AM, Jason Wee wrote: top command? st : time stolen from this vm by the hypervisor jason On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux mthero...@yahoo.comwrote: Sorry, Not sure what CPU steal is :) I have AWS console with detailed monitoring enabled... things seem to track close to the minute, so I can see the CPU load go to 0... then jump at about the minute Cassandra reports the dropped messages, -Mike On Apr 25, 2013, at 9:50 PM, aaron morton wrote: The messages appear right after the node wakes up. Are you tracking CPU steal ? - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 25/04/2013, at 4:15 AM, Robert Coli rc...@eventbrite.com wrote: On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux mthero...@yahoo.com wrote: Another related question. Once we see messages being dropped on one node, our cassandra client appears to see this, reporting errors. We use LOCAL_QUORUM with a RF of 3 on all queries. Any idea why clients would see an error? If only one node reports an error, shouldn't the consistency level prevent the client from seeing an issue? If the client is talking to a broken/degraded coordinator node, RF/CL are unable to
setcompactionthroughput and setstreamthroughput have no effect
Running these 2 commands are noop IO wise: nodetool setcompactionthroughput 0 nodetool setstreamtrhoughput 0 If trying to recover or rebuild nodes, it would be super helpful to get more than ~120mbit/s of streaming throughput (per session or ~500mbit total) and ~5% IO utilization in (8) 15k disk RAID10 (per cf). Even enabling multithreaded_compaction gives marginal improvements (1 additional thread doesn't help all that much and was only measurable in CPU usage). I understand that these processes should take lower priority to servicing reads and writes. However, in emergencies it would be a nice feature to have a switch to recover a cluster ASAP. Thanks, John
Re: CQL Clarification
I think this is some confusion about the two different usages of timestamp. The timestamp stored with the column value (not a column of timestamp type) is stored using microsecond scale, it's just a 64 bit int we do not use it as a time value. Each mutation in a single request will have a different timestamp as per https://github.com/apache/cassandra/blob/cassandra-1.2/src/java/org/apache/cassandra/service/QueryState.java#L48 A column of type timestamp is internally stored as a DateTime type which is milliseconds past the epoch https://github.com/apache/cassandra/blob/cassandra-1.2/src/java/org/apache/cassandra/db/marshal/DateType.java Does that help ? - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 29/04/2013, at 3:42 AM, Michael Theroux mthero...@yahoo.com wrote: Hello, Just wondering if I can get a quick clarification on some simple CQL. We utilize Thrift CQL Queries to access our cassandra setup. As clarified in a previous question I had, when using CQL and Thrift, timestamps on the cassandra column data is assigned by the server, not the client, unless AND TIMESTAMP is utilized in the query, for example: http://www.datastax.com/docs/1.0/references/cql/UPDATE According to the Datastax documentation, this timestamp should be: Values serialized with the timestamp type are encoded as 64-bit signed integers representing a number of milliseconds since the standard base time known as the epoch: January 1 1970 at 00:00:00 GMT. However, my testing showed that updates didn't work when I used a timestamp of this format. Looking at the Cassandra code, it appears that cassandra will assign a timestamp of System.currentTimeMillis() * 1000 when a timestamp is not specified, which would be the number of nanoseconds since the stand base time. In my test environment, setting the timestamp to be the current time * 1000 seems to work. It seems that if you have an older installation without TIMESTAMP being specified in the CQL, or a mixed environment, the timestamp should be * 1000. Just making sure I'm reading everything properly... improperly setting the timestamp could cause us some serious damage. Thanks, -Mike
Re: setcompactionthroughput and setstreamthroughput have no effect
Out of curiosity. Why did you decide to set it to 0 rather then 9. Does any documentation anywhere say that setting to 0 disables the feature? I have set streamthroughput higher and seen node join improvements. The features do work however they are probably not your limiting factor. Remember for stream you are setting Mega Bytes per second but network cards are measured in Mega Bits per second. On Sun, Apr 28, 2013 at 5:28 PM, John Watson j...@disqus.com wrote: Running these 2 commands are noop IO wise: nodetool setcompactionthroughput 0 nodetool setstreamtrhoughput 0 If trying to recover or rebuild nodes, it would be super helpful to get more than ~120mbit/s of streaming throughput (per session or ~500mbit total) and ~5% IO utilization in (8) 15k disk RAID10 (per cf). Even enabling multithreaded_compaction gives marginal improvements (1 additional thread doesn't help all that much and was only measurable in CPU usage). I understand that these processes should take lower priority to servicing reads and writes. However, in emergencies it would be a nice feature to have a switch to recover a cluster ASAP. Thanks, John
Re: question about internode_compression
It uses Snappy Compression with the default block size. There may be a case for allowing configuration, for example so the LZ4Compressor can be used. Feel free to raise a ticket at https://issues.apache.org/jira/browse/CASSANDRA Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 29/04/2013, at 8:39 AM, John Sanda john.sa...@gmail.com wrote: When internode_compression is enabled, will the compression algorithm used be the same as whatever I am using for sstable_compression? - John
Re: setcompactionthroughput and setstreamthroughput have no effect
The help command says 0 to disable: setcompactionthroughput value_in_mb - Set the MB/s throughput cap for compaction in the system, or 0 to disable throttling. setstreamthroughput value_in_mb - Set the MB/s throughput cap for streaming in the system, or 0 to disable throttling. I also set both to 1000 and it also had no effect (just in case the documentation was incorrect.) On Sun, Apr 28, 2013 at 2:43 PM, Edward Capriolo edlinuxg...@gmail.comwrote: Out of curiosity. Why did you decide to set it to 0 rather then 9. Does any documentation anywhere say that setting to 0 disables the feature? I have set streamthroughput higher and seen node join improvements. The features do work however they are probably not your limiting factor. Remember for stream you are setting Mega Bytes per second but network cards are measured in Mega Bits per second. On Sun, Apr 28, 2013 at 5:28 PM, John Watson j...@disqus.com wrote: Running these 2 commands are noop IO wise: nodetool setcompactionthroughput 0 nodetool setstreamtrhoughput 0 If trying to recover or rebuild nodes, it would be super helpful to get more than ~120mbit/s of streaming throughput (per session or ~500mbit total) and ~5% IO utilization in (8) 15k disk RAID10 (per cf). Even enabling multithreaded_compaction gives marginal improvements (1 additional thread doesn't help all that much and was only measurable in CPU usage). I understand that these processes should take lower priority to servicing reads and writes. However, in emergencies it would be a nice feature to have a switch to recover a cluster ASAP. Thanks, John
Re: cassandra-shuffle time to completion and required disk space
Can you provide some info on the number of nodes, node load, cluster load etc ? AFAIK shuffle was not an easy thing to test and does not get much real world use as only some people will run it and they (normally) use it once. Any info you can provide may help improve the process. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 29/04/2013, at 9:21 AM, John Watson j...@disqus.com wrote: The amount of time/space cassandra-shuffle requires when upgrading to using vnodes should really be apparent in documentation (when some is made). Only semi-noticeable remark about the exorbitant amount of time is a bullet point in: http://wiki.apache.org/cassandra/VirtualNodes/Balance Shuffling will entail moving a lot of data around the cluster and so has the potential to consume a lot of disk and network I/O, and to take a considerable amount of time. For this to be an online operation, the shuffle will need to operate on a lower priority basis to other streaming operations, and should be expected to take days or weeks to complete. We tried running shuffle on a QA version of our cluster and 2 things were brought to light: - Even with no reads/writes it was going to take 20 days - Each machine needed enough free diskspace to potentially hold the entire cluster's sstables on disk Regards, John
Re: Adding nodes in 1.2 with vnodes requires huge disks
On Sun, Apr 28, 2013 at 2:19 PM, aaron morton aa...@thelastpickle.comwrote: We're going to try running a shuffle before adding a new node again... maybe that will help I don't think hurt but I doubt it will help. We had to bail on shuffle since we need to add capacity ASAP and not in 20 days. It seems when new nodes join, they are streamed *all* sstables in the cluster. How many nodes did you join, what was the num_tokens ? Did you notice streaming from all nodes (in the logs) or are you saying this in response to the cluster load increasing ? Was only adding 2 nodes at the time (planning to add a total of 12.) Starting with a cluster of 12, but now 11 since 1 node entered some weird state when one of the new nodes ran out disk space. num_tokens is set to 256 on all nodes. Yes, nearly all current nodes were streaming to the new ones (which was great until disk space was an issue.) The purple line machine, I just stopped the joining process because the main cluster was dropping mutation messages at this point on a few nodes (and it still had dozens of sstables to stream.) Which were the new nodes ? Can you show the output from nodetool status? The new nodes are the purple and gray lines above all the others. nodetool status doesn't show joining nodes. I think I saw a bug already filed for this but I can't seem to find it. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 27/04/2013, at 9:35 AM, Bryan Talbot btal...@aeriagames.com wrote: I believe that nodetool rebuild is used to add a new datacenter, not just a new host to an existing cluster. Is that what you ran to add the node? -Bryan On Fri, Apr 26, 2013 at 1:27 PM, John Watson j...@disqus.com wrote: Small relief we're not the only ones that had this issue. We're going to try running a shuffle before adding a new node again... maybe that will help - John On Fri, Apr 26, 2013 at 5:07 AM, Francisco Nogueira Calmon Sobral fsob...@igcorp.com.br wrote: I am using the same version and observed something similar. I've added a new node, but the instructions from Datastax did not work for me. Then I ran nodetool rebuild on the new node. After finished this command, it contained two times the load of the other nodes. Even when I ran nodetool cleanup on the older nodes, the situation was the same. The problem only seemed to disappear when nodetool repair was applied to all nodes. Regards, Francisco Sobral. On Apr 25, 2013, at 4:57 PM, John Watson j...@disqus.com wrote: After finally upgrading to 1.2.3 from 1.1.9, enabling vnodes, and running upgradesstables, I figured it would be safe to start adding nodes to the cluster. Guess not? It seems when new nodes join, they are streamed *all* sstables in the cluster. https://dl.dropbox.com/s/bampemkvlfck2dt/Screen%20Shot%202013-04-25%20at%2012.35.24%20PM.png The gray the line machine ran out disk space and for some reason cascaded into errors in the cluster about 'no host id' when trying to store hints for it (even though it hadn't joined yet). The purple line machine, I just stopped the joining process because the main cluster was dropping mutation messages at this point on a few nodes (and it still had dozens of sstables to stream.) I followed this: http://www.datastax.com/docs/1.2/operations/add_replace_nodes Is there something missing in that documentation? Thanks, John
Re: CQL Clarification
Yes, that does help, So, in the link I provided: http://www.datastax.com/docs/1.0/references/cql/UPDATE It states: You can specify these options: Consistency level Time-to-live (TTL) Timestamp for the written columns. Where timestamp is a link to Working with dates and times and mentions the 64bit millisecond value. Is that incorrect? -Mike On Apr 28, 2013, at 11:42 AM, Michael Theroux wrote: Hello, Just wondering if I can get a quick clarification on some simple CQL. We utilize Thrift CQL Queries to access our cassandra setup. As clarified in a previous question I had, when using CQL and Thrift, timestamps on the cassandra column data is assigned by the server, not the client, unless AND TIMESTAMP is utilized in the query, for example: http://www.datastax.com/docs/1.0/references/cql/UPDATE According to the Datastax documentation, this timestamp should be: Values serialized with the timestamp type are encoded as 64-bit signed integers representing a number of milliseconds since the standard base time known as the epoch: January 1 1970 at 00:00:00 GMT. However, my testing showed that updates didn't work when I used a timestamp of this format. Looking at the Cassandra code, it appears that cassandra will assign a timestamp of System.currentTimeMillis() * 1000 when a timestamp is not specified, which would be the number of nanoseconds since the stand base time. In my test environment, setting the timestamp to be the current time * 1000 seems to work. It seems that if you have an older installation without TIMESTAMP being specified in the CQL, or a mixed environment, the timestamp should be * 1000. Just making sure I'm reading everything properly... improperly setting the timestamp could cause us some serious damage. Thanks, -Mike
Re: cassandra-shuffle time to completion and required disk space
11 nodes 1 keyspace 256 vnodes per node upgraded 1.1.9 to 1.2.3 a week ago These are taken just before starting shuffle (ran repair/cleanup the day before). During shuffle disabled all reads/writes to the cluster. nodetool status keyspace: Load Tokens Owns (effective) Host ID 80.95 GB 256 16.7% 754f9f4c-4ba7-4495-97e7-1f5b6755cb27 87.15 GB 256 16.7% 93f4400a-09d9-4ca0-b6a6-9bcca2427450 98.16 GB 256 16.7% ff821e8e-b2ca-48a9-ac3f-8234b16329ce 142.6 GB 253 100.0%339c474f-cf19-4ada-9a47-8b10912d5eb3 77.64 GB 256 16.7% e59a02b3-8b91-4abd-990e-b3cb2a494950 194.31 GB 256 25.0% 6d726cbf-147d-426e-a735-e14928c95e45 221.94 GB 256 33.3% 83ca527c-60c5-4ea0-89a8-de53b92b99c8 87.61 GB 256 16.7% c3ea4026-551b-4a14-a346-480e8c1fe283 101.02 GB 256 16.7% df7ba879-74ad-400b-b371-91b45dcbed37 172.44 GB 256 25.0% 78192d73-be0b-4d49-a129-9bec0770efed 108.5 GB 256 16.7% 9889280a-1433-439e-bb84-6b7e7f44d761 nodetool status: Load Tokens Owns Host ID 142.6 GB 253 97.5% 339c474f-cf19-4ada-9a47-8b10912d5eb3 172.44 GB 256 0.1% 78192d73-be0b-4d49-a129-9bec0770efed 221.94 GB 256 0.4% 83ca527c-60c5-4ea0-89a8-de53b92b99c8 194.31 GB 256 0.1% 6d726cbf-147d-426e-a735-e14928c95e45 77.64 GB 256 0.3% e59a02b3-8b91-4abd-990e-b3cb2a494950 87.15 GB 256 0.4% 93f4400a-09d9-4ca0-b6a6-9bcca2427450 98.16 GB 256 0.1% ff821e8e-b2ca-48a9-ac3f-8234b16329ce 87.61 GB 256 0.3% c3ea4026-551b-4a14-a346-480e8c1fe283 80.95 GB 256 0.4% 754f9f4c-4ba7-4495-97e7-1f5b6755cb27 108.5 GB 256 0.1% 9889280a-1433-439e-bb84-6b7e7f44d761 101.02 GB 256 0.3% df7ba879-74ad-400b-b371-91b45dcbed37 Here's image of the actual disk usage during shuffle: https://dl.dropbox.com/s/bx57j1z5c2spqo0/shuffle%20disk%20space.png Little after 00:00 I disabled/cleared the xfers and restarted the cluster (those drops around 00:15 are the restarts) before starting running cleanup. The disks are only 540G and whenever cassandra runs out of disk space, bad things seem to happen. Was just barely able to run cleanup without running out space after the failed shuffle. After the restart: Load Tokens Owns (effective) Host ID 131.73 GB 256 16.7% 754f9f4c-4ba7-4495-97e7-1f5b6755cb27 418.88 GB 255 16.7% 93f4400a-09d9-4ca0-b6a6-9bcca2427450 171.19 GB 255 8.5% ff821e8e-b2ca-48a9-ac3f-8234b16329ce 142.61 GB 253 100.0%339c474f-cf19-4ada-9a47-8b10912d5eb3 178.83 GB 257 24.9% e59a02b3-8b91-4abd-990e-b3cb2a494950 442.32 GB 257 25.0% 6d726cbf-147d-426e-a735-e14928c95e45 185.28 GB 257 16.7% c3ea4026-551b-4a14-a346-480e8c1fe283 274.47 GB 255 33.3% 83ca527c-60c5-4ea0-89a8-de53b92b99c8 210.73 GB 256 16.7% df7ba879-74ad-400b-b371-91b45dcbed37 274.49 GB 256 25.0% 78192d73-be0b-4d49-a129-9bec0770efed 106.47 GB 256 16.7% 9889280a-1433-439e-bb84-6b7e7f44d761 It's currently still running cleanup, so taking the output from status will be a little inaccurate. I have everything instrumented by Metrics being pushed into Graphite. So if there's graphs/data that may help from there please let me know. Thanks, John On Sun, Apr 28, 2013 at 2:52 PM, aaron morton aa...@thelastpickle.comwrote: Can you provide some info on the number of nodes, node load, cluster load etc ? AFAIK shuffle was not an easy thing to test and does not get much real world use as only some people will run it and they (normally) use it once. Any info you can provide may help improve the process. Cheers - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 29/04/2013, at 9:21 AM, John Watson j...@disqus.com wrote: The amount of time/space cassandra-shuffle requires when upgrading to using vnodes should really be apparent in documentation (when some is made). Only semi-noticeable remark about the exorbitant amount of time is a bullet point in: http://wiki.apache.org/cassandra/VirtualNodes/Balance Shuffling will entail moving a lot of data around the cluster and so has the potential to consume a lot of disk and network I/O, and to take a considerable amount of time. For this to be an online operation, the shuffle will need to operate on a lower priority basis to other streaming operations, and should be expected to take days or weeks to complete. We tried running shuffle on a QA version of our cluster and 2 things were brought to light: - Even with no reads/writes it was going to take 20 days - Each machine needed enough free diskspace to potentially hold the entire cluster's sstables on disk Regards, John