CQL indexing
HI In cql to perform a query based on columns you have to create a index on that column. What exactly happening when we create a index on a column. What the index column family might contain.
Re: Unable to drop secondary index
W dniu 26.04.2013 03:45, aaron morton pisze: You can drop the hints via JMX and stopping the node and deleting the SSTables. Thanks for advice :-) It's +/- what I did. I've paused hints delivery first and then I upgraded whole cluster to C* with CASSANDRA-5179 patch applied, removing the SSTables before restart, so it's fine now :-) Now I'm leaving for 3 weeks and when I'm back I'll have to revisit the schemas problem - you can't get bored with Cassandra! ;-) M.
Re: Really odd issue (AWS related?)
top command? st : time stolen from this vm by the hypervisor jason On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux mthero...@yahoo.comwrote: Sorry, Not sure what CPU steal is :) I have AWS console with detailed monitoring enabled... things seem to track close to the minute, so I can see the CPU load go to 0... then jump at about the minute Cassandra reports the dropped messages, -Mike On Apr 25, 2013, at 9:50 PM, aaron morton wrote: The messages appear right after the node wakes up. Are you tracking CPU steal ? - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 25/04/2013, at 4:15 AM, Robert Coli rc...@eventbrite.com wrote: On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux mthero...@yahoo.com wrote: Another related question. Once we see messages being dropped on one node, our cassandra client appears to see this, reporting errors. We use LOCAL_QUORUM with a RF of 3 on all queries. Any idea why clients would see an error? If only one node reports an error, shouldn't the consistency level prevent the client from seeing an issue? If the client is talking to a broken/degraded coordinator node, RF/CL are unable to protect it from RPCTimeout. If it is unable to coordinate the request in a timely fashion, your clients will get errors. =Rob
Re: Deletes, null values
Of course: From CQL 2 (cqlsh -2): delete '183#16684','183#16714','183#16717' from myCF where key = 'all'; And selecting this data as follow gives me the result above: select '1228#16857','1228#16866','1228#16875','1237#16544','1237#16553' from myCF where key = 'all'; From thrift (phpCassa client): $pool = new ConnectionPool('myKeyspace', array('192.168.100.201'), 6, 0, 3, 3); $my_cf= new ColumnFamily($pool, 'myCF', true, true, ConsistencyLevel::QUORUM, ConsistencyLevel::QUORUM); $my_cf-remove('all', array('1228#16857','1228#16866','1228#16875')); 2013/4/25 Sorin Manolache sor...@gmail.com On 2013-04-25 11:48, Alain RODRIGUEZ wrote: Hi, I tried to delete some columns using cql2 as well as thrift on C*1.2.2 and instead of being unreachable, deleted columns have a null value. I am using no value in this CF, the only information I use is the existence of the column. So when I select all the column for a given key I have the following returned: 1228#16857 | 1228#16866 | 1228#16875 | 1237#16544 | 1237#16553 ---+--**+--+--** -+**-- null | null | null | | This is quite annoying since my app thinks that I have 5 columns there when I should have 2 only. I first thought that this was a visible marker of tombstones but they didn't vanish after a major compaction. How can I get rid of these null/ghost columns and why does it happen ? I do something similar but I don't see null values. Could you please post the code where you delete the columns? Sorin
Re: How to change existing cluster to multi-center
I just asked this exact same question but after maybe after reading a bit more doc than you did. You may want to read this thread: http://grokbase.com/t/cassandra/user/134j85av4x/ec2snitch-to-ec2multiregionsnitch You also may want to read some doc. Datastax explain things quite well and update the doc regularly. Hope this will help. 2013/4/25 Daning Wang dan...@netseer.com Hi All, We have 8 nodes cluster(replication factor is 3), about 50G data on each node. we need to change the cluster to multi-center environment(to EC2). the data need to have one replica on ec2. Here is the plan, - Change cluster config to mult-center. - Add 2 or 3 nodes in another center, which is ec2. - Change the replication factor to make data synced to other center. We have not done the test yet, is this doable? the main concern is that since connection to ec2 is slow, it will take longer time to streaming data(should be more than 100G) at the beginning. Anybody has done this before, please share some light, Thanks in advance, Daning
Re: Deletes, null values
I copied the wrong query: In CQL 2 it was: delete '1228#16857','1228#16866','1228#16875' from myCF where key = 'all'; Sorry about the mistake. 2013/4/26 Alain RODRIGUEZ arodr...@gmail.com Of course: From CQL 2 (cqlsh -2): delete '183#16684','183#16714','183#16717' from myCF where key = 'all'; And selecting this data as follow gives me the result above: select '1228#16857','1228#16866','1228#16875','1237#16544','1237#16553' from myCF where key = 'all'; From thrift (phpCassa client): $pool = new ConnectionPool('myKeyspace', array('192.168.100.201'), 6, 0, 3, 3); $my_cf= new ColumnFamily($pool, 'myCF', true, true, ConsistencyLevel::QUORUM, ConsistencyLevel::QUORUM); $my_cf-remove('all', array('1228#16857','1228#16866','1228#16875')); 2013/4/25 Sorin Manolache sor...@gmail.com On 2013-04-25 11:48, Alain RODRIGUEZ wrote: Hi, I tried to delete some columns using cql2 as well as thrift on C*1.2.2 and instead of being unreachable, deleted columns have a null value. I am using no value in this CF, the only information I use is the existence of the column. So when I select all the column for a given key I have the following returned: 1228#16857 | 1228#16866 | 1228#16875 | 1237#16544 | 1237#16553 ---+--**+--+--** -+**-- null | null | null | | This is quite annoying since my app thinks that I have 5 columns there when I should have 2 only. I first thought that this was a visible marker of tombstones but they didn't vanish after a major compaction. How can I get rid of these null/ghost columns and why does it happen ? I do something similar but I don't see null values. Could you please post the code where you delete the columns? Sorin
Re: vnodes and load balancing - 1.2.4
Some extra information you could provide which will help debug this: the logs from those 3 nodes which have no data and the output of nodetool ring Before seeing those I can only guess, but my guess would be that in the logs on those 3 nodes you will see this: Calculating new tokens and this: Split previous range (blah, blah] into long list of tokens If that is the case then it means you accidentally started those three nodes with the default configuration (single-token) and then subsequently changed (num_tokens) and then joined them into the cluster. What happens when you do this is that the node thinks it used to be responsible for a single range and is being migrated to vnodes, so it splits its single range (now a very small part of the keyspace) into 256 smaller ranges, and ends up with just a tiny portion of the ring assigned to it. To fix this you'll need to decommission those 3 nodes, remove all data from them, then bootstrap them in again with the correct configuration from the start. Sam On 26 April 2013 06:07, David McNelis dmcne...@gmail.com wrote: So, I had 7 nodes that I set up using vnodes, 256 tokens each, no problem. I added two 512 token nodes, no problem, things seemed to balance. The next 3 nodes I added, all at 256 tokens, and they have a cumulative load of 116mb (where as the other nodes are at ~100GB and ~200GB (256 and 512 respectively). Anyone else seen this is 1.2.4? The nodes seem to join the cluster ok, and I have num_tokens set and have tried both an empty initial_token and a commented out initial token, with no change. I see nothing streaming with netstats either, though these nodes were added days apart. At first I thought I must have a hot key or something, but that doesn't seem to be the case, since the node I thought that one was on has evened out over the past couple of days with no new nodes added. I really *DON'T* want to deal with another shufflebut what options do I have, since vnodes make it unneeded to balance the cluster? (which, at the moment, seems like a load of bullshit). -- Sam Overton Acunu | http://www.acunu.com | @acunu
Many creation/inserts in parallel
Hi All We are testing Cassandra 1.2.3 (3 nodes with RF:2) with FluentCassandra driver. At first many CF are being created in parallel (about 1000 CF). After creation is done follows many insertions of little amount of data into the DB. During tests we're receiving some exceptions from driver, e.g.: FluentCassandra.Operations.CassandraOperationException: unconfigured columnfamily table_78_9 and FluentCassandra.Operations.CassandraOperationException: Connection to Cassandra has timed out Though in Cassandra's logs there are no exceptions. What should we do to handle these exceptions? -- Best regards, Alexander
Re: Adding nodes in 1.2 with vnodes requires huge disks
I am using the same version and observed something similar. I've added a new node, but the instructions from Datastax did not work for me. Then I ran nodetool rebuild on the new node. After finished this command, it contained two times the load of the other nodes. Even when I ran nodetool cleanup on the older nodes, the situation was the same. The problem only seemed to disappear when nodetool repair was applied to all nodes. Regards, Francisco Sobral. On Apr 25, 2013, at 4:57 PM, John Watson j...@disqus.com wrote: After finally upgrading to 1.2.3 from 1.1.9, enabling vnodes, and running upgradesstables, I figured it would be safe to start adding nodes to the cluster. Guess not? It seems when new nodes join, they are streamed *all* sstables in the cluster. https://dl.dropbox.com/s/bampemkvlfck2dt/Screen%20Shot%202013-04-25%20at%2012.35.24%20PM.png The gray the line machine ran out disk space and for some reason cascaded into errors in the cluster about 'no host id' when trying to store hints for it (even though it hadn't joined yet). The purple line machine, I just stopped the joining process because the main cluster was dropping mutation messages at this point on a few nodes (and it still had dozens of sstables to stream.) I followed this: http://www.datastax.com/docs/1.2/operations/add_replace_nodes Is there something missing in that documentation? Thanks, John
Slow retrieval using secondary indexes
Hi all! We are using Cassandra 1.2.1 with a 8 node cluster running at Amazon. We started with 6 nodes and added the 2 later. When performing some reads in Cassandra, we observed a high difference between gets using the primary key and gets using secondary indexes: [default@Sessions] get Users where mahoutUserid = 30127944399716352; --- RowKey: STQ0TTNII2LS211YYJI4GEV80M1SE8 = (column=mahoutUserid, value=30127944399716352, timestamp=1366820944696000) 1 Row Returned. Elapsed time: 3508 msec(s). [default@Sessions] get Users['STQ0TTNII2LS211YYJI4GEV80M1SE8']; = (column=mahoutUserid, value=30127944399716352, timestamp=1366820944696000) Returned 1 results. Elapsed time: 3.06 msec(s). In our model the secondary index in also unique, as the primary key is. Is it better, in this case, to create another CF mapping the secondary index to the key? Best regards, Francisco Sobral.
Re: Deletes, null values
On 2013-04-26 11:55, Alain RODRIGUEZ wrote: Of course: From CQL 2 (cqlsh -2): delete '183#16684','183#16714','183#16717' from myCF where key = 'all'; And selecting this data as follow gives me the result above: select '1228#16857','1228#16866','1228#16875','1237#16544','1237#16553' from myCF where key = 'all'; From thrift (phpCassa client): $pool = new ConnectionPool('myKeyspace',array('192.168.100.201'),6,0,3,3); $my_cf= new ColumnFamily($pool, 'myCF', true, true, ConsistencyLevel::QUORUM, ConsistencyLevel::QUORUM); $my_cf-remove('all', array('1228#16857','1228#16866','1228#16875')); I see. I'm sorry, I know nothing about phpCassa. I use batch_mutation with deletions and it works. But I guess phpCassa must use the same thrift primitives. Sorin 2013/4/25 Sorin Manolache sor...@gmail.com mailto:sor...@gmail.com On 2013-04-25 11:48, Alain RODRIGUEZ wrote: Hi, I tried to delete some columns using cql2 as well as thrift on C*1.2.2 and instead of being unreachable, deleted columns have a null value. I am using no value in this CF, the only information I use is the existence of the column. So when I select all the column for a given key I have the following returned: 1228#16857 | 1228#16866 | 1228#16875 | 1237#16544 | 1237#16553 ---+--__+--+--__-+__-- null | null | null | | This is quite annoying since my app thinks that I have 5 columns there when I should have 2 only. I first thought that this was a visible marker of tombstones but they didn't vanish after a major compaction. How can I get rid of these null/ghost columns and why does it happen ? I do something similar but I don't see null values. Could you please post the code where you delete the columns? Sorin
lastest PlayOrm released for cassandra and mongodb
PlayOrm now supports mongodb and cassandra with a query language that is portable across both systems as well. https://github.com/deanhiller/playorm Later, Dean
Re: Really odd issue (AWS related?)
Thanks. We weren't monitoring this value when the issue occurred, and this particular issue has not appeared for a couple of days (knock on wood). Will keep an eye out though, -Mike On Apr 26, 2013, at 5:32 AM, Jason Wee wrote: top command? st : time stolen from this vm by the hypervisor jason On Fri, Apr 26, 2013 at 9:54 AM, Michael Theroux mthero...@yahoo.com wrote: Sorry, Not sure what CPU steal is :) I have AWS console with detailed monitoring enabled... things seem to track close to the minute, so I can see the CPU load go to 0... then jump at about the minute Cassandra reports the dropped messages, -Mike On Apr 25, 2013, at 9:50 PM, aaron morton wrote: The messages appear right after the node wakes up. Are you tracking CPU steal ? - Aaron Morton Freelance Cassandra Consultant New Zealand @aaronmorton http://www.thelastpickle.com On 25/04/2013, at 4:15 AM, Robert Coli rc...@eventbrite.com wrote: On Wed, Apr 24, 2013 at 5:03 AM, Michael Theroux mthero...@yahoo.com wrote: Another related question. Once we see messages being dropped on one node, our cassandra client appears to see this, reporting errors. We use LOCAL_QUORUM with a RF of 3 on all queries. Any idea why clients would see an error? If only one node reports an error, shouldn't the consistency level prevent the client from seeing an issue? If the client is talking to a broken/degraded coordinator node, RF/CL are unable to protect it from RPCTimeout. If it is unable to coordinate the request in a timely fashion, your clients will get errors. =Rob
Is Cassandra oversized for this kind of use case?
I hope the Cassandra Community can help me finding a decision. The project i am working on actually is located in industrial plant, machines are connected to a server an every 5 minutes i get data from the machines about its status. We are talking about a production with 100+ machines, so the data amount is very high: Per Machine every 5th minute one row, means 12 rows per hour, means roundabout 120 rows per day = 1200+ rows per day multiplied by 20 its 240.000 rows per month and 2.880.000 rows per year. I have to hold the last 3 years and i must be able to do analytics on this data. in the end i deal with roundabout 10 Mio Rows (12 columns holding text and numbers each row) Okay, its kind of big data is not really big data isn'it but for me its a lot data to handle anyway. Actually i am holding all these data in a oracle database but doing analytics on so many rows is not the good and modern way i think. as the company is successfull they will grew, means more machines, again more data to handle... So i thought maybe Big Data technologies are a possible solution for me to store my data. Meanwhile i know Apache Hadoop is not the right tool for this kind of thing because it scales not down.But maybe Cassandra ? This is my question to you, do you think cassandra is the right store for this kind of data? I am thinking about 2 Nodes. Maybe virtual. Let me know what you think. And if Cassandra is not the right tool please tell me and if you know any please tell me alternatives. Maybe i am already doing the right thing with storing that much data in oracle database and maybe one of you is doing the same - if so please let me also know. Thank you very much. Web: http://www.teufel.net
Re: Is Cassandra oversized for this kind of use case?
Well, it depends more on what you will do with the data. I know I was on a sybase(RDBMS) with 1 billion rows but it was getting close to not being able to handle more (constraints had to be turned off and all sorts of optimizations done and expert consultants brought in and everything). BUT there are other use cases where noSQL is great for (ie. It is not just great for big data type systems). It is great for really high write throughput as you can add more nodes and handle more writes/second than an RDBMS very easily yet you may be doing so many deletes that the system constantly stays at a small data set. You may want to analyze the data constantly or near real time involving huge amounts of reads / second in which case noSQL can be better as well. Ie. Nosql is not just for big data. I know with PlayOrm for cassandra, we have handled many different use cases out there. Later, Dean From: Marc Teufel teufel.m...@googlemail.commailto:teufel.m...@googlemail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Friday, April 26, 2013 8:17 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Is Cassandra oversized for this kind of use case? I hope the Cassandra Community can help me finding a decision. The project i am working on actually is located in industrial plant, machines are connected to a server an every 5 minutes i get data from the machines about its status. We are talking about a production with 100+ machines, so the data amount is very high: Per Machine every 5th minute one row, means 12 rows per hour, means roundabout 120 rows per day = 1200+ rows per day multiplied by 20 its 240.000 rows per month and 2.880.000 rows per year. I have to hold the last 3 years and i must be able to do analytics on this data. in the end i deal with roundabout 10 Mio Rows (12 columns holding text and numbers each row) Okay, its kind of big data is not really big data isn'it but for me its a lot data to handle anyway. Actually i am holding all these data in a oracle database but doing analytics on so many rows is not the good and modern way i think. as the company is successfull they will grew, means more machines, again more data to handle... So i thought maybe Big Data technologies are a possible solution for me to store my data. Meanwhile i know Apache Hadoop is not the right tool for this kind of thing because it scales not down.But maybe Cassandra ? This is my question to you, do you think cassandra is the right store for this kind of data? I am thinking about 2 Nodes. Maybe virtual. Let me know what you think. And if Cassandra is not the right tool please tell me and if you know any please tell me alternatives. Maybe i am already doing the right thing with storing that much data in oracle database and maybe one of you is doing the same - if so please let me also know. Thank you very much. Web: http://www.teufel.net
Re: Performance / limitations of WHERE ... IN queries
Thanks very much, Aaron, for your answer! Thierry You are effectively doing a multi get. Getting more than one row at a time is normally faster, but there will be a drop off point where the improvements slow down. Run some tests. Also consider that each row you requests creates RF number of commands spread around the thread pools for the row. If one client reqrequests 100's or 1000's then this can delay other client requests. Cheers
CQL update and TTL
Apparently when I update a column using CQL that already has a TTL, it resets the TTL to null, so if there was already a TTL for all columns that I inserted part of a composite column set, this specific column that I updated will not expire while the others are are getting expired. Is it how it is expected to work or it is a bug? Thanks in advance Shahryar
Re: CQL update and TTL
This seems to be the correct behavior. An update refreshes the TTL, as it does in memcache for example. Yet, what I do not know is whether this behavior can be changed somehow to let the initial TTL, this might be useful on some use cases. Alain 2013/4/26 Shahryar Sedghi shsed...@gmail.com Apparently when I update a column using CQL that already has a TTL, it resets the TTL to null, so if there was already a TTL for all columns that I inserted part of a composite column set, this specific column that I updated will not expire while the others are are getting expired. Is it how it is expected to work or it is a bug? Thanks in advance Shahryar
Re: CQL update and TTL
The issue is, I can get the original TTL using the select and use it for the update, however since TTL can not be dynamic (using ?) it will exhaust the prepared statement cache, because I have tons of updates like this and every one will have a different signature due to changing TTL. I am using 1.2.3 now. Thanks On Fri, Apr 26, 2013 at 11:35 AM, Alain RODRIGUEZ arodr...@gmail.comwrote: This seems to be the correct behavior. An update refreshes the TTL, as it does in memcache for example. Yet, what I do not know is whether this behavior can be changed somehow to let the initial TTL, this might be useful on some use cases. Alain 2013/4/26 Shahryar Sedghi shsed...@gmail.com Apparently when I update a column using CQL that already has a TTL, it resets the TTL to null, so if there was already a TTL for all columns that I inserted part of a composite column set, this specific column that I updated will not expire while the others are are getting expired. Is it how it is expected to work or it is a bug? Thanks in advance Shahryar -- Life is what happens while you are making other plans. ~ John Lennon
Re: CQL update and TTL
This is indeed intended. That behavior is largely dictated by how the storage engine works, and the fact that an update does no read internally in particular. Yet, what I do not know is whether this behavior can be changed somehow to let the initial TTL, There's nothing like that supported, no. You have to read the value first to get his TTL and then insert whatever update you want with the TTL you've just fetch. And since we don't have a good way to do it much more efficiently than server side, we prefer not doing it. That way the performance impact is very explicit. -- Sylvain Alain 2013/4/26 Shahryar Sedghi shsed...@gmail.com Apparently when I update a column using CQL that already has a TTL, it resets the TTL to null, so if there was already a TTL for all columns that I inserted part of a composite column set, this specific column that I updated will not expire while the others are are getting expired. Is it how it is expected to work or it is a bug? Thanks in advance Shahryar
Re: CQL update and TTL
That is more or less what I was guessing, thanks for these precision. 2013/4/26 Sylvain Lebresne sylv...@datastax.com This is indeed intended. That behavior is largely dictated by how the storage engine works, and the fact that an update does no read internally in particular. Yet, what I do not know is whether this behavior can be changed somehow to let the initial TTL, There's nothing like that supported, no. You have to read the value first to get his TTL and then insert whatever update you want with the TTL you've just fetch. And since we don't have a good way to do it much more efficiently than server side, we prefer not doing it. That way the performance impact is very explicit. -- Sylvain Alain 2013/4/26 Shahryar Sedghi shsed...@gmail.com Apparently when I update a column using CQL that already has a TTL, it resets the TTL to null, so if there was already a TTL for all columns that I inserted part of a composite column set, this specific column that I updated will not expire while the others are are getting expired. Is it how it is expected to work or it is a bug? Thanks in advance Shahryar
Re: CQL update and TTL
is there a way to either make TTL dynamic (using ?) Not at this time. There is https://issues.apache.org/jira/browse/CASSANDRA-4450 open for that, but that's not done yet. tell the engine not to cache the Prepared statement. I am using the new CQL Java Driver. In that case, just don't use a prepared statement. Use a normal, non prepared query. Yes, normal statements will be slightly slower, but if you really have to update a column while preserving its TTL, as said above you will have to do a read followed by a write, so the whole thing won't be excessively efficient and hence I doubt not using prepared statements will be the blocking part performance wise. -- Sylvain Shahryar On Fri, Apr 26, 2013 at 11:42 AM, Sylvain Lebresne sylv...@datastax.comwrote: This is indeed intended. That behavior is largely dictated by how the storage engine works, and the fact that an update does no read internally in particular. Yet, what I do not know is whether this behavior can be changed somehow to let the initial TTL, There's nothing like that supported, no. You have to read the value first to get his TTL and then insert whatever update you want with the TTL you've just fetch. And since we don't have a good way to do it much more efficiently than server side, we prefer not doing it. That way the performance impact is very explicit. -- Sylvain Alain 2013/4/26 Shahryar Sedghi shsed...@gmail.com Apparently when I update a column using CQL that already has a TTL, it resets the TTL to null, so if there was already a TTL for all columns that I inserted part of a composite column set, this specific column that I updated will not expire while the others are are getting expired. Is it how it is expected to work or it is a bug? Thanks in advance Shahryar -- Life is what happens while you are making other plans. ~ John Lennon
cost estimate about some Cassandra patchs
Hi, We are created a new partitioner that groups some rows with **different** row keys on the same replicas. But neither the batch_mutate, or the multiget_slice are able to take opportunity of this partitioner-defined placement to vectorize/batch communications between the coordinator and the replicas. Does anyone know enough of the inner working of Cassandra to tell me how much work is needed to patch Cassandra to enable such communication vectorization/batch ? Thanks. Regards, Dominique
Re: Is Cassandra oversized for this kind of use case?
Okay one billion rows of data is a lot, compared to that i am far far away - means i can stay with Oracle? Maybe. But you're right when you say its not only about big data but also about your need. So storing the data is one part, doing analytical analysis is the second. I do a lot of calculations and queries to generate management criteria about how the production is going on actually, how the production went the last week, month, years and so on. Saving in a 5 minute rhythm is only a compromise to reduce the amount of data - maybe in the future the usecase will change an is about to store status of each machine as soon as it changes. This will of course increase the amount of data and the complexity of my queries again. And sure I show Live Data today... 5 Minute old Live Data... but if i tell the CEO that i am also able to work with real live data, i am sure this is what he wants to get ;-) Can you recommend me to use Cassandra for this kind of scenario or is this oversized ? Does it makes sense to start with 2 Nodes ? Can i virtualize these two Nodes ? Thx a lot for your assistance. Marc 2013/4/26 Hiller, Dean dean.hil...@nrel.gov Well, it depends more on what you will do with the data. I know I was on a sybase(RDBMS) with 1 billion rows but it was getting close to not being able to handle more (constraints had to be turned off and all sorts of optimizations done and expert consultants brought in and everything). BUT there are other use cases where noSQL is great for (ie. It is not just great for big data type systems). It is great for really high write throughput as you can add more nodes and handle more writes/second than an RDBMS very easily yet you may be doing so many deletes that the system constantly stays at a small data set. You may want to analyze the data constantly or near real time involving huge amounts of reads / second in which case noSQL can be better as well. Ie. Nosql is not just for big data. I know with PlayOrm for cassandra, we have handled many different use cases out there. Later, Dean From: Marc Teufel teufel.m...@googlemail.commailto: teufel.m...@googlemail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Friday, April 26, 2013 8:17 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Is Cassandra oversized for this kind of use case? I hope the Cassandra Community can help me finding a decision. The project i am working on actually is located in industrial plant, machines are connected to a server an every 5 minutes i get data from the machines about its status. We are talking about a production with 100+ machines, so the data amount is very high: Per Machine every 5th minute one row, means 12 rows per hour, means roundabout 120 rows per day = 1200+ rows per day multiplied by 20 its 240.000 rows per month and 2.880.000 rows per year. I have to hold the last 3 years and i must be able to do analytics on this data. in the end i deal with roundabout 10 Mio Rows (12 columns holding text and numbers each row) Okay, its kind of big data is not really big data isn'it but for me its a lot data to handle anyway. Actually i am holding all these data in a oracle database but doing analytics on so many rows is not the good and modern way i think. as the company is successfull they will grew, means more machines, again more data to handle... So i thought maybe Big Data technologies are a possible solution for me to store my data. Meanwhile i know Apache Hadoop is not the right tool for this kind of thing because it scales not down.But maybe Cassandra ? This is my question to you, do you think cassandra is the right store for this kind of data? I am thinking about 2 Nodes. Maybe virtual. Let me know what you think. And if Cassandra is not the right tool please tell me and if you know any please tell me alternatives. Maybe i am already doing the right thing with storing that much data in oracle database and maybe one of you is doing the same - if so please let me also know. Thank you very much. Web: http://www.teufel.net -- Mail: teufel.m...@gmail.com Web: http://www.teufel.net
Re: Is Cassandra oversized for this kind of use case?
I would at least start with 3 cheap nodes with RF=3 and start with CL=TWO on writes and reads most likely getting your feet wet. Don't buy very expensive computers like a lot do getting into the game for the first time…Every time I walk into a new gig, they seem to think they need to spend 6/10k per node. I think this kind of scenario sounds find to use cassandra. When you say virtualize, I believe you mean use Vms…..many use Amazon Vms and there is stuff to configure if you are on amazon specifically for this. If you are on your own VM's, you do need to worry about if two nodes end up on the same hardware stealing resources from each other or if hardware fails as well. Ie. The idea in noSQL is you typically have 3 copies of all data so if one node goes down, you are still live with CL=TWO. Also, plan on doing ~300GB per node typically depending on how it works out in testing. Later, Dean From: Marc Teufel teufel.m...@googlemail.commailto:teufel.m...@googlemail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Friday, April 26, 2013 10:59 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Is Cassandra oversized for this kind of use case? Okay one billion rows of data is a lot, compared to that i am far far away - means i can stay with Oracle? Maybe. But you're right when you say its not only about big data but also about your need. So storing the data is one part, doing analytical analysis is the second. I do a lot of calculations and queries to generate management criteria about how the production is going on actually, how the production went the last week, month, years and so on. Saving in a 5 minute rhythm is only a compromise to reduce the amount of data - maybe in the future the usecase will change an is about to store status of each machine as soon as it changes. This will of course increase the amount of data and the complexity of my queries again. And sure I show Live Data today... 5 Minute old Live Data... but if i tell the CEO that i am also able to work with real live data, i am sure this is what he wants to get ;-) Can you recommend me to use Cassandra for this kind of scenario or is this oversized ? Does it makes sense to start with 2 Nodes ? Can i virtualize these two Nodes ? Thx a lot for your assistance. Marc 2013/4/26 Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov Well, it depends more on what you will do with the data. I know I was on a sybase(RDBMS) with 1 billion rows but it was getting close to not being able to handle more (constraints had to be turned off and all sorts of optimizations done and expert consultants brought in and everything). BUT there are other use cases where noSQL is great for (ie. It is not just great for big data type systems). It is great for really high write throughput as you can add more nodes and handle more writes/second than an RDBMS very easily yet you may be doing so many deletes that the system constantly stays at a small data set. You may want to analyze the data constantly or near real time involving huge amounts of reads / second in which case noSQL can be better as well. Ie. Nosql is not just for big data. I know with PlayOrm for cassandra, we have handled many different use cases out there. Later, Dean From: Marc Teufel teufel.m...@googlemail.commailto:teufel.m...@googlemail.commailto:teufel.m...@googlemail.commailto:teufel.m...@googlemail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Friday, April 26, 2013 8:17 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Is Cassandra oversized for this kind of use case? I hope the Cassandra Community can help me finding a decision. The project i am working on actually is located in industrial plant, machines are connected to a server an every 5 minutes i get data from the machines about its status. We are talking about a production with 100+ machines, so the data amount is very high: Per Machine every 5th minute one row, means 12 rows per hour, means roundabout 120 rows per day = 1200+ rows per day multiplied by 20 its 240.000 rows per month and 2.880.000 rows per year. I have to hold the last 3 years and i must be able to do analytics on this data. in the end i deal with roundabout 10 Mio Rows (12 columns holding text and numbers each row) Okay, its kind of big data is not really big data isn'it but for me its a lot data
Re: vnodes and load balancing - 1.2.4
On Fri, Apr 26, 2013 at 3:48 AM, Sam Overton s...@acunu.com wrote: If that is the case then it means you accidentally started those three nodes with the default configuration (single-token) and then subsequently changed (num_tokens) and then joined them into the cluster. This would seem to be another reason why the debian package auto-starting cassandra could be hazardous? =Rob
Re: Adding nodes in 1.2 with vnodes requires huge disks
Small relief we're not the only ones that had this issue. We're going to try running a shuffle before adding a new node again... maybe that will help - John On Fri, Apr 26, 2013 at 5:07 AM, Francisco Nogueira Calmon Sobral fsob...@igcorp.com.br wrote: I am using the same version and observed something similar. I've added a new node, but the instructions from Datastax did not work for me. Then I ran nodetool rebuild on the new node. After finished this command, it contained two times the load of the other nodes. Even when I ran nodetool cleanup on the older nodes, the situation was the same. The problem only seemed to disappear when nodetool repair was applied to all nodes. Regards, Francisco Sobral. On Apr 25, 2013, at 4:57 PM, John Watson j...@disqus.com wrote: After finally upgrading to 1.2.3 from 1.1.9, enabling vnodes, and running upgradesstables, I figured it would be safe to start adding nodes to the cluster. Guess not? It seems when new nodes join, they are streamed *all* sstables in the cluster. https://dl.dropbox.com/s/bampemkvlfck2dt/Screen%20Shot%202013-04-25%20at%2012.35.24%20PM.png The gray the line machine ran out disk space and for some reason cascaded into errors in the cluster about 'no host id' when trying to store hints for it (even though it hadn't joined yet). The purple line machine, I just stopped the joining process because the main cluster was dropping mutation messages at this point on a few nodes (and it still had dozens of sstables to stream.) I followed this: http://www.datastax.com/docs/1.2/operations/add_replace_nodes Is there something missing in that documentation? Thanks, John
Re: Adding nodes in 1.2 with vnodes requires huge disks
I believe that nodetool rebuild is used to add a new datacenter, not just a new host to an existing cluster. Is that what you ran to add the node? -Bryan On Fri, Apr 26, 2013 at 1:27 PM, John Watson j...@disqus.com wrote: Small relief we're not the only ones that had this issue. We're going to try running a shuffle before adding a new node again... maybe that will help - John On Fri, Apr 26, 2013 at 5:07 AM, Francisco Nogueira Calmon Sobral fsob...@igcorp.com.br wrote: I am using the same version and observed something similar. I've added a new node, but the instructions from Datastax did not work for me. Then I ran nodetool rebuild on the new node. After finished this command, it contained two times the load of the other nodes. Even when I ran nodetool cleanup on the older nodes, the situation was the same. The problem only seemed to disappear when nodetool repair was applied to all nodes. Regards, Francisco Sobral. On Apr 25, 2013, at 4:57 PM, John Watson j...@disqus.com wrote: After finally upgrading to 1.2.3 from 1.1.9, enabling vnodes, and running upgradesstables, I figured it would be safe to start adding nodes to the cluster. Guess not? It seems when new nodes join, they are streamed *all* sstables in the cluster. https://dl.dropbox.com/s/bampemkvlfck2dt/Screen%20Shot%202013-04-25%20at%2012.35.24%20PM.png The gray the line machine ran out disk space and for some reason cascaded into errors in the cluster about 'no host id' when trying to store hints for it (even though it hadn't joined yet). The purple line machine, I just stopped the joining process because the main cluster was dropping mutation messages at this point on a few nodes (and it still had dozens of sstables to stream.) I followed this: http://www.datastax.com/docs/1.2/operations/add_replace_nodes Is there something missing in that documentation? Thanks, John