Re: How to speed up SELECT * query in Cassandra
Well, I always wondered how Cassandra can by used in Hadoop-like environment where you basically need to do full table scan. I need to say that our experience is that cassandra is perfect for writing, reading specific values by key, but definitely not for reading all of the data out of it. Some of our projects found out that doing that with a not trivial in a timely manner is close to impossible in many situations. We are slowly moving to storing the data in HDFS and possibly reprocess them on a daily bases for such usecases (statistics). This is nothing against Cassandra, it can not be perfect for everything. But I am really interested how it can work well with Spark/Hadoop where you basically needs to read all the data as well (as far as I understand that). Jirka H. On 02/11/2015 01:51 PM, DuyHai Doan wrote: The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra Prove it. Did you ever have a look into the source code of the Spark/Cassandra connector to see how data locality is achieved before throwing out such statement ? On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net mailto:mvallemil...@bloomberg.net wrote: cassandra makes a very poor datawarehouse ot long term time series store Really? This is not the impression I have... I think Cassandra is good to store larges amounts of data and historical information, it's only not good to store temporary data. Netflix has a large amount of data and it's all stored in Cassandra, AFAIK. The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra. I am not sure about the current state of Spark support for Cassandra, but I guess if you create a map reduce job, the intermediate map results will be still stored in HDFS, as it happens to hadoop, is this right? I think the problem with Spark + Cassandra or with Hadoop + Cassandra is that the hard part spark or hadoop does, the shuffling, could be done out of the box with Cassandra, but no one takes advantage on that. What if a map / reduce job used a temporary CF in Cassandra to store intermediate results? From: user@cassandra.apache.org mailto:user@cassandra.apache.org Subject: Re: How to speed up SELECT * query in Cassandra I use spark with cassandra, and you dont need DSE. I see a lot of people ask this same question below (how do I get a lot of data out of cassandra?), and my question is always, why arent you updating both places at once? For example, we use hadoop and cassandra in conjunction with each other, we use a message bus to store every event in both, aggregrate in both, but only keep current data in cassandra (cassandra makes a very poor datawarehouse ot long term time series store) and then use services to process queries that merge data from hadoop and cassandra. Also, spark on hdfs gives more flexibility in terms of large datasets and performance. The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra -- *Colin Clark* +1 612 859 6129 tel:%2B1%20612%20859%206129 Skype colin.p.clark On Feb 11, 2015, at 4:49 AM, Jens Rantil jens.ran...@tink.se mailto:jens.ran...@tink.se wrote: On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net mailto:mvallemil...@bloomberg.net wrote: If you use Cassandra enterprise, you can use hive, AFAIK. Even better, you can use Spark/Shark with DSE. Cheers, Jens -- Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se mailto:jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se http://www.tink.se/ Facebook https://www.facebook.com/#%21/tink.se Linkedin http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary Twitter https://twitter.com/tink
How to use CqlBulkOutputFormat with MultipleOutputs
Have been trying to import data into multiple Column Families through a Hadoop job. Was able to use CqlOutputFormat to move data to a single column family, but don't think this supports imports to multiple column families. From some searching saw that CqlBulkOutputformat has support to write to multiple column families, but have not been able to get it working. Could not find any examples on this. Would be great if someone could help me with an example of using CqlBulkOutputFormat with MultipleOutputs .
Re: changes to metricsReporterConfigFile requires restart of cassandra?
AFAIK yes. If you want just a subset of the metrics, I would suggest exporting them all, and filtering on the Graphite side. On Wed, Feb 11, 2015 at 6:54 AM, Erik Forsberg forsb...@opera.com wrote: Hi! I was pleased to find out that cassandra 2.0.x has added support for pluggable metrics export, which even includes a graphite metrics sender. Question: Will changes to the metricsReporterConfigFile require a restart of cassandra to take effect? I.e, if I want to add a new exported metric to that file, will I have to restart my cluster? Thanks, \EF
Re: best supported spark connector for Cassandra
I am using Calliope cassandra-spark connector( http://tuplejump.github.io/calliope/), which is quite handy and easy to use! The only problem is that it is a bit outdates , works with Spark 1.1.0, hopefully new version comes soon. best, /Shahab On Wed, Feb 11, 2015 at 2:51 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: I just finished a scala course, nice exercise to check what I learned :D Thanks for the answer! From: user@cassandra.apache.org Subject: Re: best supported spark connector for Cassandra Start looking at the Spark/Cassandra connector here (in Scala): https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector Data locality is provided by this method: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraRDD.scala#L329-L336 Start digging from this all the way down the code. As for Stratio Deep, I can't tell how the did the integration with Spark. Take some time to dig down their code to understand the logic. On Wed, Feb 11, 2015 at 2:25 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: Taking the opportunity Spark was being discussed in another thread, I decided to start a new one as I have interest in using Spark + Cassandra in the feature. About 3 years ago, Spark was not an existing option and we tried to use hadoop to process Cassandra data. My experience was horrible and we reached the conclusion it was faster to develop an internal tool than insist on Hadoop _for our specific case_. How I can see Spark is starting to be known as a better hadoop and it seems market is going this way now. I can also see I have many more options to decide how to integrate Cassandra using the Spark RDD concept than using the ColumnFamilyInputFormat. I have found this java driver made by Datastax: https://github.com/datastax/spark-cassandra-connector I also have found python Cassandra support on spark's repo, but it seems experimental yet: https://github.com/apache/spark/tree/master/examples/src/main/python Finally I have found stratio deep: https://github.com/Stratio/deep-spark It seems Stratio guys have forked Cassandra also, I am still a little confused about it. Question: which driver should I use, if I want to use Java? And which if I want to use python? I think the way Spark can integrate to Cassandra makes all the difference in the world, from my past experience, so I would like to know more about it, but I don't even know which source code I should start looking... I would like to integrate using python and or C++, but I wonder if it doesn't pay the way to use the java driver instead. Thanks in advance
Re: Recommissioned a node
Yes, including the system and commitlog directory. Then when it starts, it's like a brand new node and will bootstrap to join. On Wed, Feb 11, 2015 at 8:56 AM, Stefano Ortolani ostef...@gmail.com wrote: Hi Eric, thanks for your answer. The reason why it got recommissioned was simply because the machine got restarted (with auto_bootstrap set to to true). A cleaner, and correct, recommission would have just required wiping the data folder, am I correct? Or would I have needed to change something else in the node configuration? Cheers, Stefano On Wed, Feb 11, 2015 at 6:47 AM, Eric Stevens migh...@gmail.com wrote: AFAIK it should be ok after the repair completed (it was missing all writes while it was decommissioning and while it was offline, and nobody would have been keeping hinted handoffs for it, so repair was the right thing to do). Unless RF=N you're now due for a cleanup on the other nodes. Generally speaking though this was probably not a good idea. When the node came back online, it rejoined the cluster immediately and would have been serving client requests without having a consistent view of the data. A safer approach would be to wipe the data directory and bootstrap it as a clean new member. I'm curious what prompted that cycle of decommission then recommission. On Tue, Feb 10, 2015 at 10:13 PM, Stefano Ortolani ostef...@gmail.com wrote: Hi, I recommissioned a node after decommissioningit. That happened (1) after a successfull decommission (checked), (2) without wiping the data directory on the node, (3) simply by restarting the cassandra service. The node now reports himself healty and up and running Knowing that I issued the repair command and patiently waited for its completion, can I assume the cluster, and its internals (replicas, balance between those) to be healthy and as new? Regards, Stefano
Re: Recommissioned a node
It could, because the tombstones that mark data deleted may have been removed. There would be nothing that says this data is gone. If you're worried about it, turn up your gc grace seconds. Also, don't revive nodes back into a cluster with old data sitting on them. On Wed Feb 11 2015 at 11:18:19 AM Stefano Ortolani ostef...@gmail.com wrote: Hi Robert, it all happened within 30 minutes, so way before the default gc_grace_second (864000), so I should be fine. However, this is quite shocking if you ask me. The only possibility of getting to an inconsistent state only by restarting a node is appalling... Can other people confirm that a restart after the gc_grace_seconds passed would have violated consistency permanently? Cheers, Stefano On Wed, Feb 11, 2015 at 10:56 AM, Robert Coli rc...@eventbrite.com wrote: On Tue, Feb 10, 2015 at 9:13 PM, Stefano Ortolani ostef...@gmail.com wrote: I recommissioned a node after decommissioningit. That happened (1) after a successfull decommission (checked), (2) without wiping the data directory on the node, (3) simply by restarting the cassandra service. The node now reports himself healty and up and running Knowing that I issued the repair command and patiently waited for its completion, can I assume the cluster, and its internals (replicas, balance between those) to be healthy and as new? Did you recommission before or after gc_grace_seconds passed? If after, you have violated consistency in a manner that, in my understanding, one cannot recover from. If before, you're pretty fine. However this is a longstanding issue that I personally consider a bug : Your decommissioned node doesn't forget its state. In my opinion, you told it to leave the cluster, it should forget everything it knew as a member of that cluster. If you file this behavior as a JIRA bug, please let the list know. =Rob
Re: Pagination support on Java Driver Query API
Hi Eric, Thanks for your reply. I am using Cassandra 2.0.11 and in that I cannot append condition like last clustering key column value of the last row in the previous batch. It fails Preceding column is either not restricted or by a non-EQ relation. It means I need to specify equal condition for all preceding clustering key columns. With this I cannot get the pagination correct. Thanks Ajay I can't believe that everyone read process all rows at once (without pagination). Probably not too many people try to read all rows in a table as a single rolling operation with a standard client driver. But those who do would use token() to keep track of where they are and be able to resume with that as well. But it sounds like you're talking about paginating a subset of data - larger than you want to process as a unit, but prefiltered by some other criteria which prevents you from being able to rely on token(). For this there is no general purpose solution, but it typically involves you maintaining your own paging state, typically keeping track of the last partitioning and clustering key seen, and using that to construct your next query. For example, we have client queries which can span several partitioning keys. We make sure that the List of partition keys generated by a given client query List(Pq) is deterministic, then our paging state is the index offset of the final Pq in the response, plus the value of the final clustering column. A query coming in with a paging state attached to it starts the next set of queries from the provided Pq offset where clusteringKey the provided value. So if you can just track partition key offset (if spanning multiple partitions), and clustering key offset, you can construct your next query from those instead. On Tue, Feb 10, 2015 at 6:58 PM, Ajay ajay.ga...@gmail.com wrote: Thanks Alex. But is there any workaround possible?. I can't believe that everyone read process all rows at once (without pagination). Thanks Ajay On Feb 10, 2015 11:46 PM, Alex Popescu al...@datastax.com wrote: On Tue, Feb 10, 2015 at 4:59 AM, Ajay ajay.ga...@gmail.com wrote: 1) Java driver implicitly support Pagination in the ResultSet (using Iterator) which can be controlled through FetchSize. But it is limited in a way that we cannot skip or go previous. The FetchState is not exposed. Cassandra doesn't support skipping so this is not really a limitation of the driver. -- [:-a) Alex Popescu Sen. Product Manager @ DataStax @al3xandru To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-user+unsubscr...@lists.datastax.com.
Re: Pagination support on Java Driver Query API
Basically I am trying different queries with your approach. One such query is like Select * from mycf where condition on partition key order by ck1 asc, ck2 desc where ck1 and ck2 are clustering keys in that order. Here how do we achieve pagination support? Thanks Ajay On Feb 11, 2015 11:16 PM, Ajay ajay.ga...@gmail.com wrote: Hi Eric, Thanks for your reply. I am using Cassandra 2.0.11 and in that I cannot append condition like last clustering key column value of the last row in the previous batch. It fails Preceding column is either not restricted or by a non-EQ relation. It means I need to specify equal condition for all preceding clustering key columns. With this I cannot get the pagination correct. Thanks Ajay I can't believe that everyone read process all rows at once (without pagination). Probably not too many people try to read all rows in a table as a single rolling operation with a standard client driver. But those who do would use token() to keep track of where they are and be able to resume with that as well. But it sounds like you're talking about paginating a subset of data - larger than you want to process as a unit, but prefiltered by some other criteria which prevents you from being able to rely on token(). For this there is no general purpose solution, but it typically involves you maintaining your own paging state, typically keeping track of the last partitioning and clustering key seen, and using that to construct your next query. For example, we have client queries which can span several partitioning keys. We make sure that the List of partition keys generated by a given client query List(Pq) is deterministic, then our paging state is the index offset of the final Pq in the response, plus the value of the final clustering column. A query coming in with a paging state attached to it starts the next set of queries from the provided Pq offset where clusteringKey the provided value. So if you can just track partition key offset (if spanning multiple partitions), and clustering key offset, you can construct your next query from those instead. On Tue, Feb 10, 2015 at 6:58 PM, Ajay ajay.ga...@gmail.com wrote: Thanks Alex. But is there any workaround possible?. I can't believe that everyone read process all rows at once (without pagination). Thanks Ajay On Feb 10, 2015 11:46 PM, Alex Popescu al...@datastax.com wrote: On Tue, Feb 10, 2015 at 4:59 AM, Ajay ajay.ga...@gmail.com wrote: 1) Java driver implicitly support Pagination in the ResultSet (using Iterator) which can be controlled through FetchSize. But it is limited in a way that we cannot skip or go previous. The FetchState is not exposed. Cassandra doesn't support skipping so this is not really a limitation of the driver. -- [:-a) Alex Popescu Sen. Product Manager @ DataStax @al3xandru To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-user+unsubscr...@lists.datastax.com.
Re: Recommissioned a node
And after decreasing your RF (rare but happens) On Wed Feb 11 2015 at 11:31:38 AM Robert Coli rc...@eventbrite.com wrote: On Wed, Feb 11, 2015 at 11:20 AM, Jonathan Haddad j...@jonhaddad.com wrote: It could, because the tombstones that mark data deleted may have been removed. There would be nothing that says this data is gone. If you're worried about it, turn up your gc grace seconds. Also, don't revive nodes back into a cluster with old data sitting on them. Also, run cleanup after range movements : https://issues.apache.org/jira/browse/CASSANDRA-7764 =Rob
Re: How to speed up SELECT * query in Cassandra
For your information Colin: http://en.wikipedia.org/wiki/List_of_fallacies. Look at Burden of proof http://en.wikipedia.org/wiki/Philosophic_burden_of_proof You stated The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra It's up to YOU to prove it right, not up to me to prove it wrong. All other bla bla is troll. Come back to me once you get some decent benchmarks supporting your statement, until then, the question is closed. On Wed, Feb 11, 2015 at 3:17 PM, Colin co...@clark.ws wrote: Did you want me to included specific examples from my employment at datastax or start from the ground up? All spark is on cassandra is a better than the previous use of hive. The fact that datastax hasnt provided any benchmarks themselves other than glossy marketing statements pretty much says it all-where are your benchmarks? Maybe you could combine it with the in memory option to really boogie... :) (If I find time, I might just write a blog post about exactly how to do this-it involves the use of parquet and partitioning with clustering-and it doesnt cost anything to do it-and it's in production at my company) -- *Colin Clark* +1 612 859 6129 Skype colin.p.clark On Feb 11, 2015, at 6:51 AM, DuyHai Doan doanduy...@gmail.com wrote: The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra Prove it. Did you ever have a look into the source code of the Spark/Cassandra connector to see how data locality is achieved before throwing out such statement ? On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: cassandra makes a very poor datawarehouse ot long term time series store Really? This is not the impression I have... I think Cassandra is good to store larges amounts of data and historical information, it's only not good to store temporary data. Netflix has a large amount of data and it's all stored in Cassandra, AFAIK. The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra. I am not sure about the current state of Spark support for Cassandra, but I guess if you create a map reduce job, the intermediate map results will be still stored in HDFS, as it happens to hadoop, is this right? I think the problem with Spark + Cassandra or with Hadoop + Cassandra is that the hard part spark or hadoop does, the shuffling, could be done out of the box with Cassandra, but no one takes advantage on that. What if a map / reduce job used a temporary CF in Cassandra to store intermediate results? From: user@cassandra.apache.org Subject: Re: How to speed up SELECT * query in Cassandra I use spark with cassandra, and you dont need DSE. I see a lot of people ask this same question below (how do I get a lot of data out of cassandra?), and my question is always, why arent you updating both places at once? For example, we use hadoop and cassandra in conjunction with each other, we use a message bus to store every event in both, aggregrate in both, but only keep current data in cassandra (cassandra makes a very poor datawarehouse ot long term time series store) and then use services to process queries that merge data from hadoop and cassandra. Also, spark on hdfs gives more flexibility in terms of large datasets and performance. The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra -- *Colin Clark* +1 612 859 6129 Skype colin.p.clark On Feb 11, 2015, at 4:49 AM, Jens Rantil jens.ran...@tink.se wrote: On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: If you use Cassandra enterprise, you can use hive, AFAIK. Even better, you can use Spark/Shark with DSE. Cheers, Jens -- Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook https://www.facebook.com/#!/tink.se Linkedin http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary Twitter https://twitter.com/tink
Adding new node - OPSCenter problems
Hello all, I have added new 3 nodes to existing cluster. I must point out that I have copied the cassandra yaml file, from an existing node and just changed listen_addres per instructions here: Adding nodes to an existing cluster | DataStax Cassandra 2.0 Documentation | | | | | | | | | Adding nodes to an existing cluster | DataStax Cassandra 2.0 DocumentationSteps to add nodes when using virtual nodes. | | | | View on www.datastax.com | Preview by Yahoo | | | | | Installed datastax agents but in OpsCenter I see the new nodes in a different, empty Cluster. Also opscenter does not see the datacenter for the new nodes. They are all in the same datacenter, and the name of the Cluster is the same for all 9 nodes. Any ideeas?
changes to metricsReporterConfigFile requires restart of cassandra?
Hi! I was pleased to find out that cassandra 2.0.x has added support for pluggable metrics export, which even includes a graphite metrics sender. Question: Will changes to the metricsReporterConfigFile require a restart of cassandra to take effect? I.e, if I want to add a new exported metric to that file, will I have to restart my cluster? Thanks, \EF
Re: How to speed up SELECT * query in Cassandra
Did you want me to included specific examples from my employment at datastax or start from the ground up? All spark is on cassandra is a better than the previous use of hive. The fact that datastax hasnt provided any benchmarks themselves other than glossy marketing statements pretty much says it all-where are your benchmarks? Maybe you could combine it with the in memory option to really boogie... :) (If I find time, I might just write a blog post about exactly how to do this-it involves the use of parquet and partitioning with clustering-and it doesnt cost anything to do it-and it's in production at my company) -- Colin Clark +1 612 859 6129 Skype colin.p.clark On Feb 11, 2015, at 6:51 AM, DuyHai Doan doanduy...@gmail.com wrote: The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra Prove it. Did you ever have a look into the source code of the Spark/Cassandra connector to see how data locality is achieved before throwing out such statement ? On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: cassandra makes a very poor datawarehouse ot long term time series store Really? This is not the impression I have... I think Cassandra is good to store larges amounts of data and historical information, it's only not good to store temporary data. Netflix has a large amount of data and it's all stored in Cassandra, AFAIK. The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra. I am not sure about the current state of Spark support for Cassandra, but I guess if you create a map reduce job, the intermediate map results will be still stored in HDFS, as it happens to hadoop, is this right? I think the problem with Spark + Cassandra or with Hadoop + Cassandra is that the hard part spark or hadoop does, the shuffling, could be done out of the box with Cassandra, but no one takes advantage on that. What if a map / reduce job used a temporary CF in Cassandra to store intermediate results? From: user@cassandra.apache.org Subject: Re: How to speed up SELECT * query in Cassandra I use spark with cassandra, and you dont need DSE. I see a lot of people ask this same question below (how do I get a lot of data out of cassandra?), and my question is always, why arent you updating both places at once? For example, we use hadoop and cassandra in conjunction with each other, we use a message bus to store every event in both, aggregrate in both, but only keep current data in cassandra (cassandra makes a very poor datawarehouse ot long term time series store) and then use services to process queries that merge data from hadoop and cassandra. Also, spark on hdfs gives more flexibility in terms of large datasets and performance. The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra -- Colin Clark +1 612 859 6129 Skype colin.p.clark On Feb 11, 2015, at 4:49 AM, Jens Rantil jens.ran...@tink.se wrote: On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: If you use Cassandra enterprise, you can use hive, AFAIK. Even better, you can use Spark/Shark with DSE. Cheers, Jens -- Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook Linkedin Twitter
Re: Adding new node - OPSCenter problems
Hello, nodetool status shows existing nodes as UN and the new 3 as UJ . What is strange is that in the Owns column for the new 3 nodes I have ? instead of a percentage value. What I see is all are in rack1 in the Rack column. On Wednesday, February 11, 2015 4:50 PM, Carlos Rolo r...@pythian.com wrote: Hello, What is the output of nodetool status? All nodes should appear, otherwise there is some configuration error. Regards, Carlos Juzarte RoloCassandra Consultant Pythian - Love your data rolo@pythian | Twitter: cjrolo | Linkedin: linkedin.com/in/carlosjuzarteroloTel: 1649www.pythian.com On Wed, Feb 11, 2015 at 3:46 PM, Batranut Bogdan batra...@yahoo.com wrote: Hello all, I have added new 3 nodes to existing cluster. I must point out that I have copied the cassandra yaml file, from an existing node and just changed listen_addres per instructions here: Adding nodes to an existing cluster | DataStax Cassandra 2.0 Documentation | | | | | | | | | Adding nodes to an existing cluster | DataStax Cassandra 2.0 DocumentationSteps to add nodes when using virtual nodes. | | | | View on www.datastax.com | Preview by Yahoo | | | | | Installed datastax agents but in OpsCenter I see the new nodes in a different, empty Cluster. Also opscenter does not see the datacenter for the new nodes. They are all in the same datacenter, and the name of the Cluster is the same for all 9 nodes. Any ideeas? --
Re: How to speed up SELECT * query in Cassandra
No, the question isnt closed. You dont get to decide that. I dont run a website making claims regarding cassandra and spark - your employer does. Again, where are your benchmarks? I will publish mine, then we'll see what you've got. -- Colin Clark +1 612 859 6129 Skype colin.p.clark On Feb 11, 2015, at 8:39 AM, DuyHai Doan doanduy...@gmail.com wrote: For your information Colin: http://en.wikipedia.org/wiki/List_of_fallacies. Look at Burden of proof You stated The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra It's up to YOU to prove it right, not up to me to prove it wrong. All other bla bla is troll. Come back to me once you get some decent benchmarks supporting your statement, until then, the question is closed. On Wed, Feb 11, 2015 at 3:17 PM, Colin co...@clark.ws wrote: Did you want me to included specific examples from my employment at datastax or start from the ground up? All spark is on cassandra is a better than the previous use of hive. The fact that datastax hasnt provided any benchmarks themselves other than glossy marketing statements pretty much says it all-where are your benchmarks? Maybe you could combine it with the in memory option to really boogie... :) (If I find time, I might just write a blog post about exactly how to do this-it involves the use of parquet and partitioning with clustering-and it doesnt cost anything to do it-and it's in production at my company) -- Colin Clark +1 612 859 6129 Skype colin.p.clark On Feb 11, 2015, at 6:51 AM, DuyHai Doan doanduy...@gmail.com wrote: The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra Prove it. Did you ever have a look into the source code of the Spark/Cassandra connector to see how data locality is achieved before throwing out such statement ? On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: cassandra makes a very poor datawarehouse ot long term time series store Really? This is not the impression I have... I think Cassandra is good to store larges amounts of data and historical information, it's only not good to store temporary data. Netflix has a large amount of data and it's all stored in Cassandra, AFAIK. The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra. I am not sure about the current state of Spark support for Cassandra, but I guess if you create a map reduce job, the intermediate map results will be still stored in HDFS, as it happens to hadoop, is this right? I think the problem with Spark + Cassandra or with Hadoop + Cassandra is that the hard part spark or hadoop does, the shuffling, could be done out of the box with Cassandra, but no one takes advantage on that. What if a map / reduce job used a temporary CF in Cassandra to store intermediate results? From: user@cassandra.apache.org Subject: Re: How to speed up SELECT * query in Cassandra I use spark with cassandra, and you dont need DSE. I see a lot of people ask this same question below (how do I get a lot of data out of cassandra?), and my question is always, why arent you updating both places at once? For example, we use hadoop and cassandra in conjunction with each other, we use a message bus to store every event in both, aggregrate in both, but only keep current data in cassandra (cassandra makes a very poor datawarehouse ot long term time series store) and then use services to process queries that merge data from hadoop and cassandra. Also, spark on hdfs gives more flexibility in terms of large datasets and performance. The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra -- Colin Clark +1 612 859 6129 Skype colin.p.clark On Feb 11, 2015, at 4:49 AM, Jens Rantil jens.ran...@tink.se wrote: On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: If you use Cassandra enterprise, you can use hive, AFAIK. Even better, you can use Spark/Shark with DSE. Cheers, Jens -- Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook Linkedin Twitter
Re: best supported spark connector for Cassandra
I just finished a scala course, nice exercise to check what I learned :D Thanks for the answer! From: user@cassandra.apache.org Subject: Re: best supported spark connector for Cassandra Start looking at the Spark/Cassandra connector here (in Scala): https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector Data locality is provided by this method: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraRDD.scala#L329-L336 Start digging from this all the way down the code. As for Stratio Deep, I can't tell how the did the integration with Spark. Take some time to dig down their code to understand the logic. On Wed, Feb 11, 2015 at 2:25 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: Taking the opportunity Spark was being discussed in another thread, I decided to start a new one as I have interest in using Spark + Cassandra in the feature. About 3 years ago, Spark was not an existing option and we tried to use hadoop to process Cassandra data. My experience was horrible and we reached the conclusion it was faster to develop an internal tool than insist on Hadoop _for our specific case_. How I can see Spark is starting to be known as a better hadoop and it seems market is going this way now. I can also see I have many more options to decide how to integrate Cassandra using the Spark RDD concept than using the ColumnFamilyInputFormat. I have found this java driver made by Datastax: https://github.com/datastax/spark-cassandra-connector I also have found python Cassandra support on spark's repo, but it seems experimental yet: https://github.com/apache/spark/tree/master/examples/src/main/python Finally I have found stratio deep: https://github.com/Stratio/deep-spark It seems Stratio guys have forked Cassandra also, I am still a little confused about it. Question: which driver should I use, if I want to use Java? And which if I want to use python? I think the way Spark can integrate to Cassandra makes all the difference in the world, from my past experience, so I would like to know more about it, but I don't even know which source code I should start looking... I would like to integrate using python and or C++, but I wonder if it doesn't pay the way to use the java driver instead. Thanks in advance
Re: Adding new node - OPSCenter problems
Hello, What is the output of nodetool status? All nodes should appear, otherwise there is some configuration error. Regards, Carlos Juzarte Rolo Cassandra Consultant Pythian - Love your data rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo http://linkedin.com/in/carlosjuzarterolo* Tel: 1649 www.pythian.com On Wed, Feb 11, 2015 at 3:46 PM, Batranut Bogdan batra...@yahoo.com wrote: Hello all, I have added new 3 nodes to existing cluster. I must point out that I have copied the cassandra yaml file, from an existing node and just changed listen_addres per instructions here: Adding nodes to an existing cluster | DataStax Cassandra 2.0 Documentation http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_add_node_to_cluster_t.html Adding nodes to an existing cluster | DataStax Cassandra 2.0 Documentation http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_add_node_to_cluster_t.html Steps to add nodes when using virtual nodes. View on www.datastax.com http://www.datastax.com/documentation/cassandra/2.0/cassandra/operations/ops_add_node_to_cluster_t.html Preview by Yahoo Installed datastax agents but in OpsCenter I see the new nodes in a different, empty Cluster. Also opscenter does not see the datacenter for the new nodes. They are all in the same datacenter, and the name of the Cluster is the same for all 9 nodes. Any ideeas? -- --
Re: Recommissioned a node
Hi Eric, thanks for your answer. The reason why it got recommissioned was simply because the machine got restarted (with auto_bootstrap set to to true). A cleaner, and correct, recommission would have just required wiping the data folder, am I correct? Or would I have needed to change something else in the node configuration? Cheers, Stefano On Wed, Feb 11, 2015 at 6:47 AM, Eric Stevens migh...@gmail.com wrote: AFAIK it should be ok after the repair completed (it was missing all writes while it was decommissioning and while it was offline, and nobody would have been keeping hinted handoffs for it, so repair was the right thing to do). Unless RF=N you're now due for a cleanup on the other nodes. Generally speaking though this was probably not a good idea. When the node came back online, it rejoined the cluster immediately and would have been serving client requests without having a consistent view of the data. A safer approach would be to wipe the data directory and bootstrap it as a clean new member. I'm curious what prompted that cycle of decommission then recommission. On Tue, Feb 10, 2015 at 10:13 PM, Stefano Ortolani ostef...@gmail.com wrote: Hi, I recommissioned a node after decommissioningit. That happened (1) after a successfull decommission (checked), (2) without wiping the data directory on the node, (3) simply by restarting the cassandra service. The node now reports himself healty and up and running Knowing that I issued the repair command and patiently waited for its completion, can I assume the cluster, and its internals (replicas, balance between those) to be healthy and as new? Regards, Stefano
Re: Pagination support on Java Driver Query API
I can't believe that everyone read process all rows at once (without pagination). Probably not too many people try to read all rows in a table as a single rolling operation with a standard client driver. But those who do would use token() to keep track of where they are and be able to resume with that as well. But it sounds like you're talking about paginating a subset of data - larger than you want to process as a unit, but prefiltered by some other criteria which prevents you from being able to rely on token(). For this there is no general purpose solution, but it typically involves you maintaining your own paging state, typically keeping track of the last partitioning and clustering key seen, and using that to construct your next query. For example, we have client queries which can span several partitioning keys. We make sure that the List of partition keys generated by a given client query List(Pq) is deterministic, then our paging state is the index offset of the final Pq in the response, plus the value of the final clustering column. A query coming in with a paging state attached to it starts the next set of queries from the provided Pq offset where clusteringKey the provided value. So if you can just track partition key offset (if spanning multiple partitions), and clustering key offset, you can construct your next query from those instead. On Tue, Feb 10, 2015 at 6:58 PM, Ajay ajay.ga...@gmail.com wrote: Thanks Alex. But is there any workaround possible?. I can't believe that everyone read process all rows at once (without pagination). Thanks Ajay On Feb 10, 2015 11:46 PM, Alex Popescu al...@datastax.com wrote: On Tue, Feb 10, 2015 at 4:59 AM, Ajay ajay.ga...@gmail.com wrote: 1) Java driver implicitly support Pagination in the ResultSet (using Iterator) which can be controlled through FetchSize. But it is limited in a way that we cannot skip or go previous. The FetchState is not exposed. Cassandra doesn't support skipping so this is not really a limitation of the driver. -- [:-a) Alex Popescu Sen. Product Manager @ DataStax @al3xandru To unsubscribe from this group and stop receiving emails from it, send an email to java-driver-user+unsubscr...@lists.datastax.com.
Safely delete tmplink files - 2.1.2
Hi, we are expecting the 2.1.3 release to fix the delete of tmplink files. In the meantime, is it safe to delete this files without shutting down Cassandra? Thanks,
Re: Two problems with Cassandra
On Wed, Feb 11, 2015 at 2:22 AM, Pavel Velikhov pavel.velik...@gmail.com wrote: 2. While trying to update the full dataset with a simple transformation (again via python driver), single node and clustered Cassandra run out of memory no matter what settings I try, even I put a lot of sleeps into the mix. However simpler transformations (updating just one column, specially when there is a lot of processing overhead) work just fine. What does a simple transformation mean here? Assuming a reasonable sized heap, OOM sounds like you're trying to update a large number of large partitions in a single operation. In general, in Cassandra, you're best off interacting with a single or small number of partitions in any given interaction. =Rob
Re: Safely delete tmplink files - 2.1.2
On Wed, Feb 11, 2015 at 7:52 AM, Demian Berjman dberj...@despegar.com wrote: Hi, we are expecting the 2.1.3 release to fix the delete of tmplink files. In the meantime, is it safe to delete this files without shutting down Cassandra? If I were experiencing issues with 2.1.2, I would downgrade to 2.1.1, fwiw. My belief is that it is safe to delete these files (but they may not actually delete because Cassandra may have them open). Before doing anything in production I would ask on the JIRA ticket for the issue. also, in case you are running 2.1.2 in production... https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/ =Rob
Re: How to speed up SELECT * query in Cassandra
Hi, here are some snippets of code in scala which should get you started. Jirka H. loop {lastRow =val query = lastRow match {case Some(row) = nextPageQuery(row, upperLimit)case None = initialQuery(lowerLimit)}session.execute(query).all} private def nextPageQuery(row: Row, upperLimit: String): String = {val tokenPart = token(%s) token(0x%s) and token(%s) %s.format(rowKeyName, hex(row.getBytes(rowKeyName)), rowKeyName, upperLimit)basicQuery.format(tokenPart)} private def initialQuery(lowerLimit: String): String = {val tokenPart = token(%s) = %s.format(rowKeyName, lowerLimit)basicQuery.format(tokenPart)}private def calculateRanges: (BigDecimal, BigDecimal, IndexedSeq[(BigDecimal, BigDecimal)]) = {tokenRange match {case Some((start, end)) =Logger.info(Token range given: {}, + start.underlying.toPlainString + , + end.underlying.toPlainString + )val tokenSpaceSize = end - startval rangeSize = tokenSpaceSize / concurrencyval ranges = for (i - 0 until concurrency) yield (start + (i * rangeSize), start + ((i + 1) * rangeSize))(tokenSpaceSize, rangeSize, ranges)case None =val tokenSpaceSize = partitioner.max - partitioner.minval rangeSize = tokenSpaceSize / concurrencyval ranges = for (i - 0 until concurrency) yield (partitioner.min + (i * rangeSize), partitioner.min + ((i + 1) * rangeSize))(tokenSpaceSize, rangeSize, ranges)}} private val basicQuery = {select %s, %s, %s, writetime(%s) from %s where %s%s limit %d%s.format(rowKeyName,columnKeyName,columnValueName,columnValueName,columnFamily,%s, // templatewhereCondition,pageSize,if (cqlAllowFiltering) allow filtering else )} case object Murmur3 extends Partitioner {override val min = BigDecimal(-2).pow(63)override val max = BigDecimal(2).pow(63) - 1}case object Random extends Partitioner {override val min = BigDecimal(0)override val max = BigDecimal(2).pow(127) - 1} On 02/11/2015 02:21 PM, Ja Sam wrote: Your answer looks very promising How do you calculate start and stop? On Wed, Feb 11, 2015 at 12:09 PM, Jiri Horky ho...@avast.com mailto:ho...@avast.com wrote: The fastest way I am aware of is to do the queries in parallel to multiple cassandra nodes and make sure that you only ask them for keys they are responsible for. Otherwise, the node needs to resend your query which is much slower and creates unnecessary objects (and thus GC pressure). You can manually take advantage of the token range information, if the driver does not get this into account for you. Then, you can play with concurrency and batch size of a single query against one node. Basically, what you/driver should do is to transform the query to series of SELECT * FROM TABLE WHERE TOKEN IN (start, stop). I will need to look up the actual code, but the idea should be clear :) Jirka H. On 02/11/2015 11:26 AM, Ja Sam wrote: Is there a simple way (or even a complicated one) how can I speed up SELECT * FROM [table] query? I need to get all rows form one table every day. I split tables, and create one for each day, but still query is quite slow (200 millions of records) I was thinking about run this query in parallel, but I don't know if it is possible
Re: High GC activity on node with 4TB on data
Hi Chris, On 02/09/2015 04:22 PM, Chris Lohfink wrote: - number of tombstones - how can I reliably find it out? https://github.com/spotify/cassandra-opstools https://github.com/cloudian/support-tools thanks. If not getting much compression it may be worth trying to disable it, it may contribute but its very unlikely that its the cause of the gc pressure itself. 7000 sstables but STCS? Sounds like compactions couldn't keep up. Do you have a lot of pending compactions (nodetool)? You may want to increase your compaction throughput (nodetool) to see if you can catch up a little, it would cause a lot of heap overhead to do reads with that many. May even need to take more drastic measures if it cant catch back up. I am sorry, I was wrong. We actually do use LCS (the switch was done recently). There are almost none pending compaction. We have increased the size sstable to 768M, so it should help as as well. May also be good to check `nodetool cfstats` for very wide partitions. There are basically none, this is fine. It seems that the problem really comes from having so much data in so many sstables, so org.apache.cassandra.io.compress.CompressedRandomAccessReader classes consumes more memory than 0.75*HEAP_SIZE, which triggers the CMS over and over. We have turned off the compression and so far, the situation seems to be fine. Cheers Jirka H. Theres a good chance if under load and you have over 8gb heap your GCs could use tuning. The bigger the nodes the more manual tweaking it will require to get the most out of them https://issues.apache.org/jira/browse/CASSANDRA-8150 also has some ideas. Chris On Mon, Feb 9, 2015 at 2:00 AM, Jiri Horky ho...@avast.com mailto:ho...@avast.com wrote: Hi all, thank you all for the info. To answer the questions: - we have 2 DCs with 5 nodes in each, each node has 256G of memory, 24x1T drives, 2x Xeon CPU - there are multiple cassandra instances running for different project. The node itself is powerful enough. - there 2 keyspaces, one with 3 replicas per DC, one with 1 replica per DC (because of amount of data and because it serves more or less like a cache) - there are about 4k/s Request-response, 3k/s Read and 2k/s Mutation requests - numbers are sum of all nodes - we us STCS (LCS would be quite IO have for this amount of data) - number of tombstones - how can I reliably find it out? - the biggest CF (3.6T per node) has 7000 sstables Now, I understand that the best practice for Cassandra is to run with the minimum size of heap which is enough which for this case we thought is about 12G - there is always 8G consumbed by the SSTable readers. Also, I though that high number of tombstones create pressure in the new space (which can then cause pressure in old space as well), but this is not what we are seeing. We see continuous GC activity in Old generation only. Also, I noticed that the biggest CF has Compression factor of 0.99 which basically means that the data come compressed already. Do you think that turning off the compression should help with memory consumption? Also, I think that tuning CMSInitiatingOccupancyFraction=75 might help here, as it seems that 8G is something that Cassandra needs for bookkeeping this amount of data and that this was sligtly above the 75% limit which triggered the CMS again and again. I will definitely have a look at the presentation. Regards Jiri Horky On 02/08/2015 10:32 PM, Mark Reddy wrote: Hey Jiri, While I don't have any experience running 4TB nodes (yet), I would recommend taking a look at a presentation by Arron Morton on large nodes: http://planetcassandra.org/blog/cassandra-community-webinar-videoslides-large-nodes-with-cassandra-by-aaron-morton/ to see if you can glean anything from that. I would note that at the start of his talk he mentions that in version 1.2 we can now talk about nodes around 1 - 3 TB in size, so if you are storing anything more than that you are getting into very specialised use cases. If you could provide us with some more information about your cluster setup (No. of CFs, read/write patterns, do you delete / update often, etc.) that may help in getting you to a better place. Regards, Mark On 8 February 2015 at 21:10, Kevin Burton bur...@spinn3r.com mailto:bur...@spinn3r.com wrote: Do you have a lot of individual tables? Or lots of small compactions? I think the general consensus is that (at least for Cassandra), 8GB heaps are ideal. If you have lots of small tables it’s a known anti-pattern (I believe) because the Cassandra internals could do a better job on handling the in memory metadata representation. I think this has
How to speed up SELECT * query in Cassandra
Is there a simple way (or even a complicated one) how can I speed up SELECT * FROM [table] query? I need to get all rows form one table every day. I split tables, and create one for each day, but still query is quite slow (200 millions of records) I was thinking about run this query in parallel, but I don't know if it is possible
Re:How to speed up SELECT * query in Cassandra
Look for the message Re: Fastest way to map/parallel read all values in a table? in the mailing list, it was recently discussed. You can have several parallel processes each one reading a slice of the data, by splitting min/max murmur3 hash ranges. In the company I used to work we developed a system to run custom python processes on demand to process Cassandra data among other things to be able to do that. I hope it will be released as open source soon, it seems there is a lot of people having always this same problem. If you use Cassandra enterprise, you can use hive, AFAIK. A good idea would be running a hadoop or spark process over your cluster and do the processing you want, but sometimes I think it might be a bit hard to achieve good results for that, mainly because these tools work fine but are auto magic. It's hard to control where intermediate data will be stored, for example. From: user@cassandra.apache.org Subject: Re:How to speed up SELECT * query in Cassandra Is there a simple way (or even a complicated one) how can I speed up SELECT * FROM [table] query? I need to get all rows form one table every day. I split tables, and create one for each day, but still query is quite slow (200 millions of records) I was thinking about run this query in parallel, but I don't know if it is possible
Re: How to speed up SELECT * query in Cassandra
I use spark with cassandra, and you dont need DSE. I see a lot of people ask this same question below (how do I get a lot of data out of cassandra?), and my question is always, why arent you updating both places at once? For example, we use hadoop and cassandra in conjunction with each other, we use a message bus to store every event in both, aggregrate in both, but only keep current data in cassandra (cassandra makes a very poor datawarehouse ot long term time series store) and then use services to process queries that merge data from hadoop and cassandra. Also, spark on hdfs gives more flexibility in terms of large datasets and performance. The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra -- Colin Clark +1 612 859 6129 Skype colin.p.clark On Feb 11, 2015, at 4:49 AM, Jens Rantil jens.ran...@tink.se wrote: On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: If you use Cassandra enterprise, you can use hive, AFAIK. Even better, you can use Spark/Shark with DSE. Cheers, Jens -- Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook Linkedin Twitter
Re: Two problems with Cassandra
Hi Carlos, I tried on a single node and a 4-node cluster. On the 4-node cluster I setup the tables with replication factor = 2. I usually iterate over a subset, but it can be about ~40% right now. Some of my column values could be quite big… I remember I was exporting to csv and I had to change the default csv max column length. If I just update, there are no problems, its reading and updating that kills everything (could it have something to do with the driver?) I’m using 2.0.8 release right now. I was trying to tweak memory sizes. If I give Cassandra too much memory (8 or 16 GB) it dies much faster due to GC not being able to keep up. But it consistently dies on a specific row in single instance case… Is this enough info to point me somewhere? Thank you, Pavel On Feb 11, 2015, at 1:48 PM, Carlos Rolo r...@pythian.com wrote: Hello Pavel, What is the size of the Cluster (# of nodes)? And you need to iterate over the full 1TB every time you do the update? Or just parts of it? IMO information is short to make any kind of assessment of the problem you are having. I can suggest to try a 2.0.x (or 2.1.1) release to see if you get the same problem. Regards, Carlos Juzarte Rolo Cassandra Consultant Pythian - Love your data rolo@pythian | Twitter: cjrolo | Linkedin: linkedin.com/in/carlosjuzarterolo http://linkedin.com/in/carlosjuzarterolo Tel: 1649 www.pythian.com http://www.pythian.com/ On Wed, Feb 11, 2015 at 11:22 AM, Pavel Velikhov pavel.velik...@gmail.com mailto:pavel.velik...@gmail.com wrote: Hi, I’m using Cassandra to store NLP data, the dataset is not that huge (about 1TB), but I need to iterate over it quite frequently, updating the full dataset (each record, but not necessarily each column). I’ve run into two problems (I’m using the latest Cassandra): 1. I was trying to copy from one Cassandra cluster to another via a python driver, however the driver confused the two instances 2. While trying to update the full dataset with a simple transformation (again via python driver), single node and clustered Cassandra run out of memory no matter what settings I try, even I put a lot of sleeps into the mix. However simpler transformations (updating just one column, specially when there is a lot of processing overhead) work just fine. I’m really concerned about #2, since we’re moving all heavy processing to a Spark cluster and will expand it, and I would expect much heavier traffic to/from Cassandra. Any hints, war stories, etc. very appreciated! Thank you, Pavel Velikhov --
Re: How to speed up SELECT * query in Cassandra
On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: If you use Cassandra enterprise, you can use hive, AFAIK. Even better, you can use Spark/Shark with DSE. Cheers, Jens -- Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook https://www.facebook.com/#!/tink.se Linkedin http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary Twitter https://twitter.com/tink
Re: How to speed up SELECT * query in Cassandra
The fastest way I am aware of is to do the queries in parallel to multiple cassandra nodes and make sure that you only ask them for keys they are responsible for. Otherwise, the node needs to resend your query which is much slower and creates unnecessary objects (and thus GC pressure). You can manually take advantage of the token range information, if the driver does not get this into account for you. Then, you can play with concurrency and batch size of a single query against one node. Basically, what you/driver should do is to transform the query to series of SELECT * FROM TABLE WHERE TOKEN IN (start, stop). I will need to look up the actual code, but the idea should be clear :) Jirka H. On 02/11/2015 11:26 AM, Ja Sam wrote: Is there a simple way (or even a complicated one) how can I speed up SELECT * FROM [table] query? I need to get all rows form one table every day. I split tables, and create one for each day, but still query is quite slow (200 millions of records) I was thinking about run this query in parallel, but I don't know if it is possible
Re: Two problems with Cassandra
Hello Pavel, What is the size of the Cluster (# of nodes)? And you need to iterate over the full 1TB every time you do the update? Or just parts of it? IMO information is short to make any kind of assessment of the problem you are having. I can suggest to try a 2.0.x (or 2.1.1) release to see if you get the same problem. Regards, Carlos Juzarte Rolo Cassandra Consultant Pythian - Love your data rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo http://linkedin.com/in/carlosjuzarterolo* Tel: 1649 www.pythian.com On Wed, Feb 11, 2015 at 11:22 AM, Pavel Velikhov pavel.velik...@gmail.com wrote: Hi, I’m using Cassandra to store NLP data, the dataset is not that huge (about 1TB), but I need to iterate over it quite frequently, updating the full dataset (each record, but not necessarily each column). I’ve run into two problems (I’m using the latest Cassandra): 1. I was trying to copy from one Cassandra cluster to another via a python driver, however the driver confused the two instances 2. While trying to update the full dataset with a simple transformation (again via python driver), single node and clustered Cassandra run out of memory no matter what settings I try, even I put a lot of sleeps into the mix. However simpler transformations (updating just one column, specially when there is a lot of processing overhead) work just fine. I’m really concerned about #2, since we’re moving all heavy processing to a Spark cluster and will expand it, and I would expect much heavier traffic to/from Cassandra. Any hints, war stories, etc. very appreciated! Thank you, Pavel Velikhov -- --
Re: How to speed up SELECT * query in Cassandra
The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra Prove it. Did you ever have a look into the source code of the Spark/Cassandra connector to see how data locality is achieved before throwing out such statement ? On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: cassandra makes a very poor datawarehouse ot long term time series store Really? This is not the impression I have... I think Cassandra is good to store larges amounts of data and historical information, it's only not good to store temporary data. Netflix has a large amount of data and it's all stored in Cassandra, AFAIK. The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra. I am not sure about the current state of Spark support for Cassandra, but I guess if you create a map reduce job, the intermediate map results will be still stored in HDFS, as it happens to hadoop, is this right? I think the problem with Spark + Cassandra or with Hadoop + Cassandra is that the hard part spark or hadoop does, the shuffling, could be done out of the box with Cassandra, but no one takes advantage on that. What if a map / reduce job used a temporary CF in Cassandra to store intermediate results? From: user@cassandra.apache.org Subject: Re: How to speed up SELECT * query in Cassandra I use spark with cassandra, and you dont need DSE. I see a lot of people ask this same question below (how do I get a lot of data out of cassandra?), and my question is always, why arent you updating both places at once? For example, we use hadoop and cassandra in conjunction with each other, we use a message bus to store every event in both, aggregrate in both, but only keep current data in cassandra (cassandra makes a very poor datawarehouse ot long term time series store) and then use services to process queries that merge data from hadoop and cassandra. Also, spark on hdfs gives more flexibility in terms of large datasets and performance. The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra -- *Colin Clark* +1 612 859 6129 Skype colin.p.clark On Feb 11, 2015, at 4:49 AM, Jens Rantil jens.ran...@tink.se wrote: On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: If you use Cassandra enterprise, you can use hive, AFAIK. Even better, you can use Spark/Shark with DSE. Cheers, Jens -- Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook https://www.facebook.com/#!/tink.se Linkedin http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary Twitter https://twitter.com/tink
Re: road map for Cassandra 3.0
Look at the JIRA, filter by 3.0. But it's not very accurate. There are lot of new features scheduled for 3.0. Some of them will make it on time for 3.0.0 like User Defined Functions I guess. Some other features will be shipped with future 3 middle/minor versions. On Wed, Feb 11, 2015 at 1:25 PM, Ernesto Reinaldo Barreiro reier...@gmail.com wrote: Hi, Is there a public road map for Cassandra 3.0? Are there any estimates for the release date of 3.0? -- Regards - Ernesto Reinaldo Barreiro
Re: road map for Cassandra 3.0
Thanks for your answer! On Wed, Feb 11, 2015 at 1:03 PM, DuyHai Doan doanduy...@gmail.com wrote: Look at the JIRA, filter by 3.0. But it's not very accurate. There are lot of new features scheduled for 3.0. Some of them will make it on time for 3.0.0 like User Defined Functions I guess. Some other features will be shipped with future 3 middle/minor versions. On Wed, Feb 11, 2015 at 1:25 PM, Ernesto Reinaldo Barreiro reier...@gmail.com wrote: Hi, Is there a public road map for Cassandra 3.0? Are there any estimates for the release date of 3.0? -- Regards - Ernesto Reinaldo Barreiro -- Regards - Ernesto Reinaldo Barreiro
Re: How to speed up SELECT * query in Cassandra
Your answer looks very promising How do you calculate start and stop? On Wed, Feb 11, 2015 at 12:09 PM, Jiri Horky ho...@avast.com wrote: The fastest way I am aware of is to do the queries in parallel to multiple cassandra nodes and make sure that you only ask them for keys they are responsible for. Otherwise, the node needs to resend your query which is much slower and creates unnecessary objects (and thus GC pressure). You can manually take advantage of the token range information, if the driver does not get this into account for you. Then, you can play with concurrency and batch size of a single query against one node. Basically, what you/driver should do is to transform the query to series of SELECT * FROM TABLE WHERE TOKEN IN (start, stop). I will need to look up the actual code, but the idea should be clear :) Jirka H. On 02/11/2015 11:26 AM, Ja Sam wrote: Is there a simple way (or even a complicated one) how can I speed up SELECT * FROM [table] query? I need to get all rows form one table every day. I split tables, and create one for each day, but still query is quite slow (200 millions of records) I was thinking about run this query in parallel, but I don't know if it is possible
Re: Two problems with Cassandra
Update should not be a problem because no read is done, so no need to pull the data out. Is that row bigger than your memory capacity (Or HEAP size)? For dealing with large heaps you can refer to this ticket: CASSANDRA-8150. It provides some nice tips. If someone else can share experience would be good. Regards, Carlos Juzarte Rolo Cassandra Consultant Pythian - Love your data rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo http://linkedin.com/in/carlosjuzarterolo* Tel: 1649 www.pythian.com On Wed, Feb 11, 2015 at 12:05 PM, Pavel Velikhov pavel.velik...@gmail.com wrote: Hi Carlos, I tried on a single node and a 4-node cluster. On the 4-node cluster I setup the tables with replication factor = 2. I usually iterate over a subset, but it can be about ~40% right now. Some of my column values could be quite big… I remember I was exporting to csv and I had to change the default csv max column length. If I just update, there are no problems, its reading and updating that kills everything (could it have something to do with the driver?) I’m using 2.0.8 release right now. I was trying to tweak memory sizes. If I give Cassandra too much memory (8 or 16 GB) it dies much faster due to GC not being able to keep up. But it consistently dies on a specific row in single instance case… Is this enough info to point me somewhere? Thank you, Pavel On Feb 11, 2015, at 1:48 PM, Carlos Rolo r...@pythian.com wrote: Hello Pavel, What is the size of the Cluster (# of nodes)? And you need to iterate over the full 1TB every time you do the update? Or just parts of it? IMO information is short to make any kind of assessment of the problem you are having. I can suggest to try a 2.0.x (or 2.1.1) release to see if you get the same problem. Regards, Carlos Juzarte Rolo Cassandra Consultant Pythian - Love your data rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo http://linkedin.com/in/carlosjuzarterolo* Tel: 1649 www.pythian.com On Wed, Feb 11, 2015 at 11:22 AM, Pavel Velikhov pavel.velik...@gmail.com wrote: Hi, I’m using Cassandra to store NLP data, the dataset is not that huge (about 1TB), but I need to iterate over it quite frequently, updating the full dataset (each record, but not necessarily each column). I’ve run into two problems (I’m using the latest Cassandra): 1. I was trying to copy from one Cassandra cluster to another via a python driver, however the driver confused the two instances 2. While trying to update the full dataset with a simple transformation (again via python driver), single node and clustered Cassandra run out of memory no matter what settings I try, even I put a lot of sleeps into the mix. However simpler transformations (updating just one column, specially when there is a lot of processing overhead) work just fine. I’m really concerned about #2, since we’re moving all heavy processing to a Spark cluster and will expand it, and I would expect much heavier traffic to/from Cassandra. Any hints, war stories, etc. very appreciated! Thank you, Pavel Velikhov -- -- --
best supported spark connector for Cassandra
Taking the opportunity Spark was being discussed in another thread, I decided to start a new one as I have interest in using Spark + Cassandra in the feature. About 3 years ago, Spark was not an existing option and we tried to use hadoop to process Cassandra data. My experience was horrible and we reached the conclusion it was faster to develop an internal tool than insist on Hadoop _for our specific case_. How I can see Spark is starting to be known as a better hadoop and it seems market is going this way now. I can also see I have many more options to decide how to integrate Cassandra using the Spark RDD concept than using the ColumnFamilyInputFormat. I have found this java driver made by Datastax: https://github.com/datastax/spark-cassandra-connector I also have found python Cassandra support on spark's repo, but it seems experimental yet: https://github.com/apache/spark/tree/master/examples/src/main/python Finally I have found stratio deep: https://github.com/Stratio/deep-spark It seems Stratio guys have forked Cassandra also, I am still a little confused about it. Question: which driver should I use, if I want to use Java? And which if I want to use python? I think the way Spark can integrate to Cassandra makes all the difference in the world, from my past experience, so I would like to know more about it, but I don't even know which source code I should start looking... I would like to integrate using python and or C++, but I wonder if it doesn't pay the way to use the java driver instead. Thanks in advance
Re: best supported spark connector for Cassandra
Start looking at the Spark/Cassandra connector here (in Scala): https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector Data locality is provided by this method: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraRDD.scala#L329-L336 Start digging from this all the way down the code. As for Stratio Deep, I can't tell how the did the integration with Spark. Take some time to dig down their code to understand the logic. On Wed, Feb 11, 2015 at 2:25 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: Taking the opportunity Spark was being discussed in another thread, I decided to start a new one as I have interest in using Spark + Cassandra in the feature. About 3 years ago, Spark was not an existing option and we tried to use hadoop to process Cassandra data. My experience was horrible and we reached the conclusion it was faster to develop an internal tool than insist on Hadoop _for our specific case_. How I can see Spark is starting to be known as a better hadoop and it seems market is going this way now. I can also see I have many more options to decide how to integrate Cassandra using the Spark RDD concept than using the ColumnFamilyInputFormat. I have found this java driver made by Datastax: https://github.com/datastax/spark-cassandra-connector I also have found python Cassandra support on spark's repo, but it seems experimental yet: https://github.com/apache/spark/tree/master/examples/src/main/python Finally I have found stratio deep: https://github.com/Stratio/deep-spark It seems Stratio guys have forked Cassandra also, I am still a little confused about it. Question: which driver should I use, if I want to use Java? And which if I want to use python? I think the way Spark can integrate to Cassandra makes all the difference in the world, from my past experience, so I would like to know more about it, but I don't even know which source code I should start looking... I would like to integrate using python and or C++, but I wonder if it doesn't pay the way to use the java driver instead. Thanks in advance