Re:query by column size
There is no automatic indexing in Cassandra. There are secondary indexes, but not for these cases. You could use a solution like DSE, to get data automatically indexed on solr, in each node, as soon as data comes. Then you could do such a query on solr. If the query can be slow, you could run a MR job over all rows, filtering the ones you want. []s From: user@cassandra.apache.org Subject: Re:query by column size Greetings, I have one column family with 10 columns, one of the column we store xml/json. Is there a way I can query that column where size 50kb ? assuming I have index on that column. thanks CV.
sstables remain after compaction
Hello, Pre cassandra 1.0, after sstables are compacted, the old sstables will be remain until the first gc kick in. For cassandra 1.0, the sstables will be remove after compaction is done. Will it be possible the old sstables remains due to whatever reasons (e.g. read referencing)? Thank you. Jason
Re: best supported spark connector for Cassandra
Not for sure ;) If you need Cassandra support I can forward you to someone to talk to at Pythian. Regards, Regards, Carlos Juzarte Rolo Cassandra Consultant Pythian - Love your data rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo http://linkedin.com/in/carlosjuzarterolo* Tel: 1649 www.pythian.com On Fri, Feb 13, 2015 at 3:05 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: Actually, I am not the one looking for support, but I thank you a lot anyway. But from your message I guess the answer is yes, Datastax is not the only Cassandra vendor offering support and changing official Cassandra source at this moment, is this right? From: user@cassandra.apache.org Subject: Re: best supported spark connector for Cassandra Of course, Stratio Deep and Stratio Cassandra are licensed Apache 2.0. Regarding the Cassandra support, I can introduce you to someone in Stratio that can help you. 2015-02-12 15:05 GMT+01:00 Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net: Thanks for the hint Gaspar. Do you know if Stratio Deep / Stratio Cassandra are also licensed Apache 2.0? I had interest in knowing more about Stratio when I was working on a start up. Now, on a blueship, it seems one of the hardest obstacles to use Cassandra in a project is the need of an area supporting it, and it seems people are specially concerned about how many vendors an open source solution has to provide support. This seems to be kind of an advantage of HBase, as there are many vendors supporting it, but I wonder if Stratio can be considered an alternative to Datastax reggarding Cassandra support? It's not my call here to decide anything, but as part of the community it helps to have this business scenario clear. I could say Cassandra could be the best fit technical solution for some projects but sometimes non-technical factors are in the game, like this need for having more than one vendor available... From: gmu...@stratio.com Subject: Re: best supported spark connector for Cassandra My suggestion is to use Java or Scala instead of Python. For Java/Scala both the Datastax and Stratio drivers are valid and similar options. As far as I know they both take care about data locality and are not based on the Hadoop interface. The advantage of Stratio Deep is that allows you to integrate Spark not only with Cassandra but with MongoDB, Elasticsearch, Aerospike and others as well. Stratio has a forked Cassandra for including some additional features such as Lucene based secondary indexes. So Stratio driver works fine with the Apache Cassandra and also with their fork. You can find some examples of using Deep here: https://github.com/Stratio/deep-examples Please if you need some help with Stratio Deep do not hesitate to contact us. 2015-02-11 17:18 GMT+01:00 shahab shahab.mok...@gmail.com: I am using Calliope cassandra-spark connector( http://tuplejump.github.io/calliope/), which is quite handy and easy to use! The only problem is that it is a bit outdates , works with Spark 1.1.0, hopefully new version comes soon. best, /Shahab On Wed, Feb 11, 2015 at 2:51 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: I just finished a scala course, nice exercise to check what I learned :D Thanks for the answer! From: user@cassandra.apache.org Subject: Re: best supported spark connector for Cassandra Start looking at the Spark/Cassandra connector here (in Scala): https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector Data locality is provided by this method: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraRDD.scala#L329-L336 Start digging from this all the way down the code. As for Stratio Deep, I can't tell how the did the integration with Spark. Take some time to dig down their code to understand the logic. On Wed, Feb 11, 2015 at 2:25 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: Taking the opportunity Spark was being discussed in another thread, I decided to start a new one as I have interest in using Spark + Cassandra in the feature. About 3 years ago, Spark was not an existing option and we tried to use hadoop to process Cassandra data. My experience was horrible and we reached the conclusion it was faster to develop an internal tool than insist on Hadoop _for our specific case_. How I can see Spark is starting to be known as a better hadoop and it seems market is going this way now. I can also see I have many more options to decide how to integrate Cassandra using the Spark RDD concept than using the ColumnFamilyInputFormat. I have found this java driver made by Datastax: https://github.com/datastax/spark-cassandra-connector I also have found python
Re: Added new nodes to cluster but no streams
Hi Bastranut, A few minutes between each node will do. Cheers, Jens On Fri, Feb 13, 2015 at 1:12 PM, Batranut Bogdan batra...@yahoo.com wrote: Hello, When adding a new node to the cluster I need to wait for each node to receive all the data from other nodes in the cluster or just wait a few minutes before I start each node? On Thursday, February 12, 2015 7:21 PM, Robert Coli rc...@eventbrite.com wrote: On Thu, Feb 12, 2015 at 3:20 AM, Batranut Bogdan batra...@yahoo.com wrote: I have added new nodes to the existing cluster. In Opscenter I do not see any streams... I presume that the new nodes get the data from the rest of the cluster via streams. The existing cluster has TB magnitude, and space used in the new nodes is ~90 GB. I must admit that I have restarted the new nodes several times after adding them . Does this affect boostrap? AFAIK the new nodes should start loading a part of all the data in the existing cluster. If it stays like this for a while, it sounds like your bootstraps have hung. Note that in general you should add nodes one at a time, especially if you are in a version without the fix for CASSANDRA-2434, in theory adding multiple nodes at once might contribute to their bootstraps hanging. Stop cassandra on the joining nodes, wipe/move aside their data directories, and try again one at a time. =Rob -- Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook https://www.facebook.com/#!/tink.se Linkedin http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary Twitter https://twitter.com/tink
Re: best supported spark connector for Cassandra
Of course, Stratio Deep and Stratio Cassandra are licensed Apache 2.0. Regarding the Cassandra support, I can introduce you to someone in Stratio that can help you. 2015-02-12 15:05 GMT+01:00 Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net: Thanks for the hint Gaspar. Do you know if Stratio Deep / Stratio Cassandra are also licensed Apache 2.0? I had interest in knowing more about Stratio when I was working on a start up. Now, on a blueship, it seems one of the hardest obstacles to use Cassandra in a project is the need of an area supporting it, and it seems people are specially concerned about how many vendors an open source solution has to provide support. This seems to be kind of an advantage of HBase, as there are many vendors supporting it, but I wonder if Stratio can be considered an alternative to Datastax reggarding Cassandra support? It's not my call here to decide anything, but as part of the community it helps to have this business scenario clear. I could say Cassandra could be the best fit technical solution for some projects but sometimes non-technical factors are in the game, like this need for having more than one vendor available... From: gmu...@stratio.com Subject: Re: best supported spark connector for Cassandra My suggestion is to use Java or Scala instead of Python. For Java/Scala both the Datastax and Stratio drivers are valid and similar options. As far as I know they both take care about data locality and are not based on the Hadoop interface. The advantage of Stratio Deep is that allows you to integrate Spark not only with Cassandra but with MongoDB, Elasticsearch, Aerospike and others as well. Stratio has a forked Cassandra for including some additional features such as Lucene based secondary indexes. So Stratio driver works fine with the Apache Cassandra and also with their fork. You can find some examples of using Deep here: https://github.com/Stratio/deep-examples Please if you need some help with Stratio Deep do not hesitate to contact us. 2015-02-11 17:18 GMT+01:00 shahab shahab.mok...@gmail.com: I am using Calliope cassandra-spark connector( http://tuplejump.github.io/calliope/), which is quite handy and easy to use! The only problem is that it is a bit outdates , works with Spark 1.1.0, hopefully new version comes soon. best, /Shahab On Wed, Feb 11, 2015 at 2:51 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: I just finished a scala course, nice exercise to check what I learned :D Thanks for the answer! From: user@cassandra.apache.org Subject: Re: best supported spark connector for Cassandra Start looking at the Spark/Cassandra connector here (in Scala): https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector Data locality is provided by this method: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraRDD.scala#L329-L336 Start digging from this all the way down the code. As for Stratio Deep, I can't tell how the did the integration with Spark. Take some time to dig down their code to understand the logic. On Wed, Feb 11, 2015 at 2:25 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: Taking the opportunity Spark was being discussed in another thread, I decided to start a new one as I have interest in using Spark + Cassandra in the feature. About 3 years ago, Spark was not an existing option and we tried to use hadoop to process Cassandra data. My experience was horrible and we reached the conclusion it was faster to develop an internal tool than insist on Hadoop _for our specific case_. How I can see Spark is starting to be known as a better hadoop and it seems market is going this way now. I can also see I have many more options to decide how to integrate Cassandra using the Spark RDD concept than using the ColumnFamilyInputFormat. I have found this java driver made by Datastax: https://github.com/datastax/spark-cassandra-connector I also have found python Cassandra support on spark's repo, but it seems experimental yet: https://github.com/apache/spark/tree/master/examples/src/main/python Finally I have found stratio deep: https://github.com/Stratio/deep-spark It seems Stratio guys have forked Cassandra also, I am still a little confused about it. Question: which driver should I use, if I want to use Java? And which if I want to use python? I think the way Spark can integrate to Cassandra makes all the difference in the world, from my past experience, so I would like to know more about it, but I don't even know which source code I should start looking... I would like to integrate using python and or C++, but I wonder if it doesn't pay the way to use the java driver instead. Thanks in advance -- Gaspar Muñoz
Re: Recommissioned a node
I created an issue for this: https://issues.apache.org/jira/browse/CASSANDRA-8801 On Thu, Feb 12, 2015 at 10:18 AM, Robert Coli rc...@eventbrite.com wrote: On Thu, Feb 12, 2015 at 7:04 AM, Eric Stevens migh...@gmail.com wrote: IMO, especially with the threat to unrecoverable consistency violations, this should be a critical bug. You should file a JIRA, and let the list know what it is? :D I was never sure if it was just me being unreasonably literal to presume that decommission made the node forget its prior state, if I'm honest? It is nice to hear from other operators that this matches their expectations. But yes, the current behavior seems to have risks that forgetting doesn't, and I don't understand what benefits (if any) it has. As a brief aside, this is Yet Another Reason why you probably don't ever want a Cassandra node to automatically start on boot, or restart. If you don't know its configuration, it could join a cluster, which might be Meaningfully Bad in some circumstances. =Rob
Re: best supported spark connector for Cassandra
Actually, I am not the one looking for support, but I thank you a lot anyway. But from your message I guess the answer is yes, Datastax is not the only Cassandra vendor offering support and changing official Cassandra source at this moment, is this right? From: user@cassandra.apache.org Subject: Re: best supported spark connector for Cassandra Of course, Stratio Deep and Stratio Cassandra are licensed Apache 2.0. Regarding the Cassandra support, I can introduce you to someone in Stratio that can help you. 2015-02-12 15:05 GMT+01:00 Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net: Thanks for the hint Gaspar. Do you know if Stratio Deep / Stratio Cassandra are also licensed Apache 2.0? I had interest in knowing more about Stratio when I was working on a start up. Now, on a blueship, it seems one of the hardest obstacles to use Cassandra in a project is the need of an area supporting it, and it seems people are specially concerned about how many vendors an open source solution has to provide support. This seems to be kind of an advantage of HBase, as there are many vendors supporting it, but I wonder if Stratio can be considered an alternative to Datastax reggarding Cassandra support? It's not my call here to decide anything, but as part of the community it helps to have this business scenario clear. I could say Cassandra could be the best fit technical solution for some projects but sometimes non-technical factors are in the game, like this need for having more than one vendor available... From: gmu...@stratio.com Subject: Re: best supported spark connector for Cassandra My suggestion is to use Java or Scala instead of Python. For Java/Scala both the Datastax and Stratio drivers are valid and similar options. As far as I know they both take care about data locality and are not based on the Hadoop interface. The advantage of Stratio Deep is that allows you to integrate Spark not only with Cassandra but with MongoDB, Elasticsearch, Aerospike and others as well. Stratio has a forked Cassandra for including some additional features such as Lucene based secondary indexes. So Stratio driver works fine with the Apache Cassandra and also with their fork. You can find some examples of using Deep here: https://github.com/Stratio/deep-examples Please if you need some help with Stratio Deep do not hesitate to contact us. 2015-02-11 17:18 GMT+01:00 shahab shahab.mok...@gmail.com: I am using Calliope cassandra-spark connector(http://tuplejump.github.io/calliope/), which is quite handy and easy to use! The only problem is that it is a bit outdates , works with Spark 1.1.0, hopefully new version comes soon. best, /Shahab On Wed, Feb 11, 2015 at 2:51 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: I just finished a scala course, nice exercise to check what I learned :D Thanks for the answer! From: user@cassandra.apache.org Subject: Re: best supported spark connector for Cassandra Start looking at the Spark/Cassandra connector here (in Scala): https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector Data locality is provided by this method: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraRDD.scala#L329-L336 Start digging from this all the way down the code. As for Stratio Deep, I can't tell how the did the integration with Spark. Take some time to dig down their code to understand the logic. On Wed, Feb 11, 2015 at 2:25 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: Taking the opportunity Spark was being discussed in another thread, I decided to start a new one as I have interest in using Spark + Cassandra in the feature. About 3 years ago, Spark was not an existing option and we tried to use hadoop to process Cassandra data. My experience was horrible and we reached the conclusion it was faster to develop an internal tool than insist on Hadoop _for our specific case_. How I can see Spark is starting to be known as a better hadoop and it seems market is going this way now. I can also see I have many more options to decide how to integrate Cassandra using the Spark RDD concept than using the ColumnFamilyInputFormat. I have found this java driver made by Datastax: https://github.com/datastax/spark-cassandra-connector I also have found python Cassandra support on spark's repo, but it seems experimental yet: https://github.com/apache/spark/tree/master/examples/src/main/python Finally I have found stratio deep: https://github.com/Stratio/deep-spark It seems Stratio guys have forked Cassandra also, I am still a little confused about it. Question: which driver should I use, if I want to use Java? And which if I want to use python? I think the way Spark can integrate to Cassandra makes all the difference in
Re: How to speed up SELECT * query in Cassandra
If you are using Spark you need to be _really_ careful about your tombstones. In our experience a single partition with too many tombstones can take down the whole batch job (until something like https://issues.apache.org/jira/browse/CASSANDRA-8574 is fixed). This was a major obstacle for us to overcome when using Spark. Cheers, Jens On Wed, Feb 11, 2015 at 5:12 PM, Jiri Horky ho...@avast.com wrote: Well, I always wondered how Cassandra can by used in Hadoop-like environment where you basically need to do full table scan. I need to say that our experience is that cassandra is perfect for writing, reading specific values by key, but definitely not for reading all of the data out of it. Some of our projects found out that doing that with a not trivial in a timely manner is close to impossible in many situations. We are slowly moving to storing the data in HDFS and possibly reprocess them on a daily bases for such usecases (statistics). This is nothing against Cassandra, it can not be perfect for everything. But I am really interested how it can work well with Spark/Hadoop where you basically needs to read all the data as well (as far as I understand that). Jirka H. On 02/11/2015 01:51 PM, DuyHai Doan wrote: The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra Prove it. Did you ever have a look into the source code of the Spark/Cassandra connector to see how data locality is achieved before throwing out such statement ? On Wed, Feb 11, 2015 at 12:42 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: cassandra makes a very poor datawarehouse ot long term time series store Really? This is not the impression I have... I think Cassandra is good to store larges amounts of data and historical information, it's only not good to store temporary data. Netflix has a large amount of data and it's all stored in Cassandra, AFAIK. The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra. I am not sure about the current state of Spark support for Cassandra, but I guess if you create a map reduce job, the intermediate map results will be still stored in HDFS, as it happens to hadoop, is this right? I think the problem with Spark + Cassandra or with Hadoop + Cassandra is that the hard part spark or hadoop does, the shuffling, could be done out of the box with Cassandra, but no one takes advantage on that. What if a map / reduce job used a temporary CF in Cassandra to store intermediate results? From: user@cassandra.apache.org Subject: Re: How to speed up SELECT * query in Cassandra I use spark with cassandra, and you dont need DSE. I see a lot of people ask this same question below (how do I get a lot of data out of cassandra?), and my question is always, why arent you updating both places at once? For example, we use hadoop and cassandra in conjunction with each other, we use a message bus to store every event in both, aggregrate in both, but only keep current data in cassandra (cassandra makes a very poor datawarehouse ot long term time series store) and then use services to process queries that merge data from hadoop and cassandra. Also, spark on hdfs gives more flexibility in terms of large datasets and performance. The very nature of cassandra's distributed nature vs partitioning data on hadoop makes spark on hdfs actually fasted than on cassandra -- *Colin Clark* +1 612 859 6129 Skype colin.p.clark On Feb 11, 2015, at 4:49 AM, Jens Rantil jens.ran...@tink.se wrote: On Wed, Feb 11, 2015 at 11:40 AM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: If you use Cassandra enterprise, you can use hive, AFAIK. Even better, you can use Spark/Shark with DSE. Cheers, Jens -- Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook https://www.facebook.com/#%21/tink.se Linkedin http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary Twitter https://twitter.com/tink -- Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook https://www.facebook.com/#!/tink.se Linkedin http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary Twitter https://twitter.com/tink
Re: best supported spark connector for Cassandra
For SQL queries on Cassandra I used to use Presto: https://prestodb.io/ It's a nice tool from FB and seems to work well with Cassandra. You can use their JDBC driver with your favourite java SQL tool. Inside my apps, I never needed to use SQL queries. []s From: pavel.velik...@gmail.com Subject: Re: best supported spark connector for Cassandra Hi Marcelo, Were you able to use the Spark SQL features of the Cassandra connector? I couldn’t make a .jar that wouldn’t confict with Spark SQL native .jar… So I ended up using only the basic features, cannot use SQL queries. On Feb 13, 2015, at 7:49 PM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: I used to use calliope, which was really awesome before DataStax native integration with Spark. Now I'm quite happy with the official DataStax spark connector, it's very straightforward to use. I never tried to use these drivers with Java though, I'd suggest you to use them with Scala, which is the best option to write spark jobs. On Fri, Feb 13, 2015 at 12:12 PM, Carlos Rolo r...@pythian.com wrote: Not for sure ;) If you need Cassandra support I can forward you to someone to talk to at Pythian. Regards, Regards, Carlos Juzarte Rolo Cassandra Consultant Pythian - Love your data rolo@pythian | Twitter: cjrolo | Linkedin: linkedin.com/in/carlosjuzarterolo Tel: 1649 www.pythian.com On Fri, Feb 13, 2015 at 3:05 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: Actually, I am not the one looking for support, but I thank you a lot anyway. But from your message I guess the answer is yes, Datastax is not the only Cassandra vendor offering support and changing official Cassandra source at this moment, is this right? From: user@cassandra.apache.org Subject: Re: best supported spark connector for Cassandra Of course, Stratio Deep and Stratio Cassandra are licensed Apache 2.0. Regarding the Cassandra support, I can introduce you to someone in Stratio that can help you. 2015-02-12 15:05 GMT+01:00 Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net: Thanks for the hint Gaspar. Do you know if Stratio Deep / Stratio Cassandra are also licensed Apache 2.0? I had interest in knowing more about Stratio when I was working on a start up. Now, on a blueship, it seems one of the hardest obstacles to use Cassandra in a project is the need of an area supporting it, and it seems people are specially concerned about how many vendors an open source solution has to provide support. This seems to be kind of an advantage of HBase, as there are many vendors supporting it, but I wonder if Stratio can be considered an alternative to Datastax reggarding Cassandra support? It's not my call here to decide anything, but as part of the community it helps to have this business scenario clear. I could say Cassandra could be the best fit technical solution for some projects but sometimes non-technical factors are in the game, like this need for having more than one vendor available... From: gmu...@stratio.com Subject: Re: best supported spark connector for Cassandra My suggestion is to use Java or Scala instead of Python. For Java/Scala both the Datastax and Stratio drivers are valid and similar options. As far as I know they both take care about data locality and are not based on the Hadoop interface. The advantage of Stratio Deep is that allows you to integrate Spark not only with Cassandra but with MongoDB, Elasticsearch, Aerospike and others as well. Stratio has a forked Cassandra for including some additional features such as Lucene based secondary indexes. So Stratio driver works fine with the Apache Cassandra and also with their fork. You can find some examples of using Deep here: https://github.com/Stratio/deep-examples Please if you need some help with Stratio Deep do not hesitate to contact us. 2015-02-11 17:18 GMT+01:00 shahab shahab.mok...@gmail.com: I am using Calliope cassandra-spark connector(http://tuplejump.github.io/calliope/), which is quite handy and easy to use! The only problem is that it is a bit outdates , works with Spark 1.1.0, hopefully new version comes soon. best, /Shahab On Wed, Feb 11, 2015 at 2:51 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: I just finished a scala course, nice exercise to check what I learned :D Thanks for the answer! From: user@cassandra.apache.org Subject: Re: best supported spark connector for Cassandra Start looking at the Spark/Cassandra connector here (in Scala): https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector Data locality is provided by this method: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraRDD.scala#L329-L336 Start digging from this all the way down the code. As for Stratio Deep, I can't tell
Re: best supported spark connector for Cassandra
I used to use calliope, which was really awesome before DataStax native integration with Spark. Now I'm quite happy with the official DataStax spark connector, it's very straightforward to use. I never tried to use these drivers with Java though, I'd suggest you to use them with Scala, which is the best option to write spark jobs. On Fri, Feb 13, 2015 at 12:12 PM, Carlos Rolo r...@pythian.com wrote: Not for sure ;) If you need Cassandra support I can forward you to someone to talk to at Pythian. Regards, Regards, Carlos Juzarte Rolo Cassandra Consultant Pythian - Love your data rolo@pythian | Twitter: cjrolo | Linkedin: *linkedin.com/in/carlosjuzarterolo http://linkedin.com/in/carlosjuzarterolo* Tel: 1649 www.pythian.com On Fri, Feb 13, 2015 at 3:05 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: Actually, I am not the one looking for support, but I thank you a lot anyway. But from your message I guess the answer is yes, Datastax is not the only Cassandra vendor offering support and changing official Cassandra source at this moment, is this right? From: user@cassandra.apache.org Subject: Re: best supported spark connector for Cassandra Of course, Stratio Deep and Stratio Cassandra are licensed Apache 2.0. Regarding the Cassandra support, I can introduce you to someone in Stratio that can help you. 2015-02-12 15:05 GMT+01:00 Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net: Thanks for the hint Gaspar. Do you know if Stratio Deep / Stratio Cassandra are also licensed Apache 2.0? I had interest in knowing more about Stratio when I was working on a start up. Now, on a blueship, it seems one of the hardest obstacles to use Cassandra in a project is the need of an area supporting it, and it seems people are specially concerned about how many vendors an open source solution has to provide support. This seems to be kind of an advantage of HBase, as there are many vendors supporting it, but I wonder if Stratio can be considered an alternative to Datastax reggarding Cassandra support? It's not my call here to decide anything, but as part of the community it helps to have this business scenario clear. I could say Cassandra could be the best fit technical solution for some projects but sometimes non-technical factors are in the game, like this need for having more than one vendor available... From: gmu...@stratio.com Subject: Re: best supported spark connector for Cassandra My suggestion is to use Java or Scala instead of Python. For Java/Scala both the Datastax and Stratio drivers are valid and similar options. As far as I know they both take care about data locality and are not based on the Hadoop interface. The advantage of Stratio Deep is that allows you to integrate Spark not only with Cassandra but with MongoDB, Elasticsearch, Aerospike and others as well. Stratio has a forked Cassandra for including some additional features such as Lucene based secondary indexes. So Stratio driver works fine with the Apache Cassandra and also with their fork. You can find some examples of using Deep here: https://github.com/Stratio/deep-examples Please if you need some help with Stratio Deep do not hesitate to contact us. 2015-02-11 17:18 GMT+01:00 shahab shahab.mok...@gmail.com: I am using Calliope cassandra-spark connector( http://tuplejump.github.io/calliope/), which is quite handy and easy to use! The only problem is that it is a bit outdates , works with Spark 1.1.0, hopefully new version comes soon. best, /Shahab On Wed, Feb 11, 2015 at 2:51 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: I just finished a scala course, nice exercise to check what I learned :D Thanks for the answer! From: user@cassandra.apache.org Subject: Re: best supported spark connector for Cassandra Start looking at the Spark/Cassandra connector here (in Scala): https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector Data locality is provided by this method: https://github.com/datastax/spark-cassandra-connector/blob/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector/rdd/CassandraRDD.scala#L329-L336 Start digging from this all the way down the code. As for Stratio Deep, I can't tell how the did the integration with Spark. Take some time to dig down their code to understand the logic. On Wed, Feb 11, 2015 at 2:25 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: Taking the opportunity Spark was being discussed in another thread, I decided to start a new one as I have interest in using Spark + Cassandra in the feature. About 3 years ago, Spark was not an existing option and we tried to use hadoop to process Cassandra data. My experience was horrible and we reached the conclusion it was faster to develop an internal tool than insist on
Re: best supported spark connector for Cassandra
Hi Marcelo, Were you able to use the Spark SQL features of the Cassandra connector? I couldn’t make a .jar that wouldn’t confict with Spark SQL native .jar… So I ended up using only the basic features, cannot use SQL queries. On Feb 13, 2015, at 7:49 PM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: I used to use calliope, which was really awesome before DataStax native integration with Spark. Now I'm quite happy with the official DataStax spark connector, it's very straightforward to use. I never tried to use these drivers with Java though, I'd suggest you to use them with Scala, which is the best option to write spark jobs. On Fri, Feb 13, 2015 at 12:12 PM, Carlos Rolo r...@pythian.com mailto:r...@pythian.com wrote: Not for sure ;) If you need Cassandra support I can forward you to someone to talk to at Pythian. Regards, Regards, Carlos Juzarte Rolo Cassandra Consultant Pythian - Love your data rolo@pythian | Twitter: cjrolo | Linkedin: linkedin.com/in/carlosjuzarterolo http://linkedin.com/in/carlosjuzarterolo Tel: 1649 www.pythian.com http://www.pythian.com/ On Fri, Feb 13, 2015 at 3:05 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net mailto:mvallemil...@bloomberg.net wrote: Actually, I am not the one looking for support, but I thank you a lot anyway. But from your message I guess the answer is yes, Datastax is not the only Cassandra vendor offering support and changing official Cassandra source at this moment, is this right? From: user@cassandra.apache.org mailto:user@cassandra.apache.org Subject: Re: best supported spark connector for Cassandra Of course, Stratio Deep and Stratio Cassandra are licensed Apache 2.0. Regarding the Cassandra support, I can introduce you to someone in Stratio that can help you. 2015-02-12 15:05 GMT+01:00 Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net mailto:mvallemil...@bloomberg.net: Thanks for the hint Gaspar. Do you know if Stratio Deep / Stratio Cassandra are also licensed Apache 2.0? I had interest in knowing more about Stratio when I was working on a start up. Now, on a blueship, it seems one of the hardest obstacles to use Cassandra in a project is the need of an area supporting it, and it seems people are specially concerned about how many vendors an open source solution has to provide support. This seems to be kind of an advantage of HBase, as there are many vendors supporting it, but I wonder if Stratio can be considered an alternative to Datastax reggarding Cassandra support? It's not my call here to decide anything, but as part of the community it helps to have this business scenario clear. I could say Cassandra could be the best fit technical solution for some projects but sometimes non-technical factors are in the game, like this need for having more than one vendor available... From: gmu...@stratio.com mailto:gmu...@stratio.com Subject: Re: best supported spark connector for Cassandra My suggestion is to use Java or Scala instead of Python. For Java/Scala both the Datastax and Stratio drivers are valid and similar options. As far as I know they both take care about data locality and are not based on the Hadoop interface. The advantage of Stratio Deep is that allows you to integrate Spark not only with Cassandra but with MongoDB, Elasticsearch, Aerospike and others as well. Stratio has a forked Cassandra for including some additional features such as Lucene based secondary indexes. So Stratio driver works fine with the Apache Cassandra and also with their fork. You can find some examples of using Deep here: https://github.com/Stratio/deep-examples https://github.com/Stratio/deep-examples Please if you need some help with Stratio Deep do not hesitate to contact us. 2015-02-11 17:18 GMT+01:00 shahab shahab.mok...@gmail.com mailto:shahab.mok...@gmail.com: I am using Calliope cassandra-spark connector(http://tuplejump.github.io/calliope/ http://tuplejump.github.io/calliope/), which is quite handy and easy to use! The only problem is that it is a bit outdates , works with Spark 1.1.0, hopefully new version comes soon. best, /Shahab On Wed, Feb 11, 2015 at 2:51 PM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net mailto:mvallemil...@bloomberg.net wrote: I just finished a scala course, nice exercise to check what I learned :D Thanks for the answer! From: user@cassandra.apache.org mailto:user@cassandra.apache.org Subject: Re: best supported spark connector for Cassandra Start looking at the Spark/Cassandra connector here (in Scala): https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector https://github.com/datastax/spark-cassandra-connector/tree/master/spark-cassandra-connector/src/main/scala/com/datastax/spark/connector Data locality
Re: query by column size
I have already secondary index on that column, but how to I query that column by size ? thanks chandra On Fri, Feb 13, 2015 at 3:30 AM, Marcelo Valle (BLOOMBERG/ LONDON) mvallemil...@bloomberg.net wrote: There is no automatic indexing in Cassandra. There are secondary indexes, but not for these cases. You could use a solution like DSE, to get data automatically indexed on solr, in each node, as soon as data comes. Then you could do such a query on solr. If the query can be slow, you could run a MR job over all rows, filtering the ones you want. []s From: user@cassandra.apache.org Subject: Re:query by column size Greetings, I have one column family with 10 columns, one of the column we store xml/json. Is there a way I can query that column where size 50kb ? assuming I have index on that column. thanks CV.
Re: query by column size
On Fri, Feb 13, 2015 at 11:18 AM, chandra Varahala hadoopandcassan...@gmail.com wrote: I have already secondary index on that column, but how to I query that column by size ? You can't. If this is a query that you want to do regularly and efficiently, I suggest creating a second table to act as an index (or materialized view of sorts). Whenever your application writes a row to the original table with a column 50kb, it should also update the second table. -- Tyler Hobbs DataStax http://datastax.com/
Re: Added new nodes to cluster but no streams
Got it, thank you very much. On Friday, February 13, 2015 4:04 PM, Jens Rantil jens.ran...@tink.se wrote: Hi Bastranut, A few minutes between each node will do. Cheers,Jens On Fri, Feb 13, 2015 at 1:12 PM, Batranut Bogdan batra...@yahoo.com wrote: Hello, When adding a new node to the cluster I need to wait for each node to receive all the data from other nodes in the cluster or just wait a few minutes before I start each node? On Thursday, February 12, 2015 7:21 PM, Robert Coli rc...@eventbrite.com wrote: On Thu, Feb 12, 2015 at 3:20 AM, Batranut Bogdan batra...@yahoo.com wrote: I have added new nodes to the existing cluster. In Opscenter I do not see any streams... I presume that the new nodes get the data from the rest of the cluster via streams. The existing cluster has TB magnitude, and space used in the new nodes is ~90 GB. I must admit that I have restarted the new nodes several times after adding them . Does this affect boostrap? AFAIK the new nodes should start loading a part of all the data in the existing cluster. If it stays like this for a while, it sounds like your bootstraps have hung. Note that in general you should add nodes one at a time, especially if you are in a version without the fix for CASSANDRA-2434, in theory adding multiple nodes at once might contribute to their bootstraps hanging. Stop cassandra on the joining nodes, wipe/move aside their data directories, and try again one at a time. =Rob -- Jens RantilBackend engineerTink AB Email: jens.rantil@tink.sePhone: +46 708 84 18 32Web: www.tink.se Facebook Linkedin Twitter
Re: Pagination support on Java Driver Query API
The syntax suggested by Ondrej is not working in some case in 2.0.11 and logged an issue for the same. https://issues.apache.org/jira/browse/CASSANDRA-8797 Thanks Ajay On Feb 12, 2015 11:01 PM, Bulat Shakirzyanov bulat.shakirzya...@datastax.com wrote: Fixed my Mail.app settings so you can see my actual name, sorry. On Feb 12, 2015, at 8:55 AM, DataStax bulat.shakirzya...@datastax.com wrote: Hello, As was mentioned earlier, the Java driver doesn’t actually perform pagination. Instead, it uses cassandra native protocol to set page size of the result set. ( https://github.com/apache/cassandra/blob/trunk/doc/native_protocol_v2.spec#L699-L730 ) When Cassandra sends the result back to the java driver, it includes a some binary token. This token represents paging state. To fetch the next page, the driver re-executes the same statement with original page size and paging state attached. If there is another page available, Cassandra responds with a new paging state that can be used to fetch it. You could also try reporting this issue on the Cassandra user mailing list. On Feb 12, 2015, at 8:35 AM, Eric Stevens migh...@gmail.com wrote: I don't know what the shape of the page state data is deep inside the JavaDriver, I've actually tried to dig into that in the past and understand it to see if I could reproduce it as a general purpose any-query kind of thing. I gave up before I fully understood it, but I think it's actually a handle to an in-memory state maintained by the coordinator, which is only maintained for the lifetime of the statement (i.e. it's not stateless paging). That would make it a bad candidate for stateless paging scenarios such as REST requests where a typical setup would load balance across HTTP hosts, never mind across coordinators. It shouldn't be too much work to abstract this basic idea for manual paging into a general purpose class that takes List[ClusteringKeyDef[T, O:Ordering]], and can produce a connection agnostic PageState from a ResultSet or Row, or accepts a PageState to produce a WHERE CQL fragment. Also RE: possibly multiple queries to satisfy a page - yes, that's unfortunate. Since you're on 2.0.11, see Ondřej's answer to avoid it. On Thu, Feb 12, 2015 at 8:13 AM, Ajay ajay.ga...@gmail.com wrote: Thanks Eric. I figured out the same but didn't get time to put it on the mail. Thanks. But it is highly tied up to how data is stored internally in Cassandra. Basically how partition keys are used to distribute (less likely to change. We are not directly dependence on the partition algo) and clustering keys are used to sort the data with in a partition( multi level sorting and henceforth the restrictions on the ORDER BY clause) which I think can change likely down the lane in Cassandra 3.x or 4.x in an different way for some better storage or retrieval. Thats said I am hesitant to implement this client side logic for pagination for a) 2+ queries might need more than one query to Cassandra. b) tied up implementation to Cassandra internal storage details which can change(though not often). c) in our case, we are building REST Apis which will be deployed Tomcat clusters. Hence whatever we cache to support pagination, need to be cached in a distributed way for failover support. It (pagination support) is best done at the server side like ROWNUM in SQL or better done in Java driver to hide the internal details and can be optimized better as server sends the paging state with the driver. Thanks Ajay On Feb 12, 2015 8:22 PM, Eric Stevens migh...@gmail.com wrote: Your page state then needs to track the last ck1 and last ck2 you saw. Pages 2+ will end up needing to be up to two queries if the first query doesn't fill the page size. CREATE TABLE foo ( partitionkey int, ck1 int, ck2 int, col1 int, col2 int, PRIMARY KEY ((partitionkey), ck1, ck2) ) WITH CLUSTERING ORDER BY (ck1 asc, ck2 desc); INSERT INTO foo (partitionkey, ck1, ck2, col1, col2) VALUES (1,1,1,1,1); INSERT INTO foo (partitionkey, ck1, ck2, col1, col2) VALUES (1,1,2,2,2); INSERT INTO foo (partitionkey, ck1, ck2, col1, col2) VALUES (1,1,3,3,3); INSERT INTO foo (partitionkey, ck1, ck2, col1, col2) VALUES (1,2,1,4,4); INSERT INTO foo (partitionkey, ck1, ck2, col1, col2) VALUES (1,2,2,5,5); INSERT INTO foo (partitionkey, ck1, ck2, col1, col2) VALUES (1,2,3,6,6); If you're pulling the whole of partition 1 and your page size is 2, your first page looks like: *PAGE 1* SELECT * FROM foo WHERE partitionkey = 1 LIMIT 2; partitionkey | ck1 | ck2 | col1 | col2 --+-+-+--+-- 1 | 1 | 3 |3 |3 1 | 1 | 2 |2 |2 You got enough rows to satisfy the page, Your page state is taken from the last row: (ck1=1, ck2=2) *PAGE 2* Notice that you have a page state, and add some limiting clauses on the statement: SELECT * FROM foo WHERE partitionkey = 1 AND ck1 = 1
Re: sstables remain after compaction
On Fri, Feb 13, 2015 at 1:35 AM, Jason Wee peich...@gmail.com wrote: Pre cassandra 1.0, after sstables are compacted, the old sstables will be remain until the first gc kick in. For cassandra 1.0, the sstables will be remove after compaction is done. Will it be possible the old sstables remains due to whatever reasons (e.g. read referencing)? If I understand your question properly, the answer is no or not for longer than the duration of a running thread. If compaction is working properly in a post-needs-the-java-GC-to-delete-files version of Cassandra the input files should be deleted ASAP. If a thread is actively accessing that file, I would imagine it blocks for that long, but that's not likely to be very long. =Rob
Re: Added new nodes to cluster but no streams
Hello, When adding a new node to the cluster I need to wait for each node to receive all the data from other nodes in the cluster or just wait a few minutes before I start each node? On Thursday, February 12, 2015 7:21 PM, Robert Coli rc...@eventbrite.com wrote: On Thu, Feb 12, 2015 at 3:20 AM, Batranut Bogdan batra...@yahoo.com wrote: I have added new nodes to the existing cluster. In Opscenter I do not see any streams... I presume that the new nodes get the data from the rest of the cluster via streams. The existing cluster has TB magnitude, and space used in the new nodes is ~90 GB. I must admit that I have restarted the new nodes several times after adding them . Does this affect boostrap? AFAIK the new nodes should start loading a part of all the data in the existing cluster. If it stays like this for a while, it sounds like your bootstraps have hung. Note that in general you should add nodes one at a time, especially if you are in a version without the fix for CASSANDRA-2434, in theory adding multiple nodes at once might contribute to their bootstraps hanging. Stop cassandra on the joining nodes, wipe/move aside their data directories, and try again one at a time. =Rob
Re: sstables remain after compaction
Thank Rob, I trigger user defined compaction to big sstables (big as in the size per sstable reach more than 50GB, some 100GB). Occasionally, after user defined compaction, I see some sstables remain, even after 12 hours elapsed. You mentioned a thread, could you tell what threads are those or perhaps highlight in the code? Jason On Sat, Feb 14, 2015 at 3:58 AM, Robert Coli rc...@eventbrite.com wrote: On Fri, Feb 13, 2015 at 1:35 AM, Jason Wee peich...@gmail.com wrote: Pre cassandra 1.0, after sstables are compacted, the old sstables will be remain until the first gc kick in. For cassandra 1.0, the sstables will be remove after compaction is done. Will it be possible the old sstables remains due to whatever reasons (e.g. read referencing)? If I understand your question properly, the answer is no or not for longer than the duration of a running thread. If compaction is working properly in a post-needs-the-java-GC-to-delete-files version of Cassandra the input files should be deleted ASAP. If a thread is actively accessing that file, I would imagine it blocks for that long, but that's not likely to be very long. =Rob
Storing bi-temporal data in Cassandra
Has anyone designed a bi-temporal table in Cassandra? Doesn't look like I can do this using CQL for now. Taking the time series example from well known modeling tutorials in Cassandra - CREATE TABLE temperatures ( weatherstation_id text, event_time timestamp, temperature text, PRIMARY KEY (weatherstation_id,event_time), ) WITH CLUSTERING ORDER BY (event_time DESC); If I add another column transaction_time CREATE TABLE temperatures ( weatherstation_id text, event_time timestamp, transaction_time timestamp, temperature text, PRIMARY KEY (weatherstation_id,event_time,transaction_time), ) WITH CLUSTERING ORDER BY (event_time DESC, transaction_time DESC); If I try to run a query using the following CQL, it throws an error - select * from temperatures where weatherstation_id = 'foo' and event_time = '2015-01-01 00:00:00' and event_time '2015-01-02 00:00:00' and transaction_time '2015-01-02 00:00:00' It works if I use an equals clause for the event_time. I am trying to get the state as of a particular transaction_time -Raj