[ANN] SparkSQL support for Cassandra with Calliope
Hi All, An year ago we started this journey and laid the path for Spark + Cassandra stack. We established the ground work and direction for Spark Cassandra connectors and we have been happy seeing the results. With Spark 1.1.0 and SparkSQL release, we its time to take Calliope http://tuplejump.github.io/calliope/ to the logical next level also paving the way for much more advanced functionality to come. Yesterday we released Calliope 1.1.0 Community Tech Preview https://twitter.com/tuplejump/status/517739186124627968, which brings Native SparkSQL support for Cassandra. The further details are available here http://tuplejump.github.io/calliope/tech-preview.html. This release showcases in core spark-sql http://tuplejump.github.io/calliope/start-with-sql.html, hiveql http://tuplejump.github.io/calliope/start-with-hive.html and HiveThriftServer http://tuplejump.github.io/calliope/calliope-server.html support. I differentiate it as native spark-sql integration as it doesn't rely on Cassandra's hive connectors (like Cash or DSE) and saves a level of indirection through Hive. It also allows us to harness Spark's analyzer and optimizer in future to work out the best execution plan targeting a balance between Cassandra's querying restrictions and Sparks in memory processing. As far as we know this it the first and only third party data store connector for SparkSQL. This is a CTP release as it relies on Spark internals that still don't have/stabilized a developer API and we will work with the Spark Community in documenting the requirements and working towards a standard and stable API for third party data store integration. On another note, we no longer require you to signup to access the early access code repository. Inviting all of you try it and give us your valuable feedback. Regards, Rohit *Founder CEO, **Tuplejump, Inc.* www.tuplejump.com *The Data Engineering Platform*
Re: cassandra + spark / pyspark
Hi Oleg, I am the creator of Calliope. Calliope doesn't force any deployment model... that means you can run it with Mesos or Hadoop or Standalone. To be fair I don't think the other libs mentioned here should work too. The Spark cluster HA can be provided using ZooKeeper even in the standalone deployment mode. Can you explain what do you mean by in memory aggregations not being possible. With Calliope being able to utilize the secondary indexes and also our Stargate Indexes (Distributed lucene indexing for C*) I am sure we can handle any scenario. Calliope is used in production at many large organizations over very very big data. Feel free to mail me directly, and we can work with you to get you started. Regards, Rohit *Founder CEO, **Tuplejump, Inc.* www.tuplejump.com *The Data Engineering Platform* On Thu, Sep 11, 2014 at 8:09 PM, Oleg Ruchovets oruchov...@gmail.com wrote: Ok. DataStax , Startio are required mesos, hadoop yarn other third party to get spark cluster HA. What in case of calliope? Is it sufficient to have cassandra + calliope + spark to be able process aggregations? In my case we have quite a lot of data so doing aggregation only in memory - impossible. Does calliope support not in memory mode for spark? Thanks Oleg. On Thu, Sep 11, 2014 at 9:23 PM, abhinav chowdary abhinav.chowd...@gmail.com wrote: Adding to conversation... there are 3 great open source options available 1. Calliope http://tuplejump.github.io/calliope/ This is the first library that was out some time late last year (as i can recall) and I have been using this for a while, mostly very stable, uses Hadoop i/o in Cassandra (note that it doesn't require hadoop) 2. Datastax spark cassandra connector https://github.com/datastax/spark-cassandra-connector: Main difference is this uses cql3, again a great library but has few issues, also is very actively developed by far and still uses thrift for minor stuff but all heavy lifting in cql3 3. Startio Deep https://github.com/Stratio/stratio-deep: Has lot more to offer if you use all startio stack, Deep is for Spark, Statio Streaming is built on top of spark streaming, Stratio meta is something similar to sharkor sparksql and finally stratio Cassandra which is a fork of Cassandra with advanced Lucene based indexing
Re: Data locality with cash
Hi Jens, Cash builds on the Cassandra hadoop handlers and thus supports data locality. Regards, Rohit *Founder CEO, **Tuplejump, Inc.* www.tuplejump.com *The Data Engineering Platform* On Wed, May 21, 2014 at 9:22 PM, Jens Rantil jens.ran...@tink.se wrote: Hi, I've had a look at the Hive plugin for Cassandra[1]. Does anyone know if it supports data locality if I install task trackers and job trackers on my Cassandra instances? [1] https://github.com/tuplejump/cash Thanks, Jens