Re: cassandra + spark / pyspark
Adding to conversation... there are 3 great open source options available 1. Calliope http://tuplejump.github.io/calliope/ This is the first library that was out some time late last year (as i can recall) and I have been using this for a while, mostly very stable, uses Hadoop i/o in Cassandra (note that it doesn't require hadoop) 2. Datastax spark cassandra connector https://github.com/datastax/spark-cassandra-connector: Main difference is this uses cql3, again a great library but has few issues, also is very actively developed by far and still uses thrift for minor stuff but all heavy lifting in cql3 3. Startio Deep https://github.com/Stratio/stratio-deep: Has lot more to offer if you use all startio stack, Deep is for Spark, Statio Streaming is built on top of spark streaming, Stratio meta is something similar to sharkor sparksql and finally stratio Cassandra which is a fork of Cassandra with advanced Lucene based indexing
Re: cassandra + spark / pyspark
2. still uses thrift for minor stuff -- I think that the only call using thrift is describe_ring to get an estimate of ratio of partition keys within the token range 3. Stratio has a talk today at the SF Summit, presenting Stratio META. For the folks not attending the conference, video should be available within one month after On Thu, Sep 11, 2014 at 6:23 AM, abhinav chowdary abhinav.chowd...@gmail.com wrote: Adding to conversation... there are 3 great open source options available 1. Calliope http://tuplejump.github.io/calliope/ This is the first library that was out some time late last year (as i can recall) and I have been using this for a while, mostly very stable, uses Hadoop i/o in Cassandra (note that it doesn't require hadoop) 2. Datastax spark cassandra connector https://github.com/datastax/spark-cassandra-connector: Main difference is this uses cql3, again a great library but has few issues, also is very actively developed by far and still uses thrift for minor stuff but all heavy lifting in cql3 3. Startio Deep https://github.com/Stratio/stratio-deep: Has lot more to offer if you use all startio stack, Deep is for Spark, Statio Streaming is built on top of spark streaming, Stratio meta is something similar to sharkor sparksql and finally stratio Cassandra which is a fork of Cassandra with advanced Lucene based indexing
Re: cassandra + spark / pyspark
Ok. DataStax , Startio are required mesos, hadoop yarn other third party to get spark cluster HA. What in case of calliope? Is it sufficient to have cassandra + calliope + spark to be able process aggregations? In my case we have quite a lot of data so doing aggregation only in memory - impossible. Does calliope support not in memory mode for spark? Thanks Oleg. On Thu, Sep 11, 2014 at 9:23 PM, abhinav chowdary abhinav.chowd...@gmail.com wrote: Adding to conversation... there are 3 great open source options available 1. Calliope http://tuplejump.github.io/calliope/ This is the first library that was out some time late last year (as i can recall) and I have been using this for a while, mostly very stable, uses Hadoop i/o in Cassandra (note that it doesn't require hadoop) 2. Datastax spark cassandra connector https://github.com/datastax/spark-cassandra-connector: Main difference is this uses cql3, again a great library but has few issues, also is very actively developed by far and still uses thrift for minor stuff but all heavy lifting in cql3 3. Startio Deep https://github.com/Stratio/stratio-deep: Has lot more to offer if you use all startio stack, Deep is for Spark, Statio Streaming is built on top of spark streaming, Stratio meta is something similar to sharkor sparksql and finally stratio Cassandra which is a fork of Cassandra with advanced Lucene based indexing
Re: cassandra + spark / pyspark
Hi Oleg, I am the creator of Calliope. Calliope doesn't force any deployment model... that means you can run it with Mesos or Hadoop or Standalone. To be fair I don't think the other libs mentioned here should work too. The Spark cluster HA can be provided using ZooKeeper even in the standalone deployment mode. Can you explain what do you mean by in memory aggregations not being possible. With Calliope being able to utilize the secondary indexes and also our Stargate Indexes (Distributed lucene indexing for C*) I am sure we can handle any scenario. Calliope is used in production at many large organizations over very very big data. Feel free to mail me directly, and we can work with you to get you started. Regards, Rohit *Founder CEO, **Tuplejump, Inc.* www.tuplejump.com *The Data Engineering Platform* On Thu, Sep 11, 2014 at 8:09 PM, Oleg Ruchovets oruchov...@gmail.com wrote: Ok. DataStax , Startio are required mesos, hadoop yarn other third party to get spark cluster HA. What in case of calliope? Is it sufficient to have cassandra + calliope + spark to be able process aggregations? In my case we have quite a lot of data so doing aggregation only in memory - impossible. Does calliope support not in memory mode for spark? Thanks Oleg. On Thu, Sep 11, 2014 at 9:23 PM, abhinav chowdary abhinav.chowd...@gmail.com wrote: Adding to conversation... there are 3 great open source options available 1. Calliope http://tuplejump.github.io/calliope/ This is the first library that was out some time late last year (as i can recall) and I have been using this for a while, mostly very stable, uses Hadoop i/o in Cassandra (note that it doesn't require hadoop) 2. Datastax spark cassandra connector https://github.com/datastax/spark-cassandra-connector: Main difference is this uses cql3, again a great library but has few issues, also is very actively developed by far and still uses thrift for minor stuff but all heavy lifting in cql3 3. Startio Deep https://github.com/Stratio/stratio-deep: Has lot more to offer if you use all startio stack, Deep is for Spark, Statio Streaming is built on top of spark streaming, Stratio meta is something similar to sharkor sparksql and finally stratio Cassandra which is a fork of Cassandra with advanced Lucene based indexing
[RELEASE] Apache Cassandra 2.1.0
The Cassandra team is pleased to announce the release of the final version of Apache Cassandra 2.1.0. Cassandra 2.1.0 brings a number of new features and improvements including (but not limited to): - Improved support of Windows. - A new incremental repair option[4, 5] - A better row cache that can cache only the head of partitions[6] - Off-heap memtables[7] - Numerous performance improvements[8, 9] - CQL improvements and additions: User-defined types, tuple types, 2ndary indexing of collections, ...[10] - An improved stress tool[11] Please refer to the release notes[1] and changelog[2] for details. Both source and binary distributions of Cassandra 2.1.0 can be downloaded at: http://cassandra.apache.org/download/ As usual, a debian package is available from the project APT repository[3] (you will need to use the 21x series). The Cassandra team [1]: http://goo.gl/k4eM39 (CHANGES.txt) [2]: http://goo.gl/npCsro (NEWS.txt) [3]: http://wiki.apache.org/cassandra/DebianPackaging [4]: http://goo.gl/MjohJp [5]: http://goo.gl/f8jSme [6]: http://goo.gl/6TJPH6 [7]: http://goo.gl/YT7znJ [8]: http://goo.gl/Rg3tdA [9]: http://goo.gl/JfDBGW [10]: http://goo.gl/kQl7GW [11]: http://goo.gl/OTNqiQ
Re: Mutation Stage does not finish
Hello, The jstack output can be seen in : http://pastebin.com/LXnNyY3U. I run the tpstats today and always get the same output: Pool NameActive Pending Completed Blocked All time blocked ReadStage 0 0 0 0 0 RequestResponseStage 0 0 0 0 0 *MutationStage32 5832690042 0 0* ReadRepairStage 0 0 0 0 0 ReplicateOnWriteStage 0 0 0 0 0 GossipStage 0 0 0 0 0 AntiEntropyStage 0 0 0 0 0 MigrationStage0 0 0 0 0 MemoryMeter 0 0 98 0 0 MemtablePostFlusher 0 0 7 0 0 FlushWriter 0 0 5 0 0 MiscStage 0 0 0 0 0 commitlog_archiver0 0 0 0 0 InternalResponseStage 0 0 0 0 0 The OpCenter show the following status: Status: Active - Starting Gossip:Down Thrift:Down Native Transport: Down Pending Tasks: 0 Thanks Eduardo On Wed, Sep 10, 2014 at 10:30 PM, Benedict Elliott Smith belliottsm...@datastax.com wrote: Could you post the results of jstack on the process somewhere? On Thu, Sep 11, 2014 at 7:07 AM, Robert Coli rc...@eventbrite.com wrote: On Wed, Sep 10, 2014 at 1:53 PM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: No, is still running the Mutation Stage. If you're sure that it is not receiving Hinted Handoff, then the only mutations in question can be from the replay of the commit log. The commit log should take less than forever to replay. =Rob
Re: Quickly loading C* dataset into memory (row cache)
What are you referring to when you say memory store? RAM disk? memcached? Thanks, Danny On Wed, Sep 10, 2014 at 1:11 AM, DuyHai Doan doanduy...@gmail.com wrote: Rob Coli strikes again, you're Doing It Wrong, and he's right :D Using Cassandra as an distributed cache is a bad idea, seriously. Putting 6GB into row cache is another one. On Tue, Sep 9, 2014 at 9:21 PM, Robert Coli rc...@eventbrite.com wrote: On Tue, Sep 9, 2014 at 12:10 PM, Danny Chan tofuda...@gmail.com wrote: Is there a method to quickly load a large dataset into the row cache? I use row caching as I want the entire dataset to be in memory. You're doing it wrong. Use a memory store. =Rob
Re: [RELEASE] Apache Cassandra 2.1.0
Thanks for this new version that seems to bring a lot of new interesting features and improvements ! Definitely interested in trying new counters and incremental repairs. Congrats. PS: I am also quite curious to know what is still inside the heap :D. Maybe key cache ? So what is recommended heap size while running 2.1 (with memtable off-heap) ? 2014-09-11 17:05 GMT+02:00 Sylvain Lebresne sylv...@datastax.com: The Cassandra team is pleased to announce the release of the final version of Apache Cassandra 2.1.0. Cassandra 2.1.0 brings a number of new features and improvements including (but not limited to): - Improved support of Windows. - A new incremental repair option[4, 5] - A better row cache that can cache only the head of partitions[6] - Off-heap memtables[7] - Numerous performance improvements[8, 9] - CQL improvements and additions: User-defined types, tuple types, 2ndary indexing of collections, ...[10] - An improved stress tool[11] Please refer to the release notes[1] and changelog[2] for details. Both source and binary distributions of Cassandra 2.1.0 can be downloaded at: http://cassandra.apache.org/download/ As usual, a debian package is available from the project APT repository[3] (you will need to use the 21x series). The Cassandra team [1]: http://goo.gl/k4eM39 (CHANGES.txt) [2]: http://goo.gl/npCsro (NEWS.txt) [3]: http://wiki.apache.org/cassandra/DebianPackaging [4]: http://goo.gl/MjohJp [5]: http://goo.gl/f8jSme [6]: http://goo.gl/6TJPH6 [7]: http://goo.gl/YT7znJ [8]: http://goo.gl/Rg3tdA [9]: http://goo.gl/JfDBGW [10]: http://goo.gl/kQl7GW [11]: http://goo.gl/OTNqiQ
Re: [RELEASE] Apache Cassandra 2.1.0
Congrads team I know you worked hard on it!! One question. Where can users get a java Datastax driver to support this version? If so is it released? Best Regards, -Tony Anecito Founder/President MyUniPortal LLC http://www.myuniportal.com On Thursday, September 11, 2014 9:05 AM, Sylvain Lebresne sylv...@datastax.com wrote: The Cassandra team is pleased to announce the release of the final version of Apache Cassandra 2.1.0. Cassandra 2.1.0 brings a number of new features and improvements including (but not limited to): - Improved support of Windows. - A new incremental repair option[4, 5] - A better row cache that can cache only the head of partitions[6] - Off-heap memtables[7] - Numerous performance improvements[8, 9] - CQL improvements and additions: User-defined types, tuple types, 2ndary indexing of collections, ...[10] - An improved stress tool[11] Please refer to the release notes[1] and changelog[2] for details. Both source and binary distributions of Cassandra 2.1.0 can be downloaded at: http://cassandra.apache.org/download/ As usual, a debian package is available from the project APT repository[3] (you will need to use the 21x series). The Cassandra team [1]: http://goo.gl/k4eM39 (CHANGES.txt) [2]: http://goo.gl/npCsro (NEWS.txt) [3]: http://wiki.apache.org/cassandra/DebianPackaging [4]: http://goo.gl/MjohJp [5]: http://goo.gl/f8jSme [6]: http://goo.gl/6TJPH6 [7]: http://goo.gl/YT7znJ [8]: http://goo.gl/Rg3tdA [9]: http://goo.gl/JfDBGW [10]: http://goo.gl/kQl7GW [11]: http://goo.gl/OTNqiQ
Re: [RELEASE] Apache Cassandra 2.1.0
Yes its was released java driver 2.1 On Sep 11, 2014 8:33 AM, Tony Anecito adanec...@yahoo.com wrote: Congrads team I know you worked hard on it!! One question. Where can users get a java Datastax driver to support this version? If so is it released? Best Regards, -Tony Anecito Founder/President MyUniPortal LLC http://www.myuniportal.com On Thursday, September 11, 2014 9:05 AM, Sylvain Lebresne sylv...@datastax.com wrote: The Cassandra team is pleased to announce the release of the final version of Apache Cassandra 2.1.0. Cassandra 2.1.0 brings a number of new features and improvements including (but not limited to): - Improved support of Windows. - A new incremental repair option[4, 5] - A better row cache that can cache only the head of partitions[6] - Off-heap memtables[7] - Numerous performance improvements[8, 9] - CQL improvements and additions: User-defined types, tuple types, 2ndary indexing of collections, ...[10] - An improved stress tool[11] Please refer to the release notes[1] and changelog[2] for details. Both source and binary distributions of Cassandra 2.1.0 can be downloaded at: http://cassandra.apache.org/download/ As usual, a debian package is available from the project APT repository[3] (you will need to use the 21x series). The Cassandra team [1]: http://goo.gl/k4eM39 (CHANGES.txt) [2]: http://goo.gl/npCsro (NEWS.txt) [3]: http://wiki.apache.org/cassandra/DebianPackaging [4]: http://goo.gl/MjohJp [5]: http://goo.gl/f8jSme [6]: http://goo.gl/6TJPH6 [7]: http://goo.gl/YT7znJ [8]: http://goo.gl/Rg3tdA [9]: http://goo.gl/JfDBGW [10]: http://goo.gl/kQl7GW [11]: http://goo.gl/OTNqiQ
Re: cassandra + spark / pyspark
Thank you Rohit. I sent the email to you. Thanks Oleg. On Thu, Sep 11, 2014 at 10:51 PM, Rohit Rai ro...@tuplejump.com wrote: Hi Oleg, I am the creator of Calliope. Calliope doesn't force any deployment model... that means you can run it with Mesos or Hadoop or Standalone. To be fair I don't think the other libs mentioned here should work too. The Spark cluster HA can be provided using ZooKeeper even in the standalone deployment mode. Can you explain what do you mean by in memory aggregations not being possible. With Calliope being able to utilize the secondary indexes and also our Stargate Indexes (Distributed lucene indexing for C*) I am sure we can handle any scenario. Calliope is used in production at many large organizations over very very big data. Feel free to mail me directly, and we can work with you to get you started. Regards, Rohit *Founder CEO, **Tuplejump, Inc.* www.tuplejump.com *The Data Engineering Platform* On Thu, Sep 11, 2014 at 8:09 PM, Oleg Ruchovets oruchov...@gmail.com wrote: Ok. DataStax , Startio are required mesos, hadoop yarn other third party to get spark cluster HA. What in case of calliope? Is it sufficient to have cassandra + calliope + spark to be able process aggregations? In my case we have quite a lot of data so doing aggregation only in memory - impossible. Does calliope support not in memory mode for spark? Thanks Oleg. On Thu, Sep 11, 2014 at 9:23 PM, abhinav chowdary abhinav.chowd...@gmail.com wrote: Adding to conversation... there are 3 great open source options available 1. Calliope http://tuplejump.github.io/calliope/ This is the first library that was out some time late last year (as i can recall) and I have been using this for a while, mostly very stable, uses Hadoop i/o in Cassandra (note that it doesn't require hadoop) 2. Datastax spark cassandra connector https://github.com/datastax/spark-cassandra-connector: Main difference is this uses cql3, again a great library but has few issues, also is very actively developed by far and still uses thrift for minor stuff but all heavy lifting in cql3 3. Startio Deep https://github.com/Stratio/stratio-deep: Has lot more to offer if you use all startio stack, Deep is for Spark, Statio Streaming is built on top of spark streaming, Stratio meta is something similar to sharkor sparksql and finally stratio Cassandra which is a fork of Cassandra with advanced Lucene based indexing
Re: Quickly loading C* dataset into memory (row cache)
On Thu, Sep 11, 2014 at 8:30 AM, Danny Chan tofuda...@gmail.com wrote: What are you referring to when you say memory store? RAM disk? memcached? In 2014, probably Redis? =Rob
Detecting bitrot with incremental repair
jbellis talked about incremental repair, which is great, but as I understood, repair was also somewhat responsible for detecting and repairing bitrot on long-lived sstables. If repair doesn't do it, what will? Thanks, John... NOTICE: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message.
Re: Detecting bitrot with incremental repair
On Thu, Sep 11, 2014 at 9:44 AM, John Sumsion sumsio...@familysearch.org wrote: jbellis talked about incremental repair, which is great, but as I understood, repair was also somewhat responsible for detecting and repairing bitrot on long-lived sstables. SSTable checksums, and the checksums on individual compressed (and only compressed) partitions provide some of this functionality, at very least giving some visibility into bitrot style corruption. If repair doesn't do it, what will? Read repair will help, but only repair is capable of providing the guarantee you need. Probably Cassandra needs partition checksums on uncompressed partitions, and then to mark a sstable un-repaired when it detects a corrupt read. =Rob
Re: Mutation Stage does not finish
Robert/Elliot. I deleted commit logs, restarted cassandra and finally the node is up. Thanks for helps! Regards. Eduardo On Thu, Sep 11, 2014 at 12:08 PM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: Hello, The jstack output can be seen in : http://pastebin.com/LXnNyY3U. I run the tpstats today and always get the same output: Pool NameActive Pending Completed Blocked All time blocked ReadStage 0 0 0 0 0 RequestResponseStage 0 0 0 0 0 *MutationStage32 5832690042 0 0* ReadRepairStage 0 0 0 0 0 ReplicateOnWriteStage 0 0 0 0 0 GossipStage 0 0 0 0 0 AntiEntropyStage 0 0 0 0 0 MigrationStage0 0 0 0 0 MemoryMeter 0 0 98 0 0 MemtablePostFlusher 0 0 7 0 0 FlushWriter 0 0 5 0 0 MiscStage 0 0 0 0 0 commitlog_archiver0 0 0 0 0 InternalResponseStage 0 0 0 0 0 The OpCenter show the following status: Status: Active - Starting Gossip:Down Thrift:Down Native Transport: Down Pending Tasks: 0 Thanks Eduardo On Wed, Sep 10, 2014 at 10:30 PM, Benedict Elliott Smith belliottsm...@datastax.com wrote: Could you post the results of jstack on the process somewhere? On Thu, Sep 11, 2014 at 7:07 AM, Robert Coli rc...@eventbrite.com wrote: On Wed, Sep 10, 2014 at 1:53 PM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: No, is still running the Mutation Stage. If you're sure that it is not receiving Hinted Handoff, then the only mutations in question can be from the replay of the commit log. The commit log should take less than forever to replay. =Rob
Re: Mutation Stage does not finish
On Thu, Sep 11, 2014 at 10:34 AM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: I deleted commit logs, restarted cassandra and finally the node is up. Do you have some crazy workload where you do a huge amount of delete or something? Replaying a commitlog should not take longer than a few tens of minutes in the worst case scenario. =Rob
Is it possible to bootstrap the 1st node of a new DC?
When setting up a new (additional) data center, the documentation tells us to use nodetool rebuild -- old dc to fill up the node(s) in the new dc, and to disable auto_bootstrap. I'm wondering if it is possible to fill the node with auto_bootstrap=true instead of a nodetool rebuild command. If so, how will Cassandra decide from where to stream the data? The reason I'm asking is that when using rebuild, I've learned from experience that the node immediately joins the cluster, and starts accepting reads (from other DCs) for the range it owns. But since the data is not complete yet, it can't return anything. This seems to be a dangerous side effect of this procedure, and therefore can't be used. Thanks Tom
Re: Is it possible to bootstrap the 1st node of a new DC?
Thanks, Rob. I actually tried using LOCAL_ONE instead of ONE, but I still saw this problem. Maybe I missed some queries when updating to LOCAL_ONE. Anyway, it's good to know that this is supposed to work. Tom On Thu, Sep 11, 2014 at 10:28 PM, Robert Coli rc...@eventbrite.com wrote: On Thu, Sep 11, 2014 at 1:18 PM, Tom van den Berge t...@drillster.com wrote: When setting up a new (additional) data center, the documentation tells us to use nodetool rebuild -- old dc to fill up the node(s) in the new dc, and to disable auto_bootstrap. I'm wondering if it is possible to fill the node with auto_bootstrap=true instead of a nodetool rebuild command. If so, how will Cassandra decide from where to stream the data? Yes, if that node can hold 100% of the replicas for the new DC. Cassandra will decide from where to stream the data in the same way it normally does, by picking one replica per range and streaming from it. But you probably don't generally want to do this, rebuild exists for this use case. The reason I'm asking is that when using rebuild, I've learned from experience that the node immediately joins the cluster, and starts accepting reads (from other DCs) for the range it owns. But since the data is not complete yet, it can't return anything. This seems to be a dangerous side effect of this procedure, and therefore can't be used. Yes, that's why LOCAL_ONE ConsistencyLevel was created. Use it, and LOCAL_QUORUM, instead of ONE and QUORUM. =Rob -- Drillster BV Middenburcht 136 3452MT Vleuten Netherlands +31 30 755 5330 Open your free account at www.drillster.com
Re: Mutation Stage does not finish
yes we have a huge amount insert that can be repeated, now we are working in a new data model On Thu, Sep 11, 2014 at 2:54 PM, Robert Coli rc...@eventbrite.com wrote: On Thu, Sep 11, 2014 at 10:34 AM, Eduardo Cusa eduardo.c...@usmediaconsulting.com wrote: I deleted commit logs, restarted cassandra and finally the node is up. Do you have some crazy workload where you do a huge amount of delete or something? Replaying a commitlog should not take longer than a few tens of minutes in the worst case scenario. =Rob