Re: Machine Learning With Cassandra
If you want distributed machine learning, you can use either Mahout (runs on Hadoop) or Spark (MLLib). If you choose the Hadoop route, Datastax provides a connector (CFS) to interact with data stored in Cassandra. Otherwise you can try to use the Cassandra InputFormat (not as simple, but plenty of people use it). A quick search for “map reduce cassandra” on this list brings up a recent conversation: http://mail-archives.apache.org/mod_mbox/cassandra-user/201407.mbox/%3CCAAX2xq6UhsGfq_gtfjogOV7%3DMi8q%3D5SmRfNM1%2BKFEXXVk%2Bp8iw%40mail.gmail.com%3E If you prefer to use Spark, you can try the Datastax Cassandra connector: https://github.com/datastax/spark-cassandra-connector. This should let you run Spark jobs using data to/from Cassandra. Cheers, James Web: http://ferry.opencore.io Twitter: @open_core_io On Aug 30, 2014, at 4:02 PM, Adaryl Bob Wakefield, MBA adaryl.wakefi...@hotmail.com wrote: Yes I remember this conversation. That was when I was just first stepping into this stuff. My current understanding is: Storm = Stream and micro batch Spark = Batch and micro batch Micro batching is what gets you to exactly once processing semantics. I’m clear on that. What I’m not clear on is how and where processing takes place. I also get the fact that Spark is a faster execution engine than MapReduce. But we have Tez now..except, as far as I know, that’s not useful here because my data isn’t in HDFS. People seem to be talking quite a bit about Mahout and Spark Shell but I’d really like to get this done with a minimum amount of software; either Storm or Spark but not both. Trident ML isn’t distributed which is fine because I’m not trying to do learning on the stream. For now, I’m just trying to do learning in batch and then update parameters as suggested earlier. Let me simply the question. How do I do distributed machine learning when my data is in Cassandra and not HDFS? I haven’t totally explored mahout yet but a lot of the algorithms run on MapReduce which is fine for now. As I understand it though, MapReduce works on data in HDFS correct? Adaryl Bob Wakefield, MBA Principal Mass Street Analytics 913.938.6685 www.linkedin.com/in/bobwakefieldmba Twitter: @BobLovesData From: Shahab Yunus Sent: Saturday, August 30, 2014 11:23 AM To: user@cassandra.apache.org Subject: Re: Machine Learning With Cassandra Spark is not storage, rather it is a streaming framework supposed to be run on big data, distributed architecture (a very high-level intro/definition). It provides batched version of in-memory map/reduce like jobs. It is not completely streaming like Storm but rather batches collection of tuples and thus you can run complex ML algorithms relatively faster. I think we just discussed this a short while ago when similar question (storm vs. spark, I think) was raised by you earlier. Here is the link for that discussion: http://markmail.org/message/lc4icuw4hobul6oh Regards, Shahab On Sat, Aug 30, 2014 at 12:16 PM, Adaryl Bob Wakefield, MBA adaryl.wakefi...@hotmail.com wrote: Isn’t a bit overkill to use Storm and Spark in the architecture? You say load it “into” Spark. Is Spark separate storage? B. From: Alex Kamil Sent: Friday, August 29, 2014 10:46 PM To: user@cassandra.apache.org Subject: Re: Machine Learning With Cassandra Adaryl, most ML algorithms are based on some form of numerical optimization, using something like online gradient descent or conjugate gradient (e.g in SVM classifiers). In its simplest form it is a nested FOR loop where on each iteration you update the weights or parameters of the model until reaching some convergence threshold that minimizes the prediction error (usually the goal is to minimize a Loss function, as in a popular least squares technique). You could parallelize this loop using a brute force divide-and-conquer approach, mapping a chunk of data to each node and a computing partial sum there, then aggregating the results from each node into a global sum in a 'reduce' stage, and repeating this map-reduce cycle until convergence. You can look up distributed gradient descent or check out Mahout or Spark MLlib for examples. Alternatively you can use something like GraphLab. Cassandra can serve a data store from which you load the training data e.g. into Spark using this connector and then train the model using MLlib or Mahout (it has Spark bindings I believe). Once you trained the model, you could save the parameters back in Cassandra. Then the next stage is using the model to classify new data, e.g. recommend similar items based on a log of new purchases, there you could once again use Spark or Storm with something like this. Alex On Fri, Aug 29, 2014 at 10:24 PM, Adaryl Bob Wakefield, MBA adaryl.wakefi...@hotmail.com wrote: I’m planning to speak at a local meet-up and I need to know if what I
Re: Cassandra use cases/Strengths/Weakness
I’ve supported a variety of different “big data” systems and most have their own particular set of use cases that make sense. Having said that, I believe that Cassandra uniquely excels at the following: * Low write latency with respect to small to medium write sizes (logs, sensor data, etc.) * Linear write scalability * Fault-tolerance across geographic locations The first two points makes it an excellent candidate for high-throughput “transactional” systems. Other systems that play in this space tend to be HBase and Riak (there may be others, but I’m most familiar with those two). However, the last point is pretty unique to Cassandra. So if you’re looking for a high-scale out, high-throughput transactional system then Cassandra may make sense for you. If you’re looking for something more geared towards analytics (so few bulk writes, many reads), then something in the Hadoop space may make sense. Cheers James On Jul 4, 2014, at 3:31 PM, Prem Yadav ipremya...@gmail.com wrote: Thanks Manoj. Great post for those who already have Cassandra in production. However it brings me back to my original post. All the points you have mentioned apply to any big data technology. Storage- All of them Query- All of them. In fact lot of them perform better. Agree that CQL structure is better. But hive,mongo all good Availability- many of them So my question is basically to Cassandra support people e.g.- Datastax Or the developers. What makes Cassandra special. If I have to convince my CTO to spend million dollars on a cluster and support, his first question would be why Cassandra? Why not this or that? So I still am not sure about what special Cassandra brings to the table? Sorry about the rant. But in the enterprise world, decisions are taken based on taking into account the stability, convincing managers and what not. Chosen technology has to be stable for years. People should be convinced that the engineers are not going to do a lot of firefighting. Any inputs appreciated. On Fri, Jul 4, 2014 at 7:07 PM, Manoj Khangaonkar khangaon...@gmail.com wrote: These are my personal opinions based on few months using Cassandra. These are my views. Others may have different opinion http://khangaonkar.blogspot.com/2014/06/apache-cassandra-things-to-consider.html regards On Fri, Jul 4, 2014 at 7:37 AM, Prem Yadav ipremya...@gmail.com wrote: Hi, I have seen this in a lot of replies that Cassandra is not designed for this and that. I don't want to sound rude, i just need some info about this so that i can compare it to technologies like hbase, mongo, elasticsearch, solr, etc. 1) what is Cassandra designed for. Heave writes yes. So is Hbase. Or ElasticSearch What is the use case(s) that suit Cassandra. 2) What kind of queries are best suited for Cassandra. I ask this Because I have seen people asking about queries and getting replies that its not suited for Cassandra. For ex: queries where large number of rows are requested and timeout happens. Or range queries or aggregate queries. 3) Where does Cassandra excel compared to other technologies? I have been working on Casandra for some time. I know how it works and I like it very much. We are moving towards building a big cluster. But at this point, I am not sure if its a right decision. A lot of people including me like Cassandra in my company. But it has more to do with the CQL and not the internals or the use cases. Until now, there have been small PoCs and people enjoyed it. But a large scale project, we are not so sure. Please guide us. Please note that the drawbacks of other technologies do not interest me, its the strengths/weaknesses of Cassandra I am interested in. Thanks -- http://khangaonkar.blogspot.com/
Re: autoscaling cassandra cluster
If you're interested and/or need some Cassandra docker images let me know I'll shoot you a link. James Sent from my iPhone On May 21, 2014, at 10:19 AM, Jabbar Azam aja...@gmail.com wrote: That sounds interesting. I was thinking of using coreos with docker containers for the business logic, frontend and Cassandra. I'll also have a look at cassandra-mesos Thanks Jabbar Azam On 21 May 2014 14:04, Panagiotis Garefalakis panga...@gmail.com wrote: I agree with Prem, but recently a guy send this promising project called Mesos in this list. https://github.com/mesosphere/cassandra-mesos One of its goals is to make scaling easier. I don’t have any personal opinion yet but maybe you could give it a try. Regards, Panagiotis On Wed, May 21, 2014 at 3:49 PM, Jabbar Azam aja...@gmail.com wrote: Hello Prem, I'm trying to find out whether people are autoscaling up and down automatically, not manually. I'm also interested in whether they are using a cloud based solution and creating and destroying instances. I've found the following regarding GCE https://cloud.google.com/developers/articles/auto-scaling-on-the-google-cloud-platform and how instances can be created and destroyed. I Thanks Jabbar Azam On 21 May 2014 13:09, Prem Yadav ipremya...@gmail.com wrote: Hi Jabbar, with vnodes, scaling up should not be a problem. You could just add a machines with the cluster/seed/datacenter conf and it should join the cluster. Scaling down has to be manual where you drain the node and decommission it. thanks, Prem On Wed, May 21, 2014 at 12:35 PM, Jabbar Azam aja...@gmail.com wrote: Hello, Has anybody got a cassandra cluster which autoscales depending on load or times of the day? I've seen the documentation on the datastax website and that only mentioned adding and removing nodes, unless I've missed something. I want to know how to do this for the google compute engine. This isn't for a production system but a test system(multiple nodes) where I want to learn. I'm not sure how to check the performance of the cluster, whether I use one performance metric or a mix of performance metrics and then invoke a script to add or remove nodes from the cluster. I'd be interested to know whether people out there are autoscaling cassandra on demand. Thanks Jabbar Azam
Re: autoscaling cassandra cluster
You normally don't (ferry auto-generates the IP addresses). Let's move this conversation to the ferry-user google group so that we don't pollute this mailing list... James Sent from my iPhone On May 21, 2014, at 3:15 PM, Jabbar Azam aja...@gmail.com wrote: Hello James, How do you alter your cassandra.yaml file with each nodes IP address? I want to use the scaling software(which I've not got yet) to create and destroy the GCE instances. I want to use fleet to deploy and undeploy the cassandra nodes inside the docker instances. I do realise I will have to run nodetool to add and remove the nodes from the cluster and also the node cleanup. Disclaimer: this is not a production system but something Im experimenting with in my own time. Thanks Jabbar Azam On 21 May 2014 15:51, James Horey j...@opencore.io wrote: If you're interested and/or need some Cassandra docker images let me know I'll shoot you a link. James Sent from my iPhone On May 21, 2014, at 10:19 AM, Jabbar Azam aja...@gmail.com wrote: That sounds interesting. I was thinking of using coreos with docker containers for the business logic, frontend and Cassandra. I'll also have a look at cassandra-mesos Thanks Jabbar Azam On 21 May 2014 14:04, Panagiotis Garefalakis panga...@gmail.com wrote: I agree with Prem, but recently a guy send this promising project called Mesos in this list. https://github.com/mesosphere/cassandra-mesos One of its goals is to make scaling easier. I don’t have any personal opinion yet but maybe you could give it a try. Regards, Panagiotis On Wed, May 21, 2014 at 3:49 PM, Jabbar Azam aja...@gmail.com wrote: Hello Prem, I'm trying to find out whether people are autoscaling up and down automatically, not manually. I'm also interested in whether they are using a cloud based solution and creating and destroying instances. I've found the following regarding GCE https://cloud.google.com/developers/articles/auto-scaling-on-the-google-cloud-platform and how instances can be created and destroyed. I Thanks Jabbar Azam On 21 May 2014 13:09, Prem Yadav ipremya...@gmail.com wrote: Hi Jabbar, with vnodes, scaling up should not be a problem. You could just add a machines with the cluster/seed/datacenter conf and it should join the cluster. Scaling down has to be manual where you drain the node and decommission it. thanks, Prem On Wed, May 21, 2014 at 12:35 PM, Jabbar Azam aja...@gmail.com wrote: Hello, Has anybody got a cassandra cluster which autoscales depending on load or times of the day? I've seen the documentation on the datastax website and that only mentioned adding and removing nodes, unless I've missed something. I want to know how to do this for the google compute engine. This isn't for a production system but a test system(multiple nodes) where I want to learn. I'm not sure how to check the performance of the cluster, whether I use one performance metric or a mix of performance metrics and then invoke a script to add or remove nodes from the cluster. I'd be interested to know whether people out there are autoscaling cassandra on demand. Thanks Jabbar Azam
Re: How to clear all data using CQL?
If you’re running unit tests and repeatadly clearing the Cassandra keyspaces, you may want to check out Ferry (ferry.opencore.io). It lets you standup/destroy multiple Cassandra stacks locally on your machine and is useful for the use case you described. I’m the author of Ferry, and would be glad to help out. (Sorry for the plug) James On Apr 16, 2014, at 5:29 AM, Sebastian Schmidt isib...@gmail.com wrote: Thank you that worked! Am 16.04.2014 10:46, schrieb Mark Reddy: select keyspace_name from system.schema_keyspaces;
Help collecting Cassandra examples
Hello all, I’m trying to collect and organize Cassandra applications for educational purposes. I’m hoping that by collating these applications in a single place, new users will be able to get up to speed a bit easier. If you know of a great application (should be open-source and preferably up to date), please shoot me an email or send a pull request using the GitHub page below. https://github.com/opencore/cassandra-examples Thanks! James