Re: Machine Learning With Cassandra

2014-08-30 Thread James Horey
If you want distributed machine learning, you can use either Mahout (runs on 
Hadoop) or Spark (MLLib). If you choose the Hadoop route, Datastax provides a 
connector (CFS) to interact with data stored in Cassandra. Otherwise you can 
try to use the Cassandra InputFormat (not as simple, but plenty of people use 
it). 

A quick search for “map reduce cassandra” on this list brings up a recent 
conversation: 
http://mail-archives.apache.org/mod_mbox/cassandra-user/201407.mbox/%3CCAAX2xq6UhsGfq_gtfjogOV7%3DMi8q%3D5SmRfNM1%2BKFEXXVk%2Bp8iw%40mail.gmail.com%3E
 

If you prefer to use Spark, you can try the Datastax Cassandra connector: 
https://github.com/datastax/spark-cassandra-connector. This should let you run 
Spark jobs using data to/from Cassandra. 

Cheers, 
James

Web: http://ferry.opencore.io
Twitter: @open_core_io

On Aug 30, 2014, at 4:02 PM, Adaryl Bob Wakefield, MBA 
adaryl.wakefi...@hotmail.com wrote:

 Yes I remember this conversation. That was when I was just first stepping 
 into this stuff. My current understanding is:
 Storm = Stream and micro batch
 Spark  = Batch and micro batch
  
 Micro batching is what gets you to exactly once processing semantics. I’m 
 clear on that. What I’m not clear on is how and where processing takes place.
  
 I also get the fact that Spark is a faster execution engine than MapReduce. 
 But we have Tez now..except, as far as I know, that’s not useful here because 
 my data isn’t in HDFS. People seem to be talking quite a bit about Mahout and 
 Spark Shell but I’d really like to get this done with a minimum amount of 
 software; either Storm or Spark but not both. 
  
 Trident ML isn’t distributed which is fine because I’m not trying to do 
 learning on the stream. For now, I’m just trying to do learning in batch and 
 then update parameters as suggested earlier.
  
 Let me simply the question. How do I do distributed machine learning when my 
 data is in Cassandra and not HDFS? I haven’t totally explored mahout yet but 
 a lot of the algorithms run on MapReduce which is fine for now. As I 
 understand it though, MapReduce works on data in HDFS correct?
  
 Adaryl Bob Wakefield, MBA
 Principal
 Mass Street Analytics
 913.938.6685
 www.linkedin.com/in/bobwakefieldmba
 Twitter: @BobLovesData
  
 From: Shahab Yunus
 Sent: Saturday, August 30, 2014 11:23 AM
 To: user@cassandra.apache.org
 Subject: Re: Machine Learning With Cassandra
  
 Spark is not storage, rather it is a streaming framework supposed to be run 
 on big data, distributed architecture (a very high-level intro/definition). 
 It provides batched version of in-memory map/reduce like jobs. It is not 
 completely streaming like Storm but rather batches collection of tuples and 
 thus you can run complex ML algorithms relatively faster. 
  
 I think we just discussed this a short while ago when similar question (storm 
 vs. spark, I think) was raised by you earlier. Here is the link for that 
 discussion:
 http://markmail.org/message/lc4icuw4hobul6oh
  
  
 Regards,
 Shahab
 
 
 On Sat, Aug 30, 2014 at 12:16 PM, Adaryl Bob Wakefield, MBA 
 adaryl.wakefi...@hotmail.com wrote:
 Isn’t a bit overkill to use Storm and Spark in the architecture? You say load 
 it “into” Spark. Is Spark separate storage?
  
 B.
  
 From: Alex Kamil
 Sent: Friday, August 29, 2014 10:46 PM
 To: user@cassandra.apache.org
 Subject: Re: Machine Learning With Cassandra
  
 Adaryl,
  
 most ML algorithms  are based on some form of numerical optimization, using 
 something like online gradient descent or conjugate gradient (e.g in SVM 
 classifiers). In its simplest form it is a nested FOR loop where on each 
 iteration you update the weights or parameters of the model until reaching 
 some convergence threshold that minimizes the prediction error (usually the 
 goal is to minimize  a Loss function, as in a popular least squares 
 technique). You could parallelize this loop using a brute force 
 divide-and-conquer approach, mapping a chunk of data to each node and a 
 computing partial sum there, then aggregating the results from each node into 
 a global sum in a 'reduce' stage, and repeating this map-reduce cycle until 
 convergence. You can look up distributed gradient descent or check out Mahout 
 or Spark MLlib for examples. Alternatively you can use something like 
 GraphLab.
  
 Cassandra can serve a data store from which you load the training data e.g. 
 into Spark  using this connector and then train the model using MLlib or 
 Mahout (it has Spark bindings I believe). Once you trained the model, you 
 could save the parameters back in Cassandra. Then the next stage is using the 
 model to classify new data, e.g. recommend similar items based on a log of 
 new purchases, there you could once again use Spark or Storm with something 
 like this.
  
 Alex
  
  
 
 
 On Fri, Aug 29, 2014 at 10:24 PM, Adaryl Bob Wakefield, MBA 
 adaryl.wakefi...@hotmail.com wrote:
 I’m planning to speak at a local meet-up and I need to know if what I 

Re: Cassandra use cases/Strengths/Weakness

2014-07-04 Thread James Horey
I’ve supported a variety of different “big data” systems and most have their 
own particular set of use cases that make sense. Having said that, I believe 
that Cassandra uniquely excels at the following:

* Low write latency with respect to small to medium write sizes (logs, sensor 
data, etc.)
* Linear write scalability
* Fault-tolerance across geographic locations

The first two points makes it an excellent candidate for high-throughput 
“transactional” systems. Other systems that play in this space tend to be HBase 
and Riak (there may be others, but I’m most familiar with those two). However, 
the last point is pretty unique to Cassandra. 

So if you’re looking for a high-scale out, high-throughput transactional system 
then Cassandra may make sense for you. If you’re looking for something more 
geared towards analytics (so few bulk writes, many reads), then something in 
the Hadoop space may make sense.

Cheers
James

On Jul 4, 2014, at 3:31 PM, Prem Yadav ipremya...@gmail.com wrote:

 Thanks Manoj. Great post for those who already have Cassandra in production.
 However it brings me back to my original post.
 All the points you have mentioned apply to any big data technology.
 Storage- All of them
 Query- All of them. In fact lot of them perform better. Agree that CQL 
 structure is better. But hive,mongo all good
 Availability- many of them
 
 So my question is basically to Cassandra support people e.g.- Datastax Or the 
 developers. 
 What makes Cassandra special. 
 If I have to convince my CTO to spend million dollars on a cluster and 
 support, his first question would be why Cassandra? Why not this or that?
 
 So I still am not sure about what special Cassandra brings to the table?
 
 Sorry about the rant. But in the enterprise world, decisions are taken based 
 on taking into account the stability, convincing managers and what not. 
 Chosen technology has to be stable for years. People should be convinced that 
 the engineers are not going to do a lot of firefighting.
 
 Any inputs appreciated.
 
 
 
 On Fri, Jul 4, 2014 at 7:07 PM, Manoj Khangaonkar khangaon...@gmail.com 
 wrote:
 These are my personal opinions based on few months using Cassandra. These are 
 my views. Others
 may have different opinion
 
 
 http://khangaonkar.blogspot.com/2014/06/apache-cassandra-things-to-consider.html
 
 regards
 
 
 
 On Fri, Jul 4, 2014 at 7:37 AM, Prem Yadav ipremya...@gmail.com wrote:
 Hi,
 I have seen this in a lot of replies that Cassandra is not designed for this 
 and that. I don't want to sound rude, i just need some info about this so 
 that i can compare it to technologies like hbase, mongo, elasticsearch, solr, 
 etc.
 
 1) what is Cassandra designed for. Heave writes yes. So is Hbase. Or 
 ElasticSearch
 What is the use case(s) that suit Cassandra.
 
 2) What kind of queries are best suited for Cassandra.
 I ask this Because I have seen people asking about queries and getting 
 replies that its not suited for Cassandra. For ex: queries where large number 
 of rows are requested and timeout happens. Or range queries or aggregate 
 queries.
 
 3) Where does Cassandra excel compared to other technologies?
 
 I have been working on Casandra for some time. I know how it works and I like 
 it very much. 
 We are moving towards building a big cluster. But at this point, I am not 
 sure if its a right decision. 
 
 A lot of people including me like Cassandra in my company. But it has more to 
 do with the CQL and not the internals or the use cases. Until now, there have 
 been small PoCs and people enjoyed it. But a large scale project, we are not 
 so sure.
 
 Please guide us.
 Please note that the drawbacks of other technologies do not interest me, its 
 the strengths/weaknesses of Cassandra I am interested in.
 Thanks
 
  
 
 
 
 
 
 
 
 -- 
 http://khangaonkar.blogspot.com/
 



Re: autoscaling cassandra cluster

2014-05-21 Thread James Horey
If you're interested and/or need some Cassandra docker images let me know I'll 
shoot you a link.

James

Sent from my iPhone

 On May 21, 2014, at 10:19 AM, Jabbar Azam aja...@gmail.com wrote:
 
 That sounds interesting.   I was thinking of using coreos with docker 
 containers for the business logic, frontend and Cassandra. I'll also have a 
 look at cassandra-mesos
 
 Thanks
 
 Jabbar Azam
 
 On 21 May 2014 14:04, Panagiotis Garefalakis panga...@gmail.com wrote:
 I agree with Prem, but recently a guy send this promising project called 
 Mesos in this list. 
 https://github.com/mesosphere/cassandra-mesos
 One of its goals is to make scaling easier. 
 I don’t have any personal opinion yet but maybe you could give it a try.
 
 Regards,
 Panagiotis
 
 
 
 On Wed, May 21, 2014 at 3:49 PM, Jabbar Azam aja...@gmail.com wrote:
 Hello Prem,
 
 I'm trying to find out whether people are autoscaling up and down 
 automatically, not manually. I'm also interested in whether they are using 
 a cloud based solution and creating and destroying instances. 
 
 I've found the following regarding GCE 
 https://cloud.google.com/developers/articles/auto-scaling-on-the-google-cloud-platform
  and how instances can be created and destroyed. 
 
  I
 
 
 Thanks
 
 Jabbar Azam
 
 
 On 21 May 2014 13:09, Prem Yadav ipremya...@gmail.com wrote:
 Hi Jabbar,
 with vnodes, scaling up should not be a problem. You could just add a 
 machines with the cluster/seed/datacenter conf and it should join the 
 cluster.
 Scaling down has to be manual where you drain the node and decommission it.
 
 thanks,
 Prem
 
 
 
 On Wed, May 21, 2014 at 12:35 PM, Jabbar Azam aja...@gmail.com wrote:
 Hello,
 
 Has anybody got a cassandra cluster which autoscales depending on load or 
 times of the day?
 
 I've seen the documentation on the datastax website and that only 
 mentioned adding and removing nodes, unless I've missed something.
 
 I want to know how to do this for the google compute engine. This isn't 
 for a production system but a test system(multiple nodes) where I want to 
 learn. I'm not sure how to check the performance of the cluster, whether 
 I use one performance metric or a mix of performance metrics and then 
 invoke a script to add or remove nodes from the cluster.
 
 I'd be interested to know whether people out there are autoscaling 
 cassandra on demand.
 
 Thanks
 
 Jabbar Azam


Re: autoscaling cassandra cluster

2014-05-21 Thread James Horey
You normally don't (ferry auto-generates the IP addresses). Let's move this 
conversation to the ferry-user google group so that we don't pollute this 
mailing list...

James

Sent from my iPhone

 On May 21, 2014, at 3:15 PM, Jabbar Azam aja...@gmail.com wrote:
 
 Hello James,
 
 How do you alter your cassandra.yaml file with each nodes IP address?
 
 I want to use the scaling software(which I've not got yet) to create and 
 destroy the GCE instances. I want to use fleet to deploy and undeploy the 
 cassandra nodes inside the docker instances. I do realise I will have to run 
 nodetool to add and remove the nodes from the cluster and also the node 
 cleanup.
 
 Disclaimer: this is not a production system but something Im experimenting 
 with in my own time.
 
 
 Thanks
 
 Jabbar Azam
 
 
 On 21 May 2014 15:51, James Horey j...@opencore.io wrote:
 If you're interested and/or need some Cassandra docker images let me know 
 I'll shoot you a link.
 
 James
 
 Sent from my iPhone
 
 On May 21, 2014, at 10:19 AM, Jabbar Azam aja...@gmail.com wrote:
 
 That sounds interesting.   I was thinking of using coreos with docker 
 containers for the business logic, frontend and Cassandra. I'll also have a 
 look at cassandra-mesos
 
 Thanks
 
 Jabbar Azam
 
 On 21 May 2014 14:04, Panagiotis Garefalakis panga...@gmail.com wrote:
 I agree with Prem, but recently a guy send this promising project called 
 Mesos in this list. 
 https://github.com/mesosphere/cassandra-mesos
 One of its goals is to make scaling easier. 
 I don’t have any personal opinion yet but maybe you could give it a try.
 
 Regards,
 Panagiotis
 
 
 
 On Wed, May 21, 2014 at 3:49 PM, Jabbar Azam aja...@gmail.com wrote:
 Hello Prem,
 
 I'm trying to find out whether people are autoscaling up and down 
 automatically, not manually. I'm also interested in whether they are 
 using a cloud based solution and creating and destroying instances. 
 
 I've found the following regarding GCE 
 https://cloud.google.com/developers/articles/auto-scaling-on-the-google-cloud-platform
  and how instances can be created and destroyed. 
 
  I
 
 
 Thanks
 
 Jabbar Azam
 
 
 On 21 May 2014 13:09, Prem Yadav ipremya...@gmail.com wrote:
 Hi Jabbar,
 with vnodes, scaling up should not be a problem. You could just add a 
 machines with the cluster/seed/datacenter conf and it should join the 
 cluster.
 Scaling down has to be manual where you drain the node and decommission 
 it.
 
 thanks,
 Prem
 
 
 
 On Wed, May 21, 2014 at 12:35 PM, Jabbar Azam aja...@gmail.com wrote:
 Hello,
 
 Has anybody got a cassandra cluster which autoscales depending on load 
 or times of the day?
 
 I've seen the documentation on the datastax website and that only 
 mentioned adding and removing nodes, unless I've missed something.
 
 I want to know how to do this for the google compute engine. This isn't 
 for a production system but a test system(multiple nodes) where I want 
 to learn. I'm not sure how to check the performance of the cluster, 
 whether I use one performance metric or a mix of performance metrics 
 and then invoke a script to add or remove nodes from the cluster.
 
 I'd be interested to know whether people out there are autoscaling 
 cassandra on demand.
 
 Thanks
 
 Jabbar Azam
 


Re: How to clear all data using CQL?

2014-04-16 Thread James Horey
If you’re running unit tests and repeatadly clearing the Cassandra keyspaces, 
you may want to check out Ferry (ferry.opencore.io). It lets you 
standup/destroy multiple Cassandra stacks locally on your machine and is useful 
for the use case you described. I’m the author of Ferry, and would be glad to 
help out. 

(Sorry for the plug)
James

On Apr 16, 2014, at 5:29 AM, Sebastian Schmidt isib...@gmail.com wrote:

 Thank you that worked!
 
 Am 16.04.2014 10:46, schrieb Mark Reddy:
 select keyspace_name from system.schema_keyspaces;
 
 



Help collecting Cassandra examples

2014-03-31 Thread James Horey
Hello all,

I’m trying to collect and organize Cassandra applications for educational 
purposes. I’m hoping that by collating these applications in a single place, 
new users will be able to get up to speed a bit easier. If you know of a great 
application (should be open-source and preferably up to date), please shoot me 
an email or send a pull request using the GitHub page below. 

https://github.com/opencore/cassandra-examples

Thanks!
James