Spark and intermediate results

Marcelo Valle (BLOOMBERG/ LONDON) Fri, 09 Oct 2015 06:10:05 -0700

Hello, 

I saw this nice link from an event:


http://www.datastax.com/dev/blog/zen-art-spark-maintenance?mkt_tok=3RkMMJWWfF9wsRogvqzIZKXonjHpfsX56%2B8uX6GylMI%2F0ER3fOvrPUfGjI4GTcdmI%2BSLDwEYGJlv6SgFSrXMMblswLgIXBY%3D

I would like to test using Spark to perform some operations on a column family, 
my objective is reading from CF A and writing the output of my M/R job to CF B. 

That said, I've read this from Spark's FAQ (http://spark.apache.org/faq.html):

"Do I need Hadoop to run Spark?
No, but if you run on a cluster, you will need some form of shared file system 
(for example, NFS mounted at the same path on each node). If you have this type 
of filesystem, you can just deploy Spark in standalone mode."

The question I ask is - if I don't want to have a HDFS instalation just to run 
Spark on Cassandra, is my only option to have this NFS mounted over network? 
It doesn't seem smart to me to have something as NFS to store Spark files, as 
it would probably affect performance, and at the same time I wouldn't like to 
have an additional HDFS cluster just to run jobs on Cassandra. 
Is there a way of using Cassandra itself as this "some form of shared file 
system"?

-Marcelo


<< ideas don&#39;t deserve respect >>

Spark and intermediate results

Reply via email to