Re: Spark and intermediate results

2015-10-09 Thread Jonathan Haddad
You can run spark against your Cassandra data directly without using a
shared filesystem.

https://github.com/datastax/spark-cassandra-connector


On Fri, Oct 9, 2015 at 6:09 AM Marcelo Valle (BLOOMBERG/ LONDON) <
mvallemil...@bloomberg.net> wrote:

> Hello,
>
> I saw this nice link from an event:
>
>
> http://www.datastax.com/dev/blog/zen-art-spark-maintenance?mkt_tok=3RkMMJWWfF9wsRogvqzIZKXonjHpfsX56%2B8uX6GylMI%2F0ER3fOvrPUfGjI4GTcdmI%2BSLDwEYGJlv6SgFSrXMMblswLgIXBY%3D
>
> I would like to test using Spark to perform some operations on a column
> family, my objective is reading from CF A and writing the output of my M/R
> job to CF B.
>
> That said, I've read this from Spark's FAQ (
> http://spark.apache.org/faq.html):
>
> "Do I need Hadoop to run Spark?
> No, but if you run on a cluster, you will need some form of shared file
> system (for example, NFS mounted at the same path on each node). If you
> have this type of filesystem, you can just deploy Spark in standalone mode.
> "
>
> The question I ask is - if I don't want to have a HDFS instalation just to
> run Spark on Cassandra, is my only option to have this NFS mounted over
> network?
> It doesn't seem smart to me to have something as NFS to store Spark files,
> as it would probably affect performance, and at the same time I wouldn't
> like to have an additional HDFS cluster just to run jobs on Cassandra.
> Is there a way of using Cassandra itself as this "some form of shared
> file system"?
>
> -Marcelo
>
>
> << ideas don't deserve respect >>
>


Re: Spark and intermediate results

2015-10-09 Thread Marcelo Valle (BLOOMBERG/ LONDON)
I know the connector, but having the connector only means it will take *input* 
data from Cassandra, right? What about intermediate results?
If it stores intermediate results on Cassandra, could you please clarify how 
data locality is handled? Will it store in other keyspace? 
I could not find any doc about it...

From: user@cassandra.apache.org 
Subject: Re: Spark and intermediate results

You can run spark against your Cassandra data directly without using a shared 
filesystem. 

https://github.com/datastax/spark-cassandra-connector


On Fri, Oct 9, 2015 at 6:09 AM Marcelo Valle (BLOOMBERG/ LONDON) 
<mvallemil...@bloomberg.net> wrote:

Hello, 

I saw this nice link from an event:

http://www.datastax.com/dev/blog/zen-art-spark-maintenance?mkt_tok=3RkMMJWWfF9wsRogvqzIZKXonjHpfsX56%2B8uX6GylMI%2F0ER3fOvrPUfGjI4GTcdmI%2BSLDwEYGJlv6SgFSrXMMblswLgIXBY%3D

I would like to test using Spark to perform some operations on a column family, 
my objective is reading from CF A and writing the output of my M/R job to CF B. 

That said, I've read this from Spark's FAQ (http://spark.apache.org/faq.html):

"Do I need Hadoop to run Spark?
No, but if you run on a cluster, you will need some form of shared file system 
(for example, NFS mounted at the same path on each node). If you have this type 
of filesystem, you can just deploy Spark in standalone mode."

The question I ask is - if I don't want to have a HDFS instalation just to run 
Spark on Cassandra, is my only option to have this NFS mounted over network? 
It doesn't seem smart to me to have something as NFS to store Spark files, as 
it would probably affect performance, and at the same time I wouldn't like to 
have an additional HDFS cluster just to run jobs on Cassandra. 
Is there a way of using Cassandra itself as this "some form of shared file 
system"?

-Marcelo


<< ideas don't deserve respect >>


<< ideas dont deserve respect >>

Spark and intermediate results

2015-10-09 Thread Marcelo Valle (BLOOMBERG/ LONDON)
Hello, 

I saw this nice link from an event:

http://www.datastax.com/dev/blog/zen-art-spark-maintenance?mkt_tok=3RkMMJWWfF9wsRogvqzIZKXonjHpfsX56%2B8uX6GylMI%2F0ER3fOvrPUfGjI4GTcdmI%2BSLDwEYGJlv6SgFSrXMMblswLgIXBY%3D

I would like to test using Spark to perform some operations on a column family, 
my objective is reading from CF A and writing the output of my M/R job to CF B. 

That said, I've read this from Spark's FAQ (http://spark.apache.org/faq.html):

"Do I need Hadoop to run Spark?
No, but if you run on a cluster, you will need some form of shared file system 
(for example, NFS mounted at the same path on each node). If you have this type 
of filesystem, you can just deploy Spark in standalone mode."

The question I ask is - if I don't want to have a HDFS instalation just to run 
Spark on Cassandra, is my only option to have this NFS mounted over network? 
It doesn't seem smart to me to have something as NFS to store Spark files, as 
it would probably affect performance, and at the same time I wouldn't like to 
have an additional HDFS cluster just to run jobs on Cassandra. 
Is there a way of using Cassandra itself as this "some form of shared file 
system"?

-Marcelo


<< ideas dont deserve respect >>

Re: Spark and intermediate results

2015-10-09 Thread karthik prasad
Spark's core module uses this connector to read data from Cassandra and
create RDD's or DataFrames in its workspace (In memory/on disc, depending
on the spark configurations). Then transformations or queries are applied
on RDD's or DataFrames respectively. The end results are stored back into
Cassandra using the connector.

Note: If you just want to read/write from Cassandra using spark, you can
try Kundera's Spark-Cassandra Module
<https://github.com/impetus-opensource/Kundera/wiki/Spark-Cassandra-Module>.
Kundera exposes the operations in a JPA way and helps in quick development.

-Karthik

On Fri, Oct 9, 2015 at 8:09 PM, Marcelo Valle (BLOOMBERG/ LONDON) <
mvallemil...@bloomberg.net> wrote:

> I know the connector, but having the connector only means it will take
> *input* data from Cassandra, right? What about intermediate results?
> If it stores intermediate results on Cassandra, could you please clarify
> how data locality is handled? Will it store in other keyspace?
> I could not find any doc about it...
>
> From: user@cassandra.apache.org
> Subject: Re: Spark and intermediate results
>
> You can run spark against your Cassandra data directly without using a
> shared filesystem.
>
> https://github.com/datastax/spark-cassandra-connector
>
>
> On Fri, Oct 9, 2015 at 6:09 AM Marcelo Valle (BLOOMBERG/ LONDON) <
> mvallemil...@bloomberg.net> wrote:
>
>> Hello,
>>
>> I saw this nice link from an event:
>>
>>
>> http://www.datastax.com/dev/blog/zen-art-spark-maintenance?mkt_tok=3RkMMJWWfF9wsRogvqzIZKXonjHpfsX56%2B8uX6GylMI%2F0ER3fOvrPUfGjI4GTcdmI%2BSLDwEYGJlv6SgFSrXMMblswLgIXBY%3D
>>
>> I would like to test using Spark to perform some operations on a column
>> family, my objective is reading from CF A and writing the output of my M/R
>> job to CF B.
>>
>> That said, I've read this from Spark's FAQ (
>> http://spark.apache.org/faq.html):
>>
>> "Do I need Hadoop to run Spark?
>> No, but if you run on a cluster, you will need some form of shared file
>> system (for example, NFS mounted at the same path on each node). If you
>> have this type of filesystem, you can just deploy Spark in standalone mode.
>> "
>>
>> The question I ask is - if I don't want to have a HDFS instalation just
>> to run Spark on Cassandra, is my only option to have this NFS mounted over
>> network?
>> It doesn't seem smart to me to have something as NFS to store Spark
>> files, as it would probably affect performance, and at the same time I
>> wouldn't like to have an additional HDFS cluster just to run jobs on
>> Cassandra.
>> Is there a way of using Cassandra itself as this "some form of shared
>> file system"?
>>
>> -Marcelo
>>
>>
>> << ideas don't deserve respect >>
>>
>
>
>
> << ideas don't deserve respect >>
>