Re: Regarding rdd.collect()

2015-08-18 Thread ayan guha
I think you are mixing the notion of job from hadoop map reduce world with
spark. In spark, RDDs are immutable and transformations are lazy. So the
first time rdd is actually fills up memory is when you run first
transformation. After that, it stays up in memory until either application
is stopped or new rdd s are generated causing old rdd to get pushed out to
disk.
Remember spark does not provide fault tolerance through replication but
through lineage. So it is important to keep old rdds around in case of any
failure downstream transformations

On Tue, Aug 18, 2015 at 5:46 PM, Dawid Wysakowicz 
wysakowicz.da...@gmail.com wrote:

 No, the data is not stored between two jobs. But it is stored for a
 lifetime of a job. Job can have multiple actions run.
 For a matter of sharing an rdd between jobs you can have a look at Spark
 Job Server(spark-jobserver https://github.com/ooyala/spark-jobserver)
 or some In-Memory storages: Tachyon(http://tachyon-project.org/) or
 Ignite(https://ignite.incubator.apache.org/)

 2015-08-18 9:37 GMT+02:00 Hemant Bhanawat hemant9...@gmail.com:

 It is still in memory for future rdd transformations and actions.

 This is interesting. You mean Spark holds the data in memory between two
 job executions.  How does the second job get the handle of the data in
 memory? I am interested in knowing more about it. Can you forward me a
 spark article or JIRA that talks about it?

 On Tue, Aug 18, 2015 at 12:49 PM, Sabarish Sasidharan 
 sabarish.sasidha...@manthan.com wrote:

 It is still in memory for future rdd transformations and actions. What
 you get in driver is a copy of the data.

 Regards
 Sab

 On Tue, Aug 18, 2015 at 12:02 PM, praveen S mylogi...@gmail.com wrote:

 When I do an rdd.collect().. The data moves back to driver  Or is still
 held in memory across the executors?




 --

 Architect - Big Data
 Ph: +91 99805 99458

 Manthan Systems | *Company of the year - Analytics (2014 Frost and
 Sullivan India ICT)*
 +++






-- 
Best Regards,
Ayan Guha


Re: Regarding rdd.collect()

2015-08-18 Thread Hemant Bhanawat
On Tue, Aug 18, 2015 at 1:16 PM, Dawid Wysakowicz 
wysakowicz.da...@gmail.com wrote:

 No, the data is not stored between two jobs. But it is stored for a
 lifetime of a job. Job can have multiple actions run.

I too thought so but wanted to confirm. Thanks.


 For a matter of sharing an rdd between jobs you can have a look at Spark
 Job Server(spark-jobserver https://github.com/ooyala/spark-jobserver)
 or some In-Memory storages: Tachyon(http://tachyon-project.org/) or
 Ignite(https://ignite.incubator.apache.org/)

 2015-08-18 9:37 GMT+02:00 Hemant Bhanawat hemant9...@gmail.com:

 It is still in memory for future rdd transformations and actions.

 This is interesting. You mean Spark holds the data in memory between two
 job executions.  How does the second job get the handle of the data in
 memory? I am interested in knowing more about it. Can you forward me a
 spark article or JIRA that talks about it?

 On Tue, Aug 18, 2015 at 12:49 PM, Sabarish Sasidharan 
 sabarish.sasidha...@manthan.com wrote:

 It is still in memory for future rdd transformations and actions. What
 you get in driver is a copy of the data.

 Regards
 Sab

 On Tue, Aug 18, 2015 at 12:02 PM, praveen S mylogi...@gmail.com wrote:

 When I do an rdd.collect().. The data moves back to driver  Or is still
 held in memory across the executors?




 --

 Architect - Big Data
 Ph: +91 99805 99458

 Manthan Systems | *Company of the year - Analytics (2014 Frost and
 Sullivan India ICT)*
 +++






Re:Re: Regarding rdd.collect()

2015-08-18 Thread Todd


One spark application can have many jobs,eg,first call rdd.count then call 
rdd.collect






At 2015-08-18 15:37:14, Hemant Bhanawat hemant9...@gmail.com wrote:

It is still in memory for future rdd transformations and actions.


This is interesting. You mean Spark holds the data in memory between two job 
executions.  How does the second job get the handle of the data in memory? I am 
interested in knowing more about it. Can you forward me a spark article or JIRA 
that talks about it? 


On Tue, Aug 18, 2015 at 12:49 PM, Sabarish Sasidharan 
sabarish.sasidha...@manthan.com wrote:

It is still in memory for future rdd transformations and actions. What you get 
in driver is a copy of the data.


Regards
Sab


On Tue, Aug 18, 2015 at 12:02 PM, praveen S mylogi...@gmail.com wrote:


When I do an rdd.collect().. The data moves back to driver  Or is still held in 
memory across the executors?






--



Architect - Big Data

Ph: +91 99805 99458


Manthan Systems | Company of the year - Analytics (2014 Frost and Sullivan 
India ICT)
+++



Regarding rdd.collect()

2015-08-18 Thread praveen S
When I do an rdd.collect().. The data moves back to driver  Or is still
held in memory across the executors?


Re: Regarding rdd.collect()

2015-08-18 Thread Dawid Wysakowicz
No, the data is not stored between two jobs. But it is stored for a
lifetime of a job. Job can have multiple actions run.
For a matter of sharing an rdd between jobs you can have a look at Spark
Job Server(spark-jobserver https://github.com/ooyala/spark-jobserver) or
some In-Memory storages: Tachyon(http://tachyon-project.org/) or Ignite(
https://ignite.incubator.apache.org/)

2015-08-18 9:37 GMT+02:00 Hemant Bhanawat hemant9...@gmail.com:

 It is still in memory for future rdd transformations and actions.

 This is interesting. You mean Spark holds the data in memory between two
 job executions.  How does the second job get the handle of the data in
 memory? I am interested in knowing more about it. Can you forward me a
 spark article or JIRA that talks about it?

 On Tue, Aug 18, 2015 at 12:49 PM, Sabarish Sasidharan 
 sabarish.sasidha...@manthan.com wrote:

 It is still in memory for future rdd transformations and actions. What
 you get in driver is a copy of the data.

 Regards
 Sab

 On Tue, Aug 18, 2015 at 12:02 PM, praveen S mylogi...@gmail.com wrote:

 When I do an rdd.collect().. The data moves back to driver  Or is still
 held in memory across the executors?




 --

 Architect - Big Data
 Ph: +91 99805 99458

 Manthan Systems | *Company of the year - Analytics (2014 Frost and
 Sullivan India ICT)*
 +++





Re: Regarding rdd.collect()

2015-08-18 Thread Hemant Bhanawat
It is still in memory for future rdd transformations and actions.

This is interesting. You mean Spark holds the data in memory between two
job executions.  How does the second job get the handle of the data in
memory? I am interested in knowing more about it. Can you forward me a
spark article or JIRA that talks about it?

On Tue, Aug 18, 2015 at 12:49 PM, Sabarish Sasidharan 
sabarish.sasidha...@manthan.com wrote:

 It is still in memory for future rdd transformations and actions. What you
 get in driver is a copy of the data.

 Regards
 Sab

 On Tue, Aug 18, 2015 at 12:02 PM, praveen S mylogi...@gmail.com wrote:

 When I do an rdd.collect().. The data moves back to driver  Or is still
 held in memory across the executors?




 --

 Architect - Big Data
 Ph: +91 99805 99458

 Manthan Systems | *Company of the year - Analytics (2014 Frost and
 Sullivan India ICT)*
 +++



Re: Regarding rdd.collect()

2015-08-18 Thread Sabarish Sasidharan
It is still in memory for future rdd transformations and actions. What you
get in driver is a copy of the data.

Regards
Sab

On Tue, Aug 18, 2015 at 12:02 PM, praveen S mylogi...@gmail.com wrote:

 When I do an rdd.collect().. The data moves back to driver  Or is still
 held in memory across the executors?




-- 

Architect - Big Data
Ph: +91 99805 99458

Manthan Systems | *Company of the year - Analytics (2014 Frost and Sullivan
India ICT)*
+++