Re: Regarding rdd.collect()
I think you are mixing the notion of job from hadoop map reduce world with spark. In spark, RDDs are immutable and transformations are lazy. So the first time rdd is actually fills up memory is when you run first transformation. After that, it stays up in memory until either application is stopped or new rdd s are generated causing old rdd to get pushed out to disk. Remember spark does not provide fault tolerance through replication but through lineage. So it is important to keep old rdds around in case of any failure downstream transformations On Tue, Aug 18, 2015 at 5:46 PM, Dawid Wysakowicz wysakowicz.da...@gmail.com wrote: No, the data is not stored between two jobs. But it is stored for a lifetime of a job. Job can have multiple actions run. For a matter of sharing an rdd between jobs you can have a look at Spark Job Server(spark-jobserver https://github.com/ooyala/spark-jobserver) or some In-Memory storages: Tachyon(http://tachyon-project.org/) or Ignite(https://ignite.incubator.apache.org/) 2015-08-18 9:37 GMT+02:00 Hemant Bhanawat hemant9...@gmail.com: It is still in memory for future rdd transformations and actions. This is interesting. You mean Spark holds the data in memory between two job executions. How does the second job get the handle of the data in memory? I am interested in knowing more about it. Can you forward me a spark article or JIRA that talks about it? On Tue, Aug 18, 2015 at 12:49 PM, Sabarish Sasidharan sabarish.sasidha...@manthan.com wrote: It is still in memory for future rdd transformations and actions. What you get in driver is a copy of the data. Regards Sab On Tue, Aug 18, 2015 at 12:02 PM, praveen S mylogi...@gmail.com wrote: When I do an rdd.collect().. The data moves back to driver Or is still held in memory across the executors? -- Architect - Big Data Ph: +91 99805 99458 Manthan Systems | *Company of the year - Analytics (2014 Frost and Sullivan India ICT)* +++ -- Best Regards, Ayan Guha
Re: Regarding rdd.collect()
On Tue, Aug 18, 2015 at 1:16 PM, Dawid Wysakowicz wysakowicz.da...@gmail.com wrote: No, the data is not stored between two jobs. But it is stored for a lifetime of a job. Job can have multiple actions run. I too thought so but wanted to confirm. Thanks. For a matter of sharing an rdd between jobs you can have a look at Spark Job Server(spark-jobserver https://github.com/ooyala/spark-jobserver) or some In-Memory storages: Tachyon(http://tachyon-project.org/) or Ignite(https://ignite.incubator.apache.org/) 2015-08-18 9:37 GMT+02:00 Hemant Bhanawat hemant9...@gmail.com: It is still in memory for future rdd transformations and actions. This is interesting. You mean Spark holds the data in memory between two job executions. How does the second job get the handle of the data in memory? I am interested in knowing more about it. Can you forward me a spark article or JIRA that talks about it? On Tue, Aug 18, 2015 at 12:49 PM, Sabarish Sasidharan sabarish.sasidha...@manthan.com wrote: It is still in memory for future rdd transformations and actions. What you get in driver is a copy of the data. Regards Sab On Tue, Aug 18, 2015 at 12:02 PM, praveen S mylogi...@gmail.com wrote: When I do an rdd.collect().. The data moves back to driver Or is still held in memory across the executors? -- Architect - Big Data Ph: +91 99805 99458 Manthan Systems | *Company of the year - Analytics (2014 Frost and Sullivan India ICT)* +++
Re:Re: Regarding rdd.collect()
One spark application can have many jobs,eg,first call rdd.count then call rdd.collect At 2015-08-18 15:37:14, Hemant Bhanawat hemant9...@gmail.com wrote: It is still in memory for future rdd transformations and actions. This is interesting. You mean Spark holds the data in memory between two job executions. How does the second job get the handle of the data in memory? I am interested in knowing more about it. Can you forward me a spark article or JIRA that talks about it? On Tue, Aug 18, 2015 at 12:49 PM, Sabarish Sasidharan sabarish.sasidha...@manthan.com wrote: It is still in memory for future rdd transformations and actions. What you get in driver is a copy of the data. Regards Sab On Tue, Aug 18, 2015 at 12:02 PM, praveen S mylogi...@gmail.com wrote: When I do an rdd.collect().. The data moves back to driver Or is still held in memory across the executors? -- Architect - Big Data Ph: +91 99805 99458 Manthan Systems | Company of the year - Analytics (2014 Frost and Sullivan India ICT) +++
Regarding rdd.collect()
When I do an rdd.collect().. The data moves back to driver Or is still held in memory across the executors?
Re: Regarding rdd.collect()
No, the data is not stored between two jobs. But it is stored for a lifetime of a job. Job can have multiple actions run. For a matter of sharing an rdd between jobs you can have a look at Spark Job Server(spark-jobserver https://github.com/ooyala/spark-jobserver) or some In-Memory storages: Tachyon(http://tachyon-project.org/) or Ignite( https://ignite.incubator.apache.org/) 2015-08-18 9:37 GMT+02:00 Hemant Bhanawat hemant9...@gmail.com: It is still in memory for future rdd transformations and actions. This is interesting. You mean Spark holds the data in memory between two job executions. How does the second job get the handle of the data in memory? I am interested in knowing more about it. Can you forward me a spark article or JIRA that talks about it? On Tue, Aug 18, 2015 at 12:49 PM, Sabarish Sasidharan sabarish.sasidha...@manthan.com wrote: It is still in memory for future rdd transformations and actions. What you get in driver is a copy of the data. Regards Sab On Tue, Aug 18, 2015 at 12:02 PM, praveen S mylogi...@gmail.com wrote: When I do an rdd.collect().. The data moves back to driver Or is still held in memory across the executors? -- Architect - Big Data Ph: +91 99805 99458 Manthan Systems | *Company of the year - Analytics (2014 Frost and Sullivan India ICT)* +++
Re: Regarding rdd.collect()
It is still in memory for future rdd transformations and actions. This is interesting. You mean Spark holds the data in memory between two job executions. How does the second job get the handle of the data in memory? I am interested in knowing more about it. Can you forward me a spark article or JIRA that talks about it? On Tue, Aug 18, 2015 at 12:49 PM, Sabarish Sasidharan sabarish.sasidha...@manthan.com wrote: It is still in memory for future rdd transformations and actions. What you get in driver is a copy of the data. Regards Sab On Tue, Aug 18, 2015 at 12:02 PM, praveen S mylogi...@gmail.com wrote: When I do an rdd.collect().. The data moves back to driver Or is still held in memory across the executors? -- Architect - Big Data Ph: +91 99805 99458 Manthan Systems | *Company of the year - Analytics (2014 Frost and Sullivan India ICT)* +++
Re: Regarding rdd.collect()
It is still in memory for future rdd transformations and actions. What you get in driver is a copy of the data. Regards Sab On Tue, Aug 18, 2015 at 12:02 PM, praveen S mylogi...@gmail.com wrote: When I do an rdd.collect().. The data moves back to driver Or is still held in memory across the executors? -- Architect - Big Data Ph: +91 99805 99458 Manthan Systems | *Company of the year - Analytics (2014 Frost and Sullivan India ICT)* +++