Re: Spark RDD and Memory

2016-09-22 Thread Aditya

Hi Datta,

Thanks for the reply.

If I havent cached any rdd and the data that is being loaded into memory 
after performing some operations exceeds the memory, how it is handled 
by spark.
Is previosly loaded rdds removed from memory to make it free for 
subsequent steps in DAG?


I am running into an issue where my DAG is very long and all the data 
does not fits into memory and at some point all my executors gets lost.


On Friday 23 September 2016 12:15 PM, Aditya wrote:


Hi Datta,

Thanks for the reply.

If I havent cached any rdd and the data that is being loaded into 
memory after performing some operations exceeds the memory, how it is 
handled by spark.
Is previosly loaded rdds removed from memory to make it free for 
subsequent steps in DAG?


I am running into an issue where my DAG is very long and all the data 
does not fits into memory and at some point all my executors gets lost.



On Friday 23 September 2016 12:02 PM, Datta Khot wrote:

Hi Aditya,

If you cache the RDDs - like textFile.cache(), 
textFile1().cache() - then it will not load the data again from file 
system.


Once done with related operations it is recommended to uncache the 
RDDs to manage memory efficiently and avoid it's exhaustion.


Note caching operation is with main memory and persist is to disk.

Datta
https://in.linkedin.com/in/datta-khot-240b544
http://www.datasherpa.io/

On Fri, Sep 23, 2016 at 10:23 AM, Aditya 
> wrote:


Thanks for the reply.

One more question.
How spark handles data if it does not fit in memory? The answer
which I got is that it flushes the data to disk and handle the
memory issue.
Plus in below example.
val textFile = sc.textFile("/user/emp.txt")
val textFile1 = sc.textFile("/user/emp1.xt")
val join = textFile.join(textFile1)
join.saveAsTextFile("/home/output")
val count = join.count()

When the first action is performed it loads textFile and
textFile1 in memory, performes join and save the result.
But when the second action (count) is called, it again loads
textFile and textFile1 in memory and again performs the join
operation?
If it loads again what is the correct way to prevent it from
loading again again the same data?


On Thursday 22 September 2016 11:12 PM, Mich Talebzadeh wrote:

Hi,

unpersist works on storage memory not execution memory. So I do
not think you can flush it out of memory if you have not cached
it using cache or something like below in the first place.

s.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)

s.unpersist

I believe the recent versions of Spark deploy Least Recently
Used (LRU) mechanism to flush unused data out of memory much
like RBMS cache management. I know LLDAP does that.

HTH



Dr Mich Talebzadeh

LinkedIn

/https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/

http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk.Any and all responsibility
for any loss, damage or destruction of data or any other
property which may arise from relying on this
email's technical content is explicitly disclaimed. The author
will in no case be liable for any monetary damages arising from
such loss, damage or destruction.


On 22 September 2016 at 18:09, Hanumath Rao Maduri
 wrote:

Hello Aditya,

After an intermediate action has been applied you might want
to call rdd.unpersist() to let spark know that this rdd is
no longer required.

Thanks,
-Hanu

On Thu, Sep 22, 2016 at 7:54 AM, Aditya
mailto:aditya.calangut...@augmentiq.co.in>> wrote:

Hi,

Suppose I have two RDDs
val textFile = sc.textFile("/user/emp.txt")
val textFile1 = sc.textFile("/user/emp1.xt")

Later I perform a join operation on above two RDDs
val join = textFile.join(textFile1)

And there are subsequent transformations without
including textFile and textFile1 further and an action
to start the execution.

When action is called, textFile and textFile1 will be
loaded in memory first. Later join will be performed and
kept in memory.
My question is once join is there memory and is used for
subsequent execution, what happens to textFile and
textFile1 RDDs. Are they still kept in memory untill the
full lineage graph is completed or is it destroyed once
its use is over? If it is kept in memory, is there any
way I can explicitly remove it from memory to free the
memory?






-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org















Re: Spark RDD and Memory

2016-09-22 Thread Datta Khot
Hi Aditya,

If you cache the RDDs - like textFile.cache(), textFile1().cache() - then
it will not load the data again from file system.

Once done with related operations it is recommended to uncache the RDDs to
manage memory efficiently and avoid it's exhaustion.

Note caching operation is with main memory and persist is to disk.

Datta
https://in.linkedin.com/in/datta-khot-240b544
http://www.datasherpa.io/

On Fri, Sep 23, 2016 at 10:23 AM, Aditya  wrote:

> Thanks for the reply.
>
> One more question.
> How spark handles data if it does not fit in memory? The answer which I
> got is that it flushes the data to disk and handle the memory issue.
> Plus in below example.
> val textFile = sc.textFile("/user/emp.txt")
> val textFile1 = sc.textFile("/user/emp1.xt")
> val join = textFile.join(textFile1)
> join.saveAsTextFile("/home/output")
> val count = join.count()
>
> When the first action is performed it loads textFile and textFile1 in
> memory, performes join and save the result.
> But when the second action (count) is called, it again loads textFile and
> textFile1 in memory and again performs the join operation?
> If it loads again what is the correct way to prevent it from loading again
> again the same data?
>
>
> On Thursday 22 September 2016 11:12 PM, Mich Talebzadeh wrote:
>
> Hi,
>
> unpersist works on storage memory not execution memory. So I do not think
> you can flush it out of memory if you have not cached it using cache or
> something like below in the first place.
>
> s.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)
>
> s.unpersist
>
> I believe the recent versions of Spark deploy Least Recently Used
> (LRU) mechanism to flush unused data out of memory much like RBMS cache
> management. I know LLDAP does that.
>
> HTH
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn *
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 22 September 2016 at 18:09, Hanumath Rao Maduri 
> wrote:
>
>> Hello Aditya,
>>
>> After an intermediate action has been applied you might want to call
>> rdd.unpersist() to let spark know that this rdd is no longer required.
>>
>> Thanks,
>> -Hanu
>>
>> On Thu, Sep 22, 2016 at 7:54 AM, Aditya > co.in> wrote:
>>
>>> Hi,
>>>
>>> Suppose I have two RDDs
>>> val textFile = sc.textFile("/user/emp.txt")
>>> val textFile1 = sc.textFile("/user/emp1.xt")
>>>
>>> Later I perform a join operation on above two RDDs
>>> val join = textFile.join(textFile1)
>>>
>>> And there are subsequent transformations without including textFile and
>>> textFile1 further and an action to start the execution.
>>>
>>> When action is called, textFile and textFile1 will be loaded in memory
>>> first. Later join will be performed and kept in memory.
>>> My question is once join is there memory and is used for subsequent
>>> execution, what happens to textFile and textFile1 RDDs. Are they still kept
>>> in memory untill the full lineage graph is completed or is it destroyed
>>> once its use is over? If it is kept in memory, is there any way I can
>>> explicitly remove it from memory to free the memory?
>>>
>>>
>>>
>>>
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>>
>
>
>


Re: Spark RDD and Memory

2016-09-22 Thread Aditya

Thanks for the reply.

One more question.
How spark handles data if it does not fit in memory? The answer which I 
got is that it flushes the data to disk and handle the memory issue.

Plus in below example.
val textFile = sc.textFile("/user/emp.txt")
val textFile1 = sc.textFile("/user/emp1.xt")
val join = textFile.join(textFile1)
join.saveAsTextFile("/home/output")
val count = join.count()

When the first action is performed it loads textFile and textFile1 in 
memory, performes join and save the result.
But when the second action (count) is called, it again loads textFile 
and textFile1 in memory and again performs the join operation?
If it loads again what is the correct way to prevent it from loading 
again again the same data?


On Thursday 22 September 2016 11:12 PM, Mich Talebzadeh wrote:

Hi,

unpersist works on storage memory not execution memory. So I do not 
think you can flush it out of memory if you have not cached it using 
cache or something like below in the first place.


s.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)

s.unpersist

I believe the recent versions of Spark deploy Least Recently Used 
(LRU) mechanism to flush unused data out of memory much like RBMS 
cache management. I know LLDAP does that.


HTH



Dr Mich Talebzadeh

LinkedIn 
/https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/


http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk.Any and all responsibility for 
any loss, damage or destruction of data or any other property which 
may arise from relying on this email's technical content is explicitly 
disclaimed. The author will in no case be liable for any monetary 
damages arising from such loss, damage or destruction.



On 22 September 2016 at 18:09, Hanumath Rao Maduri > wrote:


Hello Aditya,

After an intermediate action has been applied you might want to
call rdd.unpersist() to let spark know that this rdd is no longer
required.

Thanks,
-Hanu

On Thu, Sep 22, 2016 at 7:54 AM, Aditya
mailto:aditya.calangut...@augmentiq.co.in>> wrote:

Hi,

Suppose I have two RDDs
val textFile = sc.textFile("/user/emp.txt")
val textFile1 = sc.textFile("/user/emp1.xt")

Later I perform a join operation on above two RDDs
val join = textFile.join(textFile1)

And there are subsequent transformations without including
textFile and textFile1 further and an action to start the
execution.

When action is called, textFile and textFile1 will be loaded
in memory first. Later join will be performed and kept in memory.
My question is once join is there memory and is used for
subsequent execution, what happens to textFile and textFile1
RDDs. Are they still kept in memory untill the full lineage
graph is completed or is it destroyed once its use is over? If
it is kept in memory, is there any way I can explicitly remove
it from memory to free the memory?





-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org










Re: Spark RDD and Memory

2016-09-22 Thread Mich Talebzadeh
Hi,

unpersist works on storage memory not execution memory. So I do not think
you can flush it out of memory if you have not cached it using cache or
something like below in the first place.

s.persist(org.apache.spark.storage.StorageLevel.MEMORY_ONLY)

s.unpersist

I believe the recent versions of Spark deploy Least Recently Used
(LRU) mechanism to flush unused data out of memory much like RBMS cache
management. I know LLDAP does that.

HTH



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 22 September 2016 at 18:09, Hanumath Rao Maduri 
wrote:

> Hello Aditya,
>
> After an intermediate action has been applied you might want to call
> rdd.unpersist() to let spark know that this rdd is no longer required.
>
> Thanks,
> -Hanu
>
> On Thu, Sep 22, 2016 at 7:54 AM, Aditya  co.in> wrote:
>
>> Hi,
>>
>> Suppose I have two RDDs
>> val textFile = sc.textFile("/user/emp.txt")
>> val textFile1 = sc.textFile("/user/emp1.xt")
>>
>> Later I perform a join operation on above two RDDs
>> val join = textFile.join(textFile1)
>>
>> And there are subsequent transformations without including textFile and
>> textFile1 further and an action to start the execution.
>>
>> When action is called, textFile and textFile1 will be loaded in memory
>> first. Later join will be performed and kept in memory.
>> My question is once join is there memory and is used for subsequent
>> execution, what happens to textFile and textFile1 RDDs. Are they still kept
>> in memory untill the full lineage graph is completed or is it destroyed
>> once its use is over? If it is kept in memory, is there any way I can
>> explicitly remove it from memory to free the memory?
>>
>>
>>
>>
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>
>


Re: Spark RDD and Memory

2016-09-22 Thread Hanumath Rao Maduri
Hello Aditya,

After an intermediate action has been applied you might want to call
rdd.unpersist() to let spark know that this rdd is no longer required.

Thanks,
-Hanu

On Thu, Sep 22, 2016 at 7:54 AM, Aditya 
wrote:

> Hi,
>
> Suppose I have two RDDs
> val textFile = sc.textFile("/user/emp.txt")
> val textFile1 = sc.textFile("/user/emp1.xt")
>
> Later I perform a join operation on above two RDDs
> val join = textFile.join(textFile1)
>
> And there are subsequent transformations without including textFile and
> textFile1 further and an action to start the execution.
>
> When action is called, textFile and textFile1 will be loaded in memory
> first. Later join will be performed and kept in memory.
> My question is once join is there memory and is used for subsequent
> execution, what happens to textFile and textFile1 RDDs. Are they still kept
> in memory untill the full lineage graph is completed or is it destroyed
> once its use is over? If it is kept in memory, is there any way I can
> explicitly remove it from memory to free the memory?
>
>
>
>
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Spark RDD and Memory

2016-09-22 Thread Aditya

Hi,

Suppose I have two RDDs
val textFile = sc.textFile("/user/emp.txt")
val textFile1 = sc.textFile("/user/emp1.xt")

Later I perform a join operation on above two RDDs
val join = textFile.join(textFile1)

And there are subsequent transformations without including textFile and 
textFile1 further and an action to start the execution.


When action is called, textFile and textFile1 will be loaded in memory 
first. Later join will be performed and kept in memory.
My question is once join is there memory and is used for subsequent 
execution, what happens to textFile and textFile1 RDDs. Are they still 
kept in memory untill the full lineage graph is completed or is it 
destroyed once its use is over? If it is kept in memory, is there any 
way I can explicitly remove it from memory to free the memory?






-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org