Re: Merging multiple Pandas dataframes

2017-06-22 Thread Saatvik Shah
Hi Assaf,

Thanks for your suggestion.

I also found one other improvement which is to iteratively convert Pandas
DFs to RDDs and take a union of those(similar to dataframes). Basically
calling createDataFrame is heavy + checkpointing of DataFrames is a brand
new feature. Instead create a huge union of RDDs and finally apply
createDataFrame in the end.

Thanks and Regards,
Saatvik

On Wed, Jun 21, 2017 at 2:03 AM, Mendelson, Assaf <assaf.mendel...@rsa.com>
wrote:

> If you do an action, most intermediate calculations would be gone for the
> next iteration.
>
> What I would do is persist every iteration, then after some (say 5) I
> would write to disk and reload. At that point you should call unpersist to
> free the memory as it is no longer relevant.
>
>
>
> Thanks,
>
>   Assaf.
>
>
>
> *From:* Saatvik Shah [mailto:saatvikshah1...@gmail.com]
> *Sent:* Tuesday, June 20, 2017 8:50 PM
> *To:* Mendelson, Assaf
> *Cc:* user@spark.apache.org
> *Subject:* Re: Merging multiple Pandas dataframes
>
>
>
> Hi Assaf,
>
> Thanks for the suggestion on checkpointing - I'll need to read up more on
> that.
>
> My current implementation seems to be crashing with a GC memory limit
> exceeded error if Im keeping multiple persist calls for a large number of
> files.
>
>
>
> Thus, I was also thinking about the constant calls to persist. Since all
> my actions are Spark transformations(union of large number of Spark
> Dataframes from Pandas dataframes), this entire process of building a large
> Spark dataframe is essentially a huge transformation. Is it necessary to
> call persist between unions? Shouldnt I instead wait for all the unions to
> complete and call persist finally?
>
>
>
>
>
> On Tue, Jun 20, 2017 at 2:52 AM, Mendelson, Assaf <assaf.mendel...@rsa.com>
> wrote:
>
> Note that depending on the number of iterations, the query plan for the
> dataframe can become long and this can cause slowdowns (or even crashes).
> A possible solution would be to checkpoint (or simply save and reload the
> dataframe) every once in a while. When reloading from disk, the newly
> loaded dataframe's lineage is just the disk...
>
> Thanks,
>   Assaf.
>
>
> -Original Message-
> From: saatvikshah1994 [mailto:saatvikshah1...@gmail.com]
> Sent: Tuesday, June 20, 2017 2:22 AM
> To: user@spark.apache.org
> Subject: Merging multiple Pandas dataframes
>
> Hi,
>
> I am iteratively receiving a file which can only be opened as a Pandas
> dataframe. For the first such file I receive, I am converting this to a
> Spark dataframe using the 'createDataframe' utility function. The next file
> onward, I am converting it and union'ing it into the first Spark
> dataframe(the schema always stays the same). After each union, I am
> persisting it in memory(MEMORY_AND_DISK_ONLY level). After I have converted
> all such files to a single spark dataframe I am coalescing it. Following
> some tips from this Stack Overflow
> post(https://stackoverflow.com/questions/39381183/
> managing-spark-partitions-after-dataframe-unions).
>
> Any suggestions for optimizing this process further?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Merging-multiple-Pandas-dataframes-tp28770.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
>
> --
>
> *Saatvik Shah,*
>
> *1st  Year,*
>
> *Masters in the School of Computer Science,*
>
> *Carnegie Mellon University*
>
> *https://saatvikshah1994.github.io/ <https://saatvikshah1994.github.io/>*
>



-- 
*Saatvik Shah,*
*1st  Year,*
*Masters in the School of Computer Science,*
*Carnegie Mellon University*

*https://saatvikshah1994.github.io/ <https://saatvikshah1994.github.io/>*


RE: Merging multiple Pandas dataframes

2017-06-21 Thread Mendelson, Assaf
If you do an action, most intermediate calculations would be gone for the next 
iteration.
What I would do is persist every iteration, then after some (say 5) I would 
write to disk and reload. At that point you should call unpersist to free the 
memory as it is no longer relevant.

Thanks,
  Assaf.

From: Saatvik Shah [mailto:saatvikshah1...@gmail.com]
Sent: Tuesday, June 20, 2017 8:50 PM
To: Mendelson, Assaf
Cc: user@spark.apache.org
Subject: Re: Merging multiple Pandas dataframes

Hi Assaf,
Thanks for the suggestion on checkpointing - I'll need to read up more on that.
My current implementation seems to be crashing with a GC memory limit exceeded 
error if Im keeping multiple persist calls for a large number of files.

Thus, I was also thinking about the constant calls to persist. Since all my 
actions are Spark transformations(union of large number of Spark Dataframes 
from Pandas dataframes), this entire process of building a large Spark 
dataframe is essentially a huge transformation. Is it necessary to call persist 
between unions? Shouldnt I instead wait for all the unions to complete and call 
persist finally?


On Tue, Jun 20, 2017 at 2:52 AM, Mendelson, Assaf 
<assaf.mendel...@rsa.com<mailto:assaf.mendel...@rsa.com>> wrote:
Note that depending on the number of iterations, the query plan for the 
dataframe can become long and this can cause slowdowns (or even crashes).
A possible solution would be to checkpoint (or simply save and reload the 
dataframe) every once in a while. When reloading from disk, the newly loaded 
dataframe's lineage is just the disk...

Thanks,
  Assaf.

-Original Message-
From: saatvikshah1994 
[mailto:saatvikshah1...@gmail.com<mailto:saatvikshah1...@gmail.com>]
Sent: Tuesday, June 20, 2017 2:22 AM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Merging multiple Pandas dataframes

Hi,

I am iteratively receiving a file which can only be opened as a Pandas 
dataframe. For the first such file I receive, I am converting this to a Spark 
dataframe using the 'createDataframe' utility function. The next file onward, I 
am converting it and union'ing it into the first Spark dataframe(the schema 
always stays the same). After each union, I am persisting it in 
memory(MEMORY_AND_DISK_ONLY level). After I have converted all such files to a 
single spark dataframe I am coalescing it. Following some tips from this Stack 
Overflow
post(https://stackoverflow.com/questions/39381183/managing-spark-partitions-after-dataframe-unions).

Any suggestions for optimizing this process further?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Merging-multiple-Pandas-dataframes-tp28770.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>



--
Saatvik Shah,
1st  Year,
Masters in the School of Computer Science,
Carnegie Mellon University
https://saatvikshah1994.github.io/


Re: Merging multiple Pandas dataframes

2017-06-20 Thread Saatvik Shah
Hi Assaf,

Thanks for the suggestion on checkpointing - I'll need to read up more on
that.

My current implementation seems to be crashing with a GC memory limit
exceeded error if Im keeping multiple persist calls for a large number of
files.

Thus, I was also thinking about the constant calls to persist. Since all my
actions are Spark transformations(union of large number of Spark Dataframes
from Pandas dataframes), this entire process of building a large Spark
dataframe is essentially a huge transformation. Is it necessary to call
persist between unions? Shouldnt I instead wait for all the unions to
complete and call persist finally?




On Tue, Jun 20, 2017 at 2:52 AM, Mendelson, Assaf 
wrote:

> Note that depending on the number of iterations, the query plan for the
> dataframe can become long and this can cause slowdowns (or even crashes).
> A possible solution would be to checkpoint (or simply save and reload the
> dataframe) every once in a while. When reloading from disk, the newly
> loaded dataframe's lineage is just the disk...
>
> Thanks,
>   Assaf.
>
> -Original Message-
> From: saatvikshah1994 [mailto:saatvikshah1...@gmail.com]
> Sent: Tuesday, June 20, 2017 2:22 AM
> To: user@spark.apache.org
> Subject: Merging multiple Pandas dataframes
>
> Hi,
>
> I am iteratively receiving a file which can only be opened as a Pandas
> dataframe. For the first such file I receive, I am converting this to a
> Spark dataframe using the 'createDataframe' utility function. The next file
> onward, I am converting it and union'ing it into the first Spark
> dataframe(the schema always stays the same). After each union, I am
> persisting it in memory(MEMORY_AND_DISK_ONLY level). After I have converted
> all such files to a single spark dataframe I am coalescing it. Following
> some tips from this Stack Overflow
> post(https://stackoverflow.com/questions/39381183/
> managing-spark-partitions-after-dataframe-unions).
>
> Any suggestions for optimizing this process further?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Merging-multiple-Pandas-dataframes-tp28770.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


-- 
*Saatvik Shah,*
*1st  Year,*
*Masters in the School of Computer Science,*
*Carnegie Mellon University*

*https://saatvikshah1994.github.io/ *


RE: Merging multiple Pandas dataframes

2017-06-20 Thread Mendelson, Assaf
Note that depending on the number of iterations, the query plan for the 
dataframe can become long and this can cause slowdowns (or even crashes).
A possible solution would be to checkpoint (or simply save and reload the 
dataframe) every once in a while. When reloading from disk, the newly loaded 
dataframe's lineage is just the disk...

Thanks,
  Assaf.

-Original Message-
From: saatvikshah1994 [mailto:saatvikshah1...@gmail.com] 
Sent: Tuesday, June 20, 2017 2:22 AM
To: user@spark.apache.org
Subject: Merging multiple Pandas dataframes

Hi, 

I am iteratively receiving a file which can only be opened as a Pandas 
dataframe. For the first such file I receive, I am converting this to a Spark 
dataframe using the 'createDataframe' utility function. The next file onward, I 
am converting it and union'ing it into the first Spark dataframe(the schema 
always stays the same). After each union, I am persisting it in 
memory(MEMORY_AND_DISK_ONLY level). After I have converted all such files to a 
single spark dataframe I am coalescing it. Following some tips from this Stack 
Overflow
post(https://stackoverflow.com/questions/39381183/managing-spark-partitions-after-dataframe-unions).
   

Any suggestions for optimizing this process further?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Merging-multiple-Pandas-dataframes-tp28770.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org