Re: Merging multiple Pandas dataframes
Hi Assaf, Thanks for your suggestion. I also found one other improvement which is to iteratively convert Pandas DFs to RDDs and take a union of those(similar to dataframes). Basically calling createDataFrame is heavy + checkpointing of DataFrames is a brand new feature. Instead create a huge union of RDDs and finally apply createDataFrame in the end. Thanks and Regards, Saatvik On Wed, Jun 21, 2017 at 2:03 AM, Mendelson, Assaf wrote: > If you do an action, most intermediate calculations would be gone for the > next iteration. > > What I would do is persist every iteration, then after some (say 5) I > would write to disk and reload. At that point you should call unpersist to > free the memory as it is no longer relevant. > > > > Thanks, > > Assaf. > > > > *From:* Saatvik Shah [mailto:saatvikshah1...@gmail.com] > *Sent:* Tuesday, June 20, 2017 8:50 PM > *To:* Mendelson, Assaf > *Cc:* user@spark.apache.org > *Subject:* Re: Merging multiple Pandas dataframes > > > > Hi Assaf, > > Thanks for the suggestion on checkpointing - I'll need to read up more on > that. > > My current implementation seems to be crashing with a GC memory limit > exceeded error if Im keeping multiple persist calls for a large number of > files. > > > > Thus, I was also thinking about the constant calls to persist. Since all > my actions are Spark transformations(union of large number of Spark > Dataframes from Pandas dataframes), this entire process of building a large > Spark dataframe is essentially a huge transformation. Is it necessary to > call persist between unions? Shouldnt I instead wait for all the unions to > complete and call persist finally? > > > > > > On Tue, Jun 20, 2017 at 2:52 AM, Mendelson, Assaf > wrote: > > Note that depending on the number of iterations, the query plan for the > dataframe can become long and this can cause slowdowns (or even crashes). > A possible solution would be to checkpoint (or simply save and reload the > dataframe) every once in a while. When reloading from disk, the newly > loaded dataframe's lineage is just the disk... > > Thanks, > Assaf. > > > -Original Message- > From: saatvikshah1994 [mailto:saatvikshah1...@gmail.com] > Sent: Tuesday, June 20, 2017 2:22 AM > To: user@spark.apache.org > Subject: Merging multiple Pandas dataframes > > Hi, > > I am iteratively receiving a file which can only be opened as a Pandas > dataframe. For the first such file I receive, I am converting this to a > Spark dataframe using the 'createDataframe' utility function. The next file > onward, I am converting it and union'ing it into the first Spark > dataframe(the schema always stays the same). After each union, I am > persisting it in memory(MEMORY_AND_DISK_ONLY level). After I have converted > all such files to a single spark dataframe I am coalescing it. Following > some tips from this Stack Overflow > post(https://stackoverflow.com/questions/39381183/ > managing-spark-partitions-after-dataframe-unions). > > Any suggestions for optimizing this process further? > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Merging-multiple-Pandas-dataframes-tp28770.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > > > > -- > > *Saatvik Shah,* > > *1st Year,* > > *Masters in the School of Computer Science,* > > *Carnegie Mellon University* > > *https://saatvikshah1994.github.io/ <https://saatvikshah1994.github.io/>* > -- *Saatvik Shah,* *1st Year,* *Masters in the School of Computer Science,* *Carnegie Mellon University* *https://saatvikshah1994.github.io/ <https://saatvikshah1994.github.io/>*
RE: Merging multiple Pandas dataframes
If you do an action, most intermediate calculations would be gone for the next iteration. What I would do is persist every iteration, then after some (say 5) I would write to disk and reload. At that point you should call unpersist to free the memory as it is no longer relevant. Thanks, Assaf. From: Saatvik Shah [mailto:saatvikshah1...@gmail.com] Sent: Tuesday, June 20, 2017 8:50 PM To: Mendelson, Assaf Cc: user@spark.apache.org Subject: Re: Merging multiple Pandas dataframes Hi Assaf, Thanks for the suggestion on checkpointing - I'll need to read up more on that. My current implementation seems to be crashing with a GC memory limit exceeded error if Im keeping multiple persist calls for a large number of files. Thus, I was also thinking about the constant calls to persist. Since all my actions are Spark transformations(union of large number of Spark Dataframes from Pandas dataframes), this entire process of building a large Spark dataframe is essentially a huge transformation. Is it necessary to call persist between unions? Shouldnt I instead wait for all the unions to complete and call persist finally? On Tue, Jun 20, 2017 at 2:52 AM, Mendelson, Assaf mailto:assaf.mendel...@rsa.com>> wrote: Note that depending on the number of iterations, the query plan for the dataframe can become long and this can cause slowdowns (or even crashes). A possible solution would be to checkpoint (or simply save and reload the dataframe) every once in a while. When reloading from disk, the newly loaded dataframe's lineage is just the disk... Thanks, Assaf. -Original Message- From: saatvikshah1994 [mailto:saatvikshah1...@gmail.com<mailto:saatvikshah1...@gmail.com>] Sent: Tuesday, June 20, 2017 2:22 AM To: user@spark.apache.org<mailto:user@spark.apache.org> Subject: Merging multiple Pandas dataframes Hi, I am iteratively receiving a file which can only be opened as a Pandas dataframe. For the first such file I receive, I am converting this to a Spark dataframe using the 'createDataframe' utility function. The next file onward, I am converting it and union'ing it into the first Spark dataframe(the schema always stays the same). After each union, I am persisting it in memory(MEMORY_AND_DISK_ONLY level). After I have converted all such files to a single spark dataframe I am coalescing it. Following some tips from this Stack Overflow post(https://stackoverflow.com/questions/39381183/managing-spark-partitions-after-dataframe-unions). Any suggestions for optimizing this process further? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Merging-multiple-Pandas-dataframes-tp28770.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org> -- Saatvik Shah, 1st Year, Masters in the School of Computer Science, Carnegie Mellon University https://saatvikshah1994.github.io/
Re: Merging multiple Pandas dataframes
Hi Assaf, Thanks for the suggestion on checkpointing - I'll need to read up more on that. My current implementation seems to be crashing with a GC memory limit exceeded error if Im keeping multiple persist calls for a large number of files. Thus, I was also thinking about the constant calls to persist. Since all my actions are Spark transformations(union of large number of Spark Dataframes from Pandas dataframes), this entire process of building a large Spark dataframe is essentially a huge transformation. Is it necessary to call persist between unions? Shouldnt I instead wait for all the unions to complete and call persist finally? On Tue, Jun 20, 2017 at 2:52 AM, Mendelson, Assaf wrote: > Note that depending on the number of iterations, the query plan for the > dataframe can become long and this can cause slowdowns (or even crashes). > A possible solution would be to checkpoint (or simply save and reload the > dataframe) every once in a while. When reloading from disk, the newly > loaded dataframe's lineage is just the disk... > > Thanks, > Assaf. > > -Original Message- > From: saatvikshah1994 [mailto:saatvikshah1...@gmail.com] > Sent: Tuesday, June 20, 2017 2:22 AM > To: user@spark.apache.org > Subject: Merging multiple Pandas dataframes > > Hi, > > I am iteratively receiving a file which can only be opened as a Pandas > dataframe. For the first such file I receive, I am converting this to a > Spark dataframe using the 'createDataframe' utility function. The next file > onward, I am converting it and union'ing it into the first Spark > dataframe(the schema always stays the same). After each union, I am > persisting it in memory(MEMORY_AND_DISK_ONLY level). After I have converted > all such files to a single spark dataframe I am coalescing it. Following > some tips from this Stack Overflow > post(https://stackoverflow.com/questions/39381183/ > managing-spark-partitions-after-dataframe-unions). > > Any suggestions for optimizing this process further? > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Merging-multiple-Pandas-dataframes-tp28770.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- *Saatvik Shah,* *1st Year,* *Masters in the School of Computer Science,* *Carnegie Mellon University* *https://saatvikshah1994.github.io/ <https://saatvikshah1994.github.io/>*
RE: Merging multiple Pandas dataframes
Note that depending on the number of iterations, the query plan for the dataframe can become long and this can cause slowdowns (or even crashes). A possible solution would be to checkpoint (or simply save and reload the dataframe) every once in a while. When reloading from disk, the newly loaded dataframe's lineage is just the disk... Thanks, Assaf. -Original Message- From: saatvikshah1994 [mailto:saatvikshah1...@gmail.com] Sent: Tuesday, June 20, 2017 2:22 AM To: user@spark.apache.org Subject: Merging multiple Pandas dataframes Hi, I am iteratively receiving a file which can only be opened as a Pandas dataframe. For the first such file I receive, I am converting this to a Spark dataframe using the 'createDataframe' utility function. The next file onward, I am converting it and union'ing it into the first Spark dataframe(the schema always stays the same). After each union, I am persisting it in memory(MEMORY_AND_DISK_ONLY level). After I have converted all such files to a single spark dataframe I am coalescing it. Following some tips from this Stack Overflow post(https://stackoverflow.com/questions/39381183/managing-spark-partitions-after-dataframe-unions). Any suggestions for optimizing this process further? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Merging-multiple-Pandas-dataframes-tp28770.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Merging multiple Pandas dataframes
Hi, I am iteratively receiving a file which can only be opened as a Pandas dataframe. For the first such file I receive, I am converting this to a Spark dataframe using the 'createDataframe' utility function. The next file onward, I am converting it and union'ing it into the first Spark dataframe(the schema always stays the same). After each union, I am persisting it in memory(MEMORY_AND_DISK_ONLY level). After I have converted all such files to a single spark dataframe I am coalescing it. Following some tips from this Stack Overflow post(https://stackoverflow.com/questions/39381183/managing-spark-partitions-after-dataframe-unions). Any suggestions for optimizing this process further? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Merging-multiple-Pandas-dataframes-tp28770.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org