Re: OOM Joining thousands of dataframes Was: AnalysisException: Trouble using select() to append multiple columns

Hollis Fri, 24 Dec 2021 17:39:54 -0800

Hi,

it is the same thing when you are using Sql or dataframe api. actually, they 
will be optimized by catalyst then push to rdd.
and in this case, there are many times on iteration,  (16000 times).  so you 
got a very big execution plan when you join the dataframe again and again, I 
think this is  the reason you got the IOM and analysis exception.
my suggestion is  you need checkpoint the dataframe when joined 200 dataframes.
so you can trancate the lineage. so the optimizer only analysis the 200 
dataframe.
this will reduce the pressure of spark engine.





| |
Hollis
|




---- Replied mail ----
| From | Gourav Sengupta<gourav.sengu...@gmail.com> |
| Date | 12/25/2021 03:46 |
| To | Sean Owen<sro...@gmail.com> |
| Cc | Andrew Davidson<aedav...@ucsc.edu>、Nicholas 
Gustafson<njgustaf...@gmail.com>、User<user@spark.apache.org> |
| Subject | Re: OOM Joining thousands of dataframes Was: AnalysisException: 
Trouble using select() to append multiple columns |
Hi,


may be I am getting confused as always  :) , but the requirement looked pretty 
simple to me to be implemented in SQL, or it is just the euphoria of Christmas 
eve 


Anyways, in case the above can be implemented in SQL, then I can have a look at 
it.


Yes, indeed there are bespoke scenarios where dataframes may apply and RDD are 
used, but for UDF's I prefer SQL as well, but that may be a personal 
idiosyncrasy. The Oreilly book on data algorithms using SPARK, pyspark uses 
dataframes and RDD API's :)


Regards,
Gourav Sengupta


On Fri, Dec 24, 2021 at 6:11 PM Sean Owen <sro...@gmail.com> wrote:

This is simply not generally true, no, and not in this case. The programmatic 
and SQL APIs overlap a lot, and where they do, they're essentially aliases. Use 
whatever is more natural.
What I wouldn't recommend doing is emulating SQL-like behavior in custom code, 
UDFs, etc. The native operators will be faster.
Sometimes you have to go outside SQL where necessary, like in UDFs or complex 
aggregation logic. Then you can't use SQL.


On Fri, Dec 24, 2021 at 12:05 PM Gourav Sengupta <gourav.sengu...@gmail.com> 
wrote:

Hi,


yeah I think that in practice you will always find that dataframes can give 
issues regarding a lot of things, and then you can argue. In the SPARK 
conference, I think last year, it was shown that more than 92% or 95% use the 
SPARK SQL API, if I am not mistaken. 


I think that you can do the entire processing at one single go. 


Can you please write down the end to end SQL and share without the 16000 
iterations?




Regards,
Gourav Sengupta




On Fri, Dec 24, 2021 at 5:16 PM Andrew Davidson <aedav...@ucsc.edu> wrote:


Hi Sean and Gourav

 

Thanks for the suggestions. I thought that both the sql and dataframe apis are 
wrappers around the same frame work? Ie. catalysts.  

 

I tend to mix and match my code. Sometimes I find it easier to write using sql 
some times dataframes. What is considered best practices?

 

Here is an example

Re: OOM Joining thousands of dataframes Was: AnalysisException: Trouble using select() to append multiple columns

Reply via email to