Hi, it is the same thing when you are using Sql or dataframe api. actually, they will be optimized by catalyst then push to rdd. and in this case, there are many times on iteration, (16000 times). so you got a very big execution plan when you join the dataframe again and again, I think this is the reason you got the IOM and analysis exception. my suggestion is you need checkpoint the dataframe when joined 200 dataframes. so you can trancate the lineage. so the optimizer only analysis the 200 dataframe. this will reduce the pressure of spark engine.
| | Hollis | ---- Replied mail ---- | From | Gourav Sengupta<gourav.sengu...@gmail.com> | | Date | 12/25/2021 03:46 | | To | Sean Owen<sro...@gmail.com> | | Cc | Andrew Davidson<aedav...@ucsc.edu>、Nicholas Gustafson<njgustaf...@gmail.com>、User<user@spark.apache.org> | | Subject | Re: OOM Joining thousands of dataframes Was: AnalysisException: Trouble using select() to append multiple columns | Hi, may be I am getting confused as always :) , but the requirement looked pretty simple to me to be implemented in SQL, or it is just the euphoria of Christmas eve Anyways, in case the above can be implemented in SQL, then I can have a look at it. Yes, indeed there are bespoke scenarios where dataframes may apply and RDD are used, but for UDF's I prefer SQL as well, but that may be a personal idiosyncrasy. The Oreilly book on data algorithms using SPARK, pyspark uses dataframes and RDD API's :) Regards, Gourav Sengupta On Fri, Dec 24, 2021 at 6:11 PM Sean Owen <sro...@gmail.com> wrote: This is simply not generally true, no, and not in this case. The programmatic and SQL APIs overlap a lot, and where they do, they're essentially aliases. Use whatever is more natural. What I wouldn't recommend doing is emulating SQL-like behavior in custom code, UDFs, etc. The native operators will be faster. Sometimes you have to go outside SQL where necessary, like in UDFs or complex aggregation logic. Then you can't use SQL. On Fri, Dec 24, 2021 at 12:05 PM Gourav Sengupta <gourav.sengu...@gmail.com> wrote: Hi, yeah I think that in practice you will always find that dataframes can give issues regarding a lot of things, and then you can argue. In the SPARK conference, I think last year, it was shown that more than 92% or 95% use the SPARK SQL API, if I am not mistaken. I think that you can do the entire processing at one single go. Can you please write down the end to end SQL and share without the 16000 iterations? Regards, Gourav Sengupta On Fri, Dec 24, 2021 at 5:16 PM Andrew Davidson <aedav...@ucsc.edu> wrote: Hi Sean and Gourav Thanks for the suggestions. I thought that both the sql and dataframe apis are wrappers around the same frame work? Ie. catalysts. I tend to mix and match my code. Sometimes I find it easier to write using sql some times dataframes. What is considered best practices? Here is an example