Please include the output of running explain() when reporting performance issues with DataFrames.
On Fri, Feb 19, 2016 at 9:31 AM, Tamara Mendt <t...@hellofresh.com> wrote: > Hi all, > > I am running a Spark job that gets stuck attempting to join two > dataframes. The dataframes are not very large, one is about 2 M rows, and > the other a couple of thousand rows and the resulting joined dataframe > should be about the same size as the smaller dataframe. I have tried > triggering execution of the join using the 'first' operator, which as far > as I understand would not require processing the entire resulting dataframe > (maybe I am mistaken though). The Spark UI is not telling me anything, just > showing the task to be stuck. > > When I run the exact same job on a slightly smaller dataset it works > without hanging. > > I have used the same environment to run joins on much larger dataframes, > so I am confused as to why in this particular case my Spark job is just > hanging. I have also tried running the same join operation using pyspark on > two 2 Million row dataframes (exactly like the one I am trying to join in > the job that gets stuck) and it runs succesfully. > > I have tried caching the joined dataframe to see how much memory it is > requiring but the job gets stuck on this action too. I have also tried > using persist to memory and disk on the join, and the job seems to be stuck > all the same. > > Any help as to where to look for the source of the problem would be much > appreciated. > > Cheers, > > Tamara > >