Please include the output of running explain() when reporting performance
issues with DataFrames.

On Fri, Feb 19, 2016 at 9:31 AM, Tamara Mendt <t...@hellofresh.com> wrote:

> Hi all,
>
> I am running a Spark job that gets stuck attempting to join two
> dataframes. The dataframes are not very large, one is about 2 M rows, and
> the other a couple of thousand rows and the resulting joined dataframe
> should be about the same size as the smaller dataframe. I have tried
> triggering execution of the join using the 'first' operator, which as far
> as I understand would not require processing the entire resulting dataframe
> (maybe I am mistaken though). The Spark UI is not telling me anything, just
> showing the task to be stuck.
>
> When I run the exact same job on a slightly smaller dataset it works
> without hanging.
>
> I have used the same environment to run joins on much larger dataframes,
> so I am confused as to why in this particular case my Spark job is just
> hanging. I have also tried running the same join operation using pyspark on
> two 2 Million row dataframes (exactly like the one I am trying to join in
> the job that gets stuck) and it runs succesfully.
>
> I have tried caching the joined dataframe to see how much memory it is
> requiring but the job gets stuck on this action too. I have also tried
> using persist to memory and disk on the join, and the job seems to be stuck
> all the same.
>
> Any help as to where to look for the source of the problem would be much
> appreciated.
>
> Cheers,
>
> Tamara
>
>

Reply via email to