It's not obvious from what you pasted, but perhaps the juypter notebook
already is connected to a running spark context, while spark-submit needs
to get a new spot in the (YARN?) queue.

I would check the cluster job IDs for both to ensure you're getting new
cluster tasks for each.

On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati <>

> Hi,
> I am facing a weird behaviour while running a python script. Here is what
> the code looks like mostly:
> def fn1(ip):
>    some code...
>     ...
> def fn2(row):
>     ...
>     some operations
>     ...
>     return row1
> udf_fn1 = udf(fn1)
> cdf ="xxxx") //hive table is of size > 500 Gigs with
> ~4500 partitions
> ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \
>     .drop("colz") \
>     .withColumnRenamed("colz", "coly")
> edf = ddf \
>     .filter(ddf.colp == 'some_value') \
> row: fn2(row)) \
>     .toDF()
> print edf.count() // simple way for the performance test in both platforms
> Now when I run the same code in a brand new jupyter notebook it runs 6x
> faster than when I run this python script using spark-submit. The
> configurations are printed and  compared from both the platforms and they
> are exact same. I even tried to run this script in a single cell of jupyter
> notebook and still have the same performance. I need to understand if I am
> missing something in the spark-submit which is causing the issue.  I tried
> to minimise the script to reproduce the same error without much code.
> Both are run in client mode on a yarn based spark cluster. The machines
> from which both are executed are also the same and from same user.
> What i found is the  the quantile values for median for one ran with
> jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins.  I am not
> able to figure out why this is happening.
> Any one faced this kind of issue before or know how to resolve this?
> *Regards,*
> *Dhrub*


*Patrick McCarthy  *

Senior Data Scientist, Machine Learning Engineering


470 Park Ave South, 17th Floor, NYC 10016

Reply via email to