As mentioned in the very first mail: * same cluster it is submitted. * from same machine they are submitted and also from same user * each of them has 128 executors and 2 cores per executor with 8Gigs of memory each and both of them are getting that while running
to clarify more let me quote what I mentioned above. *These data is taken from Spark-UI when the jobs are almost finished in both.* "What i found is the the quantile values for median for one ran with jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins." which means per task time taken is much higher in spark-submit script than jupyter script. This is where I am really puzzled because they are the exact same code. why running them two different ways vary so much in the execution time. *Regards,Dhrubajyoti Hati.Mob No: 9886428028/9652029028* On Wed, Sep 11, 2019 at 8:42 AM Stephen Boesch <java...@gmail.com> wrote: > Sounds like you have done your homework to properly compare . I'm > guessing the answer to the following is yes .. but in any case: are they > both running against the same spark cluster with the same configuration > parameters especially executor memory and number of workers? > > Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati < > dhruba.w...@gmail.com>: > >> No, i checked for that, hence written "brand new" jupyter notebook. Also >> the time taken by both are 30 mins and ~3hrs as i am reading a 500 gigs >> compressed base64 encoded text data from a hive table and decompressing and >> decoding in one of the udfs. Also the time compared is from Spark UI not >> how long the job actually takes after submission. Its just the running time >> i am comparing/mentioning. >> >> As mentioned earlier, all the spark conf params even match in two scripts >> and that's why i am puzzled what going on. >> >> On Wed, 11 Sep, 2019, 12:44 AM Patrick McCarthy, <pmccar...@dstillery.com> >> wrote: >> >>> It's not obvious from what you pasted, but perhaps the juypter notebook >>> already is connected to a running spark context, while spark-submit needs >>> to get a new spot in the (YARN?) queue. >>> >>> I would check the cluster job IDs for both to ensure you're getting new >>> cluster tasks for each. >>> >>> On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati <dhruba.w...@gmail.com> >>> wrote: >>> >>>> Hi, >>>> >>>> I am facing a weird behaviour while running a python script. Here is >>>> what the code looks like mostly: >>>> >>>> def fn1(ip): >>>> some code... >>>> ... >>>> >>>> def fn2(row): >>>> ... >>>> some operations >>>> ... >>>> return row1 >>>> >>>> >>>> udf_fn1 = udf(fn1) >>>> cdf = spark.read.table("xxxx") //hive table is of size > 500 Gigs with >>>> ~4500 partitions >>>> ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \ >>>> .drop("colz") \ >>>> .withColumnRenamed("colz", "coly") >>>> >>>> edf = ddf \ >>>> .filter(ddf.colp == 'some_value') \ >>>> .rdd.map(lambda row: fn2(row)) \ >>>> .toDF() >>>> >>>> print edf.count() // simple way for the performance test in both >>>> platforms >>>> >>>> Now when I run the same code in a brand new jupyter notebook it runs 6x >>>> faster than when I run this python script using spark-submit. The >>>> configurations are printed and compared from both the platforms and they >>>> are exact same. I even tried to run this script in a single cell of jupyter >>>> notebook and still have the same performance. I need to understand if I am >>>> missing something in the spark-submit which is causing the issue. I tried >>>> to minimise the script to reproduce the same error without much code. >>>> >>>> Both are run in client mode on a yarn based spark cluster. The machines >>>> from which both are executed are also the same and from same user. >>>> >>>> What i found is the the quantile values for median for one ran with >>>> jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins. I am not >>>> able to figure out why this is happening. >>>> >>>> Any one faced this kind of issue before or know how to resolve this? >>>> >>>> *Regards,* >>>> *Dhrub* >>>> >>> >>> >>> -- >>> >>> >>> *Patrick McCarthy * >>> >>> Senior Data Scientist, Machine Learning Engineering >>> >>> Dstillery >>> >>> 470 Park Ave South, 17th Floor, NYC 10016 >>> >>