Ok. Can't think of why that would happen. Am Di., 10. Sept. 2019 um 20:26 Uhr schrieb Dhrubajyoti Hati < dhruba.w...@gmail.com>:
> As mentioned in the very first mail: > * same cluster it is submitted. > * from same machine they are submitted and also from same user > * each of them has 128 executors and 2 cores per executor with 8Gigs of > memory each and both of them are getting that while running > > to clarify more let me quote what I mentioned above. *These data is taken > from Spark-UI when the jobs are almost finished in both.* > "What i found is the the quantile values for median for one ran with > jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins." which > means per task time taken is much higher in spark-submit script than > jupyter script. This is where I am really puzzled because they are the > exact same code. why running them two different ways vary so much in the > execution time. > > > > > *Regards,Dhrubajyoti Hati.Mob No: 9886428028/9652029028* > > > On Wed, Sep 11, 2019 at 8:42 AM Stephen Boesch <java...@gmail.com> wrote: > >> Sounds like you have done your homework to properly compare . I'm >> guessing the answer to the following is yes .. but in any case: are they >> both running against the same spark cluster with the same configuration >> parameters especially executor memory and number of workers? >> >> Am Di., 10. Sept. 2019 um 20:05 Uhr schrieb Dhrubajyoti Hati < >> dhruba.w...@gmail.com>: >> >>> No, i checked for that, hence written "brand new" jupyter notebook. Also >>> the time taken by both are 30 mins and ~3hrs as i am reading a 500 gigs >>> compressed base64 encoded text data from a hive table and decompressing and >>> decoding in one of the udfs. Also the time compared is from Spark UI not >>> how long the job actually takes after submission. Its just the running time >>> i am comparing/mentioning. >>> >>> As mentioned earlier, all the spark conf params even match in two >>> scripts and that's why i am puzzled what going on. >>> >>> On Wed, 11 Sep, 2019, 12:44 AM Patrick McCarthy, < >>> pmccar...@dstillery.com> wrote: >>> >>>> It's not obvious from what you pasted, but perhaps the juypter notebook >>>> already is connected to a running spark context, while spark-submit needs >>>> to get a new spot in the (YARN?) queue. >>>> >>>> I would check the cluster job IDs for both to ensure you're getting new >>>> cluster tasks for each. >>>> >>>> On Tue, Sep 10, 2019 at 2:33 PM Dhrubajyoti Hati <dhruba.w...@gmail.com> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I am facing a weird behaviour while running a python script. Here is >>>>> what the code looks like mostly: >>>>> >>>>> def fn1(ip): >>>>> some code... >>>>> ... >>>>> >>>>> def fn2(row): >>>>> ... >>>>> some operations >>>>> ... >>>>> return row1 >>>>> >>>>> >>>>> udf_fn1 = udf(fn1) >>>>> cdf = spark.read.table("xxxx") //hive table is of size > 500 Gigs with >>>>> ~4500 partitions >>>>> ddf = cdf.withColumn("coly", udf_fn1(cdf.colz)) \ >>>>> .drop("colz") \ >>>>> .withColumnRenamed("colz", "coly") >>>>> >>>>> edf = ddf \ >>>>> .filter(ddf.colp == 'some_value') \ >>>>> .rdd.map(lambda row: fn2(row)) \ >>>>> .toDF() >>>>> >>>>> print edf.count() // simple way for the performance test in both >>>>> platforms >>>>> >>>>> Now when I run the same code in a brand new jupyter notebook it runs >>>>> 6x faster than when I run this python script using spark-submit. The >>>>> configurations are printed and compared from both the platforms and they >>>>> are exact same. I even tried to run this script in a single cell of >>>>> jupyter >>>>> notebook and still have the same performance. I need to understand if I am >>>>> missing something in the spark-submit which is causing the issue. I tried >>>>> to minimise the script to reproduce the same error without much code. >>>>> >>>>> Both are run in client mode on a yarn based spark cluster. The >>>>> machines from which both are executed are also the same and from same >>>>> user. >>>>> >>>>> What i found is the the quantile values for median for one ran with >>>>> jupyter was 1.3 mins and one ran with spark-submit was ~8.5 mins. I am >>>>> not >>>>> able to figure out why this is happening. >>>>> >>>>> Any one faced this kind of issue before or know how to resolve this? >>>>> >>>>> *Regards,* >>>>> *Dhrub* >>>>> >>>> >>>> >>>> -- >>>> >>>> >>>> *Patrick McCarthy * >>>> >>>> Senior Data Scientist, Machine Learning Engineering >>>> >>>> Dstillery >>>> >>>> 470 Park Ave South, 17th Floor, NYC 10016 >>>> >>>