Re: YARN - Pyspark

Timur Shenkao Fri, 30 Sep 2016 03:33:02 -0700

It's not weird behavior. Did you run the job in cluster mode?
I suspect your driver died / finished / stopped after 12 hours but your job
continued. It's possible as you didn't output anything to console on driver
node.


Quite long time ago, when I just tried Spark Streaming, I launched PySpark
Streaming jobs in PyCharm & pyspark console and "killed" them via Ctrl+Z
Drivers were gone but YARN containers (where computations on slaves were
performed) remained.
Nevertheless, I believe that final result in "some table" is corrupted

On Fri, Sep 30, 2016 at 9:33 AM, ayan guha <[email protected]> wrote:

> Hi
>
> I just observed a litlte weird behavior:
>
> I ran a pyspark job, very simple one.
>
> conf = SparkConf()
> conf.setAppName("Historical Meter Load")
> conf.set("spark.yarn.queue","root.Applications")
> conf.set("spark.executor.instances","50")
> conf.set("spark.executor.memory","10g")
> conf.set("spark.yarn.executor.memoryOverhead","2048")
> conf.set("spark.sql.shuffle.partitions",1000)
> conf.set("spark.executor.cores","4")
> sc = SparkContext(conf = conf)
> sqlContext = HiveContext(sc)
>
> df = sqlContext.sql("some sql")
>
> c = df.count()
>
> df.filter(df["RNK"] == 1).saveAsTable("some table").mode("overwrite")
>
> sc.stop()
>
> running is on CDH 5.7 cluster, Spark 1.6.0.
>
> Behavior observed: After few hours of running (definitely over 12H, but
> not sure exacly when), Yarn reported job as Completed, finished
> successfully, whereas the job kept running (I can see from Application
> master link) for 22H. Timing of the job is expected. Behavior of YARN is
> not.
>
> Is it a known issue? Is it a pyspark specific issue or same with scala as
> well?
>
>
> --
> Best Regards,
> Ayan Guha
>

Re: YARN - Pyspark

Reply via email to