It's not weird behavior. Did you run the job in cluster mode? I suspect your driver died / finished / stopped after 12 hours but your job continued. It's possible as you didn't output anything to console on driver node.
Quite long time ago, when I just tried Spark Streaming, I launched PySpark Streaming jobs in PyCharm & pyspark console and "killed" them via Ctrl+Z Drivers were gone but YARN containers (where computations on slaves were performed) remained. Nevertheless, I believe that final result in "some table" is corrupted On Fri, Sep 30, 2016 at 9:33 AM, ayan guha <[email protected]> wrote: > Hi > > I just observed a litlte weird behavior: > > I ran a pyspark job, very simple one. > > conf = SparkConf() > conf.setAppName("Historical Meter Load") > conf.set("spark.yarn.queue","root.Applications") > conf.set("spark.executor.instances","50") > conf.set("spark.executor.memory","10g") > conf.set("spark.yarn.executor.memoryOverhead","2048") > conf.set("spark.sql.shuffle.partitions",1000) > conf.set("spark.executor.cores","4") > sc = SparkContext(conf = conf) > sqlContext = HiveContext(sc) > > df = sqlContext.sql("some sql") > > c = df.count() > > df.filter(df["RNK"] == 1).saveAsTable("some table").mode("overwrite") > > sc.stop() > > running is on CDH 5.7 cluster, Spark 1.6.0. > > Behavior observed: After few hours of running (definitely over 12H, but > not sure exacly when), Yarn reported job as Completed, finished > successfully, whereas the job kept running (I can see from Application > master link) for 22H. Timing of the job is expected. Behavior of YARN is > not. > > Is it a known issue? Is it a pyspark specific issue or same with scala as > well? > > > -- > Best Regards, > Ayan Guha >
