Re: YARN - Pyspark

2016-09-30 Thread ayan guha
I understand, thank you for explanation. However, I ran using yarn-client
mode, submitted using nohup and I could see the logs getting into log file
throughout the life of the job.everything worked well on spark side,
just Yarn reported success long before job actually completed. I would love
to understand if I am missing anything here

On Fri, Sep 30, 2016 at 8:32 PM, Timur Shenkao  wrote:

> It's not weird behavior. Did you run the job in cluster mode?
> I suspect your driver died / finished / stopped after 12 hours but your
> job continued. It's possible as you didn't output anything to console on
> driver node.
>
> Quite long time ago, when I just tried Spark Streaming, I launched PySpark
> Streaming jobs in PyCharm & pyspark console and "killed" them via Ctrl+Z
> Drivers were gone but YARN containers (where computations on slaves were
> performed) remained.
> Nevertheless, I believe that final result in "some table" is corrupted
>
> On Fri, Sep 30, 2016 at 9:33 AM, ayan guha  wrote:
>
>> Hi
>>
>> I just observed a litlte weird behavior:
>>
>> I ran a pyspark job, very simple one.
>>
>> conf = SparkConf()
>> conf.setAppName("Historical Meter Load")
>> conf.set("spark.yarn.queue","root.Applications")
>> conf.set("spark.executor.instances","50")
>> conf.set("spark.executor.memory","10g")
>> conf.set("spark.yarn.executor.memoryOverhead","2048")
>> conf.set("spark.sql.shuffle.partitions",1000)
>> conf.set("spark.executor.cores","4")
>> sc = SparkContext(conf = conf)
>> sqlContext = HiveContext(sc)
>>
>> df = sqlContext.sql("some sql")
>>
>> c = df.count()
>>
>> df.filter(df["RNK"] == 1).saveAsTable("some table").mode("overwrite")
>>
>> sc.stop()
>>
>> running is on CDH 5.7 cluster, Spark 1.6.0.
>>
>> Behavior observed: After few hours of running (definitely over 12H, but
>> not sure exacly when), Yarn reported job as Completed, finished
>> successfully, whereas the job kept running (I can see from Application
>> master link) for 22H. Timing of the job is expected. Behavior of YARN is
>> not.
>>
>> Is it a known issue? Is it a pyspark specific issue or same with scala as
>> well?
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>


-- 
Best Regards,
Ayan Guha


Re: YARN - Pyspark

2016-09-30 Thread Timur Shenkao
It's not weird behavior. Did you run the job in cluster mode?
I suspect your driver died / finished / stopped after 12 hours but your job
continued. It's possible as you didn't output anything to console on driver
node.

Quite long time ago, when I just tried Spark Streaming, I launched PySpark
Streaming jobs in PyCharm & pyspark console and "killed" them via Ctrl+Z
Drivers were gone but YARN containers (where computations on slaves were
performed) remained.
Nevertheless, I believe that final result in "some table" is corrupted

On Fri, Sep 30, 2016 at 9:33 AM, ayan guha  wrote:

> Hi
>
> I just observed a litlte weird behavior:
>
> I ran a pyspark job, very simple one.
>
> conf = SparkConf()
> conf.setAppName("Historical Meter Load")
> conf.set("spark.yarn.queue","root.Applications")
> conf.set("spark.executor.instances","50")
> conf.set("spark.executor.memory","10g")
> conf.set("spark.yarn.executor.memoryOverhead","2048")
> conf.set("spark.sql.shuffle.partitions",1000)
> conf.set("spark.executor.cores","4")
> sc = SparkContext(conf = conf)
> sqlContext = HiveContext(sc)
>
> df = sqlContext.sql("some sql")
>
> c = df.count()
>
> df.filter(df["RNK"] == 1).saveAsTable("some table").mode("overwrite")
>
> sc.stop()
>
> running is on CDH 5.7 cluster, Spark 1.6.0.
>
> Behavior observed: After few hours of running (definitely over 12H, but
> not sure exacly when), Yarn reported job as Completed, finished
> successfully, whereas the job kept running (I can see from Application
> master link) for 22H. Timing of the job is expected. Behavior of YARN is
> not.
>
> Is it a known issue? Is it a pyspark specific issue or same with scala as
> well?
>
>
> --
> Best Regards,
> Ayan Guha
>


YARN - Pyspark

2016-09-30 Thread ayan guha
Hi

I just observed a litlte weird behavior:

I ran a pyspark job, very simple one.

conf = SparkConf()
conf.setAppName("Historical Meter Load")
conf.set("spark.yarn.queue","root.Applications")
conf.set("spark.executor.instances","50")
conf.set("spark.executor.memory","10g")
conf.set("spark.yarn.executor.memoryOverhead","2048")
conf.set("spark.sql.shuffle.partitions",1000)
conf.set("spark.executor.cores","4")
sc = SparkContext(conf = conf)
sqlContext = HiveContext(sc)

df = sqlContext.sql("some sql")

c = df.count()

df.filter(df["RNK"] == 1).saveAsTable("some table").mode("overwrite")

sc.stop()

running is on CDH 5.7 cluster, Spark 1.6.0.

Behavior observed: After few hours of running (definitely over 12H, but not
sure exacly when), Yarn reported job as Completed, finished successfully,
whereas the job kept running (I can see from Application master link) for
22H. Timing of the job is expected. Behavior of YARN is not.

Is it a known issue? Is it a pyspark specific issue or same with scala as
well?


-- 
Best Regards,
Ayan Guha