RE: PySpark Write File Container exited with a non-zero exit code 143

2021-05-20 Thread Clay McDonald
Still get the same error with “pyspark --conf queue=default --conf 
executor-memory=24G”

From: ayan guha 
Sent: Thursday, May 20, 2021 12:23 AM
To: Clay McDonald 
Cc: Mich Talebzadeh ; user@spark.apache.org
Subject: Re: PySpark Write File Container exited with a non-zero exit code 143

  *** EXTERNAL EMAIL ***




Hi -- Notice the additional "y" in red (as Mich mentioned)

pyspark --conf queue=default --conf executory-memory=24G

On Thu, May 20, 2021 at 12:02 PM Clay McDonald 
mailto:stuart.mcdon...@bateswhite.com>> wrote:
How so?

From: Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>>
Sent: Wednesday, May 19, 2021 5:45 PM
To: Clay McDonald 
mailto:stuart.mcdon...@bateswhite.com>>
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: PySpark Write File Container exited with a non-zero exit code 143

  *** EXTERNAL EMAIL ***




Hi Clay,

Those parameters you are passing are not valid

pyspark --conf queue=default --conf executory-memory=24G

Python 3.7.3 (default, Apr  3 2021, 20:42:31)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Warning: Ignoring non-Spark config property: executory-memory
Warning: Ignoring non-Spark config property: queue
2021-05-19 22:28:20,521 WARN util.NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.1.1
  /_/

Using Python version 3.7.3 (default, Apr  3 2021 20:42:31)
Spark context Web UI available at http://rhes75:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1621459701490).
SparkSession available as 'spark'.

Also

pyspark dynamic_ARRAY_generator_parquet.py

Running python applications through 'pyspark' is not supported as of Spark 2.0.
Use ./bin/spark-submit 


This works

$SPARK_HOME/bin/spark-submit --master local[4] 
dynamic_ARRAY_generator_parquet.py


See

 https://spark.apache.org/docs/latest/submitting-applications.html

HTH




 
[https://docs.google.com/uc?export=download=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Wed, 19 May 2021 at 20:10, Clay McDonald 
mailto:stuart.mcdon...@bateswhite.com>> wrote:
Hello all,

I’m hoping someone can give me some direction for troubleshooting this issue, 
I’m trying to write from Spark on an HortonWorks(Cloudera) HDP cluster. I ssh 
directly to the first datanode and run PySpark with the following command; 
however, it is always failing no matter what size I set memory in Yarn 
Containers and Yarn Queues. Any suggestions?



pyspark --conf queue=default --conf executory-memory=24G

--

HDFS_RAW="/HDFS/Data/Test/Original/MyData_data/"
#HDFS_OUT="/ HDFS/Data/Test/Processed/Convert_parquet/Output"
HDFS_OUT="/tmp"
ENCODING="utf-16"

fileList1=[
'Test _2003.txt'
]
from  pyspark.sql.functions import regexp_replace,col
for f in fileList1:
fname=f
fname_noext=fname.split('.')[0]
df = 
spark.read.option("delimiter","|").option("encoding",ENCODING).option("multiLine",True).option('wholeFile',"true").csv('{}/{}'.format(HDFS_RAW,fname),
 header=True)
lastcol=df.columns[-1]
print('showing {}'.format(fname))
if ('\r' in lastcol):
lastcol=lastcol.replace('\r','')
df=df.withColumn(lastcol, 
regexp_replace(col("{}\r".format(lastcol)), "[\r]", 
"")).drop('{}\r'.format(lastcol))

df.write.format('parquet').mode('overwrite').save("{}/{}".format(HDFS_OUT,fname_noext))



Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
1.0 (TID 4, DataNode01.mydomain.com<http://DataNode01.mydomain.com>, executor 
5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) 
Reason: Container marked as failed: container_e331_1621375512548_0021_01_06 
on host: DataNode01.mydomain.com<http://DataNode01.mydomain.com>

Re: PySpark Write File Container exited with a non-zero exit code 143

2021-05-19 Thread ayan guha
Hi -- Notice the additional "y" in red (as Mich mentioned)

pyspark --conf queue=default --conf executory-memory=24G

On Thu, May 20, 2021 at 12:02 PM Clay McDonald <
stuart.mcdon...@bateswhite.com> wrote:

> How so?
>
>
>
> *From:* Mich Talebzadeh 
> *Sent:* Wednesday, May 19, 2021 5:45 PM
> *To:* Clay McDonald 
> *Cc:* user@spark.apache.org
> *Subject:* Re: PySpark Write File Container exited with a non-zero exit
> code 143
>
>
>
> *  *** EXTERNAL EMAIL ***   *
>
>
>
>
>
> Hi Clay,
>
>
>
> Those parameters you are passing are not valid
>
>
>
> pyspark --conf queue=default --conf executory-memory=24G
>
>
>
> Python 3.7.3 (default, Apr  3 2021, 20:42:31)
>
> [GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
>
> Type "help", "copyright", "credits" or "license" for more information.
>
> Warning: Ignoring non-Spark config property: executory-memory
>
> Warning: Ignoring non-Spark config property: queue
>
> 2021-05-19 22:28:20,521 WARN util.NativeCodeLoader: Unable to load
> native-hadoop library for your platform... using builtin-java classes where
> applicable
>
> Setting default log level to "WARN".
>
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
> setLogLevel(newLevel).
>
> Welcome to
>
>     __
>
>  / __/__  ___ _/ /__
>
> _\ \/ _ \/ _ `/ __/  '_/
>
>/__ / .__/\_,_/_/ /_/\_\   version 3.1.1
>
>   /_/
>
>
>
> Using Python version 3.7.3 (default, Apr  3 2021 20:42:31)
>
> Spark context Web UI available at http://rhes75:4040
>
> Spark context available as 'sc' (master = local[*], app id =
> local-1621459701490).
>
> SparkSession available as 'spark'.
>
>
>
> Also
>
>
>
> pyspark dynamic_ARRAY_generator_parquet.py
>
>
>
> Running python applications through 'pyspark' is not supported as of Spark
> 2.0.
>
> Use ./bin/spark-submit 
>
>
>
>
>
> This works
>
>
>
> $SPARK_HOME/bin/spark-submit --master local[4]
> dynamic_ARRAY_generator_parquet.py
>
>
>
>
>
> See
>
>
>
>  https://spark.apache.org/docs/latest/submitting-applications.html
>
>
>
> HTH
>
>
>
>
>
>view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
>
>
>
> On Wed, 19 May 2021 at 20:10, Clay McDonald <
> stuart.mcdon...@bateswhite.com> wrote:
>
> Hello all,
>
>
>
> I’m hoping someone can give me some direction for troubleshooting this
> issue, I’m trying to write from Spark on an HortonWorks(Cloudera) HDP
> cluster. I ssh directly to the first datanode and run PySpark with the
> following command; however, it is always failing no matter what size I set
> memory in Yarn Containers and Yarn Queues. Any suggestions?
>
>
>
>
>
>
>
> pyspark --conf queue=default --conf executory-memory=24G
>
>
>
> --
>
>
>
> HDFS_RAW="/HDFS/Data/Test/Original/MyData_data/"
>
> #HDFS_OUT="/ HDFS/Data/Test/Processed/Convert_parquet/Output"
>
> HDFS_OUT="/tmp"
>
> ENCODING="utf-16"
>
>
>
> fileList1=[
>
> 'Test _2003.txt'
>
> ]
>
> from  pyspark.sql.functions import regexp_replace,col
>
> for f in fileList1:
>
> fname=f
>
> fname_noext=fname.split('.')[0]
>
> df =
> spark.read.option("delimiter","|").option("encoding",ENCODING).option("multiLine",True).option('wholeFile',"true").csv('{}/{}'.format(HDFS_RAW,fname),
> header=True)
>
> lastcol=df.columns[-1]
>
> print('showing {}'.format(fname))
>
> if ('\r' in lastcol):
>
> lastcol=lastcol.replace('\r','')
>
> df=df.withColumn(lastcol,
> regexp_replace(col("{}\r".format(lastcol)), "[\r]",
> "")).drop('{}\r'.format(lastcol))
>
>
> df.write.format('parquet').mode('overwrite').save("{}/{}".format(HDFS_OUT,fname_noext))
>
>
>
>
>
>
>
> Caused by: org.apache.spark.SparkException: Job aborted due to stage
> failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task
> 0.3 in stage 1.0 (TID 4, DataNode01.mydomain.com, executor 5):
> ExecutorLostFailure (executor 5 exited caused by one of the running tasks)
> Reason: Container marked as failed:
> container_e331_1621375512548_0021_01_06 on host:
> DataNode01.mydomain.com. Exit status: 143. Diagnostics: [2021-05-19
> 18:09:06.392]Container killed on request. Exit code is 143
> [2021-05-19 18:09:06.413]Container exited with a non-zero exit code 143.
> [2021-05-19 18:09:06.414]Killed by external signal
>
>
>
>
>
> THANKS! CLAY
>
>
>
>

-- 
Best Regards,
Ayan Guha


RE: PySpark Write File Container exited with a non-zero exit code 143

2021-05-19 Thread Clay McDonald
How so?

From: Mich Talebzadeh 
Sent: Wednesday, May 19, 2021 5:45 PM
To: Clay McDonald 
Cc: user@spark.apache.org
Subject: Re: PySpark Write File Container exited with a non-zero exit code 143

  *** EXTERNAL EMAIL ***




Hi Clay,

Those parameters you are passing are not valid

pyspark --conf queue=default --conf executory-memory=24G

Python 3.7.3 (default, Apr  3 2021, 20:42:31)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Warning: Ignoring non-Spark config property: executory-memory
Warning: Ignoring non-Spark config property: queue
2021-05-19 22:28:20,521 WARN util.NativeCodeLoader: Unable to load 
native-hadoop library for your platform... using builtin-java classes where 
applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.1.1
  /_/

Using Python version 3.7.3 (default, Apr  3 2021 20:42:31)
Spark context Web UI available at http://rhes75:4040
Spark context available as 'sc' (master = local[*], app id = 
local-1621459701490).
SparkSession available as 'spark'.

Also

pyspark dynamic_ARRAY_generator_parquet.py

Running python applications through 'pyspark' is not supported as of Spark 2.0.
Use ./bin/spark-submit 


This works

$SPARK_HOME/bin/spark-submit --master local[4] 
dynamic_ARRAY_generator_parquet.py


See

 https://spark.apache.org/docs/latest/submitting-applications.html

HTH




 
[https://docs.google.com/uc?export=download=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Wed, 19 May 2021 at 20:10, Clay McDonald 
mailto:stuart.mcdon...@bateswhite.com>> wrote:
Hello all,

I’m hoping someone can give me some direction for troubleshooting this issue, 
I’m trying to write from Spark on an HortonWorks(Cloudera) HDP cluster. I ssh 
directly to the first datanode and run PySpark with the following command; 
however, it is always failing no matter what size I set memory in Yarn 
Containers and Yarn Queues. Any suggestions?



pyspark --conf queue=default --conf executory-memory=24G

--

HDFS_RAW="/HDFS/Data/Test/Original/MyData_data/"
#HDFS_OUT="/ HDFS/Data/Test/Processed/Convert_parquet/Output"
HDFS_OUT="/tmp"
ENCODING="utf-16"

fileList1=[
'Test _2003.txt'
]
from  pyspark.sql.functions import regexp_replace,col
for f in fileList1:
fname=f
fname_noext=fname.split('.')[0]
df = 
spark.read.option("delimiter","|").option("encoding",ENCODING).option("multiLine",True).option('wholeFile',"true").csv('{}/{}'.format(HDFS_RAW,fname),
 header=True)
lastcol=df.columns[-1]
print('showing {}'.format(fname))
if ('\r' in lastcol):
lastcol=lastcol.replace('\r','')
df=df.withColumn(lastcol, 
regexp_replace(col("{}\r".format(lastcol)), "[\r]", 
"")).drop('{}\r'.format(lastcol))

df.write.format('parquet').mode('overwrite').save("{}/{}".format(HDFS_OUT,fname_noext))



Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
1.0 (TID 4, DataNode01.mydomain.com<http://DataNode01.mydomain.com>, executor 
5): ExecutorLostFailure (executor 5 exited caused by one of the running tasks) 
Reason: Container marked as failed: container_e331_1621375512548_0021_01_06 
on host: DataNode01.mydomain.com<http://DataNode01.mydomain.com>. Exit status: 
143. Diagnostics: [2021-05-19 18:09:06.392]Container killed on request. Exit 
code is 143
[2021-05-19 18:09:06.413]Container exited with a non-zero exit code 143.
[2021-05-19 18:09:06.414]Killed by external signal


THANKS! CLAY



Re: PySpark Write File Container exited with a non-zero exit code 143

2021-05-19 Thread Mich Talebzadeh
Hi Clay,

Those parameters you are passing are not valid

pyspark --conf queue=default --conf executory-memory=24G

Python 3.7.3 (default, Apr  3 2021, 20:42:31)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-39)] on linux
Type "help", "copyright", "credits" or "license" for more information.
Warning: Ignoring non-Spark config property: executory-memory
Warning: Ignoring non-Spark config property: queue
2021-05-19 22:28:20,521 WARN util.NativeCodeLoader: Unable to load
native-hadoop library for your platform... using builtin-java classes where
applicable
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use
setLogLevel(newLevel).
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 3.1.1
  /_/

Using Python version 3.7.3 (default, Apr  3 2021 20:42:31)
Spark context Web UI available at http://rhes75:4040
Spark context available as 'sc' (master = local[*], app id =
local-1621459701490).
SparkSession available as 'spark'.

Also

pyspark dynamic_ARRAY_generator_parquet.py

Running python applications through 'pyspark' is not supported as of Spark
2.0.
Use ./bin/spark-submit 


This works

$SPARK_HOME/bin/spark-submit --master local[4]
dynamic_ARRAY_generator_parquet.py


See

 https://spark.apache.org/docs/latest/submitting-applications.html

HTH



   view my Linkedin profile




*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 19 May 2021 at 20:10, Clay McDonald 
wrote:

> Hello all,
>
>
>
> I’m hoping someone can give me some direction for troubleshooting this
> issue, I’m trying to write from Spark on an HortonWorks(Cloudera) HDP
> cluster. I ssh directly to the first datanode and run PySpark with the
> following command; however, it is always failing no matter what size I set
> memory in Yarn Containers and Yarn Queues. Any suggestions?
>
>
>
>
>
>
>
> pyspark --conf queue=default --conf executory-memory=24G
>
>
>
> --
>
>
>
> HDFS_RAW="/HDFS/Data/Test/Original/MyData_data/"
>
> #HDFS_OUT="/ HDFS/Data/Test/Processed/Convert_parquet/Output"
>
> HDFS_OUT="/tmp"
>
> ENCODING="utf-16"
>
>
>
> fileList1=[
>
> 'Test _2003.txt'
>
> ]
>
> from  pyspark.sql.functions import regexp_replace,col
>
> for f in fileList1:
>
> fname=f
>
> fname_noext=fname.split('.')[0]
>
> df =
> spark.read.option("delimiter","|").option("encoding",ENCODING).option("multiLine",True).option('wholeFile',"true").csv('{}/{}'.format(HDFS_RAW,fname),
> header=True)
>
> lastcol=df.columns[-1]
>
> print('showing {}'.format(fname))
>
> if ('\r' in lastcol):
>
> lastcol=lastcol.replace('\r','')
>
> df=df.withColumn(lastcol,
> regexp_replace(col("{}\r".format(lastcol)), "[\r]",
> "")).drop('{}\r'.format(lastcol))
>
>
> df.write.format('parquet').mode('overwrite').save("{}/{}".format(HDFS_OUT,fname_noext))
>
>
>
>
>
>
>
> Caused by: org.apache.spark.SparkException: Job aborted due to stage
> failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task
> 0.3 in stage 1.0 (TID 4, DataNode01.mydomain.com, executor 5):
> ExecutorLostFailure (executor 5 exited caused by one of the running tasks)
> Reason: Container marked as failed:
> container_e331_1621375512548_0021_01_06 on host:
> DataNode01.mydomain.com. Exit status: 143. Diagnostics: [2021-05-19
> 18:09:06.392]Container killed on request. Exit code is 143
> [2021-05-19 18:09:06.413]Container exited with a non-zero exit code 143.
> [2021-05-19 18:09:06.414]Killed by external signal
>
>
>
>
>
> THANKS! CLAY
>
>
>


PySpark Write File Container exited with a non-zero exit code 143

2021-05-19 Thread Clay McDonald
Hello all,

I'm hoping someone can give me some direction for troubleshooting this issue, 
I'm trying to write from Spark on an HortonWorks(Cloudera) HDP cluster. I ssh 
directly to the first datanode and run PySpark with the following command; 
however, it is always failing no matter what size I set memory in Yarn 
Containers and Yarn Queues. Any suggestions?



pyspark --conf queue=default --conf executory-memory=24G

--

HDFS_RAW="/HDFS/Data/Test/Original/MyData_data/"
#HDFS_OUT="/ HDFS/Data/Test/Processed/Convert_parquet/Output"
HDFS_OUT="/tmp"
ENCODING="utf-16"

fileList1=[
'Test _2003.txt'
]
from  pyspark.sql.functions import regexp_replace,col
for f in fileList1:
fname=f
fname_noext=fname.split('.')[0]
df = 
spark.read.option("delimiter","|").option("encoding",ENCODING).option("multiLine",True).option('wholeFile',"true").csv('{}/{}'.format(HDFS_RAW,fname),
 header=True)
lastcol=df.columns[-1]
print('showing {}'.format(fname))
if ('\r' in lastcol):
lastcol=lastcol.replace('\r','')
df=df.withColumn(lastcol, 
regexp_replace(col("{}\r".format(lastcol)), "[\r]", 
"")).drop('{}\r'.format(lastcol))

df.write.format('parquet').mode('overwrite').save("{}/{}".format(HDFS_OUT,fname_noext))



Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 
1.0 (TID 4, DataNode01.mydomain.com, executor 5): ExecutorLostFailure (executor 
5 exited caused by one of the running tasks) Reason: Container marked as 
failed: container_e331_1621375512548_0021_01_06 on host: 
DataNode01.mydomain.com. Exit status: 143. Diagnostics: [2021-05-19 
18:09:06.392]Container killed on request. Exit code is 143
[2021-05-19 18:09:06.413]Container exited with a non-zero exit code 143.
[2021-05-19 18:09:06.414]Killed by external signal


THANKS! CLAY