Hi,

I have an issue with my PySpark running in Kubernetes (testing on
minikube).

The project is zipped as DSBQ.zip and passed to spark-submit with the
zipped file on HDFS (pod can read it).

Zipped file DSBQ.zip zipped at root and has the following structure:

One of the py-files called DSBQ.zip is the root zip file for the
application project

ls DSBQ

__init__.py  assembly  conf  data  deployment  lib  linux  othermisc
 sparkutils  src  tests

under folder conf I have a file called config.yml that is read in
src/configure.py as follows


import yaml

import sys

import os


with open("/home/hduser/dba/bin/python/DSBQ/conf/config.yml", 'r') as file:
  config: dict = yaml.safe_load(file)


That absolute path --> /home/hduser/dba/bin/python/DSBQ/conf/config.yml is
not recognised in pod


This path

with open("DSBQ/conf/config.yml", 'r') as file:

is not recognised either

PYSpark can read from HDFS so this is the code used

        spark-submit --verbose \
           --master k8s://$K8S_SERVER \
           --deploy-mode cluster \
           --name pytest \
           --py-files
hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/DSBQ.zip,hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/dependencies_short.zip
\
           --conf spark.kubernetes.namespace=spark \
           --conf spark.executor.instances=1 \
           --conf spark.kubernetes.driver.limit.cores=1 \
           --conf spark.executor.cores=1 \
           --conf spark.executor.memory=500m \
           --conf spark.kubernetes.container.image=${IMAGE} \
           --conf
spark.kubernetes.authenticate.driver.serviceAccountName=spark-serviceaccount
\
           --conf spark.kubernetes.file.upload.path=$SOURCE_DIR \
           --conf
spark.kubernetes.driver.volumes.$VOLUME_TYPE.$VOLUME_NAME.mount.path=$MOUNT_PATH
\
           --conf
spark.kubernetes.driver.volumes.$VOLUME_TYPE.$VOLUME_NAME.options.path=$MOUNT_PATH
\
           --conf
spark.kubernetes.executor.volumes.$VOLUME_TYPE.$VOLUME_NAME.mount.path=$MOUNT_PATH
\
           --conf
spark.kubernetes.executor.volumes.$VOLUME_TYPE.$VOLUME_NAME.options.path=$MOUNT_PATH
\
           hdfs://$HDFS_HOST:$HDFS_PORT/minikube/codes/${APPLICATION}

I have not managed to read the yaml file from external mounts or other
methods. The only way is to read it is through scwholeTextFiles from HDFS

    lines =
sc.wholeTextFiles("hdfs://$HOST:$PORT/minikube/codes/config.yml")
   rdd = lines.map(lambda x: x[1])
   l = rdd.collect()
   print(l)

*l returns the value as a list* like sample below from yaml file



['common:\n  appName: \'md\'\n  newtopic: \'newtopic\'\nplot_fonts:\n
font:\n    \'family\': \'serif\'\n    \'color\': \'darkred\'\n
\'weight\': \'normal\'\n    \'size\': 10\n\n  # define font dictionary\n
font_small:\n    \'family\': \'serif\'\n    \'color\': \'darkred\'\n
\'weight\': \'normal\'\n    \'size\': 7\n\n

Which is these lines in yaml file


common:

  appName: 'md'

  newtopic: 'newtopic'

plot_fonts:

  font:

    'family': 'serif'

    'color': 'darkred'

    'weight': 'normal'

    'size': 10


  # define font dictionary

  font_small:

    'family': 'serif'

    'color': 'darkred'

    'weight': 'normal'

    'size': 7

Now I need to read that list and create a dict out of it like

    config: dict = yaml.safe_load(l)

  File "/tmp/spark-34d56d02-ce8a-442f-9c84-f265f1c279e2/testpackages.py",
line 71, in <module>
    main()
  File "/tmp/spark-34d56d02-ce8a-442f-9c84-f265f1c279e2/testpackages.py",
line 42, in main
    config: dict = yaml.safe_load(l)
  File
"/tmp/spark-34d56d02-ce8a-442f-9c84-f265f1c279e2/dependencies_short.zip/yaml/__init__.py",
line 162, in safe_load
  File
"/tmp/spark-34d56d02-ce8a-442f-9c84-f265f1c279e2/dependencies_short.zip/yaml/__init__.py",
line 112, in load
  File
"/tmp/spark-34d56d02-ce8a-442f-9c84-f265f1c279e2/dependencies_short.zip/yaml/loader.py",
line 34, in __init__
  File
"/tmp/spark-34d56d02-ce8a-442f-9c84-f265f1c279e2/dependencies_short.zip/yaml/reader.py",
line 85, in __init__
  File
"/tmp/spark-34d56d02-ce8a-442f-9c84-f265f1c279e2/dependencies_short.zip/yaml/reader.py",
line 124, in determine_encoding
  File
"/tmp/spark-34d56d02-ce8a-442f-9c84-f265f1c279e2/dependencies_short.zip/yaml/reader.py",
line 178, in update_raw
AttributeError: 'list' object has no attribute 'read'

Which throws an error!


I am sure there is a solution to read this yaml file inside pod?


Thanks



   view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.

Reply via email to