I have 2 Python modules/scripts - task.py and runner.py. First one
(task.py) is a little Spark job and works perfectly well by itself.
However, when called from runner.py with exactly the same arguments, it
fails with only useless message (both - in terminal and worker logs).
org.apache.spark.SparkException: Python worker exited unexpectedly
(crashed)
Below there's code for both - task.py and runner.py:
task.py
-----------
#!/usr/bin/env pyspark
from __future__ import print_function
from pyspark import SparkContext
def process(line):
return line.strip()
def main(spark_master, path):
sc = SparkContext(spark_master, 'My Job')
rdd = sc.textFile(path)
rdd = rdd.map(process) # this line causes troubles when called from
runner.py
count = rdd.count()
print(count)
if __name__ == '__main__':
main('spark://spark-master-host:7077',
'hdfs://hdfs-namenode-host:8020/path/to/file.log')
runner.py
-------------
#!/usr/bin/env pyspark
import task
if __name__ == '__main__':
task.main('spark://spark-master-host:7077',
'hdfs://hdfs-namenode-host:8020/path/to/file.log')
-------------------------------------------------------------------------------------------
So, what's the difference between calling PySpark-enabled script directly
and as Python module? What are good rules for writing multi-module Python
programs with Spark?
Thanks,
Andrei