I have 2 Python modules/scripts - task.py and runner.py. First one
(task.py) is a little Spark job and works perfectly well by itself.
However, when called from runner.py with exactly the same arguments, it
fails with only useless message (both - in terminal and worker logs).

    org.apache.spark.SparkException: Python worker exited unexpectedly
(crashed)

Below there's code for both - task.py and runner.py:

task.py
-----------

#!/usr/bin/env pyspark
from __future__ import print_function
from pyspark import SparkContext

def process(line):
    return line.strip()

def main(spark_master, path):
    sc = SparkContext(spark_master, 'My Job')
    rdd = sc.textFile(path)
    rdd = rdd.map(process)     # this line causes troubles when called from
runner.py
    count = rdd.count()
    print(count)

if __name__ == '__main__':
    main('spark://spark-master-host:7077',
            'hdfs://hdfs-namenode-host:8020/path/to/file.log')


runner.py
-------------

#!/usr/bin/env pyspark

import task

if __name__ == '__main__':
    task.main('spark://spark-master-host:7077',
                   'hdfs://hdfs-namenode-host:8020/path/to/file.log')

-------------------------------------------------------------------------------------------

So, what's the difference between calling PySpark-enabled script directly
and as Python module? What are good rules for writing multi-module Python
programs with Spark?

Thanks,
Andrei

Reply via email to