Re: PySpark script works itself, but fails when called from other script

Josh Rosen Fri, 15 Nov 2013 23:11:45 -0800

I'll take a look at this tomorrow, but my initial hunch is that this
problem might be serialization/pickling-related: maybe the UDF is being
serialized differently when it's defined in a module that's not __main__.
 To confirm this, try looking at the logs on the worker that ran the failed
task, since they should explain why the Python worker crashed.  You could
try adding task.py to SparkContext (e.g. add __file__ to your SparkContext
with addPyFile (or specify it when creating the context)) to see if that
fixes things, assuming the problem is serialization-related.  If it this a
serialization problem, it would be nice to document this (or fix it by
modifying cloudpickle).


- Josh


On Fri, Nov 15, 2013 at 12:51 AM, Andrei <[email protected]> wrote:

> I have 2 Python modules/scripts - task.py and runner.py. First one
> (task.py) is a little Spark job and works perfectly well by itself.
> However, when called from runner.py with exactly the same arguments, it
> fails with only useless message (both - in terminal and worker logs).
>
>     org.apache.spark.SparkException: Python worker exited unexpectedly
> (crashed)
>
> Below there's code for both - task.py and runner.py:
>
> task.py
> -----------
>
> #!/usr/bin/env pyspark
> from __future__ import print_function
> from pyspark import SparkContext
>
> def process(line):
>     return line.strip()
>
> def main(spark_master, path):
>     sc = SparkContext(spark_master, 'My Job')
>     rdd = sc.textFile(path)
>     rdd = rdd.map(process)     # this line causes troubles when called
> from runner.py
>     count = rdd.count()
>     print(count)
>
> if __name__ == '__main__':
>     main('spark://spark-master-host:7077',
>             'hdfs://hdfs-namenode-host:8020/path/to/file.log')
>
>
> runner.py
> -------------
>
> #!/usr/bin/env pyspark
>
> import task
>
> if __name__ == '__main__':
>     task.main('spark://spark-master-host:7077',
>                    'hdfs://hdfs-namenode-host:8020/path/to/file.log')
>
>
> -------------------------------------------------------------------------------------------
>
> So, what's the difference between calling PySpark-enabled script directly
> and as Python module? What are good rules for writing multi-module Python
> programs with Spark?
>
> Thanks,
> Andrei
>

Re: PySpark script works itself, but fails when called from other script

Reply via email to