How to import custom Python module in MapReduce job?

Andrei Sun, 11 Aug 2013 23:51:50 -0700

(cross-posted from
StackOverflow<http://stackoverflow.com/questions/18150208/how-to-import-custom-module-in-mapreduce-job?noredirect=1#comment26584564_18150208>
)


I have a MapReduce job defined in file *main.py*, which imports module lib from
file *lib.py*. I use Hadoop Streaming to submit this job to Hadoop cluster
as follows:

hadoop jar /usr/lib/hadoop-mapreduce/hadoop-streaming.jar

    -files lib.py,main.py
    -mapper "./main.py map" -reducer "./main.py reduce"
    -input input -output output

In my understanding, this should put both main.py and lib.py into *distributed
cache folder* on each computing machine and thus make module lib available
to main. But it doesn't happen - from log file I see, that files *are
really copied* to the same directory, but main can't import lib, throwing*
ImportError*.

Adding current directory to the path didn't work:

import sys
sys.path.append(os.path.realpath(__file__))import lib# ImportError

though, loading module manually did the trick:

import imp
lib = imp.load_source('lib', 'lib.py')

But that's not what I want. So why Python interpreter can see other .py files
in the same directory, but can't import them? Note, I have already tried
adding empty __init__.py file to the same directory without effect.

How to import custom Python module in MapReduce job?

Reply via email to