I'm trying to use subprocess in a python script which I call within an oozie shell action. Subprocess is supposed to read a file which is stored in Hadoop's HDFS.

I'm using hadoop-1.2.1 in pseudo-distributed mode and oozie-3.3.2.

Here is the python script, named connected_subprocess.py :


    #!/usr/bin/python
    import subprocess
    import networkx as nx
liste=subprocess.check_output("hadoop fs -cat /user/root/output-data/calcul-proba/final.txt",shell=True).split('\n')
    G=nx.DiGraph()
    f=open("/home/rlk/liste_strongly_connected.txt","wb")
    for item in liste:
        try:
            app1,app2=item.split('\t')
            G.add_edge(app1,app2)
        except:
            pass
    liste_connected=nx.strongly_connected_components(G)
    for item in liste_connected:
        if len(item)>1:
            f.write('{}\n'.format('\t'.join(item)))
    f.close()

The corresponding shell action in Oozie's workflow.xml is the following :


     <action name="final">
            <shell xmlns="uri:oozie:shell-action:0.1">
                <job-tracker>${jobTracker}</job-tracker>
                <name-node>${nameNode}</name-node>
                <configuration>
                    <property>
<name>mapred.job.queue.name</name>
                        <value>${queueName}</value>
                    </property>
                </configuration>
                <exec>connected_subprocess.py</exec>
                <file>connected_subprocess.py</file>
             </shell>
             <ok to="end" />
             <error to="kill" />
        </action>

When I run the oozie job the tasktracker log reads these errors:

    Error: Could not find or load main class org.apache.hadoop.fs.FsShell
    Traceback (most recent call last):
      File "./connected_subprocess.py", line 6, in <module>
liste=subprocess.check_output("hadoop fs -cat /user/root/output-data/calcul-proba/final.txt",shell=True).split('\n')
      File "/usr/lib64/python2.7/subprocess.py", line 575, in check_output
        raise CalledProcessError(retcode, cmd, output=output)
subprocess.CalledProcessError: Command 'hadoop fs -cat /user/root/output-data/calcul-proba/final.txt' returned non-zero exit status 1 Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.ShellMain], exit code [1]


It seems that I cannot run a shell command line within my python script when the python script is embedded within an oozie action since everything works fine when I run my python script within my interactive shell. The log also says that the main class org.apache.hadoop.fs.FsShell is missing whereas I copied hadoop-core-1.2.1.jar in a lib folder next to my workflow.xml and job.properties files.

Is there any way I can bypass this limitation ?

Reply via email to