I'm trying to use subprocess in a python script which I call within an
oozie shell action. Subprocess is supposed to read a file which is
stored in Hadoop's HDFS.
I'm using hadoop-1.2.1 in pseudo-distributed mode and oozie-3.3.2.
Here is the python script, named connected_subprocess.py :
#!/usr/bin/python
import subprocess
import networkx as nx
liste=subprocess.check_output("hadoop fs -cat
/user/root/output-data/calcul-proba/final.txt",shell=True).split('\n')
G=nx.DiGraph()
f=open("/home/rlk/liste_strongly_connected.txt","wb")
for item in liste:
try:
app1,app2=item.split('\t')
G.add_edge(app1,app2)
except:
pass
liste_connected=nx.strongly_connected_components(G)
for item in liste_connected:
if len(item)>1:
f.write('{}\n'.format('\t'.join(item)))
f.close()
The corresponding shell action in Oozie's workflow.xml is the following :
<action name="final">
<shell xmlns="uri:oozie:shell-action:0.1">
<job-tracker>${jobTracker}</job-tracker>
<name-node>${nameNode}</name-node>
<configuration>
<property>
<name>mapred.job.queue.name</name>
<value>${queueName}</value>
</property>
</configuration>
<exec>connected_subprocess.py</exec>
<file>connected_subprocess.py</file>
</shell>
<ok to="end" />
<error to="kill" />
</action>
When I run the oozie job the tasktracker log reads these errors:
Error: Could not find or load main class org.apache.hadoop.fs.FsShell
Traceback (most recent call last):
File "./connected_subprocess.py", line 6, in <module>
liste=subprocess.check_output("hadoop fs -cat
/user/root/output-data/calcul-proba/final.txt",shell=True).split('\n')
File "/usr/lib64/python2.7/subprocess.py", line 575, in check_output
raise CalledProcessError(retcode, cmd, output=output)
subprocess.CalledProcessError: Command 'hadoop fs -cat
/user/root/output-data/calcul-proba/final.txt' returned non-zero exit
status 1
Failing Oozie Launcher, Main class
[org.apache.oozie.action.hadoop.ShellMain], exit code [1]
It seems that I cannot run a shell command line within my python script
when the python script is embedded within an oozie action since
everything works fine when I run my python script within my interactive
shell.
The log also says that the main class org.apache.hadoop.fs.FsShell is
missing whereas I copied hadoop-core-1.2.1.jar in a lib folder next to
my workflow.xml and job.properties files.
Is there any way I can bypass this limitation ?