Hello, When you attempt to run a "hadoop fs" command on a slave node, it would require a proper classpath as well. Perhaps try setting an HADOOP_CLASSPATH env-var to ".*" and/or "./lib/*" before you launch the subprocess, so it supplies proper locations to find libraries and run from. Note that the FsShell may need more than just hadoop-core.jar to run.
An easier way would be to instead use a Java action, as Hadoop provides no first-class Python support yet. Alternatively you can use WebHDFS/HTTPFS for HDFS listing/modifications via a REST API, so you can call it from python directly, or use curl. On Tue, Sep 3, 2013 at 4:07 PM, RLK <[email protected]> wrote: > I'm trying to use subprocess in a python script which I call within an oozie > shell action. Subprocess is supposed to read a file which is stored in > Hadoop's HDFS. > > I'm using hadoop-1.2.1 in pseudo-distributed mode and oozie-3.3.2. > > Here is the python script, named connected_subprocess.py : > > > #!/usr/bin/python > import subprocess > import networkx as nx > liste=subprocess.check_output("hadoop fs -cat > /user/root/output-data/calcul-proba/final.txt",shell=True).split('\n') > G=nx.DiGraph() > f=open("/home/rlk/liste_strongly_connected.txt","wb") > for item in liste: > try: > app1,app2=item.split('\t') > G.add_edge(app1,app2) > except: > pass > liste_connected=nx.strongly_connected_components(G) > for item in liste_connected: > if len(item)>1: > f.write('{}\n'.format('\t'.join(item))) > f.close() > > The corresponding shell action in Oozie's workflow.xml is the following : > > > <action name="final"> > <shell xmlns="uri:oozie:shell-action:0.1"> > <job-tracker>${jobTracker}</job-tracker> > <name-node>${nameNode}</name-node> > <configuration> > <property> > <name>mapred.job.queue.name</name> > <value>${queueName}</value> > </property> > </configuration> > <exec>connected_subprocess.py</exec> > <file>connected_subprocess.py</file> > </shell> > <ok to="end" /> > <error to="kill" /> > </action> > > When I run the oozie job the tasktracker log reads these errors: > > Error: Could not find or load main class org.apache.hadoop.fs.FsShell > Traceback (most recent call last): > File "./connected_subprocess.py", line 6, in <module> > liste=subprocess.check_output("hadoop fs -cat > /user/root/output-data/calcul-proba/final.txt",shell=True).split('\n') > File "/usr/lib64/python2.7/subprocess.py", line 575, in check_output > raise CalledProcessError(retcode, cmd, output=output) > subprocess.CalledProcessError: Command 'hadoop fs -cat > /user/root/output-data/calcul-proba/final.txt' returned non-zero exit status > 1 > Failing Oozie Launcher, Main class > [org.apache.oozie.action.hadoop.ShellMain], exit code [1] > > > It seems that I cannot run a shell command line within my python script when > the python script is embedded within an oozie action since everything works > fine when I run my python script within my interactive shell. > The log also says that the main class org.apache.hadoop.fs.FsShell is > missing whereas I copied hadoop-core-1.2.1.jar in a lib folder next to my > workflow.xml and job.properties files. > > Is there any way I can bypass this limitation ? -- Harsh J
