Hello,

When you attempt to run a "hadoop fs" command on a slave node, it
would require a proper classpath as well. Perhaps try setting an
HADOOP_CLASSPATH env-var to ".*" and/or "./lib/*" before you launch
the subprocess, so it supplies proper locations to find libraries and
run from. Note that the FsShell may need more than just
hadoop-core.jar to run.

An easier way would be to instead use a Java action, as Hadoop
provides no first-class Python support yet. Alternatively you can use
WebHDFS/HTTPFS for HDFS listing/modifications via a REST API, so you
can call it from python directly, or use curl.

On Tue, Sep 3, 2013 at 4:07 PM, RLK
<[email protected]> wrote:
> I'm trying to use subprocess in a python script which I call within an oozie
> shell action. Subprocess is supposed to read a file which is stored in
> Hadoop's HDFS.
>
> I'm using hadoop-1.2.1 in pseudo-distributed mode and oozie-3.3.2.
>
> Here is the python script, named connected_subprocess.py :
>
>
>     #!/usr/bin/python
>     import subprocess
>     import networkx as nx
>     liste=subprocess.check_output("hadoop fs -cat
> /user/root/output-data/calcul-proba/final.txt",shell=True).split('\n')
>     G=nx.DiGraph()
>     f=open("/home/rlk/liste_strongly_connected.txt","wb")
>     for item in liste:
>         try:
>             app1,app2=item.split('\t')
>             G.add_edge(app1,app2)
>         except:
>             pass
>     liste_connected=nx.strongly_connected_components(G)
>     for item in liste_connected:
>         if len(item)>1:
>             f.write('{}\n'.format('\t'.join(item)))
>     f.close()
>
> The corresponding shell action in Oozie's workflow.xml is the following :
>
>
>      <action name="final">
>             <shell xmlns="uri:oozie:shell-action:0.1">
>                 <job-tracker>${jobTracker}</job-tracker>
>                 <name-node>${nameNode}</name-node>
>                 <configuration>
>                     <property>
> <name>mapred.job.queue.name</name>
>                         <value>${queueName}</value>
>                     </property>
>                 </configuration>
>                 <exec>connected_subprocess.py</exec>
>                 <file>connected_subprocess.py</file>
>              </shell>
>              <ok to="end" />
>              <error to="kill" />
>         </action>
>
> When I run the oozie job the tasktracker log reads these errors:
>
>     Error: Could not find or load main class org.apache.hadoop.fs.FsShell
>     Traceback (most recent call last):
>       File "./connected_subprocess.py", line 6, in <module>
>         liste=subprocess.check_output("hadoop fs -cat
> /user/root/output-data/calcul-proba/final.txt",shell=True).split('\n')
>       File "/usr/lib64/python2.7/subprocess.py", line 575, in check_output
>         raise CalledProcessError(retcode, cmd, output=output)
>     subprocess.CalledProcessError: Command 'hadoop fs -cat
> /user/root/output-data/calcul-proba/final.txt' returned non-zero exit status
> 1
>     Failing Oozie Launcher, Main class
> [org.apache.oozie.action.hadoop.ShellMain], exit code [1]
>
>
> It seems that I cannot run a shell command line within my python script when
> the python script is embedded within an oozie action since everything works
> fine when I run my python script within my interactive shell.
> The log also says that the main class org.apache.hadoop.fs.FsShell is
> missing whereas I copied hadoop-core-1.2.1.jar in a lib folder next to my
> workflow.xml and job.properties files.
>
> Is there any way I can bypass this limitation ?



-- 
Harsh J

Reply via email to