Yes thats correct, data locality could get lost with that approach. If you need that and also need a specialized input format then its probably also better to go the Java + JNI calls way completely.
On Mon, Mar 18, 2013 at 2:05 AM, Julian Bui <[email protected]> wrote: > Ah, thanks for clarifying, Harsh, > > One thing that concerns me is that you wrote "input can be mere HDFS file > path strings" which implies that the tasks will not be guaranteed to run > with the data, is that correct? > > -Julian > > > On Sun, Mar 17, 2013 at 6:28 AM, Harsh J <[email protected]> wrote: >> >> Hi, >> >> Yes, streaming lets you send in arbitrary programs and execute them >> upon the input and helps collect the output, such as a shell script or >> a python script (its agnostic to what you send as long as you also >> send full launching environment and command instructions). The same >> can be leveraged to have your MR environment start a number of map >> tasks whose input can be mere HDFS file path strings (but not the file >> data) and that your program leverages libhdfs (shipped along or >> installed on all nodes) to read those paths and process them in a way >> you see fit. Essentially, you can tweak the streaming functionalities >> to make it do what you need it/achieve different ways of >> parallelism/etc.. >> >> Yes, what am suggesting is perhaps similar to Hadoop Pipes, but that's >> generally been out of major use in the past few major releases we've >> made and generally Streaming is recommended in its favor. >> >> On Sun, Mar 17, 2013 at 4:20 PM, Julian Bui <[email protected]> wrote: >> > Hello Harsh, >> > >> > Thanks for the reply. I just want to verify that I understand your >> > comments. >> > >> > It sounds like you're saying I should write a c/c++ application and get >> > access to hdfs using libhdfs. What I'm a little confused about is what >> > you >> > mean by "use a streaming program". Do you mean I should use the hadoop >> > streaming interface to call some native binary that I wrote? I was not >> > even >> > aware that the streaming interface could execute native binaries. I >> > thought >> > that anything using the hadoop streaming interface only interacts with >> > stdin >> > and stdout and cannot make modifications to the hdfs. Or did you mean >> > that >> > I should use hadoop pipes to write a c/c++ application? >> > >> > Anyway, I hope that you can help me clear things up in my head. >> > >> > Thanks, >> > -Julian >> > >> > On Sun, Mar 17, 2013 at 2:50 AM, Harsh J <[email protected]> wrote: >> >> >> >> You're confusing two things here. HDFS is a data storage filesystem. >> >> MR does not have anything to do with HDFS (generally speaking). >> >> >> >> A reducer runs as a regular JVM on a provided node, and can execute >> >> any program you'd like it to by downloading it onto its configured >> >> local filesystem and executing it. >> >> >> >> If your goal is merely to run a regular program over data that is >> >> sitting in HDFS, that can be achieved. If your library is in C then >> >> simply use a streaming program to run it and use libhdfs' HDFS API >> >> (C/C++) to read data into your functions from HDFS files. Would this >> >> not suffice? >> >> >> >> On Sun, Mar 17, 2013 at 3:09 PM, Julian Bui <[email protected]> >> >> wrote: >> >> > Hi hadoop users, >> >> > >> >> > I just want to verify that there is no way to put a binary on HDFS >> >> > and >> >> > execute it using the hadoop java api. If not, I would appreciate >> >> > advice >> >> > in >> >> > getting in creating an implementation that uses native libraries. >> >> > >> >> > "In contrast to the POSIX model, there are no sticky, setuid or >> >> > setgid >> >> > bits >> >> > for files as there is no notion of executable files." Is there no >> >> > workaround? >> >> > >> >> > A little bit more about what I'm trying to do. I have a binary that >> >> > converts my image to another image format. I currently want to put >> >> > it >> >> > in >> >> > the distributed cache and tell the reducer to execute the binary on >> >> > the >> >> > data >> >> > on hdfs. However, since I can't set the execute permission bit on >> >> > that >> >> > file, it seems that I cannot do that. >> >> > >> >> > Since I cannot use the binary, it seems like I have to use my own >> >> > implementation to do this. The challenge is that these libraries >> >> > that I >> >> > can >> >> > use to do this are .a and .so files. Would I have to use JNI and >> >> > package >> >> > the libraries in the distributed cache and then have the reducer find >> >> > and >> >> > use those libraries on the task nodes? Actually, I wouldn't want to >> >> > use >> >> > JNI, I'd probably want to use java native access (JNA) to do this. >> >> > Has >> >> > anyone used JNA with hadoop and been successful? Are there problems >> >> > I'll >> >> > encounter? >> >> > >> >> > Please let me know. >> >> > >> >> > Thanks, >> >> > -Julian >> >> >> >> >> >> >> >> -- >> >> Harsh J >> > >> > >> >> >> >> -- >> Harsh J > > -- Harsh J
