Re: executing files on hdfs via hadoop not possible? is JNI/JNA a reasonable solution?

Harsh J Sun, 17 Mar 2013 18:22:49 -0700

Yes thats correct, data locality could get lost with that approach. If
you need that and also need a specialized input format then its
probably also better to go the Java + JNI calls way completely.


On Mon, Mar 18, 2013 at 2:05 AM, Julian Bui <[email protected]> wrote:
> Ah, thanks for clarifying, Harsh,
>
> One thing that concerns me is that you wrote "input can be mere HDFS file
> path strings" which implies that the tasks will not be guaranteed to run
> with the data, is that correct?
>
> -Julian
>
>
> On Sun, Mar 17, 2013 at 6:28 AM, Harsh J <[email protected]> wrote:
>>
>> Hi,
>>
>> Yes, streaming lets you send in arbitrary programs and execute them
>> upon the input and helps collect the output, such as a shell script or
>> a python script (its agnostic to what you send as long as you also
>> send full launching environment and command instructions). The same
>> can be leveraged to have your MR environment start a number of map
>> tasks whose input can be mere HDFS file path strings (but not the file
>> data) and that your program leverages libhdfs (shipped along or
>> installed on all nodes) to read those paths and process them in a way
>> you see fit. Essentially, you can tweak the streaming functionalities
>> to make it do what you need it/achieve different ways of
>> parallelism/etc..
>>
>> Yes, what am suggesting is perhaps similar to Hadoop Pipes, but that's
>> generally been out of major use in the past few major releases we've
>> made and generally Streaming is recommended in its favor.
>>
>> On Sun, Mar 17, 2013 at 4:20 PM, Julian Bui <[email protected]> wrote:
>> > Hello Harsh,
>> >
>> > Thanks for the reply.  I just want to verify that I understand your
>> > comments.
>> >
>> > It sounds like you're saying I should write a c/c++ application and get
>> > access to hdfs using libhdfs.  What I'm a little confused about is what
>> > you
>> > mean by "use a streaming program".  Do you mean I should use the hadoop
>> > streaming interface to call some native binary that I wrote?  I was not
>> > even
>> > aware that the streaming interface could execute native binaries.  I
>> > thought
>> > that anything using the hadoop streaming interface only interacts with
>> > stdin
>> > and stdout and cannot make modifications to the hdfs.  Or did you mean
>> > that
>> > I should use hadoop pipes to write a c/c++ application?
>> >
>> > Anyway, I hope that you can help me clear things up in my head.
>> >
>> > Thanks,
>> > -Julian
>> >
>> > On Sun, Mar 17, 2013 at 2:50 AM, Harsh J <[email protected]> wrote:
>> >>
>> >> You're confusing two things here. HDFS is a data storage filesystem.
>> >> MR does not have anything to do with HDFS (generally speaking).
>> >>
>> >> A reducer runs as a regular JVM on a provided node, and can execute
>> >> any program you'd like it to by downloading it onto its configured
>> >> local filesystem and executing it.
>> >>
>> >> If your goal is merely to run a regular program over data that is
>> >> sitting in HDFS, that can be achieved. If your library is in C then
>> >> simply use a streaming program to run it and use libhdfs' HDFS API
>> >> (C/C++) to read data into your functions from HDFS files. Would this
>> >> not suffice?
>> >>
>> >> On Sun, Mar 17, 2013 at 3:09 PM, Julian Bui <[email protected]>
>> >> wrote:
>> >> > Hi hadoop users,
>> >> >
>> >> > I just want to verify that there is no way to put a binary on HDFS
>> >> > and
>> >> > execute it using the hadoop java api.  If not, I would appreciate
>> >> > advice
>> >> > in
>> >> > getting in creating an implementation that uses native libraries.
>> >> >
>> >> > "In contrast to the POSIX model, there are no sticky, setuid or
>> >> > setgid
>> >> > bits
>> >> > for files as there is no notion of executable files."  Is there no
>> >> > workaround?
>> >> >
>> >> > A little bit more about what I'm trying to do.  I have a binary that
>> >> > converts my image to another image format.  I currently want to put
>> >> > it
>> >> > in
>> >> > the distributed cache and tell the reducer to execute the binary on
>> >> > the
>> >> > data
>> >> > on hdfs.  However, since I can't set the execute permission bit on
>> >> > that
>> >> > file, it seems that I cannot do that.
>> >> >
>> >> > Since I cannot use the binary, it seems like I have to use my own
>> >> > implementation to do this.  The challenge is that these libraries
>> >> > that I
>> >> > can
>> >> > use to do this are .a and .so files.  Would I have to use JNI and
>> >> > package
>> >> > the libraries in the distributed cache and then have the reducer find
>> >> > and
>> >> > use those libraries on the task nodes?  Actually, I wouldn't want to
>> >> > use
>> >> > JNI, I'd probably want to use java native access (JNA) to do this.
>> >> > Has
>> >> > anyone used JNA with hadoop and been successful?  Are there problems
>> >> > I'll
>> >> > encounter?
>> >> >
>> >> > Please let me know.
>> >> >
>> >> > Thanks,
>> >> > -Julian
>> >>
>> >>
>> >>
>> >> --
>> >> Harsh J
>> >
>> >
>>
>>
>>
>> --
>> Harsh J
>
>



--
Harsh J

Re: executing files on hdfs via hadoop not possible? is JNI/JNA a reasonable solution?

Reply via email to