Hello hadoop users, I have a library that takes a string as input and finds the file on the HDFS and performs operations on it...but at the moment this doesn't take advantage of node awareness; it may or may not run on the node with the data. I'd like to fix this.
***Background*** So a little more about the library I mentioned. I modified an imagery library to take in a string URL and fetches the input stream corresponding to that URL on the HDFS instead of a typical file system. So it'd take in the string "hdfs://blahblahblah/image.bmp" and now the library maintains a reference to this file's input stream and can do things to this image. ***Problem*** The problem is that I pass to the MapReduce application a string list of these images and these URLs get passed to the HDFS-ified library but these URLs may or may not be on the task node and so I don't take advantage of locality because computation isn't with the data. What are my best options for taking advantage of node awareness in this situation? I was thinking what my options are... ***Brainstorming solutions*** One possible one (not sure???) is to use WholeFileInputFormat as described in OReilly's book but this really takes a file and gives you the byte array that represents the file and I'm not sure I want this because some of my files can be a couple of gigabytes in size (though typically ~200MB) and can exhaust the memory. Another option would be to take better advantage of the HDFS-ified library so I'd have to get my task to interpret the string URL and determine which node that file exists on and then execute ON that node. I have no clue how I'd go about doing this. I'm sure there are other options as well but I'm just not that familiar with hadoop to know and I was hoping someone out there might be able to help me out. Thanks in advance, -Julian
