Spark: Using node-local files within functions?

2015-04-14 Thread Horsmann, Tobias
Hi,

I am trying to use Spark in combination with Yarn with 3rd party code which is 
unaware of distributed file systems. Providing hdfs file references thus does 
not work.

My idea to resolve this issue was the following:

Within a function I take the HDFS file reference I get as parameter and copy it 
into the local file system and provide the 3rd party components what they 
expect.
textFolder.map(new Function()
{
public List... call(String inputFile)
throws Exception
{
   //resolve, copy hdfs file to local file system

   //get local file pointer
   //this function should be executed on a node, right. There is 
probably a local file system)

   //call 3rd party library with 'local file' reference

   // do other stuff
}
}

This seem to work, but I am not really sure if this might cause other problems 
when going to productive file sizes. E.g. the files I copy to the local file 
system might be large. Would this affect Yarn somehow? Are there more advisable 
ways to befriend HDFS-unaware libraries with HDFS file pointer?

Regards,



Re: Spark: Using node-local files within functions?

2015-04-14 Thread Sandy Ryza
Hi Tobias,

It should be possible to get an InputStream from an HDFS file.  However, if
your libraries only work directly on files, then maybe that wouldn't work?
If that's the case and different tasks need different files, your way is
probably the best way.  If all tasks need the same file, a better option
would be to pass the file in with the --files option when you spark-submit,
which will cache the file between executors on the same node.

-Sandy

On Tue, Apr 14, 2015 at 1:39 AM, Horsmann, Tobias 
tobias.horsm...@uni-due.de wrote:

  Hi,

  I am trying to use Spark in combination with Yarn with 3rd party code
 which is unaware of distributed file systems. Providing hdfs file
 references thus does not work.

  My idea to resolve this issue was the following:

  Within a function I take the HDFS file reference I get as parameter and
 copy it into the local file system and provide the 3rd party components
 what they expect.
 textFolder.map(new Function()
 {
 public List... call(String inputFile)
 throws Exception
 {
//resolve, copy hdfs file to local file system

//get local file pointer
//this function should be executed on a node, right. There
 is probably a local file system)

//call 3rd party library with 'local file' reference

// do other stuff
 }
 }

 This seem to work, but I am not really sure if this might cause other
 problems when going to productive file sizes. E.g. the files I copy to the
 local file system might be large. Would this affect Yarn somehow? Are there
 more advisable ways to befriend HDFS-unaware libraries with HDFS file
 pointer?

  Regards,