Hi Tobias,

It should be possible to get an InputStream from an HDFS file.  However, if
your libraries only work directly on files, then maybe that wouldn't work?
If that's the case and different tasks need different files, your way is
probably the best way.  If all tasks need the same file, a better option
would be to pass the file in with the --files option when you spark-submit,
which will cache the file between executors on the same node.

-Sandy

On Tue, Apr 14, 2015 at 1:39 AM, Horsmann, Tobias <
tobias.horsm...@uni-due.de> wrote:

>  Hi,
>
>  I am trying to use Spark in combination with Yarn with 3rd party code
> which is unaware of distributed file systems. Providing hdfs file
> references thus does not work.
>
>  My idea to resolve this issue was the following:
>
>  Within a function I take the HDFS file reference I get as parameter and
> copy it into the local file system and provide the 3rd party components
> what they expect.
> textFolder.map(new Function<....>()
>         {
>             public List<...> call(String inputFile)
>                 throws Exception
>             {
>                //resolve, copy hdfs file to local file system
>
>                //get local file pointer
>                //this function should be executed on a node, right. There
> is probably a local file system)
>
>                //call 3rd party library with 'local file' reference
>
>                // do other stuff
> }
> }
>
> This seem to work, but I am not really sure if this might cause other
> problems when going to productive file sizes. E.g. the files I copy to the
> local file system might be large. Would this affect Yarn somehow? Are there
> more advisable ways to befriend HDFS-unaware libraries with HDFS file
> pointer?
>
>  Regards,
>
>

Reply via email to