Re: Spark DataFrame Creation

Andrew Melo Wed, 22 Jul 2020 15:03:35 -0700

Hi Mark,

On Wed, Jul 22, 2020 at 4:49 PM Mark Bidewell <mbide...@gmail.com> wrote:
>
> Sorry if this is the wrong place for this.  I am trying to debug an issue 
> with this library:
> https://github.com/springml/spark-sftp
>
> When I attempt to create a dataframe:
>
> spark.read.
>             format("com.springml.spark.sftp").
>             option("host", "...").
>             option("username", "...").
>             option("password", "...").
>             option("fileType", "csv").
>             option("inferSchema", "true").
>             option("tempLocation","/srv/spark/tmp").
>             option("hdfsTempLocation","/srv/spark/tmp");
>      .load("...")
>
> What I am seeing is that the download is occurring on the spark driver not 
> the spark worker,  This leads to a failure when spark tries to create the 
> DataFrame on the worker.
>
> I'm confused by the behavior.  my understanding was that load() was lazily 
> executed on the Spark worker.  Why would some elements be executing on the 
> driver?


Looking at the code, it appears that your sftp plugin downloads the
file to a local location and opens from there.

https://github.com/springml/spark-sftp/blob/090917547001574afa93cddaf2a022151a3f4260/src/main/scala/com/springml/spark/sftp/DefaultSource.scala#L38

You may have more luck with an sftp hadoop filesystem plugin that can
read sftp:// URLs directly.

Cheers
Andrew
>
> Thanks for your help
> --
> Mark Bidewell
> http://www.linkedin.com/in/markbidewell

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark DataFrame Creation

Reply via email to