Hi Mark, On Wed, Jul 22, 2020 at 4:49 PM Mark Bidewell <mbide...@gmail.com> wrote: > > Sorry if this is the wrong place for this. I am trying to debug an issue > with this library: > https://github.com/springml/spark-sftp > > When I attempt to create a dataframe: > > spark.read. > format("com.springml.spark.sftp"). > option("host", "..."). > option("username", "..."). > option("password", "..."). > option("fileType", "csv"). > option("inferSchema", "true"). > option("tempLocation","/srv/spark/tmp"). > option("hdfsTempLocation","/srv/spark/tmp"); > .load("...") > > What I am seeing is that the download is occurring on the spark driver not > the spark worker, This leads to a failure when spark tries to create the > DataFrame on the worker. > > I'm confused by the behavior. my understanding was that load() was lazily > executed on the Spark worker. Why would some elements be executing on the > driver?
Looking at the code, it appears that your sftp plugin downloads the file to a local location and opens from there. https://github.com/springml/spark-sftp/blob/090917547001574afa93cddaf2a022151a3f4260/src/main/scala/com/springml/spark/sftp/DefaultSource.scala#L38 You may have more luck with an sftp hadoop filesystem plugin that can read sftp:// URLs directly. Cheers Andrew > > Thanks for your help > -- > Mark Bidewell > http://www.linkedin.com/in/markbidewell --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org