Re: Referencing a scala/java PipelineStage from pyspark - constructor issues with HasInputCol

2020-08-25 Thread Sean Owen
That looks roughly right, though you will want to mark Spark dependencies as provided. Do you need netlib directly? Pyspark won't matter here if you're in Scala; what's installed with pip would not matter in any event. On Tue, Aug 25, 2020 at 3:30 AM Aviad Klein wrote: > > Hey Chris and Sean,

Re: Referencing a scala/java PipelineStage from pyspark - constructor issues with HasInputCol

2020-08-25 Thread Aviad Klein
Hey Chris and Sean, thanks for taking the time to answer. Perhaps my installation of pyspark is off, although I did use version 2.4.4 When developing in scala and pyspark how do you setup your environment? I used sbt for scala spark libraryDependencies ++= Seq( "org.apache.spark" %%

RE: Spark 3.0 using S3 taking long time for some set of TPC DS Queries

2020-08-25 Thread Rao, Abhishek (Nokia - IN/Bangalore)
Hi Luca, Thanks for sharing the feedback. We'll include these recommendations in our tests. However, we feel the issue that we're seeing right now is due to the difference in size of data downloaded from storage by the executors. In case of S3, executors are downloading almost 50 GB of data