Re: SPIP: Executor Plugin (SPARK-24918)

2018-08-30 Thread Lars Francke
+1 On Fri, Aug 31, 2018 at 8:11 AM, Reynold Xin wrote: > I actually had a similar use case a while ago, but not entirely the same. > In my use case, Spark is already up, but I want to make sure all existing > (and new) executors run some specific code. Can we update the API to > support that? I

Re: SPIP: Executor Plugin (SPARK-24918)

2018-08-30 Thread Reynold Xin
I actually had a similar use case a while ago, but not entirely the same. In my use case, Spark is already up, but I want to make sure all existing (and new) executors run some specific code. Can we update the API to support that? I think that's doable if we split the design into two: one is the ab

Re: [DISCUSS] move away from python doctests

2018-08-30 Thread Felix Cheung
+1 on what Li said. And +1 on getting more coverage in unit tests - however often times we omit python unit tests deliberately if the python “wrapper” is trivial. This is what I’ve learned over the years from the previous pyspark maintainers. Admittedly gaps are there. ___

Re: SPIP: Executor Plugin (SPARK-24918)

2018-08-30 Thread Felix Cheung
+1 From: Mridul Muralidharan Sent: Wednesday, August 29, 2018 1:27:27 PM To: dev@spark.apache.org Subject: Re: SPIP: Executor Plugin (SPARK-24918) +1 I left a couple of comments in NiharS's PR, but this is very useful to have in spark ! Regards, Mridul On Fri, Au

data source api v2 refactoring

2018-08-30 Thread Reynold Xin
I spent some time last week looking at the current data source v2 apis, and I thought we should be a bit more buttoned up in terms of the abstractions and the guarantees Spark provides. In particular, I feel we need the following levels of "abstractions", to fit the use cases in Spark, from batch,

Re: Spark Streaming : Multiple sources found for csv : Error

2018-08-30 Thread Srabasti Banerjee
Hi Jorn, Do you have suggestions as to how to do that? The conflicting packages are being picked up by default from pom.xml. I am not invoking any additional packages while running spark submit on the thin jar. ThanksSrabasti Banerjee On Thursday, 30 August, 2018, 9:45:36 PM GMT-7, Jörn Fran

Re: Spark Streaming : Multiple sources found for csv : Error

2018-08-30 Thread Srabasti Banerjee
Great we are already discussing/working to fix the issue.Happy to help if I can :-) Any workarounds that we can use for now? Please note I am not invoking any additional packages while running spark submit on the thin jar. Thanks,Srabasti Banerjee On Thursday, 30 August, 2018, 9:02:11 PM

Re: Spark Streaming : Multiple sources found for csv : Error

2018-08-30 Thread Jörn Franke
Can’t you remove the dependency to the databricks CSV data source? Spark has them now integrated since some versions so it is not needed. > On 31. Aug 2018, at 05:52, Srabasti Banerjee > wrote: > > Hi, > > I am trying to run below code to read file as a dataframe onto a Stream (for > Spark S

Re: Spark Streaming : Multiple sources found for csv : Error

2018-08-30 Thread Hyukjin Kwon
Yea, this is exactly what I have been worried of the recent changes (discussed in https://issues.apache.org/jira/browse/SPARK-24924) See https://github.com/apache/spark/pull/17916. This should be fine in upper Spark versions. FYI, +Wechen and Dongjoon I want to add Thomas Graves and Gengliang Wang

Spark Streaming : Multiple sources found for csv : Error

2018-08-30 Thread Srabasti Banerjee
Hi, I am trying to run below code to read file as a dataframe onto a Stream (for Spark Streaming) developed via Eclipse IDE, defining schemas appropriately, by running thin jar on server and am getting error below. Tried out suggestions from researching on internet based on "spark.read.option.sc

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-30 Thread Anton Kulaga
I think beta-support will be good. I am ok to trade stability for being able to use 2.12-only libraries in my code and if there is something mission-critical nobody will block from using stable 2.11 Sincerely, Anton Kulaga Bioinformatician at Computational Biology of Aging Group 296 Splaiul Indep

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-30 Thread shane knapp
+1 on beta support for scala 2.12 On Thu, Aug 30, 2018 at 2:33 PM, Stavros Kontopoulos < stavros.kontopou...@lightbend.com> wrote: > +1 that would be great Sean, also you put a lot of effort in there, would > make sense to wait a bit. > > Stavros > > On Fri, Aug 31, 2018 at 12:00 AM, Sean Owen w

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-30 Thread Stavros Kontopoulos
+1 that would be great Sean, also you put a lot of effort in there, would make sense to wait a bit. Stavros On Fri, Aug 31, 2018 at 12:00 AM, Sean Owen wrote: > I know it's famous last words, but we really might be down to the last > fix: https://github.com/apache/spark/pull/22264 More a questi

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-30 Thread Sean Owen
I know it's famous last words, but we really might be down to the last fix: https://github.com/apache/spark/pull/22264 More a question of making tests happy at this point I think than fundamental problems. My goal is to make sure we can release a usable, but beta-quality, 2.12 release of Spark in 2

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-30 Thread Reynold Xin
Let's see how they go. At some point we do need to cut the release. That argument can be made on every feature, and different people place different value / importance on different features, so we could just end up never making a release. On Thu, Aug 30, 2018 at 1:56 PM antonkulaga wrote: > >T

Re: code freeze and branch cut for Apache Spark 2.4

2018-08-30 Thread antonkulaga
>There are a few PRs to fix Scala 2.12 issues. I think they will keep coming up and we don't need to block Spark 2.4 on this. I think it can be better to wait a bit for Scala 2.12 support in 2.4 than to suffer many months until Spark 2.5 with 2.12 support will be released. Scala 2.12 is not only a

Update to Kryo 4 for Spark 2.4?

2018-08-30 Thread Sean Owen
I wanted to call any interested eyes to this discussion: https://github.com/apache/spark/pull/22179

Re: mllib + SQL

2018-08-30 Thread William Benton
What are you interested in accomplishing? The spark.ml package has provided a machine learning API based on DataFrames for quite some time. If you are interested in mixing query processing and machine learning, this is certainly the best place to start. See here: https://spark.apache.org/docs/l