[DISCUSS] remove the incomplete code path on aggregation for continuous mode

2020-05-18 Thread Jungtaek Lim
Hi devs, during experiment on complete mode I realized we left some incomplete code parts on supporting aggregation for continuous mode. (shuffle & coalesce) The work had been occurred around first half of 2018 and stopped, and no work has been done for around 2 years (so I don't expect anyone

[DISCUSS] "complete" streaming output mode

2020-05-18 Thread Jungtaek Lim
Hi devs, while dealing with SPARK-31706 [1] we figured out the streaming output mode is only effective for stateful aggregation and not guaranteed on sink, which could expose data loss issue. SPARK-31724 [2] is filed to track the efforts on improving the streaming output mode. Before we revisit

K8s with apache spark

2020-05-18 Thread Jeffrey Orihuela
Hi , I have a minikube cluster running in my local and I am trying to use apache-spark in client mode with kubernetes and read a file in order to work. I expect read a file but when a I’m trying to read with textfile = sc.textFile("README.md") and then execute textfile.count() the output is

Re: [VOTE] Apache Spark 3.0 RC2

2020-05-18 Thread Jungtaek Lim
Looks like the priority of SPARK-31706 [1] is incorrectly marked - it sounds like a blocker, as SPARK-26785 [2] / SPARK-26956 [3] dropped the feature of "update" on streaming output mode (as a result) and SPARK-31706 restores it. SPARK-31706 is not yet resolved, which may be valid reason to roll a

Re: [VOTE] Release Spark 2.4.6 (RC3)

2020-05-18 Thread Holden Karau
That seems like an important concern. I'm going to go ahead and vote -1 on this RC and I'll roll a new RC once the IndyLambda support is backported into the 2.4 branch. On Mon, May 18, 2020 at 2:58 PM DB Tsai wrote: > I am changing my vote from +1 to +0. > > Since Spark 3.0 is Scala 2.12 only,

Re: [VOTE] Release Spark 2.4.6 (RC3)

2020-05-18 Thread DB Tsai
I am changing my vote from +1 to +0. Since Spark 3.0 is Scala 2.12 only, having a transitional 2.4.x release with great support of Scala 2.12 is very important. I would like to have [SPARK-31399][CORE] Support indylambda Scala closure in ClosureCleaner backported. Without it, it might break

Re: [VOTE] Release Spark 2.4.6 (RC3)

2020-05-18 Thread Holden Karau
Another two candidates for backporting that have come up since this RC are SPARK-31692 & SPARK-31399. What are folks thoughts, should we roll an RC4? On Mon, May 18, 2020 at 2:13 PM Sean Owen wrote: > Ah OK, I assumed from the timing that this was cut to include that commit. > I should have

Re: [VOTE] Release Spark 2.4.6 (RC3)

2020-05-18 Thread Sean Owen
Ah OK, I assumed from the timing that this was cut to include that commit. I should have looked. Yes, it is not strictly a regression so does not have to block the release and this can pass. We can release 2.4.7 in a few months, too. How important is the fix? If it's pretty important, it may still

[VOTE] Apache Spark 3.0 RC2

2020-05-18 Thread Reynold Xin
Please vote on releasing the following candidate as Apache Spark version 3.0.0. The vote is open until Thu May 21 11:59pm Pacific time and passes if a majority +1 PMC votes are cast, with a minimum of 3 +1 votes. [ ] +1 Release this package as Apache Spark 3.0.0 [ ] -1 Do not release this

Re: [VOTE] Release Spark 2.4.6 (RC3)

2020-05-18 Thread Holden Karau
That is correct. I asked on the PR if that was ok with folks before I moved forward with the RC and was told that it was ok. I believe that particular bug is not a regression and is a long standing issue so we wouldn’t normally block the release on it. On Mon, May 18, 2020 at 7:40 AM Xiao Li

Re: [VOTE] Release Spark 2.4.6 (RC3)

2020-05-18 Thread Xiao Li
This RC does not include the correctness bug fix https://github.com/apache/spark/commit/a4885f3654899bcb852183af70cc0a82e7dd81d0 which is just after RC3 cut. On Mon, May 18, 2020 at 7:21 AM Tom Graves wrote: > +1. > > Tom > > On Monday, May 18, 2020, 08:05:24 AM CDT, Wenchen Fan > wrote: > > >

Re: [VOTE] Release Spark 2.4.6 (RC3)

2020-05-18 Thread Tom Graves
+1. Tom On Monday, May 18, 2020, 08:05:24 AM CDT, Wenchen Fan wrote: +1, no known blockers. On Mon, May 18, 2020 at 12:49 AM DB Tsai wrote: +1 as well. Thanks. On Sun, May 17, 2020 at 7:39 AM Sean Owen wrote: +1 , same response as to the last RC. This looks like it includes the

Re: [VOTE] Release Spark 2.4.6 (RC3)

2020-05-18 Thread Wenchen Fan
+1, no known blockers. On Mon, May 18, 2020 at 12:49 AM DB Tsai wrote: > +1 as well. Thanks. > > On Sun, May 17, 2020 at 7:39 AM Sean Owen wrote: > >> +1 , same response as to the last RC. >> This looks like it includes the fix discussed last time, as well as a >> few more small good fixes. >>

Re: Applying schema dynamically in dataframe

2020-05-18 Thread ZHANG Wei
May I get a sample scenario to understand the requirement? -- Cheers, -z On Sat, 16 May 2020 11:45:03 +0530 rahul c wrote: > Hi dev, > > Currently I have a scenario where I am reading the data from Kafka using > spark dataframe. > > Multiple data sources ingest the data into kafka same