Re: [VOTE] SPIP: Spark API for Table Metadata

2019-02-28 Thread John Zhuge
+1 (non-binding) On Thu, Feb 28, 2019 at 9:11 AM Matt Cheah wrote: > +1 (non-binding) > > > > *From: *Jamison Bennett > *Date: *Thursday, February 28, 2019 at 8:28 AM > *To: *Ryan Blue , Spark Dev List > *Subject: *Re: [VOTE] SPIP: Spark API for Table Metadata > > > > +1 (non-binding) > > >

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Matt Cheah
I want to specifically highlight and +1 a point that Ryan brought up: A commitment binds us to do this and make a reasonable attempt at finishing on time. If we choose not to commit, or if we choose to commit and don’t make a reasonable attempt, then we need to ask, “what happened?” Is Spark

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Mridul Muralidharan
I am -1 on this vote for pretty much all the reasons that Mark mentioned. A major version change gives us an opportunity to remove deprecated interfaces, stabilize experimental/developer api, drop support for outdated functionality/platforms and evolve the project with a vision for foreseeable

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Joseph Torres
I'm not worried about rushing. I worry that, without clear parameters for the amount or types of DSv2 delays that are acceptable, we might end up holding back 3.0 indefinitely to meet the deadline when we wouldn't have made that decision de novo. (Or even worse, the PMC eventually feels they must

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Ryan Blue
The question is, what does it bind? I’m not pushing for a binding statement to do this or delay the 3.0 release because I don’t think that’s a very reasonable thing to do. It may well be that there is a good reason for missing the goal. So “what does it bind?” is an apt question. A commitment

Structured Streaming - compare previous row value with current

2019-02-28 Thread Raphael Hirsiger
Hi there, Would you be able to give advise on how to best compare a previous row value in a structured streaming DF with the current one? Kind regards, Raphael - To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Sean Owen
This is a fine thing to VOTE on. Committers (and community, non-binding) can VOTE on what we like; we just don't do it often where not required because it's a) overkill overhead over simple lazy consensus, and b) it can be hard to say what the binding VOTE binds if it's not a discrete commit or

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Mark Hamstra
I agree that adding new features in a major release is not forbidden, but that is just not the primary goal of a major release. If we reach the point where we are happy with the new public API before some new features are in a satisfactory state to be merged, then I don't want there to be a prior

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Ryan Blue
Mark, I disagree. Setting common goals is a critical part of getting things done. This doesn't commit the community to push out the release if the goals aren't met, but does mean that we will, as a community, seriously consider it. This is also an acknowledgement that this is the most important

Re: CombinePerKey and GroupByKey

2019-02-28 Thread Reynold Xin
This should be fine. Dataset.groupByKey is a logical operation, not a physical one (as in Spark wouldn’t always materialize all the groups in memory). On Thu, Feb 28, 2019 at 1:46 AM Etienne Chauchot wrote: > Hi all, > > I'm migrating RDD pipelines to Dataset and I saw that Combine.PerKey is no

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Ryan Blue
Mark, if this goal is adopted, "we" is the Apache Spark community. On Thu, Feb 28, 2019 at 9:52 AM Mark Hamstra wrote: > Who is "we" in these statements, such as "we should consider a functional > DSv2 implementation a blocker for Spark 3.0"? If it means those > contributing to the DSv2 effort

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Mark Hamstra
Then I'm -1. Setting new features as blockers of major releases is not proper project management, IMO. On Thu, Feb 28, 2019 at 10:06 AM Ryan Blue wrote: > Mark, if this goal is adopted, "we" is the Apache Spark community. > > On Thu, Feb 28, 2019 at 9:52 AM Mark Hamstra > wrote: > >> Who is

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Mark Hamstra
Who is "we" in these statements, such as "we should consider a functional DSv2 implementation a blocker for Spark 3.0"? If it means those contributing to the DSv2 effort want to set their own goals, milestones, etc., then that is fine with me. If you mean that the Apache Spark project should

Re: [VOTE] SPIP: Spark API for Table Metadata

2019-02-28 Thread Matt Cheah
+1 (non-binding) From: Jamison Bennett Date: Thursday, February 28, 2019 at 8:28 AM To: Ryan Blue , Spark Dev List Subject: Re: [VOTE] SPIP: Spark API for Table Metadata +1 (non-binding) Jamison Bennett Cloudera Software Engineer jamison.benn...@cloudera.com 515 Congress Ave, Suite

Re: [VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Matt Cheah
+1 (non-binding) Are identifiers and namespaces going to be rolled under one of those six points? From: Ryan Blue Reply-To: "rb...@netflix.com" Date: Thursday, February 28, 2019 at 8:39 AM To: Spark Dev List Subject: [VOTE] Functional DataSourceV2 in Spark 3.0 I’d like to call a

Re: [DISCUSS] Spark 3.0 and DataSourceV2

2019-02-28 Thread Ryan Blue
Thanks for the discussion, everyone. Since there aren't many objections to the scope and we are aligned on what this commitment would mean, I've started a vote thread for it. rb On Wed, Feb 27, 2019 at 5:32 PM Wenchen Fan wrote: > I'm good with the list from Ryan, thanks! > > On Thu, Feb 28,

[VOTE] Functional DataSourceV2 in Spark 3.0

2019-02-28 Thread Ryan Blue
I’d like to call a vote for committing to getting DataSourceV2 in a functional state for Spark 3.0. For more context, please see the discussion thread, but here is a quick summary about what this commitment means: - We think that a “functional DSv2” is an achievable goal for the Spark 3.0

Re: [VOTE] SPIP: Spark API for Table Metadata

2019-02-28 Thread Jamison Bennett
+1 (non-binding) Jamison Bennett Cloudera Software Engineer jamison.benn...@cloudera.com 515 Congress Ave, Suite 1212 | Austin, TX | 78701 On Thu, Feb 28, 2019 at 10:20 AM Ryan Blue wrote: > +1 (non-binding) > > On Wed, Feb 27, 2019 at 8:34 PM Russell Spitzer > wrote: > >> +1

Re: [VOTE] SPIP: Spark API for Table Metadata

2019-02-28 Thread Ryan Blue
+1 (non-binding) On Wed, Feb 27, 2019 at 8:34 PM Russell Spitzer wrote: > +1 (non-binding) > > On Wed, Feb 27, 2019, 6:28 PM Ryan Blue wrote: > >> Hi everyone, >> >> In the last DSv2 sync, the consensus was that the table metadata SPIP was >> ready to bring up for a vote. Now that the

CombinePerKey and GroupByKey

2019-02-28 Thread Etienne Chauchot
Hi all, I'm migrating RDD pipelines to Dataset and I saw that Combine.PerKey is no more there in the Dataset API. So, I translated it to: KeyValueGroupedDataset> groupedDataset = keyedDataset.groupByKey(KVHelpers.extractKey(), EncoderHelpers.genericEncoder()); Dataset> combinedDataset =