Re: PSA: JIRA resolutions and meanings

2016-10-09 Thread Sean Owen
I added a variant on this text to
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-ContributingtoJIRAMaintenance

On Sat, Oct 8, 2016 at 10:09 AM Sean Owen  wrote:

> That flood of emails means several people (Xiao, Holden mostly AFAICT)
> have been updating the status of old JIRAs. Thank you, I think that really
> does help.
>
> I have a suggested set of conventions I've been using, just to bring some
> order to the resolutions. It helps because JIRA functions as a huge archive
> of decisions and the more accurately we can record that the better. What do
> people think of this?
>
> - Resolve as Fixed if there's a change you can point to that resolved the
> issue
> - If the issue is a proper subset of another issue, mark it a Duplicate of
> that issue (rather than the other way around)
> - If it's probably resolved, but not obvious what fixed it or when, then
> Cannot Reproduce or Not a Problem
> - Obsolete issue? Not a Problem
> - If it's a coherent issue but does not seem like there is support or
> interest in acting on it, then Won't Fix
> - If the issue doesn't make sense (non-Spark issue, etc) then Invalid
> - I tend to mark Umbrellas as "Done" when done if they're just containers
> - Try to set Fix version
> - Try to set Assignee to the person who most contributed to the
> resolution. Usually the person who opened the PR. Strong preference for
> ties going to the more 'junior' contributor
>
> The only ones I think are sort of important are getting the Duplicate
> pointers right, and possibly making sure that Fixed issues have a clear
> path to finding what change fixed it and when. The rest doesn't matter much.
>
>


Re: Spark Improvement Proposals

2016-10-09 Thread Cody Koeninger
Yeah, I've looked at KIPs and Scala SIPs.

I'm reluctant to use the Kafka structured streaming as an example
because of the pre-existing conflict around it.  If Michael or another
committer wanted to put it forth as an example, I'd participate in
good faith though.

On Sun, Oct 9, 2016 at 5:07 PM, Ofir Manor  wrote:
> This is a great discussion!
> Maybe you could have a look at Kafka's process - it also uses Rejected
> Alternatives and I personally find it very clear actually (the link also
> leads to all KIPs):
>
> https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
> Cody - maybe you could take one of the open issues and write a sample
> proposal? A concrete example might make it clearer for those who see this
> for the first time. Maybe the Kafka offset discussion or some other
> Kafka/Structured Streaming open issue? Will that be helpful?
>
> Ofir Manor
>
> Co-Founder & CTO | Equalum
>
> Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io
>
>
> On Mon, Oct 10, 2016 at 12:36 AM, Matei Zaharia 
> wrote:
>>
>> Yup, this is the stuff that I found unclear. Thanks for clarifying here,
>> but we should also clarify it in the writeup. In particular:
>>
>> - Goals needs to be about user-facing behavior ("people" is broad)
>>
>> - I'd rename Rejected Goals to Non-Goals. Otherwise someone will dig up
>> one of these and say "Spark's developers have officially rejected X, which
>> our awesome system has".
>>
>> - For user-facing stuff, I think you need a section on API. Virtually all
>> other *IPs I've seen have that.
>>
>> - I'm still not sure why the strategy section is needed if the purpose is
>> to define user-facing behavior -- unless this is the strategy for setting
>> the goals or for defining the API. That sounds squarely like a design doc
>> issue. In some sense, who cares whether the proposal is technically feasible
>> right now? If it's infeasible, that will be discovered later during design
>> and implementation. Same thing with rejected strategies -- listing some of
>> those is definitely useful sometimes, but if you make this a *required*
>> section, people are just going to fill it in with bogus stuff (I've seen
>> this happen before).
>>
>> Matei
>>
>> > On Oct 9, 2016, at 2:14 PM, Cody Koeninger  wrote:
>> >
>> > So to focus the discussion on the specific strategy I'm suggesting,
>> > documented at
>> >
>> >
>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>> >
>> > "Goals: What must this allow people to do, that they can't currently?"
>> >
>> > Is it unclear that this is focusing specifically on people-visible
>> > behavior?
>> >
>> > Rejected goals -  are important because otherwise people keep trying
>> > to argue about scope.  Of course you can change things later with a
>> > different SIP and different vote, the point is to focus.
>> >
>> > Use cases - are something that people are going to bring up in
>> > discussion.  If they aren't clearly documented as a goal ("This must
>> > allow me to connect using SSL"), they should be added.
>> >
>> > Internal architecture - if the people who need specific behavior are
>> > implementers of other parts of the system, that's fine.
>> >
>> > Rejected strategies - If you have none of these, you have no evidence
>> > that the proponent didn't just go with the first thing they had in
>> > mind (or have already implemented), which is a big problem currently.
>> > Approval isn't binding as to specifics of implementation, so these
>> > aren't handcuffs.  The goals are the contract, the strategy is
>> > evidence that contract can actually be met.
>> >
>> > Design docs - I'm not touching design docs.  The markdown file I
>> > linked specifically says of the strategy section "This is not a full
>> > design document."  Is this unclear?  Design docs can be worked on
>> > obviously, but that's not what I'm concerned with here.
>> >
>> >
>> >
>> >
>> > On Sun, Oct 9, 2016 at 2:34 PM, Matei Zaharia 
>> > wrote:
>> >> Hi Cody,
>> >>
>> >> I think this would be a lot more concrete if we had a more detailed
>> >> template
>> >> for SIPs. Right now, it's not super clear what's in scope -- e.g. are
>> >> they
>> >> a way to solicit feedback on the user-facing behavior or on the
>> >> internals?
>> >> "Goals" can cover both things. I've been thinking of SIPs more as
>> >> Product
>> >> Requirements Docs (PRDs), which focus on *what* a code change should do
>> >> as
>> >> opposed to how.
>> >>
>> >> In particular, here are some things that you may or may not consider in
>> >> scope for SIPs:
>> >>
>> >> - Goals and non-goals: This is definitely in scope, and IMO should
>> >> focus on
>> >> user-visible behavior (e.g. "system supports SQL window functions" or
>> >> "system continues working if one node fails"). BTW I wouldn't say
>> >> "rejected
>> >> goals" because some of them might become goals later, so we're 

Re: Spark Improvement Proposals

2016-10-09 Thread Matei Zaharia
Well, I think there are a few things here that don't make sense. First, why 
should only committers submit SIPs? Development in the project should be open 
to all contributors, whether they're committers or not. Second, I think 
unrealistic goals can be found just by inspecting the goals, and I'm not super 
worried that we'll accept a lot of SIPs that are then infeasible -- we can then 
submit new ones. But this depends on whether you want this process to be a 
"design doc lite", where people also agree on implementation strategy, or just 
a way to agree on goals. This is what I asked earlier about PRDs vs design docs 
(and I'm open to either one but I'd just like clarity). Finally, both as a user 
and designer of software, I always want to give feedback on APIs, so I'd really 
like a culture of having those early. People don't argue about prettiness when 
they discuss APIs, they argue about the core concepts to expose in order to 
meet various goals, and then they're stuck maintaining those for a long time.

Matei

> On Oct 9, 2016, at 3:10 PM, Cody Koeninger  wrote:
> 
> Users instead of people, sure.  Commiters and contributors are (or at least 
> should be) a subset of users.
> 
> Non goals, sure. I don't care what the name is, but we need to clearly say 
> e.g. 'no we are not maintaining compatibility with XYZ right now'.
> 
> API, what I care most about is whether it allows me to accomplish the goals. 
> Arguing about how ugly or pretty it is can be saved for design/ 
> implementation imho.
> 
> Strategy, this is necessary because otherwise goals can be out of line with 
> reality.  Don't propose goals you don't have at least some idea of how to 
> implement.
> 
> Rejected strategies, given that commiters are the only ones I'm saying should 
> formally submit SPARKLIs or SIPs, if they put junk in a required section then 
> slap them down for it and tell them to fix it.
> 
> 
> On Oct 9, 2016 4:36 PM, "Matei Zaharia"  > wrote:
> Yup, this is the stuff that I found unclear. Thanks for clarifying here, but 
> we should also clarify it in the writeup. In particular:
> 
> - Goals needs to be about user-facing behavior ("people" is broad)
> 
> - I'd rename Rejected Goals to Non-Goals. Otherwise someone will dig up one 
> of these and say "Spark's developers have officially rejected X, which our 
> awesome system has".
> 
> - For user-facing stuff, I think you need a section on API. Virtually all 
> other *IPs I've seen have that.
> 
> - I'm still not sure why the strategy section is needed if the purpose is to 
> define user-facing behavior -- unless this is the strategy for setting the 
> goals or for defining the API. That sounds squarely like a design doc issue. 
> In some sense, who cares whether the proposal is technically feasible right 
> now? If it's infeasible, that will be discovered later during design and 
> implementation. Same thing with rejected strategies -- listing some of those 
> is definitely useful sometimes, but if you make this a *required* section, 
> people are just going to fill it in with bogus stuff (I've seen this happen 
> before).
> 
> Matei
> 
> > On Oct 9, 2016, at 2:14 PM, Cody Koeninger  > > wrote:
> >
> > So to focus the discussion on the specific strategy I'm suggesting,
> > documented at
> >
> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
> >  
> > 
> >
> > "Goals: What must this allow people to do, that they can't currently?"
> >
> > Is it unclear that this is focusing specifically on people-visible behavior?
> >
> > Rejected goals -  are important because otherwise people keep trying
> > to argue about scope.  Of course you can change things later with a
> > different SIP and different vote, the point is to focus.
> >
> > Use cases - are something that people are going to bring up in
> > discussion.  If they aren't clearly documented as a goal ("This must
> > allow me to connect using SSL"), they should be added.
> >
> > Internal architecture - if the people who need specific behavior are
> > implementers of other parts of the system, that's fine.
> >
> > Rejected strategies - If you have none of these, you have no evidence
> > that the proponent didn't just go with the first thing they had in
> > mind (or have already implemented), which is a big problem currently.
> > Approval isn't binding as to specifics of implementation, so these
> > aren't handcuffs.  The goals are the contract, the strategy is
> > evidence that contract can actually be met.
> >
> > Design docs - I'm not touching design docs.  The markdown file I
> > linked specifically says of the strategy section "This is not a full
> > design document."  Is this unclear?  Design docs can be worked on
> > obviously, but that's not what I'm concerned with 

Re: Spark Improvement Proposals

2016-10-09 Thread Cody Koeninger
Only committers should formally submit SIPs because in an apache
project only commiters have explicit political power.  If a user can't
find a commiter willing to sponsor an SIP idea, they have no way to
get the idea passed in any case.  If I can't find a committer to
sponsor this meta-SIP idea, I'm out of luck.

I do not believe unrealistic goals can be found solely by inspection.
We've managed to ignore unrealistic goals even after implementation!
Focusing on APIs can allow people to think they've solved something,
when there's really no way of implementing that API while meeting the
goals.  Rapid iteration is clearly the best way to address this, but
we've already talked about why that hasn't really worked.  If adding a
non-binding API section to the template is important to you, I'm not
against it, but I don't think it's sufficient.

On your PRD vs design doc spectrum, I'm saying this is closer to a
PRD.  Clear agreement on goals is the most important thing and that's
why it's the thing I want binding agreement on.  But I cannot agree to
goals unless I have enough minimal technical info to judge whether the
goals are likely to actually be accomplished.



On Sun, Oct 9, 2016 at 5:35 PM, Matei Zaharia  wrote:
> Well, I think there are a few things here that don't make sense. First, why
> should only committers submit SIPs? Development in the project should be
> open to all contributors, whether they're committers or not. Second, I think
> unrealistic goals can be found just by inspecting the goals, and I'm not
> super worried that we'll accept a lot of SIPs that are then infeasible -- we
> can then submit new ones. But this depends on whether you want this process
> to be a "design doc lite", where people also agree on implementation
> strategy, or just a way to agree on goals. This is what I asked earlier
> about PRDs vs design docs (and I'm open to either one but I'd just like
> clarity). Finally, both as a user and designer of software, I always want to
> give feedback on APIs, so I'd really like a culture of having those early.
> People don't argue about prettiness when they discuss APIs, they argue about
> the core concepts to expose in order to meet various goals, and then they're
> stuck maintaining those for a long time.
>
> Matei
>
> On Oct 9, 2016, at 3:10 PM, Cody Koeninger  wrote:
>
> Users instead of people, sure.  Commiters and contributors are (or at least
> should be) a subset of users.
>
> Non goals, sure. I don't care what the name is, but we need to clearly say
> e.g. 'no we are not maintaining compatibility with XYZ right now'.
>
> API, what I care most about is whether it allows me to accomplish the goals.
> Arguing about how ugly or pretty it is can be saved for design/
> implementation imho.
>
> Strategy, this is necessary because otherwise goals can be out of line with
> reality.  Don't propose goals you don't have at least some idea of how to
> implement.
>
> Rejected strategies, given that commiters are the only ones I'm saying
> should formally submit SPARKLIs or SIPs, if they put junk in a required
> section then slap them down for it and tell them to fix it.
>
>
> On Oct 9, 2016 4:36 PM, "Matei Zaharia"  wrote:
>>
>> Yup, this is the stuff that I found unclear. Thanks for clarifying here,
>> but we should also clarify it in the writeup. In particular:
>>
>> - Goals needs to be about user-facing behavior ("people" is broad)
>>
>> - I'd rename Rejected Goals to Non-Goals. Otherwise someone will dig up
>> one of these and say "Spark's developers have officially rejected X, which
>> our awesome system has".
>>
>> - For user-facing stuff, I think you need a section on API. Virtually all
>> other *IPs I've seen have that.
>>
>> - I'm still not sure why the strategy section is needed if the purpose is
>> to define user-facing behavior -- unless this is the strategy for setting
>> the goals or for defining the API. That sounds squarely like a design doc
>> issue. In some sense, who cares whether the proposal is technically feasible
>> right now? If it's infeasible, that will be discovered later during design
>> and implementation. Same thing with rejected strategies -- listing some of
>> those is definitely useful sometimes, but if you make this a *required*
>> section, people are just going to fill it in with bogus stuff (I've seen
>> this happen before).
>>
>> Matei
>>
>> > On Oct 9, 2016, at 2:14 PM, Cody Koeninger  wrote:
>> >
>> > So to focus the discussion on the specific strategy I'm suggesting,
>> > documented at
>> >
>> >
>> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>> >
>> > "Goals: What must this allow people to do, that they can't currently?"
>> >
>> > Is it unclear that this is focusing specifically on people-visible
>> > behavior?
>> >
>> > Rejected goals -  are important because otherwise people keep trying
>> > to argue about scope.  

Re: Suggestion in README.md for guiding pull requests/JIRAs (probably about linking CONTRIBUTING.md or wiki)

2016-10-09 Thread Felix Cheung
Should we just link to

https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark




On Sun, Oct 9, 2016 at 10:09 AM -0700, "Hyukjin Kwon" 
> wrote:

Thanks for confirming this, Sean. I filed this in 
https://issues.apache.org/jira/browse/SPARK-17840

I would appreciate if anyone who has a better writing skills better than me 
tries to fix this.

I don't want to let reviewers make an effort to correct the grammar.


On 10 Oct 2016 1:34 a.m., "Sean Owen" 
> wrote:
Yes, it's really CONTRIBUTING.md that's more relevant, because github displays 
a link to it when opening pull requests. 
https://github.com/apache/spark/blob/master/CONTRIBUTING.md  There is also the 
pull request template: 
https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE

I wouldn't want to duplicate info too much, but more pointers to a single 
source of information seems OK. Although I don't know if it will help much, 
sure, pointers from README.md are OK.

On Sun, Oct 9, 2016 at 3:47 PM Hyukjin Kwon 
> wrote:
Hi all,


I just noticed the README.md (https://github.com/apache/spark) does not 
describe the steps or links to follow for creating a PR or JIRA directly. I 
know probably it is sensible to search google about the contribution guides 
first before trying to make a PR/JIRA but I think it seems not enough when I 
see some inappropriate PRs/JIRAs time to time.

I guess flooding JIRAs and PRs is problematic (assuming from the emails in dev 
mailing list) and I think we should explicitly mention and describe this in the 
README.md and pull request template[1].

(I know we have CONTBITUTING.md[2] and wiki[3] but it seems pretty true that we 
still have some PRs or JIRAs not following the documentation.)

So, my suggestions are as below:

- Create a section maybe "Contributing To Apache Spark" describing the Wiki and 
CONTRIBUTING.md[2] in the README.md.

- Describe an explicit warning in pull request template[1], for example, 
"Please double check if your pull request is from a branch to a branch. In most 
cases, this change is not appropriate. Please ask to mailing list 
(http://spark.apache.org/community.html) if you are not sure."

[1]https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE
[2]https://github.com/apache/spark/blob/master/CONTRIBUTING.md
[3]https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage


Thank you all.


Re: Suggestion in README.md for guiding pull requests/JIRAs (probably about linking CONTRIBUTING.md or wiki)

2016-10-09 Thread Hyukjin Kwon
Thanks for confirming this, Sean. I filed this in
https://issues.apache.org/jira/browse/SPARK-17840

I would appreciate if anyone who has a better writing skills better than me
tries to fix this.

I don't want to let reviewers make an effort to correct the grammar.


On 10 Oct 2016 1:34 a.m., "Sean Owen"  wrote:

> Yes, it's really CONTRIBUTING.md that's more relevant, because github
> displays a link to it when opening pull requests. https://github.com/a
> pache/spark/blob/master/CONTRIBUTING.md  There is also the pull request
> template: https://github.com/apache/spark/blob/master/.githu
> b/PULL_REQUEST_TEMPLATE
>
> I wouldn't want to duplicate info too much, but more pointers to a single
> source of information seems OK. Although I don't know if it will help much,
> sure, pointers from README.md are OK.
>
> On Sun, Oct 9, 2016 at 3:47 PM Hyukjin Kwon  wrote:
>
>> Hi all,
>>
>>
>> I just noticed the README.md (https://github.com/apache/spark) does not
>> describe the steps or links to follow for creating a PR or JIRA directly. I
>> know probably it is sensible to search google about the contribution guides
>> first before trying to make a PR/JIRA but I think it seems not enough when
>> I see some inappropriate PRs/JIRAs time to time.
>>
>> I guess flooding JIRAs and PRs is problematic (assuming from the emails
>> in dev mailing list) and I think we should explicitly mention and describe
>> this in the README.md and pull request template[1].
>>
>> (I know we have CONTBITUTING.md[2] and wiki[3] but it seems pretty true
>> that we still have some PRs or JIRAs not following the documentation.)
>>
>> So, my suggestions are as below:
>>
>> - Create a section maybe "Contributing To Apache Spark" describing the
>> Wiki and CONTRIBUTING.md[2] in the README.md.
>>
>> - Describe an explicit warning in pull request template[1], for example,
>> "Please double check if your pull request is from a branch to a branch. In
>> most cases, this change is not appropriate. Please ask to mailing list (
>> http://spark.apache.org/community.html) if you are not sure."
>>
>> [1]https://github.com/apache/spark/blob/master/.github/PULL_
>> REQUEST_TEMPLATE
>> [2]https://github.com/apache/spark/blob/master/CONTRIBUTING.md
>> [3]https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage
>>
>>
>> Thank you all.
>>
>


Re: Spark Improvement Proposals

2016-10-09 Thread Cody Koeninger
Here's my specific proposal (meta-proposal?)

Spark Improvement Proposals (SIP)


Background:

The current problem is that design and implementation of large features are
often done in private, before soliciting user feedback.

When feedback is solicited, it is often as to detailed design specifics,
not focused on goals.

When implementation does take place after design, there is often
disagreement as to what goals are or are not in scope.

This results in commits that don't fully meet user needs.


Goals:

- Ensure user, contributor, and committer goals are clearly identified and
agreed upon, before implementation takes place.

- Ensure that a technically feasible strategy is chosen that is likely to
meet the goals.


Rejected Goals:

- SIPs are not for detailed design.  Design by committee doesn't work.

- SIPs are not for every change.  We dont need that much process.


Strategy:

My suggestion is outlined as a Spark Improvement Proposal process
documented at

https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md

Specifics of Jira manipulation are an implementation detail we can figure
out.

I'm suggesting voting; the need here is for a _clear_ outcome.


Rejected Strategies:

Having someone who understands the problem implement it first works, but
only if significant iteration after user feedback is allowed.

Historically this has been problematic due to pressure to limit public api
changes.

On Fri, Oct 7, 2016 at 5:16 PM, Reynold Xin  wrote:

> Alright looks like there are quite a bit of support. We should wait to
> hear from more people too.
>
> To push this forward, Cody and I will be working together in the next
> couple of weeks to come up with a concrete, detailed proposal on what this
> entails, and then we can discuss this the specific proposal as well.
>
>
> On Fri, Oct 7, 2016 at 2:29 PM, Cody Koeninger  wrote:
>
>> Yeah, in case it wasn't clear, I was talking about SIPs for major
>> user-facing or cross-cutting changes, not minor feature adds.
>>
>> On Fri, Oct 7, 2016 at 3:58 PM, Stavros Kontopoulos <
>> stavros.kontopou...@lightbend.com> wrote:
>>
>>> +1 to the SIP label as long as it does not slow down things and it
>>> targets optimizing efforts, coordination etc. For example really small
>>> features should not need to go through this process (assuming they dont
>>> touch public interfaces)  or re-factorings and hope it will be kept this
>>> way. So as a guideline doc should be provided, like in the KIP case.
>>>
>>> IMHO so far aside from tagging things and linking them elsewhere simply
>>> having design docs and prototypes implementations in PRs is not something
>>> that has not worked so far. What is really a pain in many projects out
>>> there is discontinuity in progress of PRs, missing features, slow reviews
>>> which is understandable to some extent... it is not only about Spark but
>>> things can be improved for sure for this project in particular as already
>>> stated.
>>>
>>> On Fri, Oct 7, 2016 at 11:14 PM, Cody Koeninger 
>>> wrote:
>>>
 +1 to adding an SIP label and linking it from the website.  I think it
 needs

 - template that focuses it towards soliciting user goals / non goals
 - clear resolution as to which strategy was chosen to pursue.  I'd
 recommend a vote.

 Matei asked me to clarify what I meant by changing interfaces, I think
 it's directly relevant to the SIP idea so I'll clarify here, and split
 a thread for the other discussion per Nicholas' request.

 I meant changing public user interfaces.  I think the first design is
 unlikely to be right, because it's done at a time when you have the
 least information.  As a user, I find it considerably more frustrating
 to be unable to use a tool to get my job done, than I do having to
 make minor changes to my code in order to take advantage of features.
 I've seen committers be seriously reluctant to allow changes to
 @experimental code that are needed in order for it to really work
 right.  You need to be able to iterate, and if people on both sides of
 the fence aren't going to respect that some newer apis are subject to
 change, then why even mark them as such?

 Ideally a finished SIP should give me a checklist of things that an
 implementation must do, and things that it doesn't need to do.
 Contributors/committers should be seriously discouraged from putting
 out a version 0.1 that doesn't have at least a prototype
 implementation of all those things, especially if they're then going
 to argue against interface changes necessary to get the the rest of
 the things done in the 0.2 version.


 On Fri, Oct 7, 2016 at 2:18 PM, Reynold Xin 
 wrote:
 > I like the lightweight proposal to add a SIP label.
 >
 > During Spark 2.0 development, Tom (Graves) and I 

Suggestion in README.md for guiding pull requests/JIRAs (probably about linking CONTRIBUTING.md or wiki)

2016-10-09 Thread Hyukjin Kwon
Hi all,


I just noticed the README.md (https://github.com/apache/spark) does not
describe the steps or links to follow for creating a PR or JIRA directly. I
know probably it is sensible to search google about the contribution guides
first before trying to make a PR/JIRA but I think it seems not enough when
I see some inappropriate PRs/JIRAs time to time.

I guess flooding JIRAs and PRs is problematic (assuming from the emails in
dev mailing list) and I think we should explicitly mention and describe
this in the README.md and pull request template[1].

(I know we have CONTBITUTING.md[2] and wiki[3] but it seems pretty true
that we still have some PRs or JIRAs not following the documentation.)

So, my suggestions are as below:

- Create a section maybe "Contributing To Apache Spark" describing the Wiki
and CONTRIBUTING.md[2] in the README.md.

- Describe an explicit warning in pull request template[1], for example,
"Please double check if your pull request is from a branch to a branch. In
most cases, this change is not appropriate. Please ask to mailing list (
http://spark.apache.org/community.html) if you are not sure."

[1]https://github.com/apache/spark/blob/master/.github/PULL_REQUEST_TEMPLATE
[2]https://github.com/apache/spark/blob/master/CONTRIBUTING.md
[3]https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage


Thank you all.


Re: PSA: JIRA resolutions and meanings

2016-10-09 Thread Cody Koeninger
That's awesome Sean, very clear.

One minor thing, noncommiters can't change assigned field as far as I know.

On Oct 9, 2016 3:40 AM, "Sean Owen"  wrote:

I added a variant on this text to https://cwiki.apache.org/
confluence/display/SPARK/Contributing+to+Spark#ContributingtoSpark-
ContributingtoJIRAMaintenance


On Sat, Oct 8, 2016 at 10:09 AM Sean Owen  wrote:

> That flood of emails means several people (Xiao, Holden mostly AFAICT)
> have been updating the status of old JIRAs. Thank you, I think that really
> does help.
>
> I have a suggested set of conventions I've been using, just to bring some
> order to the resolutions. It helps because JIRA functions as a huge archive
> of decisions and the more accurately we can record that the better. What do
> people think of this?
>
> - Resolve as Fixed if there's a change you can point to that resolved the
> issue
> - If the issue is a proper subset of another issue, mark it a Duplicate of
> that issue (rather than the other way around)
> - If it's probably resolved, but not obvious what fixed it or when, then
> Cannot Reproduce or Not a Problem
> - Obsolete issue? Not a Problem
> - If it's a coherent issue but does not seem like there is support or
> interest in acting on it, then Won't Fix
> - If the issue doesn't make sense (non-Spark issue, etc) then Invalid
> - I tend to mark Umbrellas as "Done" when done if they're just containers
> - Try to set Fix version
> - Try to set Assignee to the person who most contributed to the
> resolution. Usually the person who opened the PR. Strong preference for
> ties going to the more 'junior' contributor
>
> The only ones I think are sort of important are getting the Duplicate
> pointers right, and possibly making sure that Fixed issues have a clear
> path to finding what change fixed it and when. The rest doesn't matter much.
>
>


Re: Spark Improvement Proposals

2016-10-09 Thread Ofir Manor
This is a great discussion!
Maybe you could have a look at Kafka's process - it also uses Rejected
Alternatives and I personally find it very clear actually (the link also
leads to all KIPs):

https://cwiki.apache.org/confluence/display/KAFKA/Kafka+Improvement+Proposals
Cody - maybe you could take one of the open issues and write a sample
proposal? A concrete example might make it clearer for those who see this
for the first time. Maybe the Kafka offset discussion or some other
Kafka/Structured Streaming open issue? Will that be helpful?

Ofir Manor

Co-Founder & CTO | Equalum

Mobile: +972-54-7801286 | Email: ofir.ma...@equalum.io

On Mon, Oct 10, 2016 at 12:36 AM, Matei Zaharia 
wrote:

> Yup, this is the stuff that I found unclear. Thanks for clarifying here,
> but we should also clarify it in the writeup. In particular:
>
> - Goals needs to be about user-facing behavior ("people" is broad)
>
> - I'd rename Rejected Goals to Non-Goals. Otherwise someone will dig up
> one of these and say "Spark's developers have officially rejected X, which
> our awesome system has".
>
> - For user-facing stuff, I think you need a section on API. Virtually all
> other *IPs I've seen have that.
>
> - I'm still not sure why the strategy section is needed if the purpose is
> to define user-facing behavior -- unless this is the strategy for setting
> the goals or for defining the API. That sounds squarely like a design doc
> issue. In some sense, who cares whether the proposal is technically
> feasible right now? If it's infeasible, that will be discovered later
> during design and implementation. Same thing with rejected strategies --
> listing some of those is definitely useful sometimes, but if you make this
> a *required* section, people are just going to fill it in with bogus stuff
> (I've seen this happen before).
>
> Matei
>
> > On Oct 9, 2016, at 2:14 PM, Cody Koeninger  wrote:
> >
> > So to focus the discussion on the specific strategy I'm suggesting,
> > documented at
> >
> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-
> improvement-proposals.md
> >
> > "Goals: What must this allow people to do, that they can't currently?"
> >
> > Is it unclear that this is focusing specifically on people-visible
> behavior?
> >
> > Rejected goals -  are important because otherwise people keep trying
> > to argue about scope.  Of course you can change things later with a
> > different SIP and different vote, the point is to focus.
> >
> > Use cases - are something that people are going to bring up in
> > discussion.  If they aren't clearly documented as a goal ("This must
> > allow me to connect using SSL"), they should be added.
> >
> > Internal architecture - if the people who need specific behavior are
> > implementers of other parts of the system, that's fine.
> >
> > Rejected strategies - If you have none of these, you have no evidence
> > that the proponent didn't just go with the first thing they had in
> > mind (or have already implemented), which is a big problem currently.
> > Approval isn't binding as to specifics of implementation, so these
> > aren't handcuffs.  The goals are the contract, the strategy is
> > evidence that contract can actually be met.
> >
> > Design docs - I'm not touching design docs.  The markdown file I
> > linked specifically says of the strategy section "This is not a full
> > design document."  Is this unclear?  Design docs can be worked on
> > obviously, but that's not what I'm concerned with here.
> >
> >
> >
> >
> > On Sun, Oct 9, 2016 at 2:34 PM, Matei Zaharia 
> wrote:
> >> Hi Cody,
> >>
> >> I think this would be a lot more concrete if we had a more detailed
> template
> >> for SIPs. Right now, it's not super clear what's in scope -- e.g. are
> they
> >> a way to solicit feedback on the user-facing behavior or on the
> internals?
> >> "Goals" can cover both things. I've been thinking of SIPs more as
> Product
> >> Requirements Docs (PRDs), which focus on *what* a code change should do
> as
> >> opposed to how.
> >>
> >> In particular, here are some things that you may or may not consider in
> >> scope for SIPs:
> >>
> >> - Goals and non-goals: This is definitely in scope, and IMO should
> focus on
> >> user-visible behavior (e.g. "system supports SQL window functions" or
> >> "system continues working if one node fails"). BTW I wouldn't say
> "rejected
> >> goals" because some of them might become goals later, so we're not
> >> definitively rejecting them.
> >>
> >> - Public API: Probably should be included in most SIPs unless it's too
> large
> >> to fully specify then (e.g. "let's add an ML library").
> >>
> >> - Use cases: I usually find this very useful in PRDs to better
> communicate
> >> the goals.
> >>
> >> - Internal architecture: This is usually *not* a thing users can easily
> >> comment on and it sounds more like a design doc item. Of course it's
> >> important to show that the SIP is 

Re: Spark Improvement Proposals

2016-10-09 Thread Nicholas Chammas
On Sun, Oct 9, 2016 at 5:19 PM Cody Koeninger  wrote:

> Regarding name, if the SIP overlap is a concern, we can pick a different
> name.
>
> My tongue in cheek suggestion would be
>
> Spark Lightweight Improvement process (SPARKLI)
>

If others share my minor concern about the SIP name, I propose Spark
Enhancement Proposal (SEP), taking inspiration from the Python Enhancement
Proposal name.

So if we're going to number proposals like other projects do, they'd be
numbered SEP-1, SEP-2, etc. This avoids the naming conflict with Scala SIPs.

Another way to avoid a conflict is to stick with "Spark Improvement
Proposal" but use SPIP as the acronym. So SPIP-1, SPIP-2, etc.

Anyway, it's not a big deal. I just wanted to raise this point.

Nick


Re: Suggestion in README.md for guiding pull requests/JIRAs (probably about linking CONTRIBUTING.md or wiki)

2016-10-09 Thread Reynold Xin
Github already links to CONTRIBUTING.md. -- of course, a lot of people
ignore that. One thing we can do is to add an explicit link to the wiki
contributing page in the template (but note that even that introduces some
overhead for every pull request).

Aside from that, I am not sure if the other suggestions in the JIRA ticket
are necessary. For example, the issue with creating a pull request from one
branch to another is a problem, but it happens perhaps less than once a
week and is trivially closeable. Adding an explicit warning there will fix
some cases, but won't entirely eliminate the problem (because I'm sure a
lot of people still don't read the template), and will introduce another
overhead for everybody who submits the proper way.


On Sun, Oct 9, 2016 at 10:14 AM, Felix Cheung 
wrote:

> Should we just link to
>
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
>
>
>
>
> On Sun, Oct 9, 2016 at 10:09 AM -0700, "Hyukjin Kwon"  > wrote:
>
> Thanks for confirming this, Sean. I filed this in
> https://issues.apache.org/jira/browse/SPARK-17840
>
> I would appreciate if anyone who has a better writing skills better than
> me tries to fix this.
>
> I don't want to let reviewers make an effort to correct the grammar.
>
>
> On 10 Oct 2016 1:34 a.m., "Sean Owen"  wrote:
>
>> Yes, it's really CONTRIBUTING.md that's more relevant, because github
>> displays a link to it when opening pull requests. https://github.com/a
>> pache/spark/blob/master/CONTRIBUTING.md  There is also the pull request
>> template: https://github.com/apache/spark/blob/master/.githu
>> b/PULL_REQUEST_TEMPLATE
>>
>> I wouldn't want to duplicate info too much, but more pointers to a single
>> source of information seems OK. Although I don't know if it will help much,
>> sure, pointers from README.md are OK.
>>
>> On Sun, Oct 9, 2016 at 3:47 PM Hyukjin Kwon  wrote:
>>
>>> Hi all,
>>>
>>>
>>> I just noticed the README.md (https://github.com/apache/spark) does not
>>> describe the steps or links to follow for creating a PR or JIRA directly. I
>>> know probably it is sensible to search google about the contribution guides
>>> first before trying to make a PR/JIRA but I think it seems not enough when
>>> I see some inappropriate PRs/JIRAs time to time.
>>>
>>> I guess flooding JIRAs and PRs is problematic (assuming from the emails
>>> in dev mailing list) and I think we should explicitly mention and describe
>>> this in the README.md and pull request template[1].
>>>
>>> (I know we have CONTBITUTING.md[2] and wiki[3] but it seems pretty true
>>> that we still have some PRs or JIRAs not following the documentation.)
>>>
>>> So, my suggestions are as below:
>>>
>>> - Create a section maybe "Contributing To Apache Spark" describing the
>>> Wiki and CONTRIBUTING.md[2] in the README.md.
>>>
>>> - Describe an explicit warning in pull request template[1], for
>>> example, "Please double check if your pull request is from a branch to a
>>> branch. In most cases, this change is not appropriate. Please ask to
>>> mailing list (http://spark.apache.org/community.html) if you are not
>>> sure."
>>>
>>> [1]https://github.com/apache/spark/blob/master/.github/PULL_
>>> REQUEST_TEMPLATE
>>> [2]https://github.com/apache/spark/blob/master/CONTRIBUTING.md
>>> [3]https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage
>>>
>>>
>>> Thank you all.
>>>
>>


Re: Suggestion in README.md for guiding pull requests/JIRAs (probably about linking CONTRIBUTING.md or wiki)

2016-10-09 Thread Reynold Xin
Actually let's move the discussion to the JIRA ticket, given there is a
ticket.


On Sun, Oct 9, 2016 at 5:36 PM, Reynold Xin  wrote:

> Github already links to CONTRIBUTING.md. -- of course, a lot of people
> ignore that. One thing we can do is to add an explicit link to the wiki
> contributing page in the template (but note that even that introduces some
> overhead for every pull request).
>
> Aside from that, I am not sure if the other suggestions in the JIRA ticket
> are necessary. For example, the issue with creating a pull request from one
> branch to another is a problem, but it happens perhaps less than once a
> week and is trivially closeable. Adding an explicit warning there will fix
> some cases, but won't entirely eliminate the problem (because I'm sure a
> lot of people still don't read the template), and will introduce another
> overhead for everybody who submits the proper way.
>
>
> On Sun, Oct 9, 2016 at 10:14 AM, Felix Cheung 
> wrote:
>
>> Should we just link to
>>
>> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
>>
>>
>>
>>
>> On Sun, Oct 9, 2016 at 10:09 AM -0700, "Hyukjin Kwon" <
>> gurwls...@gmail.com> wrote:
>>
>> Thanks for confirming this, Sean. I filed this in
>> https://issues.apache.org/jira/browse/SPARK-17840
>>
>> I would appreciate if anyone who has a better writing skills better than
>> me tries to fix this.
>>
>> I don't want to let reviewers make an effort to correct the grammar.
>>
>>
>> On 10 Oct 2016 1:34 a.m., "Sean Owen"  wrote:
>>
>>> Yes, it's really CONTRIBUTING.md that's more relevant, because github
>>> displays a link to it when opening pull requests. https://github.com/a
>>> pache/spark/blob/master/CONTRIBUTING.md  There is also the pull request
>>> template: https://github.com/apache/spark/blob/master/.githu
>>> b/PULL_REQUEST_TEMPLATE
>>>
>>> I wouldn't want to duplicate info too much, but more pointers to a
>>> single source of information seems OK. Although I don't know if it will
>>> help much, sure, pointers from README.md are OK.
>>>
>>> On Sun, Oct 9, 2016 at 3:47 PM Hyukjin Kwon  wrote:
>>>
 Hi all,


 I just noticed the README.md (https://github.com/apache/spark) does
 not describe the steps or links to follow for creating a PR or JIRA
 directly. I know probably it is sensible to search google about the
 contribution guides first before trying to make a PR/JIRA but I think it
 seems not enough when I see some inappropriate PRs/JIRAs time to time.

 I guess flooding JIRAs and PRs is problematic (assuming from the
 emails in dev mailing list) and I think we should explicitly mention and
 describe this in the README.md and pull request template[1].

 (I know we have CONTBITUTING.md[2] and wiki[3] but it seems pretty true
 that we still have some PRs or JIRAs not following the documentation.)

 So, my suggestions are as below:

 - Create a section maybe "Contributing To Apache Spark" describing the
 Wiki and CONTRIBUTING.md[2] in the README.md.

 - Describe an explicit warning in pull request template[1], for
 example, "Please double check if your pull request is from a branch to a
 branch. In most cases, this change is not appropriate. Please ask to
 mailing list (http://spark.apache.org/community.html) if you are not
 sure."

 [1]https://github.com/apache/spark/blob/master/.github/PULL_
 REQUEST_TEMPLATE
 [2]https://github.com/apache/spark/blob/master/CONTRIBUTING.md
 [3]https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage


 Thank you all.

>>>
>


SPARK-17845 - window function frame boundary API

2016-10-09 Thread Reynold Xin
Hi all,

I tried to use the window function DataFrame API this weekend and found it
awkward to use, especially with respect to specifying frame boundaries. I
wrote down some options here and am curious your thoughts. If you have
suggestions on the API beyond what's already listed in the JIRA ticket, do
bring them up too.

Please comment on the JIRA ticket directly:
https://issues.apache.org/jira/browse/SPARK-17845


I've attached the content of the JIRA ticket here to save you a click:


ANSI SQL uses the following to specify the frame boundaries for window
functions:

ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

In Spark's DataFrame API, we use integer values to indicate relative
position:

   - 0 means "CURRENT ROW"
   - -1 means "1 PRECEDING"
   - Long.MinValue means "UNBOUNDED PRECEDING"
   - Long.MaxValue to indicate "UNBOUNDED FOLLOWING"

// ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWINGWindow.rowsBetween(-3, +3)
// ROWS BETWEEN UNBOUNDED PRECEDING AND 3
PRECEDINGWindow.rowsBetween(Long.MinValue, -3)
// ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT
ROWWindow.rowsBetween(Long.MinValue, 0)
// ROWS BETWEEN CURRENT ROW AND UNBOUNDED
PRECEDINGWindow.rowsBetween(0, Long.MaxValue)
// ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
FOLLOWINGWindow.rowsBetween(Long.MinValue, Long.MaxValue)

I think using numeric values to indicate relative positions is actually a
good idea, but the reliance on Long.MinValue and Long.MaxValue to indicate
unbounded ends is pretty confusing:

1. The API is not self-evident. There is no way for a new user to figure
out how to indicate an unbounded frame by looking at just the API. The user
has to read the doc to figure this out.
2. It is weird Long.MinValue or Long.MaxValue has some special meaning.
3. Different languages have different min/max values, e.g. in Python we use
-sys.maxsize and +sys.maxsize.

To make this API less confusing, we have a few options:

Option 1. Add the following (additional) methods:

// ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWINGWindow.rowsBetween(-3, +3)
// this one exists already// ROWS BETWEEN UNBOUNDED PRECEDING AND 3
PRECEDINGWindow.rowsBetweenUnboundedPrecedingAnd(-3)
// ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT
ROWWindow.rowsBetweenUnboundedPrecedingAndCurrentRow()
// ROWS BETWEEN CURRENT ROW AND UNBOUNDED
PRECEDINGWindow.rowsBetweenCurrentRowAndUnboundedFollowing()
// ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED
FOLLOWINGWindow.rowsBetweenUnboundedPrecedingAndUnboundedFollowing()

This is obviously very verbose, but is very similar to how these functions
are done in SQL, and is perhaps the most obvious to end users, especially
if they come from SQL background.

Option 2. Decouple the specification for frame begin and frame end into two
functions. Assume the boundary is unlimited unless specified.

// ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWINGWindow.rowsFrom(-3).rowsTo(3)
// ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDINGWindow.rowsTo(-3)
// ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT
ROWWindow.rowsToCurrent() or Window.rowsTo(0)
// ROWS BETWEEN CURRENT ROW AND UNBOUNDED
PRECEDINGWindow.rowsFromCurrent() or Window.rowsFrom(0)
// ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING// no need to specify

If we go with option 2, we should throw exceptions if users specify
multiple from's or to's. A variant of option 2 is to require explicitly
specification of begin/end even in the case of unbounded boundary, e.g.:

Window.rowsFromBeginning().rowsTo(-3)
or
Window.rowsFromUnboundedPreceding().rowsTo(-3)


Re: SPARK-17845 - window function frame boundary API

2016-10-09 Thread ayan guha
Hi Reynold

Thanks for asking. I am from sql world and use sparl sql with analytical
functions prety heavily.

IMHO, Window.rowsBetween() as a function name looks fine. What i would
propose would be:

Window.rowsBetween(startFrom=UNBOUNDED,endTo=CURRENT_ROW,preceeding=0,following=0)


startFrom, endTo: Determining range

preceeding,following: Anchor of current row and thus altering the range.


Calls:


//ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING

Window.rowsBetween(preceeding=3,following=3)

//ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING

Window.rowsBetween(startFrom=UNBOUNDED,preceeding=3)

//ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

Window.rowsBetween(startFrom=UNBOUNDED,endTo=CURRENT_ROW)

//ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING (Not PRECEDING)

Window.rowsBetween(startFrom=CURRENT_ROW,endTo=UNBOUNDED)

//ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING

Window.rowsBetween(startFrom=UNBOUNDED,endTo=UNBOUNDED)


One missing scenario (I think that is a valid one)


//ROWS BETWEEN 3 FOLLOWING AND UNBOUNDED FOLLOWING (pair of (2))

Window.rowsBetween(endTo=UNBOUNDED,following=3)


This will be closer to SQL options, IMHO.


Thoughts?


On Mon, Oct 10, 2016 at 3:50 PM, Reynold Xin  wrote:

> Hi all,
>
> I tried to use the window function DataFrame API this weekend and found it
> awkward to use, especially with respect to specifying frame boundaries. I
> wrote down some options here and am curious your thoughts. If you have
> suggestions on the API beyond what's already listed in the JIRA ticket, do
> bring them up too.
>
> Please comment on the JIRA ticket directly: https://issues.
> apache.org/jira/browse/SPARK-17845
>
>
> I've attached the content of the JIRA ticket here to save you a click:
>
>
> ANSI SQL uses the following to specify the frame boundaries for window
> functions:
>
> ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
>
> In Spark's DataFrame API, we use integer values to indicate relative
> position:
>
>- 0 means "CURRENT ROW"
>- -1 means "1 PRECEDING"
>- Long.MinValue means "UNBOUNDED PRECEDING"
>- Long.MaxValue to indicate "UNBOUNDED FOLLOWING"
>
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWINGWindow.rowsBetween(-3, +3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 
> PRECEDINGWindow.rowsBetween(Long.MinValue, -3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT 
> ROWWindow.rowsBetween(Long.MinValue, 0)
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDINGWindow.rowsBetween(0, 
> Long.MaxValue)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED 
> FOLLOWINGWindow.rowsBetween(Long.MinValue, Long.MaxValue)
>
> I think using numeric values to indicate relative positions is actually a
> good idea, but the reliance on Long.MinValue and Long.MaxValue to indicate
> unbounded ends is pretty confusing:
>
> 1. The API is not self-evident. There is no way for a new user to figure
> out how to indicate an unbounded frame by looking at just the API. The user
> has to read the doc to figure this out.
> 2. It is weird Long.MinValue or Long.MaxValue has some special meaning.
> 3. Different languages have different min/max values, e.g. in Python we
> use -sys.maxsize and +sys.maxsize.
>
> To make this API less confusing, we have a few options:
>
> Option 1. Add the following (additional) methods:
>
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWINGWindow.rowsBetween(-3, +3)  // 
> this one exists already// ROWS BETWEEN UNBOUNDED PRECEDING AND 3 
> PRECEDINGWindow.rowsBetweenUnboundedPrecedingAnd(-3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT 
> ROWWindow.rowsBetweenUnboundedPrecedingAndCurrentRow()
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED 
> PRECEDINGWindow.rowsBetweenCurrentRowAndUnboundedFollowing()
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED 
> FOLLOWINGWindow.rowsBetweenUnboundedPrecedingAndUnboundedFollowing()
>
> This is obviously very verbose, but is very similar to how these functions
> are done in SQL, and is perhaps the most obvious to end users, especially
> if they come from SQL background.
>
> Option 2. Decouple the specification for frame begin and frame end into
> two functions. Assume the boundary is unlimited unless specified.
>
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWINGWindow.rowsFrom(-3).rowsTo(3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDINGWindow.rowsTo(-3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROWWindow.rowsToCurrent() or 
> Window.rowsTo(0)
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDINGWindow.rowsFromCurrent() 
> or Window.rowsFrom(0)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING// no need to 
> specify
>
> If we go with option 2, we should throw exceptions if users specify
> multiple from's or to's. A variant of option 2 is to require explicitly
> 

Re: This Exception has been really hard to trace

2016-10-09 Thread kant kodali
Hi Reynold,
Actually, I did that a well before posting my question here.
Thanks,kant
 





On Sun, Oct 9, 2016 8:48 PM, Reynold Xin r...@databricks.com
wrote:
You should probably check with DataStax who build the Cassandra connector for
Spark.


On Sun, Oct 9, 2016 at 8:13 PM, kant kodali   wrote:

I tried SpanBy but look like there is a strange error that happening no matter
which way I try. Like the one here described for Java solution.

http://qaoverflow.com/question/how-to-use-spanby-in-java/

java.lang.ClassCastException: cannot assign instance of
scala.collection.immutable.List$SerializationProxy to field
org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type
scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD

JavaPairRDD cassandraRowsRDD=javaFunctions
(sc).cassandraTable("test", "hello" )
.select("col1", "col2", "col3" )
.spanBy(newFunction() {
@Override
publicByteBuffer call(CassandraRow v1) {
returnv1.getBytes("rowkey");
}
}, ByteBuffer.class);

And then here I do this here is where the problem occurs
List> listOftuples =
cassandraRowsRDD.collect(); // ERROR OCCURS HERE
Tuple2 tuple =
listOftuples.iterator().next();
ByteBuffer partitionKey = tuple._1();
for(CassandraRow cassandraRow: tuple._2()) {
System.out.println(cassandraRow.getLong("col1"));
}
so I tried this  and same error
Iterable> listOftuples =
cassandraRowsRDD.collect(); // ERROR OCCURS HERE
Tuple2 tuple =
listOftuples.iterator().next();
ByteBuffer partitionKey = tuple._1();
for(CassandraRow cassandraRow: tuple._2()) {
System.out.println(cassandraRow.getLong("col1"));
}
Although I understand that ByteBuffers aren't serializable I didn't get any not
serializable exception but still I went head and changed everything to byte[] so
no more ByteBuffers in the code.
I have also tried cassandraRowsRDD.collect().forEach() and
cassandraRowsRDD.stream().forEachPartition() and the same exact error occurs.
I am running everything locally and in a stand alone mode so my spark cluster is
just running on localhost.
Scala code runner version 2.11.8  // when I run scala -version or even
./spark-shell

compile group: 'org.apache.spark' name: 'spark-core_2.11' version: '2.0.0'
compile group: 'org.apache.spark' name: 'spark-streaming_2.11' version: '2.0.0'
compile group: 'org.apache.spark' name: 'spark-sql_2.11' version: '2.0.0'
compile group: 'com.datastax.spark' name: 'spark-cassandra-connector_2.11'
version: '2.0.0-M3':

So I don't see anything wrong with these versions.
2) I am bundling everything into one jar and so far it did worked out well
except for this error.
I am using Java 8 and Gradle.

any ideas on how I can fix this?

This Exception has been really hard to trace

2016-10-09 Thread kant kodali
I tried SpanBy but look like there is a strange error that happening no matter
which way I try. Like the one here described for Java solution.

http://qaoverflow.com/question/how-to-use-spanby-in-java/

java.lang.ClassCastException: cannot assign instance of
scala.collection.immutable.List$SerializationProxy to field
org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$dependencies_ of type
scala.collection.Seq in instance of org.apache.spark.rdd.MapPartitionsRDD

JavaPairRDD cassandraRowsRDD=javaFunctions
(sc).cassandraTable("test", "hello" )
.select("col1", "col2", "col3" )
.spanBy(newFunction() {
@Override
publicByteBuffer call(CassandraRow v1) {
returnv1.getBytes("rowkey");
}
}, ByteBuffer.class);

And then here I do this here is where the problem occurs
List> listOftuples =
cassandraRowsRDD.collect(); // ERROR OCCURS HERE
Tuple2 tuple =
listOftuples.iterator().next();
ByteBuffer partitionKey = tuple._1();
for(CassandraRow cassandraRow: tuple._2()) {
System.out.println(cassandraRow.getLong("col1"));
}
so I tried this  and same error
Iterable> listOftuples =
cassandraRowsRDD.collect(); // ERROR OCCURS HERE
Tuple2 tuple =
listOftuples.iterator().next();
ByteBuffer partitionKey = tuple._1();
for(CassandraRow cassandraRow: tuple._2()) {
System.out.println(cassandraRow.getLong("col1"));
}
Although I understand that ByteBuffers aren't serializable I didn't get any not
serializable exception but still I went head and changed everything to byte[] so
no more ByteBuffers in the code.
I have also tried cassandraRowsRDD.collect().forEach() and
cassandraRowsRDD.stream().forEachPartition() and the same exact error occurs.
I am running everything locally and in a stand alone mode so my spark cluster is
just running on localhost.
Scala code runner version 2.11.8  // when I run scala -version or even
./spark-shell

compile group: 'org.apache.spark' name: 'spark-core_2.11' version: '2.0.0'
compile group: 'org.apache.spark' name: 'spark-streaming_2.11' version: '2.0.0'
compile group: 'org.apache.spark' name: 'spark-sql_2.11' version: '2.0.0'
compile group: 'com.datastax.spark' name: 'spark-cassandra-connector_2.11'
version: '2.0.0-M3':

So I don't see anything wrong with these versions.
2) I am bundling everything into one jar and so far it did worked out well
except for this error.
I am using Java 8 and Gradle.

any ideas on how I can fix this?

Re: This Exception has been really hard to trace

2016-10-09 Thread Reynold Xin
You should probably check with DataStax who build the Cassandra connector
for Spark.


On Sun, Oct 9, 2016 at 8:13 PM, kant kodali  wrote:

>
> I tried SpanBy but look like there is a strange error that happening no
> matter which way I try. Like the one here described for Java solution.
>
> http://qaoverflow.com/question/how-to-use-spanby-in-java/
>
>
> *java.lang.ClassCastException: cannot assign instance of
> scala.collection.immutable.List$SerializationProxy to
> fieldorg.apache.spark.rdd.RDD.org
> $apache$spark$rdd$RDD$$dependencies_
> of type scala.collection.Seq in instance of
> org.apache.spark.rdd.MapPartitionsRDD*
>
>
> JavaPairRDD cassandraRowsRDD=
> javaFunctions(sc).cassandraTable("test", "hello" )
>.select("col1", "col2", "col3" )
>.spanBy(new Function() {
> @Override
> public ByteBuffer call(CassandraRow v1) {
> return v1.getBytes("rowkey");
> }
> }, ByteBuffer.class);
>
>
> And then here I do this here is where the problem occurs
>
> List> listOftuples =
> cassandraRowsRDD.collect(); // ERROR OCCURS HERE
> Tuple2 tuple =
> listOftuples.iterator().next();
> ByteBuffer partitionKey = tuple._1();
> for(CassandraRow cassandraRow: tuple._2()) {
> System.out.println(cassandraRow.getLong("col1"));
> }
>
> so I tried this  and same error
>
> Iterable> listOftuples =
> cassandraRowsRDD.collect(); // ERROR OCCURS HERE
> Tuple2 tuple =
> listOftuples.iterator().next();
> ByteBuffer partitionKey = tuple._1();
> for(CassandraRow cassandraRow: tuple._2()) {
> System.out.println(cassandraRow.getLong("col1"));
> }
>
> Although I understand that ByteBuffers aren't serializable I didn't get
> any not serializable exception but still I went head and *changed
> everything to byte[] so no more ByteBuffers in the code.*
>
> I have also tried cassandraRowsRDD.collect().forEach() and
> cassandraRowsRDD.stream().forEachPartition() and the same exact error
> occurs.
>
> I am running everything locally and in a stand alone mode so my spark
> cluster is just running on localhost.
>
> Scala code runner version 2.11.8  // when I run scala -version or even
> ./spark-shell
>
>
> compile group: 'org.apache.spark' name: 'spark-core_2.11' version: '2.0.0'
> compile group: 'org.apache.spark' name: 'spark-streaming_2.11' version:
> '2.0.0'
> compile group: 'org.apache.spark' name: 'spark-sql_2.11' version: '2.0.0'
> compile group: 'com.datastax.spark' name: 'spark-cassandra-connector_2.11'
> version: '2.0.0-M3':
>
>
> So I don't see anything wrong with these versions.
>
> 2) I am bundling everything into one jar and so far it did worked out well
> except for this error.
> I am using Java 8 and Gradle.
>
>
> any ideas on how I can fix this?
>


Re: Spark Improvement Proposals

2016-10-09 Thread Nicholas Chammas
   - Rejected strategies: I personally wouldn’t put this, because what’s
   the point of voting to reject a strategy before you’ve really begun
   designing and implementing something? What if you discover that the
   strategy is actually better when you start doing stuff?

I would guess the point is to document alternatives that were discussed and
rejected, so that later on people can be pointed to that discussion and the
devs don’t have to repeat themselves unnecessarily every time someone comes
along and asks “Why didn’t you do this other thing?” That doesn’t mean a
rejected proposal can’t later be revisited and the SIP can’t be updated.

For reference from the Python community, PEP 492
, a Python Enhancement Proposal
for adding async and await syntax and “first-class” coroutines to Python,
has a section on rejected ideas
 for the new
syntax. It captures a summary of what the devs discussed, but it doesn’t
mean the PEP can’t be updated and a previously rejected proposal can’t be
revived.

At least in the Python community, a PEP serves not just as formal starting
point for a proposal (the “real” starting point is usually a discussion on
python-ideas or python-dev), but also as documentation of what was agreed
on and a living “spec” of sorts. So PEPs sometimes get updated years after
they are approved when revisions are agreed upon. PEPs are also intended
for wide consumption, vs. bug tracker issues which the broader Python dev
community are not expected to follow closely.

Dunno if we want to follow a similar pattern for Spark, since the project’s
needs are different. But the Python community has used PEPs to help
organize and steer development since 2000; there are plenty of examples
there we can probably take inspiration from.

By the way, can we call these things something other than Spark Improvement
Proposals? The acronym, SIP, conflicts with Scala SIPs
. Since the Scala and Spark
communities have a lot of overlap, we don’t want, for example, names like
“SIP-10” to have an ambiguous meaning.

Nick
​

On Sun, Oct 9, 2016 at 3:34 PM Matei Zaharia 
wrote:

> Hi Cody,
>
> I think this would be a lot more concrete if we had a more detailed
> template for SIPs. Right now, it's not super clear what's in scope -- e.g.
> are  they a way to solicit feedback on the user-facing behavior or on the
> internals? "Goals" can cover both things. I've been thinking of SIPs more
> as Product Requirements Docs (PRDs), which focus on *what* a code change
> should do as opposed to how.
>
> In particular, here are some things that you may or may not consider in
> scope for SIPs:
>
> - Goals and non-goals: This is definitely in scope, and IMO should focus
> on user-visible behavior (e.g. "system supports SQL window functions" or
> "system continues working if one node fails"). BTW I wouldn't say "rejected
> goals" because some of them might become goals later, so we're not
> definitively rejecting them.
>
> - Public API: Probably should be included in most SIPs unless it's too
> large to fully specify then (e.g. "let's add an ML library").
>
> - Use cases: I usually find this very useful in PRDs to better communicate
> the goals.
>
> - Internal architecture: This is usually *not* a thing users can easily
> comment on and it sounds more like a design doc item. Of course it's
> important to show that the SIP is feasible to implement. One exception,
> however, is that I think we'll have some SIPs primarily on internals (e.g.
> if somebody wants to refactor Spark's query optimizer or something).
>
> - Rejected strategies: I personally wouldn't put this, because what's the
> point of voting to reject a strategy before you've really begun designing
> and implementing something? What if you discover that the strategy is
> actually better when you start doing stuff?
>
> At a super high level, it depends on whether you want the SIPs to be PRDs
> for getting some quick feedback on the goals of a feature before it is
> designed, or something more like full-fledged design docs (just a more
> visible design doc for bigger changes). I looked at Kafka's KIPs, and they
> actually seem to be more like design docs. This can work too but it does
> require more work from the proposer and it can lead to the same problems
> you mentioned with people already having a design and implementation in
> mind.
>
> Basically, the question is, are you trying to iterate faster on design by
> adding a step for user feedback earlier? Or are you just trying to make
> design docs for key features more visible (and their approval more formal)?
>
> BTW note that in either case, I'd like to have a template for design docs
> too, which should also include goals. I think that would've avoided some of
> the issues you brought up.
>
> Matei
>
> On Oct 9, 2016, at 10:40 AM, Cody Koeninger 

Re: Spark Improvement Proposals

2016-10-09 Thread Cody Koeninger
If there's confusion there, the document is specifically what I'm
proposing.  The email is just by way of introduction.

On Sun, Oct 9, 2016 at 3:47 PM, Nicholas Chammas  wrote:

> Oh, hmm… I guess I’m a little confused on the relation between Cody’s
> email and the document he linked to, which says:
>
> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-
> improvement-proposals.md#when
>
> SIPs should be used for significant user-facing or cross-cutting changes,
> not day-to-day improvements. When in doubt, if a committer thinks a change
> needs an SIP, it does.
>
> Nick
> ​
>
> On Sun, Oct 9, 2016 at 4:40 PM Matei Zaharia 
> wrote:
>
>> Yup, but the example you gave is for alternatives about *user-facing
>> behavior*, not implementation. The current SIP doc describes "strategy"
>> more as implementation strategy. I'm just saying there are different
>> possible goals for these types of docs.
>>
>> BTW, PEPs and Scala SIPs focus primarily on user-facing behavior, but
>> also require a reference implementation. This is a bit different from what
>> Cody had in mind, I think.
>>
>>
>> Matei
>>
>> On Oct 9, 2016, at 1:25 PM, Nicholas Chammas 
>> wrote:
>>
>>
>>- Rejected strategies: I personally wouldn’t put this, because what’s
>>the point of voting to reject a strategy before you’ve really begun
>>designing and implementing something? What if you discover that the
>>strategy is actually better when you start doing stuff?
>>
>> I would guess the point is to document alternatives that were discussed
>> and rejected, so that later on people can be pointed to that discussion and
>> the devs don’t have to repeat themselves unnecessarily every time someone
>> comes along and asks “Why didn’t you do this other thing?” That doesn’t
>> mean a rejected proposal can’t later be revisited and the SIP can’t be
>> updated.
>>
>> For reference from the Python community, PEP 492
>> , a Python Enhancement
>> Proposal for adding async and await syntax and “first-class” coroutines
>> to Python, has a section on rejected ideas
>>  for the new
>> syntax. It captures a summary of what the devs discussed, but it doesn’t
>> mean the PEP can’t be updated and a previously rejected proposal can’t be
>> revived.
>>
>> At least in the Python community, a PEP serves not just as formal
>> starting point for a proposal (the “real” starting point is usually a
>> discussion on python-ideas or python-dev), but also as documentation of
>> what was agreed on and a living “spec” of sorts. So PEPs sometimes get
>> updated years after they are approved when revisions are agreed upon. PEPs
>> are also intended for wide consumption, vs. bug tracker issues which the
>> broader Python dev community are not expected to follow closely.
>>
>> Dunno if we want to follow a similar pattern for Spark, since the
>> project’s needs are different. But the Python community has used PEPs to
>> help organize and steer development since 2000; there are plenty of
>> examples there we can probably take inspiration from.
>>
>> By the way, can we call these things something other than Spark
>> Improvement Proposals? The acronym, SIP, conflicts with Scala SIPs
>> . Since the Scala and Spark
>> communities have a lot of overlap, we don’t want, for example, names like
>> “SIP-10” to have an ambiguous meaning.
>>
>> Nick
>> ​
>>
>> On Sun, Oct 9, 2016 at 3:34 PM Matei Zaharia 
>> wrote:
>>
>>> Hi Cody,
>>>
>>> I think this would be a lot more concrete if we had a more detailed
>>> template for SIPs. Right now, it's not super clear what's in scope -- e.g.
>>> are  they a way to solicit feedback on the user-facing behavior or on the
>>> internals? "Goals" can cover both things. I've been thinking of SIPs more
>>> as Product Requirements Docs (PRDs), which focus on *what* a code change
>>> should do as opposed to how.
>>>
>>> In particular, here are some things that you may or may not consider in
>>> scope for SIPs:
>>>
>>> - Goals and non-goals: This is definitely in scope, and IMO should focus
>>> on user-visible behavior (e.g. "system supports SQL window functions" or
>>> "system continues working if one node fails"). BTW I wouldn't say "rejected
>>> goals" because some of them might become goals later, so we're not
>>> definitively rejecting them.
>>>
>>> - Public API: Probably should be included in most SIPs unless it's too
>>> large to fully specify then (e.g. "let's add an ML library").
>>>
>>> - Use cases: I usually find this very useful in PRDs to better
>>> communicate the goals.
>>>
>>> - Internal architecture: This is usually *not* a thing users can easily
>>> comment on and it sounds more like a design doc item. Of course it's
>>> important to show that the SIP is feasible to implement. One 

Re: Spark Improvement Proposals

2016-10-09 Thread Matei Zaharia
Yup, this is the stuff that I found unclear. Thanks for clarifying here, but we 
should also clarify it in the writeup. In particular:

- Goals needs to be about user-facing behavior ("people" is broad)

- I'd rename Rejected Goals to Non-Goals. Otherwise someone will dig up one of 
these and say "Spark's developers have officially rejected X, which our awesome 
system has".

- For user-facing stuff, I think you need a section on API. Virtually all other 
*IPs I've seen have that.

- I'm still not sure why the strategy section is needed if the purpose is to 
define user-facing behavior -- unless this is the strategy for setting the 
goals or for defining the API. That sounds squarely like a design doc issue. In 
some sense, who cares whether the proposal is technically feasible right now? 
If it's infeasible, that will be discovered later during design and 
implementation. Same thing with rejected strategies -- listing some of those is 
definitely useful sometimes, but if you make this a *required* section, people 
are just going to fill it in with bogus stuff (I've seen this happen before).

Matei

> On Oct 9, 2016, at 2:14 PM, Cody Koeninger  wrote:
> 
> So to focus the discussion on the specific strategy I'm suggesting,
> documented at
> 
> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
> 
> "Goals: What must this allow people to do, that they can't currently?"
> 
> Is it unclear that this is focusing specifically on people-visible behavior?
> 
> Rejected goals -  are important because otherwise people keep trying
> to argue about scope.  Of course you can change things later with a
> different SIP and different vote, the point is to focus.
> 
> Use cases - are something that people are going to bring up in
> discussion.  If they aren't clearly documented as a goal ("This must
> allow me to connect using SSL"), they should be added.
> 
> Internal architecture - if the people who need specific behavior are
> implementers of other parts of the system, that's fine.
> 
> Rejected strategies - If you have none of these, you have no evidence
> that the proponent didn't just go with the first thing they had in
> mind (or have already implemented), which is a big problem currently.
> Approval isn't binding as to specifics of implementation, so these
> aren't handcuffs.  The goals are the contract, the strategy is
> evidence that contract can actually be met.
> 
> Design docs - I'm not touching design docs.  The markdown file I
> linked specifically says of the strategy section "This is not a full
> design document."  Is this unclear?  Design docs can be worked on
> obviously, but that's not what I'm concerned with here.
> 
> 
> 
> 
> On Sun, Oct 9, 2016 at 2:34 PM, Matei Zaharia  wrote:
>> Hi Cody,
>> 
>> I think this would be a lot more concrete if we had a more detailed template
>> for SIPs. Right now, it's not super clear what's in scope -- e.g. are  they
>> a way to solicit feedback on the user-facing behavior or on the internals?
>> "Goals" can cover both things. I've been thinking of SIPs more as Product
>> Requirements Docs (PRDs), which focus on *what* a code change should do as
>> opposed to how.
>> 
>> In particular, here are some things that you may or may not consider in
>> scope for SIPs:
>> 
>> - Goals and non-goals: This is definitely in scope, and IMO should focus on
>> user-visible behavior (e.g. "system supports SQL window functions" or
>> "system continues working if one node fails"). BTW I wouldn't say "rejected
>> goals" because some of them might become goals later, so we're not
>> definitively rejecting them.
>> 
>> - Public API: Probably should be included in most SIPs unless it's too large
>> to fully specify then (e.g. "let's add an ML library").
>> 
>> - Use cases: I usually find this very useful in PRDs to better communicate
>> the goals.
>> 
>> - Internal architecture: This is usually *not* a thing users can easily
>> comment on and it sounds more like a design doc item. Of course it's
>> important to show that the SIP is feasible to implement. One exception,
>> however, is that I think we'll have some SIPs primarily on internals (e.g.
>> if somebody wants to refactor Spark's query optimizer or something).
>> 
>> - Rejected strategies: I personally wouldn't put this, because what's the
>> point of voting to reject a strategy before you've really begun designing
>> and implementing something? What if you discover that the strategy is
>> actually better when you start doing stuff?
>> 
>> At a super high level, it depends on whether you want the SIPs to be PRDs
>> for getting some quick feedback on the goals of a feature before it is
>> designed, or something more like full-fledged design docs (just a more
>> visible design doc for bigger changes). I looked at Kafka's KIPs, and they
>> actually seem to be more like design docs. This can work too but it does
>> require more work from the proposer and it 

Re: Spark Improvement Proposals

2016-10-09 Thread Nicholas Chammas
Oh, hmm… I guess I’m a little confused on the relation between Cody’s email
and the document he linked to, which says:

https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md#when

SIPs should be used for significant user-facing or cross-cutting changes,
not day-to-day improvements. When in doubt, if a committer thinks a change
needs an SIP, it does.

Nick
​

On Sun, Oct 9, 2016 at 4:40 PM Matei Zaharia 
wrote:

> Yup, but the example you gave is for alternatives about *user-facing
> behavior*, not implementation. The current SIP doc describes "strategy"
> more as implementation strategy. I'm just saying there are different
> possible goals for these types of docs.
>
> BTW, PEPs and Scala SIPs focus primarily on user-facing behavior, but also
> require a reference implementation. This is a bit different from what Cody
> had in mind, I think.
>
>
> Matei
>
> On Oct 9, 2016, at 1:25 PM, Nicholas Chammas 
> wrote:
>
>
>- Rejected strategies: I personally wouldn’t put this, because what’s
>the point of voting to reject a strategy before you’ve really begun
>designing and implementing something? What if you discover that the
>strategy is actually better when you start doing stuff?
>
> I would guess the point is to document alternatives that were discussed
> and rejected, so that later on people can be pointed to that discussion and
> the devs don’t have to repeat themselves unnecessarily every time someone
> comes along and asks “Why didn’t you do this other thing?” That doesn’t
> mean a rejected proposal can’t later be revisited and the SIP can’t be
> updated.
>
> For reference from the Python community, PEP 492
> , a Python Enhancement
> Proposal for adding async and await syntax and “first-class” coroutines
> to Python, has a section on rejected ideas
>  for the new
> syntax. It captures a summary of what the devs discussed, but it doesn’t
> mean the PEP can’t be updated and a previously rejected proposal can’t be
> revived.
>
> At least in the Python community, a PEP serves not just as formal starting
> point for a proposal (the “real” starting point is usually a discussion on
> python-ideas or python-dev), but also as documentation of what was agreed
> on and a living “spec” of sorts. So PEPs sometimes get updated years after
> they are approved when revisions are agreed upon. PEPs are also intended
> for wide consumption, vs. bug tracker issues which the broader Python dev
> community are not expected to follow closely.
>
> Dunno if we want to follow a similar pattern for Spark, since the
> project’s needs are different. But the Python community has used PEPs to
> help organize and steer development since 2000; there are plenty of
> examples there we can probably take inspiration from.
>
> By the way, can we call these things something other than Spark
> Improvement Proposals? The acronym, SIP, conflicts with Scala SIPs
> . Since the Scala and Spark
> communities have a lot of overlap, we don’t want, for example, names like
> “SIP-10” to have an ambiguous meaning.
>
> Nick
> ​
>
> On Sun, Oct 9, 2016 at 3:34 PM Matei Zaharia 
> wrote:
>
> Hi Cody,
>
> I think this would be a lot more concrete if we had a more detailed
> template for SIPs. Right now, it's not super clear what's in scope -- e.g.
> are  they a way to solicit feedback on the user-facing behavior or on the
> internals? "Goals" can cover both things. I've been thinking of SIPs more
> as Product Requirements Docs (PRDs), which focus on *what* a code change
> should do as opposed to how.
>
> In particular, here are some things that you may or may not consider in
> scope for SIPs:
>
> - Goals and non-goals: This is definitely in scope, and IMO should focus
> on user-visible behavior (e.g. "system supports SQL window functions" or
> "system continues working if one node fails"). BTW I wouldn't say "rejected
> goals" because some of them might become goals later, so we're not
> definitively rejecting them.
>
> - Public API: Probably should be included in most SIPs unless it's too
> large to fully specify then (e.g. "let's add an ML library").
>
> - Use cases: I usually find this very useful in PRDs to better communicate
> the goals.
>
> - Internal architecture: This is usually *not* a thing users can easily
> comment on and it sounds more like a design doc item. Of course it's
> important to show that the SIP is feasible to implement. One exception,
> however, is that I think we'll have some SIPs primarily on internals (e.g.
> if somebody wants to refactor Spark's query optimizer or something).
>
> - Rejected strategies: I personally wouldn't put this, because what's the
> point of voting to reject a strategy before you've really begun designing
> and implementing something? What if you 

Re: Spark Improvement Proposals

2016-10-09 Thread Matei Zaharia
Yup, but the example you gave is for alternatives about *user-facing behavior*, 
not implementation. The current SIP doc describes "strategy" more as 
implementation strategy. I'm just saying there are different possible goals for 
these types of docs.

BTW, PEPs and Scala SIPs focus primarily on user-facing behavior, but also 
require a reference implementation. This is a bit different from what Cody had 
in mind, I think.

Matei

> On Oct 9, 2016, at 1:25 PM, Nicholas Chammas  
> wrote:
> 
> Rejected strategies: I personally wouldn’t put this, because what’s the point 
> of voting to reject a strategy before you’ve really begun designing and 
> implementing something? What if you discover that the strategy is actually 
> better when you start doing stuff?
> I would guess the point is to document alternatives that were discussed and 
> rejected, so that later on people can be pointed to that discussion and the 
> devs don’t have to repeat themselves unnecessarily every time someone comes 
> along and asks “Why didn’t you do this other thing?” That doesn’t mean a 
> rejected proposal can’t later be revisited and the SIP can’t be updated.
> 
> For reference from the Python community, PEP 492 
> , a Python Enhancement Proposal 
> for adding async and await syntax and “first-class” coroutines to Python, has 
> a section on rejected ideas 
>  for the new syntax. 
> It captures a summary of what the devs discussed, but it doesn’t mean the PEP 
> can’t be updated and a previously rejected proposal can’t be revived.
> 
> At least in the Python community, a PEP serves not just as formal starting 
> point for a proposal (the “real” starting point is usually a discussion on 
> python-ideas or python-dev), but also as documentation of what was agreed on 
> and a living “spec” of sorts. So PEPs sometimes get updated years after they 
> are approved when revisions are agreed upon. PEPs are also intended for wide 
> consumption, vs. bug tracker issues which the broader Python dev community 
> are not expected to follow closely.
> 
> Dunno if we want to follow a similar pattern for Spark, since the project’s 
> needs are different. But the Python community has used PEPs to help organize 
> and steer development since 2000; there are plenty of examples there we can 
> probably take inspiration from.
> 
> By the way, can we call these things something other than Spark Improvement 
> Proposals? The acronym, SIP, conflicts with Scala SIPs 
> . Since the Scala and Spark 
> communities have a lot of overlap, we don’t want, for example, names like 
> “SIP-10” to have an ambiguous meaning.
> 
> Nick
> 
> 
> On Sun, Oct 9, 2016 at 3:34 PM Matei Zaharia  > wrote:
> Hi Cody,
> 
> I think this would be a lot more concrete if we had a more detailed template 
> for SIPs. Right now, it's not super clear what's in scope -- e.g. are  they a 
> way to solicit feedback on the user-facing behavior or on the internals? 
> "Goals" can cover both things. I've been thinking of SIPs more as Product 
> Requirements Docs (PRDs), which focus on *what* a code change should do as 
> opposed to how.
> 
> In particular, here are some things that you may or may not consider in scope 
> for SIPs:
> 
> - Goals and non-goals: This is definitely in scope, and IMO should focus on 
> user-visible behavior (e.g. "system supports SQL window functions" or "system 
> continues working if one node fails"). BTW I wouldn't say "rejected goals" 
> because some of them might become goals later, so we're not definitively 
> rejecting them.
> 
> - Public API: Probably should be included in most SIPs unless it's too large 
> to fully specify then (e.g. "let's add an ML library").
> 
> - Use cases: I usually find this very useful in PRDs to better communicate 
> the goals.
> 
> - Internal architecture: This is usually *not* a thing users can easily 
> comment on and it sounds more like a design doc item. Of course it's 
> important to show that the SIP is feasible to implement. One exception, 
> however, is that I think we'll have some SIPs primarily on internals (e.g. if 
> somebody wants to refactor Spark's query optimizer or something).
> 
> - Rejected strategies: I personally wouldn't put this, because what's the 
> point of voting to reject a strategy before you've really begun designing and 
> implementing something? What if you discover that the strategy is actually 
> better when you start doing stuff?
> 
> At a super high level, it depends on whether you want the SIPs to be PRDs for 
> getting some quick feedback on the goals of a feature before it is designed, 
> or something more like full-fledged design docs (just a more visible design 
> doc for bigger changes). I looked at Kafka's KIPs, and they actually seem to 
> be more like 

Re: Spark Improvement Proposals

2016-10-09 Thread Cody Koeninger
Regarding name, if the SIP overlap is a concern, we can pick a different name.
My tongue in cheek suggestion would be
Spark Lightweight Improvement process (SPARKLI)

On Sun, Oct 9, 2016 at 4:14 PM, Cody Koeninger  wrote:
> So to focus the discussion on the specific strategy I'm suggesting,
> documented at
>
> https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md
>
> "Goals: What must this allow people to do, that they can't currently?"
>
> Is it unclear that this is focusing specifically on people-visible behavior?
>
> Rejected goals -  are important because otherwise people keep trying
> to argue about scope.  Of course you can change things later with a
> different SIP and different vote, the point is to focus.
>
> Use cases - are something that people are going to bring up in
> discussion.  If they aren't clearly documented as a goal ("This must
> allow me to connect using SSL"), they should be added.
>
> Internal architecture - if the people who need specific behavior are
> implementers of other parts of the system, that's fine.
>
> Rejected strategies - If you have none of these, you have no evidence
> that the proponent didn't just go with the first thing they had in
> mind (or have already implemented), which is a big problem currently.
> Approval isn't binding as to specifics of implementation, so these
> aren't handcuffs.  The goals are the contract, the strategy is
> evidence that contract can actually be met.
>
> Design docs - I'm not touching design docs.  The markdown file I
> linked specifically says of the strategy section "This is not a full
> design document."  Is this unclear?  Design docs can be worked on
> obviously, but that's not what I'm concerned with here.
>
>
>
>
> On Sun, Oct 9, 2016 at 2:34 PM, Matei Zaharia  wrote:
>> Hi Cody,
>>
>> I think this would be a lot more concrete if we had a more detailed template
>> for SIPs. Right now, it's not super clear what's in scope -- e.g. are  they
>> a way to solicit feedback on the user-facing behavior or on the internals?
>> "Goals" can cover both things. I've been thinking of SIPs more as Product
>> Requirements Docs (PRDs), which focus on *what* a code change should do as
>> opposed to how.
>>
>> In particular, here are some things that you may or may not consider in
>> scope for SIPs:
>>
>> - Goals and non-goals: This is definitely in scope, and IMO should focus on
>> user-visible behavior (e.g. "system supports SQL window functions" or
>> "system continues working if one node fails"). BTW I wouldn't say "rejected
>> goals" because some of them might become goals later, so we're not
>> definitively rejecting them.
>>
>> - Public API: Probably should be included in most SIPs unless it's too large
>> to fully specify then (e.g. "let's add an ML library").
>>
>> - Use cases: I usually find this very useful in PRDs to better communicate
>> the goals.
>>
>> - Internal architecture: This is usually *not* a thing users can easily
>> comment on and it sounds more like a design doc item. Of course it's
>> important to show that the SIP is feasible to implement. One exception,
>> however, is that I think we'll have some SIPs primarily on internals (e.g.
>> if somebody wants to refactor Spark's query optimizer or something).
>>
>> - Rejected strategies: I personally wouldn't put this, because what's the
>> point of voting to reject a strategy before you've really begun designing
>> and implementing something? What if you discover that the strategy is
>> actually better when you start doing stuff?
>>
>> At a super high level, it depends on whether you want the SIPs to be PRDs
>> for getting some quick feedback on the goals of a feature before it is
>> designed, or something more like full-fledged design docs (just a more
>> visible design doc for bigger changes). I looked at Kafka's KIPs, and they
>> actually seem to be more like design docs. This can work too but it does
>> require more work from the proposer and it can lead to the same problems you
>> mentioned with people already having a design and implementation in mind.
>>
>> Basically, the question is, are you trying to iterate faster on design by
>> adding a step for user feedback earlier? Or are you just trying to make
>> design docs for key features more visible (and their approval more formal)?
>>
>> BTW note that in either case, I'd like to have a template for design docs
>> too, which should also include goals. I think that would've avoided some of
>> the issues you brought up.
>>
>> Matei
>>
>> On Oct 9, 2016, at 10:40 AM, Cody Koeninger  wrote:
>>
>> Here's my specific proposal (meta-proposal?)
>>
>> Spark Improvement Proposals (SIP)
>>
>>
>> Background:
>>
>> The current problem is that design and implementation of large features are
>> often done in private, before soliciting user feedback.
>>
>> When feedback is solicited, it is often as to detailed design specifics, not
>> focused 

Re: Spark Improvement Proposals

2016-10-09 Thread Cody Koeninger
Users instead of people, sure.  Commiters and contributors are (or at least
should be) a subset of users.

Non goals, sure. I don't care what the name is, but we need to clearly say
e.g. 'no we are not maintaining compatibility with XYZ right now'.

API, what I care most about is whether it allows me to accomplish the
goals. Arguing about how ugly or pretty it is can be saved for design/
implementation imho.

Strategy, this is necessary because otherwise goals can be out of line with
reality.  Don't propose goals you don't have at least some idea of how to
implement.

Rejected strategies, given that commiters are the only ones I'm saying
should formally submit SPARKLIs or SIPs, if they put junk in a required
section then slap them down for it and tell them to fix it.

On Oct 9, 2016 4:36 PM, "Matei Zaharia"  wrote:

> Yup, this is the stuff that I found unclear. Thanks for clarifying here,
> but we should also clarify it in the writeup. In particular:
>
> - Goals needs to be about user-facing behavior ("people" is broad)
>
> - I'd rename Rejected Goals to Non-Goals. Otherwise someone will dig up
> one of these and say "Spark's developers have officially rejected X, which
> our awesome system has".
>
> - For user-facing stuff, I think you need a section on API. Virtually all
> other *IPs I've seen have that.
>
> - I'm still not sure why the strategy section is needed if the purpose is
> to define user-facing behavior -- unless this is the strategy for setting
> the goals or for defining the API. That sounds squarely like a design doc
> issue. In some sense, who cares whether the proposal is technically
> feasible right now? If it's infeasible, that will be discovered later
> during design and implementation. Same thing with rejected strategies --
> listing some of those is definitely useful sometimes, but if you make this
> a *required* section, people are just going to fill it in with bogus stuff
> (I've seen this happen before).
>
> Matei
>
> > On Oct 9, 2016, at 2:14 PM, Cody Koeninger  wrote:
> >
> > So to focus the discussion on the specific strategy I'm suggesting,
> > documented at
> >
> > https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-
> improvement-proposals.md
> >
> > "Goals: What must this allow people to do, that they can't currently?"
> >
> > Is it unclear that this is focusing specifically on people-visible
> behavior?
> >
> > Rejected goals -  are important because otherwise people keep trying
> > to argue about scope.  Of course you can change things later with a
> > different SIP and different vote, the point is to focus.
> >
> > Use cases - are something that people are going to bring up in
> > discussion.  If they aren't clearly documented as a goal ("This must
> > allow me to connect using SSL"), they should be added.
> >
> > Internal architecture - if the people who need specific behavior are
> > implementers of other parts of the system, that's fine.
> >
> > Rejected strategies - If you have none of these, you have no evidence
> > that the proponent didn't just go with the first thing they had in
> > mind (or have already implemented), which is a big problem currently.
> > Approval isn't binding as to specifics of implementation, so these
> > aren't handcuffs.  The goals are the contract, the strategy is
> > evidence that contract can actually be met.
> >
> > Design docs - I'm not touching design docs.  The markdown file I
> > linked specifically says of the strategy section "This is not a full
> > design document."  Is this unclear?  Design docs can be worked on
> > obviously, but that's not what I'm concerned with here.
> >
> >
> >
> >
> > On Sun, Oct 9, 2016 at 2:34 PM, Matei Zaharia 
> wrote:
> >> Hi Cody,
> >>
> >> I think this would be a lot more concrete if we had a more detailed
> template
> >> for SIPs. Right now, it's not super clear what's in scope -- e.g. are
> they
> >> a way to solicit feedback on the user-facing behavior or on the
> internals?
> >> "Goals" can cover both things. I've been thinking of SIPs more as
> Product
> >> Requirements Docs (PRDs), which focus on *what* a code change should do
> as
> >> opposed to how.
> >>
> >> In particular, here are some things that you may or may not consider in
> >> scope for SIPs:
> >>
> >> - Goals and non-goals: This is definitely in scope, and IMO should
> focus on
> >> user-visible behavior (e.g. "system supports SQL window functions" or
> >> "system continues working if one node fails"). BTW I wouldn't say
> "rejected
> >> goals" because some of them might become goals later, so we're not
> >> definitively rejecting them.
> >>
> >> - Public API: Probably should be included in most SIPs unless it's too
> large
> >> to fully specify then (e.g. "let's add an ML library").
> >>
> >> - Use cases: I usually find this very useful in PRDs to better
> communicate
> >> the goals.
> >>
> >> - Internal architecture: This is usually *not* a thing users 

Re: Spark Improvement Proposals

2016-10-09 Thread Cody Koeninger
So to focus the discussion on the specific strategy I'm suggesting,
documented at

https://github.com/koeninger/spark-1/blob/SIP-0/docs/spark-improvement-proposals.md

"Goals: What must this allow people to do, that they can't currently?"

Is it unclear that this is focusing specifically on people-visible behavior?

Rejected goals -  are important because otherwise people keep trying
to argue about scope.  Of course you can change things later with a
different SIP and different vote, the point is to focus.

Use cases - are something that people are going to bring up in
discussion.  If they aren't clearly documented as a goal ("This must
allow me to connect using SSL"), they should be added.

Internal architecture - if the people who need specific behavior are
implementers of other parts of the system, that's fine.

Rejected strategies - If you have none of these, you have no evidence
that the proponent didn't just go with the first thing they had in
mind (or have already implemented), which is a big problem currently.
Approval isn't binding as to specifics of implementation, so these
aren't handcuffs.  The goals are the contract, the strategy is
evidence that contract can actually be met.

Design docs - I'm not touching design docs.  The markdown file I
linked specifically says of the strategy section "This is not a full
design document."  Is this unclear?  Design docs can be worked on
obviously, but that's not what I'm concerned with here.




On Sun, Oct 9, 2016 at 2:34 PM, Matei Zaharia  wrote:
> Hi Cody,
>
> I think this would be a lot more concrete if we had a more detailed template
> for SIPs. Right now, it's not super clear what's in scope -- e.g. are  they
> a way to solicit feedback on the user-facing behavior or on the internals?
> "Goals" can cover both things. I've been thinking of SIPs more as Product
> Requirements Docs (PRDs), which focus on *what* a code change should do as
> opposed to how.
>
> In particular, here are some things that you may or may not consider in
> scope for SIPs:
>
> - Goals and non-goals: This is definitely in scope, and IMO should focus on
> user-visible behavior (e.g. "system supports SQL window functions" or
> "system continues working if one node fails"). BTW I wouldn't say "rejected
> goals" because some of them might become goals later, so we're not
> definitively rejecting them.
>
> - Public API: Probably should be included in most SIPs unless it's too large
> to fully specify then (e.g. "let's add an ML library").
>
> - Use cases: I usually find this very useful in PRDs to better communicate
> the goals.
>
> - Internal architecture: This is usually *not* a thing users can easily
> comment on and it sounds more like a design doc item. Of course it's
> important to show that the SIP is feasible to implement. One exception,
> however, is that I think we'll have some SIPs primarily on internals (e.g.
> if somebody wants to refactor Spark's query optimizer or something).
>
> - Rejected strategies: I personally wouldn't put this, because what's the
> point of voting to reject a strategy before you've really begun designing
> and implementing something? What if you discover that the strategy is
> actually better when you start doing stuff?
>
> At a super high level, it depends on whether you want the SIPs to be PRDs
> for getting some quick feedback on the goals of a feature before it is
> designed, or something more like full-fledged design docs (just a more
> visible design doc for bigger changes). I looked at Kafka's KIPs, and they
> actually seem to be more like design docs. This can work too but it does
> require more work from the proposer and it can lead to the same problems you
> mentioned with people already having a design and implementation in mind.
>
> Basically, the question is, are you trying to iterate faster on design by
> adding a step for user feedback earlier? Or are you just trying to make
> design docs for key features more visible (and their approval more formal)?
>
> BTW note that in either case, I'd like to have a template for design docs
> too, which should also include goals. I think that would've avoided some of
> the issues you brought up.
>
> Matei
>
> On Oct 9, 2016, at 10:40 AM, Cody Koeninger  wrote:
>
> Here's my specific proposal (meta-proposal?)
>
> Spark Improvement Proposals (SIP)
>
>
> Background:
>
> The current problem is that design and implementation of large features are
> often done in private, before soliciting user feedback.
>
> When feedback is solicited, it is often as to detailed design specifics, not
> focused on goals.
>
> When implementation does take place after design, there is often
> disagreement as to what goals are or are not in scope.
>
> This results in commits that don't fully meet user needs.
>
>
> Goals:
>
> - Ensure user, contributor, and committer goals are clearly identified and
> agreed upon, before implementation takes place.
>
> - Ensure that a