Re: Feedback on MLlib roadmap process proposal

2017-02-24 Thread Nick Pentreath
FYI I've started going through a few of the top Watched JIRAs and tried to
identify those that are obviously stale and can probably be closed, to try
to clean things up a bit.

On Thu, 23 Feb 2017 at 21:38 Tim Hunter  wrote:

> As Sean wrote very nicely above, the changes made to Spark are decided in
> an organic fashion based on the interests and motivations of the committers
> and contributors. The case of deep learning is a good example. There is a
> lot of interest, and the core algorithms could be implemented without too
> much problem in a few thousands of lines of scala code. However, the
> performance of such a simple implementation would be one to two order of
> magnitude slower than what would get from the popular frameworks out there.
>
> At this point, there are probably more man-hours invested in TensorFlow
> (as an example) than in MLlib, so I think we need to be realistic about
> what we can expect to achieve inside Spark. Unlike BLAS for linear algebra,
> there is no agreed-up interface for deep learning, and each of the XOnSpark
> flavors explores a slightly different design. It will be interesting to see
> what works well in practice. In the meantime, though, there are plenty of
> things that we could do to help developers of other libraries to have a
> great experience with Spark. Matei alluded to that in his Spark Summit
> keynote when he mentioned better integration with low-level libraries.
>
> Tim
>
>
> On Thu, Feb 23, 2017 at 5:32 AM, Nick Pentreath 
> wrote:
>
> Sorry for being late to the discussion. I think Joseph, Sean and others
> have covered the issues well.
>
> Overall I like the proposed cleaned up roadmap & process (thanks Joseph!).
> As for the actual critical roadmap items mentioned on SPARK-18813, I think
> it makes sense and will comment a bit further on that JIRA.
>
> I would like to encourage votes & watching for issues to give a sense of
> what the community wants (I guess Vote is more explicit yet passive, while
> actually Watching an issue is more informative as it may indicate a real
> use case dependent on the issue?!).
>
> I think if used well this is valuable information for contributors. Of
> course not everything on that list can get done. But if I look through the
> top votes or watch list, while not all of those are likely to go in, a
> great many of the issues are fairly non-contentious in terms of being good
> additions to the project.
>
> Things like these are good examples IMO (I just sample a few of them, not
> exhaustive):
> - sample weights for RF / DT
> - multi-model and/or parallel model selection
> - make sharedParams public?
> - multi-column support for various transformers
> - incremental model training
> - tree algorithm enhancements
>
> Now, whether these can be prioritised in terms of bandwidth available to
> reviewers and committers is a totally different thing. But as Sean mentions
> there is some process there for trying to find the balance of the issue
> being a "good thing to add", a shepherd with bandwidth & interest in the
> issue to review, and the maintenance burden imposed.
>
> Let's take Deep Learning / NN for example. Here's a good example of
> something that has a lot of votes/watchers and as Sean mentions it is
> something that "everyone wants someone else to implement". In this case,
> much of the interest may in fact be "stale" - 2 years ago it would have
> been very interesting to have a strong DL impl in Spark. Now, because there
> are a plethora of very good DL libraries out there, how many of those Votes
> would be "deleted"? Granted few are well integrated with Spark but that can
> and is changing (DL4J, BigDL, the "XonSpark" flavours etc).
>
> So this is something that I dare say will not be in Spark any time in the
> foreseeable future or perhaps ever given the current status. Perhaps it's
> worth seriously thinking about just closing these kind of issues?
>
>
>
> On Fri, 27 Jan 2017 at 05:53 Joseph Bradley  wrote:
>
> Sean has given a great explanation.  A few more comments:
>
> Roadmap: I have been creating roadmap JIRAs, but the goal really is to
> have all committers working on MLlib help to set that roadmap, based on
> either their knowledge of current maintenance/internal needs of the project
> or the feedback given from the rest of the community.
> @Committers - I see people actively shepherding PRs for MLlib, but I don't
> see many major initiatives linked to the roadmap.  If there are ones large
> enough to merit adding to the roadmap, please do.
>
> In general, there are many process improvements we could make.  A few in
> my mind are:
> * Visibility: Let the community know what committers are focusing on.
> This was the primary purpose of the "MLlib roadmap proposal."
> * Community initiatives: This is currently very organic.  Some of the
> organic process could be improved, such as encouraging Votes/Watchers
> (though I agree with 

Re: Feedback on MLlib roadmap process proposal

2017-02-23 Thread Tim Hunter
As Sean wrote very nicely above, the changes made to Spark are decided in
an organic fashion based on the interests and motivations of the committers
and contributors. The case of deep learning is a good example. There is a
lot of interest, and the core algorithms could be implemented without too
much problem in a few thousands of lines of scala code. However, the
performance of such a simple implementation would be one to two order of
magnitude slower than what would get from the popular frameworks out there.

At this point, there are probably more man-hours invested in TensorFlow (as
an example) than in MLlib, so I think we need to be realistic about what we
can expect to achieve inside Spark. Unlike BLAS for linear algebra, there
is no agreed-up interface for deep learning, and each of the XOnSpark
flavors explores a slightly different design. It will be interesting to see
what works well in practice. In the meantime, though, there are plenty of
things that we could do to help developers of other libraries to have a
great experience with Spark. Matei alluded to that in his Spark Summit
keynote when he mentioned better integration with low-level libraries.

Tim


On Thu, Feb 23, 2017 at 5:32 AM, Nick Pentreath 
wrote:

> Sorry for being late to the discussion. I think Joseph, Sean and others
> have covered the issues well.
>
> Overall I like the proposed cleaned up roadmap & process (thanks Joseph!).
> As for the actual critical roadmap items mentioned on SPARK-18813, I think
> it makes sense and will comment a bit further on that JIRA.
>
> I would like to encourage votes & watching for issues to give a sense of
> what the community wants (I guess Vote is more explicit yet passive, while
> actually Watching an issue is more informative as it may indicate a real
> use case dependent on the issue?!).
>
> I think if used well this is valuable information for contributors. Of
> course not everything on that list can get done. But if I look through the
> top votes or watch list, while not all of those are likely to go in, a
> great many of the issues are fairly non-contentious in terms of being good
> additions to the project.
>
> Things like these are good examples IMO (I just sample a few of them, not
> exhaustive):
> - sample weights for RF / DT
> - multi-model and/or parallel model selection
> - make sharedParams public?
> - multi-column support for various transformers
> - incremental model training
> - tree algorithm enhancements
>
> Now, whether these can be prioritised in terms of bandwidth available to
> reviewers and committers is a totally different thing. But as Sean mentions
> there is some process there for trying to find the balance of the issue
> being a "good thing to add", a shepherd with bandwidth & interest in the
> issue to review, and the maintenance burden imposed.
>
> Let's take Deep Learning / NN for example. Here's a good example of
> something that has a lot of votes/watchers and as Sean mentions it is
> something that "everyone wants someone else to implement". In this case,
> much of the interest may in fact be "stale" - 2 years ago it would have
> been very interesting to have a strong DL impl in Spark. Now, because there
> are a plethora of very good DL libraries out there, how many of those Votes
> would be "deleted"? Granted few are well integrated with Spark but that can
> and is changing (DL4J, BigDL, the "XonSpark" flavours etc).
>
> So this is something that I dare say will not be in Spark any time in the
> foreseeable future or perhaps ever given the current status. Perhaps it's
> worth seriously thinking about just closing these kind of issues?
>
>
>
> On Fri, 27 Jan 2017 at 05:53 Joseph Bradley  wrote:
>
>> Sean has given a great explanation.  A few more comments:
>>
>> Roadmap: I have been creating roadmap JIRAs, but the goal really is to
>> have all committers working on MLlib help to set that roadmap, based on
>> either their knowledge of current maintenance/internal needs of the project
>> or the feedback given from the rest of the community.
>> @Committers - I see people actively shepherding PRs for MLlib, but I
>> don't see many major initiatives linked to the roadmap.  If there are ones
>> large enough to merit adding to the roadmap, please do.
>>
>> In general, there are many process improvements we could make.  A few in
>> my mind are:
>> * Visibility: Let the community know what committers are focusing on.
>> This was the primary purpose of the "MLlib roadmap proposal."
>> * Community initiatives: This is currently very organic.  Some of the
>> organic process could be improved, such as encouraging Votes/Watchers
>> (though I agree with Sean about these being one-sided metrics).  Cody's SIP
>> work is a great step towards adding more clarity and structure for major
>> initiatives.
>> * JIRA hygiene: Always a challenge, and always requires some manual
>> prodding.  But it's great to push for efforts on this.
>>

Re: Feedback on MLlib roadmap process proposal

2017-02-23 Thread Nick Pentreath
Sorry for being late to the discussion. I think Joseph, Sean and others
have covered the issues well.

Overall I like the proposed cleaned up roadmap & process (thanks Joseph!).
As for the actual critical roadmap items mentioned on SPARK-18813, I think
it makes sense and will comment a bit further on that JIRA.

I would like to encourage votes & watching for issues to give a sense of
what the community wants (I guess Vote is more explicit yet passive, while
actually Watching an issue is more informative as it may indicate a real
use case dependent on the issue?!).

I think if used well this is valuable information for contributors. Of
course not everything on that list can get done. But if I look through the
top votes or watch list, while not all of those are likely to go in, a
great many of the issues are fairly non-contentious in terms of being good
additions to the project.

Things like these are good examples IMO (I just sample a few of them, not
exhaustive):
- sample weights for RF / DT
- multi-model and/or parallel model selection
- make sharedParams public?
- multi-column support for various transformers
- incremental model training
- tree algorithm enhancements

Now, whether these can be prioritised in terms of bandwidth available to
reviewers and committers is a totally different thing. But as Sean mentions
there is some process there for trying to find the balance of the issue
being a "good thing to add", a shepherd with bandwidth & interest in the
issue to review, and the maintenance burden imposed.

Let's take Deep Learning / NN for example. Here's a good example of
something that has a lot of votes/watchers and as Sean mentions it is
something that "everyone wants someone else to implement". In this case,
much of the interest may in fact be "stale" - 2 years ago it would have
been very interesting to have a strong DL impl in Spark. Now, because there
are a plethora of very good DL libraries out there, how many of those Votes
would be "deleted"? Granted few are well integrated with Spark but that can
and is changing (DL4J, BigDL, the "XonSpark" flavours etc).

So this is something that I dare say will not be in Spark any time in the
foreseeable future or perhaps ever given the current status. Perhaps it's
worth seriously thinking about just closing these kind of issues?



On Fri, 27 Jan 2017 at 05:53 Joseph Bradley  wrote:

> Sean has given a great explanation.  A few more comments:
>
> Roadmap: I have been creating roadmap JIRAs, but the goal really is to
> have all committers working on MLlib help to set that roadmap, based on
> either their knowledge of current maintenance/internal needs of the project
> or the feedback given from the rest of the community.
> @Committers - I see people actively shepherding PRs for MLlib, but I don't
> see many major initiatives linked to the roadmap.  If there are ones large
> enough to merit adding to the roadmap, please do.
>
> In general, there are many process improvements we could make.  A few in
> my mind are:
> * Visibility: Let the community know what committers are focusing on.
> This was the primary purpose of the "MLlib roadmap proposal."
> * Community initiatives: This is currently very organic.  Some of the
> organic process could be improved, such as encouraging Votes/Watchers
> (though I agree with Sean about these being one-sided metrics).  Cody's SIP
> work is a great step towards adding more clarity and structure for major
> initiatives.
> * JIRA hygiene: Always a challenge, and always requires some manual
> prodding.  But it's great to push for efforts on this.
>
>
> On Wed, Jan 25, 2017 at 3:59 AM, Sean Owen  wrote:
>
> On Wed, Jan 25, 2017 at 6:01 AM Ilya Matiach  wrote:
>
> My confusion was that the ML 2.2 roadmap critical features (
> https://issues.apache.org/jira/browse/SPARK-18813) did not line up with
> the top ML/MLLIB JIRAs by Votes
> or
> Watchers
> 
> .
>
> Your 

Re: Feedback on MLlib roadmap process proposal

2017-01-26 Thread Joseph Bradley
Sean has given a great explanation.  A few more comments:

Roadmap: I have been creating roadmap JIRAs, but the goal really is to have
all committers working on MLlib help to set that roadmap, based on either
their knowledge of current maintenance/internal needs of the project or the
feedback given from the rest of the community.
@Committers - I see people actively shepherding PRs for MLlib, but I don't
see many major initiatives linked to the roadmap.  If there are ones large
enough to merit adding to the roadmap, please do.

In general, there are many process improvements we could make.  A few in my
mind are:
* Visibility: Let the community know what committers are focusing on.  This
was the primary purpose of the "MLlib roadmap proposal."
* Community initiatives: This is currently very organic.  Some of the
organic process could be improved, such as encouraging Votes/Watchers
(though I agree with Sean about these being one-sided metrics).  Cody's SIP
work is a great step towards adding more clarity and structure for major
initiatives.
* JIRA hygiene: Always a challenge, and always requires some manual
prodding.  But it's great to push for efforts on this.


On Wed, Jan 25, 2017 at 3:59 AM, Sean Owen  wrote:

> On Wed, Jan 25, 2017 at 6:01 AM Ilya Matiach  wrote:
>
>> My confusion was that the ML 2.2 roadmap critical features (
>> https://issues.apache.org/jira/browse/SPARK-18813) did not line up with
>> the top ML/MLLIB JIRAs by Votes
>> or
>> Watchers
>> 
>> .
>>
>> Your explanation that they do not have to and there is a more complex
>> process to choosing the changes that will make it into the next release
>> makes sense to me.
>>
>
> For Spark ML, Joseph is the de facto leader and does publish a tentative
> roadmap. (We could also use JIRA mechanisms for this but any scheme is
> better than none.) Yes, not based on Votes -- nothing here is. Votes are
> noisy signal because it is usually measures: what would you like done if
> you didn't have to do it and there were no downsides for you?
>
>
>
>> My only humble recommendation would be to cleanup the top JIRAs by
>> closing the ones which have spark packages for them (eg the NN one which
>> already has several packages as you explained), noting or somehow marking
>> on some that they will not be resolved, and changing the component on the
>> ones not related to ML/MLLIB (eg https://issues.apache.org/
>> jira/browse/SPARK-12965).
>>
>
> We do that. It occasionally generates protests, so, I find myself erring
> on the side of ignoring. You can comment on any JIRA you think should be
> closed. That's helpful.
>
> That particular JIRA seems potentially legitimate. I wouldn't close it. It
> also won't get fixed until someone proposes a resolution. I'd strongly
> encourage people saying "I have this problem too" to try to fix it. I tend
> to ignore these otherwise, myself, in favor of reviewing ones where someone
> has gone to the trouble of proposing a working fix.
>
>
>
>> Also, I would love to do this if I had the permissions, but it would be
>> great to change the JIRAs that are marked as “in progress” but where the
>> corresponding pull request was closed/cancelled, for example
>> https://issues.apache.org/jira/browse/SPARK-4638.  That JIRA is
>>
>
> Yes, flag these. I or others can close them if appropriate. Anyone who
> consistently does this well, we could give JIRA permissions to.
>
> Opening a PR automatically makes it "In Progress" but there's no
> complementary process to un-mark it. You can ignore the Open / In Progress
> distinction really.
>
> This one is interesting because it does seem like a plausible feature to
> add. The original PR was abandoned by the author and nobody else submitted
> one -- despite the Votes. I hesitate to signal that no PRs would be
> considered, but, doesn't seem like it's in demand enough for someone to
> work on?
>
>
> I think one of my messages is that, de facto, here, like in 

Re: Feedback on MLlib roadmap process proposal

2017-01-25 Thread Sean Owen
On Wed, Jan 25, 2017 at 6:01 AM Ilya Matiach  wrote:

> My confusion was that the ML 2.2 roadmap critical features (
> https://issues.apache.org/jira/browse/SPARK-18813) did not line up with
> the top ML/MLLIB JIRAs by Votes
> or
> Watchers
> 
> .
>
> Your explanation that they do not have to and there is a more complex
> process to choosing the changes that will make it into the next release
> makes sense to me.
>

For Spark ML, Joseph is the de facto leader and does publish a tentative
roadmap. (We could also use JIRA mechanisms for this but any scheme is
better than none.) Yes, not based on Votes -- nothing here is. Votes are
noisy signal because it is usually measures: what would you like done if
you didn't have to do it and there were no downsides for you?



> My only humble recommendation would be to cleanup the top JIRAs by closing
> the ones which have spark packages for them (eg the NN one which already
> has several packages as you explained), noting or somehow marking on some
> that they will not be resolved, and changing the component on the ones not
> related to ML/MLLIB (eg https://issues.apache.org/jira/browse/SPARK-12965
> ).
>

We do that. It occasionally generates protests, so, I find myself erring on
the side of ignoring. You can comment on any JIRA you think should be
closed. That's helpful.

That particular JIRA seems potentially legitimate. I wouldn't close it. It
also won't get fixed until someone proposes a resolution. I'd strongly
encourage people saying "I have this problem too" to try to fix it. I tend
to ignore these otherwise, myself, in favor of reviewing ones where someone
has gone to the trouble of proposing a working fix.



> Also, I would love to do this if I had the permissions, but it would be
> great to change the JIRAs that are marked as “in progress” but where the
> corresponding pull request was closed/cancelled, for example
> https://issues.apache.org/jira/browse/SPARK-4638.  That JIRA is
>

Yes, flag these. I or others can close them if appropriate. Anyone who
consistently does this well, we could give JIRA permissions to.

Opening a PR automatically makes it "In Progress" but there's no
complementary process to un-mark it. You can ignore the Open / In Progress
distinction really.

This one is interesting because it does seem like a plausible feature to
add. The original PR was abandoned by the author and nobody else submitted
one -- despite the Votes. I hesitate to signal that no PRs would be
considered, but, doesn't seem like it's in demand enough for someone to
work on?


I think one of my messages is that, de facto, here, like in many Apache
projects, committers do not take requests. They pursue the work they
believe needs doing, and shepherd work initiated by others (a clear bug
report, a PR) to a resolution. Things get done by doing them, or by
building influence by doing other things the project needs doing. It isn't
a mechanical, objective process, and can't be. But it does work in a
recognizable way.

>


RE: Feedback on MLlib roadmap process proposal

2017-01-24 Thread Ilya Matiach
Thanks Sean, this is a really helpful overview, and contains good guidance for 
new contributors to ML/MLLIB.
My confusion was that the ML 2.2 roadmap critical features 
(https://issues.apache.org/jira/browse/SPARK-18813) did not line up with the 
top ML/MLLIB JIRAs by Votes 
<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fissues%2F%3Fjql%3Dproject%2520%253D%2520SPARK%2520AND%2520status%2520in%2520(Open%252C%2520%2522In%2520Progress%2522%252C%2520Reopened)%2520AND%2520component%2520in%2520(ML%252C%2520MLlib)%2520ORDER%2520BY%2520votes%2520DESC=02%7C01%7Cilmat%40microsoft.com%7C180d196083534d9eee6b08d444754fae%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636208718015178106=%2FtFB0LY%2BIxLoEf%2FPr1i1%2FgvrjlpXPuYLSLbpnd89Tkg%3D=0>
 or 
Watchers<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fissues%2F%3Fjql%3Dproject%2520%253D%2520SPARK%2520AND%2520status%2520in%2520(Open%252C%2520%2522In%2520Progress%2522%252C%2520Reopened)%2520AND%2520component%2520in%2520(ML%252C%2520MLlib)%2520ORDER%2520BY%2520Watchers%2520DESC=02%7C01%7Cilmat%40microsoft.com%7C180d196083534d9eee6b08d444754fae%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636208718015178106=XkPfFiB2T%2FoVnJcdr3jf12dQjes7w%2BVJMrbhgx3ELRs%3D=0>.
Your explanation that they do not have to and there is a more complex process 
to choosing the changes that will make it into the next release makes sense to 
me.
My only humble recommendation would be to cleanup the top JIRAs by closing the 
ones which have spark packages for them (eg the NN one which already has 
several packages as you explained), noting or somehow marking on some that they 
will not be resolved, and changing the component on the ones not related to 
ML/MLLIB (eg https://issues.apache.org/jira/browse/SPARK-12965).
Also, I would love to do this if I had the permissions, but it would be great 
to change the JIRAs that are marked as “in progress” but where the 
corresponding pull request was closed/cancelled, for example 
https://issues.apache.org/jira/browse/SPARK-4638.  That JIRA is actually one of 
the top ones by number of watches (adding kernels like Radial Basis Function to 
SVM, and I can imagine why it’s one of the top ones), and seeing it marked as 
in progress with a pull request is somewhat confusing.  I’ve seen several other 
JIRAs similar to this one, where the pull request was closed but the JIRA 
status was not updated – and if the pull request was closed for a good reason, 
the corresponding JIRA should probably be closed as well.
Thank you, Ilya


From: Sean Owen [mailto:so...@cloudera.com]
Sent: Tuesday, January 24, 2017 11:23 AM
To: Ilya Matiach <il...@microsoft.com>
Cc: dev@spark.apache.org
Subject: Re: Feedback on MLlib roadmap process proposal

On Tue, Jan 24, 2017 at 3:58 PM Ilya Matiach 
<il...@microsoft.com<mailto:il...@microsoft.com>> wrote:
Just a few questions with regards to the MLLIB process:


  1.  Is there a list of committers who can/are shepherds and what code they 
own?  I’ve seen this page: 
http://spark.apache.org/committers.html<https://na01.safelinks.protection.outlook.com/?url=http%3A%2F%2Fspark.apache.org%2Fcommitters.html=02%7C01%7Cilmat%40microsoft.com%7C180d196083534d9eee6b08d444754fae%7C72f988bf86f141af91ab2d7cd011db47%7C1%7C0%7C636208718015178106=L6pZhfpFVoiAIUHXQjCP%2FhFZ3zINP4jhkYdiJPRQOj4%3D=0>
 but I’m not sure if it is up to date and it doesn’t mention what code the 
committers own.  It would be useful to know who owns ML or MLLIB.  From my 
limited personal experience this seems to be Joseph K. Bradley, Yanbo Liang and 
Sean Owen.
There is no such list because there's no formal notion of ownership or access 
to subsets of the project. Tracking an informal notion would be process mostly 
for its own sake, and probably just go out of date. We sort of tried this with 
'maintainers' and it didn't actually do anything.

I am not active much in ML, but will occasionally help commit simple changes. 
What you see organically is pretty much what is, at any given time. People you 
see responding are the active ones, and influencers, commit bit or no.



  1.
  2.  Based on both user votes and watchers, the top issue currently is 
“SPARK-5575: Artificial neural networks for MLlib deep learning”.  However, it 
looks like it has been opened for almost 2 years and not a lot of progress is 
being made.  There seem to be other top issues which aren’t getting addressed 
as well on these pages mentioned in the roadmap: MLlib, sorted by: Votes 
<https://na01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fissues%2F%3Fjql%3Dproject%2520%253D%2520SPARK%2520AND%2520status%2520in%2520(Open%252C%2520%2522In%2520Progress%2522%252C%2520Reopened)%2520AND%2520component%2520in%2520(ML%252C%2520MLlib)%2520ORDER%2520BY%2520votes%2520DESC=02%7C01%7Cilmat%40microsoft.com%7C180d196083534d9eee6b08d444754fae%7C7

Re: Feedback on MLlib roadmap process proposal

2017-01-24 Thread Cody Koeninger
Totally agree with most of what Sean said, just wanted to give an
alternate take on the "maintainers" thing

On Tue, Jan 24, 2017 at 10:23 AM, Sean Owen  wrote:
> There is no such list because there's no formal notion of ownership or
> access to subsets of the project. Tracking an informal notion would be
> process mostly for its own sake, and probably just go out of date. We sort
> of tried this with 'maintainers' and it didn't actually do anything.
>

My perception of that situation is that the Apache process is actively
antagonistic towards factoring out responsibility for particular parts
of the code into a hierarchy.  I think if Spark was under a different
open source model, with otherwise exactly the same committers, that
attempt at identifying maintainers would have worked out differently.

-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org



Re: Feedback on MLlib roadmap process proposal

2017-01-24 Thread Sean Owen
On Tue, Jan 24, 2017 at 3:58 PM Ilya Matiach  wrote:

> Just a few questions with regards to the MLLIB process:
>
>
>
>1. Is there a list of committers who can/are shepherds and what code
>they own?  I’ve seen this page: http://spark.apache.org/committers.html
>but I’m not sure if it is up to date and it doesn’t mention what code the
>committers own.  It would be useful to know who owns ML or MLLIB.  From my
>limited personal experience this seems to be Joseph K. Bradley, Yanbo Liang
>and Sean Owen.
>
> There is no such list because there's no formal notion of ownership or
access to subsets of the project. Tracking an informal notion would be
process mostly for its own sake, and probably just go out of date. We sort
of tried this with 'maintainers' and it didn't actually do anything.

I am not active much in ML, but will occasionally help commit simple
changes. What you see organically is pretty much what is, at any given
time. People you see responding are the active ones, and influencers,
commit bit or no.



>
>1.
>2. Based on both user votes and watchers, the top issue currently is
>“SPARK-5575: Artificial neural networks for MLlib deep learning”.  However,
>it looks like it has been opened for almost 2 years and not a lot of
>progress is being made.  There seem to be other top issues which aren’t
>getting addressed as well on these pages mentioned in the roadmap: MLlib,
>sorted by: Votes
>
> or
>Watchers
>
> .
>Is my perception incorrect, or is there a very good reason for not
>addressing the top issues voted for by the community?  If there is a good
>reason, is there a way to filter such JIRAs out from the sorted lists, to
>know which JIRAs really should be taken/worked on?
>
> JIRA votes and watchers don't mean anything, formally. This isn't a
product company where one group might give another group a list of top
priorities to work on. There's a general statement about this at
http://spark.apache.org/contributing.html under "Code Review Criteria". In
practice, it's a soft process of convincing other people that change X does
more good than harm, is worth taking the burden of supporting, matters to
users, etc. I ignore 80% of issues, that don't seem to fit these criteria,
and choose to help with the 20% that do, which are usually simple and/or
important bug fixes.

ANNs? that's a tangent but my snap reaction are:
It's something Everybody wants Somebody Else to create, which may explain
the votes vs activity?
There is one basic ANN implementation in Spark actually.
There are others outside Spark, so may be something people get elsewhere
like dl4j or BigDL, or strapping TF to Spark in various ways.
DL is also not an obviously-great fit for the data-parallel computation
model here.
It's not a goal to implement everything in Spark. It could be a good idea,
but, no need to tether it to the core project, to the exclusion of
"unblessed" third-party packages.



>
>1.
>2. Also, this might be a newbie question, but for new contributors to
>spark, is there a process to convince a committer to be assigned to a JIRA
>that we are working on. It would be useful if there was a clear threshold
>for whether a committer can reject to work on a JIRA ahead of time, so
>contributors won’t waste time working on issues that aren’t important to
>spark and focus on making progress on the issues that the spark committers
>would like us to fix.
>
>
No, there's no concept of being tasked to work on something by someone else
here. I can't imagine we could establish a clear objective threshold for
such a subjective thing.

It's not a satisfying answer but it is the most realistic one. All of these
OSS projects work on soft power, persuasion and cooperation. I think the
good news is that all the intuitive ways to gain soft power do work: give
time to others' problems if you want time on your own, help review, make
thoughtful careful changes, etc.

My general guidance is: don't bother doing significant feature work unless
you have some clear buy-in from someone who can commit.

I completely agree that issues should be closed more aggressively for the
reason you give. On the flip-side this often ruffles feathers. We are still
overrun with issues but it's gotten a lot better culture-wise about
honestly rejecting lots of inbound stuff quickly.


RE: Feedback on MLlib roadmap process proposal

2017-01-24 Thread Ilya Matiach
Just a few questions with regards to the MLLIB process:


  1.  Is there a list of committers who can/are shepherds and what code they 
own?  I’ve seen this page: http://spark.apache.org/committers.html but I’m not 
sure if it is up to date and it doesn’t mention what code the committers own.  
It would be useful to know who owns ML or MLLIB.  From my limited personal 
experience this seems to be Joseph K. Bradley, Yanbo Liang and Sean Owen.
  2.  Based on both user votes and watchers, the top issue currently is 
“SPARK-5575: Artificial neural networks for MLlib deep learning”.  However, it 
looks like it has been opened for almost 2 years and not a lot of progress is 
being made.  There seem to be other top issues which aren’t getting addressed 
as well on these pages mentioned in the roadmap: MLlib, sorted by: Votes 
<https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20votes%20DESC>
 or Watchers 
<https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20ORDER%20BY%20Watchers%20DESC>
 .  Is my perception incorrect, or is there a very good reason for not 
addressing the top issues voted for by the community?  If there is a good 
reason, is there a way to filter such JIRAs out from the sorted lists, to know 
which JIRAs really should be taken/worked on?
  3.  Also, this might be a newbie question, but for new contributors to spark, 
is there a process to convince a committer to be assigned to a JIRA that we are 
working on. It would be useful if there was a clear threshold for whether a 
committer can reject to work on a JIRA ahead of time, so contributors won’t 
waste time working on issues that aren’t important to spark and focus on making 
progress on the issues that the spark committers would like us to fix.

Thank you, Ilya

From: Joseph Bradley [mailto:jos...@databricks.com]
Sent: Monday, January 23, 2017 8:04 PM
To: Felix Cheung <felixcheun...@hotmail.com>
Cc: Mingjie Tang <tangr...@gmail.com>; Seth Hendrickson 
<seth.hendrickso...@gmail.com>; dev@spark.apache.org
Subject: Re: Feedback on MLlib roadmap process proposal

Hi Seth,

The proposal is geared towards exactly the issue you're describing: providing 
more visibility into the capacity and intentions of committers.  If there are 
things you'd add to it or change to improve further, it would be great to hear 
ideas!  The past roadmap JIRA has some more background discussion which is 
worth looking at too.

Let's break off the MLlib mission discussion into another thread.  I'll start 
one now.

Thanks,
Joseph

On Thu, Jan 19, 2017 at 1:51 PM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
Hi Seth

Re: "The most important thing we can do, given that MLlib currently has a very 
limited committer review bandwidth, is to make clear issues that, if worked on, 
will definitely get reviewed. "

We are adopting a Shepherd model, as described in the JIRA Joseph has, in 
which, when assigned, the Shepherd will see it through with the contributor to 
make sure it lands with the target release.

I'm sure Joseph can explain it better than I do ;)

_
From: Mingjie Tang <tangr...@gmail.com<mailto:tangr...@gmail.com>>
Sent: Thursday, January 19, 2017 10:30 AM
Subject: Re: Feedback on MLlib roadmap process proposal
To: Seth Hendrickson 
<seth.hendrickso...@gmail.com<mailto:seth.hendrickso...@gmail.com>>
Cc: Joseph Bradley <jos...@databricks.com<mailto:jos...@databricks.com>>, 
<dev@spark.apache.org<mailto:dev@spark.apache.org>>


+1 general abstractions like distributed linear algebra.

On Thu, Jan 19, 2017 at 8:54 AM, Seth Hendrickson 
<seth.hendrickso...@gmail.com<mailto:seth.hendrickso...@gmail.com>> wrote:
I think the proposal laid out in SPARK-18813 is well done, and I do think it is 
going to improve the process going forward. I also really like the idea of 
getting the community to vote on JIRAs to give some of them priority - provided 
that we listen to those votes, of course. The biggest problem I see is that we 
do have several active contributors and those who want to help implement these 
changes, but PRs are reviewed rather sporadically and I imagine it is very 
difficult for contributors to understand why some get reviewed and some do not. 
The most important thing we can do, given that MLlib currently has a very 
limited committer review bandwidth, is to make clear issues that, if worked on, 
will definitely get reviewed. A hard thing to do in open source, no doubt, but 
even if we have to limit the scope of such issues to a very small subset, it's 
a gain for all I think.

On a related note, I would love to hear some d

Re: Feedback on MLlib roadmap process proposal

2017-01-23 Thread Joseph Bradley
Hi Seth,

The proposal is geared towards exactly the issue you're describing:
providing more visibility into the capacity and intentions of committers.
If there are things you'd add to it or change to improve further, it would
be great to hear ideas!  The past roadmap JIRA has some more background
discussion which is worth looking at too.

Let's break off the MLlib mission discussion into another thread.  I'll
start one now.

Thanks,
Joseph

On Thu, Jan 19, 2017 at 1:51 PM, Felix Cheung <felixcheun...@hotmail.com>
wrote:

> Hi Seth
>
> Re: "The most important thing we can do, given that MLlib currently has a
> very limited committer review bandwidth, is to make clear issues that, if
> worked on, will definitely get reviewed. "
>
> We are adopting a Shepherd model, as described in the JIRA Joseph has, in
> which, when assigned, the Shepherd will see it through with the contributor
> to make sure it lands with the target release.
>
> I'm sure Joseph can explain it better than I do ;)
>
>
> _
> From: Mingjie Tang <tangr...@gmail.com>
> Sent: Thursday, January 19, 2017 10:30 AM
> Subject: Re: Feedback on MLlib roadmap process proposal
> To: Seth Hendrickson <seth.hendrickso...@gmail.com>
> Cc: Joseph Bradley <jos...@databricks.com>, <dev@spark.apache.org>
>
>
>
> +1 general abstractions like distributed linear algebra.
>
> On Thu, Jan 19, 2017 at 8:54 AM, Seth Hendrickson <
> seth.hendrickso...@gmail.com> wrote:
>
>> I think the proposal laid out in SPARK-18813 is well done, and I do think
>> it is going to improve the process going forward. I also really like the
>> idea of getting the community to vote on JIRAs to give some of them
>> priority - provided that we listen to those votes, of course. The biggest
>> problem I see is that we do have several active contributors and those who
>> want to help implement these changes, but PRs are reviewed rather
>> sporadically and I imagine it is very difficult for contributors to
>> understand why some get reviewed and some do not. The most important thing
>> we can do, given that MLlib currently has a very limited committer review
>> bandwidth, is to make clear issues that, if worked on, will definitely get
>> reviewed. A hard thing to do in open source, no doubt, but even if we have
>> to limit the scope of such issues to a very small subset, it's a gain for
>> all I think.
>>
>> On a related note, I would love to hear some discussion on the higher
>> level goal of Spark MLlib (if this derails the original discussion, please
>> let me know and we can discuss in another thread). The roadmap does contain
>> specific items that help to convey some of this (ML parity with MLlib,
>> model persistence, etc...), but I'm interested in what the "mission" of
>> Spark MLlib is. We often see PRs for brand new algorithms which are
>> sometimes rejected and sometimes not. Do we aim to keep implementing more
>> and more algorithms? Or is our focus really, now that we have a reasonable
>> library of algorithms, to simply make the existing ones faster/better/more
>> robust? Should we aim to make interfaces that are easily extended for
>> developers to easily implement their own custom code (e.g. custom
>> optimization libraries), or do we want to restrict things to out-of-the box
>> algorithms? Should we focus on more flexible, general abstractions like
>> distributed linear algebra?
>>
>> I was not involved in the project in the early days of MLlib when this
>> discussion may have happened, but I think it would be useful to either
>> revisit it or restate it here for some of the newer developers.
>>
>> On Tue, Jan 17, 2017 at 3:38 PM, Joseph Bradley <jos...@databricks.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> This is a general call for thoughts about the process for the MLlib
>>> roadmap proposed in SPARK-18813.  See the section called "Roadmap process."
>>>
>>> Summary:
>>> * This process is about committers indicating intention to shepherd and
>>> review.
>>> * The goal is to improve visibility and communication.
>>> * This is fairly orthogonal to the SIP discussion since this proposal is
>>> more about setting release targets than about proposing future plans.
>>>
>>> Thanks!
>>> Joseph
>>>
>>> --
>>>
>>> Joseph Bradley
>>>
>>> Software Engineer - Machine Learning
>>>
>>> Databricks, Inc.
>>>
>>> [image: http://databricks.com] <http://databricks.com/>
>>>
>>
>>
>
>
>


-- 

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[image: http://databricks.com] <http://databricks.com/>


Re: Feedback on MLlib roadmap process proposal

2017-01-19 Thread Felix Cheung
Hi Seth

Re: "The most important thing we can do, given that MLlib currently has a very 
limited committer review bandwidth, is to make clear issues that, if worked on, 
will definitely get reviewed. "

We are adopting a Shepherd model, as described in the JIRA Joseph has, in 
which, when assigned, the Shepherd will see it through with the contributor to 
make sure it lands with the target release.

I'm sure Joseph can explain it better than I do ;)


_
From: Mingjie Tang <tangr...@gmail.com<mailto:tangr...@gmail.com>>
Sent: Thursday, January 19, 2017 10:30 AM
Subject: Re: Feedback on MLlib roadmap process proposal
To: Seth Hendrickson 
<seth.hendrickso...@gmail.com<mailto:seth.hendrickso...@gmail.com>>
Cc: Joseph Bradley <jos...@databricks.com<mailto:jos...@databricks.com>>, 
<dev@spark.apache.org<mailto:dev@spark.apache.org>>


+1 general abstractions like distributed linear algebra.

On Thu, Jan 19, 2017 at 8:54 AM, Seth Hendrickson 
<seth.hendrickso...@gmail.com<mailto:seth.hendrickso...@gmail.com>> wrote:
I think the proposal laid out in SPARK-18813 is well done, and I do think it is 
going to improve the process going forward. I also really like the idea of 
getting the community to vote on JIRAs to give some of them priority - provided 
that we listen to those votes, of course. The biggest problem I see is that we 
do have several active contributors and those who want to help implement these 
changes, but PRs are reviewed rather sporadically and I imagine it is very 
difficult for contributors to understand why some get reviewed and some do not. 
The most important thing we can do, given that MLlib currently has a very 
limited committer review bandwidth, is to make clear issues that, if worked on, 
will definitely get reviewed. A hard thing to do in open source, no doubt, but 
even if we have to limit the scope of such issues to a very small subset, it's 
a gain for all I think.

On a related note, I would love to hear some discussion on the higher level 
goal of Spark MLlib (if this derails the original discussion, please let me 
know and we can discuss in another thread). The roadmap does contain specific 
items that help to convey some of this (ML parity with MLlib, model 
persistence, etc...), but I'm interested in what the "mission" of Spark MLlib 
is. We often see PRs for brand new algorithms which are sometimes rejected and 
sometimes not. Do we aim to keep implementing more and more algorithms? Or is 
our focus really, now that we have a reasonable library of algorithms, to 
simply make the existing ones faster/better/more robust? Should we aim to make 
interfaces that are easily extended for developers to easily implement their 
own custom code (e.g. custom optimization libraries), or do we want to restrict 
things to out-of-the box algorithms? Should we focus on more flexible, general 
abstractions like distributed linear algebra?

I was not involved in the project in the early days of MLlib when this 
discussion may have happened, but I think it would be useful to either revisit 
it or restate it here for some of the newer developers.

On Tue, Jan 17, 2017 at 3:38 PM, Joseph Bradley 
<jos...@databricks.com<mailto:jos...@databricks.com>> wrote:
Hi all,

This is a general call for thoughts about the process for the MLlib roadmap 
proposed in SPARK-18813.  See the section called "Roadmap process."

Summary:
* This process is about committers indicating intention to shepherd and review.
* The goal is to improve visibility and communication.
* This is fairly orthogonal to the SIP discussion since this proposal is more 
about setting release targets than about proposing future plans.

Thanks!
Joseph

--

Joseph Bradley

Software Engineer - Machine Learning

Databricks, Inc.

[http://databricks.com]<http://databricks.com/>






Re: Feedback on MLlib roadmap process proposal

2017-01-19 Thread Mingjie Tang
+1 general abstractions like distributed linear algebra.

On Thu, Jan 19, 2017 at 8:54 AM, Seth Hendrickson <
seth.hendrickso...@gmail.com> wrote:

> I think the proposal laid out in SPARK-18813 is well done, and I do think
> it is going to improve the process going forward. I also really like the
> idea of getting the community to vote on JIRAs to give some of them
> priority - provided that we listen to those votes, of course. The biggest
> problem I see is that we do have several active contributors and those who
> want to help implement these changes, but PRs are reviewed rather
> sporadically and I imagine it is very difficult for contributors to
> understand why some get reviewed and some do not. The most important thing
> we can do, given that MLlib currently has a very limited committer review
> bandwidth, is to make clear issues that, if worked on, will definitely get
> reviewed. A hard thing to do in open source, no doubt, but even if we have
> to limit the scope of such issues to a very small subset, it's a gain for
> all I think.
>
> On a related note, I would love to hear some discussion on the higher
> level goal of Spark MLlib (if this derails the original discussion, please
> let me know and we can discuss in another thread). The roadmap does contain
> specific items that help to convey some of this (ML parity with MLlib,
> model persistence, etc...), but I'm interested in what the "mission" of
> Spark MLlib is. We often see PRs for brand new algorithms which are
> sometimes rejected and sometimes not. Do we aim to keep implementing more
> and more algorithms? Or is our focus really, now that we have a reasonable
> library of algorithms, to simply make the existing ones faster/better/more
> robust? Should we aim to make interfaces that are easily extended for
> developers to easily implement their own custom code (e.g. custom
> optimization libraries), or do we want to restrict things to out-of-the box
> algorithms? Should we focus on more flexible, general abstractions like
> distributed linear algebra?
>
> I was not involved in the project in the early days of MLlib when this
> discussion may have happened, but I think it would be useful to either
> revisit it or restate it here for some of the newer developers.
>
> On Tue, Jan 17, 2017 at 3:38 PM, Joseph Bradley 
> wrote:
>
>> Hi all,
>>
>> This is a general call for thoughts about the process for the MLlib
>> roadmap proposed in SPARK-18813.  See the section called "Roadmap process."
>>
>> Summary:
>> * This process is about committers indicating intention to shepherd and
>> review.
>> * The goal is to improve visibility and communication.
>> * This is fairly orthogonal to the SIP discussion since this proposal is
>> more about setting release targets than about proposing future plans.
>>
>> Thanks!
>> Joseph
>>
>> --
>>
>> Joseph Bradley
>>
>> Software Engineer - Machine Learning
>>
>> Databricks, Inc.
>>
>> [image: http://databricks.com] 
>>
>
>


Re: Feedback on MLlib roadmap process proposal

2017-01-19 Thread Seth Hendrickson
I think the proposal laid out in SPARK-18813 is well done, and I do think
it is going to improve the process going forward. I also really like the
idea of getting the community to vote on JIRAs to give some of them
priority - provided that we listen to those votes, of course. The biggest
problem I see is that we do have several active contributors and those who
want to help implement these changes, but PRs are reviewed rather
sporadically and I imagine it is very difficult for contributors to
understand why some get reviewed and some do not. The most important thing
we can do, given that MLlib currently has a very limited committer review
bandwidth, is to make clear issues that, if worked on, will definitely get
reviewed. A hard thing to do in open source, no doubt, but even if we have
to limit the scope of such issues to a very small subset, it's a gain for
all I think.

On a related note, I would love to hear some discussion on the higher level
goal of Spark MLlib (if this derails the original discussion, please let me
know and we can discuss in another thread). The roadmap does contain
specific items that help to convey some of this (ML parity with MLlib,
model persistence, etc...), but I'm interested in what the "mission" of
Spark MLlib is. We often see PRs for brand new algorithms which are
sometimes rejected and sometimes not. Do we aim to keep implementing more
and more algorithms? Or is our focus really, now that we have a reasonable
library of algorithms, to simply make the existing ones faster/better/more
robust? Should we aim to make interfaces that are easily extended for
developers to easily implement their own custom code (e.g. custom
optimization libraries), or do we want to restrict things to out-of-the box
algorithms? Should we focus on more flexible, general abstractions like
distributed linear algebra?

I was not involved in the project in the early days of MLlib when this
discussion may have happened, but I think it would be useful to either
revisit it or restate it here for some of the newer developers.

On Tue, Jan 17, 2017 at 3:38 PM, Joseph Bradley 
wrote:

> Hi all,
>
> This is a general call for thoughts about the process for the MLlib
> roadmap proposed in SPARK-18813.  See the section called "Roadmap process."
>
> Summary:
> * This process is about committers indicating intention to shepherd and
> review.
> * The goal is to improve visibility and communication.
> * This is fairly orthogonal to the SIP discussion since this proposal is
> more about setting release targets than about proposing future plans.
>
> Thanks!
> Joseph
>
> --
>
> Joseph Bradley
>
> Software Engineer - Machine Learning
>
> Databricks, Inc.
>
> [image: http://databricks.com] 
>