[jira] [Comment Edited] (SPARK-18813) MLlib 2.2 Roadmap

2016-12-12 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15741292#comment-15741292
 ] 

Yanbo Liang edited comment on SPARK-18813 at 12/12/16 2:10 PM:
---

[~josephkb] [~yuhaoyan] [~felixcheung] This is really an useful discussion and 
this JIRA looks like a good framework to help us move forward efficiently. The 
current organization, category and priority setting would be more clear for 
both users, contributors and committers. Thanks for the great work.
I totally agree [~yuhaoyan] ’s comment to consider the feedback from Spark 
users. Except for the voting, watching and comment in JIRA, I think mailing 
list is also an attractive QA tool for Spark users since it does not involves 
register and login issue. I collected some feature requests from dev and user 
mailing list in the passed several months:
* SPARK-10413 Model should support prediction on single instance. I think this 
is the most frequently required features in mailing list and was mentioned 
multiple times. This is one of the most important steps to move Spark MLlib 
into production. Further more, we can provide local model implementation in 
mllib-local package, since lots of users will make model prediction locally(not 
dependent on the whole Spark package). We already have some discussion at 
SPARK-16365.
* GBT improvement: [dmlc/xgboost|https://github.com/dmlc/xgboost] is another 
popular gradient boosting library, some users compared Spark GBT with xgboost, 
and found Spark GBT has some room for improvement. There are discussions about 
this on SPARK-8547, SPARK-4240 and others, we can link them together to help us 
move forward more efficiently. I also talked with some Spark machine learning 
users offline and found GBT related algorithms play an important role in their 
daily work to satisfy business requirement, so I think we should put high 
priority to improve MLLib GBT.
* SPARK-8418 ML estimators and transformers should support multiple column as 
input and output. This is also very important to make MLlib practical, for 
example, many string columns need to be encoded by {{StringIndexer}}, it’s 
better we can transform them in a single pass rather than multiple which can 
greatly accelerate the training or transforming process.  
* SPARK-11136 Support set initial model, which can get better solution and save 
lots of training time.
* And some other feature parity issues such as providing DataFrame-based API 
for SVM, statistic functions, distributed linear algebra, etc. I saw you have 
linked corresponding SPARK-4591 here.


was (Author: yanboliang):
[~josephkb] [~yuhaoyan] [~felixcheung] This is really an useful discussion and 
this JIRA looks like a good framework to help us move forward efficiently. The 
current organization, category and priority setting would be more clear for 
both users, contributors and committers. Thanks for the great work.
I totally agree [~yuhaoyan] ’s comment to consider the feedback from Spark 
users. Except for the voting, watching and comment in JIRA, I think mailing 
list is also an attractive QA tool for Spark users since it does not involves 
register and login issue. I collected some feature requests from dev and user 
mailing list in the passed several months:
* SPARK-10413 Model should support prediction on single instance. I think this 
is the most frequently required features in mailing list and was mentioned 
multiple times. This is one of the most important steps to move Spark MLlib 
into production. Further more, whether we should provide local model 
implementation in mllib-local package, since lots of users will score model 
locally. We already have some discussion at SPARK-16365.
* GBT improvement: [dmlc/xgboost|https://github.com/dmlc/xgboost] is another 
popular gradient boosting library, some users compared Spark GBT with xgboost, 
and found Spark GBT has some room for improvement. There are discussions about 
this on SPARK-8547, SPARK-4240 and others, we can link them together to help us 
move forward more efficiently. I also talked with some Spark machine learning 
users offline and found GBT related algorithms play an important role in their 
daily work to satisfy business requirement, so I think we should put high 
priority to improve MLLib GBT.
* SPARK-8418 ML estimators and transformers should support multiple column as 
input and output. This is also very important to make MLlib practical, for 
example, many string columns need to be encoded by {{StringIndexer}}, it’s 
better we can transform them in a single pass rather than multiple which can 
greatly accelerate the training or transforming process.  
* SPARK-11136 Support set initial model, which can get better solution and save 
lots of training time.
* And some other feature parity issues such as providing DataFrame-based API 
for SVM, statistic functions, distributed linear algebra, etc. I 

[jira] [Comment Edited] (SPARK-18813) MLlib 2.2 Roadmap

2016-12-12 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15741292#comment-15741292
 ] 

Yanbo Liang edited comment on SPARK-18813 at 12/12/16 2:07 PM:
---

[~josephkb] [~yuhaoyan] [~felixcheung] This is really an useful discussion and 
this JIRA looks like a good framework to help us move forward efficiently. The 
current organization, category and priority setting would be more clear for 
both users, contributors and committers. Thanks for the great work.
I totally agree [~yuhaoyan] ’s comment to consider the feedback from Spark 
users. Except for the voting, watching and comment in JIRA, I think mailing 
list is also an attractive QA tool for Spark users since it does not involves 
register and login issue. I collected some feature requests from dev and user 
mailing list in the passed several months:
* SPARK-10413 Model should support prediction on single instance. I think this 
is the most frequently required features in mailing list and was mentioned 
multiple times. This is one of the most important steps to move Spark MLlib 
into production. Further more, whether we should provide local model 
implementation in mllib-local package, since lots of users will score model 
locally. We already have some discussion at SPARK-16365.
* GBT improvement: [dmlc/xgboost|https://github.com/dmlc/xgboost] is another 
popular gradient boosting library, some users compared Spark GBT with xgboost, 
and found Spark GBT has some room for improvement. There are discussions about 
this on SPARK-8547, SPARK-4240 and others, we can link them together to help us 
move forward more efficiently. I also talked with some Spark machine learning 
users offline and found GBT related algorithms play an important role in their 
daily work to satisfy business requirement, so I think we should put high 
priority to improve MLLib GBT.
* SPARK-8418 ML estimators and transformers should support multiple column as 
input and output. This is also very important to make MLlib practical, for 
example, many string columns need to be encoded by {{StringIndexer}}, it’s 
better we can transform them in a single pass rather than multiple which can 
greatly accelerate the training or transforming process.  
* SPARK-11136 Support set initial model, which can get better solution and save 
lots of training time.
* And some other feature parity issues such as providing DataFrame-based API 
for SVM, statistic functions, distributed linear algebra, etc. I saw you have 
linked corresponding SPARK-4591 here.


was (Author: yanboliang):
[~josephkb] [~yuhaoyan] [~felixcheung] This is really an useful discussion and 
this JIRA looks like a good framework to help us move forward efficiently. The 
current organization, category and priority setting would be more clear for 
both users, contributors and committers. Thanks for the great work.
I totally agree [~yuhaoyan] ’s comment to consider the feedback from Spark 
users. Except for the voting, watching and comment in JIRA, I think mailing 
list is also an attractive QA tool for Spark users since it does not involves 
register and login issue. I collected some feature requests from dev and user 
mailing list in the passed several months:
* SPARK-10413 Model should support prediction on single instance. I think this 
is the most frequently required features in mailing list and was mentioned 
multiple times. This is one of the most important steps to move Spark MLlib 
into production. Further more, whether we should provide local model 
implementation in mllib-local package, since lots of users will score model 
locally. We already have some discussion at SPARK-16365.
* GBT improvement: [dmlc/xgboost|https://github.com/dmlc/xgboost] is another 
popular gradient boosting library, some users compared Spark GBT with xgboost, 
and found Spark GBT has some room for improvement. There are discussions about 
this on SPARK-8547, SPARK-4240 and others, we can link them together to help us 
move forward more efficiently. I also talked with some Spark machine learning 
users offline and found GBT related algorithms play an important role in their 
daily work to satisfy their business requirement, so I think we should put high 
priority to improve MLLib GBT.
* SPARK-8418 ML estimators and transformers should support multiple column as 
input and output. This is also very important to make MLlib practical, for 
example, many string columns need to be encoded by {{StringIndexer}}, it’s 
better we can transform them in a single pass rather than multiple which can 
greatly accelerate the training or transforming process.  
* SPARK-11136 Support set initial model, which can get better solution and save 
lots of training time.
* And some other feature parity issues such as providing DataFrame-based API 
for SVM, statistic functions, distributed linear algebra, etc. I saw you have 
linked corresponding 

[jira] [Comment Edited] (SPARK-18813) MLlib 2.2 Roadmap

2016-12-12 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15741292#comment-15741292
 ] 

Yanbo Liang edited comment on SPARK-18813 at 12/12/16 2:04 PM:
---

[~josephkb] [~yuhaoyan] [~felixcheung] This is really an useful discussion and 
this JIRA looks like a good framework to help us move forward efficiently. The 
current organization, category and priority setting would be more clear for 
both users, contributors and committers. Thanks for the great work.
I totally agree [~yuhaoyan] ’s comment to consider the feedback from Spark 
users. Except for the voting, watching and comment in JIRA, I think mailing 
list is also an attractive QA tool for Spark users since it does not involves 
register and login issue. I collected some feature requests from dev and user 
mailing list in the passed several months:
* SPARK-10413 Model should support prediction on single instance. I think this 
is the most frequently required features in mailing list and was mentioned 
multiple times. This is one of the most important steps to move Spark MLlib 
into production. Further more, whether we should provide local model 
implementation in mllib-local package, since lots of users will score model 
locally. We already have some discussion at SPARK-16365.
* GBT improvement: [dmlc/xgboost|https://github.com/dmlc/xgboost] is another 
popular gradient boosting library, some users compared Spark GBT with xgboost, 
and found Spark GBT has some room for improvement. There are discussions about 
this on SPARK-8547, SPARK-4240 and others, we can link them together to help us 
move forward more efficiently. I also talked with some Spark machine learning 
users offline and found GBT related algorithms play an important role in their 
daily work to satisfy their business requirement, so I think we should put high 
priority to improve MLLib GBT.
* SPARK-8418 ML estimators and transformers should support multiple column as 
input and output. This is also very important to make MLlib practical, for 
example, many string columns need to be encoded by {{StringIndexer}}, it’s 
better we can transform them in a single pass rather than multiple which can 
greatly accelerate the training or transforming process.  
* SPARK-11136 Support set initial model, which can get better solution and save 
lots of training time.
* And some other feature parity issues such as providing DataFrame-based API 
for SVM, statistic functions, distributed linear algebra, etc. I saw you have 
linked corresponding SPARK-4591 here.


was (Author: yanboliang):
[~josephkb] [~yuhaoyan] [~felixcheung] This is really an useful discussion and 
this JIRA looks like a good framework to help us move forward efficiently. The 
current organization, category and priority setting would be more clear for 
both users, contributors and committers. Thanks for the great work.
I totally agree [~yuhaoyan] ’s comment to consider the feedback from Spark 
users. Except for the voting, watching and comment in JIRA, I think mailing 
list is also an attractive QA tool for Spark users since it does not involves 
register and login issue. I collected some feature requests from dev and user 
mailing list in the passed several months:
* SPARK-10413 Model should support prediction on single instance. I think this 
is the most frequently required features in mailing list and was mentioned 
multiple times. This is one of the most important steps to move Spark MLlib 
into production. Further more, whether we should provide local model 
implementation in mllib-local package, since lots of users will score model 
locally. We already have some discussion at SPARK-16365.
* GBT improvement: [dmlc/xgboost|https://github.com/dmlc/xgboost] is another 
popular gradient boosting library, some users compared Spark GBT with xgboost, 
and found Spark GBT has some room for improvement. There are discussions about 
this on SPARK-8547, SPARK-4240 and others, I think we can put them together to 
help us move forward more efficiently. I also talked with some Spark machine 
learning users offline and found GBT related algorithms play an important role 
in their daily work to satisfy their business requirement, and some users 
choose xgboost.
* SPARK-8418 ML estimators and transformers should support multiple column as 
input and output. This is also very important to make MLlib practical, for 
example, many string columns need to be encoded by {{StringIndexer}}, it’s 
better we can transform them in a single pass rather than multiple which can 
greatly accelerate the training or transforming process.  
* SPARK-11136 Support set initial model, which can get better solution and save 
lots of training time.
* And some other feature parity issues such as providing DataFrame-based API 
for SVM, statistic functions, distributed linear algebra, etc. I saw you have 
linked corresponding SPARK-4591 here.


[jira] [Comment Edited] (SPARK-18813) MLlib 2.2 Roadmap

2016-12-12 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15741292#comment-15741292
 ] 

Yanbo Liang edited comment on SPARK-18813 at 12/12/16 1:55 PM:
---

[~josephkb] [~yuhaoyan] [~felixcheung] This is really an useful discussion and 
this JIRA looks like a good framework to help us move forward efficiently. The 
current organization, category and priority setting would be more clear for 
both users, contributors and committers. Thanks for the great work.
I totally agree [~yuhaoyan] ’s comment to consider the feedback from Spark 
users. Except for the voting, watching and comment in JIRA, I think mailing 
list is also an attractive QA tool for Spark users since it does not involves 
register and login issue. I collected some feature requests from dev and user 
mailing list in the passed several months:
* SPARK-10413 Model should support prediction on single instance. I think this 
is the most frequently required features in mailing list and was mentioned 
multiple times. This is one of the most important steps to move Spark MLlib 
into production. Further more, whether we should provide local model 
implementation in mllib-local package, since lots of users will score model 
locally. We already have some discussion at SPARK-16365.
* GBT improvement: [dmlc/xgboost|https://github.com/dmlc/xgboost] is another 
popular gradient boosting library, some users compared Spark GBT with xgboost, 
and found Spark GBT has some room for improvement. There are discussions about 
this on SPARK-8547, SPARK-4240 and others, I think we can put them together to 
help us move forward more efficiently. I also talked with some Spark machine 
learning users offline and found GBT related algorithms play an important role 
in their daily work to satisfy their business requirement, and some users 
choose xgboost.
* SPARK-8418 ML estimators and transformers should support multiple column as 
input and output. This is also very important to make MLlib practical, for 
example, many string columns need to be encoded by {{StringIndexer}}, it’s 
better we can transform them in a single pass rather than multiple which can 
greatly accelerate the training or transforming process.  
* SPARK-11136 Support set initial model, which can get better solution and save 
lots of training time.
* And some other feature parity issues such as providing DataFrame-based API 
for SVM, statistic functions, distributed linear algebra, etc. I saw you have 
linked corresponding SPARK-4591 here.


was (Author: yanboliang):
[~josephkb] [~yuhaoyan] [~felixcheung] This is really an useful discussion and 
this JIRA looks like a good framework to help us move forward efficiently. The 
current organization, category and priority setting would be more clear for 
both users, contributors and committers. Thanks for the great work.
I totally agree [~yuhaoyan] ’s comment to consider the feedback from Spark 
users. Except for the voting, watching and comment in JIRA, I think mailing 
list is also an attractive QA tool for Spark users since it does not involves 
register and login issue. I collected some feature requests from dev and user 
mailing list in the passed several months:
* SPARK-10413 Model should support prediction on single instance. I think this 
is the most frequently required features in mailing list and was mentioned 
multiple times. This is one of the most important steps to move Spark MLlib 
into production. Further more, whether we should provide local model 
implementation in mllib-local package, since lots of users will score model 
locally. We already have some discussion at SPARK-16365.
* GBT improvement: [dmlc/xgboost|https://github.com/dmlc/xgboost] is another 
popular gradient boosting library, some users compared Spark GBT with xgboost, 
and found Spark GBT has some room for improvement. I also talked with some 
xgboost users offline and collected the reasons why they choose it. I will 
summary them and paste here soon.
* SPARK-8418 ML estimators and transformers should support multiple column as 
input and output. This is also very important to make MLlib practical, for 
example, many string columns need to be encoded by {{StringIndexer}}, it’s 
better we can transform them in a single pass rather than multiple which can 
greatly accelerate the training or transforming process.  
* SPARK-11136 Support set initial model, which can get better solution and save 
lots of training time.
* And some other feature parity issues such as providing DataFrame-based API 
for SVM, statistic functions, distributed linear algebra, etc. I saw you have 
linked corresponding SPARK-4591 here.

> MLlib 2.2 Roadmap
> -
>
> Key: SPARK-18813
> URL: https://issues.apache.org/jira/browse/SPARK-18813
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
> 

[jira] [Comment Edited] (SPARK-18813) MLlib 2.2 Roadmap

2016-12-12 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15741292#comment-15741292
 ] 

Yanbo Liang edited comment on SPARK-18813 at 12/12/16 8:34 AM:
---

[~josephkb] [~yuhaoyan] [~felixcheung] This is really an useful discussion and 
this JIRA looks like a good framework to help us move forward efficiently. The 
current organization, category and priority setting would be more clear for 
both users, contributors and committers. Thanks for the great work.
I totally agree [~yuhaoyan] ’s comment to consider the feedback from Spark 
users. Except for the voting, watching and comment in JIRA, I think mailing 
list is also an attractive QA tool for Spark users since it does not involves 
register and login issue. I collected some feature requests from dev and user 
mailing list in the passed several months:
* SPARK-10413 Model should support prediction on single instance. I think this 
is the most frequently required features in mailing list and was mentioned 
multiple times. This is one of the most important steps to move Spark MLlib 
into production. Further more, whether we should provide local model 
implementation in mllib-local package, since lots of users will score model 
locally. We already have some discussion at SPARK-16365.
* GBT improvement: [dmlc/xgboost|https://github.com/dmlc/xgboost] is another 
popular gradient boosting library, some users compared Spark GBT with xgboost, 
and found Spark GBT has some room for improvement. I also talked with some 
xgboost users offline and collected the reasons why they choose it. I will 
summary them and paste here soon.
* SPARK-8418 ML estimators and transformers should support multiple column as 
input and output. This is also very important to make MLlib practical, for 
example, many string columns need to be encoded by {{StringIndexer}}, it’s 
better we can transform them in a single pass rather than multiple which can 
greatly accelerate the training or transforming process.  
* SPARK-11136 Support set initial model, which can get better solution and save 
lots of training time.
* And some other feature parity issues such as providing DataFrame-based API 
for SVM, statistic functions, distributed linear algebra, etc. I saw you have 
linked corresponding SPARK-4591 here.


was (Author: yanboliang):
[~josephkb] [~yuhaoyan] [~felixcheung] This is really an useful discussion and 
a great framework to help us move forward efficiently. I totally agree 
[~yuhaoyan] ’s comment to consider the feedback from Spark users. Except for 
the voting, watching and comment in JIRA, I think mailing list is also an 
attractive QA tool for Spark users since it does not involves register and 
login issue. I collected some feature requests from dev and user mailing list 
in the passed several months:
* SPARK-10413 Model should support prediction on single instance. I think this 
is the most frequently required features in mailing list and was mentioned 
multiple times. This is one of the most important steps to move Spark MLlib 
into production. Further more, whether we should provide local model 
implementation in mllib-local package, since lots of users will score model 
locally. We already have some discussion at SPARK-16365.
* GBT improvement: [dmlc/xgboost|https://github.com/dmlc/xgboost] is another 
popular gradient boosting library, some users compared Spark GBT with xgboost, 
and found Spark GBT has some room for improvement. I also talked with some 
xgboost users offline and collected the reasons why they choose it. I will 
summary them and paste here soon.
* SPARK-8418 ML estimators and transformers should support multiple column as 
input and output. This is also very important to make MLlib practical, for 
example, many string columns need to be encoded by {{StringIndexer}}, it’s 
better we can transform them in a single pass rather than multiple which can 
greatly accelerate the training or transforming process.  
* SPARK-11136 Support set initial model, which can get better solution and save 
lots of training time.
* And some other feature parity issues such as providing DataFrame-based API 
for SVM, statistic functions, distributed linear algebra, etc. I saw you have 
linked corresponding SPARK-4591 here.

> MLlib 2.2 Roadmap
> -
>
> Key: SPARK-18813
> URL: https://issues.apache.org/jira/browse/SPARK-18813
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> *PROPOSAL: This includes a proposal for the 2.2 roadmap process for MLlib.*
> The roadmap process described below is significantly updated since the 2.1 
> roadmap [SPARK-15581].  Please refer to [SPARK-15581] for more discussion on 
> the basis for this proposal, and comment in this 

[jira] [Comment Edited] (SPARK-18813) MLlib 2.2 Roadmap

2016-12-12 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15741292#comment-15741292
 ] 

Yanbo Liang edited comment on SPARK-18813 at 12/12/16 8:22 AM:
---

[~josephkb] [~yuhaoyan] [~felixcheung] This is really an useful discussion and 
a great framework to help us move forward efficiently. I totally agree 
[~yuhaoyan] ’s comment to consider the feedback from Spark users. Except for 
the voting, watching and comment in JIRA, I think mailing list is also an 
attractive QA tool for Spark users since it does not involves register and 
login issue. I collected some feature requests from dev and user mailing list 
in the passed several months:
* SPARK-10413 Model should support prediction on single instance. I think this 
is the most frequently required features in mailing list and was mentioned 
multiple times. This is one of the most important steps to move Spark MLlib 
into production. Further more, whether we should provide local model 
implementation in mllib-local package, since lots of users will score model 
locally. We already have some discussion at SPARK-16365.
* GBT improvement: [dmlc/xgboost|https://github.com/dmlc/xgboost] is another 
popular gradient boosting library, some users compared Spark GBT with xgboost, 
and found Spark GBT has some room for improvement. I also talked with some 
xgboost users offline and collected the reasons why they choose it. I will 
summary them and paste here soon.
* SPARK-8418 ML estimators and transformers should support multiple column as 
input and output. This is also very important to make MLlib practical, for 
example, many string columns need to be encoded by {{StringIndexer}}, it’s 
better we can transform them in a single pass rather than multiple which can 
greatly accelerate the training or transforming process.  
* SPARK-11136 Support set initial model, which can get better solution and save 
lots of training time.
* And some other feature parity issues such as providing DataFrame-based API 
for SVM, statistic functions, distributed linear algebra, etc. I saw you have 
linked corresponding SPARK-4591 here.


was (Author: yanboliang):
[~josephkb] [~yuhaoyan] [~felixcheung] This is really an useful discussion and 
a great framework to help us move forward efficiently. I totally agree 
[~yuhaoyan] ’s comment to consider the feedback from Spark users. Except for 
the voting, watching and comment in JIRA, I think mailing list is also an 
attractive QA tool for Spark users since it does not involves register and 
login issue. I collected some feature requests from dev and user mailing list 
in the pass several months:
* SPARK-10413 Model should support prediction on single instance. I think this 
is the most frequently required features in mailing list and was mentioned 
multiple times. This is one of the most important steps to move Spark MLlib 
into production. Further more, whether we should provide local model 
implementation in mllib-local package, since lots of users will score model 
locally. We have some discussion at SPARK-16365.
* GBT improvement: [dmlc/xgboost|https://github.com/dmlc/xgboost] is another 
popular gradient boosting library, some users compared Spark GBT with xgboost, 
and found Spark GBT has some room for improvement. I also talked with some 
xgboost users offline and collected the reasons why they choose it. I will 
summary them and paste here soon.
* SPARK-8418 ML estimators and transformers should support multiple column as 
input and output. This is also very important to make MLlib practical, for 
example, many string columns need to be encoded by {{StringIndexer}}, it’s 
better we can transform them in a single pass rather than multiple which can 
greatly accelerate the training or transforming process.  
* SPARK-11136 Support set initial model, which can get better solution and save 
lots of training time.
* And some other feature parity issues such as providing DataFrame-based API 
for SVM, statistic functions, distributed linear algebra, etc. I saw you have 
linked corresponding SPARK-4591 here.

> MLlib 2.2 Roadmap
> -
>
> Key: SPARK-18813
> URL: https://issues.apache.org/jira/browse/SPARK-18813
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> *PROPOSAL: This includes a proposal for the 2.2 roadmap process for MLlib.*
> The roadmap process described below is significantly updated since the 2.1 
> roadmap [SPARK-15581].  Please refer to [SPARK-15581] for more discussion on 
> the basis for this proposal, and comment in this JIRA if you have suggestions 
> for improvements.
> h1. Roadmap process
> This roadmap is a master list for MLlib improvements we are working on during 
> this release.  This 

[jira] [Comment Edited] (SPARK-18813) MLlib 2.2 Roadmap

2016-12-11 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15740184#comment-15740184
 ] 

Felix Cheung edited comment on SPARK-18813 at 12/11/16 7:11 PM:


I added a couple of JIRAs for R that can be found with [this 
query|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20component%20in%20(SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0%20ORDER%20BY%20priority%20DESC]

We could turn them into subtasks if we are having umbrella


was (Author: felixcheung):
I added a couple of JIRAs for R that can be found with [this 
query|https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)%20AND%20component%20in%20(ML%2C%20MLlib)%20AND%20component%20in%20(SparkR)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.2.0%20ORDER%20BY%20priority%20DESC]

> MLlib 2.2 Roadmap
> -
>
> Key: SPARK-18813
> URL: https://issues.apache.org/jira/browse/SPARK-18813
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> *PROPOSAL: This includes a proposal for the 2.2 roadmap process for MLlib.*
> The roadmap process described below is significantly updated since the 2.1 
> roadmap [SPARK-15581].  Please refer to [SPARK-15581] for more discussion on 
> the basis for this proposal, and comment in this JIRA if you have suggestions 
> for improvements.
> h1. Roadmap process
> This roadmap is a master list for MLlib improvements we are working on during 
> this release.  This includes ML-related changes in PySpark and SparkR.
> *What is planned for the next release?*
> * This roadmap lists issues which at least one Committer has prioritized.  
> See details below in "Instructions for committers."
> * This roadmap only lists larger or more critical issues.
> *How can contributors influence this roadmap?*
> * If you believe an issue should be in this roadmap, please discuss the issue 
> on JIRA and/or the dev mailing list.  Make sure to ping Committers since at 
> least one must agree to shepherd the issue.
> * For general discussions, use this JIRA or the dev mailing list.  For 
> specific issues, please comment on those issues or the mailing list.
> h2. Target Version and Priority
> This section describes the meaning of Target Version and Priority.  _These 
> meanings have been updated in this proposal for the 2.2 process._
> || Category | Target Version | Priority | Shepherd | Put on roadmap? | In 
> next release? ||
> | 1 | next release | Blocker | *must* | *must* | *must* |
> | 2 | next release | Critical | *must* | yes, unless small | *best effort* |
> | 3 | next release | Major | *must* | optional | *best effort* |
> | 4 | next release | Minor | optional | no | maybe |
> | 5 | next release | Trivial | optional | no | maybe |
> | 6 | (empty) | (any) | yes | no | maybe |
> | 7 | (empty) | (any) | no | no | maybe |
> The *Category* in the table above has the following meaning:
> 1. A committer has promised to see this issue to completion for the next 
> release.  Contributions *will* receive attention.
> 2-3. A committer has promised to see this issue to completion for the next 
> release.  Contributions *will* receive attention.  The issue may slip to the 
> next release if development is slower than expected.
> 4-5. A committer has promised interest in this issue.  Contributions *will* 
> receive attention.  The issue may slip to another release.
> 6. A committer has promised interest in this issue and should respond, but no 
> promises are made about priorities or releases.
> 7. This issue is open for discussion, but it needs a committer to promise 
> interest to proceed.
> h1. Instructions
> h2. For contributors
> Getting started
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time contributor, please always start with a small 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a larger feature.
> Coordinating on JIRA
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start work. This is to avoid duplicate work. For small patches, you do 
> not need to get the JIRA assigned to you to begin work.
> * For medium/large features or features with dependencies, please get 
> assigned first before coding and keep the ETA updated on the JIRA. If there 
> is no activity on the JIRA page for a certain amount of time, the JIRA should 
> be released for other contributors.
> * Do not claim multiple (>3) JIRAs at 

[jira] [Comment Edited] (SPARK-18813) MLlib 2.2 Roadmap

2016-12-09 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736953#comment-15736953
 ] 

yuhao yang edited comment on SPARK-18813 at 12/10/16 5:06 AM:
--

The plan is definitely solid and practical. I understand for efficiency and 
operability, we need to rely on committers for release management and feature 
review.

The only thing I would add is that we should however find a way to *take in the 
suggestions and feedback from real world Spark users*, who will ultimately 
decide the popularity of Apache Spark. In the long term, we should find a 
mechanism to collect and respond to users' requirements and complaints. One 
idea is to have a voting website as a wish list from Spark users. Users can 
create or vote for the features or improvements they need in Spark. This helps 
committers collect the requirements and also give everybody a channel to 
express their priorities. I'd like to hear other ideas. The main idea is to 
improve the transparency and diversity in the community and make everyone feel 
more involved but not isolated. 


was (Author: yuhaoyan):
The plan is definitely solid and practical. I understand for efficiency and 
operability, we need to rely on committers for release management and feature 
review.

The only thing I would add is that we should however find a way to *take in the 
suggestions and feedback from real world Spark users*, who will ultimately 
decide the popularity of Apache Spark. In the long term, we should find a 
mechanism to collect and respond to users' requirements and complaints. One 
idea is to have a voting website as a wish list from Spark users. Users can 
create or vote for the features or improvements they need in Spark. This helps 
committers collect the requirements and also give everybody a channel to 
express their priorities. Hopefully it will improve the transparency and 
diversity in the community and make everyone feel more involved but not 
isolated. 

> MLlib 2.2 Roadmap
> -
>
> Key: SPARK-18813
> URL: https://issues.apache.org/jira/browse/SPARK-18813
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> *PROPOSAL: This includes a proposal for the 2.2 roadmap process for MLlib.*
> The roadmap process described below is significantly updated since the 2.1 
> roadmap [SPARK-15581].  Please refer to [SPARK-15581] for more discussion on 
> the basis for this proposal, and comment in this JIRA if you have suggestions 
> for improvements.
> h1. Roadmap process
> This roadmap is a master list for MLlib improvements we are working on during 
> this release.  This includes ML-related changes in PySpark and SparkR.
> *What is planned for the next release?*
> * This roadmap lists issues which at least one Committer has prioritized.  
> See details below in "Instructions for committers."
> * This roadmap only lists larger or more critical issues.
> *How can contributors influence this roadmap?*
> * If you believe an issue should be in this roadmap, please discuss the issue 
> on JIRA and/or the dev mailing list.  Make sure to ping Committers since at 
> least one must agree to shepherd the issue.
> * For general discussions, use this JIRA or the dev mailing list.  For 
> specific issues, please comment on those issues or the mailing list.
> h2. Target Version and Priority
> This section describes the meaning of Target Version and Priority.  _These 
> meanings have been updated in this proposal for the 2.2 process._
> || Category | Target Version | Priority | Shepherd | Put on roadmap? | In 
> next release? ||
> | 1 | next release | Blocker | *must* | *must* | *must* |
> | 2 | next release | Critical | *must* | yes, unless small | *best effort* |
> | 3 | next release | Major | *must* | optional | *best effort* |
> | 4 | next release | Minor | optional | no | maybe |
> | 5 | next release | Trivial | optional | no | maybe |
> | 6 | (empty) | (any) | yes | no | maybe |
> | 7 | (empty) | (any) | no | no | maybe |
> The *Category* in the table above has the following meaning:
> 1. A committer has promised to see this issue to completion for the next 
> release.  Contributions *will* receive attention.
> 2-3. A committer has promised to see this issue to completion for the next 
> release.  Contributions *will* receive attention.  The issue may slip to the 
> next release if development is slower than expected.
> 4-5. A committer has promised interest in this issue.  Contributions *will* 
> receive attention.  The issue may slip to another release.
> 6. A committer has promised interest in this issue and should respond, but no 
> promises are made about priorities or releases.
> 7. This issue is open for discussion, but it needs a committer to 

[jira] [Comment Edited] (SPARK-18813) MLlib 2.2 Roadmap

2016-12-09 Thread yuhao yang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-18813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15736953#comment-15736953
 ] 

yuhao yang edited comment on SPARK-18813 at 12/10/16 1:55 AM:
--

The plan is definitely solid and practical. I understand for efficiency and 
operability, we need to rely on committers for release management and feature 
review.

The only thing I would add is that we should however find a way to *take in the 
suggestions and feedback from real world Spark users*, who will ultimately 
decide the popularity of Apache Spark. In the long term, we should find a 
mechanism to collect and respond to users' requirements and complaints. One 
idea is to have a voting website as a wish list from Spark users. Users can 
create or vote for the features or improvements they need in Spark. This helps 
committers collect the requirements and also give everybody a channel to 
express their priorities. Hopefully it will improve the transparency and 
diversity in the community and make everyone feel more involved but not 
isolated. 


was (Author: yuhaoyan):
The plan is definitely solid and practical. I understand for efficiency and 
operability, we need to rely on committers for release management and feature 
review.

The only thing I would add is that we should however find a way to *take in the 
suggestions and feedback from real world Spark users*, who will ultimately 
decide the popularity of Apache Spark. We should find a mechanism to collect 
and respond to users' requirements and complaints. One idea is to have a voting 
website as a wish list from Spark users. Users can create or vote for the 
features or improvements they need in Spark. This helps committers collect the 
requirements and also give everybody a channel to express their priorities. 
Hopefully it will improve the transparency and diversity in the community and 
make everyone feel more involved but not isolated. 

> MLlib 2.2 Roadmap
> -
>
> Key: SPARK-18813
> URL: https://issues.apache.org/jira/browse/SPARK-18813
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> *PROPOSAL: This includes a proposal for the 2.2 roadmap process for MLlib.*
> The roadmap process described below is significantly updated since the 2.1 
> roadmap [SPARK-15581].  Please refer to [SPARK-15581] for more discussion on 
> the basis for this proposal, and comment in this JIRA if you have suggestions 
> for improvements.
> h1. Roadmap process
> This roadmap is a master list for MLlib improvements we are working on during 
> this release.  This includes ML-related changes in PySpark and SparkR.
> *What is planned for the next release?*
> * This roadmap lists issues which at least one Committer has prioritized.  
> See details below in "Instructions for committers."
> * This roadmap only lists larger or more critical issues.
> *How can contributors influence this roadmap?*
> * If you believe an issue should be in this roadmap, please discuss the issue 
> on JIRA and/or the dev mailing list.  Make sure to ping Committers since at 
> least one must agree to shepherd the issue.
> * For general discussions, use this JIRA or the dev mailing list.  For 
> specific issues, please comment on those issues or the mailing list.
> h2. Target Version and Priority
> This section describes the meaning of Target Version and Priority.  _These 
> meanings have been updated in this proposal for the 2.2 process._
> || Category | Target Version | Priority | Shepherd | Put on roadmap? | In 
> next release? ||
> | 1 | next release | Blocker | *must* | *must* | *must* |
> | 2 | next release | Critical | *must* | yes, unless small | *best effort* |
> | 3 | next release | Major | *must* | optional | *best effort* |
> | 4 | next release | Minor | optional | no | maybe |
> | 5 | next release | Trivial | optional | no | maybe |
> | 6 | (empty) | (any) | yes | no | maybe |
> | 7 | (empty) | (any) | no | no | maybe |
> The *Category* in the table above has the following meaning:
> 1. A committer has promised to see this issue to completion for the next 
> release.  Contributions *will* receive attention.
> 2-3. A committer has promised to see this issue to completion for the next 
> release.  Contributions *will* receive attention.  The issue may slip to the 
> next release if development is slower than expected.
> 4-5. A committer has promised interest in this issue.  Contributions *will* 
> receive attention.  The issue may slip to another release.
> 6. A committer has promised interest in this issue and should respond, but no 
> promises are made about priorities or releases.
> 7. This issue is open for discussion, but it needs a committer to promise 
> interest to proceed.
> h1. Instructions
> h2.