[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator

2016-11-01 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15625108#comment-15625108
 ] 

Vincent commented on SPARK-17055:
-

[~sowen] Okay. hmm, I guess we have some misunderstanding here. 
[~remi.delas...@gmail.com] reviewed the code and gave some feedbacks that we 
should be prepared to accommodate to other folding methods if there is any. But 
my opinion was that we align with current design in MLLIB, becoz: 1. we dont 
have that many folding methods so far in MLLIB; 2. Changing the API as Remi 
proposed will have impact on current cross-validation usage, which I think it'd 
be better get the nod from committers.
So, I suggest reopen this ticket coz as we know this folding method is useful 
in practice, and ,as from what I have learned, some ppl in the community 
actually need/use this PR when they use Spark ML. :)

> add labelKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator

2016-11-01 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15625033#comment-15625033
 ] 

Rémi Delassus commented on SPARK-17055:
---

Discussion in which you said you were closing it, not why you were doing so.
Reading it I thought you were merging the PR.

> add labelKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator

2016-11-01 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15624928#comment-15624928
 ] 

Sean Owen commented on SPARK-17055:
---

I did so following the discussion on the PR, that you were a part of yesterday: 
https://github.com/apache/spark/pull/14640#issuecomment-257205012

> add labelKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator

2016-10-31 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15624114#comment-15624114
 ] 

Vincent commented on SPARK-17055:
-

[~srowen] May I ask the reason why we close this issue? It'd be helpful for us 
to understand current guideline if we are to implement more features in 
ML/MLLIB, thanks.

> add labelKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator

2016-10-28 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15614576#comment-15614576
 ] 

Rémi Delassus commented on SPARK-17055:
---

I had an issue that could be solved by this kind of technique : Each sample got 
a timesamp and the error can only be computed on a full month of data. Thus I 
need months of data to be distributed in folds, no each sample individually.

But in my opinion this is not the good way to solve that. Since there is an 
infinite number of ways to split the data, I think we should be able to pass 
the split method as an argument to the crossvalidator. The method described 
here could be implemented, as well as any other.

> add labelKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator

2016-08-23 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432463#comment-15432463
 ] 

Vincent commented on SPARK-17055:
-

sorry for late reply. Yes, I just knew they intend to rename it to GroupKFold, 
or something like that. Though personally I think it's fine to keep the way it 
is, though, it could be still kinda confusing when someone first uses it before 
understanding the idea behinds it.

As for application, take face recognition as an example. features are, say, 
eyes, nose, lips etc. training data are obtained from a number of different 
person, this method can create subject independent folds, so we can train the 
model with features from certain group of people and take the data from the 
rest of group of people for validation. it will enhance the generic ability of 
the model and avoid over-fitting.

it's a useful method, seen in sklearn, and currently caret is on the way add 
this feature.

> add labelKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator

2016-08-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432427#comment-15432427
 ] 

Sean Owen commented on SPARK-17055:
---

>From a comment in the PR, I get it. This is not actually about labels, but 
>about some arbitrary attribute or function of each example. The purpose is to 
>group examples into train/test such that examples with the same attribute 
>value always go into the same data set. So maybe you want all examples for one 
>customer ID to go into train, or all into test, but not split across both.

This needs a different name I think because 'label' has a specific and 
different meaning, and even scikit says they want to rename it. It's coherent, 
but I still don't know how useful it is. It would need to be reconstruted for 
Spark ML.

> add labelKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator

2016-08-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15430386#comment-15430386
 ] 

Sean Owen commented on SPARK-17055:
---

The model will always have 0% accuracy on CV / test data whose label was not in 
the training data. Can you give me an example that clarifies what you have in 
mind? I don't think this statement is true.

> add labelKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator

2016-08-22 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15430381#comment-15430381
 ] 

Vincent commented on SPARK-17055:
-

well, a better model will have a better cv performance on data with unseen 
labels, so the final selected model will have a relatively better capability on 
predicting samples with unseen categories/labels in real case.

> add labelKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator

2016-08-22 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15430359#comment-15430359
 ] 

Sean Owen commented on SPARK-17055:
---

Yes, I understand how model fitting works. If a label is present in test but 
not train data, the model will never predict it. This doesn't require any code 
at all; it's always true. What would you be evaluating by holding out all data 
with a certain label from training?

> add labelKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator

2016-08-15 Thread Vincent (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15420867#comment-15420867
 ] 

Vincent commented on SPARK-17055:
-

one of the most common tasks is to fit a "model" to a set of training data, so 
as to be able to make reliable predictions on general untrained data. 
labelKFold can be used to test the model's ability to generalize by evaluating 
its performance on a class of data not used for training, which is assumed to 
approximate the typical unseen data that a model will encounter.

> add labelKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator

2016-08-15 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15420748#comment-15420748
 ] 

Sean Owen commented on SPARK-17055:
---

Hm, when would you want a label to not be present in training but present in 
test or vice versa? they should, in general, have the same distribution.

> add labelKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream packages like Sklearn already supports such cross validation 
> method. 
> (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator

2016-08-14 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15420569#comment-15420569
 ] 

Apache Spark commented on SPARK-17055:
--

User 'VinceShieh' has created a pull request for this issue:
https://github.com/apache/spark/pull/14640

> add labelKFold to CrossValidator
> 
>
> Key: SPARK-17055
> URL: https://issues.apache.org/jira/browse/SPARK-17055
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 2.0.0
>Reporter: Vincent
>Priority: Minor
>
> Current CrossValidator only supports k-fold, which randomly divides all the 
> samples in k groups of samples. But in cases when data is gathered from 
> different subjects and we want to avoid over-fitting, we want to hold out 
> samples with certain labels from training data and put them into validation 
> fold, i.e. we want to ensure that the same label is not in both testing and 
> training sets.
> Mainstream package like Sklearn already supports such cross validation 
> method. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org