[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15625108#comment-15625108 ] Vincent commented on SPARK-17055: - [~sowen] Okay. hmm, I guess we have some misunderstanding here. [~remi.delas...@gmail.com] reviewed the code and gave some feedbacks that we should be prepared to accommodate to other folding methods if there is any. But my opinion was that we align with current design in MLLIB, becoz: 1. we dont have that many folding methods so far in MLLIB; 2. Changing the API as Remi proposed will have impact on current cross-validation usage, which I think it'd be better get the nod from committers. So, I suggest reopen this ticket coz as we know this folding method is useful in practice, and ,as from what I have learned, some ppl in the community actually need/use this PR when they use Spark ML. :) > add labelKFold to CrossValidator > > > Key: SPARK-17055 > URL: https://issues.apache.org/jira/browse/SPARK-17055 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Vincent >Priority: Minor > > Current CrossValidator only supports k-fold, which randomly divides all the > samples in k groups of samples. But in cases when data is gathered from > different subjects and we want to avoid over-fitting, we want to hold out > samples with certain labels from training data and put them into validation > fold, i.e. we want to ensure that the same label is not in both testing and > training sets. > Mainstream packages like Sklearn already supports such cross validation > method. > (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15625033#comment-15625033 ] Rémi Delassus commented on SPARK-17055: --- Discussion in which you said you were closing it, not why you were doing so. Reading it I thought you were merging the PR. > add labelKFold to CrossValidator > > > Key: SPARK-17055 > URL: https://issues.apache.org/jira/browse/SPARK-17055 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Vincent >Priority: Minor > > Current CrossValidator only supports k-fold, which randomly divides all the > samples in k groups of samples. But in cases when data is gathered from > different subjects and we want to avoid over-fitting, we want to hold out > samples with certain labels from training data and put them into validation > fold, i.e. we want to ensure that the same label is not in both testing and > training sets. > Mainstream packages like Sklearn already supports such cross validation > method. > (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15624928#comment-15624928 ] Sean Owen commented on SPARK-17055: --- I did so following the discussion on the PR, that you were a part of yesterday: https://github.com/apache/spark/pull/14640#issuecomment-257205012 > add labelKFold to CrossValidator > > > Key: SPARK-17055 > URL: https://issues.apache.org/jira/browse/SPARK-17055 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Vincent >Priority: Minor > > Current CrossValidator only supports k-fold, which randomly divides all the > samples in k groups of samples. But in cases when data is gathered from > different subjects and we want to avoid over-fitting, we want to hold out > samples with certain labels from training data and put them into validation > fold, i.e. we want to ensure that the same label is not in both testing and > training sets. > Mainstream packages like Sklearn already supports such cross validation > method. > (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15624114#comment-15624114 ] Vincent commented on SPARK-17055: - [~srowen] May I ask the reason why we close this issue? It'd be helpful for us to understand current guideline if we are to implement more features in ML/MLLIB, thanks. > add labelKFold to CrossValidator > > > Key: SPARK-17055 > URL: https://issues.apache.org/jira/browse/SPARK-17055 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Vincent >Priority: Minor > > Current CrossValidator only supports k-fold, which randomly divides all the > samples in k groups of samples. But in cases when data is gathered from > different subjects and we want to avoid over-fitting, we want to hold out > samples with certain labels from training data and put them into validation > fold, i.e. we want to ensure that the same label is not in both testing and > training sets. > Mainstream packages like Sklearn already supports such cross validation > method. > (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15614576#comment-15614576 ] Rémi Delassus commented on SPARK-17055: --- I had an issue that could be solved by this kind of technique : Each sample got a timesamp and the error can only be computed on a full month of data. Thus I need months of data to be distributed in folds, no each sample individually. But in my opinion this is not the good way to solve that. Since there is an infinite number of ways to split the data, I think we should be able to pass the split method as an argument to the crossvalidator. The method described here could be implemented, as well as any other. > add labelKFold to CrossValidator > > > Key: SPARK-17055 > URL: https://issues.apache.org/jira/browse/SPARK-17055 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Vincent >Priority: Minor > > Current CrossValidator only supports k-fold, which randomly divides all the > samples in k groups of samples. But in cases when data is gathered from > different subjects and we want to avoid over-fitting, we want to hold out > samples with certain labels from training data and put them into validation > fold, i.e. we want to ensure that the same label is not in both testing and > training sets. > Mainstream packages like Sklearn already supports such cross validation > method. > (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432463#comment-15432463 ] Vincent commented on SPARK-17055: - sorry for late reply. Yes, I just knew they intend to rename it to GroupKFold, or something like that. Though personally I think it's fine to keep the way it is, though, it could be still kinda confusing when someone first uses it before understanding the idea behinds it. As for application, take face recognition as an example. features are, say, eyes, nose, lips etc. training data are obtained from a number of different person, this method can create subject independent folds, so we can train the model with features from certain group of people and take the data from the rest of group of people for validation. it will enhance the generic ability of the model and avoid over-fitting. it's a useful method, seen in sklearn, and currently caret is on the way add this feature. > add labelKFold to CrossValidator > > > Key: SPARK-17055 > URL: https://issues.apache.org/jira/browse/SPARK-17055 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Vincent >Priority: Minor > > Current CrossValidator only supports k-fold, which randomly divides all the > samples in k groups of samples. But in cases when data is gathered from > different subjects and we want to avoid over-fitting, we want to hold out > samples with certain labels from training data and put them into validation > fold, i.e. we want to ensure that the same label is not in both testing and > training sets. > Mainstream packages like Sklearn already supports such cross validation > method. > (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15432427#comment-15432427 ] Sean Owen commented on SPARK-17055: --- >From a comment in the PR, I get it. This is not actually about labels, but >about some arbitrary attribute or function of each example. The purpose is to >group examples into train/test such that examples with the same attribute >value always go into the same data set. So maybe you want all examples for one >customer ID to go into train, or all into test, but not split across both. This needs a different name I think because 'label' has a specific and different meaning, and even scikit says they want to rename it. It's coherent, but I still don't know how useful it is. It would need to be reconstruted for Spark ML. > add labelKFold to CrossValidator > > > Key: SPARK-17055 > URL: https://issues.apache.org/jira/browse/SPARK-17055 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Vincent >Priority: Minor > > Current CrossValidator only supports k-fold, which randomly divides all the > samples in k groups of samples. But in cases when data is gathered from > different subjects and we want to avoid over-fitting, we want to hold out > samples with certain labels from training data and put them into validation > fold, i.e. we want to ensure that the same label is not in both testing and > training sets. > Mainstream packages like Sklearn already supports such cross validation > method. > (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15430386#comment-15430386 ] Sean Owen commented on SPARK-17055: --- The model will always have 0% accuracy on CV / test data whose label was not in the training data. Can you give me an example that clarifies what you have in mind? I don't think this statement is true. > add labelKFold to CrossValidator > > > Key: SPARK-17055 > URL: https://issues.apache.org/jira/browse/SPARK-17055 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Vincent >Priority: Minor > > Current CrossValidator only supports k-fold, which randomly divides all the > samples in k groups of samples. But in cases when data is gathered from > different subjects and we want to avoid over-fitting, we want to hold out > samples with certain labels from training data and put them into validation > fold, i.e. we want to ensure that the same label is not in both testing and > training sets. > Mainstream packages like Sklearn already supports such cross validation > method. > (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15430381#comment-15430381 ] Vincent commented on SPARK-17055: - well, a better model will have a better cv performance on data with unseen labels, so the final selected model will have a relatively better capability on predicting samples with unseen categories/labels in real case. > add labelKFold to CrossValidator > > > Key: SPARK-17055 > URL: https://issues.apache.org/jira/browse/SPARK-17055 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Vincent >Priority: Minor > > Current CrossValidator only supports k-fold, which randomly divides all the > samples in k groups of samples. But in cases when data is gathered from > different subjects and we want to avoid over-fitting, we want to hold out > samples with certain labels from training data and put them into validation > fold, i.e. we want to ensure that the same label is not in both testing and > training sets. > Mainstream packages like Sklearn already supports such cross validation > method. > (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15430359#comment-15430359 ] Sean Owen commented on SPARK-17055: --- Yes, I understand how model fitting works. If a label is present in test but not train data, the model will never predict it. This doesn't require any code at all; it's always true. What would you be evaluating by holding out all data with a certain label from training? > add labelKFold to CrossValidator > > > Key: SPARK-17055 > URL: https://issues.apache.org/jira/browse/SPARK-17055 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Vincent >Priority: Minor > > Current CrossValidator only supports k-fold, which randomly divides all the > samples in k groups of samples. But in cases when data is gathered from > different subjects and we want to avoid over-fitting, we want to hold out > samples with certain labels from training data and put them into validation > fold, i.e. we want to ensure that the same label is not in both testing and > training sets. > Mainstream packages like Sklearn already supports such cross validation > method. > (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15420867#comment-15420867 ] Vincent commented on SPARK-17055: - one of the most common tasks is to fit a "model" to a set of training data, so as to be able to make reliable predictions on general untrained data. labelKFold can be used to test the model's ability to generalize by evaluating its performance on a class of data not used for training, which is assumed to approximate the typical unseen data that a model will encounter. > add labelKFold to CrossValidator > > > Key: SPARK-17055 > URL: https://issues.apache.org/jira/browse/SPARK-17055 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Vincent >Priority: Minor > > Current CrossValidator only supports k-fold, which randomly divides all the > samples in k groups of samples. But in cases when data is gathered from > different subjects and we want to avoid over-fitting, we want to hold out > samples with certain labels from training data and put them into validation > fold, i.e. we want to ensure that the same label is not in both testing and > training sets. > Mainstream packages like Sklearn already supports such cross validation > method. > (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15420748#comment-15420748 ] Sean Owen commented on SPARK-17055: --- Hm, when would you want a label to not be present in training but present in test or vice versa? they should, in general, have the same distribution. > add labelKFold to CrossValidator > > > Key: SPARK-17055 > URL: https://issues.apache.org/jira/browse/SPARK-17055 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Vincent >Priority: Minor > > Current CrossValidator only supports k-fold, which randomly divides all the > samples in k groups of samples. But in cases when data is gathered from > different subjects and we want to avoid over-fitting, we want to hold out > samples with certain labels from training data and put them into validation > fold, i.e. we want to ensure that the same label is not in both testing and > training sets. > Mainstream packages like Sklearn already supports such cross validation > method. > (http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.LabelKFold.html#sklearn.cross_validation.LabelKFold) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-17055) add labelKFold to CrossValidator
[ https://issues.apache.org/jira/browse/SPARK-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15420569#comment-15420569 ] Apache Spark commented on SPARK-17055: -- User 'VinceShieh' has created a pull request for this issue: https://github.com/apache/spark/pull/14640 > add labelKFold to CrossValidator > > > Key: SPARK-17055 > URL: https://issues.apache.org/jira/browse/SPARK-17055 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 2.0.0 >Reporter: Vincent >Priority: Minor > > Current CrossValidator only supports k-fold, which randomly divides all the > samples in k groups of samples. But in cases when data is gathered from > different subjects and we want to avoid over-fitting, we want to hold out > samples with certain labels from training data and put them into validation > fold, i.e. we want to ensure that the same label is not in both testing and > training sets. > Mainstream package like Sklearn already supports such cross validation > method. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org