[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names
[ https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196765#comment-15196765 ] Xusen Yin commented on SPARK-13641: --- [~muralidh] I gonna close this JIRA since I find that it is intended to do so by One-hot encoder to index discrete feature. > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names > --- > > Key: SPARK-13641 > URL: https://issues.apache.org/jira/browse/SPARK-13641 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Reporter: Xusen Yin >Priority: Minor > > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names. Let's take the HouseVotes84 data set as an example: > {code} > case m: XXXModel => > val attrs = AttributeGroup.fromStructField( > m.summary.predictions.schema(m.summary.featuresCol)) > attrs.attributes.get.map(_.name.get) > {code} > The code above gets features' names from the features column. Usually, the > features column is generated by RFormula. The latter has a VectorAssembler in > it, which leads the output attributes not equal with the original ones. > E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". > But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the > transform function of > VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75] > adds salts of the column names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names
[ https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15181872#comment-15181872 ] Xusen Yin commented on SPARK-13641: --- You can checkout code from https://github.com/apache/spark/pull/11486. Run ./bin/sparkR with this [test example|https://github.com/yinxusen/spark/blob/SPARK-13449/R/pkg/inst/tests/testthat/test_mllib.R#L145]. With summary(model) you can see the column names are not the original. > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names > --- > > Key: SPARK-13641 > URL: https://issues.apache.org/jira/browse/SPARK-13641 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Reporter: Xusen Yin >Priority: Minor > > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names. Let's take the HouseVotes84 data set as an example: > {code} > case m: XXXModel => > val attrs = AttributeGroup.fromStructField( > m.summary.predictions.schema(m.summary.featuresCol)) > attrs.attributes.get.map(_.name.get) > {code} > The code above gets features' names from the features column. Usually, the > features column is generated by RFormula. The latter has a VectorAssembler in > it, which leads the output attributes not equal with the original ones. > E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". > But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the > transform function of > VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75] > adds salts of the column names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names
[ https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15181819#comment-15181819 ] Gayathri Murali commented on SPARK-13641: - [~xusen] Can you list the steps to reproduce the bug? > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names > --- > > Key: SPARK-13641 > URL: https://issues.apache.org/jira/browse/SPARK-13641 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Reporter: Xusen Yin >Priority: Minor > > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names. Let's take the HouseVotes84 data set as an example: > {code} > case m: XXXModel => > val attrs = AttributeGroup.fromStructField( > m.summary.predictions.schema(m.summary.featuresCol)) > attrs.attributes.get.map(_.name.get) > {code} > The code above gets features' names from the features column. Usually, the > features column is generated by RFormula. The latter has a VectorAssembler in > it, which leads the output attributes not equal with the original ones. > E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". > But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the > transform function of > VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75] > adds salts of the column names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names
[ https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15181369#comment-15181369 ] Gayathri Murali commented on SPARK-13641: - Yes, it looks intentional to carry metadata. There are multiple ways in which it can stripped out. I will let [~mengxr] to answer if its ok to strip out the "c_" from feature names in the SparkRWrapper > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names > --- > > Key: SPARK-13641 > URL: https://issues.apache.org/jira/browse/SPARK-13641 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Reporter: Xusen Yin >Priority: Minor > > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names. Let's take the HouseVotes84 data set as an example: > {code} > case m: XXXModel => > val attrs = AttributeGroup.fromStructField( > m.summary.predictions.schema(m.summary.featuresCol)) > attrs.attributes.get.map(_.name.get) > {code} > The code above gets features' names from the features column. Usually, the > features column is generated by RFormula. The latter has a VectorAssembler in > it, which leads the output attributes not equal with the original ones. > E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". > But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the > transform function of > VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75] > adds salts of the column names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names
[ https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15181294#comment-15181294 ] Xusen Yin commented on SPARK-13641: --- Or maybe we try to find another way to extract the feature names. > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names > --- > > Key: SPARK-13641 > URL: https://issues.apache.org/jira/browse/SPARK-13641 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Reporter: Xusen Yin >Priority: Minor > > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names. Let's take the HouseVotes84 data set as an example: > {code} > case m: XXXModel => > val attrs = AttributeGroup.fromStructField( > m.summary.predictions.schema(m.summary.featuresCol)) > attrs.attributes.get.map(_.name.get) > {code} > The code above gets features' names from the features column. Usually, the > features column is generated by RFormula. The latter has a VectorAssembler in > it, which leads the output attributes not equal with the original ones. > E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". > But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the > transform function of > VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75] > adds salts of the column names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names
[ https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15181130#comment-15181130 ] Xusen Yin commented on SPARK-13641: --- I am not quite familiar with it but I think it's intentional. Refer to the JIRA [SPARK 7198|https://issues.apache.org/jira/browse/SPARK-7198] and bring [~mengxr] here. > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names > --- > > Key: SPARK-13641 > URL: https://issues.apache.org/jira/browse/SPARK-13641 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Reporter: Xusen Yin >Priority: Minor > > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names. Let's take the HouseVotes84 data set as an example: > {code} > case m: XXXModel => > val attrs = AttributeGroup.fromStructField( > m.summary.predictions.schema(m.summary.featuresCol)) > attrs.attributes.get.map(_.name.get) > {code} > The code above gets features' names from the features column. Usually, the > features column is generated by RFormula. The latter has a VectorAssembler in > it, which leads the output attributes not equal with the original ones. > E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". > But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the > transform function of > VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75] > adds salts of the column names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names
[ https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179241#comment-15179241 ] Gayathri Murali commented on SPARK-13641: - [~xusen] I can remove c_feature from Vector Assembler code, but is the naming convention intentional? > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names > --- > > Key: SPARK-13641 > URL: https://issues.apache.org/jira/browse/SPARK-13641 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Reporter: Xusen Yin >Priority: Minor > > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names. Let's take the HouseVotes84 data set as an example: > {code} > case m: XXXModel => > val attrs = AttributeGroup.fromStructField( > m.summary.predictions.schema(m.summary.featuresCol)) > attrs.attributes.get.map(_.name.get) > {code} > The code above gets features' names from the features column. Usually, the > features column is generated by RFormula. The latter has a VectorAssembler in > it, which leads the output attributes not equal with the original ones. > E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". > But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the > transform function of > VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75] > adds salts of the column names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names
[ https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178674#comment-15178674 ] Xusen Yin commented on SPARK-13641: --- Gayathri, you can just go ahead and work on it. -- Cheers, Xusen Yin > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names > --- > > Key: SPARK-13641 > URL: https://issues.apache.org/jira/browse/SPARK-13641 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Reporter: Xusen Yin >Priority: Minor > > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names. Let's take the HouseVotes84 data set as an example: > {code} > case m: XXXModel => > val attrs = AttributeGroup.fromStructField( > m.summary.predictions.schema(m.summary.featuresCol)) > attrs.attributes.get.map(_.name.get) > {code} > The code above gets features' names from the features column. Usually, the > features column is generated by RFormula. The latter has a VectorAssembler in > it, which leads the output attributes not equal with the original ones. > E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". > But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the > transform function of > VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75] > adds salts of the column names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names
[ https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178517#comment-15178517 ] Gayathri Murali commented on SPARK-13641: - I can work on this. Can you please assign it to me > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names > --- > > Key: SPARK-13641 > URL: https://issues.apache.org/jira/browse/SPARK-13641 > Project: Spark > Issue Type: Bug > Components: ML, SparkR >Reporter: Xusen Yin >Priority: Minor > > getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the > original column names. Let's take the HouseVotes84 data set as an example: > {code} > case m: XXXModel => > val attrs = AttributeGroup.fromStructField( > m.summary.predictions.schema(m.summary.featuresCol)) > attrs.attributes.get.map(_.name.get) > {code} > The code above gets features' names from the features column. Usually, the > features column is generated by RFormula. The latter has a VectorAssembler in > it, which leads the output attributes not equal with the original ones. > E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". > But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the > transform function of > VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75] > adds salts of the column names. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org