[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names

2016-03-15 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15196765#comment-15196765
 ] 

Xusen Yin commented on SPARK-13641:
---

[~muralidh] I gonna close this JIRA since I find that it is intended to do so 
by One-hot encoder to index discrete feature. 

> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names
> ---
>
> Key: SPARK-13641
> URL: https://issues.apache.org/jira/browse/SPARK-13641
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Xusen Yin
>Priority: Minor
>
> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names. Let's take the HouseVotes84 data set as an example:
> {code}
> case m: XXXModel =>
>   val attrs = AttributeGroup.fromStructField(
> m.summary.predictions.schema(m.summary.featuresCol))
>   attrs.attributes.get.map(_.name.get)
> {code}
> The code above gets features' names from the features column. Usually, the 
> features column is generated by RFormula. The latter has a VectorAssembler in 
> it, which leads the output attributes not equal with the original ones.
> E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". 
> But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the 
> transform function of 
> VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75]
>  adds salts of the column names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names

2016-03-05 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15181872#comment-15181872
 ] 

Xusen Yin commented on SPARK-13641:
---

You can checkout code from https://github.com/apache/spark/pull/11486.

Run ./bin/sparkR with this [test 
example|https://github.com/yinxusen/spark/blob/SPARK-13449/R/pkg/inst/tests/testthat/test_mllib.R#L145].
 With summary(model) you can see the column names are not the original.

> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names
> ---
>
> Key: SPARK-13641
> URL: https://issues.apache.org/jira/browse/SPARK-13641
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Xusen Yin
>Priority: Minor
>
> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names. Let's take the HouseVotes84 data set as an example:
> {code}
> case m: XXXModel =>
>   val attrs = AttributeGroup.fromStructField(
> m.summary.predictions.schema(m.summary.featuresCol))
>   attrs.attributes.get.map(_.name.get)
> {code}
> The code above gets features' names from the features column. Usually, the 
> features column is generated by RFormula. The latter has a VectorAssembler in 
> it, which leads the output attributes not equal with the original ones.
> E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". 
> But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the 
> transform function of 
> VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75]
>  adds salts of the column names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names

2016-03-05 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15181819#comment-15181819
 ] 

Gayathri Murali commented on SPARK-13641:
-

[~xusen] Can you list the steps to reproduce the bug? 

> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names
> ---
>
> Key: SPARK-13641
> URL: https://issues.apache.org/jira/browse/SPARK-13641
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Xusen Yin
>Priority: Minor
>
> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names. Let's take the HouseVotes84 data set as an example:
> {code}
> case m: XXXModel =>
>   val attrs = AttributeGroup.fromStructField(
> m.summary.predictions.schema(m.summary.featuresCol))
>   attrs.attributes.get.map(_.name.get)
> {code}
> The code above gets features' names from the features column. Usually, the 
> features column is generated by RFormula. The latter has a VectorAssembler in 
> it, which leads the output attributes not equal with the original ones.
> E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". 
> But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the 
> transform function of 
> VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75]
>  adds salts of the column names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names

2016-03-04 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15181369#comment-15181369
 ] 

Gayathri Murali commented on SPARK-13641:
-

Yes, it looks intentional to carry metadata. There are multiple ways in which 
it can stripped out. I will let [~mengxr] to answer if its ok to strip out the 
"c_" from feature names in the SparkRWrapper

> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names
> ---
>
> Key: SPARK-13641
> URL: https://issues.apache.org/jira/browse/SPARK-13641
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Xusen Yin
>Priority: Minor
>
> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names. Let's take the HouseVotes84 data set as an example:
> {code}
> case m: XXXModel =>
>   val attrs = AttributeGroup.fromStructField(
> m.summary.predictions.schema(m.summary.featuresCol))
>   attrs.attributes.get.map(_.name.get)
> {code}
> The code above gets features' names from the features column. Usually, the 
> features column is generated by RFormula. The latter has a VectorAssembler in 
> it, which leads the output attributes not equal with the original ones.
> E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". 
> But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the 
> transform function of 
> VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75]
>  adds salts of the column names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names

2016-03-04 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15181294#comment-15181294
 ] 

Xusen Yin commented on SPARK-13641:
---

Or maybe we try to find another way to extract the feature names.

> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names
> ---
>
> Key: SPARK-13641
> URL: https://issues.apache.org/jira/browse/SPARK-13641
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Xusen Yin
>Priority: Minor
>
> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names. Let's take the HouseVotes84 data set as an example:
> {code}
> case m: XXXModel =>
>   val attrs = AttributeGroup.fromStructField(
> m.summary.predictions.schema(m.summary.featuresCol))
>   attrs.attributes.get.map(_.name.get)
> {code}
> The code above gets features' names from the features column. Usually, the 
> features column is generated by RFormula. The latter has a VectorAssembler in 
> it, which leads the output attributes not equal with the original ones.
> E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". 
> But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the 
> transform function of 
> VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75]
>  adds salts of the column names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names

2016-03-04 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15181130#comment-15181130
 ] 

Xusen Yin commented on SPARK-13641:
---

I am not quite familiar with it but I think it's intentional. Refer to the JIRA 
[SPARK 7198|https://issues.apache.org/jira/browse/SPARK-7198] and bring 
[~mengxr] here.

> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names
> ---
>
> Key: SPARK-13641
> URL: https://issues.apache.org/jira/browse/SPARK-13641
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Xusen Yin
>Priority: Minor
>
> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names. Let's take the HouseVotes84 data set as an example:
> {code}
> case m: XXXModel =>
>   val attrs = AttributeGroup.fromStructField(
> m.summary.predictions.schema(m.summary.featuresCol))
>   attrs.attributes.get.map(_.name.get)
> {code}
> The code above gets features' names from the features column. Usually, the 
> features column is generated by RFormula. The latter has a VectorAssembler in 
> it, which leads the output attributes not equal with the original ones.
> E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". 
> But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the 
> transform function of 
> VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75]
>  adds salts of the column names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names

2016-03-03 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179241#comment-15179241
 ] 

Gayathri Murali commented on SPARK-13641:
-

[~xusen] I can remove c_feature from Vector Assembler code, but is the naming 
convention intentional?  

> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names
> ---
>
> Key: SPARK-13641
> URL: https://issues.apache.org/jira/browse/SPARK-13641
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Xusen Yin
>Priority: Minor
>
> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names. Let's take the HouseVotes84 data set as an example:
> {code}
> case m: XXXModel =>
>   val attrs = AttributeGroup.fromStructField(
> m.summary.predictions.schema(m.summary.featuresCol))
>   attrs.attributes.get.map(_.name.get)
> {code}
> The code above gets features' names from the features column. Usually, the 
> features column is generated by RFormula. The latter has a VectorAssembler in 
> it, which leads the output attributes not equal with the original ones.
> E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". 
> But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the 
> transform function of 
> VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75]
>  adds salts of the column names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names

2016-03-03 Thread Xusen Yin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178674#comment-15178674
 ] 

Xusen Yin commented on SPARK-13641:
---

Gayathri, you can just go ahead and work on it.




-- 
Cheers,
Xusen Yin


> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names
> ---
>
> Key: SPARK-13641
> URL: https://issues.apache.org/jira/browse/SPARK-13641
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Xusen Yin
>Priority: Minor
>
> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names. Let's take the HouseVotes84 data set as an example:
> {code}
> case m: XXXModel =>
>   val attrs = AttributeGroup.fromStructField(
> m.summary.predictions.schema(m.summary.featuresCol))
>   attrs.attributes.get.map(_.name.get)
> {code}
> The code above gets features' names from the features column. Usually, the 
> features column is generated by RFormula. The latter has a VectorAssembler in 
> it, which leads the output attributes not equal with the original ones.
> E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". 
> But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the 
> transform function of 
> VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75]
>  adds salts of the column names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13641) getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the original column names

2016-03-03 Thread Gayathri Murali (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178517#comment-15178517
 ] 

Gayathri Murali commented on SPARK-13641:
-

I can work on this. Can you please assign it to me


> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names
> ---
>
> Key: SPARK-13641
> URL: https://issues.apache.org/jira/browse/SPARK-13641
> Project: Spark
>  Issue Type: Bug
>  Components: ML, SparkR
>Reporter: Xusen Yin
>Priority: Minor
>
> getModelFeatures of ml.api.r.SparkRWrapper cannot (always) reveal the 
> original column names. Let's take the HouseVotes84 data set as an example:
> {code}
> case m: XXXModel =>
>   val attrs = AttributeGroup.fromStructField(
> m.summary.predictions.schema(m.summary.featuresCol))
>   attrs.attributes.get.map(_.name.get)
> {code}
> The code above gets features' names from the features column. Usually, the 
> features column is generated by RFormula. The latter has a VectorAssembler in 
> it, which leads the output attributes not equal with the original ones.
> E.g., we want to learn the HouseVotes84's features' name "V1, V2, ..., V16". 
> But with RFormula, we can only get "V1_n, V2_y, ..., V16_y" because [the 
> transform function of 
> VectorAssembler|https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L75]
>  adds salts of the column names.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org