[jira] [Commented] (SPARK-11478) ML StringIndexer return inconsistent schema

2016-09-06 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15468525#comment-15468525
 ] 

Joseph K. Bradley commented on SPARK-11478:
---

I'm not sure if it was on purpose.  I could see two arguments:
* Should not be nullable: MLlib algorithms all assume data are complete, with 
no missing fields.
* Should be nullable: Algorithms could (should?) be modified to support 
nullable values.

Is this issue a blocker for any workloads, or is it just an oddity?  I'll 
downgrade it to minor unless someone protests.

> ML StringIndexer return inconsistent schema
> ---
>
> Key: SPARK-11478
> URL: https://issues.apache.org/jira/browse/SPARK-11478
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>
> ML StringIndexer transform and transformSchema return inconsistent schema.
> {code}
> val data = sc.parallelize(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, 
> "a"), (5, "c")), 2)
> val df = sqlContext.createDataFrame(data).toDF("id", "label")
> val indexer = new StringIndexer()
>   .setInputCol("label")
>   .setOutputCol("labelIndex")
>   .fit(df)
> val transformed = indexer.transform(df)
> println(transformed.schema.toString())
> println(indexer.transformSchema(df.schema))
> The nullable of "labelIndex" return inconsistent value:
> StructType(StructField(id,IntegerType,false), 
> StructField(label,StringType,true), StructField(labelIndex,DoubleType,true))
> StructType(StructField(id,IntegerType,false), 
> StructField(label,StringType,true), StructField(labelIndex,DoubleType,false))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11478) ML StringIndexer return inconsistent schema

2015-12-16 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15061371#comment-15061371
 ] 

Yanbo Liang commented on SPARK-11478:
-

[~wjur] I'm not working on this. You can work on it if you have time, thanks!

> ML StringIndexer return inconsistent schema
> ---
>
> Key: SPARK-11478
> URL: https://issues.apache.org/jira/browse/SPARK-11478
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>
> ML StringIndexer transform and transformSchema return inconsistent schema.
> {code}
> val data = sc.parallelize(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, 
> "a"), (5, "c")), 2)
> val df = sqlContext.createDataFrame(data).toDF("id", "label")
> val indexer = new StringIndexer()
>   .setInputCol("label")
>   .setOutputCol("labelIndex")
>   .fit(df)
> val transformed = indexer.transform(df)
> println(transformed.schema.toString())
> println(indexer.transformSchema(df.schema))
> The nullable of "labelIndex" return inconsistent value:
> StructType(StructField(id,IntegerType,false), 
> StructField(label,StringType,true), StructField(labelIndex,DoubleType,true))
> StructType(StructField(id,IntegerType,false), 
> StructField(label,StringType,true), StructField(labelIndex,DoubleType,false))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11478) ML StringIndexer return inconsistent schema

2015-12-16 Thread Wojciech Jurczyk (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15060210#comment-15060210
 ] 

Wojciech Jurczyk commented on SPARK-11478:
--

Any progress on this, [~yanboliang]? I faced the same issue and I'm wondering 
if you're still working on this.

> ML StringIndexer return inconsistent schema
> ---
>
> Key: SPARK-11478
> URL: https://issues.apache.org/jira/browse/SPARK-11478
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>
> ML StringIndexer transform and transformSchema return inconsistent schema.
> {code}
> val data = sc.parallelize(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, 
> "a"), (5, "c")), 2)
> val df = sqlContext.createDataFrame(data).toDF("id", "label")
> val indexer = new StringIndexer()
>   .setInputCol("label")
>   .setOutputCol("labelIndex")
>   .fit(df)
> val transformed = indexer.transform(df)
> println(transformed.schema.toString())
> println(indexer.transformSchema(df.schema))
> The nullable of "labelIndex" return inconsistent value:
> StructType(StructField(id,IntegerType,false), 
> StructField(label,StringType,true), StructField(labelIndex,DoubleType,true))
> StructType(StructField(id,IntegerType,false), 
> StructField(label,StringType,true), StructField(labelIndex,DoubleType,false))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11478) ML StringIndexer return inconsistent schema

2015-11-09 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14996286#comment-14996286
 ] 

Yanbo Liang commented on SPARK-11478:
-

Because the "nullable" value is generated and ruled by the DataFrame execution 
workflow, it means only when we call "transform" we can get the "nullable" 
value at the scope of ML. (may be Spark SQL can expose API to get "nullable" 
ahead?)
{quote}
For now, does it work to change toStructField to set nullable to true? All of 
the UDFs which create Double fields apparently set nullable = true by default 
(because of how ScalaReflection works).
{quote}
Yes, the Double fields set nullable = true by default. If we change 
toStructField to set nullable to true, we can pass regression test for 
[this|https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/ml/attribute/AttributeSuite.scala#L68]
 test case. I want to know whether toStructField setting nullable to false is 
on purpose.

> ML StringIndexer return inconsistent schema
> ---
>
> Key: SPARK-11478
> URL: https://issues.apache.org/jira/browse/SPARK-11478
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>
> ML StringIndexer transform and transformSchema return inconsistent schema.
> {code}
> val data = sc.parallelize(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, 
> "a"), (5, "c")), 2)
> val df = sqlContext.createDataFrame(data).toDF("id", "label")
> val indexer = new StringIndexer()
>   .setInputCol("label")
>   .setOutputCol("labelIndex")
>   .fit(df)
> val transformed = indexer.transform(df)
> println(transformed.schema.toString())
> println(indexer.transformSchema(df.schema))
> The nullable of "labelIndex" return inconsistent value:
> StructType(StructField(id,IntegerType,false), 
> StructField(label,StringType,true), StructField(labelIndex,DoubleType,true))
> StructType(StructField(id,IntegerType,false), 
> StructField(label,StringType,true), StructField(labelIndex,DoubleType,false))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11478) ML StringIndexer return inconsistent schema

2015-11-06 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14994007#comment-14994007
 ] 

Joseph K. Bradley commented on SPARK-11478:
---

{quote}it is difficult to get the "nullable" value of specific column before it 
generated{quote}
--> Is this true?  What is an example?  I could imagine this happening in the 
future but cannot think of an example at this time.

For now, does it work to change toStructField to set nullable to true?  All of 
the UDFs which create Double fields apparently set nullable = true by default 
(because of how ScalaReflection works).

In the long term, it'd be nice to have everything be an Option (allowing an 
unknown state).

> ML StringIndexer return inconsistent schema
> ---
>
> Key: SPARK-11478
> URL: https://issues.apache.org/jira/browse/SPARK-11478
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>
> ML StringIndexer transform and transformSchema return inconsistent schema.
> {code}
> val data = sc.parallelize(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, 
> "a"), (5, "c")), 2)
> val df = sqlContext.createDataFrame(data).toDF("id", "label")
> val indexer = new StringIndexer()
>   .setInputCol("label")
>   .setOutputCol("labelIndex")
>   .fit(df)
> val transformed = indexer.transform(df)
> println(transformed.schema.toString())
> println(indexer.transformSchema(df.schema))
> The nullable of "labelIndex" return inconsistent value:
> StructType(StructField(id,IntegerType,false), 
> StructField(label,StringType,true), StructField(labelIndex,DoubleType,true))
> StructType(StructField(id,IntegerType,false), 
> StructField(label,StringType,true), StructField(labelIndex,DoubleType,false))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11478) ML StringIndexer return inconsistent schema

2015-11-04 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14989091#comment-14989091
 ] 

Yanbo Liang commented on SPARK-11478:
-

I have found the cause of this bug, but not figured out a way to resolve it. 
In "transformSchema" we use the following code snippet to produce the output 
column schema :
{code}
val attr = NominalAttribute.defaultAttr.withName($(outputCol))
attr.toStructField()
{code}
And in Attribute.toStructField() we cast the "nullable" to "false" under any 
condition.

But in "transform" we use DataFrame operations to produce new columns. We have 
no authority to change the "nullable" of specified column which is decided by 
the DataFrame internal implementation.

In most of other feature transformers have the same problem like this. I 
propose not to check "nullable" because it is difficult to get the "nullable" 
value of specific column before it generated. So in "transformSchema" we can 
not get the right value of "nullable". If we set it with a specific value, 
"transform" can not follow this rule. [~mengxr]

> ML StringIndexer return inconsistent schema
> ---
>
> Key: SPARK-11478
> URL: https://issues.apache.org/jira/browse/SPARK-11478
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>
> ML StringIndexer transform and transformSchema return inconsistent schema.
> {code}
> val data = sc.parallelize(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, 
> "a"), (5, "c")), 2)
> val df = sqlContext.createDataFrame(data).toDF("id", "label")
> val indexer = new StringIndexer()
>   .setInputCol("label")
>   .setOutputCol("labelIndex")
>   .fit(df)
> val transformed = indexer.transform(df)
> println(transformed.schema.toString())
> println(indexer.transformSchema(df.schema))
> The nullable of "labelIndex" return inconsistent value:
> StructType(StructField(id,IntegerType,false), 
> StructField(label,StringType,true), StructField(labelIndex,DoubleType,true))
> StructType(StructField(id,IntegerType,false), 
> StructField(label,StringType,true), StructField(labelIndex,DoubleType,false))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11478) ML StringIndexer return inconsistent schema

2015-11-03 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987807#comment-14987807
 ] 

Apache Spark commented on SPARK-11478:
--

User 'rekhajoshm' has created a pull request for this issue:
https://github.com/apache/spark/pull/9440

> ML StringIndexer return inconsistent schema
> ---
>
> Key: SPARK-11478
> URL: https://issues.apache.org/jira/browse/SPARK-11478
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>
> ML StringIndexer transform and transformSchema return inconsistent schema.
> {code}
> val data = sc.parallelize(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, 
> "a"), (5, "c")), 2)
> val df = sqlContext.createDataFrame(data).toDF("id", "label")
> val indexer = new StringIndexer()
>   .setInputCol("label")
>   .setOutputCol("labelIndex")
>   .fit(df)
> val transformed = indexer.transform(df)
> println(transformed.schema.toString())
> println(indexer.transformSchema(df.schema))
> The nullable of "labelIndex" return inconsistent value:
> StructType(StructField(id,IntegerType,false), 
> StructField(label,StringType,true), StructField(labelIndex,DoubleType,true))
> StructType(StructField(id,IntegerType,false), 
> StructField(label,StringType,true), StructField(labelIndex,DoubleType,false))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11478) ML StringIndexer return inconsistent schema

2015-11-03 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14987379#comment-14987379
 ] 

Yanbo Liang commented on SPARK-11478:
-

I will try to find some clues.

> ML StringIndexer return inconsistent schema
> ---
>
> Key: SPARK-11478
> URL: https://issues.apache.org/jira/browse/SPARK-11478
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>
> ML StringIndexer transform and transformSchema return inconsistent schema.
> {code}
> val data = sc.parallelize(Seq((0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, 
> "a"), (5, "c")), 2)
> val df = sqlContext.createDataFrame(data).toDF("id", "label")
> val indexer = new StringIndexer()
>   .setInputCol("label")
>   .setOutputCol("labelIndex")
>   .fit(df)
> val transformed = indexer.transform(df)
> println(transformed.schema.toString())
> println(indexer.transformSchema(df.schema))
> The nullable of "labelIndex" return inconsistent value:
> StructType(StructField(id,IntegerType,false), 
> StructField(label,StringType,true), StructField(labelIndex,DoubleType,true))
> StructType(StructField(id,IntegerType,false), 
> StructField(label,StringType,true), StructField(labelIndex,DoubleType,false))
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org