[jira] [Commented] (SPARK-17983) Can't filter over mixed case parquet columns of converted Hive tables

2016-11-04 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15636515#comment-15636515
 ] 

Apache Spark commented on SPARK-17983:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/14750

> Can't filter over mixed case parquet columns of converted Hive tables
> -
>
> Key: SPARK-17983
> URL: https://issues.apache.org/jira/browse/SPARK-17983
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> We should probably revive https://github.com/apache/spark/pull/14750 in order 
> to fix this issue and related classes of issues.
> The only other alternatives are (1) reconciling on-disk schemas with 
> metastore schema at planning time, which seems pretty messy, and (2) fixing 
> all the datasources to support case-insensitive matching, which also has 
> issues.
> Reproduction:
> {code}
>   private def setupPartitionedTable(tableName: String, dir: File): Unit = {
> spark.range(5).selectExpr("id as normalCol", "id as partCol1", "id as 
> partCol2").write
>   .partitionBy("partCol1", "partCol2")
>   .mode("overwrite")
>   .parquet(dir.getAbsolutePath)
> spark.sql(s"""
>   |create external table $tableName (normalCol long)
>   |partitioned by (partCol1 int, partCol2 int)
>   |stored as parquet
>   |location "${dir.getAbsolutePath}.stripMargin)
> spark.sql(s"msck repair table $tableName")
>   }
>   test("filter by mixed case col") {
> withTable("test") {
>   withTempDir { dir =>
> setupPartitionedTable("test", dir)
> val df = spark.sql("select * from test where normalCol = 3")
> assert(df.count() == 1)
>   }
> }
>   }
> {code}
> cc [~cloud_fan]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17983) Can't filter over mixed case parquet columns of converted Hive tables

2016-10-18 Thread Eric Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586476#comment-15586476
 ] 

Eric Liang commented on SPARK-17983:


Since we already store the original (case-sensitive) schema of datasource 
tables in the metastore as a table property, I think it also makes sense to do 
this for hive tables  created through Spark. This would resolve this issue for 
both types of tables.

The one issue I can see with this is that hive tables created through previous 
versions of Spark will stop working in 2.1 if they have mixed-case files. We 
would need a workaround for this situation.

> Can't filter over mixed case parquet columns of converted Hive tables
> -
>
> Key: SPARK-17983
> URL: https://issues.apache.org/jira/browse/SPARK-17983
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> We should probably revive https://github.com/apache/spark/pull/14750 in order 
> to fix this issue and related classes of issues.
> The only other alternatives are (1) reconciling on-disk schemas with 
> metastore schema at planning time, which seems pretty messy, and (2) fixing 
> all the datasources to support case-insensitive matching, which also has 
> issues.
> Reproduction:
> {code}
>   private def setupPartitionedTable(tableName: String, dir: File): Unit = {
> spark.range(5).selectExpr("id as normalCol", "id as partCol1", "id as 
> partCol2").write
>   .partitionBy("partCol1", "partCol2")
>   .mode("overwrite")
>   .parquet(dir.getAbsolutePath)
> spark.sql(s"""
>   |create external table $tableName (normalCol long)
>   |partitioned by (partCol1 int, partCol2 int)
>   |stored as parquet
>   |location "${dir.getAbsolutePath}.stripMargin)
> spark.sql(s"msck repair table $tableName")
>   }
>   test("filter by mixed case col") {
> withTable("test") {
>   withTempDir { dir =>
> setupPartitionedTable("test", dir)
> val df = spark.sql("select * from test where normalCol = 3")
> assert(df.count() == 1)
>   }
> }
>   }
> {code}
> cc [~cloud_fan]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17983) Can't filter over mixed case parquet columns of converted Hive tables

2016-10-18 Thread Michael Allman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586133#comment-15586133
 ] 

Michael Allman commented on SPARK-17983:


Hmmm... not sure what you mean. You talking about changing the spec itself? 
What about other case-sensitive file formats?

> Can't filter over mixed case parquet columns of converted Hive tables
> -
>
> Key: SPARK-17983
> URL: https://issues.apache.org/jira/browse/SPARK-17983
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> We should probably revive https://github.com/apache/spark/pull/14750 in order 
> to fix this issue and related classes of issues.
> The only other alternatives are (1) reconciling on-disk schemas with 
> metastore schema at planning time, which seems pretty messy, and (2) fixing 
> all the datasources to support case-insensitive matching, which also has 
> issues.
> Reproduction:
> {code}
>   private def setupPartitionedTable(tableName: String, dir: File): Unit = {
> spark.range(5).selectExpr("id as normalCol", "id as partCol1", "id as 
> partCol2").write
>   .partitionBy("partCol1", "partCol2")
>   .mode("overwrite")
>   .parquet(dir.getAbsolutePath)
> spark.sql(s"""
>   |create external table $tableName (normalCol long)
>   |partitioned by (partCol1 int, partCol2 int)
>   |stored as parquet
>   |location "${dir.getAbsolutePath}.stripMargin)
> spark.sql(s"msck repair table $tableName")
>   }
>   test("filter by mixed case col") {
> withTable("test") {
>   withTempDir { dir =>
> setupPartitionedTable("test", dir)
> val df = spark.sql("select * from test where normalCol = 3")
> assert(df.count() == 1)
>   }
> }
>   }
> {code}
> cc [~cloud_fan]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17983) Can't filter over mixed case parquet columns of converted Hive tables

2016-10-18 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586121#comment-15586121
 ] 

Reynold Xin commented on SPARK-17983:
-

We can update Parquet to make it case insensitive too.


> Can't filter over mixed case parquet columns of converted Hive tables
> -
>
> Key: SPARK-17983
> URL: https://issues.apache.org/jira/browse/SPARK-17983
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> We should probably revive https://github.com/apache/spark/pull/14750 in order 
> to fix this issue and related classes of issues.
> The only other alternatives are (1) reconciling on-disk schemas with 
> metastore schema at planning time, which seems pretty messy, and (2) fixing 
> all the datasources to support case-insensitive matching, which also has 
> issues.
> Reproduction:
> {code}
>   private def setupPartitionedTable(tableName: String, dir: File): Unit = {
> spark.range(5).selectExpr("id as normalCol", "id as partCol1", "id as 
> partCol2").write
>   .partitionBy("partCol1", "partCol2")
>   .mode("overwrite")
>   .parquet(dir.getAbsolutePath)
> spark.sql(s"""
>   |create external table $tableName (normalCol long)
>   |partitioned by (partCol1 int, partCol2 int)
>   |stored as parquet
>   |location "${dir.getAbsolutePath}.stripMargin)
> spark.sql(s"msck repair table $tableName")
>   }
>   test("filter by mixed case col") {
> withTable("test") {
>   withTempDir { dir =>
> setupPartitionedTable("test", dir)
> val df = spark.sql("select * from test where normalCol = 3")
> assert(df.count() == 1)
>   }
> }
>   }
> {code}
> cc [~cloud_fan]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17983) Can't filter over mixed case parquet columns of converted Hive tables

2016-10-18 Thread Michael Allman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586116#comment-15586116
 ] 

Michael Allman commented on SPARK-17983:


Speaking strictly from the POV of parquet predicate pushdown, I don't see how 
we can get away from doing that in a case-sensitive matter—at least not if it's 
part of planning (optimization). Pushing down a filter with the wrong case 
column name just doesn't work. The same can be said of projection pushdown, 
though I believe that happens as part of execution.

> Can't filter over mixed case parquet columns of converted Hive tables
> -
>
> Key: SPARK-17983
> URL: https://issues.apache.org/jira/browse/SPARK-17983
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> We should probably revive https://github.com/apache/spark/pull/14750 in order 
> to fix this issue and related classes of issues.
> The only other alternatives are (1) reconciling on-disk schemas with 
> metastore schema at planning time, which seems pretty messy, and (2) fixing 
> all the datasources to support case-insensitive matching, which also has 
> issues.
> Reproduction:
> {code}
>   private def setupPartitionedTable(tableName: String, dir: File): Unit = {
> spark.range(5).selectExpr("id as normalCol", "id as partCol1", "id as 
> partCol2").write
>   .partitionBy("partCol1", "partCol2")
>   .mode("overwrite")
>   .parquet(dir.getAbsolutePath)
> spark.sql(s"""
>   |create external table $tableName (normalCol long)
>   |partitioned by (partCol1 int, partCol2 int)
>   |stored as parquet
>   |location "${dir.getAbsolutePath}.stripMargin)
> spark.sql(s"msck repair table $tableName")
>   }
>   test("filter by mixed case col") {
> withTable("test") {
>   withTempDir { dir =>
> setupPartitionedTable("test", dir)
> val df = spark.sql("select * from test where normalCol = 3")
> assert(df.count() == 1)
>   }
> }
>   }
> {code}
> cc [~cloud_fan]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17983) Can't filter over mixed case parquet columns of converted Hive tables

2016-10-18 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586090#comment-15586090
 ] 

Reynold Xin commented on SPARK-17983:
-

I'm really tempted to say everything in Spark SQL for the time being has to be 
case insensitive, because I don't see a good way to provide any good/reliable 
solutions otherwise.


> Can't filter over mixed case parquet columns of converted Hive tables
> -
>
> Key: SPARK-17983
> URL: https://issues.apache.org/jira/browse/SPARK-17983
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> We should probably revive https://github.com/apache/spark/pull/14750 in order 
> to fix this issue and related classes of issues.
> The only other alternatives are (1) reconciling on-disk schemas with 
> metastore schema at planning time, which seems pretty messy, and (2) fixing 
> all the datasources to support case-insensitive matching, which also has 
> issues.
> Reproduction:
> {code}
>   private def setupPartitionedTable(tableName: String, dir: File): Unit = {
> spark.range(5).selectExpr("id as normalCol", "id as partCol1", "id as 
> partCol2").write
>   .partitionBy("partCol1", "partCol2")
>   .mode("overwrite")
>   .parquet(dir.getAbsolutePath)
> spark.sql(s"""
>   |create external table $tableName (normalCol long)
>   |partitioned by (partCol1 int, partCol2 int)
>   |stored as parquet
>   |location "${dir.getAbsolutePath}.stripMargin)
> spark.sql(s"msck repair table $tableName")
>   }
>   test("filter by mixed case col") {
> withTable("test") {
>   withTempDir { dir =>
> setupPartitionedTable("test", dir)
> val df = spark.sql("select * from test where normalCol = 3")
> assert(df.count() == 1)
>   }
> }
>   }
> {code}
> cc [~cloud_fan]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17983) Can't filter over mixed case parquet columns of converted Hive tables

2016-10-18 Thread Reynold Xin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586069#comment-15586069
 ] 

Reynold Xin commented on SPARK-17983:
-

[~michael] [~ekhliang] can we put all the tickets related to pushing partition 
handling into the metastore into the umbrella ticket SPARK-17861?

> Can't filter over mixed case parquet columns of converted Hive tables
> -
>
> Key: SPARK-17983
> URL: https://issues.apache.org/jira/browse/SPARK-17983
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> We should probably revive https://github.com/apache/spark/pull/14750 in order 
> to fix this issue and related classes of issues.
> The only other alternatives are (1) reconciling on-disk schemas with 
> metastore schema at planning time, which seems pretty messy, and (2) fixing 
> all the datasources to support case-insensitive matching, which also has 
> issues.
> Reproduction:
> {code}
>   private def setupPartitionedTable(tableName: String, dir: File): Unit = {
> spark.range(5).selectExpr("id as normalCol", "id as partCol1", "id as 
> partCol2").write
>   .partitionBy("partCol1", "partCol2")
>   .mode("overwrite")
>   .parquet(dir.getAbsolutePath)
> spark.sql(s"""
>   |create external table $tableName (normalCol long)
>   |partitioned by (partCol1 int, partCol2 int)
>   |stored as parquet
>   |location "${dir.getAbsolutePath}.stripMargin)
> spark.sql(s"msck repair table $tableName")
>   }
>   test("filter by mixed case col") {
> withTable("test") {
>   withTempDir { dir =>
> setupPartitionedTable("test", dir)
> val df = spark.sql("select * from test where normalCol = 3")
> assert(df.count() == 1)
>   }
> }
>   }
> {code}
> cc [~cloud_fan]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17983) Can't filter over mixed case parquet columns of converted Hive tables

2016-10-18 Thread Michael Allman (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15586018#comment-15586018
 ] 

Michael Allman commented on SPARK-17983:


cc [~rxin]

I had a feeling there might be some fallout like this. It seems we really need 
to reconcile Hive metastore column names to on-disk column names as part of 
planning.

I think I mentioned, and I actually have, a implemented this kind of 
reconciliation that occurs after partition pruning in optimization so that it 
only involves the partitions in the query plan. Obviously, this is big 
improvement over the original behavior which did a scan over every data file in 
the table.

Even though this adds some additional cost to the query planning, I believe 
this can be restricted to the first access of a partition in a given Spark 
session. The straightforward solution would be to cache the table metadata 
incrementally as partitions are scanned. Subsequent requests for partition 
schema and metadata would come from the cache. The cache would be invalidated 
through the usual methods.

This follows along the lines of the "re-add partition caching" task I mentioned 
in the beginning of https://github.com/apache/spark/pull/14690.

Thoughts?

> Can't filter over mixed case parquet columns of converted Hive tables
> -
>
> Key: SPARK-17983
> URL: https://issues.apache.org/jira/browse/SPARK-17983
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.0
>Reporter: Eric Liang
>Priority: Critical
>
> We should probably revive https://github.com/apache/spark/pull/14750 in order 
> to fix this issue and related classes of issues.
> The only other alternatives are (1) reconciling on-disk schemas with 
> metastore schema at planning time, which seems pretty messy, and (2) fixing 
> all the datasources to support case-insensitive matching, which also has 
> issues.
> Reproduction:
> {code}
>   private def setupPartitionedTable(tableName: String, dir: File): Unit = {
> spark.range(5).selectExpr("id as normalCol", "id as partCol1", "id as 
> partCol2").write
>   .partitionBy("partCol1", "partCol2")
>   .mode("overwrite")
>   .parquet(dir.getAbsolutePath)
> spark.sql(s"""
>   |create external table $tableName (normalCol long)
>   |partitioned by (partCol1 int, partCol2 int)
>   |stored as parquet
>   |location "${dir.getAbsolutePath}.stripMargin)
> spark.sql(s"msck repair table $tableName")
>   }
>   test("filter by mixed case col") {
> withTable("test") {
>   withTempDir { dir =>
> setupPartitionedTable("test", dir)
> val df = spark.sql("select * from test where normalCol = 3")
> assert(df.count() == 1)
>   }
> }
>   }
> {code}
> cc [~cloud_fan]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org