spark git commit: [SPARK-17213][SQL] Disable Parquet filter push-down for string and binary columns due to PARQUET-686

2016-12-01 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/branch-2.1 0f0903d17 -> a7f8ebb86


[SPARK-17213][SQL] Disable Parquet filter push-down for string and binary 
columns due to PARQUET-686

This PR targets to both master and branch-2.1.

## What changes were proposed in this pull request?

Due to PARQUET-686, Parquet doesn't do string comparison correctly while doing 
filter push-down for string columns. This PR disables filter push-down for both 
string and binary columns to work around this issue. Binary columns are also 
affected because some Parquet data models (like Hive) may store string columns 
as a plain Parquet `binary` instead of a `binary (UTF8)`.

## How was this patch tested?

New test case added in `ParquetFilterSuite`.

Author: Cheng Lian 

Closes #16106 from liancheng/spark-17213-bad-string-ppd.

(cherry picked from commit ca6391637212814b7c0bd14c434a6737da17b258)
Signed-off-by: Reynold Xin 


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a7f8ebb8
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a7f8ebb8
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a7f8ebb8

Branch: refs/heads/branch-2.1
Commit: a7f8ebb8629706c54c286b7aca658838e718e804
Parents: 0f0903d
Author: Cheng Lian 
Authored: Thu Dec 1 22:02:45 2016 -0800
Committer: Reynold Xin 
Committed: Thu Dec 1 22:03:01 2016 -0800

--
 .../datasources/parquet/ParquetFilters.scala| 24 ++
 .../parquet/ParquetFilterSuite.scala| 26 +---
 2 files changed, 47 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/a7f8ebb8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
index a6e9788..7730d1f 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
@@ -40,6 +40,9 @@ private[parquet] object ParquetFilters {
   (n: String, v: Any) => FilterApi.eq(floatColumn(n), 
v.asInstanceOf[java.lang.Float])
 case DoubleType =>
   (n: String, v: Any) => FilterApi.eq(doubleColumn(n), 
v.asInstanceOf[java.lang.Double])
+
+// See SPARK-17213: https://issues.apache.org/jira/browse/SPARK-17213
+/*
 // Binary.fromString and Binary.fromByteArray don't accept null values
 case StringType =>
   (n: String, v: Any) => FilterApi.eq(
@@ -49,6 +52,7 @@ private[parquet] object ParquetFilters {
   (n: String, v: Any) => FilterApi.eq(
 binaryColumn(n),
 Option(v).map(b => 
Binary.fromReusedByteArray(v.asInstanceOf[Array[Byte]])).orNull)
+ */
   }
 
   private val makeNotEq: PartialFunction[DataType, (String, Any) => 
FilterPredicate] = {
@@ -62,6 +66,9 @@ private[parquet] object ParquetFilters {
   (n: String, v: Any) => FilterApi.notEq(floatColumn(n), 
v.asInstanceOf[java.lang.Float])
 case DoubleType =>
   (n: String, v: Any) => FilterApi.notEq(doubleColumn(n), 
v.asInstanceOf[java.lang.Double])
+
+// See SPARK-17213: https://issues.apache.org/jira/browse/SPARK-17213
+/*
 case StringType =>
   (n: String, v: Any) => FilterApi.notEq(
 binaryColumn(n),
@@ -70,6 +77,7 @@ private[parquet] object ParquetFilters {
   (n: String, v: Any) => FilterApi.notEq(
 binaryColumn(n),
 Option(v).map(b => 
Binary.fromReusedByteArray(v.asInstanceOf[Array[Byte]])).orNull)
+ */
   }
 
   private val makeLt: PartialFunction[DataType, (String, Any) => 
FilterPredicate] = {
@@ -81,6 +89,9 @@ private[parquet] object ParquetFilters {
   (n: String, v: Any) => FilterApi.lt(floatColumn(n), 
v.asInstanceOf[java.lang.Float])
 case DoubleType =>
   (n: String, v: Any) => FilterApi.lt(doubleColumn(n), 
v.asInstanceOf[java.lang.Double])
+
+// See SPARK-17213: https://issues.apache.org/jira/browse/SPARK-17213
+/*
 case StringType =>
   (n: String, v: Any) =>
 FilterApi.lt(binaryColumn(n),
@@ -88,6 +99,7 @@ private[parquet] object ParquetFilters {
 case BinaryType =>
   (n: String, v: Any) =>
 FilterApi.lt(binaryColumn(n), 
Binary.fromReusedByteArray(v.asInstanceOf[Array[Byte]]))
+ */
   }
 
   private val makeLtEq: PartialFunction[DataType, (String, Any) => 
FilterPredicate] = {
@@ -99,6 +111,9 @@ private[parquet] object ParquetFilters {
   (n: String, v: Any) => 

spark git commit: [SPARK-17213][SQL] Disable Parquet filter push-down for string and binary columns due to PARQUET-686

2016-12-01 Thread rxin
Repository: spark
Updated Branches:
  refs/heads/master c82f16c15 -> ca6391637


[SPARK-17213][SQL] Disable Parquet filter push-down for string and binary 
columns due to PARQUET-686

This PR targets to both master and branch-2.1.

## What changes were proposed in this pull request?

Due to PARQUET-686, Parquet doesn't do string comparison correctly while doing 
filter push-down for string columns. This PR disables filter push-down for both 
string and binary columns to work around this issue. Binary columns are also 
affected because some Parquet data models (like Hive) may store string columns 
as a plain Parquet `binary` instead of a `binary (UTF8)`.

## How was this patch tested?

New test case added in `ParquetFilterSuite`.

Author: Cheng Lian 

Closes #16106 from liancheng/spark-17213-bad-string-ppd.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ca639163
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ca639163
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ca639163

Branch: refs/heads/master
Commit: ca6391637212814b7c0bd14c434a6737da17b258
Parents: c82f16c
Author: Cheng Lian 
Authored: Thu Dec 1 22:02:45 2016 -0800
Committer: Reynold Xin 
Committed: Thu Dec 1 22:02:45 2016 -0800

--
 .../datasources/parquet/ParquetFilters.scala| 24 ++
 .../parquet/ParquetFilterSuite.scala| 26 +---
 2 files changed, 47 insertions(+), 3 deletions(-)
--


http://git-wip-us.apache.org/repos/asf/spark/blob/ca639163/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
--
diff --git 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
index a6e9788..7730d1f 100644
--- 
a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
+++ 
b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
@@ -40,6 +40,9 @@ private[parquet] object ParquetFilters {
   (n: String, v: Any) => FilterApi.eq(floatColumn(n), 
v.asInstanceOf[java.lang.Float])
 case DoubleType =>
   (n: String, v: Any) => FilterApi.eq(doubleColumn(n), 
v.asInstanceOf[java.lang.Double])
+
+// See SPARK-17213: https://issues.apache.org/jira/browse/SPARK-17213
+/*
 // Binary.fromString and Binary.fromByteArray don't accept null values
 case StringType =>
   (n: String, v: Any) => FilterApi.eq(
@@ -49,6 +52,7 @@ private[parquet] object ParquetFilters {
   (n: String, v: Any) => FilterApi.eq(
 binaryColumn(n),
 Option(v).map(b => 
Binary.fromReusedByteArray(v.asInstanceOf[Array[Byte]])).orNull)
+ */
   }
 
   private val makeNotEq: PartialFunction[DataType, (String, Any) => 
FilterPredicate] = {
@@ -62,6 +66,9 @@ private[parquet] object ParquetFilters {
   (n: String, v: Any) => FilterApi.notEq(floatColumn(n), 
v.asInstanceOf[java.lang.Float])
 case DoubleType =>
   (n: String, v: Any) => FilterApi.notEq(doubleColumn(n), 
v.asInstanceOf[java.lang.Double])
+
+// See SPARK-17213: https://issues.apache.org/jira/browse/SPARK-17213
+/*
 case StringType =>
   (n: String, v: Any) => FilterApi.notEq(
 binaryColumn(n),
@@ -70,6 +77,7 @@ private[parquet] object ParquetFilters {
   (n: String, v: Any) => FilterApi.notEq(
 binaryColumn(n),
 Option(v).map(b => 
Binary.fromReusedByteArray(v.asInstanceOf[Array[Byte]])).orNull)
+ */
   }
 
   private val makeLt: PartialFunction[DataType, (String, Any) => 
FilterPredicate] = {
@@ -81,6 +89,9 @@ private[parquet] object ParquetFilters {
   (n: String, v: Any) => FilterApi.lt(floatColumn(n), 
v.asInstanceOf[java.lang.Float])
 case DoubleType =>
   (n: String, v: Any) => FilterApi.lt(doubleColumn(n), 
v.asInstanceOf[java.lang.Double])
+
+// See SPARK-17213: https://issues.apache.org/jira/browse/SPARK-17213
+/*
 case StringType =>
   (n: String, v: Any) =>
 FilterApi.lt(binaryColumn(n),
@@ -88,6 +99,7 @@ private[parquet] object ParquetFilters {
 case BinaryType =>
   (n: String, v: Any) =>
 FilterApi.lt(binaryColumn(n), 
Binary.fromReusedByteArray(v.asInstanceOf[Array[Byte]]))
+ */
   }
 
   private val makeLtEq: PartialFunction[DataType, (String, Any) => 
FilterPredicate] = {
@@ -99,6 +111,9 @@ private[parquet] object ParquetFilters {
   (n: String, v: Any) => FilterApi.ltEq(floatColumn(n), 
v.asInstanceOf[java.lang.Float])
 case DoubleType =>
   (n: String, v: Any) =>