spark git commit: [SPARK-17213][SQL] Disable Parquet filter push-down for string and binary columns due to PARQUET-686
Repository: spark Updated Branches: refs/heads/branch-2.1 0f0903d17 -> a7f8ebb86 [SPARK-17213][SQL] Disable Parquet filter push-down for string and binary columns due to PARQUET-686 This PR targets to both master and branch-2.1. ## What changes were proposed in this pull request? Due to PARQUET-686, Parquet doesn't do string comparison correctly while doing filter push-down for string columns. This PR disables filter push-down for both string and binary columns to work around this issue. Binary columns are also affected because some Parquet data models (like Hive) may store string columns as a plain Parquet `binary` instead of a `binary (UTF8)`. ## How was this patch tested? New test case added in `ParquetFilterSuite`. Author: Cheng LianCloses #16106 from liancheng/spark-17213-bad-string-ppd. (cherry picked from commit ca6391637212814b7c0bd14c434a6737da17b258) Signed-off-by: Reynold Xin Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/a7f8ebb8 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/a7f8ebb8 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/a7f8ebb8 Branch: refs/heads/branch-2.1 Commit: a7f8ebb8629706c54c286b7aca658838e718e804 Parents: 0f0903d Author: Cheng Lian Authored: Thu Dec 1 22:02:45 2016 -0800 Committer: Reynold Xin Committed: Thu Dec 1 22:03:01 2016 -0800 -- .../datasources/parquet/ParquetFilters.scala| 24 ++ .../parquet/ParquetFilterSuite.scala| 26 +--- 2 files changed, 47 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/a7f8ebb8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala index a6e9788..7730d1f 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala @@ -40,6 +40,9 @@ private[parquet] object ParquetFilters { (n: String, v: Any) => FilterApi.eq(floatColumn(n), v.asInstanceOf[java.lang.Float]) case DoubleType => (n: String, v: Any) => FilterApi.eq(doubleColumn(n), v.asInstanceOf[java.lang.Double]) + +// See SPARK-17213: https://issues.apache.org/jira/browse/SPARK-17213 +/* // Binary.fromString and Binary.fromByteArray don't accept null values case StringType => (n: String, v: Any) => FilterApi.eq( @@ -49,6 +52,7 @@ private[parquet] object ParquetFilters { (n: String, v: Any) => FilterApi.eq( binaryColumn(n), Option(v).map(b => Binary.fromReusedByteArray(v.asInstanceOf[Array[Byte]])).orNull) + */ } private val makeNotEq: PartialFunction[DataType, (String, Any) => FilterPredicate] = { @@ -62,6 +66,9 @@ private[parquet] object ParquetFilters { (n: String, v: Any) => FilterApi.notEq(floatColumn(n), v.asInstanceOf[java.lang.Float]) case DoubleType => (n: String, v: Any) => FilterApi.notEq(doubleColumn(n), v.asInstanceOf[java.lang.Double]) + +// See SPARK-17213: https://issues.apache.org/jira/browse/SPARK-17213 +/* case StringType => (n: String, v: Any) => FilterApi.notEq( binaryColumn(n), @@ -70,6 +77,7 @@ private[parquet] object ParquetFilters { (n: String, v: Any) => FilterApi.notEq( binaryColumn(n), Option(v).map(b => Binary.fromReusedByteArray(v.asInstanceOf[Array[Byte]])).orNull) + */ } private val makeLt: PartialFunction[DataType, (String, Any) => FilterPredicate] = { @@ -81,6 +89,9 @@ private[parquet] object ParquetFilters { (n: String, v: Any) => FilterApi.lt(floatColumn(n), v.asInstanceOf[java.lang.Float]) case DoubleType => (n: String, v: Any) => FilterApi.lt(doubleColumn(n), v.asInstanceOf[java.lang.Double]) + +// See SPARK-17213: https://issues.apache.org/jira/browse/SPARK-17213 +/* case StringType => (n: String, v: Any) => FilterApi.lt(binaryColumn(n), @@ -88,6 +99,7 @@ private[parquet] object ParquetFilters { case BinaryType => (n: String, v: Any) => FilterApi.lt(binaryColumn(n), Binary.fromReusedByteArray(v.asInstanceOf[Array[Byte]])) + */ } private val makeLtEq: PartialFunction[DataType, (String, Any) => FilterPredicate] = { @@ -99,6 +111,9 @@ private[parquet] object ParquetFilters { (n: String, v: Any) =>
spark git commit: [SPARK-17213][SQL] Disable Parquet filter push-down for string and binary columns due to PARQUET-686
Repository: spark Updated Branches: refs/heads/master c82f16c15 -> ca6391637 [SPARK-17213][SQL] Disable Parquet filter push-down for string and binary columns due to PARQUET-686 This PR targets to both master and branch-2.1. ## What changes were proposed in this pull request? Due to PARQUET-686, Parquet doesn't do string comparison correctly while doing filter push-down for string columns. This PR disables filter push-down for both string and binary columns to work around this issue. Binary columns are also affected because some Parquet data models (like Hive) may store string columns as a plain Parquet `binary` instead of a `binary (UTF8)`. ## How was this patch tested? New test case added in `ParquetFilterSuite`. Author: Cheng LianCloses #16106 from liancheng/spark-17213-bad-string-ppd. Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/ca639163 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/ca639163 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/ca639163 Branch: refs/heads/master Commit: ca6391637212814b7c0bd14c434a6737da17b258 Parents: c82f16c Author: Cheng Lian Authored: Thu Dec 1 22:02:45 2016 -0800 Committer: Reynold Xin Committed: Thu Dec 1 22:02:45 2016 -0800 -- .../datasources/parquet/ParquetFilters.scala| 24 ++ .../parquet/ParquetFilterSuite.scala| 26 +--- 2 files changed, 47 insertions(+), 3 deletions(-) -- http://git-wip-us.apache.org/repos/asf/spark/blob/ca639163/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala -- diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala index a6e9788..7730d1f 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala @@ -40,6 +40,9 @@ private[parquet] object ParquetFilters { (n: String, v: Any) => FilterApi.eq(floatColumn(n), v.asInstanceOf[java.lang.Float]) case DoubleType => (n: String, v: Any) => FilterApi.eq(doubleColumn(n), v.asInstanceOf[java.lang.Double]) + +// See SPARK-17213: https://issues.apache.org/jira/browse/SPARK-17213 +/* // Binary.fromString and Binary.fromByteArray don't accept null values case StringType => (n: String, v: Any) => FilterApi.eq( @@ -49,6 +52,7 @@ private[parquet] object ParquetFilters { (n: String, v: Any) => FilterApi.eq( binaryColumn(n), Option(v).map(b => Binary.fromReusedByteArray(v.asInstanceOf[Array[Byte]])).orNull) + */ } private val makeNotEq: PartialFunction[DataType, (String, Any) => FilterPredicate] = { @@ -62,6 +66,9 @@ private[parquet] object ParquetFilters { (n: String, v: Any) => FilterApi.notEq(floatColumn(n), v.asInstanceOf[java.lang.Float]) case DoubleType => (n: String, v: Any) => FilterApi.notEq(doubleColumn(n), v.asInstanceOf[java.lang.Double]) + +// See SPARK-17213: https://issues.apache.org/jira/browse/SPARK-17213 +/* case StringType => (n: String, v: Any) => FilterApi.notEq( binaryColumn(n), @@ -70,6 +77,7 @@ private[parquet] object ParquetFilters { (n: String, v: Any) => FilterApi.notEq( binaryColumn(n), Option(v).map(b => Binary.fromReusedByteArray(v.asInstanceOf[Array[Byte]])).orNull) + */ } private val makeLt: PartialFunction[DataType, (String, Any) => FilterPredicate] = { @@ -81,6 +89,9 @@ private[parquet] object ParquetFilters { (n: String, v: Any) => FilterApi.lt(floatColumn(n), v.asInstanceOf[java.lang.Float]) case DoubleType => (n: String, v: Any) => FilterApi.lt(doubleColumn(n), v.asInstanceOf[java.lang.Double]) + +// See SPARK-17213: https://issues.apache.org/jira/browse/SPARK-17213 +/* case StringType => (n: String, v: Any) => FilterApi.lt(binaryColumn(n), @@ -88,6 +99,7 @@ private[parquet] object ParquetFilters { case BinaryType => (n: String, v: Any) => FilterApi.lt(binaryColumn(n), Binary.fromReusedByteArray(v.asInstanceOf[Array[Byte]])) + */ } private val makeLtEq: PartialFunction[DataType, (String, Any) => FilterPredicate] = { @@ -99,6 +111,9 @@ private[parquet] object ParquetFilters { (n: String, v: Any) => FilterApi.ltEq(floatColumn(n), v.asInstanceOf[java.lang.Float]) case DoubleType => (n: String, v: Any) =>