[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

2018-06-30 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/21623


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

2018-06-29 Thread wangyum
Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/21623#discussion_r199116993
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -660,6 +688,62 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
   assert(df.where("col > 0").count() === 2)
 }
   }
+
+  test("filter pushdown - StringStartsWith") {
+withParquetDataFrame((1 to 4).map(i => Tuple1(i + "str" + i))) { 
implicit df =>
+  // Test canDrop()
--- End diff --

Both methods have been executed but it can't be confirmed which method  has 
taken effect.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

2018-06-29 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21623#discussion_r199081727
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -660,6 +688,62 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
   assert(df.where("col > 0").count() === 2)
 }
   }
+
+  test("filter pushdown - StringStartsWith") {
+withParquetDataFrame((1 to 4).map(i => Tuple1(i + "str" + i))) { 
implicit df =>
+  // Test canDrop()
--- End diff --

to confirm, do they test `canDrop` or `keep`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

2018-06-29 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21623#discussion_r199080962
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -140,6 +142,32 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
 checkBinaryFilterPredicate(predicate, filterClass, 
Seq(Row(expected)))(df)
   }
 
+  // This function tests that exactly go through the `canDrop` and 
`inverseCanDrop`.
+  private def testStringStartsWith(dataFrame: DataFrame, filter: String): 
Unit = {
+withTempPath { dir =>
+  val path = dir.getCanonicalPath
+  dataFrame.write.option("parquet.block.size", 512).parquet(path)
+  Seq(true, false).foreach { pushDown =>
+withSQLConf(
+  SQLConf.PARQUET_FILTER_PUSHDOWN_STRING_STARTSWITH_ENABLED.key -> 
pushDown.toString) {
+  val accu = new NumRowGroupsAcc
+  sparkContext.register(accu)
+
+  val df = spark.read.parquet(path).filter(filter)
+  df.foreachPartition((it: Iterator[Row]) => it.foreach(v => 
accu.add(0)))
+  df.collect
--- End diff --

what does this `collect` do? `foreachPartition` is already an action


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

2018-06-29 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21623#discussion_r199078944
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
@@ -270,6 +273,36 @@ private[parquet] class ParquetFilters(pushDownDate: 
Boolean) {
   case sources.Not(pred) =>
 createFilter(schema, pred).map(FilterApi.not)
 
+  case sources.StringStartsWith(name, prefix) if pushDownStartWith && 
canMakeFilterOn(name) =>
+Option(prefix).map { v =>
+  FilterApi.userDefined(binaryColumn(name),
+new UserDefinedPredicate[Binary] with Serializable {
+  private val strToBinary = 
Binary.fromReusedByteArray(v.getBytes)
+  private val size = strToBinary.length
+
+  override def canDrop(statistics: Statistics[Binary]): 
Boolean = {
+val comparator = 
PrimitiveComparator.UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR
+val max = statistics.getMax
+val min = statistics.getMin
+comparator.compare(max.slice(0, math.min(size, 
max.length)), strToBinary) < 0 ||
+  comparator.compare(min.slice(0, math.min(size, 
min.length)), strToBinary) > 0
+  }
+
+  override def inverseCanDrop(statistics: Statistics[Binary]): 
Boolean = {
+val comparator = 
PrimitiveComparator.UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR
+val max = statistics.getMax
+val min = statistics.getMin
+comparator.compare(max.slice(0, math.min(size, 
max.length)), strToBinary) == 0 &&
+  comparator.compare(min.slice(0, math.min(size, 
min.length)), strToBinary) == 0
+  }
+
+  override def keep(value: Binary): Boolean = {
+
UTF8String.fromBytes(value.getBytes).startsWith(UTF8String.fromString(v))
--- End diff --

`UTF8String.fromString(v)` -> `UTF8String.fromBytes(strToBinary.getBytes)`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

2018-06-29 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/21623#discussion_r199078056
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -378,6 +378,14 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val PARQUET_FILTER_PUSHDOWN_STRING_STARTSWITH_ENABLED =
+buildConf("spark.sql.parquet.filterPushdown.string.startsWith")
+.doc("If true, enables Parquet filter push-down optimization for 
string starts with. " +
--- End diff --

`for string startsWith function`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

2018-06-29 Thread stanzhai
Github user stanzhai commented on a diff in the pull request:

https://github.com/apache/spark/pull/21623#discussion_r199062132
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala ---
@@ -378,6 +378,14 @@ object SQLConf {
 .booleanConf
 .createWithDefault(true)
 
+  val PARQUET_FILTER_PUSHDOWN_STRING_STARTSWITH_ENABLED =
+buildConf("spark.sql.parquet.filterPushdown.string.startsWith")
--- End diff --

It would be better if we added `.enabled` postfix.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

2018-06-28 Thread wangyum
Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/21623#discussion_r199043411
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -660,6 +661,56 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
   assert(df.where("col > 0").count() === 2)
 }
   }
+
+  test("filter pushdown - StringStartsWith") {
+withParquetDataFrame((1 to 4).map(i => Tuple1(i + "str" + i))) { 
implicit df =>
--- End diff --

Added `testStringStartsWith` to test that exactly go through the `canDrop` 
and `inverseCanDrop`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

2018-06-28 Thread wangyum
Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/21623#discussion_r199043210
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
@@ -22,16 +22,23 @@ import java.sql.Date
 import org.apache.parquet.filter2.predicate._
 import org.apache.parquet.filter2.predicate.FilterApi._
 import org.apache.parquet.io.api.Binary
+import org.apache.parquet.schema.PrimitiveComparator
 
 import org.apache.spark.sql.catalyst.util.DateTimeUtils
 import org.apache.spark.sql.catalyst.util.DateTimeUtils.SQLDate
+import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.sources
 import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
 
 /**
  * Some utility function to convert Spark data source filters to Parquet 
filters.
  */
-private[parquet] class ParquetFilters(pushDownDate: Boolean) {
+private[parquet] class ParquetFilters() {
+
+  val sqlConf: SQLConf = SQLConf.get
--- End diff --

You are right. I hit a bug here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

2018-06-27 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/21623#discussion_r198553569
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
@@ -22,16 +22,23 @@ import java.sql.Date
 import org.apache.parquet.filter2.predicate._
 import org.apache.parquet.filter2.predicate.FilterApi._
 import org.apache.parquet.io.api.Binary
+import org.apache.parquet.schema.PrimitiveComparator
 
 import org.apache.spark.sql.catalyst.util.DateTimeUtils
 import org.apache.spark.sql.catalyst.util.DateTimeUtils.SQLDate
+import org.apache.spark.sql.internal.SQLConf
 import org.apache.spark.sql.sources
 import org.apache.spark.sql.types._
+import org.apache.spark.unsafe.types.UTF8String
 
 /**
  * Some utility function to convert Spark data source filters to Parquet 
filters.
  */
-private[parquet] class ParquetFilters(pushDownDate: Boolean) {
+private[parquet] class ParquetFilters() {
+
+  val sqlConf: SQLConf = SQLConf.get
--- End diff --

This should pass in `pushDownDate` and `pushDownStartWith` like the 
previous version did with just the date setting.

The SQLConf is already available in ParquetFileFormat and it *would* be 
better to pass it in. The problem is that this class is instantiated in the 
function (`(file: PartitionedFile) => { ... }`) that gets serialized and sent 
to executors. That means we don't want SQLConf and its references in the 
function's closure. The way we got around this before was to put boolean config 
vals in the closure instead. I think you should go with that approach.

I'm not sure what `SQLConf.get` is for or what a correct use would be. 
@gatorsmile, can you comment on use of `SQLConf.get`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

2018-06-27 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/21623#discussion_r198551889
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -660,6 +661,56 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
   assert(df.where("col > 0").count() === 2)
 }
   }
+
+  test("filter pushdown - StringStartsWith") {
+withParquetDataFrame((1 to 4).map(i => Tuple1(i + "str" + i))) { 
implicit df =>
--- End diff --

I think that all of these tests go through the `keep` method instead of the 
`canDrop` and `inverseCanDrop`. I think those methods need to be tested. You 
can do that by constructing a Parquet file with row groups that have 
predictable statistics, but that would be difficult. An easier way to do this 
is to define the predicate class elsewhere and create a unit test for it that 
passes in different statistics values.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

2018-06-26 Thread attilapiros
Github user attilapiros commented on a diff in the pull request:

https://github.com/apache/spark/pull/21623#discussion_r198269733
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
@@ -270,6 +277,29 @@ private[parquet] class ParquetFilters(pushDownDate: 
Boolean) {
   case sources.Not(pred) =>
 createFilter(schema, pred).map(FilterApi.not)
 
+  case sources.StringStartsWith(name, prefix) if pushDownStartWith && 
canMakeFilterOn(name) =>
+Option(prefix).map { v =>
+  FilterApi.userDefined(binaryColumn(name),
+new UserDefinedPredicate[Binary] with Serializable {
+  private val strToBinary = 
Binary.fromReusedByteArray(v.getBytes)
+  private val size = strToBinary.length
+
+  override def canDrop(statistics: Statistics[Binary]): 
Boolean = {
+val comparator = 
PrimitiveComparator.UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR
+val max = statistics.getMax
+val min = statistics.getMin
+comparator.compare(max.slice(0, math.min(size, 
max.length)), strToBinary) < 0 ||
+  comparator.compare(min.slice(0, math.min(size, 
min.length)), strToBinary) > 0
+  }
+
+  override def inverseCanDrop(statistics: Statistics[Binary]): 
Boolean = false
--- End diff --

@rdblue oh, sorry I have not seen your reply. Yes, in that case we can and 
you are right it is worth to do.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

2018-06-26 Thread attilapiros
Github user attilapiros commented on a diff in the pull request:

https://github.com/apache/spark/pull/21623#discussion_r198245076
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
@@ -270,6 +277,29 @@ private[parquet] class ParquetFilters(pushDownDate: 
Boolean) {
   case sources.Not(pred) =>
 createFilter(schema, pred).map(FilterApi.not)
 
+  case sources.StringStartsWith(name, prefix) if pushDownStartWith && 
canMakeFilterOn(name) =>
+Option(prefix).map { v =>
+  FilterApi.userDefined(binaryColumn(name),
+new UserDefinedPredicate[Binary] with Serializable {
+  private val strToBinary = 
Binary.fromReusedByteArray(v.getBytes)
+  private val size = strToBinary.length
+
+  override def canDrop(statistics: Statistics[Binary]): 
Boolean = {
+val comparator = 
PrimitiveComparator.UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR
+val max = statistics.getMax
+val min = statistics.getMin
+comparator.compare(max.slice(0, math.min(size, 
max.length)), strToBinary) < 0 ||
+  comparator.compare(min.slice(0, math.min(size, 
min.length)), strToBinary) > 0
+  }
+
+  override def inverseCanDrop(statistics: Statistics[Binary]): 
Boolean = false
--- End diff --

There is one rare case when you can drop it with inverse when both min and 
max starts the perfix. 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

2018-06-26 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/21623#discussion_r198244664
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
@@ -270,6 +277,29 @@ private[parquet] class ParquetFilters(pushDownDate: 
Boolean) {
   case sources.Not(pred) =>
 createFilter(schema, pred).map(FilterApi.not)
 
+  case sources.StringStartsWith(name, prefix) if pushDownStartWith && 
canMakeFilterOn(name) =>
+Option(prefix).map { v =>
+  FilterApi.userDefined(binaryColumn(name),
+new UserDefinedPredicate[Binary] with Serializable {
+  private val strToBinary = 
Binary.fromReusedByteArray(v.getBytes)
+  private val size = strToBinary.length
+
+  override def canDrop(statistics: Statistics[Binary]): 
Boolean = {
+val comparator = 
PrimitiveComparator.UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR
+val max = statistics.getMax
+val min = statistics.getMin
+comparator.compare(max.slice(0, math.min(size, 
max.length)), strToBinary) < 0 ||
+  comparator.compare(min.slice(0, math.min(size, 
min.length)), strToBinary) > 0
+  }
+
+  override def inverseCanDrop(statistics: Statistics[Binary]): 
Boolean = false
--- End diff --

Sorry, I meant if the min and max both *include* the prefix, then we should 
be able to drop the range. The situation is where both min and max match, so 
all values must also match the filter. If we are looking for values that do not 
match the filter, then we can eliminate the row group.

The example is prefix=CCC and values are between min=CCCa and max=CCCZ: all 
values start with CCC, so the entire row group can be skipped.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

2018-06-26 Thread attilapiros
Github user attilapiros commented on a diff in the pull request:

https://github.com/apache/spark/pull/21623#discussion_r198241792
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
@@ -270,6 +277,29 @@ private[parquet] class ParquetFilters(pushDownDate: 
Boolean) {
   case sources.Not(pred) =>
 createFilter(schema, pred).map(FilterApi.not)
 
+  case sources.StringStartsWith(name, prefix) if pushDownStartWith && 
canMakeFilterOn(name) =>
+Option(prefix).map { v =>
+  FilterApi.userDefined(binaryColumn(name),
+new UserDefinedPredicate[Binary] with Serializable {
+  private val strToBinary = 
Binary.fromReusedByteArray(v.getBytes)
+  private val size = strToBinary.length
+
+  override def canDrop(statistics: Statistics[Binary]): 
Boolean = {
+val comparator = 
PrimitiveComparator.UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR
+val max = statistics.getMax
+val min = statistics.getMin
+comparator.compare(max.slice(0, math.min(size, 
max.length)), strToBinary) < 0 ||
+  comparator.compare(min.slice(0, math.min(size, 
min.length)), strToBinary) > 0
+  }
+
+  override def inverseCanDrop(statistics: Statistics[Binary]): 
Boolean = false
--- End diff --

No. 

Let me illustrate this with an example: let's assume min="BBB", max="DDD" 
canDrop() means if your prefix is before "BBB" (like "A") we can stop as there 
is no reason to search within this range. This is also true for prefixes after 
"DDD" (like "E").

Now if your operator is negated. What can you say when your prefix is "C" 
and the range is "BBB" and "DDD"? Can you drop it? No. And if the prefix is "A" 
or "E". Still not. You see you should check the range.  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

2018-06-26 Thread rdblue
Github user rdblue commented on a diff in the pull request:

https://github.com/apache/spark/pull/21623#discussion_r198230713
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
@@ -270,6 +277,29 @@ private[parquet] class ParquetFilters(pushDownDate: 
Boolean) {
   case sources.Not(pred) =>
 createFilter(schema, pred).map(FilterApi.not)
 
+  case sources.StringStartsWith(name, prefix) if pushDownStartWith && 
canMakeFilterOn(name) =>
+Option(prefix).map { v =>
+  FilterApi.userDefined(binaryColumn(name),
+new UserDefinedPredicate[Binary] with Serializable {
+  private val strToBinary = 
Binary.fromReusedByteArray(v.getBytes)
+  private val size = strToBinary.length
+
+  override def canDrop(statistics: Statistics[Binary]): 
Boolean = {
+val comparator = 
PrimitiveComparator.UNSIGNED_LEXICOGRAPHICAL_BINARY_COMPARATOR
+val max = statistics.getMax
+val min = statistics.getMin
+comparator.compare(max.slice(0, math.min(size, 
max.length)), strToBinary) < 0 ||
+  comparator.compare(min.slice(0, math.min(size, 
min.length)), strToBinary) > 0
+  }
+
+  override def inverseCanDrop(statistics: Statistics[Binary]): 
Boolean = false
--- End diff --

Why can't this evaluate the inverse of `StartsWith`? If the min and max 
values exclude the prefix, then this should be able to filter.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

2018-06-25 Thread wangyum
Github user wangyum commented on a diff in the pull request:

https://github.com/apache/spark/pull/21623#discussion_r197992151
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala
 ---
@@ -660,6 +660,30 @@ class ParquetFilterSuite extends QueryTest with 
ParquetTest with SharedSQLContex
   assert(df.where("col > 0").count() === 2)
 }
   }
+
+  test("filter pushdown - StringStartsWith") {
+withParquetDataFrame((1 to 4).map(i => Tuple1(i + "str" + i))) { 
implicit df =>
+  Seq("2", "2s", "2st", "2str", "2str2").foreach { prefix =>
+checkFilterPredicate(
+  '_1.startsWith(prefix).asInstanceOf[Predicate],
+  classOf[UserDefinedByInstance[_, _]],
+  "2str2")
+  }
+
+  Seq("2S", "null", "2str22").foreach { prefix =>
+checkFilterPredicate(
+  '_1.startsWith(prefix).asInstanceOf[Predicate],
+  classOf[UserDefinedByInstance[_, _]],
+  Seq.empty[Row])
+  }
+
+  assertResult(None) {
+parquetFilters.createFilter(
+  df.schema,
+  sources.StringStartsWith("_1", null))
--- End diff --

Thanks @attilapiros , `sources.StringStartsWith("_1", null)` will not 
matches them, same as before.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

2018-06-25 Thread HyukjinKwon
Github user HyukjinKwon commented on a diff in the pull request:

https://github.com/apache/spark/pull/21623#discussion_r197988392
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala
 ---
@@ -270,6 +272,29 @@ private[parquet] class ParquetFilters(pushDownDate: 
Boolean) {
   case sources.Not(pred) =>
 createFilter(schema, pred).map(FilterApi.not)
 
+  case sources.StringStartsWith(name, prefix) if canMakeFilterOn(name) 
=>
--- End diff --

What do you think about adding a configuration to control this and set it 
true by default? It's basically dependent on an user defined predicate we 
manually wrote here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #21623: [SPARK-24638][SQL] StringStartsWith support push ...

2018-06-23 Thread wangyum
GitHub user wangyum opened a pull request:

https://github.com/apache/spark/pull/21623

[SPARK-24638][SQL] StringStartsWith support push down

## What changes were proposed in this pull request?

`StringStartsWith` support push down. About 50% savings in compute time.

## How was this patch tested?
unit tests and manual tests.
Performance test:
```scala
cat < SPARK-24638.scala
spark.range(1000).selectExpr("concat(id, 'str', id) as 
id").coalesce(1).write.option("parquet.block.size", 
1048576).parquet("/tmp/spark/parquet/string")
val df = spark.read.parquet("/tmp/spark/parquet/string/")
spark.sql("set spark.sql.parquet.filterPushdown=true")
val pushdownEnableStart = System.currentTimeMillis()
for(i <- 0 until 100) {
  df.where("id like '98%'").count()
}
val pushdownEnable = System.currentTimeMillis() - pushdownEnableStart

spark.sql("set spark.sql.parquet.filterPushdown=false")
val pushdownDisableStart = System.currentTimeMillis()
for(i <- 0 until 100) {
  df.where("id like '98%'").count()
}
val pushdownDisable = System.currentTimeMillis() - pushdownDisableStart

val improvements = pushdownDisable.toDouble - pushdownEnable.toDouble

println(s"improvements: ${improvements}")

EOF

bin/spark-shell -i SPARK-24638.scala
```


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/wangyum/spark SPARK-24638

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/21623.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #21623


commit 5b52ace44c8a41631535c883b7a5c8545959e5e5
Author: Yuming Wang 
Date:   2018-06-23T13:27:30Z

StringStartsWith support push down




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org