[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC

2020-08-07 Thread GitBox


dongjoon-hyun commented on a change in pull request #28761:
URL: https://github.com/apache/spark/pull/28761#discussion_r466838962



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala
##
@@ -37,12 +40,45 @@ trait OrcFiltersBase {
   }
 
   /**
-   * Return true if this is a searchable type in ORC.
-   * Both CharType and VarcharType are cleaned at AstBuilder.
+   * This method returns a map which contains ORC field name and data type. 
Each key
+   * represents a column; `dots` are used as separators for nested columns. If 
any part
+   * of the names contains `dots`, it is quoted to avoid confusion. See
+   * `org.apache.spark.sql.connector.catalog.quoted` for implementation 
details.
+   *
+   * BinaryType, UserDefinedType, ArrayType and MapType are ignored.
*/
-  protected[sql] def isSearchableType(dataType: DataType) = dataType match {
-case BinaryType => false
-case _: AtomicType => true
-case _ => false
+  protected[sql] def isSearchableType(

Review comment:
   Oh, in the following my comment, I didn't mean to reuse the existing 
one. `isXXX` is used for Boolean function. I guessed something like 
`getSearchableTypeMap`.
   >  In order to be more clear, we had better have `Searchable` in the 
function name like the previous one (isSearchableType).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC

2020-08-06 Thread GitBox


dongjoon-hyun commented on a change in pull request #28761:
URL: https://github.com/apache/spark/pull/28761#discussion_r466831167



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala
##
@@ -37,12 +40,44 @@ trait OrcFiltersBase {
   }
 
   /**
-   * Return true if this is a searchable type in ORC.
-   * Both CharType and VarcharType are cleaned at AstBuilder.
+   * This method returns a map which contains ORC field name and data type. 
Each key
+   * represents a column; `dots` are used as separators for nested columns. If 
any part
+   * of the names contains `dots`, it is quoted to avoid confusion. See
+   * `org.apache.spark.sql.connector.catalog.quote` for implementation details.
*/
-  protected[sql] def isSearchableType(dataType: DataType) = dataType match {
-case BinaryType => false
-case _: AtomicType => true
-case _ => false
+  protected[sql] def getNameToOrcFieldMap(
+  schema: StructType,
+  caseSensitive: Boolean): Map[String, DataType] = {
+import 
org.apache.spark.sql.connector.catalog.CatalogV2Implicits.MultipartIdentifierHelper
+
+def getPrimitiveFields(
+fields: Seq[StructField],
+parentFieldNames: Seq[String] = Seq.empty): Seq[(String, DataType)] = {
+  fields.flatMap { f =>
+f.dataType match {
+  case st: StructType =>
+getPrimitiveFields(st.fields, parentFieldNames :+ f.name)
+  case BinaryType => None
+  case _: AtomicType =>
+Some(((parentFieldNames :+ f.name).quoted, f.dataType))
+  case _ => None
+}
+  }
+}
+
+val primitiveFields = getPrimitiveFields(schema.fields)
+if (caseSensitive) {
+  primitiveFields.toMap

Review comment:
   Thanks!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC

2020-08-06 Thread GitBox


dongjoon-hyun commented on a change in pull request #28761:
URL: https://github.com/apache/spark/pull/28761#discussion_r466828786



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala
##
@@ -37,12 +40,44 @@ trait OrcFiltersBase {
   }
 
   /**
-   * Return true if this is a searchable type in ORC.
-   * Both CharType and VarcharType are cleaned at AstBuilder.
+   * This method returns a map which contains ORC field name and data type. 
Each key
+   * represents a column; `dots` are used as separators for nested columns. If 
any part
+   * of the names contains `dots`, it is quoted to avoid confusion. See
+   * `org.apache.spark.sql.connector.catalog.quote` for implementation details.
*/
-  protected[sql] def isSearchableType(dataType: DataType) = dataType match {
-case BinaryType => false
-case _: AtomicType => true
-case _ => false
+  protected[sql] def getNameToOrcFieldMap(
+  schema: StructType,
+  caseSensitive: Boolean): Map[String, DataType] = {
+import 
org.apache.spark.sql.connector.catalog.CatalogV2Implicits.MultipartIdentifierHelper
+
+def getPrimitiveFields(
+fields: Seq[StructField],
+parentFieldNames: Seq[String] = Seq.empty): Seq[(String, DataType)] = {
+  fields.flatMap { f =>
+f.dataType match {
+  case st: StructType =>
+getPrimitiveFields(st.fields, parentFieldNames :+ f.name)
+  case BinaryType => None
+  case _: AtomicType =>
+Some(((parentFieldNames :+ f.name).quoted, f.dataType))
+  case _ => None
+}
+  }
+}
+
+val primitiveFields = getPrimitiveFields(schema.fields)
+if (caseSensitive) {
+  primitiveFields.toMap

Review comment:
   For this one, just let it be in this PR.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC

2020-08-06 Thread GitBox


dongjoon-hyun commented on a change in pull request #28761:
URL: https://github.com/apache/spark/pull/28761#discussion_r466828593



##
File path: 
sql/core/v2.3/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala
##
@@ -93,155 +92,200 @@ class OrcFilterSuite extends OrcTest with 
SharedSparkSession {
   }
 
   test("filter pushdown - integer") {
-withOrcDataFrame((1 to 4).map(i => Tuple1(Option(i { implicit df =>
-  checkFilterPredicate($"_1".isNull, PredicateLeaf.Operator.IS_NULL)
-
-  checkFilterPredicate($"_1" === 1, PredicateLeaf.Operator.EQUALS)
-  checkFilterPredicate($"_1" <=> 1, 
PredicateLeaf.Operator.NULL_SAFE_EQUALS)
-
-  checkFilterPredicate($"_1" < 2, PredicateLeaf.Operator.LESS_THAN)
-  checkFilterPredicate($"_1" > 3, PredicateLeaf.Operator.LESS_THAN_EQUALS)
-  checkFilterPredicate($"_1" <= 1, PredicateLeaf.Operator.LESS_THAN_EQUALS)
-  checkFilterPredicate($"_1" >= 4, PredicateLeaf.Operator.LESS_THAN)
-
-  checkFilterPredicate(Literal(1) === $"_1", PredicateLeaf.Operator.EQUALS)
-  checkFilterPredicate(Literal(1) <=> $"_1", 
PredicateLeaf.Operator.NULL_SAFE_EQUALS)
-  checkFilterPredicate(Literal(2) > $"_1", 
PredicateLeaf.Operator.LESS_THAN)
-  checkFilterPredicate(Literal(3) < $"_1", 
PredicateLeaf.Operator.LESS_THAN_EQUALS)
-  checkFilterPredicate(Literal(1) >= $"_1", 
PredicateLeaf.Operator.LESS_THAN_EQUALS)
-  checkFilterPredicate(Literal(4) <= $"_1", 
PredicateLeaf.Operator.LESS_THAN)
+withNestedOrcDataFrame((1 to 4).map(i => Tuple1(Option(i { case 
(inputDF, colName, _) =>
+  implicit val df: DataFrame = inputDF
+
+  val intAttr = df(colName).expr
+  assert(df(colName).expr.dataType === IntegerType)
+
+  checkFilterPredicate(intAttr.isNull, PredicateLeaf.Operator.IS_NULL)
+
+  checkFilterPredicate(intAttr === 1, PredicateLeaf.Operator.EQUALS)
+  checkFilterPredicate(intAttr <=> 1, 
PredicateLeaf.Operator.NULL_SAFE_EQUALS)
+
+  checkFilterPredicate(intAttr < 2, PredicateLeaf.Operator.LESS_THAN)
+  checkFilterPredicate(intAttr > 3, 
PredicateLeaf.Operator.LESS_THAN_EQUALS)
+  checkFilterPredicate(intAttr <= 1, 
PredicateLeaf.Operator.LESS_THAN_EQUALS)
+  checkFilterPredicate(intAttr >= 4, PredicateLeaf.Operator.LESS_THAN)
+
+  checkFilterPredicate(Literal(1) === intAttr, 
PredicateLeaf.Operator.EQUALS)
+  checkFilterPredicate(Literal(1) <=> intAttr, 
PredicateLeaf.Operator.NULL_SAFE_EQUALS)
+  checkFilterPredicate(Literal(2) > intAttr, 
PredicateLeaf.Operator.LESS_THAN)
+  checkFilterPredicate(Literal(3) < intAttr, 
PredicateLeaf.Operator.LESS_THAN_EQUALS)
+  checkFilterPredicate(Literal(1) >= intAttr, 
PredicateLeaf.Operator.LESS_THAN_EQUALS)
+  checkFilterPredicate(Literal(4) <= intAttr, 
PredicateLeaf.Operator.LESS_THAN)
 }
   }
 
   test("filter pushdown - long") {
-withOrcDataFrame((1 to 4).map(i => Tuple1(Option(i.toLong { implicit 
df =>
-  checkFilterPredicate($"_1".isNull, PredicateLeaf.Operator.IS_NULL)
-
-  checkFilterPredicate($"_1" === 1, PredicateLeaf.Operator.EQUALS)
-  checkFilterPredicate($"_1" <=> 1, 
PredicateLeaf.Operator.NULL_SAFE_EQUALS)
-
-  checkFilterPredicate($"_1" < 2, PredicateLeaf.Operator.LESS_THAN)
-  checkFilterPredicate($"_1" > 3, PredicateLeaf.Operator.LESS_THAN_EQUALS)
-  checkFilterPredicate($"_1" <= 1, PredicateLeaf.Operator.LESS_THAN_EQUALS)
-  checkFilterPredicate($"_1" >= 4, PredicateLeaf.Operator.LESS_THAN)
-
-  checkFilterPredicate(Literal(1) === $"_1", PredicateLeaf.Operator.EQUALS)
-  checkFilterPredicate(Literal(1) <=> $"_1", 
PredicateLeaf.Operator.NULL_SAFE_EQUALS)
-  checkFilterPredicate(Literal(2) > $"_1", 
PredicateLeaf.Operator.LESS_THAN)
-  checkFilterPredicate(Literal(3) < $"_1", 
PredicateLeaf.Operator.LESS_THAN_EQUALS)
-  checkFilterPredicate(Literal(1) >= $"_1", 
PredicateLeaf.Operator.LESS_THAN_EQUALS)
-  checkFilterPredicate(Literal(4) <= $"_1", 
PredicateLeaf.Operator.LESS_THAN)
+withNestedOrcDataFrame(
+(1 to 4).map(i => Tuple1(Option(i.toLong { case (inputDF, colName, 
_) =>
+  implicit val df: DataFrame = inputDF
+
+  val longAttr = df(colName).expr
+  assert(df(colName).expr.dataType === LongType)
+
+  checkFilterPredicate(longAttr.isNull, PredicateLeaf.Operator.IS_NULL)
+
+  checkFilterPredicate(longAttr === 1, PredicateLeaf.Operator.EQUALS)
+  checkFilterPredicate(longAttr <=> 1, 
PredicateLeaf.Operator.NULL_SAFE_EQUALS)
+
+  checkFilterPredicate(longAttr < 2, PredicateLeaf.Operator.LESS_THAN)
+  checkFilterPredicate(longAttr > 3, 
PredicateLeaf.Operator.LESS_THAN_EQUALS)
+  checkFilterPredicate(longAttr <= 1, 
PredicateLeaf.Operator.LESS_THAN_EQUALS)
+  checkFilterPredicate(longAttr >= 4, PredicateLeaf.Operator.LESS_THAN)
+
+  checkFilterPredicate(Literal(1) === longAttr, 
PredicateLeaf.Operator.EQUALS)
+  

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC

2020-08-06 Thread GitBox


dongjoon-hyun commented on a change in pull request #28761:
URL: https://github.com/apache/spark/pull/28761#discussion_r466710704



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala
##
@@ -37,12 +40,44 @@ trait OrcFiltersBase {
   }
 
   /**
-   * Return true if this is a searchable type in ORC.
-   * Both CharType and VarcharType are cleaned at AstBuilder.
+   * This method returns a map which contains ORC field name and data type. 
Each key
+   * represents a column; `dots` are used as separators for nested columns. If 
any part
+   * of the names contains `dots`, it is quoted to avoid confusion. See
+   * `org.apache.spark.sql.connector.catalog.quote` for implementation details.
*/
-  protected[sql] def isSearchableType(dataType: DataType) = dataType match {
-case BinaryType => false
-case _: AtomicType => true
-case _ => false
+  protected[sql] def getNameToOrcFieldMap(
+  schema: StructType,
+  caseSensitive: Boolean): Map[String, DataType] = {
+import 
org.apache.spark.sql.connector.catalog.CatalogV2Implicits.MultipartIdentifierHelper
+
+def getPrimitiveFields(
+fields: Seq[StructField],
+parentFieldNames: Seq[String] = Seq.empty): Seq[(String, DataType)] = {
+  fields.flatMap { f =>
+f.dataType match {
+  case st: StructType =>
+getPrimitiveFields(st.fields, parentFieldNames :+ f.name)
+  case BinaryType => None
+  case _: AtomicType =>
+Some(((parentFieldNames :+ f.name).quoted, f.dataType))
+  case _ => None
+}
+  }
+}
+
+val primitiveFields = getPrimitiveFields(schema.fields)
+if (caseSensitive) {
+  primitiveFields.toMap

Review comment:
   Just a question. Do we have a test coverage for this code path, @viirya ?





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC

2020-08-06 Thread GitBox


dongjoon-hyun commented on a change in pull request #28761:
URL: https://github.com/apache/spark/pull/28761#discussion_r466708125



##
File path: 
sql/core/v2.3/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala
##
@@ -93,155 +92,200 @@ class OrcFilterSuite extends OrcTest with 
SharedSparkSession {
   }
 
   test("filter pushdown - integer") {
-withOrcDataFrame((1 to 4).map(i => Tuple1(Option(i { implicit df =>
-  checkFilterPredicate($"_1".isNull, PredicateLeaf.Operator.IS_NULL)
-
-  checkFilterPredicate($"_1" === 1, PredicateLeaf.Operator.EQUALS)
-  checkFilterPredicate($"_1" <=> 1, 
PredicateLeaf.Operator.NULL_SAFE_EQUALS)
-
-  checkFilterPredicate($"_1" < 2, PredicateLeaf.Operator.LESS_THAN)
-  checkFilterPredicate($"_1" > 3, PredicateLeaf.Operator.LESS_THAN_EQUALS)
-  checkFilterPredicate($"_1" <= 1, PredicateLeaf.Operator.LESS_THAN_EQUALS)
-  checkFilterPredicate($"_1" >= 4, PredicateLeaf.Operator.LESS_THAN)
-
-  checkFilterPredicate(Literal(1) === $"_1", PredicateLeaf.Operator.EQUALS)
-  checkFilterPredicate(Literal(1) <=> $"_1", 
PredicateLeaf.Operator.NULL_SAFE_EQUALS)
-  checkFilterPredicate(Literal(2) > $"_1", 
PredicateLeaf.Operator.LESS_THAN)
-  checkFilterPredicate(Literal(3) < $"_1", 
PredicateLeaf.Operator.LESS_THAN_EQUALS)
-  checkFilterPredicate(Literal(1) >= $"_1", 
PredicateLeaf.Operator.LESS_THAN_EQUALS)
-  checkFilterPredicate(Literal(4) <= $"_1", 
PredicateLeaf.Operator.LESS_THAN)
+withNestedOrcDataFrame((1 to 4).map(i => Tuple1(Option(i { case 
(inputDF, colName, _) =>
+  implicit val df: DataFrame = inputDF
+
+  val intAttr = df(colName).expr
+  assert(df(colName).expr.dataType === IntegerType)
+
+  checkFilterPredicate(intAttr.isNull, PredicateLeaf.Operator.IS_NULL)
+
+  checkFilterPredicate(intAttr === 1, PredicateLeaf.Operator.EQUALS)
+  checkFilterPredicate(intAttr <=> 1, 
PredicateLeaf.Operator.NULL_SAFE_EQUALS)
+
+  checkFilterPredicate(intAttr < 2, PredicateLeaf.Operator.LESS_THAN)
+  checkFilterPredicate(intAttr > 3, 
PredicateLeaf.Operator.LESS_THAN_EQUALS)
+  checkFilterPredicate(intAttr <= 1, 
PredicateLeaf.Operator.LESS_THAN_EQUALS)
+  checkFilterPredicate(intAttr >= 4, PredicateLeaf.Operator.LESS_THAN)
+
+  checkFilterPredicate(Literal(1) === intAttr, 
PredicateLeaf.Operator.EQUALS)
+  checkFilterPredicate(Literal(1) <=> intAttr, 
PredicateLeaf.Operator.NULL_SAFE_EQUALS)
+  checkFilterPredicate(Literal(2) > intAttr, 
PredicateLeaf.Operator.LESS_THAN)
+  checkFilterPredicate(Literal(3) < intAttr, 
PredicateLeaf.Operator.LESS_THAN_EQUALS)
+  checkFilterPredicate(Literal(1) >= intAttr, 
PredicateLeaf.Operator.LESS_THAN_EQUALS)
+  checkFilterPredicate(Literal(4) <= intAttr, 
PredicateLeaf.Operator.LESS_THAN)
 }
   }
 
   test("filter pushdown - long") {
-withOrcDataFrame((1 to 4).map(i => Tuple1(Option(i.toLong { implicit 
df =>
-  checkFilterPredicate($"_1".isNull, PredicateLeaf.Operator.IS_NULL)
-
-  checkFilterPredicate($"_1" === 1, PredicateLeaf.Operator.EQUALS)
-  checkFilterPredicate($"_1" <=> 1, 
PredicateLeaf.Operator.NULL_SAFE_EQUALS)
-
-  checkFilterPredicate($"_1" < 2, PredicateLeaf.Operator.LESS_THAN)
-  checkFilterPredicate($"_1" > 3, PredicateLeaf.Operator.LESS_THAN_EQUALS)
-  checkFilterPredicate($"_1" <= 1, PredicateLeaf.Operator.LESS_THAN_EQUALS)
-  checkFilterPredicate($"_1" >= 4, PredicateLeaf.Operator.LESS_THAN)
-
-  checkFilterPredicate(Literal(1) === $"_1", PredicateLeaf.Operator.EQUALS)
-  checkFilterPredicate(Literal(1) <=> $"_1", 
PredicateLeaf.Operator.NULL_SAFE_EQUALS)
-  checkFilterPredicate(Literal(2) > $"_1", 
PredicateLeaf.Operator.LESS_THAN)
-  checkFilterPredicate(Literal(3) < $"_1", 
PredicateLeaf.Operator.LESS_THAN_EQUALS)
-  checkFilterPredicate(Literal(1) >= $"_1", 
PredicateLeaf.Operator.LESS_THAN_EQUALS)
-  checkFilterPredicate(Literal(4) <= $"_1", 
PredicateLeaf.Operator.LESS_THAN)
+withNestedOrcDataFrame(
+(1 to 4).map(i => Tuple1(Option(i.toLong { case (inputDF, colName, 
_) =>
+  implicit val df: DataFrame = inputDF
+
+  val longAttr = df(colName).expr
+  assert(df(colName).expr.dataType === LongType)
+
+  checkFilterPredicate(longAttr.isNull, PredicateLeaf.Operator.IS_NULL)
+
+  checkFilterPredicate(longAttr === 1, PredicateLeaf.Operator.EQUALS)
+  checkFilterPredicate(longAttr <=> 1, 
PredicateLeaf.Operator.NULL_SAFE_EQUALS)
+
+  checkFilterPredicate(longAttr < 2, PredicateLeaf.Operator.LESS_THAN)
+  checkFilterPredicate(longAttr > 3, 
PredicateLeaf.Operator.LESS_THAN_EQUALS)
+  checkFilterPredicate(longAttr <= 1, 
PredicateLeaf.Operator.LESS_THAN_EQUALS)
+  checkFilterPredicate(longAttr >= 4, PredicateLeaf.Operator.LESS_THAN)
+
+  checkFilterPredicate(Literal(1) === longAttr, 
PredicateLeaf.Operator.EQUALS)
+  

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC

2020-08-06 Thread GitBox


dongjoon-hyun commented on a change in pull request #28761:
URL: https://github.com/apache/spark/pull/28761#discussion_r466707364



##
File path: 
sql/core/v2.3/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala
##
@@ -93,155 +92,200 @@ class OrcFilterSuite extends OrcTest with 
SharedSparkSession {
   }
 
   test("filter pushdown - integer") {
-withOrcDataFrame((1 to 4).map(i => Tuple1(Option(i { implicit df =>
-  checkFilterPredicate($"_1".isNull, PredicateLeaf.Operator.IS_NULL)
-
-  checkFilterPredicate($"_1" === 1, PredicateLeaf.Operator.EQUALS)
-  checkFilterPredicate($"_1" <=> 1, 
PredicateLeaf.Operator.NULL_SAFE_EQUALS)
-
-  checkFilterPredicate($"_1" < 2, PredicateLeaf.Operator.LESS_THAN)
-  checkFilterPredicate($"_1" > 3, PredicateLeaf.Operator.LESS_THAN_EQUALS)
-  checkFilterPredicate($"_1" <= 1, PredicateLeaf.Operator.LESS_THAN_EQUALS)
-  checkFilterPredicate($"_1" >= 4, PredicateLeaf.Operator.LESS_THAN)
-
-  checkFilterPredicate(Literal(1) === $"_1", PredicateLeaf.Operator.EQUALS)
-  checkFilterPredicate(Literal(1) <=> $"_1", 
PredicateLeaf.Operator.NULL_SAFE_EQUALS)
-  checkFilterPredicate(Literal(2) > $"_1", 
PredicateLeaf.Operator.LESS_THAN)
-  checkFilterPredicate(Literal(3) < $"_1", 
PredicateLeaf.Operator.LESS_THAN_EQUALS)
-  checkFilterPredicate(Literal(1) >= $"_1", 
PredicateLeaf.Operator.LESS_THAN_EQUALS)
-  checkFilterPredicate(Literal(4) <= $"_1", 
PredicateLeaf.Operator.LESS_THAN)
+withNestedOrcDataFrame((1 to 4).map(i => Tuple1(Option(i { case 
(inputDF, colName, _) =>
+  implicit val df: DataFrame = inputDF
+
+  val intAttr = df(colName).expr
+  assert(df(colName).expr.dataType === IntegerType)
+
+  checkFilterPredicate(intAttr.isNull, PredicateLeaf.Operator.IS_NULL)
+
+  checkFilterPredicate(intAttr === 1, PredicateLeaf.Operator.EQUALS)
+  checkFilterPredicate(intAttr <=> 1, 
PredicateLeaf.Operator.NULL_SAFE_EQUALS)
+
+  checkFilterPredicate(intAttr < 2, PredicateLeaf.Operator.LESS_THAN)
+  checkFilterPredicate(intAttr > 3, 
PredicateLeaf.Operator.LESS_THAN_EQUALS)
+  checkFilterPredicate(intAttr <= 1, 
PredicateLeaf.Operator.LESS_THAN_EQUALS)
+  checkFilterPredicate(intAttr >= 4, PredicateLeaf.Operator.LESS_THAN)
+
+  checkFilterPredicate(Literal(1) === intAttr, 
PredicateLeaf.Operator.EQUALS)
+  checkFilterPredicate(Literal(1) <=> intAttr, 
PredicateLeaf.Operator.NULL_SAFE_EQUALS)
+  checkFilterPredicate(Literal(2) > intAttr, 
PredicateLeaf.Operator.LESS_THAN)
+  checkFilterPredicate(Literal(3) < intAttr, 
PredicateLeaf.Operator.LESS_THAN_EQUALS)
+  checkFilterPredicate(Literal(1) >= intAttr, 
PredicateLeaf.Operator.LESS_THAN_EQUALS)
+  checkFilterPredicate(Literal(4) <= intAttr, 
PredicateLeaf.Operator.LESS_THAN)
 }
   }
 
   test("filter pushdown - long") {
-withOrcDataFrame((1 to 4).map(i => Tuple1(Option(i.toLong { implicit 
df =>
-  checkFilterPredicate($"_1".isNull, PredicateLeaf.Operator.IS_NULL)
-
-  checkFilterPredicate($"_1" === 1, PredicateLeaf.Operator.EQUALS)
-  checkFilterPredicate($"_1" <=> 1, 
PredicateLeaf.Operator.NULL_SAFE_EQUALS)
-
-  checkFilterPredicate($"_1" < 2, PredicateLeaf.Operator.LESS_THAN)
-  checkFilterPredicate($"_1" > 3, PredicateLeaf.Operator.LESS_THAN_EQUALS)
-  checkFilterPredicate($"_1" <= 1, PredicateLeaf.Operator.LESS_THAN_EQUALS)
-  checkFilterPredicate($"_1" >= 4, PredicateLeaf.Operator.LESS_THAN)
-
-  checkFilterPredicate(Literal(1) === $"_1", PredicateLeaf.Operator.EQUALS)
-  checkFilterPredicate(Literal(1) <=> $"_1", 
PredicateLeaf.Operator.NULL_SAFE_EQUALS)
-  checkFilterPredicate(Literal(2) > $"_1", 
PredicateLeaf.Operator.LESS_THAN)
-  checkFilterPredicate(Literal(3) < $"_1", 
PredicateLeaf.Operator.LESS_THAN_EQUALS)
-  checkFilterPredicate(Literal(1) >= $"_1", 
PredicateLeaf.Operator.LESS_THAN_EQUALS)
-  checkFilterPredicate(Literal(4) <= $"_1", 
PredicateLeaf.Operator.LESS_THAN)
+withNestedOrcDataFrame(
+(1 to 4).map(i => Tuple1(Option(i.toLong { case (inputDF, colName, 
_) =>
+  implicit val df: DataFrame = inputDF
+
+  val longAttr = df(colName).expr
+  assert(df(colName).expr.dataType === LongType)
+
+  checkFilterPredicate(longAttr.isNull, PredicateLeaf.Operator.IS_NULL)
+
+  checkFilterPredicate(longAttr === 1, PredicateLeaf.Operator.EQUALS)
+  checkFilterPredicate(longAttr <=> 1, 
PredicateLeaf.Operator.NULL_SAFE_EQUALS)
+
+  checkFilterPredicate(longAttr < 2, PredicateLeaf.Operator.LESS_THAN)
+  checkFilterPredicate(longAttr > 3, 
PredicateLeaf.Operator.LESS_THAN_EQUALS)
+  checkFilterPredicate(longAttr <= 1, 
PredicateLeaf.Operator.LESS_THAN_EQUALS)
+  checkFilterPredicate(longAttr >= 4, PredicateLeaf.Operator.LESS_THAN)
+
+  checkFilterPredicate(Literal(1) === longAttr, 
PredicateLeaf.Operator.EQUALS)
+  

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC

2020-08-06 Thread GitBox


dongjoon-hyun commented on a change in pull request #28761:
URL: https://github.com/apache/spark/pull/28761#discussion_r466706196



##
File path: 
sql/core/v1.2/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala
##
@@ -231,37 +229,37 @@ private[sql] object OrcFilters extends OrcFiltersBase {
 // Since ORC 1.5.0 (ORC-323), we need to quote for column names with `.` 
characters
 // in order to distinguish predicate pushdown for nested columns.

Review comment:
   Since we removed `quoteIfNeeded` in this file completely, I believe we 
can remove this old comment (231~232) together in both files v1.2(here) and 
v2.3.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC

2020-08-06 Thread GitBox


dongjoon-hyun commented on a change in pull request #28761:
URL: https://github.com/apache/spark/pull/28761#discussion_r466695536



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala
##
@@ -37,12 +40,44 @@ trait OrcFiltersBase {
   }
 
   /**
-   * Return true if this is a searchable type in ORC.
-   * Both CharType and VarcharType are cleaned at AstBuilder.
+   * This method returns a map which contains ORC field name and data type. 
Each key
+   * represents a column; `dots` are used as separators for nested columns. If 
any part
+   * of the names contains `dots`, it is quoted to avoid confusion. See
+   * `org.apache.spark.sql.connector.catalog.quote` for implementation details.
*/
-  protected[sql] def isSearchableType(dataType: DataType) = dataType match {
-case BinaryType => false
-case _: AtomicType => true
-case _ => false
+  protected[sql] def getNameToOrcFieldMap(

Review comment:
   1. `OrcField` looks a little mismatched because this function returns 
`DataType` instead of a field. Currently, it sounds likes `ToOrcField`.
   2. According to the behavior of this function, this ignores BinaryType, 
complexType, UserDefinedType. Also, function description doesn't mention the 
limitation at all. In order to be more clear, we had better have `Searchable` 
in the function name like the previous one (isSearchableType).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC

2020-08-06 Thread GitBox


dongjoon-hyun commented on a change in pull request #28761:
URL: https://github.com/apache/spark/pull/28761#discussion_r466695536



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala
##
@@ -37,12 +40,44 @@ trait OrcFiltersBase {
   }
 
   /**
-   * Return true if this is a searchable type in ORC.
-   * Both CharType and VarcharType are cleaned at AstBuilder.
+   * This method returns a map which contains ORC field name and data type. 
Each key
+   * represents a column; `dots` are used as separators for nested columns. If 
any part
+   * of the names contains `dots`, it is quoted to avoid confusion. See
+   * `org.apache.spark.sql.connector.catalog.quote` for implementation details.
*/
-  protected[sql] def isSearchableType(dataType: DataType) = dataType match {
-case BinaryType => false
-case _: AtomicType => true
-case _ => false
+  protected[sql] def getNameToOrcFieldMap(

Review comment:
   1. `OrcField` looks a little mismatched because this function returns 
`DataType` instead of a field.
   2. According to the behavior of this function, this ignores BinaryType, 
complexType, UserDefinedType. Also, function description doesn't mention the 
limitation at all. In order to be more clear, we had better have `Searchable` 
in the function name like the previous one (isSearchableType).





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC

2020-08-06 Thread GitBox


dongjoon-hyun commented on a change in pull request #28761:
URL: https://github.com/apache/spark/pull/28761#discussion_r466632459



##
File path: 
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala
##
@@ -37,12 +40,44 @@ trait OrcFiltersBase {
   }
 
   /**
-   * Return true if this is a searchable type in ORC.
-   * Both CharType and VarcharType are cleaned at AstBuilder.
+   * This method returns a map which contains ORC field name and data type. 
Each key
+   * represents a column; `dots` are used as separators for nested columns. If 
any part
+   * of the names contains `dots`, it is quoted to avoid confusion. See
+   * `org.apache.spark.sql.connector.catalog.quote` for implementation details.

Review comment:
   `quote` -> `quoted`.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org