[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC
dongjoon-hyun commented on a change in pull request #28761: URL: https://github.com/apache/spark/pull/28761#discussion_r466838962 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala ## @@ -37,12 +40,45 @@ trait OrcFiltersBase { } /** - * Return true if this is a searchable type in ORC. - * Both CharType and VarcharType are cleaned at AstBuilder. + * This method returns a map which contains ORC field name and data type. Each key + * represents a column; `dots` are used as separators for nested columns. If any part + * of the names contains `dots`, it is quoted to avoid confusion. See + * `org.apache.spark.sql.connector.catalog.quoted` for implementation details. + * + * BinaryType, UserDefinedType, ArrayType and MapType are ignored. */ - protected[sql] def isSearchableType(dataType: DataType) = dataType match { -case BinaryType => false -case _: AtomicType => true -case _ => false + protected[sql] def isSearchableType( Review comment: Oh, in the following my comment, I didn't mean to reuse the existing one. `isXXX` is used for Boolean function. I guessed something like `getSearchableTypeMap`. > In order to be more clear, we had better have `Searchable` in the function name like the previous one (isSearchableType). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC
dongjoon-hyun commented on a change in pull request #28761: URL: https://github.com/apache/spark/pull/28761#discussion_r466831167 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala ## @@ -37,12 +40,44 @@ trait OrcFiltersBase { } /** - * Return true if this is a searchable type in ORC. - * Both CharType and VarcharType are cleaned at AstBuilder. + * This method returns a map which contains ORC field name and data type. Each key + * represents a column; `dots` are used as separators for nested columns. If any part + * of the names contains `dots`, it is quoted to avoid confusion. See + * `org.apache.spark.sql.connector.catalog.quote` for implementation details. */ - protected[sql] def isSearchableType(dataType: DataType) = dataType match { -case BinaryType => false -case _: AtomicType => true -case _ => false + protected[sql] def getNameToOrcFieldMap( + schema: StructType, + caseSensitive: Boolean): Map[String, DataType] = { +import org.apache.spark.sql.connector.catalog.CatalogV2Implicits.MultipartIdentifierHelper + +def getPrimitiveFields( +fields: Seq[StructField], +parentFieldNames: Seq[String] = Seq.empty): Seq[(String, DataType)] = { + fields.flatMap { f => +f.dataType match { + case st: StructType => +getPrimitiveFields(st.fields, parentFieldNames :+ f.name) + case BinaryType => None + case _: AtomicType => +Some(((parentFieldNames :+ f.name).quoted, f.dataType)) + case _ => None +} + } +} + +val primitiveFields = getPrimitiveFields(schema.fields) +if (caseSensitive) { + primitiveFields.toMap Review comment: Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC
dongjoon-hyun commented on a change in pull request #28761: URL: https://github.com/apache/spark/pull/28761#discussion_r466828786 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala ## @@ -37,12 +40,44 @@ trait OrcFiltersBase { } /** - * Return true if this is a searchable type in ORC. - * Both CharType and VarcharType are cleaned at AstBuilder. + * This method returns a map which contains ORC field name and data type. Each key + * represents a column; `dots` are used as separators for nested columns. If any part + * of the names contains `dots`, it is quoted to avoid confusion. See + * `org.apache.spark.sql.connector.catalog.quote` for implementation details. */ - protected[sql] def isSearchableType(dataType: DataType) = dataType match { -case BinaryType => false -case _: AtomicType => true -case _ => false + protected[sql] def getNameToOrcFieldMap( + schema: StructType, + caseSensitive: Boolean): Map[String, DataType] = { +import org.apache.spark.sql.connector.catalog.CatalogV2Implicits.MultipartIdentifierHelper + +def getPrimitiveFields( +fields: Seq[StructField], +parentFieldNames: Seq[String] = Seq.empty): Seq[(String, DataType)] = { + fields.flatMap { f => +f.dataType match { + case st: StructType => +getPrimitiveFields(st.fields, parentFieldNames :+ f.name) + case BinaryType => None + case _: AtomicType => +Some(((parentFieldNames :+ f.name).quoted, f.dataType)) + case _ => None +} + } +} + +val primitiveFields = getPrimitiveFields(schema.fields) +if (caseSensitive) { + primitiveFields.toMap Review comment: For this one, just let it be in this PR. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC
dongjoon-hyun commented on a change in pull request #28761: URL: https://github.com/apache/spark/pull/28761#discussion_r466828593 ## File path: sql/core/v2.3/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala ## @@ -93,155 +92,200 @@ class OrcFilterSuite extends OrcTest with SharedSparkSession { } test("filter pushdown - integer") { -withOrcDataFrame((1 to 4).map(i => Tuple1(Option(i { implicit df => - checkFilterPredicate($"_1".isNull, PredicateLeaf.Operator.IS_NULL) - - checkFilterPredicate($"_1" === 1, PredicateLeaf.Operator.EQUALS) - checkFilterPredicate($"_1" <=> 1, PredicateLeaf.Operator.NULL_SAFE_EQUALS) - - checkFilterPredicate($"_1" < 2, PredicateLeaf.Operator.LESS_THAN) - checkFilterPredicate($"_1" > 3, PredicateLeaf.Operator.LESS_THAN_EQUALS) - checkFilterPredicate($"_1" <= 1, PredicateLeaf.Operator.LESS_THAN_EQUALS) - checkFilterPredicate($"_1" >= 4, PredicateLeaf.Operator.LESS_THAN) - - checkFilterPredicate(Literal(1) === $"_1", PredicateLeaf.Operator.EQUALS) - checkFilterPredicate(Literal(1) <=> $"_1", PredicateLeaf.Operator.NULL_SAFE_EQUALS) - checkFilterPredicate(Literal(2) > $"_1", PredicateLeaf.Operator.LESS_THAN) - checkFilterPredicate(Literal(3) < $"_1", PredicateLeaf.Operator.LESS_THAN_EQUALS) - checkFilterPredicate(Literal(1) >= $"_1", PredicateLeaf.Operator.LESS_THAN_EQUALS) - checkFilterPredicate(Literal(4) <= $"_1", PredicateLeaf.Operator.LESS_THAN) +withNestedOrcDataFrame((1 to 4).map(i => Tuple1(Option(i { case (inputDF, colName, _) => + implicit val df: DataFrame = inputDF + + val intAttr = df(colName).expr + assert(df(colName).expr.dataType === IntegerType) + + checkFilterPredicate(intAttr.isNull, PredicateLeaf.Operator.IS_NULL) + + checkFilterPredicate(intAttr === 1, PredicateLeaf.Operator.EQUALS) + checkFilterPredicate(intAttr <=> 1, PredicateLeaf.Operator.NULL_SAFE_EQUALS) + + checkFilterPredicate(intAttr < 2, PredicateLeaf.Operator.LESS_THAN) + checkFilterPredicate(intAttr > 3, PredicateLeaf.Operator.LESS_THAN_EQUALS) + checkFilterPredicate(intAttr <= 1, PredicateLeaf.Operator.LESS_THAN_EQUALS) + checkFilterPredicate(intAttr >= 4, PredicateLeaf.Operator.LESS_THAN) + + checkFilterPredicate(Literal(1) === intAttr, PredicateLeaf.Operator.EQUALS) + checkFilterPredicate(Literal(1) <=> intAttr, PredicateLeaf.Operator.NULL_SAFE_EQUALS) + checkFilterPredicate(Literal(2) > intAttr, PredicateLeaf.Operator.LESS_THAN) + checkFilterPredicate(Literal(3) < intAttr, PredicateLeaf.Operator.LESS_THAN_EQUALS) + checkFilterPredicate(Literal(1) >= intAttr, PredicateLeaf.Operator.LESS_THAN_EQUALS) + checkFilterPredicate(Literal(4) <= intAttr, PredicateLeaf.Operator.LESS_THAN) } } test("filter pushdown - long") { -withOrcDataFrame((1 to 4).map(i => Tuple1(Option(i.toLong { implicit df => - checkFilterPredicate($"_1".isNull, PredicateLeaf.Operator.IS_NULL) - - checkFilterPredicate($"_1" === 1, PredicateLeaf.Operator.EQUALS) - checkFilterPredicate($"_1" <=> 1, PredicateLeaf.Operator.NULL_SAFE_EQUALS) - - checkFilterPredicate($"_1" < 2, PredicateLeaf.Operator.LESS_THAN) - checkFilterPredicate($"_1" > 3, PredicateLeaf.Operator.LESS_THAN_EQUALS) - checkFilterPredicate($"_1" <= 1, PredicateLeaf.Operator.LESS_THAN_EQUALS) - checkFilterPredicate($"_1" >= 4, PredicateLeaf.Operator.LESS_THAN) - - checkFilterPredicate(Literal(1) === $"_1", PredicateLeaf.Operator.EQUALS) - checkFilterPredicate(Literal(1) <=> $"_1", PredicateLeaf.Operator.NULL_SAFE_EQUALS) - checkFilterPredicate(Literal(2) > $"_1", PredicateLeaf.Operator.LESS_THAN) - checkFilterPredicate(Literal(3) < $"_1", PredicateLeaf.Operator.LESS_THAN_EQUALS) - checkFilterPredicate(Literal(1) >= $"_1", PredicateLeaf.Operator.LESS_THAN_EQUALS) - checkFilterPredicate(Literal(4) <= $"_1", PredicateLeaf.Operator.LESS_THAN) +withNestedOrcDataFrame( +(1 to 4).map(i => Tuple1(Option(i.toLong { case (inputDF, colName, _) => + implicit val df: DataFrame = inputDF + + val longAttr = df(colName).expr + assert(df(colName).expr.dataType === LongType) + + checkFilterPredicate(longAttr.isNull, PredicateLeaf.Operator.IS_NULL) + + checkFilterPredicate(longAttr === 1, PredicateLeaf.Operator.EQUALS) + checkFilterPredicate(longAttr <=> 1, PredicateLeaf.Operator.NULL_SAFE_EQUALS) + + checkFilterPredicate(longAttr < 2, PredicateLeaf.Operator.LESS_THAN) + checkFilterPredicate(longAttr > 3, PredicateLeaf.Operator.LESS_THAN_EQUALS) + checkFilterPredicate(longAttr <= 1, PredicateLeaf.Operator.LESS_THAN_EQUALS) + checkFilterPredicate(longAttr >= 4, PredicateLeaf.Operator.LESS_THAN) + + checkFilterPredicate(Literal(1) === longAttr, PredicateLeaf.Operator.EQUALS) +
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC
dongjoon-hyun commented on a change in pull request #28761: URL: https://github.com/apache/spark/pull/28761#discussion_r466710704 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala ## @@ -37,12 +40,44 @@ trait OrcFiltersBase { } /** - * Return true if this is a searchable type in ORC. - * Both CharType and VarcharType are cleaned at AstBuilder. + * This method returns a map which contains ORC field name and data type. Each key + * represents a column; `dots` are used as separators for nested columns. If any part + * of the names contains `dots`, it is quoted to avoid confusion. See + * `org.apache.spark.sql.connector.catalog.quote` for implementation details. */ - protected[sql] def isSearchableType(dataType: DataType) = dataType match { -case BinaryType => false -case _: AtomicType => true -case _ => false + protected[sql] def getNameToOrcFieldMap( + schema: StructType, + caseSensitive: Boolean): Map[String, DataType] = { +import org.apache.spark.sql.connector.catalog.CatalogV2Implicits.MultipartIdentifierHelper + +def getPrimitiveFields( +fields: Seq[StructField], +parentFieldNames: Seq[String] = Seq.empty): Seq[(String, DataType)] = { + fields.flatMap { f => +f.dataType match { + case st: StructType => +getPrimitiveFields(st.fields, parentFieldNames :+ f.name) + case BinaryType => None + case _: AtomicType => +Some(((parentFieldNames :+ f.name).quoted, f.dataType)) + case _ => None +} + } +} + +val primitiveFields = getPrimitiveFields(schema.fields) +if (caseSensitive) { + primitiveFields.toMap Review comment: Just a question. Do we have a test coverage for this code path, @viirya ? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC
dongjoon-hyun commented on a change in pull request #28761: URL: https://github.com/apache/spark/pull/28761#discussion_r466708125 ## File path: sql/core/v2.3/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala ## @@ -93,155 +92,200 @@ class OrcFilterSuite extends OrcTest with SharedSparkSession { } test("filter pushdown - integer") { -withOrcDataFrame((1 to 4).map(i => Tuple1(Option(i { implicit df => - checkFilterPredicate($"_1".isNull, PredicateLeaf.Operator.IS_NULL) - - checkFilterPredicate($"_1" === 1, PredicateLeaf.Operator.EQUALS) - checkFilterPredicate($"_1" <=> 1, PredicateLeaf.Operator.NULL_SAFE_EQUALS) - - checkFilterPredicate($"_1" < 2, PredicateLeaf.Operator.LESS_THAN) - checkFilterPredicate($"_1" > 3, PredicateLeaf.Operator.LESS_THAN_EQUALS) - checkFilterPredicate($"_1" <= 1, PredicateLeaf.Operator.LESS_THAN_EQUALS) - checkFilterPredicate($"_1" >= 4, PredicateLeaf.Operator.LESS_THAN) - - checkFilterPredicate(Literal(1) === $"_1", PredicateLeaf.Operator.EQUALS) - checkFilterPredicate(Literal(1) <=> $"_1", PredicateLeaf.Operator.NULL_SAFE_EQUALS) - checkFilterPredicate(Literal(2) > $"_1", PredicateLeaf.Operator.LESS_THAN) - checkFilterPredicate(Literal(3) < $"_1", PredicateLeaf.Operator.LESS_THAN_EQUALS) - checkFilterPredicate(Literal(1) >= $"_1", PredicateLeaf.Operator.LESS_THAN_EQUALS) - checkFilterPredicate(Literal(4) <= $"_1", PredicateLeaf.Operator.LESS_THAN) +withNestedOrcDataFrame((1 to 4).map(i => Tuple1(Option(i { case (inputDF, colName, _) => + implicit val df: DataFrame = inputDF + + val intAttr = df(colName).expr + assert(df(colName).expr.dataType === IntegerType) + + checkFilterPredicate(intAttr.isNull, PredicateLeaf.Operator.IS_NULL) + + checkFilterPredicate(intAttr === 1, PredicateLeaf.Operator.EQUALS) + checkFilterPredicate(intAttr <=> 1, PredicateLeaf.Operator.NULL_SAFE_EQUALS) + + checkFilterPredicate(intAttr < 2, PredicateLeaf.Operator.LESS_THAN) + checkFilterPredicate(intAttr > 3, PredicateLeaf.Operator.LESS_THAN_EQUALS) + checkFilterPredicate(intAttr <= 1, PredicateLeaf.Operator.LESS_THAN_EQUALS) + checkFilterPredicate(intAttr >= 4, PredicateLeaf.Operator.LESS_THAN) + + checkFilterPredicate(Literal(1) === intAttr, PredicateLeaf.Operator.EQUALS) + checkFilterPredicate(Literal(1) <=> intAttr, PredicateLeaf.Operator.NULL_SAFE_EQUALS) + checkFilterPredicate(Literal(2) > intAttr, PredicateLeaf.Operator.LESS_THAN) + checkFilterPredicate(Literal(3) < intAttr, PredicateLeaf.Operator.LESS_THAN_EQUALS) + checkFilterPredicate(Literal(1) >= intAttr, PredicateLeaf.Operator.LESS_THAN_EQUALS) + checkFilterPredicate(Literal(4) <= intAttr, PredicateLeaf.Operator.LESS_THAN) } } test("filter pushdown - long") { -withOrcDataFrame((1 to 4).map(i => Tuple1(Option(i.toLong { implicit df => - checkFilterPredicate($"_1".isNull, PredicateLeaf.Operator.IS_NULL) - - checkFilterPredicate($"_1" === 1, PredicateLeaf.Operator.EQUALS) - checkFilterPredicate($"_1" <=> 1, PredicateLeaf.Operator.NULL_SAFE_EQUALS) - - checkFilterPredicate($"_1" < 2, PredicateLeaf.Operator.LESS_THAN) - checkFilterPredicate($"_1" > 3, PredicateLeaf.Operator.LESS_THAN_EQUALS) - checkFilterPredicate($"_1" <= 1, PredicateLeaf.Operator.LESS_THAN_EQUALS) - checkFilterPredicate($"_1" >= 4, PredicateLeaf.Operator.LESS_THAN) - - checkFilterPredicate(Literal(1) === $"_1", PredicateLeaf.Operator.EQUALS) - checkFilterPredicate(Literal(1) <=> $"_1", PredicateLeaf.Operator.NULL_SAFE_EQUALS) - checkFilterPredicate(Literal(2) > $"_1", PredicateLeaf.Operator.LESS_THAN) - checkFilterPredicate(Literal(3) < $"_1", PredicateLeaf.Operator.LESS_THAN_EQUALS) - checkFilterPredicate(Literal(1) >= $"_1", PredicateLeaf.Operator.LESS_THAN_EQUALS) - checkFilterPredicate(Literal(4) <= $"_1", PredicateLeaf.Operator.LESS_THAN) +withNestedOrcDataFrame( +(1 to 4).map(i => Tuple1(Option(i.toLong { case (inputDF, colName, _) => + implicit val df: DataFrame = inputDF + + val longAttr = df(colName).expr + assert(df(colName).expr.dataType === LongType) + + checkFilterPredicate(longAttr.isNull, PredicateLeaf.Operator.IS_NULL) + + checkFilterPredicate(longAttr === 1, PredicateLeaf.Operator.EQUALS) + checkFilterPredicate(longAttr <=> 1, PredicateLeaf.Operator.NULL_SAFE_EQUALS) + + checkFilterPredicate(longAttr < 2, PredicateLeaf.Operator.LESS_THAN) + checkFilterPredicate(longAttr > 3, PredicateLeaf.Operator.LESS_THAN_EQUALS) + checkFilterPredicate(longAttr <= 1, PredicateLeaf.Operator.LESS_THAN_EQUALS) + checkFilterPredicate(longAttr >= 4, PredicateLeaf.Operator.LESS_THAN) + + checkFilterPredicate(Literal(1) === longAttr, PredicateLeaf.Operator.EQUALS) +
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC
dongjoon-hyun commented on a change in pull request #28761: URL: https://github.com/apache/spark/pull/28761#discussion_r466707364 ## File path: sql/core/v2.3/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala ## @@ -93,155 +92,200 @@ class OrcFilterSuite extends OrcTest with SharedSparkSession { } test("filter pushdown - integer") { -withOrcDataFrame((1 to 4).map(i => Tuple1(Option(i { implicit df => - checkFilterPredicate($"_1".isNull, PredicateLeaf.Operator.IS_NULL) - - checkFilterPredicate($"_1" === 1, PredicateLeaf.Operator.EQUALS) - checkFilterPredicate($"_1" <=> 1, PredicateLeaf.Operator.NULL_SAFE_EQUALS) - - checkFilterPredicate($"_1" < 2, PredicateLeaf.Operator.LESS_THAN) - checkFilterPredicate($"_1" > 3, PredicateLeaf.Operator.LESS_THAN_EQUALS) - checkFilterPredicate($"_1" <= 1, PredicateLeaf.Operator.LESS_THAN_EQUALS) - checkFilterPredicate($"_1" >= 4, PredicateLeaf.Operator.LESS_THAN) - - checkFilterPredicate(Literal(1) === $"_1", PredicateLeaf.Operator.EQUALS) - checkFilterPredicate(Literal(1) <=> $"_1", PredicateLeaf.Operator.NULL_SAFE_EQUALS) - checkFilterPredicate(Literal(2) > $"_1", PredicateLeaf.Operator.LESS_THAN) - checkFilterPredicate(Literal(3) < $"_1", PredicateLeaf.Operator.LESS_THAN_EQUALS) - checkFilterPredicate(Literal(1) >= $"_1", PredicateLeaf.Operator.LESS_THAN_EQUALS) - checkFilterPredicate(Literal(4) <= $"_1", PredicateLeaf.Operator.LESS_THAN) +withNestedOrcDataFrame((1 to 4).map(i => Tuple1(Option(i { case (inputDF, colName, _) => + implicit val df: DataFrame = inputDF + + val intAttr = df(colName).expr + assert(df(colName).expr.dataType === IntegerType) + + checkFilterPredicate(intAttr.isNull, PredicateLeaf.Operator.IS_NULL) + + checkFilterPredicate(intAttr === 1, PredicateLeaf.Operator.EQUALS) + checkFilterPredicate(intAttr <=> 1, PredicateLeaf.Operator.NULL_SAFE_EQUALS) + + checkFilterPredicate(intAttr < 2, PredicateLeaf.Operator.LESS_THAN) + checkFilterPredicate(intAttr > 3, PredicateLeaf.Operator.LESS_THAN_EQUALS) + checkFilterPredicate(intAttr <= 1, PredicateLeaf.Operator.LESS_THAN_EQUALS) + checkFilterPredicate(intAttr >= 4, PredicateLeaf.Operator.LESS_THAN) + + checkFilterPredicate(Literal(1) === intAttr, PredicateLeaf.Operator.EQUALS) + checkFilterPredicate(Literal(1) <=> intAttr, PredicateLeaf.Operator.NULL_SAFE_EQUALS) + checkFilterPredicate(Literal(2) > intAttr, PredicateLeaf.Operator.LESS_THAN) + checkFilterPredicate(Literal(3) < intAttr, PredicateLeaf.Operator.LESS_THAN_EQUALS) + checkFilterPredicate(Literal(1) >= intAttr, PredicateLeaf.Operator.LESS_THAN_EQUALS) + checkFilterPredicate(Literal(4) <= intAttr, PredicateLeaf.Operator.LESS_THAN) } } test("filter pushdown - long") { -withOrcDataFrame((1 to 4).map(i => Tuple1(Option(i.toLong { implicit df => - checkFilterPredicate($"_1".isNull, PredicateLeaf.Operator.IS_NULL) - - checkFilterPredicate($"_1" === 1, PredicateLeaf.Operator.EQUALS) - checkFilterPredicate($"_1" <=> 1, PredicateLeaf.Operator.NULL_SAFE_EQUALS) - - checkFilterPredicate($"_1" < 2, PredicateLeaf.Operator.LESS_THAN) - checkFilterPredicate($"_1" > 3, PredicateLeaf.Operator.LESS_THAN_EQUALS) - checkFilterPredicate($"_1" <= 1, PredicateLeaf.Operator.LESS_THAN_EQUALS) - checkFilterPredicate($"_1" >= 4, PredicateLeaf.Operator.LESS_THAN) - - checkFilterPredicate(Literal(1) === $"_1", PredicateLeaf.Operator.EQUALS) - checkFilterPredicate(Literal(1) <=> $"_1", PredicateLeaf.Operator.NULL_SAFE_EQUALS) - checkFilterPredicate(Literal(2) > $"_1", PredicateLeaf.Operator.LESS_THAN) - checkFilterPredicate(Literal(3) < $"_1", PredicateLeaf.Operator.LESS_THAN_EQUALS) - checkFilterPredicate(Literal(1) >= $"_1", PredicateLeaf.Operator.LESS_THAN_EQUALS) - checkFilterPredicate(Literal(4) <= $"_1", PredicateLeaf.Operator.LESS_THAN) +withNestedOrcDataFrame( +(1 to 4).map(i => Tuple1(Option(i.toLong { case (inputDF, colName, _) => + implicit val df: DataFrame = inputDF + + val longAttr = df(colName).expr + assert(df(colName).expr.dataType === LongType) + + checkFilterPredicate(longAttr.isNull, PredicateLeaf.Operator.IS_NULL) + + checkFilterPredicate(longAttr === 1, PredicateLeaf.Operator.EQUALS) + checkFilterPredicate(longAttr <=> 1, PredicateLeaf.Operator.NULL_SAFE_EQUALS) + + checkFilterPredicate(longAttr < 2, PredicateLeaf.Operator.LESS_THAN) + checkFilterPredicate(longAttr > 3, PredicateLeaf.Operator.LESS_THAN_EQUALS) + checkFilterPredicate(longAttr <= 1, PredicateLeaf.Operator.LESS_THAN_EQUALS) + checkFilterPredicate(longAttr >= 4, PredicateLeaf.Operator.LESS_THAN) + + checkFilterPredicate(Literal(1) === longAttr, PredicateLeaf.Operator.EQUALS) +
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC
dongjoon-hyun commented on a change in pull request #28761: URL: https://github.com/apache/spark/pull/28761#discussion_r466706196 ## File path: sql/core/v1.2/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala ## @@ -231,37 +229,37 @@ private[sql] object OrcFilters extends OrcFiltersBase { // Since ORC 1.5.0 (ORC-323), we need to quote for column names with `.` characters // in order to distinguish predicate pushdown for nested columns. Review comment: Since we removed `quoteIfNeeded` in this file completely, I believe we can remove this old comment (231~232) together in both files v1.2(here) and v2.3. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC
dongjoon-hyun commented on a change in pull request #28761: URL: https://github.com/apache/spark/pull/28761#discussion_r466695536 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala ## @@ -37,12 +40,44 @@ trait OrcFiltersBase { } /** - * Return true if this is a searchable type in ORC. - * Both CharType and VarcharType are cleaned at AstBuilder. + * This method returns a map which contains ORC field name and data type. Each key + * represents a column; `dots` are used as separators for nested columns. If any part + * of the names contains `dots`, it is quoted to avoid confusion. See + * `org.apache.spark.sql.connector.catalog.quote` for implementation details. */ - protected[sql] def isSearchableType(dataType: DataType) = dataType match { -case BinaryType => false -case _: AtomicType => true -case _ => false + protected[sql] def getNameToOrcFieldMap( Review comment: 1. `OrcField` looks a little mismatched because this function returns `DataType` instead of a field. Currently, it sounds likes `ToOrcField`. 2. According to the behavior of this function, this ignores BinaryType, complexType, UserDefinedType. Also, function description doesn't mention the limitation at all. In order to be more clear, we had better have `Searchable` in the function name like the previous one (isSearchableType). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC
dongjoon-hyun commented on a change in pull request #28761: URL: https://github.com/apache/spark/pull/28761#discussion_r466695536 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala ## @@ -37,12 +40,44 @@ trait OrcFiltersBase { } /** - * Return true if this is a searchable type in ORC. - * Both CharType and VarcharType are cleaned at AstBuilder. + * This method returns a map which contains ORC field name and data type. Each key + * represents a column; `dots` are used as separators for nested columns. If any part + * of the names contains `dots`, it is quoted to avoid confusion. See + * `org.apache.spark.sql.connector.catalog.quote` for implementation details. */ - protected[sql] def isSearchableType(dataType: DataType) = dataType match { -case BinaryType => false -case _: AtomicType => true -case _ => false + protected[sql] def getNameToOrcFieldMap( Review comment: 1. `OrcField` looks a little mismatched because this function returns `DataType` instead of a field. 2. According to the behavior of this function, this ignores BinaryType, complexType, UserDefinedType. Also, function description doesn't mention the limitation at all. In order to be more clear, we had better have `Searchable` in the function name like the previous one (isSearchableType). This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a change in pull request #28761: [SPARK-25557][SQL] Nested column predicate pushdown for ORC
dongjoon-hyun commented on a change in pull request #28761: URL: https://github.com/apache/spark/pull/28761#discussion_r466632459 ## File path: sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFiltersBase.scala ## @@ -37,12 +40,44 @@ trait OrcFiltersBase { } /** - * Return true if this is a searchable type in ORC. - * Both CharType and VarcharType are cleaned at AstBuilder. + * This method returns a map which contains ORC field name and data type. Each key + * represents a column; `dots` are used as separators for nested columns. If any part + * of the names contains `dots`, it is quoted to avoid confusion. See + * `org.apache.spark.sql.connector.catalog.quote` for implementation details. Review comment: `quote` -> `quoted`. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org