[GitHub] [spark] cloud-fan commented on a change in pull request #34701: [SPARK-37450][SQL] Prune unnecessary fields from Generate
cloud-fan commented on a change in pull request #34701: URL: https://github.com/apache/spark/pull/34701#discussion_r760866369 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala ## @@ -882,4 +882,32 @@ abstract class SchemaPruningSuite Nil) } } + + test("SPARK-37450: Prunes unnecessary fields from Explode for count aggregation") { +import testImplicits._ + +withTempView("table") { + withTempPath { dir => +val path = dir.getCanonicalPath + +val jsonStr = + """{ +"items": [ + {"itemId": 1, "itemData": "a"}, + {"itemId": 2, "itemData": "b"} +]}""".stripMargin +val df = spark.read.json(Seq(jsonStr).toDS) +makeDataSourceFile(df, new File(path + "/table")) Review comment: can't we use `path` directly? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34701: [SPARK-37450][SQL] Prune unnecessary fields from Generate
cloud-fan commented on a change in pull request #34701: URL: https://github.com/apache/spark/pull/34701#discussion_r760865681 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala ## @@ -882,4 +882,32 @@ abstract class SchemaPruningSuite Nil) } } + + test("SPARK-37450: Prunes unnecessary fields from Explode for count aggregation") { +import testImplicits._ + +withTempView("table") { + withTempPath { dir => +val path = dir.getCanonicalPath + +val jsonStr = + """{ Review comment: nit: can we use the multi-line format? ``` """ |xxx |""".stripMargin ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34701: [SPARK-37450][SQL] Prune unnecessary fields from Generate
cloud-fan commented on a change in pull request #34701: URL: https://github.com/apache/spark/pull/34701#discussion_r760865681 ## File path: sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala ## @@ -882,4 +882,32 @@ abstract class SchemaPruningSuite Nil) } } + + test("SPARK-37450: Prunes unnecessary fields from Explode for count aggregation") { +import testImplicits._ + +withTempView("table") { + withTempPath { dir => +val path = dir.getCanonicalPath + +val jsonStr = + """{ Review comment: nit: can we use the multi-line format? ``` """ |xxx |""".stripMargin ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34701: [SPARK-37450][SQL] Prune unnecessary fields from Generate
cloud-fan commented on a change in pull request #34701: URL: https://github.com/apache/spark/pull/34701#discussion_r760864036 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ## @@ -2162,6 +2163,42 @@ object RemoveLiteralFromGroupExpressions extends Rule[LogicalPlan] { } } +/** + * Prunes unnecessary fields from a [[Generate]] if it is under a project which does not refer + * any generated attributes, .e.g., count-like aggregation on an exploded array. + */ +object GenerateOptimization extends Rule[LogicalPlan] { + def apply(plan: LogicalPlan): LogicalPlan = plan.transformDownWithPruning( + _.containsAllPatterns(PROJECT, GENERATE), ruleId) { + case p @ Project(_, g: Generate) if p.references.isEmpty + && g.generator.isInstanceOf[ExplodeBase] => +g.generator.children.head.dataType match { + case ArrayType(StructType(fields), containsNull) => Review comment: ```suggestion case ArrayType(StructType(fields), containsNull) if fields.length > 1 => ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34701: [SPARK-37450][SQL] Prune unnecessary fields from Generate
cloud-fan commented on a change in pull request #34701: URL: https://github.com/apache/spark/pull/34701#discussion_r760776756 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ## @@ -2162,6 +2163,48 @@ object RemoveLiteralFromGroupExpressions extends Rule[LogicalPlan] { } } +/** + * Prunes unnecessary fields from a [[Generate]] if it is under a project which does not refer + * any generated attributes, .e.g., count-like aggregation on an exploded array. + */ +object GenerateOptimization extends Rule[LogicalPlan] { + def apply(plan: LogicalPlan): LogicalPlan = plan.transformDownWithPruning( + _.containsAllPatterns(PROJECT, GENERATE), ruleId) { + case p @ Project(_, g: Generate) if p.references.isEmpty + && g.generator.isInstanceOf[ExplodeBase] => +g.generator.children.head.dataType match { + case ArrayType(StructType(fields), _) => +val atomicFields = fields.collect { + case f: StructField if f.dataType.isInstanceOf[AtomicType] => f +} +val extractor = if (atomicFields.size > 0) { + // Pick an arbitrary atomic field, if any Review comment: shall we pick the smallest one? e.g. prefer int over string -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34701: [SPARK-37450][SQL] Prune unnecessary fields from Generate
cloud-fan commented on a change in pull request #34701: URL: https://github.com/apache/spark/pull/34701#discussion_r760776756 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ## @@ -2162,6 +2163,48 @@ object RemoveLiteralFromGroupExpressions extends Rule[LogicalPlan] { } } +/** + * Prunes unnecessary fields from a [[Generate]] if it is under a project which does not refer + * any generated attributes, .e.g., count-like aggregation on an exploded array. + */ +object GenerateOptimization extends Rule[LogicalPlan] { + def apply(plan: LogicalPlan): LogicalPlan = plan.transformDownWithPruning( + _.containsAllPatterns(PROJECT, GENERATE), ruleId) { + case p @ Project(_, g: Generate) if p.references.isEmpty + && g.generator.isInstanceOf[ExplodeBase] => +g.generator.children.head.dataType match { + case ArrayType(StructType(fields), _) => +val atomicFields = fields.collect { + case f: StructField if f.dataType.isInstanceOf[AtomicType] => f +} +val extractor = if (atomicFields.size > 0) { + // Pick an arbitrary atomic field, if any Review comment: shall we pick the smaller list one? e.g. prefer int over string -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34701: [SPARK-37450][SQL] Prune unnecessary fields from Generate
cloud-fan commented on a change in pull request #34701: URL: https://github.com/apache/spark/pull/34701#discussion_r760776370 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ## @@ -2162,6 +2163,48 @@ object RemoveLiteralFromGroupExpressions extends Rule[LogicalPlan] { } } +/** + * Prunes unnecessary fields from a [[Generate]] if it is under a project which does not refer + * any generated attributes, .e.g., count-like aggregation on an exploded array. + */ +object GenerateOptimization extends Rule[LogicalPlan] { + def apply(plan: LogicalPlan): LogicalPlan = plan.transformDownWithPruning( + _.containsAllPatterns(PROJECT, GENERATE), ruleId) { + case p @ Project(_, g: Generate) if p.references.isEmpty + && g.generator.isInstanceOf[ExplodeBase] => +g.generator.children.head.dataType match { + case ArrayType(StructType(fields), _) => +val atomicFields = fields.collect { + case f: StructField if f.dataType.isInstanceOf[AtomicType] => f +} +val extractor = if (atomicFields.size > 0) { + // Pick an arbitrary atomic field, if any + ExtractValue(g.generator.children.head, Review comment: nit: I feel it's safer to create `GetStructField` instead of doing name lookup again. It's possible that some dataframe-generated query plan has name conflicts in the struct, and `GetStructField` allows us to put the ordinal directly to avoid a name lookup. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34701: [SPARK-37450][SQL] Prune unnecessary fields from Generate
cloud-fan commented on a change in pull request #34701: URL: https://github.com/apache/spark/pull/34701#discussion_r760752530 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ## @@ -2162,6 +2163,74 @@ object RemoveLiteralFromGroupExpressions extends Rule[LogicalPlan] { } } +/** + * Prunes unnecessary fields from a [[Generate]] if it is under a count aggregation + * query. + */ +object CountAggregateOptimization extends Rule[LogicalPlan] { + private def isLiteralCountAggFun(func: AggregateFunction): Boolean = func match { +case c: Count if c.children.size == 1 && c.children(0).isInstanceOf[Literal] => true +case _ => false + } + + private def isCandidate(agg: Aggregate): Boolean = { +if (agg.aggregateExpressions.size == 1) { + agg.aggregateExpressions(0) match { +case Alias(ae: AggregateExpression, _) if isLiteralCountAggFun(ae.aggregateFunction) => + true +case _ => + false + } +} else { + false +} + } + + private def pruningFields(agg: Aggregate): LogicalPlan = { +agg.transformDownWithPruning(_.containsPattern(GENERATE), ruleId) { + case p @ Project(_, g: Generate) if p.references.intersect(g.outputSet).isEmpty Review comment: is `p.references.isEmpty` clearer here? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a change in pull request #34701: [SPARK-37450][SQL] Prune unnecessary fields from Generate under count-only Aggregate
cloud-fan commented on a change in pull request #34701: URL: https://github.com/apache/spark/pull/34701#discussion_r759945029 ## File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala ## @@ -2162,6 +2163,74 @@ object RemoveLiteralFromGroupExpressions extends Rule[LogicalPlan] { } } +/** + * Prunes unnecessary fields from a [[Generate]] if it is under a count aggregation + * query. + */ +object CountAggregateOptimization extends Rule[LogicalPlan] { Review comment: > also generally empty project cases. Yea, and not only empty, e.g. `Project(1, child)` can also trigger the optimization, as long as the Project has no references. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org