[GitHub] [spark] cloud-fan commented on a change in pull request #34701: [SPARK-37450][SQL] Prune unnecessary fields from Generate

2021-12-02 Thread GitBox


cloud-fan commented on a change in pull request #34701:
URL: https://github.com/apache/spark/pull/34701#discussion_r760866369



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala
##
@@ -882,4 +882,32 @@ abstract class SchemaPruningSuite
   Nil)
 }
   }
+
+  test("SPARK-37450: Prunes unnecessary fields from Explode for count 
aggregation") {
+import testImplicits._
+
+withTempView("table") {
+  withTempPath { dir =>
+val path = dir.getCanonicalPath
+
+val jsonStr =
+  """{
+"items": [
+  {"itemId": 1, "itemData": "a"},
+  {"itemId": 2, "itemData": "b"}
+]}""".stripMargin
+val df = spark.read.json(Seq(jsonStr).toDS)
+makeDataSourceFile(df, new File(path + "/table"))

Review comment:
   can't we use `path` directly?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #34701: [SPARK-37450][SQL] Prune unnecessary fields from Generate

2021-12-02 Thread GitBox


cloud-fan commented on a change in pull request #34701:
URL: https://github.com/apache/spark/pull/34701#discussion_r760865681



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala
##
@@ -882,4 +882,32 @@ abstract class SchemaPruningSuite
   Nil)
 }
   }
+
+  test("SPARK-37450: Prunes unnecessary fields from Explode for count 
aggregation") {
+import testImplicits._
+
+withTempView("table") {
+  withTempPath { dir =>
+val path = dir.getCanonicalPath
+
+val jsonStr =
+  """{

Review comment:
   nit: can we use the multi-line format?
   ```
   """
   |xxx
   |""".stripMargin
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #34701: [SPARK-37450][SQL] Prune unnecessary fields from Generate

2021-12-02 Thread GitBox


cloud-fan commented on a change in pull request #34701:
URL: https://github.com/apache/spark/pull/34701#discussion_r760865681



##
File path: 
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/SchemaPruningSuite.scala
##
@@ -882,4 +882,32 @@ abstract class SchemaPruningSuite
   Nil)
 }
   }
+
+  test("SPARK-37450: Prunes unnecessary fields from Explode for count 
aggregation") {
+import testImplicits._
+
+withTempView("table") {
+  withTempPath { dir =>
+val path = dir.getCanonicalPath
+
+val jsonStr =
+  """{

Review comment:
   nit: can we use the multi-line format?
   ```
 """
   |xxx
   |""".stripMargin
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #34701: [SPARK-37450][SQL] Prune unnecessary fields from Generate

2021-12-02 Thread GitBox


cloud-fan commented on a change in pull request #34701:
URL: https://github.com/apache/spark/pull/34701#discussion_r760864036



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##
@@ -2162,6 +2163,42 @@ object RemoveLiteralFromGroupExpressions extends 
Rule[LogicalPlan] {
   }
 }
 
+/**
+ * Prunes unnecessary fields from a [[Generate]] if it is under a project 
which does not refer
+ * any generated attributes, .e.g., count-like aggregation on an exploded 
array.
+ */
+object GenerateOptimization extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan.transformDownWithPruning(
+  _.containsAllPatterns(PROJECT, GENERATE), ruleId) {
+  case p @ Project(_, g: Generate) if p.references.isEmpty
+  && g.generator.isInstanceOf[ExplodeBase] =>
+g.generator.children.head.dataType match {
+  case ArrayType(StructType(fields), containsNull) =>

Review comment:
   ```suggestion
 case ArrayType(StructType(fields), containsNull) if fields.length 
> 1 =>
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #34701: [SPARK-37450][SQL] Prune unnecessary fields from Generate

2021-12-01 Thread GitBox


cloud-fan commented on a change in pull request #34701:
URL: https://github.com/apache/spark/pull/34701#discussion_r760776756



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##
@@ -2162,6 +2163,48 @@ object RemoveLiteralFromGroupExpressions extends 
Rule[LogicalPlan] {
   }
 }
 
+/**
+ * Prunes unnecessary fields from a [[Generate]] if it is under a project 
which does not refer
+ * any generated attributes, .e.g., count-like aggregation on an exploded 
array.
+ */
+object GenerateOptimization extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan.transformDownWithPruning(
+  _.containsAllPatterns(PROJECT, GENERATE), ruleId) {
+  case p @ Project(_, g: Generate) if p.references.isEmpty
+  && g.generator.isInstanceOf[ExplodeBase] =>
+g.generator.children.head.dataType match {
+  case ArrayType(StructType(fields), _) =>
+val atomicFields = fields.collect {
+  case f: StructField if f.dataType.isInstanceOf[AtomicType] => f
+}
+val extractor = if (atomicFields.size > 0) {
+  // Pick an arbitrary atomic field, if any

Review comment:
   shall we pick the smallest one? e.g. prefer int over string




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #34701: [SPARK-37450][SQL] Prune unnecessary fields from Generate

2021-12-01 Thread GitBox


cloud-fan commented on a change in pull request #34701:
URL: https://github.com/apache/spark/pull/34701#discussion_r760776756



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##
@@ -2162,6 +2163,48 @@ object RemoveLiteralFromGroupExpressions extends 
Rule[LogicalPlan] {
   }
 }
 
+/**
+ * Prunes unnecessary fields from a [[Generate]] if it is under a project 
which does not refer
+ * any generated attributes, .e.g., count-like aggregation on an exploded 
array.
+ */
+object GenerateOptimization extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan.transformDownWithPruning(
+  _.containsAllPatterns(PROJECT, GENERATE), ruleId) {
+  case p @ Project(_, g: Generate) if p.references.isEmpty
+  && g.generator.isInstanceOf[ExplodeBase] =>
+g.generator.children.head.dataType match {
+  case ArrayType(StructType(fields), _) =>
+val atomicFields = fields.collect {
+  case f: StructField if f.dataType.isInstanceOf[AtomicType] => f
+}
+val extractor = if (atomicFields.size > 0) {
+  // Pick an arbitrary atomic field, if any

Review comment:
   shall we pick the smaller list one? e.g. prefer int over string




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #34701: [SPARK-37450][SQL] Prune unnecessary fields from Generate

2021-12-01 Thread GitBox


cloud-fan commented on a change in pull request #34701:
URL: https://github.com/apache/spark/pull/34701#discussion_r760776370



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##
@@ -2162,6 +2163,48 @@ object RemoveLiteralFromGroupExpressions extends 
Rule[LogicalPlan] {
   }
 }
 
+/**
+ * Prunes unnecessary fields from a [[Generate]] if it is under a project 
which does not refer
+ * any generated attributes, .e.g., count-like aggregation on an exploded 
array.
+ */
+object GenerateOptimization extends Rule[LogicalPlan] {
+  def apply(plan: LogicalPlan): LogicalPlan = plan.transformDownWithPruning(
+  _.containsAllPatterns(PROJECT, GENERATE), ruleId) {
+  case p @ Project(_, g: Generate) if p.references.isEmpty
+  && g.generator.isInstanceOf[ExplodeBase] =>
+g.generator.children.head.dataType match {
+  case ArrayType(StructType(fields), _) =>
+val atomicFields = fields.collect {
+  case f: StructField if f.dataType.isInstanceOf[AtomicType] => f
+}
+val extractor = if (atomicFields.size > 0) {
+  // Pick an arbitrary atomic field, if any
+  ExtractValue(g.generator.children.head,

Review comment:
   nit: I feel it's safer to create `GetStructField` instead of doing name 
lookup again. It's possible that some dataframe-generated query plan has name 
conflicts in the struct, and `GetStructField` allows us to put the ordinal 
directly to avoid a name lookup.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #34701: [SPARK-37450][SQL] Prune unnecessary fields from Generate

2021-12-01 Thread GitBox


cloud-fan commented on a change in pull request #34701:
URL: https://github.com/apache/spark/pull/34701#discussion_r760752530



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##
@@ -2162,6 +2163,74 @@ object RemoveLiteralFromGroupExpressions extends 
Rule[LogicalPlan] {
   }
 }
 
+/**
+ * Prunes unnecessary fields from a [[Generate]] if it is under a count 
aggregation
+ * query.
+ */
+object CountAggregateOptimization extends Rule[LogicalPlan] {
+  private def isLiteralCountAggFun(func: AggregateFunction): Boolean = func 
match {
+case c: Count if c.children.size == 1 && 
c.children(0).isInstanceOf[Literal] => true
+case _ => false
+  }
+
+  private def isCandidate(agg: Aggregate): Boolean = {
+if (agg.aggregateExpressions.size == 1) {
+  agg.aggregateExpressions(0) match {
+case Alias(ae: AggregateExpression, _) if 
isLiteralCountAggFun(ae.aggregateFunction) =>
+  true
+case _ =>
+  false
+  }
+} else {
+  false
+}
+  }
+
+  private def pruningFields(agg: Aggregate): LogicalPlan = {
+agg.transformDownWithPruning(_.containsPattern(GENERATE), ruleId) {
+  case p @ Project(_, g: Generate) if 
p.references.intersect(g.outputSet).isEmpty

Review comment:
   is `p.references.isEmpty` clearer here?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] cloud-fan commented on a change in pull request #34701: [SPARK-37450][SQL] Prune unnecessary fields from Generate under count-only Aggregate

2021-12-01 Thread GitBox


cloud-fan commented on a change in pull request #34701:
URL: https://github.com/apache/spark/pull/34701#discussion_r759945029



##
File path: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala
##
@@ -2162,6 +2163,74 @@ object RemoveLiteralFromGroupExpressions extends 
Rule[LogicalPlan] {
   }
 }
 
+/**
+ * Prunes unnecessary fields from a [[Generate]] if it is under a count 
aggregation
+ * query.
+ */
+object CountAggregateOptimization extends Rule[LogicalPlan] {

Review comment:
   > also generally empty project cases.
   
   Yea, and not only empty, e.g. `Project(1, child)` can also trigger the 
optimization, as long as the Project has no references.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org