[GitHub] [spark] karuppayya commented on pull request #28662: [SPARK-31850][SQL]Prevent DetermineTableStats from computing stats multiple times for same table

2020-07-06 Thread GitBox


karuppayya commented on pull request #28662:
URL: https://github.com/apache/spark/pull/28662#issuecomment-654417312


   https://github.com/apache/spark/pull/28686 should handle most cases. Closing 
this PR.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] karuppayya commented on pull request #28662: [SPARK-31850][SQL]Prevent DetermineTableStats from computing stats multiple times for same table

2020-06-17 Thread GitBox


karuppayya commented on pull request #28662:
URL: https://github.com/apache/spark/pull/28662#issuecomment-645572180


   @viirya @maropu Can you please help review this PR 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] karuppayya commented on pull request #28662: [SPARK-31850][SQL]Prevent DetermineTableStats from computing stats multiple times for same table

2020-06-10 Thread GitBox


karuppayya commented on pull request #28662:
URL: https://github.com/apache/spark/pull/28662#issuecomment-641746170


   @viirya @maropu Can you please help review this PR



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] karuppayya commented on pull request #28662: [SPARK-31850][SQL]Prevent DetermineTableStats from computing stats multiple times for same table

2020-06-04 Thread GitBox


karuppayya commented on pull request #28662:
URL: https://github.com/apache/spark/pull/28662#issuecomment-639264681


   The above condition is already present.
   But we return a **copy** of relation(code: 
https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveStrategies.scala#L137)
 with the updated Table Stats at the end of the method
   - When ResolvedAggregateFunction rule runs again(to achieve Fixed point), it 
will not be aware of the updated relation. `executeWithSameContext` with rerun 
the Stats collection as part of DetermineTableStats rule.
   - When the DetermineTableStats rule actually runs as part of Analysis phase, 
it will not be aware of the updated relation
   @viirya  



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] karuppayya commented on pull request #28662: [SPARK-31850][SQL]Prevent DetermineTableStats from computing stats multiple times for same table

2020-06-04 Thread GitBox


karuppayya commented on pull request #28662:
URL: https://github.com/apache/spark/pull/28662#issuecomment-638629109


   > Can you explain why DetermineTableStats will calculate the statistics 
multiple times?
   ```
 at 
org.apache.spark.sql.hive.DetermineTableStats.hiveTableWithStats(HiveStrategies.scala:121)
 at 
org.apache.spark.sql.hive.DetermineTableStats$$anonfun$apply$2.applyOrElse(HiveStrategies.scala:150)
 at 
org.apache.spark.sql.hive.DetermineTableStats$$anonfun$apply$2.applyOrElse(HiveStrategies.scala:147)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$Lambda$870.808816071.apply(Unknown
 Source:-1)
 at 
org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$Lambda$869.1113025977.apply(Unknown
 Source:-1)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104)
 at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$4(AnalysisHelper.scala:113)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$Lambda$872.1354725727.apply(Unknown
 Source:-1)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$Lambda$1144.1492742163.apply(Unknown
 Source:-1)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:397)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:350)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:113)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$Lambda$869.1113025977.apply(Unknown
 Source:-1)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104)
 at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$4(AnalysisHelper.scala:113)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$Lambda$872.1354725727.apply(Unknown
 Source:-1)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode$$Lambda$1144.1492742163.apply(Unknown
 Source:-1)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:397)
 at 
org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:350)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:113)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$$Lambda$869.1113025977.apply(Unknown
 Source:-1)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104)
 at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators(AnalysisHelper.scala:73)
 at 
org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators$(AnalysisHelper.scala:72)
 at 
org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29)
 at 
org.apache.spark.sql.hive.DetermineTableStats.apply(HiveStrategies.scala:147)
 at 
org.apache.spark.sql.hive.DetermineTableStats.apply(HiveStrategies.scala:114)
 at 

[GitHub] [spark] karuppayya commented on pull request #28662: [SPARK-31850][SQL]Prevent DetermineTableStats from computing stats multiple times for same table

2020-06-01 Thread GitBox


karuppayya commented on pull request #28662:
URL: https://github.com/apache/spark/pull/28662#issuecomment-637207991


   @maropu I my example I took the case of parquet as data format. This can 
happen with formats other than parquet/orc(like JSON, CSV etc)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] karuppayya commented on pull request #28662: [SPARK-31850][SQL]Prevent DetermineTableStats from computing stats multiple times for same table

2020-06-01 Thread GitBox


karuppayya commented on pull request #28662:
URL: https://github.com/apache/spark/pull/28662#issuecomment-637188562


   @maropu in the above comment are referring to chnages in 
https://github.com/apache/spark/pull/28686 ?
   Even if #28686 is fixed, the issue will still be there when the flag to 
convert Parquet/Orc Hive table to datasource tables is disabled.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] [spark] karuppayya commented on pull request #28662: [SPARK-31850][SQL]Prevent DetermineTableStats from computing stats multiple times for same table

2020-05-29 Thread GitBox


karuppayya commented on pull request #28662:
URL: https://github.com/apache/spark/pull/28662#issuecomment-636105802


   Tagging few more committers from the file's git history for review: 
@HeartSaVioR @holdenk @maropu 
   Thank you. 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org