[GitHub] [spark] wangyum commented on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation
wangyum commented on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation URL: https://github.com/apache/spark/pull/24715#issuecomment-523029568 @cloud-fan I added a test for external partitioned table: https://github.com/apache/spark/pull/24715/files#diff-8c27508821958acbe016862c9ab2f25fR754-R779 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation
wangyum commented on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation URL: https://github.com/apache/spark/pull/24715#issuecomment-522441286 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation
wangyum commented on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation URL: https://github.com/apache/spark/pull/24715#issuecomment-521943871 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation
wangyum commented on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation URL: https://github.com/apache/spark/pull/24715#issuecomment-521867590 I switched to an idle cluster, [PartitioningAwareFileIndex.sizeInBytes](https://github.com/apache/spark/blob/b276788d57b270d455ef6a7c5ed6cf8a74885dde/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L103) is basically still faster than `CommandUtils.getSizeInBytesFallBackToHdfs`: https://issues.apache.org/jira/browse/SPARK-25474?focusedCommentId=16908660=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-16908660 This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation
wangyum commented on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation URL: https://github.com/apache/spark/pull/24715#issuecomment-521683954 > @wangyum do you mean CommandUtils.getSizeInBytesFallBackToHdfs is very slow if there are many files? `CommandUtils.getSizeInBytesFallBackToHdfs` is not very slow. I have no idea why [PartitioningAwareFileIndex.sizeInBytes](https://github.com/apache/spark/blob/b276788d57b270d455ef6a7c5ed6cf8a74885dde/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala#L103) is faster than `CommandUtils.getSizeInBytesFallBackToHdfs`. It may be related to the cluster load, I plan to switch to an idle cluster to test tomorrow. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum commented on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation
wangyum commented on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation URL: https://github.com/apache/spark/pull/24715#issuecomment-521611355 I did some benchmark. Prepare data: ```scala spark.range(1).repartition(1).write.saveAsTable("test_non_partition_1") spark.range(1).repartition(30).write.saveAsTable("test_non_partition_30") spark.range(1).selectExpr("id", "id % 5000 as c2", "id as c3").repartition(org.apache.spark.sql.functions.col("c2")).write.partitionBy("c2").saveAsTable("test_partition_5000") spark.range(1).selectExpr("id", "id % 1 as c2", "id as c3").repartition(org.apache.spark.sql.functions.col("c2")).write.partitionBy("c2").saveAsTable("test_partition_1") ``` Add these lines to [LogicalRelation.computeStats](https://github.com/apache/spark/blob/950d407f2b22f1ae088c55cba3d0081c3c1ecff9/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala#L44): ```scala val time1 = System.currentTimeMillis() val relationSize = relation.sizeInBytes val time2 = System.currentTimeMillis() val fallBackToHdfsSize = CommandUtils.getSizeInBytesFallBackToHdfs(relation.sqlContext.sparkSession, catalogTable.get) val time3 = System.currentTimeMillis() // scalastyle:off println(s"Get size from relation: $relationSize, time: ${time2 - time1}") println(s"Get size fall back to HDFS: $fallBackToHdfsSize, time: ${time3 - time2}") // scalastyle:on ``` Non-partitioned table benchmark result: ``` scala> spark.sql("explain cost select * from test_non_partition_1 limit 1").show Get size from relation: 576588171, time: 22 Get size fall back to HDFS: 576588171, time: 41 ++ |plan| ++ |== Optimized Logi...| ++ scala> spark.sql("explain cost select * from test_non_partition_1 limit 1").show Get size from relation: 576588171, time: 3 Get size fall back to HDFS: 576588171, time: 28 ++ |plan| ++ |== Optimized Logi...| ++ scala> scala> spark.sql("explain cost select * from test_non_partition_30 limit 1").show Get size from relation: 706507984, time: 135 Get size fall back to HDFS: 706507984, time: 2038 ++ |plan| ++ |== Optimized Logi...| ++ scala> spark.sql("explain cost select * from test_non_partition_30 limit 1").show Get size from relation: 706507984, time: 168 Get size fall back to HDFS: 706507984, time: 3629 ++ |plan| ++ |== Optimized Logi...| ++ ``` Partitioned table benchmark result: ``` scala> spark.sql("explain cost select * from test_partition_5000 limit 1").show Get size from relation: 9223372036854775807, time: 0 Get size fall back to HDFS: 1018560794, time: 46 ++ |plan| ++ |== Optimized Logi...| ++ scala> spark.sql("explain cost select * from test_partition_1 limit 1").show Get size from relation: 9223372036854775807, time: 0 Get size fall back to HDFS: 1036799332, time: 43 ++ |plan| ++ |== Optimized Logi...| ++ ``` Partitioned table with `spark.sql.hive.manageFilesourcePartitions=false` (set it by --conf) benchmark result: ``` scala> spark.sql("set spark.sql.hive.manageFilesourcePartitions").show ++-+ | key|value| ++-+ |spark.sql.hive.ma...|false| ++-+ scala> spark.sql("explain cost select * from test_partition_5000 limit 1").show Get size from relation: 1018560794, time: 3 Get size fall back to HDFS: 1018560794, time: 45 ++ |plan| ++ |== Optimized Logi...| ++ scala> spark.sql("explain cost select * from test_partition_1 limit 1").show Get size from relation: 1036799332, time: 865 Get size fall back to HDFS: 1036799332, time: 69 ++ |plan| ++ |== Optimized Logi...| ++ ``` This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards,
[GitHub] [spark] wangyum commented on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation
wangyum commented on issue #24715: [SPARK-25474][SQL] Data source tables support fallback to HDFS for size estimation URL: https://github.com/apache/spark/pull/24715#issuecomment-506934789 retest this please This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org