[GitHub] [spark] AngersZhuuuu commented on issue #26500: [SPARK-29874][SQL]Optimize Dataset.isEmpty()
AngersZh commented on issue #26500: [SPARK-29874][SQL]Optimize Dataset.isEmpty() URL: https://github.com/apache/spark/pull/26500#issuecomment-556984517 > > will add three shuffles by limit(), groupby() and count() > > have you confirmed? groupby + count is one operator called Aggregate. Updated, `count` won't trigger shuffle. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on issue #26500: [SPARK-29874][SQL]Optimize Dataset.isEmpty()
AngersZh commented on issue #26500: [SPARK-29874][SQL]Optimize Dataset.isEmpty() URL: https://github.com/apache/spark/pull/26500#issuecomment-556977743 > great! can you enrich the PR description? `Optimize Dataset.isEmpty()` is good in the "Why" section but we need to put more in the "What" section. e.g. we change the implementation to avoid shuffles. Updated , is clear now? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on issue #26500: [SPARK-29874][SQL]Optimize Dataset.isEmpty()
AngersZh commented on issue #26500: [SPARK-29874][SQL]Optimize Dataset.isEmpty() URL: https://github.com/apache/spark/pull/26500#issuecomment-556952943 ``` test("benchmark of empty") { var start = System.currentTimeMillis() var isEmpty = spark.range(1000) .repartition(100) .limit(1) .groupBy() .count() .queryExecution.executedPlan.executeCollect().head.getLong(0) == 0 println(isEmpty) var end = System.currentTimeMillis() // scalastyle:off println(s"duration = ${end - start}") start = System.currentTimeMillis() isEmpty = spark.range(1000) .repartition(100) .select() .queryExecution.executedPlan.executeTake(1) == 0 println(isEmpty) end = System.currentTimeMillis() // scalastyle:off println(s"duration = ${end - start}") } Result false duration = 7248 false duration = 1449 ``` @cloud-fan @maropu @srowen The test case is simple but can mimic the behavior before and after the API change. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on issue #26500: [SPARK-29874][SQL]Optimize Dataset.isEmpty()
AngersZh commented on issue #26500: [SPARK-29874][SQL]Optimize Dataset.isEmpty() URL: https://github.com/apache/spark/pull/26500#issuecomment-556035473 > Ping @AngersZh Thank you ping, sorry for pending this work. A little busy these days. Starting work on these things. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on issue #26500: [SPARK-29874][SQL]Optimize Dataset.isEmpty()
AngersZh commented on issue #26500: [SPARK-29874][SQL]Optimize Dataset.isEmpty() URL: https://github.com/apache/spark/pull/26500#issuecomment-553309981 cc @cloud-fan This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org