[GitHub] spark pull request #22501: [SPARK-25492][TEST] Refactor WideSchemaBenchmark ...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/22501 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22501: [SPARK-25492][TEST] Refactor WideSchemaBenchmark ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22501#discussion_r226769745 --- Diff: sql/core/benchmarks/WideSchemaBenchmark-results.txt --- @@ -1,117 +1,145 @@ -Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.11.6 -Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz + +parsing large select expressions + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz parsing large select:Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -1 select expressions 2 /4 0.0 2050147.0 1.0X -100 select expressions 6 /7 0.0 6123412.0 0.3X -2500 select expressions135 / 141 0.0 134623148.0 0.0X +1 select expressions 2 /4 0.0 1934953.0 1.0X +100 select expressions 4 /5 0.0 3659399.0 0.5X +2500 select expressions 68 / 76 0.0 68278937.0 0.0X -Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.11.6 -Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz + +many column field read and write + + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz many column field r/w: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -1 cols x 10 rows (read in-mem) 16 / 18 6.3 158.6 1.0X -1 cols x 10 rows (exec in-mem) 17 / 19 6.0 166.7 1.0X -1 cols x 10 rows (read parquet) 24 / 26 4.3 235.1 0.7X -1 cols x 10 rows (write parquet)81 / 85 1.2 811.3 0.2X -100 cols x 1000 rows (read in-mem) 17 / 19 6.0 166.2 1.0X -100 cols x 1000 rows (exec in-mem) 25 / 27 4.0 249.2 0.6X -100 cols x 1000 rows (read parquet) 23 / 25 4.4 226.0 0.7X -100 cols x 1000 rows (write parquet)83 / 87 1.2 831.0 0.2X -2500 cols x 40 rows (read in-mem) 132 / 137 0.8 1322.9 0.1X -2500 cols x 40 rows (exec in-mem) 326 / 330 0.3 3260.6 0.0X -2500 cols x 40 rows (read parquet) 831 / 839 0.1 8305.8 0.0X -2500 cols x 40 rows (write parquet)237 / 245 0.4 2372.6 0.1X - -Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.11.6 -Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz +1 cols x 10 rows (read in-mem) 22 / 25 4.6 219.4 1.0X +1 cols x 10 rows (exec in-mem) 22 / 28 4.5 223.8 1.0X +1 cols x 10 rows (read parquet) 45 / 49 2.2 449.6 0.5X +1 cols x 10 rows (write parquet) 204 / 223 0.5 2044.4 0.1X --- End diff -- For this part, right, @rdblue . I guess so. After merging EC2 result to @wangyum 's PR, I'll compare the numbers one by one once again. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22501: [SPARK-25492][TEST] Refactor WideSchemaBenchmark ...
Github user rdblue commented on a diff in the pull request: https://github.com/apache/spark/pull/22501#discussion_r226765772 --- Diff: sql/core/benchmarks/WideSchemaBenchmark-results.txt --- @@ -1,117 +1,145 @@ -Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.11.6 -Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz + +parsing large select expressions + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz parsing large select:Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -1 select expressions 2 /4 0.0 2050147.0 1.0X -100 select expressions 6 /7 0.0 6123412.0 0.3X -2500 select expressions135 / 141 0.0 134623148.0 0.0X +1 select expressions 2 /4 0.0 1934953.0 1.0X +100 select expressions 4 /5 0.0 3659399.0 0.5X +2500 select expressions 68 / 76 0.0 68278937.0 0.0X -Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.11.6 -Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz + +many column field read and write + + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz many column field r/w: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -1 cols x 10 rows (read in-mem) 16 / 18 6.3 158.6 1.0X -1 cols x 10 rows (exec in-mem) 17 / 19 6.0 166.7 1.0X -1 cols x 10 rows (read parquet) 24 / 26 4.3 235.1 0.7X -1 cols x 10 rows (write parquet)81 / 85 1.2 811.3 0.2X -100 cols x 1000 rows (read in-mem) 17 / 19 6.0 166.2 1.0X -100 cols x 1000 rows (exec in-mem) 25 / 27 4.0 249.2 0.6X -100 cols x 1000 rows (read parquet) 23 / 25 4.4 226.0 0.7X -100 cols x 1000 rows (write parquet)83 / 87 1.2 831.0 0.2X -2500 cols x 40 rows (read in-mem) 132 / 137 0.8 1322.9 0.1X -2500 cols x 40 rows (exec in-mem) 326 / 330 0.3 3260.6 0.0X -2500 cols x 40 rows (read parquet) 831 / 839 0.1 8305.8 0.0X -2500 cols x 40 rows (write parquet)237 / 245 0.4 2372.6 0.1X - -Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.11.6 -Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz +1 cols x 10 rows (read in-mem) 22 / 25 4.6 219.4 1.0X +1 cols x 10 rows (exec in-mem) 22 / 28 4.5 223.8 1.0X +1 cols x 10 rows (read parquet) 45 / 49 2.2 449.6 0.5X +1 cols x 10 rows (write parquet) 204 / 223 0.5 2044.4 0.1X --- End diff -- @dongjoon-hyun, so you are saying that it doesn't appear that there is a performance regression, right? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22501: [SPARK-25492][TEST] Refactor WideSchemaBenchmark ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22501#discussion_r226742168 --- Diff: sql/core/benchmarks/WideSchemaBenchmark-results.txt --- @@ -1,117 +1,145 @@ -Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.11.6 -Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz + +parsing large select expressions + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz parsing large select:Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -1 select expressions 2 /4 0.0 2050147.0 1.0X -100 select expressions 6 /7 0.0 6123412.0 0.3X -2500 select expressions135 / 141 0.0 134623148.0 0.0X +1 select expressions 2 /4 0.0 1934953.0 1.0X +100 select expressions 4 /5 0.0 3659399.0 0.5X +2500 select expressions 68 / 76 0.0 68278937.0 0.0X -Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.11.6 -Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz + +many column field read and write + + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz many column field r/w: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -1 cols x 10 rows (read in-mem) 16 / 18 6.3 158.6 1.0X -1 cols x 10 rows (exec in-mem) 17 / 19 6.0 166.7 1.0X -1 cols x 10 rows (read parquet) 24 / 26 4.3 235.1 0.7X -1 cols x 10 rows (write parquet)81 / 85 1.2 811.3 0.2X -100 cols x 1000 rows (read in-mem) 17 / 19 6.0 166.2 1.0X -100 cols x 1000 rows (exec in-mem) 25 / 27 4.0 249.2 0.6X -100 cols x 1000 rows (read parquet) 23 / 25 4.4 226.0 0.7X -100 cols x 1000 rows (write parquet)83 / 87 1.2 831.0 0.2X -2500 cols x 40 rows (read in-mem) 132 / 137 0.8 1322.9 0.1X -2500 cols x 40 rows (exec in-mem) 326 / 330 0.3 3260.6 0.0X -2500 cols x 40 rows (read parquet) 831 / 839 0.1 8305.8 0.0X -2500 cols x 40 rows (write parquet)237 / 245 0.4 2372.6 0.1X - -Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.11.6 -Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz +1 cols x 10 rows (read in-mem) 22 / 25 4.6 219.4 1.0X +1 cols x 10 rows (exec in-mem) 22 / 28 4.5 223.8 1.0X +1 cols x 10 rows (read parquet) 45 / 49 2.2 449.6 0.5X +1 cols x 10 rows (write parquet) 204 / 223 0.5 2044.4 0.1X --- End diff -- The following [EC2 result](https://github.com/wangyum/spark/pull/19) shows the consistent ratio like Spark 2.1.0. The result on Mac seemed to be unstable for some unknown reason like https://github.com/apache/spark/pull/22501#discussion_r226440992. ```scala 1 cols x 10 rows (read parquet) 61 / 70 1.6 610.2 0.6X 1 cols x 10 rows (write parquet) 209 / 233 0.5 2086.1 0.2X ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22501: [SPARK-25492][TEST] Refactor WideSchemaBenchmark ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22501#discussion_r226740901 --- Diff: sql/core/benchmarks/WideSchemaBenchmark-results.txt --- @@ -1,117 +1,145 @@ -Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.11.6 -Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz + +parsing large select expressions + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz parsing large select:Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -1 select expressions 2 /4 0.0 2050147.0 1.0X -100 select expressions 6 /7 0.0 6123412.0 0.3X -2500 select expressions135 / 141 0.0 134623148.0 0.0X +1 select expressions 2 /4 0.0 1934953.0 1.0X +100 select expressions 4 /5 0.0 3659399.0 0.5X +2500 select expressions 68 / 76 0.0 68278937.0 0.0X -Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.11.6 -Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz + +many column field read and write + + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz many column field r/w: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -1 cols x 10 rows (read in-mem) 16 / 18 6.3 158.6 1.0X -1 cols x 10 rows (exec in-mem) 17 / 19 6.0 166.7 1.0X -1 cols x 10 rows (read parquet) 24 / 26 4.3 235.1 0.7X -1 cols x 10 rows (write parquet)81 / 85 1.2 811.3 0.2X -100 cols x 1000 rows (read in-mem) 17 / 19 6.0 166.2 1.0X -100 cols x 1000 rows (exec in-mem) 25 / 27 4.0 249.2 0.6X -100 cols x 1000 rows (read parquet) 23 / 25 4.4 226.0 0.7X -100 cols x 1000 rows (write parquet)83 / 87 1.2 831.0 0.2X -2500 cols x 40 rows (read in-mem) 132 / 137 0.8 1322.9 0.1X -2500 cols x 40 rows (exec in-mem) 326 / 330 0.3 3260.6 0.0X -2500 cols x 40 rows (read parquet) 831 / 839 0.1 8305.8 0.0X -2500 cols x 40 rows (write parquet)237 / 245 0.4 2372.6 0.1X - -Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.11.6 -Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz +1 cols x 10 rows (read in-mem) 22 / 25 4.6 219.4 1.0X +1 cols x 10 rows (exec in-mem) 22 / 28 4.5 223.8 1.0X +1 cols x 10 rows (read parquet) 45 / 49 2.2 449.6 0.5X +1 cols x 10 rows (write parquet) 204 / 223 0.5 2044.4 0.1X +100 cols x 1000 rows (read in-mem) 26 / 28 3.9 255.8 0.9X +100 cols x 1000 rows (exec in-mem) 32 / 35 3.1 319.3 0.7X +100 cols x 1000 rows (read parquet) 45 / 52 2.2 445.9 0.5X +100 cols x 1000 rows (write parquet) 275 / 536 0.4 2746.1 0.1X +2500 cols x 40 rows (read in-mem) 261 / 434 0.4 2607.3 0.1X +2500 cols x 40 rows (exec in-mem) 624 / 701 0.2 6240.5 0.0X +2500 cols x 40 rows (read parquet) 196 / 301 0.5 1963.4 0.1X +2500 cols x 40 rows (write parquet)687 / 1049 0.1 6870.6 0.0X --- End diff -- FYI, this large gap was removed at EC2 result. --- - To unsubscribe, e-mail:
[GitHub] spark pull request #22501: [SPARK-25492][TEST] Refactor WideSchemaBenchmark ...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22501#discussion_r226520120 --- Diff: sql/core/benchmarks/WideSchemaBenchmark-results.txt --- @@ -1,117 +1,145 @@ -Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.11.6 -Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz + +parsing large select expressions + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz parsing large select:Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -1 select expressions 2 /4 0.0 2050147.0 1.0X -100 select expressions 6 /7 0.0 6123412.0 0.3X -2500 select expressions135 / 141 0.0 134623148.0 0.0X +1 select expressions 2 /4 0.0 1934953.0 1.0X +100 select expressions 4 /5 0.0 3659399.0 0.5X +2500 select expressions 68 / 76 0.0 68278937.0 0.0X -Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.11.6 -Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz + +many column field read and write + + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz many column field r/w: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -1 cols x 10 rows (read in-mem) 16 / 18 6.3 158.6 1.0X -1 cols x 10 rows (exec in-mem) 17 / 19 6.0 166.7 1.0X -1 cols x 10 rows (read parquet) 24 / 26 4.3 235.1 0.7X -1 cols x 10 rows (write parquet)81 / 85 1.2 811.3 0.2X -100 cols x 1000 rows (read in-mem) 17 / 19 6.0 166.2 1.0X -100 cols x 1000 rows (exec in-mem) 25 / 27 4.0 249.2 0.6X -100 cols x 1000 rows (read parquet) 23 / 25 4.4 226.0 0.7X -100 cols x 1000 rows (write parquet)83 / 87 1.2 831.0 0.2X -2500 cols x 40 rows (read in-mem) 132 / 137 0.8 1322.9 0.1X -2500 cols x 40 rows (exec in-mem) 326 / 330 0.3 3260.6 0.0X -2500 cols x 40 rows (read parquet) 831 / 839 0.1 8305.8 0.0X -2500 cols x 40 rows (write parquet)237 / 245 0.4 2372.6 0.1X - -Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.11.6 -Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz +1 cols x 10 rows (read in-mem) 22 / 25 4.6 219.4 1.0X +1 cols x 10 rows (exec in-mem) 22 / 28 4.5 223.8 1.0X +1 cols x 10 rows (read parquet) 45 / 49 2.2 449.6 0.5X +1 cols x 10 rows (write parquet) 204 / 223 0.5 2044.4 0.1X --- End diff -- May be a parquet issue. I found that the binary write performance is a little worse after upgrading to parquet 1.10.0: https://github.com/apache/parquet-mr/pull/505. I will verify it later. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22501: [SPARK-25492][TEST] Refactor WideSchemaBenchmark ...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/22501#discussion_r226516354 --- Diff: sql/core/benchmarks/WideSchemaBenchmark-results.txt --- @@ -1,117 +1,145 @@ -Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.11.6 -Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz + +parsing large select expressions + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz parsing large select:Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -1 select expressions 2 /4 0.0 2050147.0 1.0X -100 select expressions 6 /7 0.0 6123412.0 0.3X -2500 select expressions135 / 141 0.0 134623148.0 0.0X +1 select expressions 2 /4 0.0 1934953.0 1.0X +100 select expressions 4 /5 0.0 3659399.0 0.5X +2500 select expressions 68 / 76 0.0 68278937.0 0.0X -Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.11.6 -Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz + +many column field read and write + + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz many column field r/w: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -1 cols x 10 rows (read in-mem) 16 / 18 6.3 158.6 1.0X -1 cols x 10 rows (exec in-mem) 17 / 19 6.0 166.7 1.0X -1 cols x 10 rows (read parquet) 24 / 26 4.3 235.1 0.7X -1 cols x 10 rows (write parquet)81 / 85 1.2 811.3 0.2X -100 cols x 1000 rows (read in-mem) 17 / 19 6.0 166.2 1.0X -100 cols x 1000 rows (exec in-mem) 25 / 27 4.0 249.2 0.6X -100 cols x 1000 rows (read parquet) 23 / 25 4.4 226.0 0.7X -100 cols x 1000 rows (write parquet)83 / 87 1.2 831.0 0.2X -2500 cols x 40 rows (read in-mem) 132 / 137 0.8 1322.9 0.1X -2500 cols x 40 rows (exec in-mem) 326 / 330 0.3 3260.6 0.0X -2500 cols x 40 rows (read parquet) 831 / 839 0.1 8305.8 0.0X -2500 cols x 40 rows (write parquet)237 / 245 0.4 2372.6 0.1X - -Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.11.6 -Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz +1 cols x 10 rows (read in-mem) 22 / 25 4.6 219.4 1.0X +1 cols x 10 rows (exec in-mem) 22 / 28 4.5 223.8 1.0X +1 cols x 10 rows (read parquet) 45 / 49 2.2 449.6 0.5X +1 cols x 10 rows (write parquet) 204 / 223 0.5 2044.4 0.1X --- End diff -- I have no idea how this happens. Can you create a JIRA ticket to investigate this regression? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22501: [SPARK-25492][TEST] Refactor WideSchemaBenchmark ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22501#discussion_r226442573 --- Diff: sql/core/benchmarks/WideSchemaBenchmark-results.txt --- @@ -1,117 +1,145 @@ -Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.11.6 -Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz + +parsing large select expressions + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz parsing large select:Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -1 select expressions 2 /4 0.0 2050147.0 1.0X -100 select expressions 6 /7 0.0 6123412.0 0.3X -2500 select expressions135 / 141 0.0 134623148.0 0.0X +1 select expressions 2 /4 0.0 1934953.0 1.0X +100 select expressions 4 /5 0.0 3659399.0 0.5X +2500 select expressions 68 / 76 0.0 68278937.0 0.0X -Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.11.6 -Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz + +many column field read and write + + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz many column field r/w: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -1 cols x 10 rows (read in-mem) 16 / 18 6.3 158.6 1.0X -1 cols x 10 rows (exec in-mem) 17 / 19 6.0 166.7 1.0X -1 cols x 10 rows (read parquet) 24 / 26 4.3 235.1 0.7X -1 cols x 10 rows (write parquet)81 / 85 1.2 811.3 0.2X -100 cols x 1000 rows (read in-mem) 17 / 19 6.0 166.2 1.0X -100 cols x 1000 rows (exec in-mem) 25 / 27 4.0 249.2 0.6X -100 cols x 1000 rows (read parquet) 23 / 25 4.4 226.0 0.7X -100 cols x 1000 rows (write parquet)83 / 87 1.2 831.0 0.2X -2500 cols x 40 rows (read in-mem) 132 / 137 0.8 1322.9 0.1X -2500 cols x 40 rows (exec in-mem) 326 / 330 0.3 3260.6 0.0X -2500 cols x 40 rows (read parquet) 831 / 839 0.1 8305.8 0.0X -2500 cols x 40 rows (write parquet)237 / 245 0.4 2372.6 0.1X - -Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.11.6 -Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz +1 cols x 10 rows (read in-mem) 22 / 25 4.6 219.4 1.0X +1 cols x 10 rows (exec in-mem) 22 / 28 4.5 223.8 1.0X +1 cols x 10 rows (read parquet) 45 / 49 2.2 449.6 0.5X +1 cols x 10 rows (write parquet) 204 / 223 0.5 2044.4 0.1X +100 cols x 1000 rows (read in-mem) 26 / 28 3.9 255.8 0.9X +100 cols x 1000 rows (exec in-mem) 32 / 35 3.1 319.3 0.7X +100 cols x 1000 rows (read parquet) 45 / 52 2.2 445.9 0.5X +100 cols x 1000 rows (write parquet) 275 / 536 0.4 2746.1 0.1X +2500 cols x 40 rows (read in-mem) 261 / 434 0.4 2607.3 0.1X +2500 cols x 40 rows (exec in-mem) 624 / 701 0.2 6240.5 0.0X +2500 cols x 40 rows (read parquet) 196 / 301 0.5 1963.4 0.1X +2500 cols x 40 rows (write parquet)687 / 1049 0.1 6870.6 0.0X + + + +wide shallowly nested struct field read and write
[GitHub] spark pull request #22501: [SPARK-25492][TEST] Refactor WideSchemaBenchmark ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22501#discussion_r226440992 --- Diff: sql/core/benchmarks/WideSchemaBenchmark-results.txt --- @@ -1,117 +1,145 @@ -Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.11.6 -Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz + +parsing large select expressions + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz parsing large select:Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -1 select expressions 2 /4 0.0 2050147.0 1.0X -100 select expressions 6 /7 0.0 6123412.0 0.3X -2500 select expressions135 / 141 0.0 134623148.0 0.0X +1 select expressions 2 /4 0.0 1934953.0 1.0X +100 select expressions 4 /5 0.0 3659399.0 0.5X +2500 select expressions 68 / 76 0.0 68278937.0 0.0X -Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.11.6 -Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz + +many column field read and write + + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz many column field r/w: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -1 cols x 10 rows (read in-mem) 16 / 18 6.3 158.6 1.0X -1 cols x 10 rows (exec in-mem) 17 / 19 6.0 166.7 1.0X -1 cols x 10 rows (read parquet) 24 / 26 4.3 235.1 0.7X -1 cols x 10 rows (write parquet)81 / 85 1.2 811.3 0.2X -100 cols x 1000 rows (read in-mem) 17 / 19 6.0 166.2 1.0X -100 cols x 1000 rows (exec in-mem) 25 / 27 4.0 249.2 0.6X -100 cols x 1000 rows (read parquet) 23 / 25 4.4 226.0 0.7X -100 cols x 1000 rows (write parquet)83 / 87 1.2 831.0 0.2X -2500 cols x 40 rows (read in-mem) 132 / 137 0.8 1322.9 0.1X -2500 cols x 40 rows (exec in-mem) 326 / 330 0.3 3260.6 0.0X -2500 cols x 40 rows (read parquet) 831 / 839 0.1 8305.8 0.0X -2500 cols x 40 rows (write parquet)237 / 245 0.4 2372.6 0.1X - -Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.11.6 -Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz +1 cols x 10 rows (read in-mem) 22 / 25 4.6 219.4 1.0X +1 cols x 10 rows (exec in-mem) 22 / 28 4.5 223.8 1.0X +1 cols x 10 rows (read parquet) 45 / 49 2.2 449.6 0.5X +1 cols x 10 rows (write parquet) 204 / 223 0.5 2044.4 0.1X +100 cols x 1000 rows (read in-mem) 26 / 28 3.9 255.8 0.9X +100 cols x 1000 rows (exec in-mem) 32 / 35 3.1 319.3 0.7X +100 cols x 1000 rows (read parquet) 45 / 52 2.2 445.9 0.5X +100 cols x 1000 rows (write parquet) 275 / 536 0.4 2746.1 0.1X +2500 cols x 40 rows (read in-mem) 261 / 434 0.4 2607.3 0.1X +2500 cols x 40 rows (exec in-mem) 624 / 701 0.2 6240.5 0.0X +2500 cols x 40 rows (read parquet) 196 / 301 0.5 1963.4 0.1X +2500 cols x 40 rows (write parquet)687 / 1049 0.1 6870.6 0.0X --- End diff -- The difference between `best` and `average` is too high in line 32 and line 33. I'll try to run this on EC2, too. ---
[GitHub] spark pull request #22501: [SPARK-25492][TEST] Refactor WideSchemaBenchmark ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22501#discussion_r226439834 --- Diff: sql/core/benchmarks/WideSchemaBenchmark-results.txt --- @@ -1,117 +1,145 @@ -Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.11.6 -Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz + +parsing large select expressions + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz parsing large select:Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -1 select expressions 2 /4 0.0 2050147.0 1.0X -100 select expressions 6 /7 0.0 6123412.0 0.3X -2500 select expressions135 / 141 0.0 134623148.0 0.0X +1 select expressions 2 /4 0.0 1934953.0 1.0X +100 select expressions 4 /5 0.0 3659399.0 0.5X +2500 select expressions 68 / 76 0.0 68278937.0 0.0X -Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.11.6 -Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz + +many column field read and write + + +Java HotSpot(TM) 64-Bit Server VM 1.8.0_151-b12 on Mac OS X 10.12.6 +Intel(R) Core(TM) i7-7820HQ CPU @ 2.90GHz many column field r/w: Best/Avg Time(ms)Rate(M/s) Per Row(ns) Relative -1 cols x 10 rows (read in-mem) 16 / 18 6.3 158.6 1.0X -1 cols x 10 rows (exec in-mem) 17 / 19 6.0 166.7 1.0X -1 cols x 10 rows (read parquet) 24 / 26 4.3 235.1 0.7X -1 cols x 10 rows (write parquet)81 / 85 1.2 811.3 0.2X -100 cols x 1000 rows (read in-mem) 17 / 19 6.0 166.2 1.0X -100 cols x 1000 rows (exec in-mem) 25 / 27 4.0 249.2 0.6X -100 cols x 1000 rows (read parquet) 23 / 25 4.4 226.0 0.7X -100 cols x 1000 rows (write parquet)83 / 87 1.2 831.0 0.2X -2500 cols x 40 rows (read in-mem) 132 / 137 0.8 1322.9 0.1X -2500 cols x 40 rows (exec in-mem) 326 / 330 0.3 3260.6 0.0X -2500 cols x 40 rows (read parquet) 831 / 839 0.1 8305.8 0.0X -2500 cols x 40 rows (write parquet)237 / 245 0.4 2372.6 0.1X - -Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.11.6 -Intel(R) Core(TM) i7-4980HQ CPU @ 2.80GHz +1 cols x 10 rows (read in-mem) 22 / 25 4.6 219.4 1.0X +1 cols x 10 rows (exec in-mem) 22 / 28 4.5 223.8 1.0X +1 cols x 10 rows (read parquet) 45 / 49 2.2 449.6 0.5X +1 cols x 10 rows (write parquet) 204 / 223 0.5 2044.4 0.1X --- End diff -- This might be a little regression on Parquet writer from Spark 2.1.0 (SPARK-17335). cc @cloud-fan and @gatorsmile , @rdblue --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22501: [SPARK-25492][TEST] Refactor WideSchemaBenchmark ...
Github user HyukjinKwon commented on a diff in the pull request: https://github.com/apache/spark/pull/22501#discussion_r224985471 --- Diff: core/src/test/scala/org/apache/spark/benchmark/BenchmarkBase.scala --- @@ -48,15 +48,11 @@ abstract class BenchmarkBase { if (!file.exists()) { file.createNewFile() } - output = Some(new FileOutputStream(file)) + output = Option(new FileOutputStream(file)) --- End diff -- My point was that there's no point of checking `null` below from my cursory look. If there's no chance that it becomes `null`, we can leave it `Some` and remove `null` check below. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22501: [SPARK-25492][TEST] Refactor WideSchemaBenchmark ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22501#discussion_r223220176 --- Diff: core/src/test/scala/org/apache/spark/benchmark/BenchmarkBase.scala --- @@ -48,15 +48,11 @@ abstract class BenchmarkBase { if (!file.exists()) { file.createNewFile() } - output = Some(new FileOutputStream(file)) + output = Option(new FileOutputStream(file)) --- End diff -- Why do you replace `Some` to `Option`? Are you worrying `new FileOutputStream(file)` becomes `null`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22501: [SPARK-25492][TEST] Refactor WideSchemaBenchmark ...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22501#discussion_r223202914 --- Diff: core/src/test/scala/org/apache/spark/benchmark/BenchmarkBase.scala --- @@ -48,15 +48,11 @@ abstract class BenchmarkBase { if (!file.exists()) { file.createNewFile() } - output = Some(new FileOutputStream(file)) + output = Option(new FileOutputStream(file)) --- End diff -- I am worried that I will forget it after a long time, so I am changing this time. I should revert it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22501: [SPARK-25492][TEST] Refactor WideSchemaBenchmark ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22501#discussion_r223196145 --- Diff: core/src/test/scala/org/apache/spark/benchmark/BenchmarkBase.scala --- @@ -48,15 +48,11 @@ abstract class BenchmarkBase { if (!file.exists()) { file.createNewFile() } - output = Some(new FileOutputStream(file)) + output = Option(new FileOutputStream(file)) --- End diff -- IIUC, @HyukjinKwon meant `when you need to touch this file`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22501: [SPARK-25492][TEST] Refactor WideSchemaBenchmark ...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22501#discussion_r223195740 --- Diff: core/src/test/scala/org/apache/spark/benchmark/BenchmarkBase.scala --- @@ -48,15 +48,11 @@ abstract class BenchmarkBase { if (!file.exists()) { file.createNewFile() } - output = Some(new FileOutputStream(file)) + output = Option(new FileOutputStream(file)) --- End diff -- Change here because: https://github.com/apache/spark/pull/22443#discussion_r221181428 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22501: [SPARK-25492][TEST] Refactor WideSchemaBenchmark ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22501#discussion_r223195081 --- Diff: core/src/test/scala/org/apache/spark/benchmark/BenchmarkBase.scala --- @@ -48,15 +48,11 @@ abstract class BenchmarkBase { if (!file.exists()) { file.createNewFile() } - output = Some(new FileOutputStream(file)) + output = Option(new FileOutputStream(file)) --- End diff -- This looks like irrelevant pig-back. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22501: [SPARK-25492][TEST] Refactor WideSchemaBenchmark ...
Github user wangyum commented on a diff in the pull request: https://github.com/apache/spark/pull/22501#discussion_r219725654 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/WideSchemaBenchmark.scala --- @@ -17,22 +17,19 @@ package org.apache.spark.sql -import java.io.{File, FileOutputStream, OutputStream} +import java.io.File -import org.scalatest.BeforeAndAfterEach - -import org.apache.spark.SparkFunSuite -import org.apache.spark.sql.functions._ -import org.apache.spark.util.{Benchmark, Utils} +import org.apache.spark.util.{Benchmark, BenchmarkBase => FileBenchmarkBase, Utils} /** * Benchmark for performance with very wide and nested DataFrames. - * To run this: - * build/sbt "sql/test-only *WideSchemaBenchmark" - * - * Results will be written to "sql/core/benchmarks/WideSchemaBenchmark-results.txt". + * To run this benchmark: + * 1. without sbt: bin/spark-submit --class + * 2. build/sbt "sql/test:runMain " + * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain " + *Results will be written to "benchmarks/WideSchemaBenchmark-results.txt". --- End diff -- Thanks @dongjoon-hyun. Actually I'm waiting for https://github.com/apache/spark/pull/22484. I want to move `withTempDir()` to `RunBenchmarkWithCodegen.scala`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22501: [SPARK-25492][TEST] Refactor WideSchemaBenchmark ...
Github user dongjoon-hyun commented on a diff in the pull request: https://github.com/apache/spark/pull/22501#discussion_r219724989 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/WideSchemaBenchmark.scala --- @@ -17,22 +17,19 @@ package org.apache.spark.sql -import java.io.{File, FileOutputStream, OutputStream} +import java.io.File -import org.scalatest.BeforeAndAfterEach - -import org.apache.spark.SparkFunSuite -import org.apache.spark.sql.functions._ -import org.apache.spark.util.{Benchmark, Utils} +import org.apache.spark.util.{Benchmark, BenchmarkBase => FileBenchmarkBase, Utils} /** * Benchmark for performance with very wide and nested DataFrames. - * To run this: - * build/sbt "sql/test-only *WideSchemaBenchmark" - * - * Results will be written to "sql/core/benchmarks/WideSchemaBenchmark-results.txt". + * To run this benchmark: + * 1. without sbt: bin/spark-submit --class + * 2. build/sbt "sql/test:runMain " + * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain " + *Results will be written to "benchmarks/WideSchemaBenchmark-results.txt". --- End diff -- Could you fix doc generation failure? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22501: [SPARK-25492][TEST] Refactor WideSchemaBenchmark ...
GitHub user wangyum opened a pull request: https://github.com/apache/spark/pull/22501 [SPARK-25492][TEST] Refactor WideSchemaBenchmark to use main method ## What changes were proposed in this pull request? Refactor `WideSchemaBenchmark` to use main method. Generate benchmark result: ```sh SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.WideSchemaBenchmark" ``` ## How was this patch tested? manual tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/wangyum/spark SPARK-25492 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22501.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22501 commit f56b73223fbf765e408d9aef6565a2318f4836e3 Author: Yuming Wang Date: 2018-09-20T16:04:30Z Refactor WideSchemaBenchmark --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org