[GitHub] spark pull request #19842: [SPARK-22643][SQL] ColumnarArray should be an imm...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19842#discussion_r154045805 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchBenchmark.scala --- @@ -387,20 +389,108 @@ object ColumnarBatchBenchmark { } /* -String Read/Write: Avg Time(ms)Avg Rate(M/s) Relative Rate - - -On Heap 457.035.85 1.00 X -Off Heap 1206.013.59 0.38 X +Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Mac OS X 10.13.1 +Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz + +String Read/Write: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative + +On Heap332 / 338 49.3 20.3 1.0X +Off Heap 466 / 467 35.2 28.4 0.7X */ val benchmark = new Benchmark("String Read/Write", count * iters) benchmark.addCase("On Heap")(column(MemoryMode.ON_HEAP)) benchmark.addCase("Off Heap")(column(MemoryMode.OFF_HEAP)) benchmark.run } + def arrayAccess(iters: Int): Unit = { +val random = new Random(0) +val count = 4 * 1000 + +val onHeapVector = new OnHeapColumnVector(count, ArrayType(IntegerType)) +val offHeapVector = new OffHeapColumnVector(count, ArrayType(IntegerType)) + +val minSize = 3 +val maxSize = 32 +var arraysCount = 0 +var elementsCount = 0 +while (arraysCount < count) { + val size = random.nextInt(maxSize - minSize) + minSize + val onHeapArrayData = onHeapVector.arrayData() + val offHeapArrayData = offHeapVector.arrayData() + + var i = 0 + while (i < size) { +val value = random.nextInt() +onHeapArrayData.appendInt(value) +offHeapArrayData.appendInt(value) +i += 1 + } + + onHeapVector.putArray(arraysCount, elementsCount, size) + offHeapVector.putArray(arraysCount, elementsCount, size) + elementsCount += size + arraysCount += 1 +} + +def readArrays(onHeap: Boolean): Unit = { + System.gc() + val vector = if (onHeap) onHeapVector else offHeapVector + + var sum = 0L + for (_ <- 0 until iters) { +var i = 0 +while (i < count) { + sum += vector.getArray(i).numElements() + i += 1 +} + } +} + +def readArrayElements(onHeap: Boolean): Unit = { + System.gc() + val vector = if (onHeap) onHeapVector else offHeapVector + + var sum = 0L + for (_ <- 0 until iters) { +var i = 0 +while (i < count) { + val array = vector.getArray(i) + val size = array.numElements() + var j = 0 + while (j < size) { +sum += array.getInt(j) +j += 1 + } + i += 1 +} + } +} + +val benchmark = new Benchmark("Array Vector Read", count * iters) +benchmark.addCase("On Heap Read Size Only") { _ => readArrays(true) } +benchmark.addCase("Off Heap Read Size Only") { _ => readArrays(false) } +benchmark.addCase("On Heap Read Elements") { _ => readArrayElements(true) } +benchmark.addCase("Off Heap Read Elements") { _ => readArrayElements(false) } + +/* +Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Mac OS X 10.13.1 +Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz + +Array Vector Read: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative + +On Heap Read Size Only 415 / 422394.7 2.5 1.0X +Off Heap Read Size Only394 / 402415.9 2.4 1.1X +On Heap Read Elements 2558 / 2593 64.0 15.6 0.2X +Off Heap Read Elements3316 / 3317 49.4 20.2 0.1X --- End diff -- I understand reusing the object has a good performance. I am curious whether the current catalyst can generate such a Java code for accessing nested array elements in SQL `selectExpr("a[1][2] +
[GitHub] spark pull request #19842: [SPARK-22643][SQL] ColumnarArray should be an imm...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/19842 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19842: [SPARK-22643][SQL] ColumnarArray should be an imm...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19842#discussion_r153854878 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchBenchmark.scala --- @@ -387,20 +389,108 @@ object ColumnarBatchBenchmark { } /* -String Read/Write: Avg Time(ms)Avg Rate(M/s) Relative Rate - - -On Heap 457.035.85 1.00 X -Off Heap 1206.013.59 0.38 X +Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Mac OS X 10.13.1 +Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz + +String Read/Write: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative + +On Heap332 / 338 49.3 20.3 1.0X +Off Heap 466 / 467 35.2 28.4 0.7X */ val benchmark = new Benchmark("String Read/Write", count * iters) benchmark.addCase("On Heap")(column(MemoryMode.ON_HEAP)) benchmark.addCase("Off Heap")(column(MemoryMode.OFF_HEAP)) benchmark.run } + def arrayAccess(iters: Int): Unit = { +val random = new Random(0) +val count = 4 * 1000 + +val onHeapVector = new OnHeapColumnVector(count, ArrayType(IntegerType)) +val offHeapVector = new OffHeapColumnVector(count, ArrayType(IntegerType)) + +val minSize = 3 +val maxSize = 32 +var arraysCount = 0 +var elementsCount = 0 +while (arraysCount < count) { + val size = random.nextInt(maxSize - minSize) + minSize + val onHeapArrayData = onHeapVector.arrayData() + val offHeapArrayData = offHeapVector.arrayData() + + var i = 0 + while (i < size) { +val value = random.nextInt() +onHeapArrayData.appendInt(value) +offHeapArrayData.appendInt(value) +i += 1 + } + + onHeapVector.putArray(arraysCount, elementsCount, size) + offHeapVector.putArray(arraysCount, elementsCount, size) + elementsCount += size + arraysCount += 1 +} + +def readArrays(onHeap: Boolean): Unit = { + System.gc() + val vector = if (onHeap) onHeapVector else offHeapVector + + var sum = 0L + for (_ <- 0 until iters) { +var i = 0 +while (i < count) { + sum += vector.getArray(i).numElements() + i += 1 +} + } +} + +def readArrayElements(onHeap: Boolean): Unit = { + System.gc() + val vector = if (onHeap) onHeapVector else offHeapVector + + var sum = 0L + for (_ <- 0 until iters) { +var i = 0 +while (i < count) { + val array = vector.getArray(i) + val size = array.numElements() + var j = 0 + while (j < size) { +sum += array.getInt(j) +j += 1 + } + i += 1 +} + } +} + +val benchmark = new Benchmark("Array Vector Read", count * iters) +benchmark.addCase("On Heap Read Size Only") { _ => readArrays(true) } +benchmark.addCase("Off Heap Read Size Only") { _ => readArrays(false) } +benchmark.addCase("On Heap Read Elements") { _ => readArrayElements(true) } +benchmark.addCase("Off Heap Read Elements") { _ => readArrayElements(false) } + +/* +Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Mac OS X 10.13.1 +Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz + +Array Vector Read: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative + +On Heap Read Size Only 415 / 422394.7 2.5 1.0X +Off Heap Read Size Only394 / 402415.9 2.4 1.1X +On Heap Read Elements 2558 / 2593 64.0 15.6 0.2X +Off Heap Read Elements3316 / 3317 49.4 20.2 0.1X --- End diff -- the result before this PR ``` On Heap Read Size Only 83 / 92 1970.3 0.5 1.0X Off Heap Read Size Only 98 / 110
[GitHub] spark pull request #19842: [SPARK-22643][SQL] ColumnarArray should be an imm...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19842#discussion_r153854357 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchBenchmark.scala --- @@ -387,20 +389,108 @@ object ColumnarBatchBenchmark { } /* -String Read/Write: Avg Time(ms)Avg Rate(M/s) Relative Rate - - -On Heap 457.035.85 1.00 X -Off Heap 1206.013.59 0.38 X +Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Mac OS X 10.13.1 +Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz + +String Read/Write: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative + +On Heap332 / 338 49.3 20.3 1.0X +Off Heap 466 / 467 35.2 28.4 0.7X --- End diff -- this is due to the data copy saved in https://github.com/apache/spark/pull/19815/files#diff-f43d67d60091eab39c1310330bf7a8ffR211 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19842: [SPARK-22643][SQL] ColumnarArray should be an imm...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19842#discussion_r153854036 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchBenchmark.scala --- @@ -265,20 +263,22 @@ object ColumnarBatchBenchmark { } /* -Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz -Int Read/Write: Avg Time(ms)Avg Rate(M/s) Relative Rate - - -Java Array 248.8 1317.04 1.00 X -ByteBuffer Unsafe 435.6 752.25 0.57 X -ByteBuffer API 1752.0 187.03 0.14 X -DirectByteBuffer595.4 550.35 0.42 X -Unsafe Buffer 235.2 1393.20 1.06 X -Column(on heap) 189.8 1726.45 1.31 X -Column(off heap)408.4 802.35 0.61 X -Column(off heap direct) 237.6 1379.12 1.05 X -UnsafeRow (on heap) 414.6 790.35 0.60 X -UnsafeRow (off heap)487.2 672.58 0.51 X -Column On Heap Append 530.1 618.14 0.59 X +Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Mac OS X 10.13.1 +Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz + +Int Read/Write: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative + +Java Array 177 / 181 1856.4 0.5 1.0X +ByteBuffer Unsafe 318 / 322 1032.0 1.0 0.6X +ByteBuffer API1411 / 1418232.2 4.3 0.1X +DirectByteBuffer 467 / 474701.8 1.4 0.4X +Unsafe Buffer 178 / 185 1843.6 0.5 1.0X +Column(on heap)178 / 184 1840.8 0.5 1.0X --- End diff -- Previusly onheap column vector was much faster than java array, which is unreasonable and I can't reproduce it now. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19842: [SPARK-22643][SQL] ColumnarArray should be an imm...
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/19842#discussion_r153853783 --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchBenchmark.scala --- @@ -265,20 +263,22 @@ object ColumnarBatchBenchmark { } /* -Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz --- End diff -- I rerun all the benchmarks in this file and update the results --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19842: [SPARK-22643][SQL] ColumnarArray should be an imm...
Github user hvanhovell commented on a diff in the pull request: https://github.com/apache/spark/pull/19842#discussion_r153850016 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java --- @@ -175,9 +175,7 @@ public ColumnarRow getStruct(int rowId, int size) { * Returns the array at rowid. */ public final ColumnarArray getArray(int rowId) { -resultArray.length = getArrayLength(rowId); -resultArray.offset = getArrayOffset(rowId); -return resultArray; +return new ColumnarArray(arrayData(), getArrayOffset(rowId), getArrayLength(rowId)); --- End diff -- I don't think that is a good idea. That would require us to keep an array of `ColumnarArray` around. That might mess with both GC and escape analysis. Let's just create a benchmark and check if we do not regress. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19842: [SPARK-22643][SQL] ColumnarArray should be an imm...
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/19842#discussion_r153829687 --- Diff: sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java --- @@ -175,9 +175,7 @@ public ColumnarRow getStruct(int rowId, int size) { * Returns the array at rowid. */ public final ColumnarArray getArray(int rowId) { -resultArray.length = getArrayLength(rowId); -resultArray.offset = getArrayOffset(rowId); -return resultArray; +return new ColumnarArray(arrayData(), getArrayOffset(rowId), getArrayLength(rowId)); --- End diff -- Is it better to create `ColumnarArray` for each `rowID` only once (e.g. by using caching)? I am curious whether we would see performance overhead for creating `ColumnarArray` to access elements of a multi-dimensional array (e.g. `a[1][2] + a[1][3]`). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19842: [SPARK-22643][SQL] ColumnarArray should be an imm...
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/19842 [SPARK-22643][SQL] ColumnarArray should be an immutable view ## What changes were proposed in this pull request? To make `ColumnVector` public, `ColumnarArray` need to be public too, and we should not have mutable public fields in a public class. This PR proposes to make `ColumnarArray` an immutable view of the data, and always create a new instance of `ColumnarArray` in `ColumnVector#getArray` ## How was this patch tested? locally tested with TPCDS, no performance regression is observed. You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark column-vector Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19842.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19842 commit aaa33dde6c392ad96f4673a70480d7bbb73e0d41 Author: Wenchen FanDate: 2017-11-29T07:16:57Z ColumnarArray should be an immutable view --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org