[GitHub] spark pull request #19842: [SPARK-22643][SQL] ColumnarArray should be an imm...

2017-11-30 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19842#discussion_r154045805
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchBenchmark.scala
 ---
@@ -387,20 +389,108 @@ object ColumnarBatchBenchmark {
 }
 
 /*
-String Read/Write:   Avg Time(ms)Avg Rate(M/s) 
 Relative Rate
-
-
-On Heap 457.035.85 
1.00 X
-Off Heap   1206.013.59 
0.38 X
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Mac OS X 10.13.1
+Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
+
+String Read/Write:   Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
+

+On Heap332 /  338 49.3 
 20.3   1.0X
+Off Heap   466 /  467 35.2 
 28.4   0.7X
 */
 val benchmark = new Benchmark("String Read/Write", count * iters)
 benchmark.addCase("On Heap")(column(MemoryMode.ON_HEAP))
 benchmark.addCase("Off Heap")(column(MemoryMode.OFF_HEAP))
 benchmark.run
   }
 
+  def arrayAccess(iters: Int): Unit = {
+val random = new Random(0)
+val count = 4 * 1000
+
+val onHeapVector = new OnHeapColumnVector(count, 
ArrayType(IntegerType))
+val offHeapVector = new OffHeapColumnVector(count, 
ArrayType(IntegerType))
+
+val minSize = 3
+val maxSize = 32
+var arraysCount = 0
+var elementsCount = 0
+while (arraysCount < count) {
+  val size = random.nextInt(maxSize - minSize) + minSize
+  val onHeapArrayData = onHeapVector.arrayData()
+  val offHeapArrayData = offHeapVector.arrayData()
+
+  var i = 0
+  while (i < size) {
+val value = random.nextInt()
+onHeapArrayData.appendInt(value)
+offHeapArrayData.appendInt(value)
+i += 1
+  }
+
+  onHeapVector.putArray(arraysCount, elementsCount, size)
+  offHeapVector.putArray(arraysCount, elementsCount, size)
+  elementsCount += size
+  arraysCount += 1
+}
+
+def readArrays(onHeap: Boolean): Unit = {
+  System.gc()
+  val vector = if (onHeap) onHeapVector else offHeapVector
+
+  var sum = 0L
+  for (_ <- 0 until iters) {
+var i = 0
+while (i < count) {
+  sum += vector.getArray(i).numElements()
+  i += 1
+}
+  }
+}
+
+def readArrayElements(onHeap: Boolean): Unit = {
+  System.gc()
+  val vector = if (onHeap) onHeapVector else offHeapVector
+
+  var sum = 0L
+  for (_ <- 0 until iters) {
+var i = 0
+while (i < count) {
+  val array = vector.getArray(i)
+  val size = array.numElements()
+  var j = 0
+  while (j < size) {
+sum += array.getInt(j)
+j += 1
+  }
+  i += 1
+}
+  }
+}
+
+val benchmark = new Benchmark("Array Vector Read", count * iters)
+benchmark.addCase("On Heap Read Size Only") { _ => readArrays(true) }
+benchmark.addCase("Off Heap Read Size Only") { _ => readArrays(false) }
+benchmark.addCase("On Heap Read Elements") { _ => 
readArrayElements(true) }
+benchmark.addCase("Off Heap Read Elements") { _ => 
readArrayElements(false) }
+
+/*
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Mac OS X 10.13.1
+Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
+
+Array Vector Read:   Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
+

+On Heap Read Size Only 415 /  422394.7 
  2.5   1.0X
+Off Heap Read Size Only394 /  402415.9 
  2.4   1.1X
+On Heap Read Elements 2558 / 2593 64.0 
 15.6   0.2X
+Off Heap Read Elements3316 / 3317 49.4 
 20.2   0.1X
--- End diff --

I understand reusing the object has a good performance.
I am curious whether the current catalyst can generate such a Java code for 
accessing nested array elements in SQL
`selectExpr("a[1][2] + 

[GitHub] spark pull request #19842: [SPARK-22643][SQL] ColumnarArray should be an imm...

2017-11-30 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/19842


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19842: [SPARK-22643][SQL] ColumnarArray should be an imm...

2017-11-29 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19842#discussion_r153854878
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchBenchmark.scala
 ---
@@ -387,20 +389,108 @@ object ColumnarBatchBenchmark {
 }
 
 /*
-String Read/Write:   Avg Time(ms)Avg Rate(M/s) 
 Relative Rate
-
-
-On Heap 457.035.85 
1.00 X
-Off Heap   1206.013.59 
0.38 X
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Mac OS X 10.13.1
+Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
+
+String Read/Write:   Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
+

+On Heap332 /  338 49.3 
 20.3   1.0X
+Off Heap   466 /  467 35.2 
 28.4   0.7X
 */
 val benchmark = new Benchmark("String Read/Write", count * iters)
 benchmark.addCase("On Heap")(column(MemoryMode.ON_HEAP))
 benchmark.addCase("Off Heap")(column(MemoryMode.OFF_HEAP))
 benchmark.run
   }
 
+  def arrayAccess(iters: Int): Unit = {
+val random = new Random(0)
+val count = 4 * 1000
+
+val onHeapVector = new OnHeapColumnVector(count, 
ArrayType(IntegerType))
+val offHeapVector = new OffHeapColumnVector(count, 
ArrayType(IntegerType))
+
+val minSize = 3
+val maxSize = 32
+var arraysCount = 0
+var elementsCount = 0
+while (arraysCount < count) {
+  val size = random.nextInt(maxSize - minSize) + minSize
+  val onHeapArrayData = onHeapVector.arrayData()
+  val offHeapArrayData = offHeapVector.arrayData()
+
+  var i = 0
+  while (i < size) {
+val value = random.nextInt()
+onHeapArrayData.appendInt(value)
+offHeapArrayData.appendInt(value)
+i += 1
+  }
+
+  onHeapVector.putArray(arraysCount, elementsCount, size)
+  offHeapVector.putArray(arraysCount, elementsCount, size)
+  elementsCount += size
+  arraysCount += 1
+}
+
+def readArrays(onHeap: Boolean): Unit = {
+  System.gc()
+  val vector = if (onHeap) onHeapVector else offHeapVector
+
+  var sum = 0L
+  for (_ <- 0 until iters) {
+var i = 0
+while (i < count) {
+  sum += vector.getArray(i).numElements()
+  i += 1
+}
+  }
+}
+
+def readArrayElements(onHeap: Boolean): Unit = {
+  System.gc()
+  val vector = if (onHeap) onHeapVector else offHeapVector
+
+  var sum = 0L
+  for (_ <- 0 until iters) {
+var i = 0
+while (i < count) {
+  val array = vector.getArray(i)
+  val size = array.numElements()
+  var j = 0
+  while (j < size) {
+sum += array.getInt(j)
+j += 1
+  }
+  i += 1
+}
+  }
+}
+
+val benchmark = new Benchmark("Array Vector Read", count * iters)
+benchmark.addCase("On Heap Read Size Only") { _ => readArrays(true) }
+benchmark.addCase("Off Heap Read Size Only") { _ => readArrays(false) }
+benchmark.addCase("On Heap Read Elements") { _ => 
readArrayElements(true) }
+benchmark.addCase("Off Heap Read Elements") { _ => 
readArrayElements(false) }
+
+/*
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Mac OS X 10.13.1
+Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
+
+Array Vector Read:   Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
+

+On Heap Read Size Only 415 /  422394.7 
  2.5   1.0X
+Off Heap Read Size Only394 /  402415.9 
  2.4   1.1X
+On Heap Read Elements 2558 / 2593 64.0 
 15.6   0.2X
+Off Heap Read Elements3316 / 3317 49.4 
 20.2   0.1X
--- End diff --

the result before this PR
```
On Heap Read Size Only  83 /   92   1970.3  
 0.5   1.0X
Off Heap Read Size Only 98 /  110  

[GitHub] spark pull request #19842: [SPARK-22643][SQL] ColumnarArray should be an imm...

2017-11-29 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19842#discussion_r153854357
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchBenchmark.scala
 ---
@@ -387,20 +389,108 @@ object ColumnarBatchBenchmark {
 }
 
 /*
-String Read/Write:   Avg Time(ms)Avg Rate(M/s) 
 Relative Rate
-
-
-On Heap 457.035.85 
1.00 X
-Off Heap   1206.013.59 
0.38 X
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Mac OS X 10.13.1
+Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
+
+String Read/Write:   Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
+

+On Heap332 /  338 49.3 
 20.3   1.0X
+Off Heap   466 /  467 35.2 
 28.4   0.7X
--- End diff --

this is due to the data copy saved in 
https://github.com/apache/spark/pull/19815/files#diff-f43d67d60091eab39c1310330bf7a8ffR211


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19842: [SPARK-22643][SQL] ColumnarArray should be an imm...

2017-11-29 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19842#discussion_r153854036
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchBenchmark.scala
 ---
@@ -265,20 +263,22 @@ object ColumnarBatchBenchmark {
 }
 
 /*
-Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
-Int Read/Write:  Avg Time(ms)Avg Rate(M/s)  Relative 
Rate
-
-
-Java Array  248.8  1317.04 
1.00 X
-ByteBuffer Unsafe   435.6   752.25 
0.57 X
-ByteBuffer API 1752.0   187.03 
0.14 X
-DirectByteBuffer595.4   550.35 
0.42 X
-Unsafe Buffer   235.2  1393.20 
1.06 X
-Column(on heap) 189.8  1726.45 
1.31 X
-Column(off heap)408.4   802.35 
0.61 X
-Column(off heap direct) 237.6  1379.12 
1.05 X
-UnsafeRow (on heap) 414.6   790.35 
0.60 X
-UnsafeRow (off heap)487.2   672.58 
0.51 X
-Column On Heap Append   530.1   618.14 
0.59 X
+Java HotSpot(TM) 64-Bit Server VM 1.8.0_60-b27 on Mac OS X 10.13.1
+Intel(R) Core(TM) i7-4960HQ CPU @ 2.60GHz
+
+Int Read/Write:  Best/Avg Time(ms)
Rate(M/s)   Per Row(ns)   Relative
+

+Java Array 177 /  181   1856.4 
  0.5   1.0X
+ByteBuffer Unsafe  318 /  322   1032.0 
  1.0   0.6X
+ByteBuffer API1411 / 1418232.2 
  4.3   0.1X
+DirectByteBuffer   467 /  474701.8 
  1.4   0.4X
+Unsafe Buffer  178 /  185   1843.6 
  0.5   1.0X
+Column(on heap)178 /  184   1840.8 
  0.5   1.0X
--- End diff --

Previusly onheap column vector was much faster than java array, which is 
unreasonable and I can't reproduce it now.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19842: [SPARK-22643][SQL] ColumnarArray should be an imm...

2017-11-29 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/19842#discussion_r153853783
  
--- Diff: 
sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchBenchmark.scala
 ---
@@ -265,20 +263,22 @@ object ColumnarBatchBenchmark {
 }
 
 /*
-Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz
--- End diff --

I rerun all the benchmarks in this file and update the results


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19842: [SPARK-22643][SQL] ColumnarArray should be an imm...

2017-11-29 Thread hvanhovell
Github user hvanhovell commented on a diff in the pull request:

https://github.com/apache/spark/pull/19842#discussion_r153850016
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java
 ---
@@ -175,9 +175,7 @@ public ColumnarRow getStruct(int rowId, int size) {
* Returns the array at rowid.
*/
   public final ColumnarArray getArray(int rowId) {
-resultArray.length = getArrayLength(rowId);
-resultArray.offset = getArrayOffset(rowId);
-return resultArray;
+return new ColumnarArray(arrayData(), getArrayOffset(rowId), 
getArrayLength(rowId));
--- End diff --

I don't think that is a good idea. That would require us to keep an array 
of `ColumnarArray` around. That might mess with both GC and escape analysis. 
Let's just create a benchmark and check if we do not regress.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19842: [SPARK-22643][SQL] ColumnarArray should be an imm...

2017-11-29 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/19842#discussion_r153829687
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java
 ---
@@ -175,9 +175,7 @@ public ColumnarRow getStruct(int rowId, int size) {
* Returns the array at rowid.
*/
   public final ColumnarArray getArray(int rowId) {
-resultArray.length = getArrayLength(rowId);
-resultArray.offset = getArrayOffset(rowId);
-return resultArray;
+return new ColumnarArray(arrayData(), getArrayOffset(rowId), 
getArrayLength(rowId));
--- End diff --

Is it better to create `ColumnarArray` for each `rowID` only once (e.g. by 
using caching)? I am curious whether we would see performance overhead for 
creating `ColumnarArray` to access elements of a multi-dimensional array (e.g. 
`a[1][2] + a[1][3]`).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19842: [SPARK-22643][SQL] ColumnarArray should be an imm...

2017-11-28 Thread cloud-fan
GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/19842

[SPARK-22643][SQL] ColumnarArray should be an immutable view

## What changes were proposed in this pull request?

To make `ColumnVector` public, `ColumnarArray` need to be public too, and 
we should not have mutable public fields in a public class. This PR proposes to 
make `ColumnarArray` an immutable view of the data, and always create a new 
instance of `ColumnarArray` in `ColumnVector#getArray`

## How was this patch tested?

locally tested with TPCDS, no performance regression is observed.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark column-vector

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19842.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19842


commit aaa33dde6c392ad96f4673a70480d7bbb73e0d41
Author: Wenchen Fan 
Date:   2017-11-29T07:16:57Z

ColumnarArray should be an immutable view




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org