[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector

2018-01-22 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/20277


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector

2018-01-22 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20277#discussion_r162876178
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java 
---
@@ -33,18 +33,6 @@
   private final ArrowVectorAccessor accessor;
   private ArrowColumnVector[] childColumns;
 
-  private void ensureAccessible(int index) {
-ensureAccessible(index, 1);
-  }
-
-  private void ensureAccessible(int index, int count) {
--- End diff --

It is good to do it later. I agree that we do the same check at one place.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector

2018-01-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20277#discussion_r162862761
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java 
---
@@ -33,18 +33,6 @@
   private final ArrowVectorAccessor accessor;
   private ArrowColumnVector[] childColumns;
 
-  private void ensureAccessible(int index) {
-ensureAccessible(index, 1);
-  }
-
-  private void ensureAccessible(int index, int count) {
--- End diff --

How about we do it later? We need to find a central place to put this 
check, instead of doing it in every implementation.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector

2018-01-21 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20277#discussion_r162860736
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/VectorizedHashMapGenerator.scala
 ---
@@ -127,8 +127,10 @@ class VectorizedHashMapGenerator(
 
 def genEqualsForKeys(groupingKeys: Seq[Buffer]): String = {
   groupingKeys.zipWithIndex.map { case (key: Buffer, ordinal: Int) =>
-s"""(${ctx.genEqual(key.dataType, 
ctx.getValue(s"vectors[$ordinal]", "buckets[idx]",
-  key.dataType), key.name)})"""
+// `ColumnVector.getStruct` is different from 
`InternalRow.getStruct`, it only takes an
+// `ordinal` parameter.
+val value = ctx.getValue(s"vectors[$ordinal]", key.dataType, 
"buckets[idx]")
--- End diff --

`getValueFromVector` instead of `getValue`?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector

2018-01-21 Thread ueshin
Github user ueshin commented on a diff in the pull request:

https://github.com/apache/spark/pull/20277#discussion_r162853774
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java 
---
@@ -55,164 +43,82 @@ public void close() {
 if (childColumns != null) {
   for (int i = 0; i < childColumns.length; i++) {
 childColumns[i].close();
+childColumns[i] = null;
   }
 }
--- End diff --

We need to do `childColumns = null` after the for loop, otherwise 
`NullPointerException` will be thrown if `close()` is called twice?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector

2018-01-21 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20277#discussion_r162844101
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java 
---
@@ -55,164 +43,82 @@ public void close() {
 if (childColumns != null) {
   for (int i = 0; i < childColumns.length; i++) {
 childColumns[i].close();
+childColumns[i] = null;
--- End diff --

what do you mean? `ColumnVector.close` is a required interface.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector

2018-01-19 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20277#discussion_r162748352
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java 
---
@@ -55,164 +43,82 @@ public void close() {
 if (childColumns != null) {
   for (int i = 0; i < childColumns.length; i++) {
 childColumns[i].close();
+childColumns[i] = null;
--- End diff --

Is it OK not to call `close()` while `ColumnVector.close()` is provided?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector

2018-01-19 Thread kiszk
Github user kiszk commented on a diff in the pull request:

https://github.com/apache/spark/pull/20277#discussion_r162747998
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java 
---
@@ -33,18 +33,6 @@
   private final ArrowVectorAccessor accessor;
   private ArrowColumnVector[] childColumns;
 
-  private void ensureAccessible(int index) {
-ensureAccessible(index, 1);
-  }
-
-  private void ensureAccessible(int index, int count) {
--- End diff --

I agree with this in non-debug version. Can we add assert of this check at 
each caller site for debugging?

p.s. Sorry for slow reviews since I am on vacation this week.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector

2018-01-18 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20277#discussion_r162547306
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarBatchScan.scala 
---
@@ -50,7 +50,14 @@ private[sql] trait ColumnarBatchScan extends 
CodegenSupport {
   dataType: DataType,
   nullable: Boolean): ExprCode = {
 val javaType = ctx.javaType(dataType)
-val value = ctx.getValue(columnVar, dataType, ordinal)
+val value = if (dataType.isInstanceOf[StructType]) {
+  // `ColumnVector.getStruct` is different from 
`InternalRow.getStruct`, it only takes an
+  // `ordinal` parameter.
+  s"$columnVar.getStruct($ordinal)"
+} else {
+  ctx.getValue(columnVar, dataType, ordinal)
+}
--- End diff --

ah I didn't know there is such an API. I'll use it instead.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector

2018-01-18 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20277#discussion_r162536698
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarBatchScan.scala 
---
@@ -50,7 +50,14 @@ private[sql] trait ColumnarBatchScan extends 
CodegenSupport {
   dataType: DataType,
   nullable: Boolean): ExprCode = {
 val javaType = ctx.javaType(dataType)
-val value = ctx.getValue(columnVar, dataType, ordinal)
+val value = if (dataType.isInstanceOf[StructType]) {
+  // `ColumnVector.getStruct` is different from 
`InternalRow.getStruct`, it only takes an
+  // `ordinal` parameter.
+  s"$columnVar.getStruct($ordinal)"
+} else {
+  ctx.getValue(columnVar, dataType, ordinal)
+}
--- End diff --

Can't we use this API?
```scala
  /**
   * Returns the specialized code to access a value from a column vector 
for a given `DataType`.
   */
  def getValue(vector: String, rowId: String, dataType: DataType): String = 
{
...
  }
```


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector

2018-01-18 Thread viirya
Github user viirya commented on a diff in the pull request:

https://github.com/apache/spark/pull/20277#discussion_r162536737
  
--- Diff: 
sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/VectorizedHashMapGenerator.scala
 ---
@@ -127,8 +127,14 @@ class VectorizedHashMapGenerator(
 
 def genEqualsForKeys(groupingKeys: Seq[Buffer]): String = {
   groupingKeys.zipWithIndex.map { case (key: Buffer, ordinal: Int) =>
-s"""(${ctx.genEqual(key.dataType, 
ctx.getValue(s"vectors[$ordinal]", "buckets[idx]",
-  key.dataType), key.name)})"""
+// `ColumnVector.getStruct` is different from 
`InternalRow.getStruct`, it only takes an
+// `ordinal` parameter.
+val value = if (key.dataType.isInstanceOf[StructType]) {
+  s"vectors[$ordinal].getStruct(buckets[idx])"
+} else {
+  ctx.getValue(s"vectors[$ordinal]", "buckets[idx]", key.dataType)
--- End diff --

ditto.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector

2018-01-18 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20277#discussion_r162534791
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java 
---
@@ -53,166 +41,83 @@ public int numNulls() {
   @Override
   public void close() {
 if (childColumns != null) {
-  for (int i = 0; i < childColumns.length; i++) {
-childColumns[i].close();
+  for (ArrowColumnVector childColumn : childColumns) {
--- End diff --

the performance is same, it's just a more standard way to iterate an array 
in java


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector

2018-01-18 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20277#discussion_r162531810
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java ---
@@ -152,19 +198,11 @@ public final ColumnarRow getStruct(int rowId) {
 return new ColumnarRow(this, rowId);
   }
 
-  /**
-   * A special version of {@link #getStruct(int)}, which is only used as 
an adapter for Spark
-   * codegen framework, the second parameter is totally ignored.
-   */
-  public final ColumnarRow getStruct(int rowId, int size) {
-return getStruct(rowId);
-  }
-
   /**
* Returns the array for rowId.
*/
   public final ColumnarArray getArray(int rowId) {
-return new ColumnarArray(arrayData(), getArrayOffset(rowId), 
getArrayLength(rowId));
+return new ColumnarArray(getChild(0), getArrayOffset(rowId), 
getArrayLength(rowId));
--- End diff --

Why change this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector

2018-01-18 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20277#discussion_r162531441
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java 
---
@@ -53,166 +41,83 @@ public int numNulls() {
   @Override
   public void close() {
 if (childColumns != null) {
-  for (int i = 0; i < childColumns.length; i++) {
-childColumns[i].close();
+  for (ArrowColumnVector childColumn : childColumns) {
--- End diff --

Will this be faster?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector

2018-01-18 Thread jiangxb1987
Github user jiangxb1987 commented on a diff in the pull request:

https://github.com/apache/spark/pull/20277#discussion_r162531459
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java 
---
@@ -53,166 +41,83 @@ public int numNulls() {
   @Override
   public void close() {
 if (childColumns != null) {
-  for (int i = 0; i < childColumns.length; i++) {
-childColumns[i].close();
+  for (ArrowColumnVector childColumn : childColumns) {
--- End diff --

Should also apply similar change to `WritableColumnVector`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector

2018-01-16 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20277#discussion_r161957896
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java 
---
@@ -33,18 +33,6 @@
   private final ArrowVectorAccessor accessor;
   private ArrowColumnVector[] childColumns;
 
-  private void ensureAccessible(int index) {
-ensureAccessible(index, 1);
-  }
-
-  private void ensureAccessible(int index, int count) {
--- End diff --

ColumnVector is a performance critical place, we don't need index checking 
here, like other column vector implementations.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector

2018-01-16 Thread cloud-fan
Github user cloud-fan commented on a diff in the pull request:

https://github.com/apache/spark/pull/20277#discussion_r161816826
  
--- Diff: 
sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java 
---
@@ -28,21 +28,14 @@
 /**
  * A column vector backed by Apache Arrow.
  */
-public final class ArrowColumnVector extends ColumnVector {
+public final class ArrowColumnVector implements ColumnVector {
 
   private final ArrowVectorAccessor accessor;
   private ArrowColumnVector[] childColumns;
 
-  private void ensureAccessible(int index) {
--- End diff --

`ColumnVector` is a performance critical place, we don't need index 
checking here, like other column vector implementations.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector

2018-01-16 Thread cloud-fan
GitHub user cloud-fan opened a pull request:

https://github.com/apache/spark/pull/20277

[SPARK-23090][SQL] polish ColumnVector

## What changes were proposed in this pull request?

Several improvements:
* make ColumnVector an interface instead of abstract class
* provide a default implementation for the batch get methods
* remove `arrayData`, as it's always the first child column vector
* rename `getChildColumn` to `getChild`, which is more concise

## How was this patch tested?

existing tests

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/cloud-fan/spark column-vector

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/20277.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #20277


commit ba7ca0f168b35e381a9fd53ca59dd39d9dbd5920
Author: Wenchen Fan 
Date:   2018-01-16T16:38:28Z

polish ColumnVector




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org