[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/20277 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/20277#discussion_r162876178 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java --- @@ -33,18 +33,6 @@ private final ArrowVectorAccessor accessor; private ArrowColumnVector[] childColumns; - private void ensureAccessible(int index) { -ensureAccessible(index, 1); - } - - private void ensureAccessible(int index, int count) { --- End diff -- It is good to do it later. I agree that we do the same check at one place. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20277#discussion_r162862761 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java --- @@ -33,18 +33,6 @@ private final ArrowVectorAccessor accessor; private ArrowColumnVector[] childColumns; - private void ensureAccessible(int index) { -ensureAccessible(index, 1); - } - - private void ensureAccessible(int index, int count) { --- End diff -- How about we do it later? We need to find a central place to put this check, instead of doing it in every implementation. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/20277#discussion_r162860736 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/VectorizedHashMapGenerator.scala --- @@ -127,8 +127,10 @@ class VectorizedHashMapGenerator( def genEqualsForKeys(groupingKeys: Seq[Buffer]): String = { groupingKeys.zipWithIndex.map { case (key: Buffer, ordinal: Int) => -s"""(${ctx.genEqual(key.dataType, ctx.getValue(s"vectors[$ordinal]", "buckets[idx]", - key.dataType), key.name)})""" +// `ColumnVector.getStruct` is different from `InternalRow.getStruct`, it only takes an +// `ordinal` parameter. +val value = ctx.getValue(s"vectors[$ordinal]", key.dataType, "buckets[idx]") --- End diff -- `getValueFromVector` instead of `getValue`? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector
Github user ueshin commented on a diff in the pull request: https://github.com/apache/spark/pull/20277#discussion_r162853774 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java --- @@ -55,164 +43,82 @@ public void close() { if (childColumns != null) { for (int i = 0; i < childColumns.length; i++) { childColumns[i].close(); +childColumns[i] = null; } } --- End diff -- We need to do `childColumns = null` after the for loop, otherwise `NullPointerException` will be thrown if `close()` is called twice? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20277#discussion_r162844101 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java --- @@ -55,164 +43,82 @@ public void close() { if (childColumns != null) { for (int i = 0; i < childColumns.length; i++) { childColumns[i].close(); +childColumns[i] = null; --- End diff -- what do you mean? `ColumnVector.close` is a required interface. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/20277#discussion_r162748352 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java --- @@ -55,164 +43,82 @@ public void close() { if (childColumns != null) { for (int i = 0; i < childColumns.length; i++) { childColumns[i].close(); +childColumns[i] = null; --- End diff -- Is it OK not to call `close()` while `ColumnVector.close()` is provided? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector
Github user kiszk commented on a diff in the pull request: https://github.com/apache/spark/pull/20277#discussion_r162747998 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java --- @@ -33,18 +33,6 @@ private final ArrowVectorAccessor accessor; private ArrowColumnVector[] childColumns; - private void ensureAccessible(int index) { -ensureAccessible(index, 1); - } - - private void ensureAccessible(int index, int count) { --- End diff -- I agree with this in non-debug version. Can we add assert of this check at each caller site for debugging? p.s. Sorry for slow reviews since I am on vacation this week. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20277#discussion_r162547306 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarBatchScan.scala --- @@ -50,7 +50,14 @@ private[sql] trait ColumnarBatchScan extends CodegenSupport { dataType: DataType, nullable: Boolean): ExprCode = { val javaType = ctx.javaType(dataType) -val value = ctx.getValue(columnVar, dataType, ordinal) +val value = if (dataType.isInstanceOf[StructType]) { + // `ColumnVector.getStruct` is different from `InternalRow.getStruct`, it only takes an + // `ordinal` parameter. + s"$columnVar.getStruct($ordinal)" +} else { + ctx.getValue(columnVar, dataType, ordinal) +} --- End diff -- ah I didn't know there is such an API. I'll use it instead. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20277#discussion_r162536698 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/ColumnarBatchScan.scala --- @@ -50,7 +50,14 @@ private[sql] trait ColumnarBatchScan extends CodegenSupport { dataType: DataType, nullable: Boolean): ExprCode = { val javaType = ctx.javaType(dataType) -val value = ctx.getValue(columnVar, dataType, ordinal) +val value = if (dataType.isInstanceOf[StructType]) { + // `ColumnVector.getStruct` is different from `InternalRow.getStruct`, it only takes an + // `ordinal` parameter. + s"$columnVar.getStruct($ordinal)" +} else { + ctx.getValue(columnVar, dataType, ordinal) +} --- End diff -- Can't we use this API? ```scala /** * Returns the specialized code to access a value from a column vector for a given `DataType`. */ def getValue(vector: String, rowId: String, dataType: DataType): String = { ... } ``` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector
Github user viirya commented on a diff in the pull request: https://github.com/apache/spark/pull/20277#discussion_r162536737 --- Diff: sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/VectorizedHashMapGenerator.scala --- @@ -127,8 +127,14 @@ class VectorizedHashMapGenerator( def genEqualsForKeys(groupingKeys: Seq[Buffer]): String = { groupingKeys.zipWithIndex.map { case (key: Buffer, ordinal: Int) => -s"""(${ctx.genEqual(key.dataType, ctx.getValue(s"vectors[$ordinal]", "buckets[idx]", - key.dataType), key.name)})""" +// `ColumnVector.getStruct` is different from `InternalRow.getStruct`, it only takes an +// `ordinal` parameter. +val value = if (key.dataType.isInstanceOf[StructType]) { + s"vectors[$ordinal].getStruct(buckets[idx])" +} else { + ctx.getValue(s"vectors[$ordinal]", "buckets[idx]", key.dataType) --- End diff -- ditto. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20277#discussion_r162534791 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java --- @@ -53,166 +41,83 @@ public int numNulls() { @Override public void close() { if (childColumns != null) { - for (int i = 0; i < childColumns.length; i++) { -childColumns[i].close(); + for (ArrowColumnVector childColumn : childColumns) { --- End diff -- the performance is same, it's just a more standard way to iterate an array in java --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/20277#discussion_r162531810 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ColumnVector.java --- @@ -152,19 +198,11 @@ public final ColumnarRow getStruct(int rowId) { return new ColumnarRow(this, rowId); } - /** - * A special version of {@link #getStruct(int)}, which is only used as an adapter for Spark - * codegen framework, the second parameter is totally ignored. - */ - public final ColumnarRow getStruct(int rowId, int size) { -return getStruct(rowId); - } - /** * Returns the array for rowId. */ public final ColumnarArray getArray(int rowId) { -return new ColumnarArray(arrayData(), getArrayOffset(rowId), getArrayLength(rowId)); +return new ColumnarArray(getChild(0), getArrayOffset(rowId), getArrayLength(rowId)); --- End diff -- Why change this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/20277#discussion_r162531441 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java --- @@ -53,166 +41,83 @@ public int numNulls() { @Override public void close() { if (childColumns != null) { - for (int i = 0; i < childColumns.length; i++) { -childColumns[i].close(); + for (ArrowColumnVector childColumn : childColumns) { --- End diff -- Will this be faster? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector
Github user jiangxb1987 commented on a diff in the pull request: https://github.com/apache/spark/pull/20277#discussion_r162531459 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java --- @@ -53,166 +41,83 @@ public int numNulls() { @Override public void close() { if (childColumns != null) { - for (int i = 0; i < childColumns.length; i++) { -childColumns[i].close(); + for (ArrowColumnVector childColumn : childColumns) { --- End diff -- Should also apply similar change to `WritableColumnVector` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20277#discussion_r161957896 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java --- @@ -33,18 +33,6 @@ private final ArrowVectorAccessor accessor; private ArrowColumnVector[] childColumns; - private void ensureAccessible(int index) { -ensureAccessible(index, 1); - } - - private void ensureAccessible(int index, int count) { --- End diff -- ColumnVector is a performance critical place, we don't need index checking here, like other column vector implementations. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector
Github user cloud-fan commented on a diff in the pull request: https://github.com/apache/spark/pull/20277#discussion_r161816826 --- Diff: sql/core/src/main/java/org/apache/spark/sql/vectorized/ArrowColumnVector.java --- @@ -28,21 +28,14 @@ /** * A column vector backed by Apache Arrow. */ -public final class ArrowColumnVector extends ColumnVector { +public final class ArrowColumnVector implements ColumnVector { private final ArrowVectorAccessor accessor; private ArrowColumnVector[] childColumns; - private void ensureAccessible(int index) { --- End diff -- `ColumnVector` is a performance critical place, we don't need index checking here, like other column vector implementations. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #20277: [SPARK-23090][SQL] polish ColumnVector
GitHub user cloud-fan opened a pull request: https://github.com/apache/spark/pull/20277 [SPARK-23090][SQL] polish ColumnVector ## What changes were proposed in this pull request? Several improvements: * make ColumnVector an interface instead of abstract class * provide a default implementation for the batch get methods * remove `arrayData`, as it's always the first child column vector * rename `getChildColumn` to `getChild`, which is more concise ## How was this patch tested? existing tests You can merge this pull request into a Git repository by running: $ git pull https://github.com/cloud-fan/spark column-vector Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/20277.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #20277 commit ba7ca0f168b35e381a9fd53ca59dd39d9dbd5920 Author: Wenchen FanDate: 2018-01-16T16:38:28Z polish ColumnVector --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org