[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure
Github user asfgit closed the pull request at: https://github.com/apache/drill/pull/914 ---
[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/914#discussion_r152664121 --- Diff: exec/memory/base/src/main/java/io/netty/buffer/DrillBuf.java --- @@ -882,4 +882,71 @@ public void print(StringBuilder sb, int indent, Verbosity verbosity) { } } + // The "unsafe" methods are for use ONLY by code that does its own + // bounds checking. They are called "unsafe" for a reason: they will crash + // the JVM if values are addressed out of bounds. + + /** + * Write an integer to the buffer at the given byte index, without + * bounds checks. + * + * @param index byte (not int) index of the location to write + * @param value the value to write + */ + + public void unsafePutInt(int index, int value) { --- End diff -- Fixed. ---
[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure
Github user bitblender commented on a diff in the pull request: https://github.com/apache/drill/pull/914#discussion_r140644571 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/impl/TupleState.java --- @@ -0,0 +1,353 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.rowSet.impl; + +import java.util.ArrayList; +import java.util.List; + +import org.apache.drill.exec.expr.TypeHelper; +import org.apache.drill.exec.physical.rowSet.impl.ColumnState.BaseMapColumnState; +import org.apache.drill.exec.physical.rowSet.impl.ColumnState.MapArrayColumnState; +import org.apache.drill.exec.physical.rowSet.impl.ColumnState.MapColumnState; +import org.apache.drill.exec.record.ColumnMetadata; +import org.apache.drill.exec.record.MaterializedField; +import org.apache.drill.exec.record.TupleMetadata; +import org.apache.drill.exec.record.TupleSchema; +import org.apache.drill.exec.record.TupleSchema.AbstractColumnMetadata; +import org.apache.drill.exec.vector.ValueVector; +import org.apache.drill.exec.vector.accessor.ObjectType; +import org.apache.drill.exec.vector.accessor.ObjectWriter; +import org.apache.drill.exec.vector.accessor.TupleWriter; +import org.apache.drill.exec.vector.accessor.TupleWriter.TupleWriterListener; +import org.apache.drill.exec.vector.accessor.impl.HierarchicalFormatter; +import org.apache.drill.exec.vector.accessor.writer.AbstractObjectWriter; +import org.apache.drill.exec.vector.accessor.writer.AbstractTupleWriter; +import org.apache.drill.exec.vector.accessor.writer.ColumnWriterFactory; +import org.apache.drill.exec.vector.complex.AbstractMapVector; + +public abstract class TupleState implements TupleWriterListener { + + public static class RowState extends TupleState { + +/** + * The row-level writer for stepping through rows as they are written, + * and for accessing top-level columns. + */ + +private final RowSetLoaderImpl writer; + +public RowState(ResultSetLoaderImpl rsLoader) { + super(rsLoader, rsLoader.projectionSet); + writer = new RowSetLoaderImpl(rsLoader, schema); + writer.bindListener(this); +} + +public RowSetLoaderImpl rootWriter() { return writer; } + +@Override +public AbstractTupleWriter writer() { return writer; } + +@Override +public int innerCardinality() { return resultSetLoader.targetRowCount();} + } + + public static class MapState extends TupleState { + +protected final AbstractMapVector mapVector; +protected final BaseMapColumnState mapColumnState; +protected int outerCardinality; + +public MapState(ResultSetLoaderImpl rsLoader, +BaseMapColumnState mapColumnState, +AbstractMapVector mapVector, +ProjectionSet projectionSet) { + super(rsLoader, projectionSet); + this.mapVector = mapVector; + this.mapColumnState = mapColumnState; + mapColumnState.writer().bindListener(this); +} + +@Override +protected void columnAdded(ColumnState colState) { + @SuppressWarnings("resource") + ValueVector vector = colState.vector(); + + // Can't materialize the child if the map itself is + // not materialized. + + assert mapVector != null || vector == null; + if (vector != null) { +mapVector.putChild(vector.getField().getName(), vector); + } +} + +@Override +public AbstractTupleWriter writer() { + AbstractObjectWriter objWriter = mapColumnState.writer(); + TupleWriter tupleWriter; + if (objWriter.type() == ObjectType.ARRAY) { +tupleWriter = objWriter.array().tuple(); + } else { +tupleWriter = objWriter.tuple(); + } + return (AbstractTupleWriter) tupleWriter; +} + +@Override
[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure
Github user bitblender commented on a diff in the pull request: https://github.com/apache/drill/pull/914#discussion_r151762298 --- Diff: exec/vector/src/main/codegen/templates/ColumnAccessors.java --- @@ -275,17 +273,17 @@ public boolean isNull() { final int offset = writeIndex(len); <#else> final int writeIndex = writeIndex(); - <#assign putAddr = "bufAddr + writeIndex * VALUE_WIDTH"> + <#assign putAddr = "writeIndex * VALUE_WIDTH"> --- End diff -- putAddr is not an address but an offset, right ? Should be renamed. ---
[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure
Github user bitblender commented on a diff in the pull request: https://github.com/apache/drill/pull/914#discussion_r151762286 --- Diff: exec/memory/base/src/main/java/io/netty/buffer/DrillBuf.java --- @@ -882,4 +882,71 @@ public void print(StringBuilder sb, int indent, Verbosity verbosity) { } } + // The "unsafe" methods are for use ONLY by code that does its own + // bounds checking. They are called "unsafe" for a reason: they will crash + // the JVM if values are addressed out of bounds. + + /** + * Write an integer to the buffer at the given byte index, without + * bounds checks. + * + * @param index byte (not int) index of the location to write + * @param value the value to write + */ + + public void unsafePutInt(int index, int value) { --- End diff -- The first argument in these unsafePutXXX methods is an offset right? Should the 'index' be changed to an 'offset' ? ---
[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure
Github user parthchandra commented on a diff in the pull request: https://github.com/apache/drill/pull/914#discussion_r151019919 --- Diff: exec/vector/src/main/java/org/apache/drill/exec/vector/accessor/writer/BaseScalarWriter.java --- @@ -0,0 +1,264 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.vector.accessor.writer; + +import java.math.BigDecimal; + +import org.apache.drill.exec.vector.accessor.ColumnWriterIndex; +import org.apache.drill.exec.vector.accessor.impl.HierarchicalFormatter; +import org.joda.time.Period; + +/** + * Column writer implementation that acts as the basis for the + * generated, vector-specific implementations. All set methods + * throw an exception; subclasses simply override the supported + * method(s). + * + * The only tricky part to this class is understanding the + * state of the write indexes as the write proceeds. There are + * two pointers to consider: + * + * lastWriteIndex: The position in the vector at which the + * client last asked us to write data. This index is maintained + * in this class because it depends only on the actions of this + * class. + * vectorIndex: The position in the vector at which we will + * write if the client chooses to write a value at this time. + * The vector index is shared by all columns at the same repeat + * level. It is incremented as the client steps through the write + * and is observed in this class each time a write occurs. + * + * A repeat level is defined as any of the following: + * + * The set of top-level scalar columns, or those within a + * top-level, non-repeated map, or nested to any depth within + * non-repeated maps rooted at the top level. + * The values for a single scalar array. + * The set of scalar columns within a repeated map, or + * nested within non-repeated maps within a repeated map. + * + * Items at a repeat level index together and share a vector + * index. However, the columns within a repeat level + * do not share a last write index: some can lag further + * behind than others. + * + * Let's illustrate the states. Let's focus on one column and + * illustrate the three states that can occur during write: + * + * Behind: the last write index is more than one position behind + * the vector index. Zero-filling will be needed to catch up to + * the vector index. + * Written: the last write index is the same as the vector + * index because the client wrote data at this position (and previous + * values were back-filled with nulls, empties or zeros.) + * Unwritten: the last write index is one behind the vector + * index. This occurs when the column was written, then the client + * moved to the next row or array position. + * Restarted: The current row is abandoned (perhaps filtered + * out) and is to be rewritten. The last write position moves + * back one position. Note that, the Restarted state is + * indistinguishable from the unwritten state: the only real + * difference is that the current slot (pointed to by the + * vector index) contains the previous written value that must + * be overwritten or back-filled. But, this is fine, because we + * assume that unwritten values are garbage anyway. + * + * To illustrate: + * Behind WrittenUnwrittenRestarted + * |X| |X| |X| |X| + * lw >|X| |X| |X| |X| + * | | |0| |0| lw > |0| + *v >| | lw, v > |X|lw > |X| v > |X| + *v > | | + * + * The illustrated state transitions are: + * + * Suppose the state starts in Behind. + * If the client writes a value, then the empty slot is + * back-filled and the state moves to Written. + * If the client does not write a value, the state stays + * at Behind, and the gap of unfilled values
[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/914#discussion_r150753441 --- Diff: exec/vector/src/main/codegen/templates/ColumnAccessors.java --- @@ -191,141 +180,268 @@ public void bind(RowIndex vectorIndex, ValueVector vector) { <#if accessorType=="BigDecimal"> <#assign label="Decimal"> +<#if drillType == "VarChar" || drillType == "Var16Char"> + <#assign accessorType = "byte[]"> + <#assign label = "Bytes"> + <#if ! notyet> // // ${drillType} readers and writers - public static class ${drillType}ColumnReader extends AbstractColumnReader { + public static class ${drillType}ColumnReader extends BaseScalarReader { -<@bindReader "" drillType /> +<@bindReader "" drillType false /> -<@getType label /> +<@getType drillType label /> <@get drillType accessorType label false/> } - public static class Nullable${drillType}ColumnReader extends AbstractColumnReader { + public static class Nullable${drillType}ColumnReader extends BaseScalarReader { -<@bindReader "Nullable" drillType /> +<@bindReader "Nullable" drillType false /> -<@getType label /> +<@getType drillType label /> @Override public boolean isNull() { - return accessor().isNull(vectorIndex.index()); -} - -<@get drillType accessorType label false/> - } - - public static class Repeated${drillType}ColumnReader extends AbstractArrayReader { - -<@bindReader "Repeated" drillType /> - -<@getType label /> - -@Override -public int size() { - return accessor().getInnerValueCountAt(vectorIndex.index()); + return accessor().isNull(vectorIndex.vectorIndex()); } -<@get drillType accessorType label true/> +<@get drillType accessorType label false /> } - public static class ${drillType}ColumnWriter extends AbstractColumnWriter { + public static class Repeated${drillType}ColumnReader extends BaseElementReader { -<@bindWriter "" drillType /> +<@bindReader "" drillType true /> -<@getType label /> +<@getType drillType label /> -<@set drillType accessorType label false "set" /> +<@get drillType accessorType label true /> } - public static class Nullable${drillType}ColumnWriter extends AbstractColumnWriter { - -<@bindWriter "Nullable" drillType /> + <#assign varWidth = drillType == "VarChar" || drillType == "Var16Char" || drillType == "VarBinary" /> + <#if varWidth> + public static class ${drillType}ColumnWriter extends BaseVarWidthWriter { + <#else> + public static class ${drillType}ColumnWriter extends BaseFixedWidthWriter { +<#if drillType = "Decimal9" || drillType == "Decimal18" || + drillType == "Decimal28Sparse" || drillType == "Decimal38Sparse"> +private MajorType type; + +private static final int VALUE_WIDTH = ${drillType}Vector.VALUE_WIDTH; + +private final ${drillType}Vector vector; + +public ${drillType}ColumnWriter(final ValueVector vector) { + <#if varWidth> + super(((${drillType}Vector) vector).getOffsetVector()); + <#else> +<#if drillType = "Decimal9" || drillType == "Decimal18" || + drillType == "Decimal28Sparse" || drillType == "Decimal38Sparse"> + type = vector.getField().getType(); + + + this.vector = (${drillType}Vector) vector; +} -<@getType label /> +@Override public ValueVector vector() { return vector; } +<#-- All change of buffer comes through this function to allow capturing + the buffer address and capacity. Only two ways to set the buffer: + by binding to a vector in bindVector(), or by resizing the vector + in writeIndex(). --> @Override -public void setNull() { - mutator.setNull(vectorIndex.index()); +protected final void setAddr() { + final DrillBuf buf = vector.getBuffer(); + bufAddr = buf.addr(); + <#if varWidth> + capacity = buf.capacity(); + <#else> + <#-- Turns out that keeping track of capacity as the count of + values simplifies the per-value code path. --> + capacity = buf.capacity() / VALUE_WIDTH; + } -<@set drillType accessorType label true "set" /> - } - - public static class
[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/914#discussion_r150755238 --- Diff: exec/vector/src/main/java/org/apache/drill/exec/record/MaterializedField.java --- @@ -168,6 +174,58 @@ public boolean equals(Object obj) { Objects.equals(this.type, other.type); } + public boolean isEquivalent(MaterializedField other) { +if (! name.equalsIgnoreCase(other.name)) { + return false; +} + +// Requires full type equality, including fields such as precision and scale. +// But, unset fields are equivalent to 0. Can't use the protobuf-provided +// isEquals(), that treats set and unset fields as different. + +if (type.getMinorType() != other.type.getMinorType()) { + return false; +} +if (type.getMode() != other.type.getMode()) { + return false; +} +if (type.getScale() != other.type.getScale()) { + return false; +} +if (type.getPrecision() != other.type.getPrecision()) { + return false; +} + +// Compare children -- but only for maps, not the internal children +// for Varchar, repeated or nullable types. + +if (type.getMinorType() != MinorType.MAP) { + return true; +} + +if (children == null || other.children == null) { + return children == other.children; +} +if (children.size() != other.children.size()) { + return false; +} + +// Maps are name-based, not position. But, for our +// purposes, we insist on identical ordering. + +Iterator thisIter = children.iterator(); +Iterator otherIter = other.children.iterator(); +while (thisIter.hasNext()) { --- End diff -- The row set & writer abstractions require identical ordering so that column indexes are well-defined. Here we are facing the age-old philosophical question of "sameness." Sameness is instrumental: sameness-for-a-purpose. Here, we want to know if two schemas are equivalent for the purposes of referencing columns by index. We recently did a fix elsewhere we do use the looser definition: that A and B contain the same columns, but in possibly different orderings. Added a comment to explain this. ---
[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/914#discussion_r150758630 --- Diff: exec/vector/src/main/java/org/apache/drill/exec/vector/accessor/writer/BaseScalarWriter.java --- @@ -0,0 +1,264 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.vector.accessor.writer; + +import java.math.BigDecimal; + +import org.apache.drill.exec.vector.accessor.ColumnWriterIndex; +import org.apache.drill.exec.vector.accessor.impl.HierarchicalFormatter; +import org.joda.time.Period; + +/** + * Column writer implementation that acts as the basis for the + * generated, vector-specific implementations. All set methods + * throw an exception; subclasses simply override the supported + * method(s). + * + * The only tricky part to this class is understanding the + * state of the write indexes as the write proceeds. There are + * two pointers to consider: + * + * lastWriteIndex: The position in the vector at which the + * client last asked us to write data. This index is maintained + * in this class because it depends only on the actions of this + * class. + * vectorIndex: The position in the vector at which we will + * write if the client chooses to write a value at this time. + * The vector index is shared by all columns at the same repeat + * level. It is incremented as the client steps through the write + * and is observed in this class each time a write occurs. + * + * A repeat level is defined as any of the following: + * + * The set of top-level scalar columns, or those within a + * top-level, non-repeated map, or nested to any depth within + * non-repeated maps rooted at the top level. + * The values for a single scalar array. + * The set of scalar columns within a repeated map, or + * nested within non-repeated maps within a repeated map. + * + * Items at a repeat level index together and share a vector + * index. However, the columns within a repeat level + * do not share a last write index: some can lag further + * behind than others. + * + * Let's illustrate the states. Let's focus on one column and + * illustrate the three states that can occur during write: + * + * Behind: the last write index is more than one position behind + * the vector index. Zero-filling will be needed to catch up to + * the vector index. + * Written: the last write index is the same as the vector + * index because the client wrote data at this position (and previous + * values were back-filled with nulls, empties or zeros.) + * Unwritten: the last write index is one behind the vector + * index. This occurs when the column was written, then the client + * moved to the next row or array position. + * Restarted: The current row is abandoned (perhaps filtered + * out) and is to be rewritten. The last write position moves + * back one position. Note that, the Restarted state is + * indistinguishable from the unwritten state: the only real + * difference is that the current slot (pointed to by the + * vector index) contains the previous written value that must + * be overwritten or back-filled. But, this is fine, because we + * assume that unwritten values are garbage anyway. + * + * To illustrate: + * Behind WrittenUnwrittenRestarted + * |X| |X| |X| |X| + * lw >|X| |X| |X| |X| + * | | |0| |0| lw > |0| + *v >| | lw, v > |X|lw > |X| v > |X| + *v > | | + * + * The illustrated state transitions are: + * + * Suppose the state starts in Behind. + * If the client writes a value, then the empty slot is + * back-filled and the state moves to Written. + * If the client does not write a value, the state stays + * at Behind, and the gap of unfilled values
[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/914#discussion_r150725372 --- Diff: exec/java-exec/src/test/java/org/apache/drill/test/rowSet/test/RowSetTest.java --- @@ -19,420 +19,648 @@ import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertFalse; -import static org.junit.Assert.assertNotNull; -import static org.junit.Assert.assertNull; import static org.junit.Assert.assertSame; import static org.junit.Assert.assertTrue; +import static org.junit.Assert.fail; + +import java.io.UnsupportedEncodingException; -import org.apache.drill.common.types.TypeProtos.DataMode; import org.apache.drill.common.types.TypeProtos.MinorType; import org.apache.drill.exec.record.BatchSchema; +import org.apache.drill.exec.record.TupleMetadata; +import org.apache.drill.exec.vector.ValueVector; import org.apache.drill.exec.vector.accessor.ArrayReader; import org.apache.drill.exec.vector.accessor.ArrayWriter; -import org.apache.drill.exec.vector.accessor.TupleAccessor.TupleSchema; +import org.apache.drill.exec.vector.accessor.ObjectType; +import org.apache.drill.exec.vector.accessor.ScalarElementReader; +import org.apache.drill.exec.vector.accessor.ScalarReader; +import org.apache.drill.exec.vector.accessor.ScalarWriter; +import org.apache.drill.exec.vector.accessor.TupleReader; +import org.apache.drill.exec.vector.accessor.TupleWriter; +import org.apache.drill.exec.vector.accessor.ValueType; +import org.apache.drill.exec.vector.complex.MapVector; +import org.apache.drill.exec.vector.complex.RepeatedMapVector; import org.apache.drill.test.SubOperatorTest; import org.apache.drill.test.rowSet.RowSet.ExtendableRowSet; -import org.apache.drill.test.rowSet.RowSet.RowSetReader; -import org.apache.drill.test.rowSet.RowSet.RowSetWriter; import org.apache.drill.test.rowSet.RowSet.SingleRowSet; import org.apache.drill.test.rowSet.RowSetComparison; -import org.apache.drill.test.rowSet.RowSetSchema; -import org.apache.drill.test.rowSet.RowSetSchema.FlattenedSchema; -import org.apache.drill.test.rowSet.RowSetSchema.PhysicalSchema; +import org.apache.drill.test.rowSet.RowSetReader; +import org.apache.drill.test.rowSet.RowSetWriter; import org.apache.drill.test.rowSet.SchemaBuilder; +import org.bouncycastle.util.Arrays; --- End diff -- Eclipse auto-import... Fixed. ---
[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/914#discussion_r150753305 --- Diff: exec/vector/src/main/codegen/templates/ColumnAccessors.java --- @@ -191,141 +180,268 @@ public void bind(RowIndex vectorIndex, ValueVector vector) { <#if accessorType=="BigDecimal"> <#assign label="Decimal"> +<#if drillType == "VarChar" || drillType == "Var16Char"> + <#assign accessorType = "byte[]"> + <#assign label = "Bytes"> + <#if ! notyet> // // ${drillType} readers and writers - public static class ${drillType}ColumnReader extends AbstractColumnReader { + public static class ${drillType}ColumnReader extends BaseScalarReader { -<@bindReader "" drillType /> +<@bindReader "" drillType false /> -<@getType label /> +<@getType drillType label /> <@get drillType accessorType label false/> } - public static class Nullable${drillType}ColumnReader extends AbstractColumnReader { + public static class Nullable${drillType}ColumnReader extends BaseScalarReader { -<@bindReader "Nullable" drillType /> +<@bindReader "Nullable" drillType false /> -<@getType label /> +<@getType drillType label /> @Override public boolean isNull() { - return accessor().isNull(vectorIndex.index()); -} - -<@get drillType accessorType label false/> - } - - public static class Repeated${drillType}ColumnReader extends AbstractArrayReader { - -<@bindReader "Repeated" drillType /> - -<@getType label /> - -@Override -public int size() { - return accessor().getInnerValueCountAt(vectorIndex.index()); + return accessor().isNull(vectorIndex.vectorIndex()); } -<@get drillType accessorType label true/> +<@get drillType accessorType label false /> } - public static class ${drillType}ColumnWriter extends AbstractColumnWriter { + public static class Repeated${drillType}ColumnReader extends BaseElementReader { -<@bindWriter "" drillType /> +<@bindReader "" drillType true /> -<@getType label /> +<@getType drillType label /> -<@set drillType accessorType label false "set" /> +<@get drillType accessorType label true /> } - public static class Nullable${drillType}ColumnWriter extends AbstractColumnWriter { - -<@bindWriter "Nullable" drillType /> + <#assign varWidth = drillType == "VarChar" || drillType == "Var16Char" || drillType == "VarBinary" /> + <#if varWidth> + public static class ${drillType}ColumnWriter extends BaseVarWidthWriter { + <#else> + public static class ${drillType}ColumnWriter extends BaseFixedWidthWriter { +<#if drillType = "Decimal9" || drillType == "Decimal18" || + drillType == "Decimal28Sparse" || drillType == "Decimal38Sparse"> +private MajorType type; + +private static final int VALUE_WIDTH = ${drillType}Vector.VALUE_WIDTH; + +private final ${drillType}Vector vector; + +public ${drillType}ColumnWriter(final ValueVector vector) { + <#if varWidth> + super(((${drillType}Vector) vector).getOffsetVector()); + <#else> +<#if drillType = "Decimal9" || drillType == "Decimal18" || + drillType == "Decimal28Sparse" || drillType == "Decimal38Sparse"> + type = vector.getField().getType(); + + + this.vector = (${drillType}Vector) vector; +} -<@getType label /> +@Override public ValueVector vector() { return vector; } +<#-- All change of buffer comes through this function to allow capturing + the buffer address and capacity. Only two ways to set the buffer: + by binding to a vector in bindVector(), or by resizing the vector + in writeIndex(). --> @Override -public void setNull() { - mutator.setNull(vectorIndex.index()); +protected final void setAddr() { + final DrillBuf buf = vector.getBuffer(); + bufAddr = buf.addr(); + <#if varWidth> + capacity = buf.capacity(); + <#else> + <#-- Turns out that keeping track of capacity as the count of + values simplifies the per-value code path. --> + capacity = buf.capacity() / VALUE_WIDTH; + } -<@set drillType accessorType label true "set" /> - } - - public static class
[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure
Github user parthchandra commented on a diff in the pull request: https://github.com/apache/drill/pull/914#discussion_r149764036 --- Diff: exec/vector/src/main/java/org/apache/drill/exec/vector/accessor/writer/BaseScalarWriter.java --- @@ -0,0 +1,264 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.vector.accessor.writer; + +import java.math.BigDecimal; + +import org.apache.drill.exec.vector.accessor.ColumnWriterIndex; +import org.apache.drill.exec.vector.accessor.impl.HierarchicalFormatter; +import org.joda.time.Period; + +/** + * Column writer implementation that acts as the basis for the + * generated, vector-specific implementations. All set methods + * throw an exception; subclasses simply override the supported + * method(s). + * + * The only tricky part to this class is understanding the + * state of the write indexes as the write proceeds. There are + * two pointers to consider: + * + * lastWriteIndex: The position in the vector at which the + * client last asked us to write data. This index is maintained + * in this class because it depends only on the actions of this + * class. + * vectorIndex: The position in the vector at which we will + * write if the client chooses to write a value at this time. + * The vector index is shared by all columns at the same repeat + * level. It is incremented as the client steps through the write + * and is observed in this class each time a write occurs. + * + * A repeat level is defined as any of the following: + * + * The set of top-level scalar columns, or those within a + * top-level, non-repeated map, or nested to any depth within + * non-repeated maps rooted at the top level. + * The values for a single scalar array. + * The set of scalar columns within a repeated map, or + * nested within non-repeated maps within a repeated map. + * + * Items at a repeat level index together and share a vector + * index. However, the columns within a repeat level + * do not share a last write index: some can lag further + * behind than others. + * + * Let's illustrate the states. Let's focus on one column and + * illustrate the three states that can occur during write: + * + * Behind: the last write index is more than one position behind + * the vector index. Zero-filling will be needed to catch up to + * the vector index. + * Written: the last write index is the same as the vector + * index because the client wrote data at this position (and previous + * values were back-filled with nulls, empties or zeros.) + * Unwritten: the last write index is one behind the vector + * index. This occurs when the column was written, then the client + * moved to the next row or array position. + * Restarted: The current row is abandoned (perhaps filtered + * out) and is to be rewritten. The last write position moves + * back one position. Note that, the Restarted state is + * indistinguishable from the unwritten state: the only real + * difference is that the current slot (pointed to by the + * vector index) contains the previous written value that must + * be overwritten or back-filled. But, this is fine, because we + * assume that unwritten values are garbage anyway. + * + * To illustrate: + * Behind WrittenUnwrittenRestarted + * |X| |X| |X| |X| + * lw >|X| |X| |X| |X| + * | | |0| |0| lw > |0| + *v >| | lw, v > |X|lw > |X| v > |X| + *v > | | + * + * The illustrated state transitions are: + * + * Suppose the state starts in Behind. + * If the client writes a value, then the empty slot is + * back-filled and the state moves to Written. + * If the client does not write a value, the state stays + * at Behind, and the gap of unfilled values
[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure
Github user parthchandra commented on a diff in the pull request: https://github.com/apache/drill/pull/914#discussion_r149765040 --- Diff: exec/vector/src/main/java/org/apache/drill/exec/record/MaterializedField.java --- @@ -168,6 +174,58 @@ public boolean equals(Object obj) { Objects.equals(this.type, other.type); } + public boolean isEquivalent(MaterializedField other) { +if (! name.equalsIgnoreCase(other.name)) { + return false; +} + +// Requires full type equality, including fields such as precision and scale. +// But, unset fields are equivalent to 0. Can't use the protobuf-provided +// isEquals(), that treats set and unset fields as different. + +if (type.getMinorType() != other.type.getMinorType()) { + return false; +} +if (type.getMode() != other.type.getMode()) { + return false; +} +if (type.getScale() != other.type.getScale()) { + return false; +} +if (type.getPrecision() != other.type.getPrecision()) { + return false; +} + +// Compare children -- but only for maps, not the internal children +// for Varchar, repeated or nullable types. + +if (type.getMinorType() != MinorType.MAP) { + return true; +} + +if (children == null || other.children == null) { + return children == other.children; +} +if (children.size() != other.children.size()) { + return false; +} + +// Maps are name-based, not position. But, for our +// purposes, we insist on identical ordering. + +Iterator thisIter = children.iterator(); +Iterator otherIter = other.children.iterator(); +while (thisIter.hasNext()) { --- End diff -- isEquivalent requires identical ordering which is a stronger requirement than the guarantee that the children list is providing. Could we use contains() to find the child and then apply isEquivalent recursively? ---
[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure
Github user parthchandra commented on a diff in the pull request: https://github.com/apache/drill/pull/914#discussion_r149758695 --- Diff: exec/vector/src/main/codegen/templates/ColumnAccessors.java --- @@ -191,141 +180,268 @@ public void bind(RowIndex vectorIndex, ValueVector vector) { <#if accessorType=="BigDecimal"> <#assign label="Decimal"> +<#if drillType == "VarChar" || drillType == "Var16Char"> + <#assign accessorType = "byte[]"> + <#assign label = "Bytes"> + <#if ! notyet> // // ${drillType} readers and writers - public static class ${drillType}ColumnReader extends AbstractColumnReader { + public static class ${drillType}ColumnReader extends BaseScalarReader { -<@bindReader "" drillType /> +<@bindReader "" drillType false /> -<@getType label /> +<@getType drillType label /> <@get drillType accessorType label false/> } - public static class Nullable${drillType}ColumnReader extends AbstractColumnReader { + public static class Nullable${drillType}ColumnReader extends BaseScalarReader { -<@bindReader "Nullable" drillType /> +<@bindReader "Nullable" drillType false /> -<@getType label /> +<@getType drillType label /> @Override public boolean isNull() { - return accessor().isNull(vectorIndex.index()); -} - -<@get drillType accessorType label false/> - } - - public static class Repeated${drillType}ColumnReader extends AbstractArrayReader { - -<@bindReader "Repeated" drillType /> - -<@getType label /> - -@Override -public int size() { - return accessor().getInnerValueCountAt(vectorIndex.index()); + return accessor().isNull(vectorIndex.vectorIndex()); } -<@get drillType accessorType label true/> +<@get drillType accessorType label false /> } - public static class ${drillType}ColumnWriter extends AbstractColumnWriter { + public static class Repeated${drillType}ColumnReader extends BaseElementReader { -<@bindWriter "" drillType /> +<@bindReader "" drillType true /> -<@getType label /> +<@getType drillType label /> -<@set drillType accessorType label false "set" /> +<@get drillType accessorType label true /> } - public static class Nullable${drillType}ColumnWriter extends AbstractColumnWriter { - -<@bindWriter "Nullable" drillType /> + <#assign varWidth = drillType == "VarChar" || drillType == "Var16Char" || drillType == "VarBinary" /> + <#if varWidth> + public static class ${drillType}ColumnWriter extends BaseVarWidthWriter { + <#else> + public static class ${drillType}ColumnWriter extends BaseFixedWidthWriter { +<#if drillType = "Decimal9" || drillType == "Decimal18" || + drillType == "Decimal28Sparse" || drillType == "Decimal38Sparse"> +private MajorType type; + +private static final int VALUE_WIDTH = ${drillType}Vector.VALUE_WIDTH; + +private final ${drillType}Vector vector; + +public ${drillType}ColumnWriter(final ValueVector vector) { + <#if varWidth> + super(((${drillType}Vector) vector).getOffsetVector()); + <#else> +<#if drillType = "Decimal9" || drillType == "Decimal18" || + drillType == "Decimal28Sparse" || drillType == "Decimal38Sparse"> + type = vector.getField().getType(); + + + this.vector = (${drillType}Vector) vector; +} -<@getType label /> +@Override public ValueVector vector() { return vector; } +<#-- All change of buffer comes through this function to allow capturing + the buffer address and capacity. Only two ways to set the buffer: + by binding to a vector in bindVector(), or by resizing the vector + in writeIndex(). --> @Override -public void setNull() { - mutator.setNull(vectorIndex.index()); +protected final void setAddr() { + final DrillBuf buf = vector.getBuffer(); + bufAddr = buf.addr(); + <#if varWidth> + capacity = buf.capacity(); + <#else> + <#-- Turns out that keeping track of capacity as the count of + values simplifies the per-value code path. --> + capacity = buf.capacity() / VALUE_WIDTH; + } -<@set drillType accessorType label true "set" /> - } - - public static class
[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure
Github user parthchandra commented on a diff in the pull request: https://github.com/apache/drill/pull/914#discussion_r149759147 --- Diff: exec/vector/src/main/codegen/templates/ColumnAccessors.java --- @@ -191,141 +180,268 @@ public void bind(RowIndex vectorIndex, ValueVector vector) { <#if accessorType=="BigDecimal"> <#assign label="Decimal"> +<#if drillType == "VarChar" || drillType == "Var16Char"> + <#assign accessorType = "byte[]"> + <#assign label = "Bytes"> + <#if ! notyet> // // ${drillType} readers and writers - public static class ${drillType}ColumnReader extends AbstractColumnReader { + public static class ${drillType}ColumnReader extends BaseScalarReader { -<@bindReader "" drillType /> +<@bindReader "" drillType false /> -<@getType label /> +<@getType drillType label /> <@get drillType accessorType label false/> } - public static class Nullable${drillType}ColumnReader extends AbstractColumnReader { + public static class Nullable${drillType}ColumnReader extends BaseScalarReader { -<@bindReader "Nullable" drillType /> +<@bindReader "Nullable" drillType false /> -<@getType label /> +<@getType drillType label /> @Override public boolean isNull() { - return accessor().isNull(vectorIndex.index()); -} - -<@get drillType accessorType label false/> - } - - public static class Repeated${drillType}ColumnReader extends AbstractArrayReader { - -<@bindReader "Repeated" drillType /> - -<@getType label /> - -@Override -public int size() { - return accessor().getInnerValueCountAt(vectorIndex.index()); + return accessor().isNull(vectorIndex.vectorIndex()); } -<@get drillType accessorType label true/> +<@get drillType accessorType label false /> } - public static class ${drillType}ColumnWriter extends AbstractColumnWriter { + public static class Repeated${drillType}ColumnReader extends BaseElementReader { -<@bindWriter "" drillType /> +<@bindReader "" drillType true /> -<@getType label /> +<@getType drillType label /> -<@set drillType accessorType label false "set" /> +<@get drillType accessorType label true /> } - public static class Nullable${drillType}ColumnWriter extends AbstractColumnWriter { - -<@bindWriter "Nullable" drillType /> + <#assign varWidth = drillType == "VarChar" || drillType == "Var16Char" || drillType == "VarBinary" /> + <#if varWidth> + public static class ${drillType}ColumnWriter extends BaseVarWidthWriter { + <#else> + public static class ${drillType}ColumnWriter extends BaseFixedWidthWriter { +<#if drillType = "Decimal9" || drillType == "Decimal18" || + drillType == "Decimal28Sparse" || drillType == "Decimal38Sparse"> +private MajorType type; + +private static final int VALUE_WIDTH = ${drillType}Vector.VALUE_WIDTH; + +private final ${drillType}Vector vector; + +public ${drillType}ColumnWriter(final ValueVector vector) { + <#if varWidth> + super(((${drillType}Vector) vector).getOffsetVector()); + <#else> +<#if drillType = "Decimal9" || drillType == "Decimal18" || + drillType == "Decimal28Sparse" || drillType == "Decimal38Sparse"> + type = vector.getField().getType(); + + + this.vector = (${drillType}Vector) vector; +} -<@getType label /> +@Override public ValueVector vector() { return vector; } +<#-- All change of buffer comes through this function to allow capturing + the buffer address and capacity. Only two ways to set the buffer: + by binding to a vector in bindVector(), or by resizing the vector + in writeIndex(). --> @Override -public void setNull() { - mutator.setNull(vectorIndex.index()); +protected final void setAddr() { + final DrillBuf buf = vector.getBuffer(); + bufAddr = buf.addr(); + <#if varWidth> + capacity = buf.capacity(); + <#else> + <#-- Turns out that keeping track of capacity as the count of + values simplifies the per-value code path. --> + capacity = buf.capacity() / VALUE_WIDTH; + } -<@set drillType accessorType label true "set" /> - } - - public static class
[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure
Github user parthchandra commented on a diff in the pull request: https://github.com/apache/drill/pull/914#discussion_r149760234 --- Diff: exec/java-exec/src/test/java/org/apache/drill/test/rowSet/test/RowSetTest.java --- @@ -19,420 +19,648 @@ import static org.junit.Assert.assertEquals; import static org.junit.Assert.assertFalse; -import static org.junit.Assert.assertNotNull; -import static org.junit.Assert.assertNull; import static org.junit.Assert.assertSame; import static org.junit.Assert.assertTrue; +import static org.junit.Assert.fail; + +import java.io.UnsupportedEncodingException; -import org.apache.drill.common.types.TypeProtos.DataMode; import org.apache.drill.common.types.TypeProtos.MinorType; import org.apache.drill.exec.record.BatchSchema; +import org.apache.drill.exec.record.TupleMetadata; +import org.apache.drill.exec.vector.ValueVector; import org.apache.drill.exec.vector.accessor.ArrayReader; import org.apache.drill.exec.vector.accessor.ArrayWriter; -import org.apache.drill.exec.vector.accessor.TupleAccessor.TupleSchema; +import org.apache.drill.exec.vector.accessor.ObjectType; +import org.apache.drill.exec.vector.accessor.ScalarElementReader; +import org.apache.drill.exec.vector.accessor.ScalarReader; +import org.apache.drill.exec.vector.accessor.ScalarWriter; +import org.apache.drill.exec.vector.accessor.TupleReader; +import org.apache.drill.exec.vector.accessor.TupleWriter; +import org.apache.drill.exec.vector.accessor.ValueType; +import org.apache.drill.exec.vector.complex.MapVector; +import org.apache.drill.exec.vector.complex.RepeatedMapVector; import org.apache.drill.test.SubOperatorTest; import org.apache.drill.test.rowSet.RowSet.ExtendableRowSet; -import org.apache.drill.test.rowSet.RowSet.RowSetReader; -import org.apache.drill.test.rowSet.RowSet.RowSetWriter; import org.apache.drill.test.rowSet.RowSet.SingleRowSet; import org.apache.drill.test.rowSet.RowSetComparison; -import org.apache.drill.test.rowSet.RowSetSchema; -import org.apache.drill.test.rowSet.RowSetSchema.FlattenedSchema; -import org.apache.drill.test.rowSet.RowSetSchema.PhysicalSchema; +import org.apache.drill.test.rowSet.RowSetReader; +import org.apache.drill.test.rowSet.RowSetWriter; import org.apache.drill.test.rowSet.SchemaBuilder; +import org.bouncycastle.util.Arrays; --- End diff -- BouncyCastle arrays instead of Java Arrays ? ---
[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/914#discussion_r139296614 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/impl/package-info.java --- @@ -0,0 +1,295 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +/** + * Handles the details of the result set loader implementation. + * + * The primary purpose of this loader, and the most complex to understand and + * maintain, is overflow handling. + * + * Detailed Use Cases + * + * Let's examine it by considering a number of + * use cases. + * + * Rowabcdefgh + * n-2XX-- + * n-1 -- + * n X!O O O + * + * Here: + * + * n-2, n-1, and n are rows. n is the overflow row. + * X indicates a value was written before overflow. + * Blank indicates no value was written in that row. + * ! indicates the value that triggered overflow. + * - indicates a column that did not exist prior to overflow. + * + * Column a is written before overflow occurs, b causes overflow, and all other + * columns either are not written, or written after overflow. + * + * The scenarios, identified by column names above, are: + * + * a + * a contains values for all three rows. + * + * Two values were written in the "main" batch, while a third was written to + * what becomes the overflow row. + * When overflow occurs, the last write position is at n. It must be moved + * back to n-1. + * Since data was written to the overflow row, it is copied to the look- + * ahead batch. + * The last write position in the lookahead batch is 0 (since data was + * copied into the 0th row. + * When harvesting, no empty-filling is needed. + * When starting the next batch, the last write position must be set to 0 to + * reflect the presence of the value for row n. + * + * + * b + * b contains values for all three rows. The value for row n triggers + * overflow. + * + * The last write position is at n-1, which is kept for the "main" + * vector. + * A new overflow vector is created and starts empty, with the last write + * position at -1. + * Once created, b is immediately written to the overflow vector, advancing + * the last write position to 0. + * Harvesting, and starting the next for column b works the same as column + * a. + * + * + * c + * Column c has values for all rows. + * + * The value for row n is written after overflow. + * At overflow, the last write position is at n-1. + * At overflow, a new lookahead vector is created with the last write + * position at -1. + * The value of c is written to the lookahead vector, advancing the last + * write position to -1. + * Harvesting, and starting the next for column c works the same as column + * a. + * + * + * d + * Column d writes values to the last two rows before overflow, but not to + * the overflow row. + * + * The last write position for the main batch is at n-1. + * The last write position in the lookahead batch remains at -1. + * Harvesting for column d requires filling an empty value for row n-1. + * When starting the next batch, the last write position must be set to -1, + * indicating no data yet written. + * + * + * f + * Column f has no data in the last position of the main batch, and no data + * in the overflow row. + * + * The last write position is at n-2. + * An empty value must be written into position n-1 during harvest. + * On start of the next batch, the last write position starts at -1. + * + * + * g + * Column g is added after overflow, and has a value written to the overflow + * row. + * + * On harvest, column g is simply skipped. + * On start of the next row, the last write position can be left unchanged + * since no "exchange" was done. + * + * +
[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure
Github user paul-rogers commented on a diff in the pull request: https://github.com/apache/drill/pull/914#discussion_r139296587 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/impl/package-info.java --- @@ -0,0 +1,295 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +/** + * Handles the details of the result set loader implementation. + * + * The primary purpose of this loader, and the most complex to understand and + * maintain, is overflow handling. + * + * Detailed Use Cases + * + * Let's examine it by considering a number of + * use cases. + * + * Rowabcdefgh + * n-2XX-- + * n-1 -- + * n X!O O O + * + * Here: + * + * n-2, n-1, and n are rows. n is the overflow row. + * X indicates a value was written before overflow. + * Blank indicates no value was written in that row. + * ! indicates the value that triggered overflow. + * - indicates a column that did not exist prior to overflow. --- End diff -- Values added after overflow. Fixed in latest commit. ---
[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure
Github user bitblender commented on a diff in the pull request: https://github.com/apache/drill/pull/914#discussion_r137852118 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/impl/package-info.java --- @@ -0,0 +1,295 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +/** + * Handles the details of the result set loader implementation. + * + * The primary purpose of this loader, and the most complex to understand and + * maintain, is overflow handling. + * + * Detailed Use Cases + * + * Let's examine it by considering a number of + * use cases. + * + * Rowabcdefgh + * n-2XX-- + * n-1 -- + * n X!O O O + * + * Here: + * + * n-2, n-1, and n are rows. n is the overflow row. + * X indicates a value was written before overflow. + * Blank indicates no value was written in that row. + * ! indicates the value that triggered overflow. + * - indicates a column that did not exist prior to overflow. --- End diff -- What does an 'O' value mean in the diagram above? ---
[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure
Github user bitblender commented on a diff in the pull request: https://github.com/apache/drill/pull/914#discussion_r137851895 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/impl/package-info.java --- @@ -0,0 +1,295 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +/** + * Handles the details of the result set loader implementation. + * + * The primary purpose of this loader, and the most complex to understand and + * maintain, is overflow handling. + * + * Detailed Use Cases + * + * Let's examine it by considering a number of + * use cases. + * + * Rowabcdefgh + * n-2XX-- + * n-1 -- + * n X!O O O + * + * Here: + * + * n-2, n-1, and n are rows. n is the overflow row. + * X indicates a value was written before overflow. + * Blank indicates no value was written in that row. + * ! indicates the value that triggered overflow. + * - indicates a column that did not exist prior to overflow. + * + * Column a is written before overflow occurs, b causes overflow, and all other + * columns either are not written, or written after overflow. + * + * The scenarios, identified by column names above, are: + * + * a + * a contains values for all three rows. + * + * Two values were written in the "main" batch, while a third was written to + * what becomes the overflow row. + * When overflow occurs, the last write position is at n. It must be moved + * back to n-1. + * Since data was written to the overflow row, it is copied to the look- + * ahead batch. + * The last write position in the lookahead batch is 0 (since data was + * copied into the 0th row. + * When harvesting, no empty-filling is needed. + * When starting the next batch, the last write position must be set to 0 to + * reflect the presence of the value for row n. + * + * + * b + * b contains values for all three rows. The value for row n triggers + * overflow. + * + * The last write position is at n-1, which is kept for the "main" + * vector. + * A new overflow vector is created and starts empty, with the last write + * position at -1. + * Once created, b is immediately written to the overflow vector, advancing + * the last write position to 0. + * Harvesting, and starting the next for column b works the same as column + * a. + * + * + * c + * Column c has values for all rows. + * + * The value for row n is written after overflow. + * At overflow, the last write position is at n-1. + * At overflow, a new lookahead vector is created with the last write + * position at -1. + * The value of c is written to the lookahead vector, advancing the last + * write position to -1. + * Harvesting, and starting the next for column c works the same as column + * a. + * + * + * d + * Column d writes values to the last two rows before overflow, but not to + * the overflow row. + * + * The last write position for the main batch is at n-1. + * The last write position in the lookahead batch remains at -1. + * Harvesting for column d requires filling an empty value for row n-1. + * When starting the next batch, the last write position must be set to -1, + * indicating no data yet written. + * + * + * f + * Column f has no data in the last position of the main batch, and no data + * in the overflow row. + * + * The last write position is at n-2. + * An empty value must be written into position n-1 during harvest. + * On start of the next batch, the last write position starts at -1. + * + * + * g + * Column g is added after overflow, and has a value written to the overflow + * row. + * + * On harvest, column g is simply skipped. + * On start of the next row, the last write position can be left unchanged + * since no "exchange" was done. + * + * + *
[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure
GitHub user paul-rogers opened a pull request: https://github.com/apache/drill/pull/914 DRILL-5657: Size-aware vector writer structure This large PR provides another two levels of foundation for size-aware vector writers in the Drill record readers. It combines code from two previous PRS: * PR 866 - DRILL-5657: Implement size-aware result set loader * PR 887 - DRILL-5688: Add repeated map support to column accessors The PR then goes on to integrate the two prior PRs and provide additional functionality. Like the two previous PRs, this one is divided into commits to group the work. 1. Accessor layer 2. Row Set layer 3. Tuple and Column Model layer 4. Row Set Loader layer 5. Secondary changes Much of the material below appears in Javadoc throughout the code. The material here is not meant to replace that documentation. Instead, it is meant to provide the âbig pictureâ: placing the bits and pieces in context and pointing out interesting functionality to explore in each layer. ## Commit 1: The Accessor Layer The first commit provides the core of the mechanism: the writers that put data into vectors, and the readers that retrieve that data. The version here is an evolution of the version provided in an earlier PR a few months ago. ### Overview of the Drill Vector Data Model The code will make much more sense if we start with a review of Drillâs complex vector data model. Drill has 38+ data (âminorâ) types as defined in the [proto buf](https://github.com/apache/drill/blob/master/protocol/src/main/protobuf/Types.proto) definition. Drill also has three cardinalities (âmodesâ) defined in the same file. The result is over 120+ different vector types. Then, when you add maps, repeated maps, lists and repeated lists, you rapidly get an explosion of types that the writer code must handle. Vectors can be categorized along multiple dimensions: * By data (minor) type * By cardinality (mode) * By fixed or variable width A repeated map, a list, a repeated list and any array (repeated) scalar all are array-like. Nullable and required modes are identical (single values), but a nullable has an additional is-set (âbitâ) vector. A key contribution of this PR is the data model used to organize vectors. * Both the top-level row, and a Drill map are âtuplesâ and are treated similarly in the model. * All non-map, non-list (that is, scalar) data types are treated uniformly. * All arrays (whether a list, a repeated list, a repeated map, or a repeated scalar) are treated uniformly. ### Accessor Data Model The above leads to a very simple, JSON-like data model, introduced in this PR. * A tuple reader or writer models a row. (Usually via a subclass.) Column are accessible by name or position. * Every column is modeled as an object. * The object can have an object type: scalar, tuple or array. * An array has a single element type (but many run-time elements) * A scalar can be nullable or not, and provides a uniform get/set interface. This data model is similar to; but has important differences from, the prior, generated, readers and writers. The object layer is new: it is the simplest way to model the three âobject types.â An app using this code would use just the leaf scalar readers and writers. Although there is quite a bit of code change here to provide the new structure the core functionality of reading and writing to vectors has not changed much. And, this code has extensive unit tests, which should avoid the need to "mentally execute" each line of code. See the classes in `org.apache.drill.exec.vector.accessor` for details. In particular, please see the `package-info.java` file in that package for more information. As before, the top package provides a set of interfaces; the inner packages provide the implementation. The `ColumnAccessors.java` template generates the per-vector code. Warning: this template has become quite cryptic: the best bet for review is to generate the Java code and review that. ### Writer Performance During previous review, we discussed ways to optimize writer performance. This PR has two improvements: * Completely rework the writers to minimize code steps * Rework the âcolumn loadersâ to eliminate them: instead of two additional method calls, the âloaderâ now uses the column writers directly. Much behind-the-scenes rework was needed to accomplish the above. Column readers, however, were left with their existing structure; we can apply the same optimizations to the readers in a later PR. While doing the performance optimization, it became clear we can adopt a major simplification. Writers evolved from having three versions