[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure

2017-12-20 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/914


---


[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure

2017-11-22 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/914#discussion_r152664121
  
--- Diff: exec/memory/base/src/main/java/io/netty/buffer/DrillBuf.java ---
@@ -882,4 +882,71 @@ public void print(StringBuilder sb, int indent, 
Verbosity verbosity) {
 }
   }
 
+  // The "unsafe" methods are for use ONLY by code that does its own
+  // bounds checking. They are called "unsafe" for a reason: they will 
crash
+  // the JVM if values are addressed out of bounds.
+
+  /**
+   * Write an integer to the buffer at the given byte index, without
+   * bounds checks.
+   *
+   * @param index byte (not int) index of the location to write
+   * @param value the value to write
+   */
+
+  public void unsafePutInt(int index, int value) {
--- End diff --

Fixed.


---


[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure

2017-11-17 Thread bitblender
Github user bitblender commented on a diff in the pull request:

https://github.com/apache/drill/pull/914#discussion_r140644571
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/impl/TupleState.java
 ---
@@ -0,0 +1,353 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.rowSet.impl;
+
+import java.util.ArrayList;
+import java.util.List;
+
+import org.apache.drill.exec.expr.TypeHelper;
+import 
org.apache.drill.exec.physical.rowSet.impl.ColumnState.BaseMapColumnState;
+import 
org.apache.drill.exec.physical.rowSet.impl.ColumnState.MapArrayColumnState;
+import 
org.apache.drill.exec.physical.rowSet.impl.ColumnState.MapColumnState;
+import org.apache.drill.exec.record.ColumnMetadata;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.TupleMetadata;
+import org.apache.drill.exec.record.TupleSchema;
+import org.apache.drill.exec.record.TupleSchema.AbstractColumnMetadata;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.accessor.ObjectType;
+import org.apache.drill.exec.vector.accessor.ObjectWriter;
+import org.apache.drill.exec.vector.accessor.TupleWriter;
+import 
org.apache.drill.exec.vector.accessor.TupleWriter.TupleWriterListener;
+import org.apache.drill.exec.vector.accessor.impl.HierarchicalFormatter;
+import org.apache.drill.exec.vector.accessor.writer.AbstractObjectWriter;
+import org.apache.drill.exec.vector.accessor.writer.AbstractTupleWriter;
+import org.apache.drill.exec.vector.accessor.writer.ColumnWriterFactory;
+import org.apache.drill.exec.vector.complex.AbstractMapVector;
+
+public abstract class TupleState implements TupleWriterListener {
+
+  public static class RowState extends TupleState {
+
+/**
+ * The row-level writer for stepping through rows as they are written,
+ * and for accessing top-level columns.
+ */
+
+private final RowSetLoaderImpl writer;
+
+public RowState(ResultSetLoaderImpl rsLoader) {
+  super(rsLoader, rsLoader.projectionSet);
+  writer = new RowSetLoaderImpl(rsLoader, schema);
+  writer.bindListener(this);
+}
+
+public RowSetLoaderImpl rootWriter() { return writer; }
+
+@Override
+public AbstractTupleWriter writer() { return writer; }
+
+@Override
+public int innerCardinality() { return 
resultSetLoader.targetRowCount();}
+  }
+
+  public static class MapState extends TupleState {
+
+protected final AbstractMapVector mapVector;
+protected final BaseMapColumnState mapColumnState;
+protected int outerCardinality;
+
+public MapState(ResultSetLoaderImpl rsLoader,
+BaseMapColumnState mapColumnState,
+AbstractMapVector mapVector,
+ProjectionSet projectionSet) {
+  super(rsLoader, projectionSet);
+  this.mapVector = mapVector;
+  this.mapColumnState = mapColumnState;
+  mapColumnState.writer().bindListener(this);
+}
+
+@Override
+protected void columnAdded(ColumnState colState) {
+  @SuppressWarnings("resource")
+  ValueVector vector = colState.vector();
+
+  // Can't materialize the child if the map itself is
+  // not materialized.
+
+  assert mapVector != null || vector == null;
+  if (vector != null) {
+mapVector.putChild(vector.getField().getName(), vector);
+  }
+}
+
+@Override
+public AbstractTupleWriter writer() {
+  AbstractObjectWriter objWriter = mapColumnState.writer();
+  TupleWriter tupleWriter;
+  if (objWriter.type() == ObjectType.ARRAY) {
+tupleWriter = objWriter.array().tuple();
+  } else {
+tupleWriter = objWriter.tuple();
+  }
+  return (AbstractTupleWriter) tupleWriter;
+}
+
+@Override

[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure

2017-11-17 Thread bitblender
Github user bitblender commented on a diff in the pull request:

https://github.com/apache/drill/pull/914#discussion_r151762298
  
--- Diff: exec/vector/src/main/codegen/templates/ColumnAccessors.java ---
@@ -275,17 +273,17 @@ public boolean isNull() {
   final int offset = writeIndex(len);
   <#else>
   final int writeIndex = writeIndex();
-  <#assign putAddr = "bufAddr + writeIndex * VALUE_WIDTH">
+  <#assign putAddr = "writeIndex * VALUE_WIDTH">
--- End diff --

putAddr is not an address but an offset, right ? Should be renamed. 


---


[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure

2017-11-17 Thread bitblender
Github user bitblender commented on a diff in the pull request:

https://github.com/apache/drill/pull/914#discussion_r151762286
  
--- Diff: exec/memory/base/src/main/java/io/netty/buffer/DrillBuf.java ---
@@ -882,4 +882,71 @@ public void print(StringBuilder sb, int indent, 
Verbosity verbosity) {
 }
   }
 
+  // The "unsafe" methods are for use ONLY by code that does its own
+  // bounds checking. They are called "unsafe" for a reason: they will 
crash
+  // the JVM if values are addressed out of bounds.
+
+  /**
+   * Write an integer to the buffer at the given byte index, without
+   * bounds checks.
+   *
+   * @param index byte (not int) index of the location to write
+   * @param value the value to write
+   */
+
+  public void unsafePutInt(int index, int value) {
--- End diff --

The first argument in these unsafePutXXX methods is an offset right? Should 
the 'index' be changed to an 'offset' ?


---


[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure

2017-11-14 Thread parthchandra
Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/914#discussion_r151019919
  
--- Diff: 
exec/vector/src/main/java/org/apache/drill/exec/vector/accessor/writer/BaseScalarWriter.java
 ---
@@ -0,0 +1,264 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.vector.accessor.writer;
+
+import java.math.BigDecimal;
+
+import org.apache.drill.exec.vector.accessor.ColumnWriterIndex;
+import org.apache.drill.exec.vector.accessor.impl.HierarchicalFormatter;
+import org.joda.time.Period;
+
+/**
+ * Column writer implementation that acts as the basis for the
+ * generated, vector-specific implementations. All set methods
+ * throw an exception; subclasses simply override the supported
+ * method(s).
+ * 
+ * The only tricky part to this class is understanding the
+ * state of the write indexes as the write proceeds. There are
+ * two pointers to consider:
+ * 
+ * lastWriteIndex: The position in the vector at which the
+ * client last asked us to write data. This index is maintained
+ * in this class because it depends only on the actions of this
+ * class.
+ * vectorIndex: The position in the vector at which we will
+ * write if the client chooses to write a value at this time.
+ * The vector index is shared by all columns at the same repeat
+ * level. It is incremented as the client steps through the write
+ * and is observed in this class each time a write occurs.
+ * 
+ * A repeat level is defined as any of the following:
+ * 
+ * The set of top-level scalar columns, or those within a
+ * top-level, non-repeated map, or nested to any depth within
+ * non-repeated maps rooted at the top level.
+ * The values for a single scalar array.
+ * The set of scalar columns within a repeated map, or
+ * nested within non-repeated maps within a repeated map.
+ * 
+ * Items at a repeat level index together and share a vector
+ * index. However, the columns within a repeat level
+ * do not share a last write index: some can lag further
+ * behind than others.
+ * 
+ * Let's illustrate the states. Let's focus on one column and
+ * illustrate the three states that can occur during write:
+ * 
+ * Behind: the last write index is more than one position behind
+ * the vector index. Zero-filling will be needed to catch up to
+ * the vector index.
+ * Written: the last write index is the same as the vector
+ * index because the client wrote data at this position (and previous
+ * values were back-filled with nulls, empties or zeros.)
+ * Unwritten: the last write index is one behind the vector
+ * index. This occurs when the column was written, then the client
+ * moved to the next row or array position.
+ * Restarted: The current row is abandoned (perhaps filtered
+ * out) and is to be rewritten. The last write position moves
+ * back one position. Note that, the Restarted state is
+ * indistinguishable from the unwritten state: the only real
+ * difference is that the current slot (pointed to by the
+ * vector index) contains the previous written value that must
+ * be overwritten or back-filled. But, this is fine, because we
+ * assume that unwritten values are garbage anyway.
+ * 
+ * To illustrate:
+ *  Behind  WrittenUnwrittenRestarted
+ *   |X|  |X| |X|  |X|
+ *   lw >|X|  |X| |X|  |X|
+ *   | |  |0| |0| lw > |0|
+ *v >| |  lw, v > |X|lw > |X|  v > |X|
+ *v > | |
+ * 
+ * The illustrated state transitions are:
+ * 
+ * Suppose the state starts in Behind.
+ *   If the client writes a value, then the empty slot is
+ *   back-filled and the state moves to Written.
+ *   If the client does not write a value, the state stays
+ *   at Behind, and the gap of unfilled values 

[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure

2017-11-14 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/914#discussion_r150753441
  
--- Diff: exec/vector/src/main/codegen/templates/ColumnAccessors.java ---
@@ -191,141 +180,268 @@ public void bind(RowIndex vectorIndex, ValueVector 
vector) {
 <#if accessorType=="BigDecimal">
   <#assign label="Decimal">
 
+<#if drillType == "VarChar" || drillType == "Var16Char">
+  <#assign accessorType = "byte[]">
+  <#assign label = "Bytes">
+
 <#if ! notyet>
   
//
   // ${drillType} readers and writers
 
-  public static class ${drillType}ColumnReader extends 
AbstractColumnReader {
+  public static class ${drillType}ColumnReader extends BaseScalarReader {
 
-<@bindReader "" drillType />
+<@bindReader "" drillType false />
 
-<@getType label />
+<@getType drillType label />
 
 <@get drillType accessorType label false/>
   }
 
-  public static class Nullable${drillType}ColumnReader extends 
AbstractColumnReader {
+  public static class Nullable${drillType}ColumnReader extends 
BaseScalarReader {
 
-<@bindReader "Nullable" drillType />
+<@bindReader "Nullable" drillType false />
 
-<@getType label />
+<@getType drillType label />
 
 @Override
 public boolean isNull() {
-  return accessor().isNull(vectorIndex.index());
-}
-
-<@get drillType accessorType label false/>
-  }
-
-  public static class Repeated${drillType}ColumnReader extends 
AbstractArrayReader {
-
-<@bindReader "Repeated" drillType />
-
-<@getType label />
-
-@Override
-public int size() {
-  return accessor().getInnerValueCountAt(vectorIndex.index());
+  return accessor().isNull(vectorIndex.vectorIndex());
 }
 
-<@get drillType accessorType label true/>
+<@get drillType accessorType label false />
   }
 
-  public static class ${drillType}ColumnWriter extends 
AbstractColumnWriter {
+  public static class Repeated${drillType}ColumnReader extends 
BaseElementReader {
 
-<@bindWriter "" drillType />
+<@bindReader "" drillType true />
 
-<@getType label />
+<@getType drillType label />
 
-<@set drillType accessorType label false "set" />
+<@get drillType accessorType label true />
   }
 
-  public static class Nullable${drillType}ColumnWriter extends 
AbstractColumnWriter {
-
-<@bindWriter "Nullable" drillType />
+  <#assign varWidth = drillType == "VarChar" || drillType == 
"Var16Char" || drillType == "VarBinary" />
+  <#if varWidth>
+  public static class ${drillType}ColumnWriter extends BaseVarWidthWriter {
+  <#else>
+  public static class ${drillType}ColumnWriter extends 
BaseFixedWidthWriter {
+<#if drillType = "Decimal9" || drillType == "Decimal18" ||
+ drillType == "Decimal28Sparse" || drillType == 
"Decimal38Sparse">
+private MajorType type;
+
+private static final int VALUE_WIDTH = ${drillType}Vector.VALUE_WIDTH;
+  
+private final ${drillType}Vector vector;
+
+public ${drillType}ColumnWriter(final ValueVector vector) {
+  <#if varWidth>
+  super(((${drillType}Vector) vector).getOffsetVector());
+  <#else>
+<#if drillType = "Decimal9" || drillType == "Decimal18" ||
+ drillType == "Decimal28Sparse" || drillType == 
"Decimal38Sparse">
+  type = vector.getField().getType();
+
+  
+  this.vector = (${drillType}Vector) vector;
+}
 
-<@getType label />
+@Override public ValueVector vector() { return vector; }
 
+<#-- All change of buffer comes through this function to allow 
capturing
+ the buffer address and capacity. Only two ways to set the 
buffer:
+ by binding to a vector in bindVector(), or by resizing the 
vector
+ in writeIndex(). -->
 @Override
-public void setNull() {
-  mutator.setNull(vectorIndex.index());
+protected final void setAddr() {
+  final DrillBuf buf = vector.getBuffer();
+  bufAddr = buf.addr();
+  <#if varWidth>
+  capacity = buf.capacity();
+  <#else>
+  <#-- Turns out that keeping track of capacity as the count of
+   values simplifies the per-value code path. -->
+  capacity = buf.capacity() / VALUE_WIDTH;
+  
 }
 
-<@set drillType accessorType label true "set" />
-  }
-
-  public static class 

[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure

2017-11-14 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/914#discussion_r150755238
  
--- Diff: 
exec/vector/src/main/java/org/apache/drill/exec/record/MaterializedField.java 
---
@@ -168,6 +174,58 @@ public boolean equals(Object obj) {
 Objects.equals(this.type, other.type);
   }
 
+  public boolean isEquivalent(MaterializedField other) {
+if (! name.equalsIgnoreCase(other.name)) {
+  return false;
+}
+
+// Requires full type equality, including fields such as precision and 
scale.
+// But, unset fields are equivalent to 0. Can't use the 
protobuf-provided
+// isEquals(), that treats set and unset fields as different.
+
+if (type.getMinorType() != other.type.getMinorType()) {
+  return false;
+}
+if (type.getMode() != other.type.getMode()) {
+  return false;
+}
+if (type.getScale() != other.type.getScale()) {
+  return false;
+}
+if (type.getPrecision() != other.type.getPrecision()) {
+  return false;
+}
+
+// Compare children -- but only for maps, not the internal children
+// for Varchar, repeated or nullable types.
+
+if (type.getMinorType() != MinorType.MAP) {
+  return true;
+}
+
+if (children == null  ||  other.children == null) {
+  return children == other.children;
+}
+if (children.size() != other.children.size()) {
+  return false;
+}
+
+// Maps are name-based, not position. But, for our
+// purposes, we insist on identical ordering.
+
+Iterator thisIter = children.iterator();
+Iterator otherIter = other.children.iterator();
+while (thisIter.hasNext()) {
--- End diff --

The row set & writer abstractions require identical ordering so that column 
indexes are well-defined. Here we are facing the age-old philosophical question 
of "sameness." Sameness is instrumental: sameness-for-a-purpose. Here, we want 
to know if two schemas are equivalent for the purposes of referencing columns 
by index. We recently did a fix elsewhere we do use the looser definition: that 
A and B contain the same columns, but in possibly different orderings. Added a 
comment to explain this.


---


[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure

2017-11-14 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/914#discussion_r150758630
  
--- Diff: 
exec/vector/src/main/java/org/apache/drill/exec/vector/accessor/writer/BaseScalarWriter.java
 ---
@@ -0,0 +1,264 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.vector.accessor.writer;
+
+import java.math.BigDecimal;
+
+import org.apache.drill.exec.vector.accessor.ColumnWriterIndex;
+import org.apache.drill.exec.vector.accessor.impl.HierarchicalFormatter;
+import org.joda.time.Period;
+
+/**
+ * Column writer implementation that acts as the basis for the
+ * generated, vector-specific implementations. All set methods
+ * throw an exception; subclasses simply override the supported
+ * method(s).
+ * 
+ * The only tricky part to this class is understanding the
+ * state of the write indexes as the write proceeds. There are
+ * two pointers to consider:
+ * 
+ * lastWriteIndex: The position in the vector at which the
+ * client last asked us to write data. This index is maintained
+ * in this class because it depends only on the actions of this
+ * class.
+ * vectorIndex: The position in the vector at which we will
+ * write if the client chooses to write a value at this time.
+ * The vector index is shared by all columns at the same repeat
+ * level. It is incremented as the client steps through the write
+ * and is observed in this class each time a write occurs.
+ * 
+ * A repeat level is defined as any of the following:
+ * 
+ * The set of top-level scalar columns, or those within a
+ * top-level, non-repeated map, or nested to any depth within
+ * non-repeated maps rooted at the top level.
+ * The values for a single scalar array.
+ * The set of scalar columns within a repeated map, or
+ * nested within non-repeated maps within a repeated map.
+ * 
+ * Items at a repeat level index together and share a vector
+ * index. However, the columns within a repeat level
+ * do not share a last write index: some can lag further
+ * behind than others.
+ * 
+ * Let's illustrate the states. Let's focus on one column and
+ * illustrate the three states that can occur during write:
+ * 
+ * Behind: the last write index is more than one position behind
+ * the vector index. Zero-filling will be needed to catch up to
+ * the vector index.
+ * Written: the last write index is the same as the vector
+ * index because the client wrote data at this position (and previous
+ * values were back-filled with nulls, empties or zeros.)
+ * Unwritten: the last write index is one behind the vector
+ * index. This occurs when the column was written, then the client
+ * moved to the next row or array position.
+ * Restarted: The current row is abandoned (perhaps filtered
+ * out) and is to be rewritten. The last write position moves
+ * back one position. Note that, the Restarted state is
+ * indistinguishable from the unwritten state: the only real
+ * difference is that the current slot (pointed to by the
+ * vector index) contains the previous written value that must
+ * be overwritten or back-filled. But, this is fine, because we
+ * assume that unwritten values are garbage anyway.
+ * 
+ * To illustrate:
+ *  Behind  WrittenUnwrittenRestarted
+ *   |X|  |X| |X|  |X|
+ *   lw >|X|  |X| |X|  |X|
+ *   | |  |0| |0| lw > |0|
+ *v >| |  lw, v > |X|lw > |X|  v > |X|
+ *v > | |
+ * 
+ * The illustrated state transitions are:
+ * 
+ * Suppose the state starts in Behind.
+ *   If the client writes a value, then the empty slot is
+ *   back-filled and the state moves to Written.
+ *   If the client does not write a value, the state stays
+ *   at Behind, and the gap of unfilled values 

[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure

2017-11-14 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/914#discussion_r150725372
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/test/rowSet/test/RowSetTest.java 
---
@@ -19,420 +19,648 @@
 
 import static org.junit.Assert.assertEquals;
 import static org.junit.Assert.assertFalse;
-import static org.junit.Assert.assertNotNull;
-import static org.junit.Assert.assertNull;
 import static org.junit.Assert.assertSame;
 import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+
+import java.io.UnsupportedEncodingException;
 
-import org.apache.drill.common.types.TypeProtos.DataMode;
 import org.apache.drill.common.types.TypeProtos.MinorType;
 import org.apache.drill.exec.record.BatchSchema;
+import org.apache.drill.exec.record.TupleMetadata;
+import org.apache.drill.exec.vector.ValueVector;
 import org.apache.drill.exec.vector.accessor.ArrayReader;
 import org.apache.drill.exec.vector.accessor.ArrayWriter;
-import org.apache.drill.exec.vector.accessor.TupleAccessor.TupleSchema;
+import org.apache.drill.exec.vector.accessor.ObjectType;
+import org.apache.drill.exec.vector.accessor.ScalarElementReader;
+import org.apache.drill.exec.vector.accessor.ScalarReader;
+import org.apache.drill.exec.vector.accessor.ScalarWriter;
+import org.apache.drill.exec.vector.accessor.TupleReader;
+import org.apache.drill.exec.vector.accessor.TupleWriter;
+import org.apache.drill.exec.vector.accessor.ValueType;
+import org.apache.drill.exec.vector.complex.MapVector;
+import org.apache.drill.exec.vector.complex.RepeatedMapVector;
 import org.apache.drill.test.SubOperatorTest;
 import org.apache.drill.test.rowSet.RowSet.ExtendableRowSet;
-import org.apache.drill.test.rowSet.RowSet.RowSetReader;
-import org.apache.drill.test.rowSet.RowSet.RowSetWriter;
 import org.apache.drill.test.rowSet.RowSet.SingleRowSet;
 import org.apache.drill.test.rowSet.RowSetComparison;
-import org.apache.drill.test.rowSet.RowSetSchema;
-import org.apache.drill.test.rowSet.RowSetSchema.FlattenedSchema;
-import org.apache.drill.test.rowSet.RowSetSchema.PhysicalSchema;
+import org.apache.drill.test.rowSet.RowSetReader;
+import org.apache.drill.test.rowSet.RowSetWriter;
 import org.apache.drill.test.rowSet.SchemaBuilder;
+import org.bouncycastle.util.Arrays;
--- End diff --

Eclipse auto-import... Fixed.


---


[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure

2017-11-14 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/914#discussion_r150753305
  
--- Diff: exec/vector/src/main/codegen/templates/ColumnAccessors.java ---
@@ -191,141 +180,268 @@ public void bind(RowIndex vectorIndex, ValueVector 
vector) {
 <#if accessorType=="BigDecimal">
   <#assign label="Decimal">
 
+<#if drillType == "VarChar" || drillType == "Var16Char">
+  <#assign accessorType = "byte[]">
+  <#assign label = "Bytes">
+
 <#if ! notyet>
   
//
   // ${drillType} readers and writers
 
-  public static class ${drillType}ColumnReader extends 
AbstractColumnReader {
+  public static class ${drillType}ColumnReader extends BaseScalarReader {
 
-<@bindReader "" drillType />
+<@bindReader "" drillType false />
 
-<@getType label />
+<@getType drillType label />
 
 <@get drillType accessorType label false/>
   }
 
-  public static class Nullable${drillType}ColumnReader extends 
AbstractColumnReader {
+  public static class Nullable${drillType}ColumnReader extends 
BaseScalarReader {
 
-<@bindReader "Nullable" drillType />
+<@bindReader "Nullable" drillType false />
 
-<@getType label />
+<@getType drillType label />
 
 @Override
 public boolean isNull() {
-  return accessor().isNull(vectorIndex.index());
-}
-
-<@get drillType accessorType label false/>
-  }
-
-  public static class Repeated${drillType}ColumnReader extends 
AbstractArrayReader {
-
-<@bindReader "Repeated" drillType />
-
-<@getType label />
-
-@Override
-public int size() {
-  return accessor().getInnerValueCountAt(vectorIndex.index());
+  return accessor().isNull(vectorIndex.vectorIndex());
 }
 
-<@get drillType accessorType label true/>
+<@get drillType accessorType label false />
   }
 
-  public static class ${drillType}ColumnWriter extends 
AbstractColumnWriter {
+  public static class Repeated${drillType}ColumnReader extends 
BaseElementReader {
 
-<@bindWriter "" drillType />
+<@bindReader "" drillType true />
 
-<@getType label />
+<@getType drillType label />
 
-<@set drillType accessorType label false "set" />
+<@get drillType accessorType label true />
   }
 
-  public static class Nullable${drillType}ColumnWriter extends 
AbstractColumnWriter {
-
-<@bindWriter "Nullable" drillType />
+  <#assign varWidth = drillType == "VarChar" || drillType == 
"Var16Char" || drillType == "VarBinary" />
+  <#if varWidth>
+  public static class ${drillType}ColumnWriter extends BaseVarWidthWriter {
+  <#else>
+  public static class ${drillType}ColumnWriter extends 
BaseFixedWidthWriter {
+<#if drillType = "Decimal9" || drillType == "Decimal18" ||
+ drillType == "Decimal28Sparse" || drillType == 
"Decimal38Sparse">
+private MajorType type;
+
+private static final int VALUE_WIDTH = ${drillType}Vector.VALUE_WIDTH;
+  
+private final ${drillType}Vector vector;
+
+public ${drillType}ColumnWriter(final ValueVector vector) {
+  <#if varWidth>
+  super(((${drillType}Vector) vector).getOffsetVector());
+  <#else>
+<#if drillType = "Decimal9" || drillType == "Decimal18" ||
+ drillType == "Decimal28Sparse" || drillType == 
"Decimal38Sparse">
+  type = vector.getField().getType();
+
+  
+  this.vector = (${drillType}Vector) vector;
+}
 
-<@getType label />
+@Override public ValueVector vector() { return vector; }
 
+<#-- All change of buffer comes through this function to allow 
capturing
+ the buffer address and capacity. Only two ways to set the 
buffer:
+ by binding to a vector in bindVector(), or by resizing the 
vector
+ in writeIndex(). -->
 @Override
-public void setNull() {
-  mutator.setNull(vectorIndex.index());
+protected final void setAddr() {
+  final DrillBuf buf = vector.getBuffer();
+  bufAddr = buf.addr();
+  <#if varWidth>
+  capacity = buf.capacity();
+  <#else>
+  <#-- Turns out that keeping track of capacity as the count of
+   values simplifies the per-value code path. -->
+  capacity = buf.capacity() / VALUE_WIDTH;
+  
 }
 
-<@set drillType accessorType label true "set" />
-  }
-
-  public static class 

[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure

2017-11-10 Thread parthchandra
Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/914#discussion_r149764036
  
--- Diff: 
exec/vector/src/main/java/org/apache/drill/exec/vector/accessor/writer/BaseScalarWriter.java
 ---
@@ -0,0 +1,264 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.vector.accessor.writer;
+
+import java.math.BigDecimal;
+
+import org.apache.drill.exec.vector.accessor.ColumnWriterIndex;
+import org.apache.drill.exec.vector.accessor.impl.HierarchicalFormatter;
+import org.joda.time.Period;
+
+/**
+ * Column writer implementation that acts as the basis for the
+ * generated, vector-specific implementations. All set methods
+ * throw an exception; subclasses simply override the supported
+ * method(s).
+ * 
+ * The only tricky part to this class is understanding the
+ * state of the write indexes as the write proceeds. There are
+ * two pointers to consider:
+ * 
+ * lastWriteIndex: The position in the vector at which the
+ * client last asked us to write data. This index is maintained
+ * in this class because it depends only on the actions of this
+ * class.
+ * vectorIndex: The position in the vector at which we will
+ * write if the client chooses to write a value at this time.
+ * The vector index is shared by all columns at the same repeat
+ * level. It is incremented as the client steps through the write
+ * and is observed in this class each time a write occurs.
+ * 
+ * A repeat level is defined as any of the following:
+ * 
+ * The set of top-level scalar columns, or those within a
+ * top-level, non-repeated map, or nested to any depth within
+ * non-repeated maps rooted at the top level.
+ * The values for a single scalar array.
+ * The set of scalar columns within a repeated map, or
+ * nested within non-repeated maps within a repeated map.
+ * 
+ * Items at a repeat level index together and share a vector
+ * index. However, the columns within a repeat level
+ * do not share a last write index: some can lag further
+ * behind than others.
+ * 
+ * Let's illustrate the states. Let's focus on one column and
+ * illustrate the three states that can occur during write:
+ * 
+ * Behind: the last write index is more than one position behind
+ * the vector index. Zero-filling will be needed to catch up to
+ * the vector index.
+ * Written: the last write index is the same as the vector
+ * index because the client wrote data at this position (and previous
+ * values were back-filled with nulls, empties or zeros.)
+ * Unwritten: the last write index is one behind the vector
+ * index. This occurs when the column was written, then the client
+ * moved to the next row or array position.
+ * Restarted: The current row is abandoned (perhaps filtered
+ * out) and is to be rewritten. The last write position moves
+ * back one position. Note that, the Restarted state is
+ * indistinguishable from the unwritten state: the only real
+ * difference is that the current slot (pointed to by the
+ * vector index) contains the previous written value that must
+ * be overwritten or back-filled. But, this is fine, because we
+ * assume that unwritten values are garbage anyway.
+ * 
+ * To illustrate:
+ *  Behind  WrittenUnwrittenRestarted
+ *   |X|  |X| |X|  |X|
+ *   lw >|X|  |X| |X|  |X|
+ *   | |  |0| |0| lw > |0|
+ *v >| |  lw, v > |X|lw > |X|  v > |X|
+ *v > | |
+ * 
+ * The illustrated state transitions are:
+ * 
+ * Suppose the state starts in Behind.
+ *   If the client writes a value, then the empty slot is
+ *   back-filled and the state moves to Written.
+ *   If the client does not write a value, the state stays
+ *   at Behind, and the gap of unfilled values 

[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure

2017-11-10 Thread parthchandra
Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/914#discussion_r149765040
  
--- Diff: 
exec/vector/src/main/java/org/apache/drill/exec/record/MaterializedField.java 
---
@@ -168,6 +174,58 @@ public boolean equals(Object obj) {
 Objects.equals(this.type, other.type);
   }
 
+  public boolean isEquivalent(MaterializedField other) {
+if (! name.equalsIgnoreCase(other.name)) {
+  return false;
+}
+
+// Requires full type equality, including fields such as precision and 
scale.
+// But, unset fields are equivalent to 0. Can't use the 
protobuf-provided
+// isEquals(), that treats set and unset fields as different.
+
+if (type.getMinorType() != other.type.getMinorType()) {
+  return false;
+}
+if (type.getMode() != other.type.getMode()) {
+  return false;
+}
+if (type.getScale() != other.type.getScale()) {
+  return false;
+}
+if (type.getPrecision() != other.type.getPrecision()) {
+  return false;
+}
+
+// Compare children -- but only for maps, not the internal children
+// for Varchar, repeated or nullable types.
+
+if (type.getMinorType() != MinorType.MAP) {
+  return true;
+}
+
+if (children == null  ||  other.children == null) {
+  return children == other.children;
+}
+if (children.size() != other.children.size()) {
+  return false;
+}
+
+// Maps are name-based, not position. But, for our
+// purposes, we insist on identical ordering.
+
+Iterator thisIter = children.iterator();
+Iterator otherIter = other.children.iterator();
+while (thisIter.hasNext()) {
--- End diff --

isEquivalent requires identical ordering which is a stronger requirement 
than the guarantee that the children list is providing. Could we use contains() 
to find the child and then apply isEquivalent recursively?



---


[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure

2017-11-10 Thread parthchandra
Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/914#discussion_r149758695
  
--- Diff: exec/vector/src/main/codegen/templates/ColumnAccessors.java ---
@@ -191,141 +180,268 @@ public void bind(RowIndex vectorIndex, ValueVector 
vector) {
 <#if accessorType=="BigDecimal">
   <#assign label="Decimal">
 
+<#if drillType == "VarChar" || drillType == "Var16Char">
+  <#assign accessorType = "byte[]">
+  <#assign label = "Bytes">
+
 <#if ! notyet>
   
//
   // ${drillType} readers and writers
 
-  public static class ${drillType}ColumnReader extends 
AbstractColumnReader {
+  public static class ${drillType}ColumnReader extends BaseScalarReader {
 
-<@bindReader "" drillType />
+<@bindReader "" drillType false />
 
-<@getType label />
+<@getType drillType label />
 
 <@get drillType accessorType label false/>
   }
 
-  public static class Nullable${drillType}ColumnReader extends 
AbstractColumnReader {
+  public static class Nullable${drillType}ColumnReader extends 
BaseScalarReader {
 
-<@bindReader "Nullable" drillType />
+<@bindReader "Nullable" drillType false />
 
-<@getType label />
+<@getType drillType label />
 
 @Override
 public boolean isNull() {
-  return accessor().isNull(vectorIndex.index());
-}
-
-<@get drillType accessorType label false/>
-  }
-
-  public static class Repeated${drillType}ColumnReader extends 
AbstractArrayReader {
-
-<@bindReader "Repeated" drillType />
-
-<@getType label />
-
-@Override
-public int size() {
-  return accessor().getInnerValueCountAt(vectorIndex.index());
+  return accessor().isNull(vectorIndex.vectorIndex());
 }
 
-<@get drillType accessorType label true/>
+<@get drillType accessorType label false />
   }
 
-  public static class ${drillType}ColumnWriter extends 
AbstractColumnWriter {
+  public static class Repeated${drillType}ColumnReader extends 
BaseElementReader {
 
-<@bindWriter "" drillType />
+<@bindReader "" drillType true />
 
-<@getType label />
+<@getType drillType label />
 
-<@set drillType accessorType label false "set" />
+<@get drillType accessorType label true />
   }
 
-  public static class Nullable${drillType}ColumnWriter extends 
AbstractColumnWriter {
-
-<@bindWriter "Nullable" drillType />
+  <#assign varWidth = drillType == "VarChar" || drillType == 
"Var16Char" || drillType == "VarBinary" />
+  <#if varWidth>
+  public static class ${drillType}ColumnWriter extends BaseVarWidthWriter {
+  <#else>
+  public static class ${drillType}ColumnWriter extends 
BaseFixedWidthWriter {
+<#if drillType = "Decimal9" || drillType == "Decimal18" ||
+ drillType == "Decimal28Sparse" || drillType == 
"Decimal38Sparse">
+private MajorType type;
+
+private static final int VALUE_WIDTH = ${drillType}Vector.VALUE_WIDTH;
+  
+private final ${drillType}Vector vector;
+
+public ${drillType}ColumnWriter(final ValueVector vector) {
+  <#if varWidth>
+  super(((${drillType}Vector) vector).getOffsetVector());
+  <#else>
+<#if drillType = "Decimal9" || drillType == "Decimal18" ||
+ drillType == "Decimal28Sparse" || drillType == 
"Decimal38Sparse">
+  type = vector.getField().getType();
+
+  
+  this.vector = (${drillType}Vector) vector;
+}
 
-<@getType label />
+@Override public ValueVector vector() { return vector; }
 
+<#-- All change of buffer comes through this function to allow 
capturing
+ the buffer address and capacity. Only two ways to set the 
buffer:
+ by binding to a vector in bindVector(), or by resizing the 
vector
+ in writeIndex(). -->
 @Override
-public void setNull() {
-  mutator.setNull(vectorIndex.index());
+protected final void setAddr() {
+  final DrillBuf buf = vector.getBuffer();
+  bufAddr = buf.addr();
+  <#if varWidth>
+  capacity = buf.capacity();
+  <#else>
+  <#-- Turns out that keeping track of capacity as the count of
+   values simplifies the per-value code path. -->
+  capacity = buf.capacity() / VALUE_WIDTH;
+  
 }
 
-<@set drillType accessorType label true "set" />
-  }
-
-  public static class 

[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure

2017-11-10 Thread parthchandra
Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/914#discussion_r149759147
  
--- Diff: exec/vector/src/main/codegen/templates/ColumnAccessors.java ---
@@ -191,141 +180,268 @@ public void bind(RowIndex vectorIndex, ValueVector 
vector) {
 <#if accessorType=="BigDecimal">
   <#assign label="Decimal">
 
+<#if drillType == "VarChar" || drillType == "Var16Char">
+  <#assign accessorType = "byte[]">
+  <#assign label = "Bytes">
+
 <#if ! notyet>
   
//
   // ${drillType} readers and writers
 
-  public static class ${drillType}ColumnReader extends 
AbstractColumnReader {
+  public static class ${drillType}ColumnReader extends BaseScalarReader {
 
-<@bindReader "" drillType />
+<@bindReader "" drillType false />
 
-<@getType label />
+<@getType drillType label />
 
 <@get drillType accessorType label false/>
   }
 
-  public static class Nullable${drillType}ColumnReader extends 
AbstractColumnReader {
+  public static class Nullable${drillType}ColumnReader extends 
BaseScalarReader {
 
-<@bindReader "Nullable" drillType />
+<@bindReader "Nullable" drillType false />
 
-<@getType label />
+<@getType drillType label />
 
 @Override
 public boolean isNull() {
-  return accessor().isNull(vectorIndex.index());
-}
-
-<@get drillType accessorType label false/>
-  }
-
-  public static class Repeated${drillType}ColumnReader extends 
AbstractArrayReader {
-
-<@bindReader "Repeated" drillType />
-
-<@getType label />
-
-@Override
-public int size() {
-  return accessor().getInnerValueCountAt(vectorIndex.index());
+  return accessor().isNull(vectorIndex.vectorIndex());
 }
 
-<@get drillType accessorType label true/>
+<@get drillType accessorType label false />
   }
 
-  public static class ${drillType}ColumnWriter extends 
AbstractColumnWriter {
+  public static class Repeated${drillType}ColumnReader extends 
BaseElementReader {
 
-<@bindWriter "" drillType />
+<@bindReader "" drillType true />
 
-<@getType label />
+<@getType drillType label />
 
-<@set drillType accessorType label false "set" />
+<@get drillType accessorType label true />
   }
 
-  public static class Nullable${drillType}ColumnWriter extends 
AbstractColumnWriter {
-
-<@bindWriter "Nullable" drillType />
+  <#assign varWidth = drillType == "VarChar" || drillType == 
"Var16Char" || drillType == "VarBinary" />
+  <#if varWidth>
+  public static class ${drillType}ColumnWriter extends BaseVarWidthWriter {
+  <#else>
+  public static class ${drillType}ColumnWriter extends 
BaseFixedWidthWriter {
+<#if drillType = "Decimal9" || drillType == "Decimal18" ||
+ drillType == "Decimal28Sparse" || drillType == 
"Decimal38Sparse">
+private MajorType type;
+
+private static final int VALUE_WIDTH = ${drillType}Vector.VALUE_WIDTH;
+  
+private final ${drillType}Vector vector;
+
+public ${drillType}ColumnWriter(final ValueVector vector) {
+  <#if varWidth>
+  super(((${drillType}Vector) vector).getOffsetVector());
+  <#else>
+<#if drillType = "Decimal9" || drillType == "Decimal18" ||
+ drillType == "Decimal28Sparse" || drillType == 
"Decimal38Sparse">
+  type = vector.getField().getType();
+
+  
+  this.vector = (${drillType}Vector) vector;
+}
 
-<@getType label />
+@Override public ValueVector vector() { return vector; }
 
+<#-- All change of buffer comes through this function to allow 
capturing
+ the buffer address and capacity. Only two ways to set the 
buffer:
+ by binding to a vector in bindVector(), or by resizing the 
vector
+ in writeIndex(). -->
 @Override
-public void setNull() {
-  mutator.setNull(vectorIndex.index());
+protected final void setAddr() {
+  final DrillBuf buf = vector.getBuffer();
+  bufAddr = buf.addr();
+  <#if varWidth>
+  capacity = buf.capacity();
+  <#else>
+  <#-- Turns out that keeping track of capacity as the count of
+   values simplifies the per-value code path. -->
+  capacity = buf.capacity() / VALUE_WIDTH;
+  
 }
 
-<@set drillType accessorType label true "set" />
-  }
-
-  public static class 

[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure

2017-11-10 Thread parthchandra
Github user parthchandra commented on a diff in the pull request:

https://github.com/apache/drill/pull/914#discussion_r149760234
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/test/rowSet/test/RowSetTest.java 
---
@@ -19,420 +19,648 @@
 
 import static org.junit.Assert.assertEquals;
 import static org.junit.Assert.assertFalse;
-import static org.junit.Assert.assertNotNull;
-import static org.junit.Assert.assertNull;
 import static org.junit.Assert.assertSame;
 import static org.junit.Assert.assertTrue;
+import static org.junit.Assert.fail;
+
+import java.io.UnsupportedEncodingException;
 
-import org.apache.drill.common.types.TypeProtos.DataMode;
 import org.apache.drill.common.types.TypeProtos.MinorType;
 import org.apache.drill.exec.record.BatchSchema;
+import org.apache.drill.exec.record.TupleMetadata;
+import org.apache.drill.exec.vector.ValueVector;
 import org.apache.drill.exec.vector.accessor.ArrayReader;
 import org.apache.drill.exec.vector.accessor.ArrayWriter;
-import org.apache.drill.exec.vector.accessor.TupleAccessor.TupleSchema;
+import org.apache.drill.exec.vector.accessor.ObjectType;
+import org.apache.drill.exec.vector.accessor.ScalarElementReader;
+import org.apache.drill.exec.vector.accessor.ScalarReader;
+import org.apache.drill.exec.vector.accessor.ScalarWriter;
+import org.apache.drill.exec.vector.accessor.TupleReader;
+import org.apache.drill.exec.vector.accessor.TupleWriter;
+import org.apache.drill.exec.vector.accessor.ValueType;
+import org.apache.drill.exec.vector.complex.MapVector;
+import org.apache.drill.exec.vector.complex.RepeatedMapVector;
 import org.apache.drill.test.SubOperatorTest;
 import org.apache.drill.test.rowSet.RowSet.ExtendableRowSet;
-import org.apache.drill.test.rowSet.RowSet.RowSetReader;
-import org.apache.drill.test.rowSet.RowSet.RowSetWriter;
 import org.apache.drill.test.rowSet.RowSet.SingleRowSet;
 import org.apache.drill.test.rowSet.RowSetComparison;
-import org.apache.drill.test.rowSet.RowSetSchema;
-import org.apache.drill.test.rowSet.RowSetSchema.FlattenedSchema;
-import org.apache.drill.test.rowSet.RowSetSchema.PhysicalSchema;
+import org.apache.drill.test.rowSet.RowSetReader;
+import org.apache.drill.test.rowSet.RowSetWriter;
 import org.apache.drill.test.rowSet.SchemaBuilder;
+import org.bouncycastle.util.Arrays;
--- End diff --

BouncyCastle arrays instead of Java Arrays ?


---


[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure

2017-09-16 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/914#discussion_r139296614
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/impl/package-info.java
 ---
@@ -0,0 +1,295 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * Handles the details of the result set loader implementation.
+ * 
+ * The primary purpose of this loader, and the most complex to understand 
and
+ * maintain, is overflow handling.
+ *
+ * Detailed Use Cases
+ *
+ * Let's examine it by considering a number of
+ * use cases.
+ * 
+ * 
Rowabcdefgh
+ * 
n-2XX--
+ * n-1  
--
+ * n  X!O O 
O 
+ * 
+ * Here:
+ * 
+ * n-2, n-1, and n are rows. n is the overflow row.
+ * X indicates a value was written before overflow.
+ * Blank indicates no value was written in that row.
+ * ! indicates the value that triggered overflow.
+ * - indicates a column that did not exist prior to overflow.
+ * 
+ * Column a is written before overflow occurs, b causes overflow, and all 
other
+ * columns either are not written, or written after overflow.
+ * 
+ * The scenarios, identified by column names above, are:
+ * 
+ * a
+ * a contains values for all three rows.
+ * 
+ * Two values were written in the "main" batch, while a third was 
written to
+ * what becomes the overflow row.
+ * When overflow occurs, the last write position is at n. It must be 
moved
+ * back to n-1.
+ * Since data was written to the overflow row, it is copied to the 
look-
+ * ahead batch.
+ * The last write position in the lookahead batch is 0 (since data was
+ * copied into the 0th row.
+ * When harvesting, no empty-filling is needed.
+ * When starting the next batch, the last write position must be set 
to 0 to
+ * reflect the presence of the value for row n.
+ * 
+ * 
+ * b
+ * b contains values for all three rows. The value for row n triggers
+ * overflow.
+ * 
+ * The last write position is at n-1, which is kept for the "main"
+ * vector.
+ * A new overflow vector is created and starts empty, with the last 
write
+ * position at -1.
+ * Once created, b is immediately written to the overflow vector, 
advancing
+ * the last write position to 0.
+ * Harvesting, and starting the next for column b works the same as 
column
+ * a.
+ * 
+ * 
+ * c
+ * Column c has values for all rows.
+ * 
+ * The value for row n is written after overflow.
+ * At overflow, the last write position is at n-1.
+ * At overflow, a new lookahead vector is created with the last write
+ * position at -1.
+ * The value of c is written to the lookahead vector, advancing the 
last
+ * write position to -1.
+ * Harvesting, and starting the next for column c works the same as 
column
+ * a.
+ * 
+ * 
+ * d
+ * Column d writes values to the last two rows before overflow, but 
not to
+ * the overflow row.
+ * 
+ * The last write position for the main batch is at n-1.
+ * The last write position in the lookahead batch remains at -1.
+ * Harvesting for column d requires filling an empty value for row 
n-1.
+ * When starting the next batch, the last write position must be set 
to -1,
+ * indicating no data yet written.
+ * 
+ * 
+ * f
+ * Column f has no data in the last position of the main batch, and no 
data
+ * in the overflow row.
+ * 
+ * The last write position is at n-2.
+ * An empty value must be written into position n-1 during 
harvest.
+ * On start of the next batch, the last write position starts at 
-1.
+ * 
+ * 
+ * g
+ * Column g is added after overflow, and has a value written to the 
overflow
+ * row.
+ * 
+ * On harvest, column g is simply skipped.
+ * On start of the next row, the last write position can be left 
unchanged
+ * since no "exchange" was done.
+ * 
+ * 
+ 

[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure

2017-09-16 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/914#discussion_r139296587
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/impl/package-info.java
 ---
@@ -0,0 +1,295 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * Handles the details of the result set loader implementation.
+ * 
+ * The primary purpose of this loader, and the most complex to understand 
and
+ * maintain, is overflow handling.
+ *
+ * Detailed Use Cases
+ *
+ * Let's examine it by considering a number of
+ * use cases.
+ * 
+ * 
Rowabcdefgh
+ * 
n-2XX--
+ * n-1  
--
+ * n  X!O O 
O 
+ * 
+ * Here:
+ * 
+ * n-2, n-1, and n are rows. n is the overflow row.
+ * X indicates a value was written before overflow.
+ * Blank indicates no value was written in that row.
+ * ! indicates the value that triggered overflow.
+ * - indicates a column that did not exist prior to overflow.
--- End diff --

Values added after overflow.
Fixed in latest commit.


---


[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure

2017-09-08 Thread bitblender
Github user bitblender commented on a diff in the pull request:

https://github.com/apache/drill/pull/914#discussion_r137852118
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/impl/package-info.java
 ---
@@ -0,0 +1,295 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * Handles the details of the result set loader implementation.
+ * 
+ * The primary purpose of this loader, and the most complex to understand 
and
+ * maintain, is overflow handling.
+ *
+ * Detailed Use Cases
+ *
+ * Let's examine it by considering a number of
+ * use cases.
+ * 
+ * 
Rowabcdefgh
+ * 
n-2XX--
+ * n-1  
--
+ * n  X!O O 
O 
+ * 
+ * Here:
+ * 
+ * n-2, n-1, and n are rows. n is the overflow row.
+ * X indicates a value was written before overflow.
+ * Blank indicates no value was written in that row.
+ * ! indicates the value that triggered overflow.
+ * - indicates a column that did not exist prior to overflow.
--- End diff --

What does an 'O' value mean in the diagram above?


---


[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure

2017-09-08 Thread bitblender
Github user bitblender commented on a diff in the pull request:

https://github.com/apache/drill/pull/914#discussion_r137851895
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/impl/package-info.java
 ---
@@ -0,0 +1,295 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+/**
+ * Handles the details of the result set loader implementation.
+ * 
+ * The primary purpose of this loader, and the most complex to understand 
and
+ * maintain, is overflow handling.
+ *
+ * Detailed Use Cases
+ *
+ * Let's examine it by considering a number of
+ * use cases.
+ * 
+ * 
Rowabcdefgh
+ * 
n-2XX--
+ * n-1  
--
+ * n  X!O O 
O 
+ * 
+ * Here:
+ * 
+ * n-2, n-1, and n are rows. n is the overflow row.
+ * X indicates a value was written before overflow.
+ * Blank indicates no value was written in that row.
+ * ! indicates the value that triggered overflow.
+ * - indicates a column that did not exist prior to overflow.
+ * 
+ * Column a is written before overflow occurs, b causes overflow, and all 
other
+ * columns either are not written, or written after overflow.
+ * 
+ * The scenarios, identified by column names above, are:
+ * 
+ * a
+ * a contains values for all three rows.
+ * 
+ * Two values were written in the "main" batch, while a third was 
written to
+ * what becomes the overflow row.
+ * When overflow occurs, the last write position is at n. It must be 
moved
+ * back to n-1.
+ * Since data was written to the overflow row, it is copied to the 
look-
+ * ahead batch.
+ * The last write position in the lookahead batch is 0 (since data was
+ * copied into the 0th row.
+ * When harvesting, no empty-filling is needed.
+ * When starting the next batch, the last write position must be set 
to 0 to
+ * reflect the presence of the value for row n.
+ * 
+ * 
+ * b
+ * b contains values for all three rows. The value for row n triggers
+ * overflow.
+ * 
+ * The last write position is at n-1, which is kept for the "main"
+ * vector.
+ * A new overflow vector is created and starts empty, with the last 
write
+ * position at -1.
+ * Once created, b is immediately written to the overflow vector, 
advancing
+ * the last write position to 0.
+ * Harvesting, and starting the next for column b works the same as 
column
+ * a.
+ * 
+ * 
+ * c
+ * Column c has values for all rows.
+ * 
+ * The value for row n is written after overflow.
+ * At overflow, the last write position is at n-1.
+ * At overflow, a new lookahead vector is created with the last write
+ * position at -1.
+ * The value of c is written to the lookahead vector, advancing the 
last
+ * write position to -1.
+ * Harvesting, and starting the next for column c works the same as 
column
+ * a.
+ * 
+ * 
+ * d
+ * Column d writes values to the last two rows before overflow, but 
not to
+ * the overflow row.
+ * 
+ * The last write position for the main batch is at n-1.
+ * The last write position in the lookahead batch remains at -1.
+ * Harvesting for column d requires filling an empty value for row 
n-1.
+ * When starting the next batch, the last write position must be set 
to -1,
+ * indicating no data yet written.
+ * 
+ * 
+ * f
+ * Column f has no data in the last position of the main batch, and no 
data
+ * in the overflow row.
+ * 
+ * The last write position is at n-2.
+ * An empty value must be written into position n-1 during 
harvest.
+ * On start of the next batch, the last write position starts at 
-1.
+ * 
+ * 
+ * g
+ * Column g is added after overflow, and has a value written to the 
overflow
+ * row.
+ * 
+ * On harvest, column g is simply skipped.
+ * On start of the next row, the last write position can be left 
unchanged
+ * since no "exchange" was done.
+ * 
+ * 
+ * 

[GitHub] drill pull request #914: DRILL-5657: Size-aware vector writer structure

2017-08-17 Thread paul-rogers
GitHub user paul-rogers opened a pull request:

https://github.com/apache/drill/pull/914

DRILL-5657: Size-aware vector writer structure

This large PR provides another two levels of foundation for size-aware 
vector writers in the Drill record readers. It combines code from two previous 
PRS:

* PR 866 - DRILL-5657: Implement size-aware result set loader
* PR 887 - DRILL-5688: Add repeated map support to column accessors

The PR then goes on to integrate the two prior PRs and provide additional 
functionality.

Like the two previous PRs, this one is divided into commits to group the 
work.

1. Accessor layer
2. Row Set layer
3. Tuple and Column Model layer
4. Row Set Loader layer
5. Secondary changes

Much of the material below appears in Javadoc throughout the code. The 
material here is not meant to replace that documentation. Instead, it is meant 
to provide the “big picture”: placing the bits and pieces in context and 
pointing out interesting functionality to explore in each layer.

## Commit 1: The Accessor Layer

The first commit provides the core of the mechanism: the writers that put 
data into vectors, and the readers that retrieve that data. The version here is 
an evolution of the version provided in an earlier PR a few months ago.

### Overview of the Drill Vector Data Model

The code will make much more sense if we start with a review of Drill’s 
complex vector data model. Drill has 38+ data (“minor”) types as defined in 
the [proto 
buf](https://github.com/apache/drill/blob/master/protocol/src/main/protobuf/Types.proto)
 definition. Drill also has three cardinalities (“modes”) defined in the 
same file. The result is over 120+ different vector types. Then, when you add 
maps, repeated maps, lists and repeated lists, you rapidly get an explosion of 
types that the writer code must handle.

Vectors can be categorized along multiple dimensions:

* By data (minor) type
* By cardinality (mode)
* By fixed or variable width

A repeated map, a list, a repeated list and any array (repeated) scalar all 
are array-like. Nullable and required modes are identical (single values), but 
a nullable has an additional is-set (“bit”) vector.

A key contribution of this PR is the data model used to organize vectors.

* Both the top-level row, and a Drill map are “tuples” and are treated 
similarly in the model.
* All non-map, non-list (that is, scalar) data types are treated uniformly.
* All arrays (whether a list, a repeated list, a repeated map, or a 
repeated scalar) are treated uniformly.

### Accessor Data Model

The above leads to a very simple, JSON-like data model, introduced in this 
PR.

* A tuple reader or writer models a row. (Usually via a subclass.) Column 
are accessible by name or position.
* Every column is modeled as an object.
* The object can have an object type: scalar, tuple or array.
* An array has a single element type (but many run-time elements)
* A scalar can be nullable or not, and provides a uniform get/set interface.

This data model is similar to; but has important differences from, the 
prior, generated, readers and writers. 

The object layer is new: it is the simplest way to model the three 
“object types.” An app using this code would use just the leaf scalar 
readers and writers.

Although there is quite a bit of code change here to provide the new 
structure the core functionality of reading and writing to vectors has not 
changed much. And, this code has extensive unit tests, which should avoid the 
need to "mentally execute" each line of code.

See the classes in `org.apache.drill.exec.vector.accessor` for details. In 
particular, please see the `package-info.java` file in that package for more 
information.

As before, the top package provides a set of interfaces; the inner packages 
provide the implementation. The `ColumnAccessors.java` template generates the 
per-vector code. Warning: this template has become quite cryptic: the best bet 
for review is to generate the Java code and review that.

### Writer Performance

During previous review, we discussed ways to optimize writer performance. 
This PR has two improvements:

* Completely rework the writers to minimize code steps
* Rework the “column loaders” to eliminate them: instead of two 
additional method calls, the “loader” now uses the column writers directly.

Much behind-the-scenes rework was needed to accomplish the above.

Column readers, however, were left with their existing structure; we can 
apply the same optimizations to the readers in a later PR.

While doing the performance optimization, it became clear we can adopt a 
major simplification. Writers evolved from having three versions