[GitHub] drill pull request #1125: DRILL-6126: Allocate memory for value vectors upfr...

2018-03-07 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/drill/pull/1125


---


[GitHub] drill pull request #1125: DRILL-6126: Allocate memory for value vectors upfr...

2018-03-04 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1125#discussion_r172105917
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java 
---
@@ -199,12 +422,18 @@ public String toString() {
.append(", per-array: ")
.append(estElementCountPerArray);
   }
-  buf .append(", std size: ")
-  .append(stdSize)
-  .append(", actual size: ")
-  .append(estSize)
-  .append(", data size: ")
-  .append(dataSize)
+  buf .append(", std size per entry: ")
+  .append(getStdDataSizePerEntry())
+  .append(", std net size per entry: ")
+  .append(getStdNetSizePerEntry())
+  .append(", data size per entry: ")
+  .append(getDataSizePerEntry())
+  .append(", net size per entry: ")
+  .append(getNetSizePerEntry())
+  .append(", totalDataSize: ")
+  .append(getTotalDataSize())
+  .append(", totalNetSize: ")
+  .append(getTotalNetSize())
--- End diff --

Extra info is always good. This info is printed by the sort operator when 
debug logging is enabled. (It was invaluable when investigating issues found by 
QA tests.) Suggestion: find a way to compact the labels. Maybe:
```
Per entry: std size: xx, std net: xxx; Totals: data size: xxx, net size: xxx
```

Also, `totalDataSize` and `totalNetSize` should have spaces: they are 
labels, not variable names here.


---


[GitHub] drill pull request #1125: DRILL-6126: Allocate memory for value vectors upfr...

2018-03-04 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1125#discussion_r172105268
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java 
---
@@ -525,4 +763,11 @@ public VectorInitializer buildVectorInitializer() {
 }
 return initializer;
   }
+
+  public void allocateVectors(VectorContainer container, int recordCount) {
+for (VectorWrapper w : container) {
+  ColumnSize colSize = columnSizes.get(w.getField().getName());
+  colSize.allocateVector(w.getValueVector(), recordCount);
+}
+  }
--- End diff --

Very nice clean-up and enhancements. Now I don't feel so bad about this 
stuff being quick & dirty...


---


[GitHub] drill pull request #1125: DRILL-6126: Allocate memory for value vectors upfr...

2018-03-04 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1125#discussion_r172104808
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java 
---
@@ -76,110 +82,327 @@
  * greater than (but unlikely) same as the row count.
  */
 
-public final int valueCount;
+private final int valueCount;
 
 /**
- * Total number of elements for a repeated type, or 1 if this is
- * a non-repeated type. That is, a batch of 100 rows may have an
- * array with 10 elements per row. In this case, the element count
- * is 1000.
+ * Total number of elements for a repeated type, or same as
+ * valueCount if this is a non-repeated type. That is, a batch
+ * of 100 rows may have an array with 10 elements per row.
+ * In this case, the element count is 1000.
  */
 
-public final int elementCount;
+private int elementCount;
 
 /**
- * Size of the top level value vector. For map and repeated list,
- * this is just size of offset vector.
+ * The estimated, average number of elements per parent value.
+ * Always 1 for a non-repeated type. For a repeated type,
+ * this is the average entries per array (per repeated element).
  */
-public int dataSize;
+
+private float estElementCountPerArray;
 
 /**
- * Total size of the column includes the sum total of memory for all
- * value vectors representing the column.
+ * Indicates if it is variable width column.
+ * For map columns, this is true if any of the children is variable
+ * width column.
  */
-public int netSize;
+
+private boolean isVariableWidth;
 
 /**
- * The estimated, average number of elements per parent value.
- * Always 1 for a non-repeated type. For a repeated type,
- * this is the average entries per array (per repeated element).
+ * Indicates if cardinality is repeated(top level only).
+ */
+
+private boolean isRepeated;
+
+/**
+ * Indicates if cardinality is optional i.e. nullable(top level only).
+ */
+private boolean isOptional;
+
+/**
+ * Child columns if this is a map column.
+ */
+private Map children = 
CaseInsensitiveMap.newHashMap();
+
+/**
+ * std pure data size per entry from Drill metadata, based on type.
+ * Does not include metadata vector overhead we add for cardinality,
+ * variable length etc.
+ * For variable-width columns, we use 50 as std size for entry width.
+ * For repeated column, we assume repetition of 10.
+ */
+public int getStdDataSizePerEntry() {
+  int stdDataSize;
+
+  try {
+stdDataSize = TypeHelper.getSize(metadata.getType());
+
+// For variable width, typeHelper includes offset vector width. 
Adjust for that.
+if (isVariableWidth) {
+  stdDataSize -= OFFSET_VECTOR_WIDTH;
+}
+
+if (isRepeated) {
+  stdDataSize = stdDataSize * STD_REPETITION_FACTOR;
+}
+  } catch (Exception e) {
+// For unsupported types, just set stdSize to 0.
+// Map, Union, List etc.
+stdDataSize = 0;
+  }
+
+  // Add sizes of children.
+  for (ColumnSize columnSize : children.values()) {
+stdDataSize += columnSize.getStdDataSizePerEntry();
+  }
+
+  if (isRepeatedList()) {
+stdDataSize = stdDataSize * STD_REPETITION_FACTOR;
+  }
+
+  return stdDataSize;
+}
+
+/**
+ * std net size per entry taking into account additional metadata 
vectors
+ * we add on top for variable length, cardinality etc.
+ * For variable-width columns, we use 50 as std data size for entry 
width.
+ * For repeated column, we assume repetition of 10.
+ */
+public int getStdNetSizePerEntry() {
+  int stdNetSize;
+  try {
+stdNetSize = TypeHelper.getSize(metadata.getType());
+  } catch (Exception e) {
+stdNetSize = 0;
+  }
+
+  if (isOptional) {
+stdNetSize += BIT_VECTOR_WIDTH;
+  }
+
+  if (isRepeated) {
+stdNetSize = (stdNetSize * STD_REPETITION_FACTOR) + 
OFFSET_VECTOR_WIDTH;
+  }
+
+  for (ColumnSize columnSize : children.values()) {
+stdNetSize += columnSize.getStdNetSizePerEntry();
+  }
+
+  if (isRepeatedList()) {
+stdNetSize = (stdNetSize * STD_REPETITION_FACTOR) + 
OFFSET_VECTOR_WIDTH;
+   

[GitHub] drill pull request #1125: DRILL-6126: Allocate memory for value vectors upfr...

2018-03-04 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1125#discussion_r172105435
  
--- Diff: 
exec/vector/src/main/java/org/apache/drill/exec/vector/complex/RepeatedMapVector.java
 ---
@@ -608,4 +608,22 @@ public void collectLedgers(Set ledgers) {
   public void toNullable(ValueVector nullableVector) {
 throw new UnsupportedOperationException();
   }
+
+  @Override
+  public int getPayloadByteCount(int valueCount) {
+if (valueCount == 0) {
+  return 0;
+}
+
+int count = 0;
+
+int entryCount = offsets.getAccessor().get(valueCount);
+count += offsets.getPayloadByteCount(valueCount);
--- End diff --

```
int count = offsets.get...
```


---


[GitHub] drill pull request #1125: DRILL-6126: Allocate memory for value vectors upfr...

2018-03-04 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1125#discussion_r172105104
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java 
---
@@ -76,110 +82,327 @@
  * greater than (but unlikely) same as the row count.
  */
 
-public final int valueCount;
+private final int valueCount;
 
 /**
- * Total number of elements for a repeated type, or 1 if this is
- * a non-repeated type. That is, a batch of 100 rows may have an
- * array with 10 elements per row. In this case, the element count
- * is 1000.
+ * Total number of elements for a repeated type, or same as
+ * valueCount if this is a non-repeated type. That is, a batch
+ * of 100 rows may have an array with 10 elements per row.
+ * In this case, the element count is 1000.
  */
 
-public final int elementCount;
+private int elementCount;
 
 /**
- * Size of the top level value vector. For map and repeated list,
- * this is just size of offset vector.
+ * The estimated, average number of elements per parent value.
+ * Always 1 for a non-repeated type. For a repeated type,
+ * this is the average entries per array (per repeated element).
  */
-public int dataSize;
+
+private float estElementCountPerArray;
 
 /**
- * Total size of the column includes the sum total of memory for all
- * value vectors representing the column.
+ * Indicates if it is variable width column.
+ * For map columns, this is true if any of the children is variable
+ * width column.
  */
-public int netSize;
+
+private boolean isVariableWidth;
 
 /**
- * The estimated, average number of elements per parent value.
- * Always 1 for a non-repeated type. For a repeated type,
- * this is the average entries per array (per repeated element).
+ * Indicates if cardinality is repeated(top level only).
+ */
+
+private boolean isRepeated;
+
+/**
+ * Indicates if cardinality is optional i.e. nullable(top level only).
+ */
+private boolean isOptional;
+
+/**
+ * Child columns if this is a map column.
+ */
+private Map children = 
CaseInsensitiveMap.newHashMap();
+
+/**
+ * std pure data size per entry from Drill metadata, based on type.
+ * Does not include metadata vector overhead we add for cardinality,
+ * variable length etc.
+ * For variable-width columns, we use 50 as std size for entry width.
+ * For repeated column, we assume repetition of 10.
+ */
+public int getStdDataSizePerEntry() {
+  int stdDataSize;
+
+  try {
+stdDataSize = TypeHelper.getSize(metadata.getType());
+
+// For variable width, typeHelper includes offset vector width. 
Adjust for that.
+if (isVariableWidth) {
+  stdDataSize -= OFFSET_VECTOR_WIDTH;
+}
+
+if (isRepeated) {
+  stdDataSize = stdDataSize * STD_REPETITION_FACTOR;
+}
+  } catch (Exception e) {
+// For unsupported types, just set stdSize to 0.
+// Map, Union, List etc.
+stdDataSize = 0;
+  }
+
+  // Add sizes of children.
+  for (ColumnSize columnSize : children.values()) {
+stdDataSize += columnSize.getStdDataSizePerEntry();
+  }
+
+  if (isRepeatedList()) {
+stdDataSize = stdDataSize * STD_REPETITION_FACTOR;
+  }
+
+  return stdDataSize;
+}
+
+/**
+ * std net size per entry taking into account additional metadata 
vectors
+ * we add on top for variable length, cardinality etc.
+ * For variable-width columns, we use 50 as std data size for entry 
width.
+ * For repeated column, we assume repetition of 10.
+ */
+public int getStdNetSizePerEntry() {
+  int stdNetSize;
+  try {
+stdNetSize = TypeHelper.getSize(metadata.getType());
+  } catch (Exception e) {
+stdNetSize = 0;
+  }
+
+  if (isOptional) {
+stdNetSize += BIT_VECTOR_WIDTH;
+  }
+
+  if (isRepeated) {
+stdNetSize = (stdNetSize * STD_REPETITION_FACTOR) + 
OFFSET_VECTOR_WIDTH;
+  }
+
+  for (ColumnSize columnSize : children.values()) {
+stdNetSize += columnSize.getStdNetSizePerEntry();
+  }
+
+  if (isRepeatedList()) {
+stdNetSize = (stdNetSize * STD_REPETITION_FACTOR) + 
OFFSET_VECTOR_WIDTH;
+   

[GitHub] drill pull request #1125: DRILL-6126: Allocate memory for value vectors upfr...

2018-03-04 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1125#discussion_r172103839
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/MergeJoinBatch.java
 ---
@@ -492,8 +504,12 @@ private void allocateBatch(boolean newSchema) {
 } else {
   container.zeroVectors();
 }
+
+// Allocate memory for the vectors.
+// This will iteratively allocate memory for all nested columns 
underneath.
 for (VectorWrapper w : container) {
-  AllocationHelper.allocateNew(w.getValueVector(), 
Character.MAX_VALUE);
+  RecordBatchSizer.ColumnSize colSize = 
mergeJoinMemoryManager.getColumnSize(w.getField().getName());
+  colSize.allocateVector(w.getValueVector(), 
mergeJoinMemoryManager.getOutputRowCount());
--- End diff --

Nice! Much simpler. Just one nit: might as well get the row count once and 
store it in a local variable to avoid getting the same number repeatedly.


---


[GitHub] drill pull request #1125: DRILL-6126: Allocate memory for value vectors upfr...

2018-03-04 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1125#discussion_r172104598
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java 
---
@@ -76,110 +82,327 @@
  * greater than (but unlikely) same as the row count.
  */
 
-public final int valueCount;
+private final int valueCount;
 
 /**
- * Total number of elements for a repeated type, or 1 if this is
- * a non-repeated type. That is, a batch of 100 rows may have an
- * array with 10 elements per row. In this case, the element count
- * is 1000.
+ * Total number of elements for a repeated type, or same as
+ * valueCount if this is a non-repeated type. That is, a batch
+ * of 100 rows may have an array with 10 elements per row.
+ * In this case, the element count is 1000.
  */
 
-public final int elementCount;
+private int elementCount;
 
 /**
- * Size of the top level value vector. For map and repeated list,
- * this is just size of offset vector.
+ * The estimated, average number of elements per parent value.
+ * Always 1 for a non-repeated type. For a repeated type,
+ * this is the average entries per array (per repeated element).
  */
-public int dataSize;
+
+private float estElementCountPerArray;
 
 /**
- * Total size of the column includes the sum total of memory for all
- * value vectors representing the column.
+ * Indicates if it is variable width column.
+ * For map columns, this is true if any of the children is variable
+ * width column.
  */
-public int netSize;
+
+private boolean isVariableWidth;
 
 /**
- * The estimated, average number of elements per parent value.
- * Always 1 for a non-repeated type. For a repeated type,
- * this is the average entries per array (per repeated element).
+ * Indicates if cardinality is repeated(top level only).
+ */
+
+private boolean isRepeated;
--- End diff --

Might be fun to check out the new metadata classes added for the result set 
loader. They parse the `MajorType` to pull out this kind of information. You 
could embed an instance of the `ColumnMetadata` class here to provide this 
detailed information.


---


[GitHub] drill pull request #1125: DRILL-6126: Allocate memory for value vectors upfr...

2018-03-04 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1125#discussion_r172104395
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java 
---
@@ -76,110 +82,327 @@
  * greater than (but unlikely) same as the row count.
  */
 
-public final int valueCount;
+private final int valueCount;
 
 /**
- * Total number of elements for a repeated type, or 1 if this is
- * a non-repeated type. That is, a batch of 100 rows may have an
- * array with 10 elements per row. In this case, the element count
- * is 1000.
+ * Total number of elements for a repeated type, or same as
+ * valueCount if this is a non-repeated type. That is, a batch
+ * of 100 rows may have an array with 10 elements per row.
+ * In this case, the element count is 1000.
  */
 
-public final int elementCount;
+private int elementCount;
 
 /**
- * Size of the top level value vector. For map and repeated list,
- * this is just size of offset vector.
+ * The estimated, average number of elements per parent value.
+ * Always 1 for a non-repeated type. For a repeated type,
+ * this is the average entries per array (per repeated element).
  */
-public int dataSize;
+
+private float estElementCountPerArray;
--- End diff --

Perhaps `estCardinality`? I've been using the term "cardinality" in new 
code to refer to array sizes.


---


[GitHub] drill pull request #1125: DRILL-6126: Allocate memory for value vectors upfr...

2018-03-02 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1125#discussion_r171999276
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java 
---
@@ -245,16 +251,30 @@ private void buildVectorInitializer(VectorInitializer 
initializer) {
   else if (width > 0) {
 initializer.variableWidth(name, width);
   }
+
+  for (ColumnSize columnSize : childColumnSizes.values()) {
+columnSize.buildVectorInitializer(initializer);
+  }
 }
+
   }
 
   public static ColumnSize getColumn(ValueVector v, String prefix) {
 return new ColumnSize(v, prefix);
   }
 
+  public ColumnSize getColumn(String name) {
+return allColumnSizes.get(name);
+  }
+
   public static final int MAX_VECTOR_SIZE = ValueVector.MAX_BUFFER_SIZE; 
// 16 MiB
 
-  private Map columnSizes = 
CaseInsensitiveMap.newHashMap();
+  // This keeps information for all columns i.e. all top columns and 
nested columns underneath
+  private Map allColumnSizes = 
CaseInsensitiveMap.newHashMap();
--- End diff --

yes, I got rid of allColumnSizes. We will have only top level columns.


---


[GitHub] drill pull request #1125: DRILL-6126: Allocate memory for value vectors upfr...

2018-03-02 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1125#discussion_r171999148
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java 
---
@@ -418,11 +438,13 @@ private void measureColumn(ValueVector v, String 
prefix) {
 netRowWidthCap50 += ! colSize.isVariableWidth ? colSize.estSize :
 8 /* offset vector */ + 
roundUpToPowerOf2(Math.min(colSize.estSize,50));
 // above change 8 to 4 after DRILL-5446 is fixed
+
+return colSize;
   }
 
-  private void expandMap(AbstractMapVector mapVector, String prefix) {
+  private void expandMap(ColumnSize colSize, AbstractMapVector mapVector, 
String prefix) {
 for (ValueVector vector : mapVector) {
-  measureColumn(vector, prefix);
+  colSize.childColumnSizes.put(prefix + vector.getField().getName(), 
measureColumn(vector, prefix));
--- End diff --

I made the change to maintain trees of columns for trees of maps. Any one 
who wants to access the lower level columns, they get top level column and have 
to walk through the tree.  



---


[GitHub] drill pull request #1125: DRILL-6126: Allocate memory for value vectors upfr...

2018-02-21 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1125#discussion_r169859674
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java 
---
@@ -245,16 +251,30 @@ private void buildVectorInitializer(VectorInitializer 
initializer) {
   else if (width > 0) {
 initializer.variableWidth(name, width);
   }
+
+  for (ColumnSize columnSize : childColumnSizes.values()) {
+columnSize.buildVectorInitializer(initializer);
+  }
 }
+
   }
 
   public static ColumnSize getColumn(ValueVector v, String prefix) {
 return new ColumnSize(v, prefix);
   }
 
+  public ColumnSize getColumn(String name) {
+return allColumnSizes.get(name);
+  }
+
   public static final int MAX_VECTOR_SIZE = ValueVector.MAX_BUFFER_SIZE; 
// 16 MiB
 
-  private Map columnSizes = 
CaseInsensitiveMap.newHashMap();
+  // This keeps information for all columns i.e. all top columns and 
nested columns underneath
+  private Map allColumnSizes = 
CaseInsensitiveMap.newHashMap();
--- End diff --

I'm a bit confused. I saw above that a `ColumnSize` has a list of children. 
Why are the children repeated here, introducing the naming issues described 
below?

Given the column alias issues, and how column index is used elsewhere, can 
this just be an index of top-level columns? Then, to size map vectors (the only 
one that has the nesting issue), use the code for recursion that already exists 
in the `VectorInitializer`?


---


[GitHub] drill pull request #1125: DRILL-6126: Allocate memory for value vectors upfr...

2018-02-21 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1125#discussion_r169860846
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java
 ---
@@ -266,12 +270,18 @@ public void addComplexWriter(ComplexWriter writer) {
 complexWriters.add(writer);
   }
 
-  private boolean doAlloc() {
-//Allocate vv in the allocationVectors.
+  private boolean doAlloc(int recordCount) {
+
+//Allocate v in the allocationVectors.
 for (ValueVector v : this.allocationVectors) {
-  if (!v.allocateNewSafe()) {
-return false;
-  }
+  // build vector initializer for the column.
+  // This will iteratively include all nested columns underneath.
+  RecordBatchSizer.ColumnSize colSize = 
flattenMemoryManager.getColumnSize(v.getField().getName());
+  VectorInitializer initializer = new VectorInitializer();
+  colSize.buildVectorInitializer(initializer);
+  // Allocate memory for the vector. If it is map, it will allocate 
memory
+  // for all nested child columns as well.
+  initializer.allocateVector(v, "", recordCount);
--- End diff --

While this code can be made to work, it does introduce a more complex path 
than was intended. The idea is that a `VectorInirializer` is a recursive 
structure. The top-level one holds hints for the row. Nested instances can hold 
data for maps.

Because this class was meant to be temporary, it holds "hints": the 
information is used if available, defaults used if the hints are not available.

So, a better approach would be to assemble a `VectorInitializer` for the 
output row in one step. Then, apply it to the entire row in another step.

Further if we do that, we don't have to create initializer objects for 
every vector allocation; we can reuse the same set if the output row sizes 
don't change.

Further, the code will be easier to reason about since we won't have two 
distinct paths.


---


[GitHub] drill pull request #1125: DRILL-6126: Allocate memory for value vectors upfr...

2018-02-21 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1125#discussion_r169857960
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java 
---
@@ -418,11 +438,13 @@ private void measureColumn(ValueVector v, String 
prefix) {
 netRowWidthCap50 += ! colSize.isVariableWidth ? colSize.estSize :
 8 /* offset vector */ + 
roundUpToPowerOf2(Math.min(colSize.estSize,50));
 // above change 8 to 4 after DRILL-5446 is fixed
+
+return colSize;
   }
 
-  private void expandMap(AbstractMapVector mapVector, String prefix) {
+  private void expandMap(ColumnSize colSize, AbstractMapVector mapVector, 
String prefix) {
 for (ValueVector vector : mapVector) {
-  measureColumn(vector, prefix);
+  colSize.childColumnSizes.put(prefix + vector.getField().getName(), 
measureColumn(vector, prefix));
--- End diff --

This is subject to aliasing. Suppose I have two maps:

```
aa(b)
a(ab)
```
When I add the child vectors, both will produce a combined name of `aab`.

We can't use dots n names for the same reason:

```
a.b(c)
a(b.c)
```

Both will produce `a.b.c`.

In the new "result set loader" code, all places that handle trees of 
columns use actual trees of maps.

A crude-but-effecive solution is to use a non-legal name character. The 
only valid one is the back-tick since we use that in SQL to quote names. If we 
do that, we now have

```
aa`b
a`ab
a.b`c
a`b.c
```

And the names are now un-aliased.


---


[GitHub] drill pull request #1125: DRILL-6126: Allocate memory for value vectors upfr...

2018-02-21 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1125#discussion_r169860310
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java 
---
@@ -245,16 +251,30 @@ private void buildVectorInitializer(VectorInitializer 
initializer) {
   else if (width > 0) {
 initializer.variableWidth(name, width);
   }
+
--- End diff --

There may be a but in the original code above this line.

```
   String name = prefix + metadata.getName();
...
   initializer.variableWidthArray(name, ...
```

First problem: this method does not recurse into the child vectors of maps. 
Second, if it did, since each initializer is supposed to hold a name relative 
to its parent tuple (row or map), the prefix is not needed.

The vector initializer code was originally created for sort and none of the 
tests for sort include maps, so I probably did not catch the fact that maps 
don't work correctly with the original code.

Suggestion: add unit tests for this stuff. This code is getting very 
complex and tests would be very helpful to catch these subtle bugs.

Disclaimer: this is all supposed to be replaced with the ResultSetLoader 
code. The vector allocation code there is much cleaner, does handle maps, and 
has been unit tested.


---


[GitHub] drill pull request #1125: DRILL-6126: Allocate memory for value vectors upfr...

2018-02-21 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1125#discussion_r169858105
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java 
---
@@ -108,6 +108,12 @@
 public final float estElementCountPerArray;
 public final boolean isVariableWidth;
 
+public Map childColumnSizes = 
CaseInsensitiveMap.newHashMap();
--- End diff --

Nit: `children` might be shorter and simpler. It is clear that the children 
are sizes from the member type.


---


[GitHub] drill pull request #1125: DRILL-6126: Allocate memory for value vectors upfr...

2018-02-21 Thread paul-rogers
Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/1125#discussion_r169859837
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/VectorInitializer.java
 ---
@@ -97,7 +97,7 @@ public void allocateBatch(VectorAccessible va, int 
recordCount) {
 }
   }
 
-  private void allocateVector(ValueVector vector, String prefix, int 
recordCount) {
+  public void allocateVector(ValueVector vector, String prefix, int 
recordCount) {
--- End diff --

Note the error in the member variable in this class:

```
   private Map hints = new HashMap<>();
```

Should be a case insensitive map.


---


[GitHub] drill pull request #1125: DRILL-6126: Allocate memory for value vectors upfr...

2018-02-21 Thread ppadma
GitHub user ppadma opened a pull request:

https://github.com/apache/drill/pull/1125

 DRILL-6126: Allocate memory for value vectors upfront in flatten operator

Made changes to allocate memory upfront for flatten operator based on 
sizing calculations.
Need to do allocation of single column (can be nested) for a particular 
record count
and allocation hints. Refactored the code a bit for that.
Instead of assuming worst case fragmentation factor of 2, changed the logic 
to round down the number of rows calculated to nearest power of two. This will 
allow us to pack value vectors more densely and will help with memory 
utilization.
RepeatedMapvector and RepeatedListVector are extended from 
RepeatedFixedWidthVectorLike. This is wrong and causing problems in Allocation 
logic (allocatePrecomputedChildCount in AllocationHelper more specifically). 
Fixed that.
This PR has 2 commits. One for all of the above and second one for
DRILL-6162: Enhance record batch sizer to retain nesting information.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ppadma/drill DRILL-6126

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1125.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1125


commit 58c6b9ad584e56c71d982feaaa43ad32b5011eef
Author: Padma Penumarthy 
Date:   2018-02-21T17:33:12Z

DRILL-6162: Enhance record batch sizer to retain nesting information for 
map columns.

commit f7c09131179b75d10ffe195785c9aef3b9c7ed97
Author: Padma Penumarthy 
Date:   2018-02-21T17:35:47Z

DRILL-6126: Allocate memory for value vectors upfront in flatten operator




---