[GitHub] drill pull request #1228: DRILL-6307: Handle empty batches in record batch s...

2018-04-26 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1228#discussion_r184483395
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java 
---
@@ -536,6 +556,11 @@ public ColumnSize getColumn(String name) {
*/
   private int netRowWidth;
   private int netRowWidthCap50;
+
+  /**
+   * actual row size if input is not empty. Otherwise, standard size.
+   */
+  private int rowAllocSize;
--- End diff --

@paul-rogers yes, this PR is trying to do the second thing i.e. make best 
guess with no knowledge at all about lengths. Also, right side can produce no 
rows for reasons other than empty array i.e. for ex. we might be filtering 
everything out after unnest.  
Please let me know what you think about the approach.


---


[GitHub] drill pull request #1228: DRILL-6307: Handle empty batches in record batch s...

2018-04-25 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1228#discussion_r184258865
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java 
---
@@ -536,6 +556,11 @@ public ColumnSize getColumn(String name) {
*/
   private int netRowWidth;
   private int netRowWidthCap50;
+
+  /**
+   * actual row size if input is not empty. Otherwise, standard size.
+   */
+  private int rowAllocSize;
--- End diff --

actually, this is a problem only for lateral join. In lateral join, right 
side will work on one row at a time from left side. If right side produces zero 
rows because of empty array or some other reason, for left outer join, we still 
have to finish cross join for that row from left side by having nulls for right 
side columns. Then, we go to next row on left side, continuing to work on 
filling the output batch till it is full.  What that means is we have to 
allocate vectors based on that first batch on right side, which can be empty.


---


[GitHub] drill pull request #1228: DRILL-6307: Handle empty batches in record batch s...

2018-04-25 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1228#discussion_r184236170
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java 
---
@@ -277,18 +286,29 @@ public boolean isRepeatedList() {
 /**
  * This is the average per entry width, used for vector allocation.
  */
-public int getEntryWidth() {
+private int getEntryWidthForAlloc() {
   int width = 0;
   if (isVariableWidth) {
-width = getNetSizePerEntry() - OFFSET_VECTOR_WIDTH;
+width = getAllocSizePerEntry() - OFFSET_VECTOR_WIDTH;
 
 // Subtract out the bits (is-set) vector width
-if (metadata.getDataMode() == DataMode.OPTIONAL) {
+if (isOptional) {
   width -= BIT_VECTOR_WIDTH;
 }
+
+if (isRepeated && getValueCount() == 0) {
+  return (safeDivide(width, STD_REPETITION_FACTOR));
--- End diff --

You are right. row count can be non-zero and valueCount can still be zero.  
Since this is intended for empty batch case, changed this to check for row 
count instead.


---


[GitHub] drill issue #1228: DRILL-6307: Handle empty batches in record batch sizer co...

2018-04-25 Thread ppadma
Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/1228
  
@paul-rogers Paul, I addressed code review comments. Can you take a look 
when you get a chance ?


---


[GitHub] drill issue #1227: DRILL-6236: batch sizing for hash join

2018-04-25 Thread ppadma
Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/1227
  
@Ben-Zvi I manually added the PR link to the JIRA. all code review comments 
are addressed. can you look at the latest changes ?


---


[GitHub] drill pull request #1228: DRILL-6307: Handle empty batches in record batch s...

2018-04-25 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1228#discussion_r184202508
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java 
---
@@ -536,6 +556,11 @@ public ColumnSize getColumn(String name) {
*/
   private int netRowWidth;
   private int netRowWidthCap50;
+
+  /**
+   * actual row size if input is not empty. Otherwise, standard size.
+   */
+  private int rowAllocSize;
--- End diff --

This is not just a problem with size estimation for vector memory 
allocation.  Let us say one side of join receives an empty batch as first 
batch.  If we use row width as 0 in outgoing row width calculation, number of 
rows (to include in the outgoing batch) we will calculate will be higher and 
later when we get a non empty batch, we might exceed the memory limits.


---


[GitHub] drill pull request #1228: DRILL-6307: Handle empty batches in record batch s...

2018-04-25 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1228#discussion_r184200281
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java 
---
@@ -277,18 +286,29 @@ public boolean isRepeatedList() {
 /**
  * This is the average per entry width, used for vector allocation.
  */
-public int getEntryWidth() {
+private int getEntryWidthForAlloc() {
   int width = 0;
   if (isVariableWidth) {
-width = getNetSizePerEntry() - OFFSET_VECTOR_WIDTH;
+width = getAllocSizePerEntry() - OFFSET_VECTOR_WIDTH;
 
 // Subtract out the bits (is-set) vector width
-if (metadata.getDataMode() == DataMode.OPTIONAL) {
+if (isOptional) {
   width -= BIT_VECTOR_WIDTH;
 }
+
+if (isRepeated && getValueCount() == 0) {
+  return (safeDivide(width, STD_REPETITION_FACTOR));
+}
   }
 
-  return (safeDivide(width, cardinality));
+  return (safeDivide(width, getEntryCardinalityForAlloc()));
+}
+
+/**
+ * This is the average per entry cardinality, used for vector 
allocation.
+ */
+private float getEntryCardinalityForAlloc() {
+  return getCardinality() == 0 ? (isRepeated ? STD_REPETITION_FACTOR : 
1) :getCardinality();
--- End diff --

This is for joins. We allocate vectors based on first batch sizing 
information and if that first batch is empty, then, we are allocating vectors 
with zero capacity. When we read the next batch with data, we will end up going 
through realloc loop as we write values. For ex., for outer left join, if right 
side batch is empty, we still have to include the right side columns as null in 
outgoing batch. With the new lateral join operator, if the input has an empty 
array as the first record in the unnest column, then also we see the problem. 


---


[GitHub] drill pull request #1228: DRILL-6307: Handle empty batches in record batch s...

2018-04-25 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1228#discussion_r184192443
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java 
---
@@ -50,7 +50,7 @@
 public class RecordBatchSizer {
   private static final int OFFSET_VECTOR_WIDTH = UInt4Vector.VALUE_WIDTH;
   private static final int BIT_VECTOR_WIDTH = UInt1Vector.VALUE_WIDTH;
-  private static final int STD_REPETITION_FACTOR = 10;
+  public static final int STD_REPETITION_FACTOR = 10;
--- End diff --

done. using 5 in both places now. 


---


[GitHub] drill issue #1218: DRILL-6335: Refactor row set abstractions to prepare for ...

2018-04-23 Thread ppadma
Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/1218
  
LGTM. +1


---


[GitHub] drill issue #1227: Drill-6236: batch sizing for hash join

2018-04-20 Thread ppadma
Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/1227
  
@Ben-Zvi  my bad. I updated the title. but, it has not updated the JIRA. 
trying to figure this out.


---


[GitHub] drill pull request #1227: Drill 6236: batch sizing for hash join

2018-04-20 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1227#discussion_r183171726
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinBatch.java
 ---
@@ -560,6 +554,40 @@ public void close() {
 super.close();
   }
 
+  @Override
+  protected void updateBatchMemoryManagerStats() {
+stats.setLongStat(Metric.LEFT_INPUT_BATCH_COUNT, 
batchMemoryManager.getNumIncomingBatches(LEFT_INDEX));
--- End diff --

@sohami I will create a JIRA and address that in a separate PR. For now, I 
would like to override this method. is that ok ?


---


[GitHub] drill issue #1227: Drill 6236: batch sizing for hash join

2018-04-20 Thread ppadma
Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/1227
  
@sohami thanks for the review. updated with review comments addressed. 
please take a look.


---


[GitHub] drill pull request #1227: Drill 6236: batch sizing for hash join

2018-04-20 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1227#discussion_r183112258
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinBatch.java
 ---
@@ -147,7 +150,19 @@
 NUM_BUCKETS,
 NUM_ENTRIES,
 NUM_RESIZING,
-RESIZING_TIME_MS;
+RESIZING_TIME_MS,
+LEFT_INPUT_BATCH_COUNT,
+LEFT_AVG_INPUT_BATCH_BYTES,
+LEFT_AVG_INPUT_ROW_BYTES,
+LEFT_INPUT_RECORD_COUNT,
+RIGHT_INPUT_BATCH_COUNT,
+RIGHT_AVG_INPUT_BATCH_BYTES,
+RIGHT_AVG_INPUT_ROW_BYTES,
+RIGHT_INPUT_RECORD_COUNT,
+OUTPUT_BATCH_COUNT,
+AVG_OUTPUT_BATCH_BYTES,
+AVG_OUTPUT_ROW_BYTES,
+OUTPUT_RECORD_COUNT;
--- End diff --

It is relevant in the sense that they provide high level picture of amount 
of data being processed, memory usage etc. by each operator. This is also 
helpful when debugging trying to figure out what is going on. 


---


[GitHub] drill issue #1227: Drill 6236: batch sizing for hash join

2018-04-20 Thread ppadma
Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/1227
  
@Ben-Zvi Thanks a lot for the review. updated PR with review comments taken 
care of. Please take a look.

Regarding spill files, here are my thoughts. 
For build side, I am using aggregate statistics i.e. average of all 
batches.  On probe side, I am using stats for each batch coming in and 
adjusting the output row count. So,  we can skip applying sizing for batches 
spilled from build side and continue to do what I am doing on the probe side. 
Once your code is merged in, I will refactor the code as needed.



---


[GitHub] drill pull request #1227: Drill 6236: batch sizing for hash join

2018-04-20 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1227#discussion_r183108078
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinBatch.java
 ---
@@ -300,13 +322,14 @@ public void setupHashTable() throws IOException, 
SchemaChangeException, ClassTra
 
   public void executeBuildPhase() throws SchemaChangeException, 
ClassTransformationException, IOException {
 //Setup the underlying hash table
-
 // skip first batch if count is zero, as it may be an empty schema 
batch
 if (isFurtherProcessingRequired(rightUpstream) && 
right.getRecordCount() == 0) {
   for (final VectorWrapper w : right) {
 w.clear();
   }
   rightUpstream = next(right);
+  // For build side, use aggregate i.e. average row width across 
batches
+  batchMemoryManager.update(RIGHT_INDEX, 0,true);
--- End diff --

There is a call to "next" right above the update. 


---


[GitHub] drill pull request #1218: DRILL-6335: Refactor row set abstractions to prepa...

2018-04-19 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1218#discussion_r18296
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/test/rowSet/RowSetWriterImpl.java 
---
@@ -158,4 +159,10 @@ public SingleRowSet done() {
   public int lastWriteIndex() {
 return writerIndex.vectorIndex();
   }
+
+  @Override
+  public ColumnMetadata schema() {
+// No column schema for the row as a whole.
--- End diff --

same comment as above


---


[GitHub] drill pull request #1218: DRILL-6335: Refactor row set abstractions to prepa...

2018-04-19 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1218#discussion_r182921553
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/impl/RowSetLoaderImpl.java
 ---
@@ -95,4 +96,10 @@ public void endBatch() {
 
   @Override
   public int rowCount() { return rsLoader.rowCount(); }
+
+  @Override
+  public ColumnMetadata schema() {
+// No column for the row tuple
+return null;
--- End diff --

this comment is not clear to me. is it ok to return null here ? 


---


[GitHub] drill pull request #1218: DRILL-6335: Refactor row set abstractions to prepa...

2018-04-19 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1218#discussion_r182729279
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/test/rowSet/test/TestHyperVectorReaders.java
 ---
@@ -362,4 +363,136 @@ public void testRepeatedMap() {
 
 RowSetUtilities.verify(expected, hyperSet);
   }
+
+  @Test
+  public void testUnion() {
+TupleMetadata schema = new SchemaBuilder()
+.add("a", MinorType.INT)
+.addUnion("u")
+  .addType(MinorType.INT)
+  .addType(MinorType.VARCHAR)
+  .resumeSchema()
+.buildSchema();
+
+SingleRowSet rowSet1 = fixture.rowSetBuilder(schema)
+.addRow(2, 20)
+.addRow(4, "fourth")
+.build();
+
+SingleRowSet rowSet2 = fixture.rowSetBuilder(schema)
+.addRow(1, "first")
+.addRow(3, 30)
+.build();
+
+// Build the hyper batch
+
+HyperRowSet hyperSet = 
HyperRowSetImpl.fromRowSets(fixture.allocator(), rowSet1, rowSet2);
+assertEquals(4, hyperSet.rowCount());
+SelectionVector4 sv4 = hyperSet.getSv4();
+sv4.set(0, 1, 0);
+sv4.set(1, 0, 0);
--- End diff --

do we have to manually map each index like this ? may be we can do this in 
the fromRowSets method itself ?


---


[GitHub] drill pull request #1218: DRILL-6335: Refactor row set abstractions to prepa...

2018-04-19 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1218#discussion_r182923416
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/test/rowSet/test/TestIndirectReaders.java
 ---
@@ -0,0 +1,151 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.test.rowSet.test;
+
+import static org.junit.Assert.*;
+
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.exec.record.metadata.TupleMetadata;
+import org.apache.drill.exec.record.selection.SelectionVector2;
+import org.apache.drill.exec.vector.accessor.ArrayReader;
+import org.apache.drill.exec.vector.accessor.ArrayWriter;
+import org.apache.drill.exec.vector.accessor.ScalarReader;
+import org.apache.drill.exec.vector.accessor.ScalarWriter;
+import org.apache.drill.test.SubOperatorTest;
+import org.apache.drill.test.rowSet.RowSetWriter;
+import org.apache.drill.test.rowSet.RowSet.ExtendableRowSet;
+import org.apache.drill.test.rowSet.RowSet.SingleRowSet;
+import org.apache.drill.test.rowSet.schema.SchemaBuilder;
+import org.apache.drill.test.rowSet.RowSetComparison;
+import org.apache.drill.test.rowSet.RowSetReader;
+import org.junit.Test;
+
+/**
+ * Test reading with an indirection vector (sv2.) This form of
+ * indirection vector reorders values within a single batch.
+ * Since the indirection occurs only in the reader index, only
+ * light testing is done; all readers go through the same index class,
+ * so if the index works for one reader, it will for for all.
--- End diff --

typo: it will work for all (instead of for for all)


---


[GitHub] drill pull request #1228: DRILL-6307: Handle empty batches in record batch s...

2018-04-19 Thread ppadma
GitHub user ppadma opened a pull request:

https://github.com/apache/drill/pull/1228

DRILL-6307: Handle empty batches in record batch sizer correctly

When we get empty batch, record batch sizer calculates row width as zero. 
In that case, we do not do accounting and memory allocation correctly for 
outgoing batches. 

For ex., for outer left join, if right side batch is empty, we still have 
to include the right side columns as null in outgoing batch. Say first batch is 
empty. Then, for outgoing, we allocate empty vectors with zero capacity.  When 
we read the next batch with data, we will end up going through realloc loop as 
we write values. Also, if we use right side row width as 0 in outgoing row 
width calculation, number of rows (to include in the outgoing batch) we will 
calculate will be higher and later when we get a non empty batch, we might 
exceed the memory limits. 

This PR tries to address these problems by allocating memory based on std 
size for empty input batch. Uses allocation width as width of the batch in 
number of rows calculation for binary operators. For unary operators, this is 
not a problem since we drop empty batches without doing any processing. 



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ppadma/drill DRILL-6307

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1228.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1228


commit cd78209e9f75a59edc68df3e416f3936fb00f917
Author: Padma Penumarthy 
Date:   2018-04-06T19:56:06Z

DRILL-6307: Handle empty batches in record batch sizer correctly




---


[GitHub] drill pull request #1227: Drill 6236: batch sizing for hash join

2018-04-19 Thread ppadma
GitHub user ppadma opened a pull request:

https://github.com/apache/drill/pull/1227

Drill 6236: batch sizing for hash join



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ppadma/drill DRILL-6236

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1227.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1227


commit f41aadab1cdf61bb1818724597aaa7726000af09
Author: Padma Penumarthy 
Date:   2018-04-07T03:53:26Z

DRILL-6236: batch sizing for hash join

commit e2ddf8761e006a16ed7b6ceea3c2fa849cc6a6e5
Author: Padma Penumarthy 
Date:   2018-04-19T05:23:54Z

DRILL-6343: bit vector copyFromSafe is not doing realloc




---


[GitHub] drill issue #1161: DRILL-6230: Extend row set readers to handle hyper vector...

2018-03-31 Thread ppadma
Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/1161
  
@paul-rogers ran the tests. they all pass. 


---


[GitHub] drill issue #1181: DRILL-6284: Add operator metrics for batch sizing for fla...

2018-03-31 Thread ppadma
Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/1181
  
@paul-rogers thanks a lot for the review. updated PR with review comments 
taken care of. 


---


[GitHub] drill issue #1181: DRILL-6284: Add operator metrics for batch sizing for fla...

2018-03-29 Thread ppadma
Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/1181
  
@paul-rogers I added metrics for merge join also. I refactored 
AbstractRecordBatchMemoryManager to handle batches from multiple streams. 
Please review when you get a chance. 


---


[GitHub] drill issue #1181: DRILL-6284: Add operator metrics for batch sizing for fla...

2018-03-28 Thread ppadma
Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/1181
  
@paul-rogers thanks for the review. Please take a look at updated changes.


---


[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...

2018-03-28 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1181#discussion_r177810092
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java
 ---
@@ -99,6 +100,22 @@ private void clear() {
 }
   }
 
+  public enum Metric implements MetricDef {
+NUM_INCOMING_BATCHES,
+AVG_INPUT_BATCH_SIZE,
+AVG_INPUT_ROW_WIDTH,
--- End diff --

@paul-rogers Paul, I did not understand what you mean by parallel here and 
below. Do you mean they should be adjacent columns in the web UI ?


---


[GitHub] drill pull request #1161: DRILL-6230: Extend row set readers to handle hyper...

2018-03-28 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1161#discussion_r177564887
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/model/ReaderIndex.java
 ---
@@ -28,26 +28,30 @@
 
 public abstract class ReaderIndex implements ColumnReaderIndex {
 
-  protected int rowIndex = -1;
+  protected int position = -1;
   protected final int rowCount;
 
   public ReaderIndex(int rowCount) {
 this.rowCount = rowCount;
   }
 
-  public int position() { return rowIndex; }
-  public void set(int index) { rowIndex = index; }
+  public void set(int index) {
+assert position >= -1 && position <= rowCount;
+position = index;
+  }
+
+  @Override
+  public int logicalIndex() { return position; }
+
+  @Override
+  public int size() { return rowCount; }
 
+  @Override
   public boolean next() {
-if (++rowIndex < rowCount ) {
+if (++position < rowCount) {
   return true;
-} else {
-  rowIndex--;
-  return false;
 }
+position = rowCount;
--- End diff --

is there a need to set position to rowcount ? It will come here when 
position = rowcount 


---


[GitHub] drill pull request #1161: DRILL-6230: Extend row set readers to handle hyper...

2018-03-28 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1161#discussion_r177182331
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/test/rowSet/test/TestHyperVectorReaders.java
 ---
@@ -0,0 +1,365 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.test.rowSet.test;
+
+import static org.apache.drill.test.rowSet.RowSetUtilities.mapArray;
+import static org.apache.drill.test.rowSet.RowSetUtilities.mapValue;
+import static org.apache.drill.test.rowSet.RowSetUtilities.strArray;
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertFalse;
+import static org.junit.Assert.assertTrue;
+
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.exec.record.metadata.TupleMetadata;
+import org.apache.drill.exec.record.selection.SelectionVector4;
+import org.apache.drill.test.SubOperatorTest;
+import org.apache.drill.test.rowSet.HyperRowSetImpl;
+import org.apache.drill.test.rowSet.RowSet.ExtendableRowSet;
+import org.apache.drill.test.rowSet.RowSet.HyperRowSet;
+import org.apache.drill.test.rowSet.RowSet.SingleRowSet;
+import org.apache.drill.test.rowSet.RowSetBuilder;
+import org.apache.drill.test.rowSet.RowSetReader;
+import org.apache.drill.test.rowSet.RowSetUtilities;
+import org.apache.drill.test.rowSet.RowSetWriter;
+import org.apache.drill.test.rowSet.schema.SchemaBuilder;
+import org.junit.Test;
+
+/**
+ * Test the reader mechanism that reads rows indexed via an SV4.
+ * SV4's introduce an additional level of indexing: each row may
+ * come from a different batch. The readers use the SV4 to find
+ * the root batch and vector, then must navigate downward from that
+ * vector for maps, repeated maps, lists, unions, repeated lists,
+ * nullable vectors and variable-length vectors.
+ * 
+ * This test does not cover repeated vectors; those tests should be added.
--- End diff --

please file a JIRA for this.


---


[GitHub] drill pull request #1161: DRILL-6230: Extend row set readers to handle hyper...

2018-03-28 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1161#discussion_r177175974
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/test/rowSet/HyperRowSetImpl.java 
---
@@ -45,8 +50,67 @@ public RowSetReader buildReader(HyperRowSet rowSet, 
SelectionVector4 sv4) {
   TupleMetadata schema = rowSet.schema();
   HyperRowIndex rowIndex = new HyperRowIndex(sv4);
   return new RowSetReaderImpl(schema, rowIndex,
-  buildContainerChildren(rowSet.container(),
-  new MetadataRetrieval(schema)));
+  buildContainerChildren(rowSet.container(), schema));
+}
+  }
+
+  public static class HyperRowSetBuilderImpl implements HyperRowSetBuilder 
{
+
+private final BufferAllocator allocator;
+private final List batches = new ArrayList<>();
+private int totalRowCount;
+
+public HyperRowSetBuilderImpl(BufferAllocator allocator) {
+  this.allocator = allocator;
+}
+
+@Override
+public void addBatch(SingleRowSet rowSet) {
+  if (rowSet.rowCount() == 0) {
+return;
+  }
+  if (rowSet.indirectionType() != SelectionVectorMode.NONE) {
+throw new IllegalArgumentException("Batches must not have a 
selection vector.");
+  }
+  batches.add(rowSet.container());
+  totalRowCount += rowSet.rowCount();
+}
+
+@Override
+public void addBatch(VectorContainer container) {
+  if (container.getRecordCount() == 0) {
+return;
+  }
+  if (container.getSchema().getSelectionVectorMode() != 
SelectionVectorMode.NONE) {
+throw new IllegalArgumentException("Batches must not have a 
selection vector.");
+  }
+  batches.add(container);
+  totalRowCount += container.getRecordCount();
+}
+
+@SuppressWarnings("resource")
+@Override
+public HyperRowSet build() throws SchemaChangeException {
+  SelectionVector4 sv4 = new SelectionVector4(allocator, 
totalRowCount);
+  ExpandableHyperContainer hyperContainer = new 
ExpandableHyperContainer();
+  for (VectorContainer container : batches) {
+hyperContainer.addBatch(container);
+  }
+
+  // TODO: This has a bug. If the hyperset has two batches with unions,
+  // and the first union contains only VARCHAR, while the second 
contains
--- End diff --

is there a JIRA for this bug ?


---


[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...

2018-03-21 Thread ppadma
GitHub user ppadma opened a pull request:

https://github.com/apache/drill/pull/1181

DRILL-6284: Add operator metrics for batch sizing for flatten

@kkhatua please review.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ppadma/drill DRILL-6284

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1181.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1181


commit f0b7bed20aef64cdc9e025a5ca209e1ad6220aa6
Author: Padma Penumarthy 
Date:   2018-03-20T20:44:50Z

DRILL-6284: Add operator metrics for batch sizing for flatten




---


[GitHub] drill pull request #1179: DRILL-6254: IllegalArgumentException: the requeste...

2018-03-21 Thread ppadma
GitHub user ppadma opened a pull request:

https://github.com/apache/drill/pull/1179

DRILL-6254: IllegalArgumentException: the requested size must be non-…

…negative

We should limit memory allocation to number of records that are going to be 
in the next batch, not the total number of records remaining. For very large 
remaining record count, when multiplied with high cardinality, integer 
overflows and is becoming negative.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ppadma/drill DRILL-6254

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1179.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1179


commit c26e8f44b9c873d0d7acabac3623a3a8e19086eb
Author: Padma Penumarthy 
Date:   2018-03-21T20:39:43Z

DRILL-6254: IllegalArgumentException: the requested size must be 
non-negative




---


[GitHub] drill issue #1175: DRILL-6262: IndexOutOfBoundException in RecordBatchSize f...

2018-03-19 Thread ppadma
Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/1175
  
LGTM. +1.


---


[GitHub] drill issue #1171: DRILL-6231: Fix memory allocation for repeated list vecto...

2018-03-18 Thread ppadma
Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/1171
  
@paul-rogers I added the null check. Also, added a test case for 3D array 
i.e. repeated repeated list.


---


[GitHub] drill issue #1171: DRILL-6231: Fix memory allocation for repeated list vecto...

2018-03-15 Thread ppadma
Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/1171
  
@paul-rogers please review.


---


[GitHub] drill issue #1161: DRILL-6230: Extend row set readers to handle hyper vector...

2018-03-15 Thread ppadma
Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/1161
  
@paul-rogers I started the review. will get back soon.


---


[GitHub] drill pull request #1171: DRILL-6231: Fix memory allocation for repeated lis...

2018-03-15 Thread ppadma
GitHub user ppadma opened a pull request:

https://github.com/apache/drill/pull/1171

DRILL-6231: Fix memory allocation for repeated list vector



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ppadma/drill DRILL-6231

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1171.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1171


commit 114a526240a272bd626862a02d1acd06b9c8b4cf
Author: Padma Penumarthy 
Date:   2018-03-16T04:50:54Z

DRILL-6231: Fix memory allocation for repeated list vector




---


[GitHub] drill issue #1150: DRILL-6210: Enhanced test schema utilities

2018-03-06 Thread ppadma
Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/1150
  
@Ben-Zvi Boaz, can you do a committer review for this PR and include it in 
this week's batch commit ? 


---


[GitHub] drill issue #1125: DRILL-6126: Allocate memory for value vectors upfront in ...

2018-03-06 Thread ppadma
Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/1125
  
@paul-rogers Thank you Paul. I will be incorporating some of your 
suggestions in future PRs. This change will allow us to build more things on 
top. 


---


[GitHub] drill issue #1150: DRILL-6210: Enhanced test schema utilities

2018-03-05 Thread ppadma
Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/1150
  
This is very useful for testing complex types.  LGTM.  +1. 


---


[GitHub] drill issue #1125: DRILL-6126: Allocate memory for value vectors upfront in ...

2018-03-03 Thread ppadma
Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/1125
  
@paul-rogers Updated the PR with latest changes.  I have decided not to use 
vector initializer for allocation as it is subject to alias issues like you 
mentioned. Instead, I added allocate vector method to columnSize in batch 
sizer, which will use internal sizing information it has to allocate memory 
(including all it's children) for a particular record count. 
Refactored the batch sizer code, added unit tests for verifying sizing and 
allocation for different vector types. 
Please take a look when you get a chance and let me know what you think.



---


[GitHub] drill pull request #1125: DRILL-6126: Allocate memory for value vectors upfr...

2018-03-02 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1125#discussion_r171999276
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java 
---
@@ -245,16 +251,30 @@ private void buildVectorInitializer(VectorInitializer 
initializer) {
   else if (width > 0) {
 initializer.variableWidth(name, width);
   }
+
+  for (ColumnSize columnSize : childColumnSizes.values()) {
+columnSize.buildVectorInitializer(initializer);
+  }
 }
+
   }
 
   public static ColumnSize getColumn(ValueVector v, String prefix) {
 return new ColumnSize(v, prefix);
   }
 
+  public ColumnSize getColumn(String name) {
+return allColumnSizes.get(name);
+  }
+
   public static final int MAX_VECTOR_SIZE = ValueVector.MAX_BUFFER_SIZE; 
// 16 MiB
 
-  private Map columnSizes = 
CaseInsensitiveMap.newHashMap();
+  // This keeps information for all columns i.e. all top columns and 
nested columns underneath
+  private Map allColumnSizes = 
CaseInsensitiveMap.newHashMap();
--- End diff --

yes, I got rid of allColumnSizes. We will have only top level columns.


---


[GitHub] drill pull request #1125: DRILL-6126: Allocate memory for value vectors upfr...

2018-03-02 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1125#discussion_r171999148
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java 
---
@@ -418,11 +438,13 @@ private void measureColumn(ValueVector v, String 
prefix) {
 netRowWidthCap50 += ! colSize.isVariableWidth ? colSize.estSize :
 8 /* offset vector */ + 
roundUpToPowerOf2(Math.min(colSize.estSize,50));
 // above change 8 to 4 after DRILL-5446 is fixed
+
+return colSize;
   }
 
-  private void expandMap(AbstractMapVector mapVector, String prefix) {
+  private void expandMap(ColumnSize colSize, AbstractMapVector mapVector, 
String prefix) {
 for (ValueVector vector : mapVector) {
-  measureColumn(vector, prefix);
+  colSize.childColumnSizes.put(prefix + vector.getField().getName(), 
measureColumn(vector, prefix));
--- End diff --

I made the change to maintain trees of columns for trees of maps. Any one 
who wants to access the lower level columns, they get top level column and have 
to walk through the tree.  



---


[GitHub] drill pull request #1147: DRILL-6205: Reduce memory consumption of testFlatt...

2018-03-02 Thread ppadma
GitHub user ppadma opened a pull request:

https://github.com/apache/drill/pull/1147

DRILL-6205: Reduce memory consumption of testFlattenUpperLimit test



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ppadma/drill DRILL-6205

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1147.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1147


commit d61ad8fe176996015209eba321e2a176d3a8d24f
Author: Padma Penumarthy 
Date:   2018-03-02T19:13:59Z

DRILL-6205: Reduce memory consumption of testFlattenUpperLimit test




---


[GitHub] drill issue #1129: DRILL-6180: Use System Option "output_batch_size" for Ext...

2018-02-28 Thread ppadma
Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/1129
  
@paul-rogers Made the change you suggested. Please take a look when you get 
a chance. 


---


[GitHub] drill issue #1125: DRILL-6126: Allocate memory for value vectors upfront in ...

2018-02-28 Thread ppadma
Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/1125
  
@paul-rogers Paul, Thanks a lot for your review comments and bringing up 
some good issues. Just want to let you know. I am working on refactoring the 
batch sizer code, writing bunch of unit tests to test sizing and vector 
allocation for all different vector types. Found some bugs in the process and 
fixed them. I will be posting new changes soon and need your review once they 
are ready.


---


[GitHub] drill pull request #1129: DRILL-6180: Use System Option "output_batch_size" ...

2018-02-22 Thread ppadma
GitHub user ppadma opened a pull request:

https://github.com/apache/drill/pull/1129

DRILL-6180: Use System Option "output_batch_size" for External Sort

External Sort has boot time configuration for output batch size 
"drill.exec.sort.external.spill.merge_batch_size" which is defaulted to 16M.
To make batch sizing configuration uniform across all operators, change 
this to use new system option that is added 
"drill.exec.memory.operator.output_batch_size". 
This option has default value of 32M. Changed it to 16M.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ppadma/drill DRILL-6180

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1129.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1129


commit 5121663c1fac618d0374667c97c20570197b7455
Author: Padma Penumarthy 
Date:   2018-02-23T00:41:47Z

DRILL-6180: Use System Option "output_batch_size" for External Sort




---


[GitHub] drill issue #1125: DRILL-6126: Allocate memory for value vectors upfront in ...

2018-02-21 Thread ppadma
Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/1125
  
@paul-rogers Paul, will you be able to review this ?


---


[GitHub] drill pull request #1125: DRILL-6126: Allocate memory for value vectors upfr...

2018-02-21 Thread ppadma
GitHub user ppadma opened a pull request:

https://github.com/apache/drill/pull/1125

 DRILL-6126: Allocate memory for value vectors upfront in flatten operator

Made changes to allocate memory upfront for flatten operator based on 
sizing calculations.
Need to do allocation of single column (can be nested) for a particular 
record count
and allocation hints. Refactored the code a bit for that.
Instead of assuming worst case fragmentation factor of 2, changed the logic 
to round down the number of rows calculated to nearest power of two. This will 
allow us to pack value vectors more densely and will help with memory 
utilization.
RepeatedMapvector and RepeatedListVector are extended from 
RepeatedFixedWidthVectorLike. This is wrong and causing problems in Allocation 
logic (allocatePrecomputedChildCount in AllocationHelper more specifically). 
Fixed that.
This PR has 2 commits. One for all of the above and second one for
DRILL-6162: Enhance record batch sizer to retain nesting information.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ppadma/drill DRILL-6126

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1125.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1125


commit 58c6b9ad584e56c71d982feaaa43ad32b5011eef
Author: Padma Penumarthy 
Date:   2018-02-21T17:33:12Z

DRILL-6162: Enhance record batch sizer to retain nesting information for 
map columns.

commit f7c09131179b75d10ffe195785c9aef3b9c7ed97
Author: Padma Penumarthy 
Date:   2018-02-21T17:35:47Z

DRILL-6126: Allocate memory for value vectors upfront in flatten operator




---


[GitHub] drill pull request #1121: DRILL-6153: Operator framework

2018-02-19 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1121#discussion_r169155001
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/protocol/OperatorDriver.java
 ---
@@ -0,0 +1,222 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.protocol;
+
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.exec.ops.OperatorContext;
+import org.apache.drill.exec.record.RecordBatch.IterOutcome;
+
+/**
+ * State machine that drives the operator executable. Converts
+ * between the iterator protocol and the operator executable protocol.
+ * Implemented as a separate class in anticipation of eventually
+ * changing the record batch (iterator) protocol.
+ */
+
+public class OperatorDriver {
+  public enum State {
+
+/**
+ * Before the first call to next().
+ */
+
+START,
+
+/**
+ * The first call to next() has been made and schema (only)
+ * was returned. On the subsequent call to next(), return any
+ * data that might have accompanied that first batch.
+ */
+
+SCHEMA,
+
+/**
+ * The second call to next() has been made and there is more
+ * data to deliver on subsequent calls.
+ */
+
+RUN,
+
+/**
+ * No more data to deliver.
+ */
+
+END,
+
+/**
+ * An error occurred. Operation was cancelled.
+ */
+
+FAILED,
+
+/**
+ * close() called and resources are released.
+ */
+
+CLOSED }
--- End diff --

minor: closing braces in a separate line for better readability.


---


[GitHub] drill pull request #1121: DRILL-6153: Operator framework

2018-02-19 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1121#discussion_r169159418
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/protocol/OperatorDriver.java
 ---
@@ -0,0 +1,222 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.protocol;
+
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.exec.ops.OperatorContext;
+import org.apache.drill.exec.record.RecordBatch.IterOutcome;
+
+/**
+ * State machine that drives the operator executable. Converts
+ * between the iterator protocol and the operator executable protocol.
+ * Implemented as a separate class in anticipation of eventually
+ * changing the record batch (iterator) protocol.
+ */
+
+public class OperatorDriver {
+  public enum State {
+
+/**
+ * Before the first call to next().
+ */
+
+START,
+
+/**
+ * The first call to next() has been made and schema (only)
+ * was returned. On the subsequent call to next(), return any
+ * data that might have accompanied that first batch.
+ */
+
+SCHEMA,
+
+/**
+ * The second call to next() has been made and there is more
+ * data to deliver on subsequent calls.
+ */
+
+RUN,
+
+/**
+ * No more data to deliver.
+ */
+
+END,
+
+/**
+ * An error occurred. Operation was cancelled.
+ */
+
+FAILED,
+
+/**
+ * close() called and resources are released.
+ */
+
+CLOSED }
+
+  private OperatorDriver.State state = State.START;
+
+  /**
+   * Operator context. The driver "owns" the context and is responsible
+   * for closing it.
+   */
+
+  private final OperatorContext opContext;
+  private final OperatorExec operatorExec;
+  private final BatchAccessor batchAccessor;
+  private int schemaVersion;
+
+  public OperatorDriver(OperatorContext opContext, OperatorExec opExec) {
+this.opContext = opContext;
+this.operatorExec = opExec;
+batchAccessor = operatorExec.batchAccessor();
+  }
+
+  /**
+   * Get the next batch. Performs initialization on the first call.
+   * @return the iteration outcome to send downstream
+   */
+
+  public IterOutcome next() {
+try {
+  switch (state) {
+  case START:
+return start();
+  case RUN:
+return doNext();
+   default:
+OperatorRecordBatch.logger.debug("Extra call to next() in state " 
+ state + ": " + operatorLabel());
+return IterOutcome.NONE;
+  }
+} catch (UserException e) {
+  cancelSilently();
+  state = State.FAILED;
+  throw e;
+} catch (Throwable t) {
+  cancelSilently();
+  state = State.FAILED;
+  throw UserException.executionError(t)
+.addContext("Exception thrown from", operatorLabel())
+.build(OperatorRecordBatch.logger);
+}
+  }
+
+  /**
+   * Cancels the operator before reaching EOF.
+   */
+
+  public void cancel() {
+try {
+  switch (state) {
+  case START:
+  case RUN:
+cancelSilently();
+break;
+  default:
+break;
+  }
+} finally {
+  state = State.FAILED;
--- End diff --

I am thinking FAILED represents internal failure with in the operator.  
Cancel means we are explicitly canceling it (for whatever reasons) i.e. 
operator is being asked to shutdown or close.  For cancel, should we move the 
state to CLOSED instead of FAILED. 


---


[GitHub] drill pull request #1121: DRILL-6153: Operator framework

2018-02-19 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1121#discussion_r169155210
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/protocol/OperatorDriver.java
 ---
@@ -0,0 +1,222 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.protocol;
+
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.exec.ops.OperatorContext;
+import org.apache.drill.exec.record.RecordBatch.IterOutcome;
+
+/**
+ * State machine that drives the operator executable. Converts
+ * between the iterator protocol and the operator executable protocol.
+ * Implemented as a separate class in anticipation of eventually
+ * changing the record batch (iterator) protocol.
+ */
+
+public class OperatorDriver {
+  public enum State {
+
+/**
+ * Before the first call to next().
+ */
+
+START,
+
+/**
+ * The first call to next() has been made and schema (only)
+ * was returned. On the subsequent call to next(), return any
+ * data that might have accompanied that first batch.
+ */
+
+SCHEMA,
+
+/**
+ * The second call to next() has been made and there is more
+ * data to deliver on subsequent calls.
+ */
+
+RUN,
+
+/**
+ * No more data to deliver.
+ */
+
+END,
+
+/**
+ * An error occurred. Operation was cancelled.
+ */
+
+FAILED,
+
+/**
+ * close() called and resources are released.
+ */
+
+CLOSED }
+
+  private OperatorDriver.State state = State.START;
+
+  /**
+   * Operator context. The driver "owns" the context and is responsible
+   * for closing it.
+   */
+
+  private final OperatorContext opContext;
+  private final OperatorExec operatorExec;
+  private final BatchAccessor batchAccessor;
+  private int schemaVersion;
+
+  public OperatorDriver(OperatorContext opContext, OperatorExec opExec) {
+this.opContext = opContext;
+this.operatorExec = opExec;
+batchAccessor = operatorExec.batchAccessor();
+  }
+
+  /**
+   * Get the next batch. Performs initialization on the first call.
+   * @return the iteration outcome to send downstream
+   */
+
+  public IterOutcome next() {
+try {
+  switch (state) {
+  case START:
+return start();
+  case RUN:
+return doNext();
+   default:
--- End diff --

alignment


---


[GitHub] drill pull request #1121: DRILL-6153: Operator framework

2018-02-19 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1121#discussion_r169162527
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/protocol/VectorContainerAccessor.java
 ---
@@ -0,0 +1,132 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.protocol;
+
+import java.util.Collections;
+import java.util.Iterator;
+
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.record.BatchSchema;
+import org.apache.drill.exec.record.TypedFieldId;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.record.WritableBatch;
+import org.apache.drill.exec.record.selection.SelectionVector2;
+import org.apache.drill.exec.record.selection.SelectionVector4;
+
+public class VectorContainerAccessor implements BatchAccessor {
+
+  public static class ContainerAndSv2Accessor extends 
VectorContainerAccessor {
+
+private SelectionVector2 sv2;
+
+public void setSelectionVector(SelectionVector2 sv2) {
+  this.sv2 = sv2;
+}
+
+@Override
+public SelectionVector2 getSelectionVector2() {
+  return sv2;
+}
+  }
+
+  public static class ContainerAndSv4Accessor extends 
VectorContainerAccessor {
+
+private SelectionVector4 sv4;
+
+@Override
+public SelectionVector4 getSelectionVector4() {
+  return sv4;
+}
+  }
+
+  private VectorContainer container;
+  private SchemaTracker schemaTracker = new SchemaTracker();
+
+  /**
+   * Set the vector container. Done initially, and any time the schema of
+   * the container may have changed. May be called with the same container
+   * as the previous call, or a different one. A schema change occurs
+   * unless the vectors are identical across the two containers.
+   *
+   * @param container the container that holds vectors to be sent
+   * downstream
+   */
+
+  public void setContainer(VectorContainer container) {
+this.container = container;
+if (container != null) {
+  schemaTracker.trackSchema(container);
+}
+  }
+
+  @Override
+  public BatchSchema getSchema() {
+return container == null ? null : container.getSchema();
+  }
+
+  @Override
+  public int schemaVersion() { return schemaTracker.schemaVersion(); }
+
+  @Override
+  public int getRowCount() {
+return container == null ? 0 : container.getRecordCount();
+  }
+
+  @Override
+  public VectorContainer getOutgoingContainer() { return container; }
+
+  @Override
+  public TypedFieldId getValueVectorId(SchemaPath path) {
+return container.getValueVectorId(path);
+  }
+
+  @Override
+  public VectorWrapper getValueAccessorById(Class clazz, int... ids) 
{
+return container.getValueAccessorById(clazz, ids);
+  }
+
+  @Override
+  public WritableBatch getWritableBatch() {
+return WritableBatch.get(container);
+  }
+
+  @Override
+  public SelectionVector2 getSelectionVector2() {
+// Throws an exception by default
--- End diff --

should we make that explicit by indicating what exceptions it might throw. 


---


[GitHub] drill pull request #1121: DRILL-6153: Operator framework

2018-02-14 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1121#discussion_r168324742
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/protocol/OperatorDriver.java
 ---
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.protocol;
+
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.exec.ops.OperatorContext;
+import org.apache.drill.exec.record.RecordBatch.IterOutcome;
+
+/**
+ * State machine that drives the operator executable. Converts
+ * between the iterator protocol and the operator executable protocol.
+ * Implemented as a separate class in anticipation of eventually
+ * changing the record batch (iterator) protocol.
+ */
+
+public class OperatorDriver {
+  public enum State { START, SCHEMA, RUN, END, FAILED, CLOSED }
+
+  private OperatorDriver.State state = State.START;
+
+  /**
+   * Operator context. The driver "owns" the context and is responsible
+   * for closing it.
+   */
+
+  private final OperatorContext opContext;
+  private final OperatorExec operatorExec;
+  private final BatchAccessor batchAccessor;
+  private int schemaVersion;
+
+  public OperatorDriver(OperatorContext opServicees, OperatorExec opExec) {
+this.opContext = opServicees;
+this.operatorExec = opExec;
+batchAccessor = operatorExec.batchAccessor();
+  }
+
+  /**
+   * Get the next batch. Performs initialization on the first call.
+   * @return the iteration outcome to send downstream
+   */
+
+  public IterOutcome next() {
+try {
+  switch (state) {
+  case START:
+return start();
+  case RUN:
+return doNext();
+   default:
+OperatorRecordBatch.logger.debug("Extra call to next() in state " 
+ state + ": " + operatorLabel());
+return IterOutcome.NONE;
+  }
+} catch (UserException e) {
+  cancelSilently();
+  state = State.FAILED;
+  throw e;
+} catch (Throwable t) {
+  cancelSilently();
+  state = State.FAILED;
+  throw UserException.executionError(t)
+.addContext("Exception thrown from", operatorLabel())
+.build(OperatorRecordBatch.logger);
+}
+  }
+
+  /**
+   * Cancels the operator before reaching EOF.
+   */
+
+  public void cancel() {
+try {
+  switch (state) {
+  case START:
+  case RUN:
+cancelSilently();
+break;
+  default:
+break;
+  }
+} finally {
+  state = State.FAILED;
+}
+  }
+
+ /**
+   * Start the operator executor. Bind it to the various contexts.
+   * Then start the executor and fetch the first schema.
+   * @return result of the first batch, which should contain
+   * only a schema, or EOF
+   */
+
+  private IterOutcome start() {
+state = State.SCHEMA;
+if (operatorExec.buildSchema()) {
+  schemaVersion = batchAccessor.schemaVersion();
+  state = State.RUN;
+  return IterOutcome.OK_NEW_SCHEMA;
+} else {
+  state = State.END;
+  return IterOutcome.NONE;
+}
+  }
+
+  /**
+   * Fetch a record batch, detecting EOF and a new schema.
+   * @return the IterOutcome for the above cases
+   */
+
+  private IterOutcome doNext() {
+if (! operatorExec.next()) {
+  state = State.END;
+  return IterOutcome.NONE;
+}
+int newVersion = batchAccessor.schemaVersion();
+if (newVersion != schemaVersion) {
+  schemaVersion = newVersion;
+  return IterOutcome.OK_NEW_SCHEMA;
+}
+return IterOutcome.OK;
+  }
+
+  /**
+   * Implement a cancellati

[GitHub] drill pull request #1121: DRILL-6153: Operator framework

2018-02-14 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1121#discussion_r168331779
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/protocol/OperatorDriver.java
 ---
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.protocol;
+
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.exec.ops.OperatorContext;
+import org.apache.drill.exec.record.RecordBatch.IterOutcome;
+
+/**
+ * State machine that drives the operator executable. Converts
+ * between the iterator protocol and the operator executable protocol.
+ * Implemented as a separate class in anticipation of eventually
+ * changing the record batch (iterator) protocol.
+ */
+
+public class OperatorDriver {
+  public enum State { START, SCHEMA, RUN, END, FAILED, CLOSED }
--- End diff --

Do we need SCHEMA state as you are moving from START to RUN or END or 
FAILED. may be combine START and SCHEMA states to say GET_SCHEMA or something 
like that ? Also, would it be good to have two states FAILED and CANCELLED to 
differentiate whether the query was cancelled or failed due to error.


---


[GitHub] drill pull request #1121: DRILL-6153: Operator framework

2018-02-14 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1121#discussion_r168337843
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/protocol/OperatorDriver.java
 ---
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.protocol;
+
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.exec.ops.OperatorContext;
+import org.apache.drill.exec.record.RecordBatch.IterOutcome;
+
+/**
+ * State machine that drives the operator executable. Converts
+ * between the iterator protocol and the operator executable protocol.
+ * Implemented as a separate class in anticipation of eventually
+ * changing the record batch (iterator) protocol.
+ */
+
+public class OperatorDriver {
+  public enum State { START, SCHEMA, RUN, END, FAILED, CLOSED }
+
+  private OperatorDriver.State state = State.START;
+
+  /**
+   * Operator context. The driver "owns" the context and is responsible
+   * for closing it.
+   */
+
+  private final OperatorContext opContext;
+  private final OperatorExec operatorExec;
+  private final BatchAccessor batchAccessor;
+  private int schemaVersion;
+
+  public OperatorDriver(OperatorContext opServicees, OperatorExec opExec) {
+this.opContext = opServicees;
+this.operatorExec = opExec;
+batchAccessor = operatorExec.batchAccessor();
+  }
+
+  /**
+   * Get the next batch. Performs initialization on the first call.
+   * @return the iteration outcome to send downstream
+   */
+
+  public IterOutcome next() {
+try {
+  switch (state) {
+  case START:
+return start();
+  case RUN:
+return doNext();
+   default:
+OperatorRecordBatch.logger.debug("Extra call to next() in state " 
+ state + ": " + operatorLabel());
+return IterOutcome.NONE;
+  }
+} catch (UserException e) {
+  cancelSilently();
+  state = State.FAILED;
+  throw e;
+} catch (Throwable t) {
+  cancelSilently();
+  state = State.FAILED;
+  throw UserException.executionError(t)
+.addContext("Exception thrown from", operatorLabel())
+.build(OperatorRecordBatch.logger);
+}
+  }
+
+  /**
+   * Cancels the operator before reaching EOF.
+   */
+
+  public void cancel() {
+try {
+  switch (state) {
+  case START:
+  case RUN:
+cancelSilently();
+break;
+  default:
+break;
+  }
+} finally {
+  state = State.FAILED;
+}
+  }
+
+ /**
+   * Start the operator executor. Bind it to the various contexts.
+   * Then start the executor and fetch the first schema.
+   * @return result of the first batch, which should contain
+   * only a schema, or EOF
+   */
+
+  private IterOutcome start() {
--- End diff --

would be good if we can capture what exceptions each of these functions 
might throw. 


---


[GitHub] drill pull request #1121: DRILL-6153: Operator framework

2018-02-14 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1121#discussion_r168325261
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/protocol/OperatorExec.java
 ---
@@ -0,0 +1,127 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.protocol;
+
+import org.apache.drill.exec.ops.OperatorContext;
+
+/**
+ * Core protocol for a Drill operator execution.
+ *
+ * Lifecycle
+ *
+ * 
+ * Creation via an operator-specific constructor in the
+ * corresponding RecordBatchCreator.
+ * bind() called to provide the operator services.
+ * buildSchema() called to define the schema before
+ * fetching the first record batch.
+ * next() called repeatedly to prepare each new record
+ * batch until EOF or until cancellation.
+ * cancel() called if the operator should quit early.
+ * close() called to release resources. Note that
+ * close() is called in response to:
+ *   EOF
+ *   After cancel()
+ *   After an exception is thrown.
+ * 
+ *
+ * Error Handling
+ *
+ * Any method can throw an (unchecked) exception. (Drill does not use
+ * checked exceptions.) Preferably, the code will throw a
+ * UserException that explains the error to the user. If any
+ * other kind of exception is thrown, then the enclosing class wraps it
+ * in a generic UserException that indicates that "something went
+ * wrong", which is less than ideal.
+ *
+ * Result Set
+ * The operator "publishes" a result set in response to returning
+ * true from next() by populating a
+ * {@link BatchAccesor} provided via {@link #batchAccessor()}. For
+ * compatibility with other Drill operators, the set of vectors within
+ * the batch must be the same from one batch to the next.
+ */
+
+public interface OperatorExec {
+
+  /**
+   * Bind this operator to the context. The context provides access
+   * to per-operator, per-fragment and per-Drillbit services.
+   * Also provides access to the operator definition (AKA "pop
+   * config") for this operator.
+   *
+   * @param context operator context
+   */
+
+  public void bind(OperatorContext context);
+
+  /**
+   * Provides a generic access mechanism to the batch's output data.
+   * This method is called after a successful return from
+   * {@link #buildSchema()} and {@link #next()}. The batch itself
+   * can be held in a standard {@link VectorContainer}, or in some
+   * other structure more convenient for this operator.
+   *
+   * @return the access for the batch's output container
+   */
+
+  BatchAccessor batchAccessor();
+
+  /**
+   * Retrieves the schema of the batch before the first actual batch
+   * of data. The schema is returned via an empty batch (no rows,
+   * only schema) from {@link #batchAccessor()}.
+   *
+   * @return true if a schema is available, false if the operator
+   * reached EOF before a schema was found
+   */
+
+  boolean buildSchema();
+
+  /**
+   * Retrieves the next batch of data. The data is returned via
+   * the {@link #batchAccessor()} method.
+   *
+   * @return true if another batch of data is available, false if
+   * EOF was reached and no more data is available
+   */
+
+  boolean next();
+
+  /**
+   * Alerts the operator that the query was cancelled. Generally
+   * optional, but allows the operator to realize that a cancellation
--- End diff --

why is this optional ?


---


[GitHub] drill pull request #1121: DRILL-6153: Operator framework

2018-02-14 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1121#discussion_r168325765
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/protocol/OperatorRecordBatch.java
 ---
@@ -0,0 +1,167 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.protocol;
+
+import java.util.Iterator;
+
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.ops.FragmentContextInterface;
+import org.apache.drill.exec.ops.OperatorContext;
+import org.apache.drill.exec.physical.base.PhysicalOperator;
+import org.apache.drill.exec.record.BatchSchema;
+import org.apache.drill.exec.record.CloseableRecordBatch;
+import org.apache.drill.exec.record.TypedFieldId;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.record.WritableBatch;
+import org.apache.drill.exec.record.selection.SelectionVector2;
+import org.apache.drill.exec.record.selection.SelectionVector4;
+
+/**
+ * Modular implementation of the standard Drill record batch iterator
+ * protocol. The protocol has two parts: control of the operator and
+ * access to the record batch. Each is encapsulated in separate
+ * implementation classes to allow easier customization for each
+ * situation. The operator internals are, themselves, abstracted to
+ * yet another class with the steps represented as method calls rather
+ * than as internal states as in the record batch iterator protocol.
+ * 
+ * Note that downstream operators make an assumption that the
+ * same vectors will appear from one batch to the next. That is,
+ * not only must the schema be the same, but if column "a" appears
+ * in two batches, the same value vector must back "a" in both
+ * batches. The TransferPair abstraction fails if different
+ * vectors appear across batches.
+ */
+
+public class OperatorRecordBatch implements CloseableRecordBatch {
+  static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(OperatorRecordBatch.class);
+
+  private final OperatorDriver driver;
+  private final BatchAccessor batchAccessor;
+
+  public OperatorRecordBatch(FragmentContext context, PhysicalOperator 
config, OperatorExec opExec) {
+OperatorContext opContext = context.newOperatorContext(config);
+opContext.getStats().startProcessing();
+
+// Chicken-and-egg binding: the two objects must know about each 
other. Pass the
+// context to the operator exec via a bind method.
+
+try {
+  opExec.bind(opContext);
+  driver = new OperatorDriver(opContext, opExec);
+  batchAccessor = opExec.batchAccessor();
+} catch (UserException e) {
+  opContext.close();
+  throw e;
+} catch (Throwable t) {
+  opContext.close();
+  throw UserException.executionError(t)
+.addContext("Exception thrown from", 
opExec.getClass().getSimpleName() + ".bind()")
+.build(logger);
+}
+finally {
+  opContext.getStats().stopProcessing();
+}
+  }
+
+  @Override
+  public FragmentContext getContext() {
+
+// Backward compatibility with the full server context. Awkward for 
testing
+
+FragmentContext fragmentContext = fragmentContext();
+if (fragmentContext instanceof FragmentContext) {
--- End diff --

not clear what we are doing here. why can't we just return fragmentContext ?


---


[GitHub] drill pull request #1121: DRILL-6153: Operator framework

2018-02-14 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1121#discussion_r168317696
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/protocol/OperatorDriver.java
 ---
@@ -0,0 +1,183 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.protocol;
+
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.exec.ops.OperatorContext;
+import org.apache.drill.exec.record.RecordBatch.IterOutcome;
+
+/**
+ * State machine that drives the operator executable. Converts
+ * between the iterator protocol and the operator executable protocol.
+ * Implemented as a separate class in anticipation of eventually
+ * changing the record batch (iterator) protocol.
+ */
+
+public class OperatorDriver {
+  public enum State { START, SCHEMA, RUN, END, FAILED, CLOSED }
+
+  private OperatorDriver.State state = State.START;
+
+  /**
+   * Operator context. The driver "owns" the context and is responsible
+   * for closing it.
+   */
+
+  private final OperatorContext opContext;
+  private final OperatorExec operatorExec;
+  private final BatchAccessor batchAccessor;
+  private int schemaVersion;
+
+  public OperatorDriver(OperatorContext opServicees, OperatorExec opExec) {
--- End diff --

typo ? opServices instead of opServicees ?


---


[GitHub] drill pull request #1112: DRILL-6114: Metadata revisions

2018-02-14 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1112#discussion_r168289206
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/metadata/AbstractColumnMetadata.java
 ---
@@ -0,0 +1,206 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.record.metadata;
+
+import org.apache.drill.common.types.TypeProtos.DataMode;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.exec.record.MaterializedField;
+
+/**
+ * Abstract definition of column metadata. Allows applications to create
+ * specialized forms of a column metadata object by extending from this
+ * abstract class.
+ * 
+ * Note that, by design, primitive columns do not have a link to their
+ * tuple parent, or their index within that parent. This allows the same
+ * metadata to be shared between two views of a tuple, perhaps physical
+ * and projected views. This restriction does not apply to map columns,
+ * since maps (and the row itself) will, by definition, differ between
+ * the two views.
+ */
+
+public abstract class AbstractColumnMetadata implements ColumnMetadata {
+
+  // Capture the key schema information. We cannot use the 
MaterializedField
+  // or MajorType because then encode child information that we encode here
+  // as a child schema. Keeping the two in sync is nearly impossible.
+
+  protected final String name;
+  protected final MinorType type;
+  protected final DataMode mode;
+  protected final int precision;
+  protected final int scale;
+  protected boolean projected = true;
+
+  /**
+   * Predicted number of elements per array entry. Default is
+   * taken from the often hard-coded value of 10.
+   */
+
+  protected int expectedElementCount = 1;
+
+  public AbstractColumnMetadata(MaterializedField schema) {
+name = schema.getName();
+MajorType majorType = schema.getType();
+type = majorType.getMinorType();
+mode = majorType.getMode();
+precision = majorType.getPrecision();
+scale = majorType.getScale();
+if (isArray()) {
+  expectedElementCount = DEFAULT_ARRAY_SIZE;
+}
+  }
+
+  public AbstractColumnMetadata(String name, MinorType type, DataMode 
mode) {
+this.name = name;
+this.type = type;
+this.mode = mode;
+precision = 0;
+scale = 0;
+if (isArray()) {
+  expectedElementCount = DEFAULT_ARRAY_SIZE;
+}
+  }
+
+  public AbstractColumnMetadata(AbstractColumnMetadata from) {
+name = from.name;
+type = from.type;
+mode = from.mode;
+precision = from.precision;
+scale = from.scale;
+expectedElementCount = from.expectedElementCount;
+  }
+
+  protected void bind(TupleSchema parentTuple) { }
+
+  @Override
+  public String name() { return name; }
+
+  @Override
+  public MinorType type() { return type; }
+
+  @Override
+  public MajorType majorType() {
+return MajorType.newBuilder()
+.setMinorType(type())
+.setMode(mode())
+.build();
+  }
+
+  @Override
+  public DataMode mode() { return mode; }
+
+  @Override
+  public boolean isNullable() { return mode() == DataMode.OPTIONAL; }
+
+  @Override
+  public boolean isArray() { return mode() == DataMode.REPEATED; }
+
+  @Override
+  public int dimensions() { return isArray() ? 1 : 0; }
+
+  @Override
+  public boolean isMap() { return false; }
+
+  @Override
+  public boolean isVariant() { return false; }
--- End diff --

Seems like Variant is new name for old Union. Can we add a comment here  ?


---


[GitHub] drill pull request #1112: DRILL-6114: Metadata revisions

2018-02-14 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1112#discussion_r168291276
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/metadata/MetadataUtils.java
 ---
@@ -0,0 +1,165 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.record.metadata;
+
+import java.util.List;
+
+import org.apache.drill.common.types.TypeProtos.DataMode;
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.exec.record.BatchSchema;
+import org.apache.drill.exec.record.MaterializedField;
+
+public class MetadataUtils {
+
+  public static TupleSchema fromFields(Iterable fields) 
{
+TupleSchema tuple = new TupleSchema();
+for (MaterializedField field : fields) {
+  tuple.add(field);
+}
+return tuple;
+  }
+
+  /**
+   * Create a column metadata object that holds the given
+   * {@link MaterializedField}. The type of the object will be either a
+   * primitive or map column, depending on the field's type. The logic
+   * here mimics the code as written, which is very messy in some places.
+   *
+   * @param field the materialized field to wrap
+   * @return the column metadata that wraps the field
+   */
+
+  public static AbstractColumnMetadata fromField(MaterializedField field) {
+MinorType type = field.getType().getMinorType();
+switch (type) {
+case MAP:
+  return MetadataUtils.newMap(field);
+case UNION:
+  if (field.getType().getMode() != DataMode.OPTIONAL) {
+throw new UnsupportedOperationException(type.name() + " type must 
be nullable");
+  }
+  return new VariantColumnMetadata(field);
+case LIST:
+  switch (field.getType().getMode()) {
+  case OPTIONAL:
+
+// List of unions (or a degenerate union of a single type.)
+// Not supported in Drill.
--- End diff --

If not supported, should we throw an exception ?


---


[GitHub] drill pull request #1112: DRILL-6114: Metadata revisions

2018-02-14 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1112#discussion_r168301865
  
--- Diff: 
exec/vector/src/main/java/org/apache/drill/exec/record/metadata/ColumnMetadata.java
 ---
@@ -15,36 +15,115 @@
  * See the License for the specific language governing permissions and
  * limitations under the License.
  */
-package org.apache.drill.exec.record;
+package org.apache.drill.exec.record.metadata;
 
 import org.apache.drill.common.types.TypeProtos.DataMode;
 import org.apache.drill.common.types.TypeProtos.MajorType;
 import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.exec.record.MaterializedField;
 
 /**
  * Metadata description of a column including names, types and structure
  * information.
  */
 
 public interface ColumnMetadata {
+
+  /**
+   * Rough characterization of Drill types into metadata categories.
+   * Various aspects of Drill's type system are very, very messy.
+   * However, Drill is defined by its code, not some abstract design,
+   * so the metadata system here does the best job it can to simplify
+   * the messy type system while staying close to the underlying
+   * implementation.
+   */
+
   enum StructureType {
-PRIMITIVE, LIST, TUPLE
+
+/**
+ * Primitive column (all types except List, Map and Union.)
+ * Includes (one-dimensional) arrays of those types.
+ */
+
+PRIMITIVE,
+
+/**
+ * Map or repeated map. Also describes the row as a whole.
+ */
+
+TUPLE,
+
+/**
+ * Union or (non-repeated) list. (A non-repeated list is,
+ * essentially, a repeated union.)
+ */
+
+VARIANT,
+
+/**
+ * A repeated list. A repeated list is not simply the repeated
+ * form of a list, it is something else entirely. It acts as
+ * a dimensional wrapper around any other type (except list)
+ * and adds a non-nullable extra dimension. Hence, this type is
+ * for 2D+ arrays.
+ * 
+ * In theory, a 2D list of, say, INT would be an INT column, but
+ * repeated in to dimensions. Alas, that is not how it is. Also,
+ * if we have a separate category for 2D lists, we should have
+ * a separate category for 1D lists. But, again, that is not how
+ * the code has evolved.
+ */
+
+MULTI_ARRAY
   }
 
-  public static final int DEFAULT_ARRAY_SIZE = 10;
+  int DEFAULT_ARRAY_SIZE = 10;
--- End diff --

why is this changed from static final int to int ? 


---


[GitHub] drill pull request #1112: DRILL-6114: Metadata revisions

2018-02-14 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1112#discussion_r168295700
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/metadata/RepeatedListColumnMetadata.java
 ---
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.record.metadata;
+
+import org.apache.drill.common.types.TypeProtos.DataMode;
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.exec.record.MaterializedField;
+
+import com.google.common.base.Preconditions;
+
+public class RepeatedListColumnMetadata extends AbstractColumnMetadata {
+
+  private AbstractColumnMetadata childSchema;
+
+  public RepeatedListColumnMetadata(MaterializedField field) {
+super(field);
+Preconditions.checkArgument(field.getType().getMinorType() == 
MinorType.LIST);
+Preconditions.checkArgument(field.getType().getMode() == 
DataMode.REPEATED);
+Preconditions.checkArgument(field.getChildren().size() <= 1);
+if (! field.getChildren().isEmpty()) {
+  childSchema = 
MetadataUtils.fromField(field.getChildren().iterator().next());
+  Preconditions.checkArgument(childSchema.isArray());
+}
+  }
+
+  public RepeatedListColumnMetadata(String name, AbstractColumnMetadata 
childSchema) {
+super(name, MinorType.LIST, DataMode.REPEATED);
+if (childSchema != null) {
+  Preconditions.checkArgument(childSchema.isArray());
+}
+this.childSchema = childSchema;
+  }
+
+  public void childSchema(ColumnMetadata childMetadata) {
+Preconditions.checkState(childSchema == null);
+Preconditions.checkArgument(childMetadata.mode() == DataMode.REPEATED);
+childSchema = (AbstractColumnMetadata) childMetadata;
+  }
+
+  @Override
+  public StructureType structureType() { return StructureType.MULTI_ARRAY; 
}
+
+  @Override
+  public MaterializedField schema() {
+MaterializedField field = emptySchema();
+if (childSchema != null) {
+  field.addChild(childSchema.schema());
+}
+return field;
+  }
+
+  @Override
+  public MaterializedField emptySchema() {
+return MaterializedField.create(name(), majorType());
+  }
+
+  @Override
+  public ColumnMetadata cloneEmpty() {
+return new RepeatedListColumnMetadata(name, null);
+  }
+
+  @Override
+  public AbstractColumnMetadata copy() {
+return new RepeatedListColumnMetadata(name, childSchema);
+  }
+
+  @Override
+  public ColumnMetadata childSchema() { return childSchema; }
+
+  @Override
+  public int dimensions() {
+
+// If there is no child, then we don't know the
+// dimensionality.
+
+return childSchema == null ? -1
--- End diff --

can we use a static final instead of -1 ?


---


[GitHub] drill pull request #1112: DRILL-6114: Metadata revisions

2018-02-14 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1112#discussion_r168288500
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/metadata/AbstractColumnMetadata.java
 ---
@@ -0,0 +1,206 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.record.metadata;
+
+import org.apache.drill.common.types.TypeProtos.DataMode;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.exec.record.MaterializedField;
+
+/**
+ * Abstract definition of column metadata. Allows applications to create
+ * specialized forms of a column metadata object by extending from this
+ * abstract class.
+ * 
+ * Note that, by design, primitive columns do not have a link to their
+ * tuple parent, or their index within that parent. This allows the same
+ * metadata to be shared between two views of a tuple, perhaps physical
+ * and projected views. This restriction does not apply to map columns,
+ * since maps (and the row itself) will, by definition, differ between
+ * the two views.
+ */
+
+public abstract class AbstractColumnMetadata implements ColumnMetadata {
+
+  // Capture the key schema information. We cannot use the 
MaterializedField
+  // or MajorType because then encode child information that we encode here
+  // as a child schema. Keeping the two in sync is nearly impossible.
+
+  protected final String name;
+  protected final MinorType type;
+  protected final DataMode mode;
+  protected final int precision;
+  protected final int scale;
+  protected boolean projected = true;
+
--- End diff --

I see in the current definition of MajorType, there is also timeZone. We 
don't need that here ? 


---


[GitHub] drill issue #1107: DRILL-6123: Limit batch size for Merge Join based on memo...

2018-02-09 Thread ppadma
Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/1107
  
@paul-rogers Thank you very much for the review. Updated the PR with review 
comments taken care of. Please take a look. 


---


[GitHub] drill pull request #1107: DRILL-6123: Limit batch size for Merge Join based ...

2018-02-09 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1107#discussion_r167380917
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/AbstractRecordBatchMemoryManager.java
 ---
@@ -0,0 +1,63 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.record;
+
+import org.apache.drill.exec.physical.impl.spill.RecordBatchSizer;
+import org.apache.drill.exec.vector.ValueVector;
+
+public abstract class AbstractRecordBatchMemoryManager {
+  private int outgoingRowWidth;
+  private int outputRowCount = ValueVector.MAX_ROW_COUNT;
+  protected static final int OFFSET_VECTOR_WIDTH = 4;
+  protected static final int WORST_CASE_FRAGMENTATION_FACTOR = 2;
+  protected static final int MAX_NUM_ROWS = ValueVector.MAX_ROW_COUNT;
+  protected static final int MIN_NUM_ROWS = 1;
+
+  public void update(int inputIndex) {};
+
+  public void update() {};
+
+  public int getOutputRowCount() {
+return outputRowCount;
+  }
+
+  /**
+   * Given batchSize and rowWidth, this will set output rowCount taking 
into account
+   * the min and max that is allowed.
+   */
+  public void setOutputRowCount(long batchSize, int rowWidth) {
+this.outputRowCount = Math.min(ValueVector.MAX_ROW_COUNT,
+  
Math.max(RecordBatchSizer.safeDivide(batchSize/WORST_CASE_FRAGMENTATION_FACTOR, 
rowWidth), MIN_NUM_ROWS));
--- End diff --

batchSize here is the configured outputBatchSize that we should conform to, 
not the incoming batch size. 


---


[GitHub] drill pull request #1107: DRILL-6123: Limit batch size for Merge Join based ...

2018-02-09 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1107#discussion_r167380761
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/spill/RecordBatchSizer.java
 ---
@@ -311,8 +311,8 @@ public static ColumnSize getColumn(ValueVector v, 
String prefix) {
 
   public RecordBatchSizer(RecordBatch batch) {
 this(batch,
- (batch.getSchema().getSelectionVectorMode() == 
BatchSchema.SelectionVectorMode.TWO_BYTE) ?
- batch.getSelectionVector2() : null);
+  (batch.getSchema() == null ? null : 
(batch.getSchema().getSelectionVectorMode() == 
BatchSchema.SelectionVectorMode.TWO_BYTE ?
--- End diff --

yes, we can get empty batches with empty schema  and I think it makes sense 
to add the check here instead of calling code. That way, it be transparently 
handled underneath. 


---


[GitHub] drill pull request #1107: DRILL-6123: Limit batch size for Merge Join based ...

2018-02-09 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1107#discussion_r167380394
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/spill/RecordBatchSizer.java
 ---
@@ -137,10 +137,10 @@ public ColumnSize(ValueVector v, String prefix) {
   case MAP:
   case UNION:
 // No standard size for Union type
-dataSize = v.getPayloadByteCount(valueCount);
+dataSize = valueCount == 0 ? 0 : v.getPayloadByteCount(valueCount);
--- End diff --

When we get empty batch, getPayloadByteCount will throw out of bounds 
exception if the underlying method tries to read the value vector for index 0 
i.e. when it is trying to read offset vector. I added the check in those 
methods and removed here. 

I still have to retain the check here for the case when we are trying to 
read the offset vector.


---


[GitHub] drill pull request #1115: DRILL-6138: Move RecordBatchSizer to org.apache.dr...

2018-02-06 Thread ppadma
GitHub user ppadma opened a pull request:

https://github.com/apache/drill/pull/1115

DRILL-6138: Move RecordBatchSizer to org.apache.drill.exec.record pac…

…kage
Also,  changed columnSizes in RecordBatchSizer from list to map so we can 
lookup using field names.

@Ben-Zvi can you please review ?


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ppadma/drill DRILL-6138

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1115.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1115


commit b4104e73ad37dee96e29e345b958412791ab9079
Author: Padma Penumarthy 
Date:   2018-02-06T05:41:45Z

DRILL-6138: Move RecordBatchSizer to org.apache.drill.exec.record package




---


[GitHub] drill pull request #1107: DRILL-6123: Limit batch size for Merge Join based ...

2018-02-06 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1107#discussion_r166427892
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/MergeJoinBatch.java
 ---
@@ -102,20 +105,78 @@
   private final List comparators;
   private final JoinRelType joinType;
   private JoinWorker worker;
+  private final long outputBatchSize;
 
   private static final String LEFT_INPUT = "LEFT INPUT";
   private static final String RIGHT_INPUT = "RIGHT INPUT";
 
+  private class MergeJoinMemoryManager extends 
AbstractRecordBatchMemoryManager {
+private int leftRowWidth;
+private int rightRowWidth;
+
+/**
+ * mergejoin operates on one record at a time from the left and right 
batches
+ * using RecordIterator abstraction. We have a callback mechanism to 
get notified
+ * when new batch is loaded in record iterator.
+ * This can get called in the middle of current output batch we are 
building.
+ * when this gets called, adjust number of output rows for the current 
batch and
+ * update the value to be used for subsequent batches.
+ */
+@Override
+public void update(int inputIndex) {
+  switch(inputIndex) {
+case 0:
+  final RecordBatchSizer leftSizer = new RecordBatchSizer(left);
+  leftRowWidth = leftSizer.netRowWidth();
+  break;
+case 1:
+  final RecordBatchSizer rightSizer = new RecordBatchSizer(right);
+  rightRowWidth = rightSizer.netRowWidth();
+default:
+  break;
+  }
+
+  final int newOutgoingRowWidth = leftRowWidth + rightRowWidth;
+
+  // If outgoing row width is 0, just return. This is possible for 
empty batches or
+  // when first set of batches come with OK_NEW_SCHEMA and no data.
+  if (newOutgoingRowWidth == 0) {
+return;
+  }
+
+  // update the value to be used for next batch(es)
+  setOutputRowCount(Math.min(ValueVector.MAX_ROW_COUNT,
+
Math.max(RecordBatchSizer.safeDivide(outputBatchSize/WORST_CASE_FRAGMENTATION_FACTOR,
 newOutgoingRowWidth), MIN_NUM_ROWS)));
+
+  // Adjust for the current batch.
+  // calculate memory used so far based on previous outgoing row width 
and how many rows we already processed.
+  final long memoryUsed = status.getOutPosition() * 
getOutgoingRowWidth();
+  // This is the remaining memory.
+  final long remainingMemory = 
Math.max(outputBatchSize/WORST_CASE_FRAGMENTATION_FACTOR - memoryUsed, 0);
+  // These are number of rows we can fit in remaining memory based on 
new outgoing row width.
+  final int numOutputRowsRemaining = 
RecordBatchSizer.safeDivide(remainingMemory, newOutgoingRowWidth);
+
+  final int adjustedOutputRowCount = Math.min(MAX_NUM_ROWS, 
Math.max(status.getOutPosition() + numOutputRowsRemaining, MIN_NUM_ROWS));
+  status.setOutputRowCount(adjustedOutputRowCount);
+  setOutgoingRowWidth(newOutgoingRowWidth);
--- End diff --

Yes, this is how it works.
We read from left and right side using RecordIterator abstraction, which 
reads full record batches underneath and gives one record at a time with its 
next call.  I have a callback mechanism when we read a new batch in record 
iterator to adjust the row widths. When I get the callback, I adjust row count 
for the current batch we are working on based on remaining memory available for 
the current batch and also compute and save the row count I should use for next 
full batch. In the innerNext, when we start working on the next output batch, I 
am setting the target output row count based on this value.

Addressed all other code review comments. Please take a look when you get a 
chance.



---


[GitHub] drill issue #1112: DRILL-6114: Metadata revisions

2018-02-06 Thread ppadma
Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/1112
  
@paul-rogers I ran the pre commit tests. No issues. Everything passed. Will 
do one more time once code reviews are done.


---


[GitHub] drill issue #1107: DRILL-6123: Limit batch size for Merge Join based on memo...

2018-02-05 Thread ppadma
Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/1107
  
@sachouche @ilooner @paul-rogers Can one of you review this PR for me ?


---


[GitHub] drill pull request #1101: DRILL-6032: Made the batch sizing for HashAgg more...

2018-02-05 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1101#discussion_r166142178
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/spill/RecordBatchSizer.java
 ---
@@ -65,6 +70,14 @@
 
 public int stdSize;
 
+/**
+ * If the we can determine the exact width of the row of a vector 
upfront,
+ * the row widths is saved here. If we cannot determine the exact width
+ * (for example for VarChar or Repeated vectors), then
+ */
+
+private int knownSize = -1;
--- End diff --

Like I mentioned in other comment, seems like we can just use stdSize. 


---


[GitHub] drill pull request #1101: DRILL-6032: Made the batch sizing for HashAgg more...

2018-02-05 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1101#discussion_r166098364
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java
 ---
@@ -140,6 +131,9 @@
   private OperatorContext oContext;
   private BufferAllocator allocator;
 
+  private Map keySizes;
+  // The size estimates for varchar value columns. The keys are the index 
of the varchar value columns.
+  private Map varcharValueSizes;
--- End diff --

Don't you need to adjust size estimates for repeated types also ?


---


[GitHub] drill pull request #1101: DRILL-6032: Made the batch sizing for HashAgg more...

2018-02-05 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1101#discussion_r166141274
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java
 ---
@@ -733,28 +780,32 @@ private void restoreReservedMemory() {
* @param records
*/
   private void allocateOutgoing(int records) {
-// Skip the keys and only allocate for outputting the workspace values
-// (keys will be output through splitAndTransfer)
-Iterator> outgoingIter = outContainer.iterator();
-for (int i = 0; i < numGroupByOutFields; i++) {
-  outgoingIter.next();
-}
-
 // try to preempt an OOM by using the reserved memory
 useReservedOutgoingMemory();
 long allocatedBefore = allocator.getAllocatedMemory();
 
-while (outgoingIter.hasNext()) {
+for (int columnIndex = numGroupByOutFields; columnIndex < 
outContainer.getNumberOfColumns(); columnIndex++) {
+  final VectorWrapper wrapper = 
outContainer.getValueVector(columnIndex);
   @SuppressWarnings("resource")
-  ValueVector vv = outgoingIter.next().getValueVector();
+  final ValueVector vv = wrapper.getValueVector();
 
-  AllocationHelper.allocatePrecomputedChildCount(vv, records, 
maxColumnWidth, 0);
+  final RecordBatchSizer.ColumnSize columnSizer = new 
RecordBatchSizer.ColumnSize(wrapper.getValueVector());
+  int columnSize;
+
+  if (columnSizer.hasKnownSize()) {
+// For fixed width vectors we know the size of each record
+columnSize = columnSizer.getKnownSize();
+  } else {
+// For var chars we need to use the input estimate
+columnSize = varcharValueSizes.get(columnIndex);
+  }
+
+  AllocationHelper.allocatePrecomputedChildCount(vv, records, 
columnSize, 0);
--- End diff --

I think we should also get elementCount from sizer and use that instead of 
passing 0. 


---


[GitHub] drill pull request #1101: DRILL-6032: Made the batch sizing for HashAgg more...

2018-02-05 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1101#discussion_r166096630
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java
 ---
@@ -215,6 +206,7 @@ public BatchHolder() {
   MaterializedField outputField = materializedValueFields[i];
   // Create a type-specific ValueVector for this value
   vector = TypeHelper.getNewVector(outputField, allocator);
+  int columnSize = new RecordBatchSizer.ColumnSize(vector).estSize;
--- End diff --

there is already stdSize which is kind of doing the same thing. can we use 
that instead of knownSize ?


---


[GitHub] drill pull request #1101: DRILL-6032: Made the batch sizing for HashAgg more...

2018-02-05 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1101#discussion_r166142507
  
--- Diff: exec/vector/src/main/codegen/templates/FixedValueVectors.java ---
@@ -298,6 +298,11 @@ public int getPayloadByteCount(int valueCount) {
 return valueCount * ${type.width};
   }
 
+  @Override
+  public int getValueWidth() {
--- End diff --

If we are using stdSize and get the value from TypeHelper, we don't need 
all value vectors to have this new function.


---


[GitHub] drill pull request #1101: DRILL-6032: Made the batch sizing for HashAgg more...

2018-02-05 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1101#discussion_r166136279
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java
 ---
@@ -226,7 +221,7 @@ public BatchHolder() {
 ((FixedWidthVector) vector).allocateNew(HashTable.BATCH_SIZE);
   } else if (vector instanceof VariableWidthVector) {
 // This case is never used  a varchar falls under 
ObjectVector which is allocated on the heap !
-((VariableWidthVector) vector).allocateNew(maxColumnWidth, 
HashTable.BATCH_SIZE);
+((VariableWidthVector) vector).allocateNew(columnSize, 
HashTable.BATCH_SIZE);
--- End diff --

for a just allocated vector,  estSize will return 0. how can we use that 
for allocation ? 


---


[GitHub] drill pull request #1107: DRILL-6123: Limit batch size for Merge Join based ...

2018-02-01 Thread ppadma
GitHub user ppadma opened a pull request:

https://github.com/apache/drill/pull/1107

DRILL-6123: Limit batch size for Merge Join based on memory

Merge join limits output batch size to 32K rows irrespective of row size. 
This can create large batches (in terms of memory), depending upon average row 
width. Changed the logic to figure out output row count based on memory 
specified with the new outputBatchSize option and average outgoing row width. 
Average outgoing row width will be sum of  left and right batch row widths. 
Output row count will be minimum of 1 and max of 64k. 
Added AbstractRecordBatchMemoryManager class to be used across all 
operators. Restructured the code a little bit for that.
  

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ppadma/drill DRILL-6123

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1107.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1107


commit e01da78116730afbcd9b5062e316678cebc4848f
Author: Padma Penumarthy 
Date:   2018-01-31T00:58:57Z

DRILL-6123: Limit batch size for Merge Join based on memory




---


[GitHub] drill pull request #1101: DRILL-6032: Made the batch sizing for HashAgg more...

2018-01-31 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1101#discussion_r165156589
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java
 ---
@@ -215,6 +206,7 @@ public BatchHolder() {
   MaterializedField outputField = materializedValueFields[i];
   // Create a type-specific ValueVector for this value
   vector = TypeHelper.getNewVector(outputField, allocator);
+  int columnSize = new RecordBatchSizer.ColumnSize(vector).estSize;
--- End diff --

@ilooner That is the point. If we know the exact value, why do we need 
RecordBatchSizer ? we should use RecordBatchSizer when we need to get sizing 
information for a batch (in most cases, incoming batch). In this case, you are 
allocating memory for value vectors for the batch you are building. For fixed 
width columns, you can get the column width size for each type you are 
allocating memory for using TypeHelper.getSize. For variable width columns, 
TypeHelper.getSize assumes it is 50 bytes.  If you want to adjust memory you 
are allocating for variable width columns for outgoing batch based on incoming 
batch, that's when you use RecordBatchSizer on actual incoming batch to figure 
out the average size of that column.  You can also use RecordBatchSizer on 
incoming batch if you want to figure out how many values you want to allocate 
memory for in the outgoing batch. Note that, with your change, for just created 
value vectors, variable width columns will return estSize of 1, which is n
 ot what you want. 


---


[GitHub] drill pull request #1101: DRILL-6032: Made the batch sizing for HashAgg more...

2018-01-29 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1101#discussion_r164634551
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java
 ---
@@ -215,6 +206,7 @@ public BatchHolder() {
   MaterializedField outputField = materializedValueFields[i];
   // Create a type-specific ValueVector for this value
   vector = TypeHelper.getNewVector(outputField, allocator);
+  int columnSize = new RecordBatchSizer.ColumnSize(vector).estSize;
--- End diff --

Why not just use TypeHelper.getSize(outputField.getType()) ? It seems like 
you  don't need RecordBatchSizer for this.  estSize in RecordBatchSizer is 
taking actual memory allocation into account i.e. it includes the over head of  
unused vector space and we need that value as well for different reasons. I 
also think it is doing right thing by returning zero when there are no rows and 
no memory allocated. 


---


[GitHub] drill pull request #1101: DRILL-6032: Made the batch sizing for HashAgg more...

2018-01-29 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1101#discussion_r164599681
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java
 ---
@@ -215,6 +206,7 @@ public BatchHolder() {
   MaterializedField outputField = materializedValueFields[i];
   // Create a type-specific ValueVector for this value
   vector = TypeHelper.getNewVector(outputField, allocator);
+  int columnSize = new RecordBatchSizer.ColumnSize(vector).estSize;
--- End diff --

I wonder what is wrong if estSize is 0 when there is no data. If there is 
no data for a column, why would we want to add it's value width to outgoing row 
width ? 


---


[GitHub] drill pull request #1091: DRILL-6071: Limit batch size for flatten operator

2018-01-26 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1091#discussion_r164231963
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java
 ---
@@ -94,8 +98,57 @@ private void clear() {
 }
   }
 
+  private class FlattenMemoryManager {
--- End diff --

Memory manager actually requires incoming batch. All the tests in 
testOutputBatch are exercising this code. I went down the path of writing tests 
just for memory manager. But, they are redundant and doing the same thing I am 
doing with the other tests. Also, did not see much point in validating output 
row count, since we are interested in batch size more than row count. Please 
let me know if you feel otherwise and I can include those tests as well. 


---


[GitHub] drill pull request #1091: DRILL-6071: Limit batch size for flatten operator

2018-01-26 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1091#discussion_r164233329
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/physical/unit/TestOutputBatchSize.java
 ---
@@ -0,0 +1,498 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.physical.unit;
+
+import com.google.common.collect.Lists;
+import org.apache.drill.common.expression.SchemaPath;
+
+import org.apache.drill.exec.physical.base.AbstractBase;
+import org.apache.drill.exec.physical.base.PhysicalOperator;
+import org.apache.drill.exec.physical.config.FlattenPOP;
+import org.apache.drill.exec.physical.impl.ScanBatch;
+import org.apache.drill.exec.physical.impl.spill.RecordBatchSizer;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.VectorAccessible;
+import org.apache.drill.exec.util.JsonStringArrayList;
+import org.apache.drill.exec.util.JsonStringHashMap;
+import org.apache.drill.exec.util.Text;
+import org.junit.Ignore;
+import org.junit.Test;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.List;
+
+public class TestOutputBatchSize extends PhysicalOpUnitTestBase {
--- End diff --

I added a test case like this, testFlattenLargeRecords and there are bunch 
of other test cases as well.
All the tests are verifying the batch sizes and number of batches.


---


[GitHub] drill issue #1091: DRILL-6071: Limit batch size for flatten operator

2018-01-26 Thread ppadma
Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/1091
  
@paul-rogers Thank you Paul. Ready for review. Please take a look when you 
get a chance. 


---


[GitHub] drill issue #1091: DRILL-6071: Limit batch size for flatten operator

2018-01-19 Thread ppadma
Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/1091
  
@paul-rogers Addressed code review comments. Added a new system/session 
option to configure output batch size. Please review. 


---


[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-16 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r161900034
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,286 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 0) {
+  matcherFcn = new MatcherZero();
+} else if (patternLength == 1) {
+  matcherFcn = new MatcherOne();
+} else if (patternLength == 2) {
+  matcherFcn = new MatcherTwo();
+} else if (patternLength == 3) {
+  matcherFcn = new MatcherThree();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class MatcherZero extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
+private MatcherZero() {
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
   return 1;
 }
+  }
+
+  /** Handles patterns with length one */
+  private final class MatcherOne extends MatcherFcn {
+
+private MatcherOne() {
+  super();
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
+
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
+  return 0;
+}
+  }
+
+  /** Handles patterns with length two */
+  private final class MatcherTwo extends MatcherFcn {
+
+private MatcherTwo() {
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
+  final byte firstPattByte  = patternArray[0];
+  final byte secondPattByte = patternArray[1];
+
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+final byte firstInByte = drillBuf.getByte(start + idx);
 
-final int txtLength = end - start;
+if (firstPattByte != firstInByte) {
+  continue;
+} else {
+  final byte secondInByte = drillBuf.getByte(start + idx +1);
 
-// no match if input string length is less than pattern length
-if (txtLength < patternLength) {
+  if (secondInByte == secondPattByte) {
+return 1;
+  }
+}
+  }
   return 0;
 }
+  }
 
+  /** Handles patterns with length three */
+  private final class MatcherT

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-16 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r161895458
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,286 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 0) {
+  matcherFcn = new MatcherZero();
+} else if (patternLength == 1) {
+  matcherFcn = new MatcherOne();
+} else if (patternLength == 2) {
+  matcherFcn = new MatcherTwo();
+} else if (patternLength == 3) {
+  matcherFcn = new MatcherThree();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class MatcherZero extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
+private MatcherZero() {
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
   return 1;
 }
+  }
+
+  /** Handles patterns with length one */
+  private final class MatcherOne extends MatcherFcn {
+
+private MatcherOne() {
+  super();
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
--- End diff --

can we name it firstPatternByte ?


---


[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-16 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r161906070
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,286 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 0) {
+  matcherFcn = new MatcherZero();
+} else if (patternLength == 1) {
+  matcherFcn = new MatcherOne();
+} else if (patternLength == 2) {
+  matcherFcn = new MatcherTwo();
+} else if (patternLength == 3) {
+  matcherFcn = new MatcherThree();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class MatcherZero extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
+private MatcherZero() {
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
   return 1;
 }
+  }
+
+  /** Handles patterns with length one */
+  private final class MatcherOne extends MatcherFcn {
+
+private MatcherOne() {
+  super();
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
+
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
+  return 0;
+}
+  }
+
+  /** Handles patterns with length two */
+  private final class MatcherTwo extends MatcherFcn {
+
+private MatcherTwo() {
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
+  final byte firstPattByte  = patternArray[0];
+  final byte secondPattByte = patternArray[1];
+
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+final byte firstInByte = drillBuf.getByte(start + idx);
 
-final int txtLength = end - start;
+if (firstPattByte != firstInByte) {
+  continue;
+} else {
+  final byte secondInByte = drillBuf.getByte(start + idx +1);
--- End diff --

space between + and 1


---


[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-16 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r161899357
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,286 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 0) {
+  matcherFcn = new MatcherZero();
+} else if (patternLength == 1) {
+  matcherFcn = new MatcherOne();
+} else if (patternLength == 2) {
+  matcherFcn = new MatcherTwo();
+} else if (patternLength == 3) {
+  matcherFcn = new MatcherThree();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class MatcherZero extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
+private MatcherZero() {
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
   return 1;
 }
+  }
+
+  /** Handles patterns with length one */
+  private final class MatcherOne extends MatcherFcn {
+
+private MatcherOne() {
+  super();
--- End diff --

redundant ?


---


[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-16 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r161905865
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,286 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 0) {
+  matcherFcn = new MatcherZero();
+} else if (patternLength == 1) {
+  matcherFcn = new MatcherOne();
+} else if (patternLength == 2) {
+  matcherFcn = new MatcherTwo();
+} else if (patternLength == 3) {
+  matcherFcn = new MatcherThree();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class MatcherZero extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
+private MatcherZero() {
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
   return 1;
 }
+  }
+
+  /** Handles patterns with length one */
+  private final class MatcherOne extends MatcherFcn {
+
+private MatcherOne() {
+  super();
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
+
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
+  return 0;
+}
+  }
+
+  /** Handles patterns with length two */
+  private final class MatcherTwo extends MatcherFcn {
+
+private MatcherTwo() {
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
+  final byte firstPattByte  = patternArray[0];
+  final byte secondPattByte = patternArray[1];
--- End diff --

can you initialize them in the constructor for matcher function ? that way 
you can initialize once and reuse for each match. 


---


[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-16 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r161895158
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,286 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 0) {
+  matcherFcn = new MatcherZero();
+} else if (patternLength == 1) {
+  matcherFcn = new MatcherOne();
+} else if (patternLength == 2) {
+  matcherFcn = new MatcherTwo();
+} else if (patternLength == 3) {
+  matcherFcn = new MatcherThree();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
--- End diff --

 length zero.


---


[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2018-01-16 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r161899883
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,286 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 0) {
+  matcherFcn = new MatcherZero();
+} else if (patternLength == 1) {
+  matcherFcn = new MatcherOne();
+} else if (patternLength == 2) {
+  matcherFcn = new MatcherTwo();
+} else if (patternLength == 3) {
+  matcherFcn = new MatcherThree();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class MatcherZero extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
+private MatcherZero() {
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
   return 1;
 }
+  }
+
+  /** Handles patterns with length one */
+  private final class MatcherOne extends MatcherFcn {
+
+private MatcherOne() {
+  super();
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
+
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
+  return 0;
+}
+  }
+
+  /** Handles patterns with length two */
+  private final class MatcherTwo extends MatcherFcn {
+
+private MatcherTwo() {
+}
+
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
+  final byte firstPattByte  = patternArray[0];
+  final byte secondPattByte = patternArray[1];
+
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+final byte firstInByte = drillBuf.getByte(start + idx);
 
-final int txtLength = end - start;
+if (firstPattByte != firstInByte) {
+  continue;
+} else {
+  final byte secondInByte = drillBuf.getByte(start + idx +1);
 
-// no match if input string length is less than pattern length
-if (txtLength < patternLength) {
+  if (secondInByte == secondPattByte) {
+return 1;
+  }
+}
+  }
   return 0;
 }
+  }
 
+  /** Handles patterns with length three */
+  private final class MatcherT

[GitHub] drill issue #1091: DRILL-6071: Limit batch size for flatten operator

2018-01-16 Thread ppadma
Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/1091
  
@paul-rogers  Paul, can you please review this PR ? 


---


[GitHub] drill pull request #1091: DRILL-6071: Limit batch size for flatten operator

2018-01-14 Thread ppadma
GitHub user ppadma opened a pull request:

https://github.com/apache/drill/pull/1091

DRILL-6071: Limit batch size for flatten operator

Please see DRILL-6071 for details.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ppadma/drill DRILL-6071

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1091.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1091


commit 8c7d60440e41efeab53fe78d9db579e7f85be149
Author: Padma Penumarthy 
Date:   2018-01-10T13:06:58Z

DRILL-6071: Limit batch size for flatten operator




---


[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2017-12-20 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r158033645
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
+  matcherFcn = new Matcher3();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class Matcher1 extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
-  return 1;
+private Matcher1() {
+  super();
 }
 
-final int txtLength = end - start;
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
 
-// no match if input string length is less than pattern length
-if (txtLength < patternLength) {
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
   return 0;
 }
+  }
 
+  /** Handles patterns with length two */
+  private final class Matcher2 extends MatcherFcn {
 
-final int outerEnd = txtLength - patternLength;
+private Matcher2() {
+  super();
+}
 
-outer:
-for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
+  final byte firstPattByte  = patternArray[0];
+  final byte secondPattByte = patternArray[1];
 
   // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
   // so, we can just directly compare.
-  for (int patternIndex = 0; patternIndex < patternLength; 
patternIndex++) {
-if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start 
+ txtIndex + patternIndex)) {
-  continue outer;
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+final byte firstInByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != firstInByte) {
+  continue;
+} else {
+  final byte secondInByte = drillBuf.getByte(start + idx +1);
+
+  if (secondInByte == secondPattByte) {
+return 1;
+  }
 }
   }
+  return 0;
+}
+  }
+
+  /** Handles patterns with length three */
+  private final class Matcher3 extends MatcherFcn {
 
 

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2017-12-20 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r158029512
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
+  matcherFcn = new Matcher3();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class Matcher1 extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
-  return 1;
+private Matcher1() {
+  super();
 }
 
-final int txtLength = end - start;
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
 
-// no match if input string length is less than pattern length
-if (txtLength < patternLength) {
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
   return 0;
 }
+  }
 
+  /** Handles patterns with length two */
+  private final class Matcher2 extends MatcherFcn {
 
-final int outerEnd = txtLength - patternLength;
+private Matcher2() {
+  super();
+}
 
-outer:
-for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
--- End diff --

This is a problem. It can become negative if end is same as start i.e. null 
text. 


---


[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2017-12-20 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r158025793
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
--- End diff --

switch instead of if 


---


[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2017-12-20 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r158032627
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
+  matcherFcn = new Matcher3();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
--- End diff --

is this true for null pattern ?


---


[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2017-12-20 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r158034838
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
+  matcherFcn = new Matcher3();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class Matcher1 extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
-  return 1;
+private Matcher1() {
+  super();
 }
 
-final int txtLength = end - start;
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
 
-// no match if input string length is less than pattern length
-if (txtLength < patternLength) {
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
   return 0;
 }
+  }
 
+  /** Handles patterns with length two */
+  private final class Matcher2 extends MatcherFcn {
 
-final int outerEnd = txtLength - patternLength;
+private Matcher2() {
+  super();
+}
 
-outer:
-for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
+  final byte firstPattByte  = patternArray[0];
+  final byte secondPattByte = patternArray[1];
 
   // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
   // so, we can just directly compare.
-  for (int patternIndex = 0; patternIndex < patternLength; 
patternIndex++) {
-if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start 
+ txtIndex + patternIndex)) {
-  continue outer;
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+final byte firstInByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != firstInByte) {
+  continue;
+} else {
+  final byte secondInByte = drillBuf.getByte(start + idx +1);
+
+  if (secondInByte == secondPattByte) {
+return 1;
+  }
 }
   }
+  return 0;
+}
+  }
+
+  /** Handles patterns with length three */
+  private final class Matcher3 extends MatcherFcn {
 
 

[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...

2017-12-20 Thread ppadma
Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/1072#discussion_r158030478
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java
 ---
@@ -19,44 +19,283 @@
 
 import io.netty.buffer.DrillBuf;
 
-public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher {
+/** SQL Pattern Contains implementation */
+public final class SqlPatternContainsMatcher extends 
AbstractSqlPatternMatcher {
+  private final MatcherFcn matcherFcn;
 
   public SqlPatternContainsMatcher(String patternString) {
 super(patternString);
+
+// Pattern matching is 1) a CPU intensive operation and 2) pattern and 
input dependent. The conclusion is
+// that there is no single implementation that can do it all well. So, 
we use multiple implementations
+// chosen based on the pattern length.
+if (patternLength == 1) {
+  matcherFcn = new Matcher1();
+} else if (patternLength == 2) {
+  matcherFcn = new Matcher2();
+} else if (patternLength == 3) {
+  matcherFcn = new Matcher3();
+} else if (patternLength < 10) {
+  matcherFcn = new MatcherN();
+} else {
+  matcherFcn = new BoyerMooreMatcher();
+}
   }
 
   @Override
   public int match(int start, int end, DrillBuf drillBuf) {
+return matcherFcn.match(start, end, drillBuf);
+  }
+
+  
//--
+  // Inner Data Structure
+  // 
--
+
+  /** Abstract matcher class to allow us pick the most efficient 
implementation */
+  private abstract class MatcherFcn {
+protected final byte[] patternArray;
+
+protected MatcherFcn() {
+  assert patternByteBuffer.hasArray();
+
+  patternArray = patternByteBuffer.array();
+}
+
+/**
+ * @return 1 if the pattern was matched; 0 otherwise
+ */
+protected abstract int match(int start, int end, DrillBuf drillBuf);
+  }
+
+  /** Handles patterns with length one */
+  private final class Matcher1 extends MatcherFcn {
 
-if (patternLength == 0) { // Everything should match for null pattern 
string
-  return 1;
+private Matcher1() {
+  super();
 }
 
-final int txtLength = end - start;
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start;
+  final byte firstPattByte  = patternArray[0];
 
-// no match if input string length is less than pattern length
-if (txtLength < patternLength) {
+  // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
+  // so, we can just directly compare.
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+byte inputByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != inputByte) {
+  continue;
+}
+return 1;
+  }
   return 0;
 }
+  }
 
+  /** Handles patterns with length two */
+  private final class Matcher2 extends MatcherFcn {
 
-final int outerEnd = txtLength - patternLength;
+private Matcher2() {
+  super();
+}
 
-outer:
-for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) {
+/** {@inheritDoc} */
+@Override
+protected final int match(int start, int end, DrillBuf drillBuf) {
+  final int lengthToProcess = end - start - 1;
+  final byte firstPattByte  = patternArray[0];
+  final byte secondPattByte = patternArray[1];
 
   // simplePattern string has meta characters i.e % and _ and escape 
characters removed.
   // so, we can just directly compare.
-  for (int patternIndex = 0; patternIndex < patternLength; 
patternIndex++) {
-if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start 
+ txtIndex + patternIndex)) {
-  continue outer;
+  for (int idx = 0; idx < lengthToProcess; idx++) {
+final byte firstInByte = drillBuf.getByte(start + idx);
+
+if (firstPattByte != firstInByte) {
+  continue;
+} else {
+  final byte secondInByte = drillBuf.getByte(start + idx +1);
+
+  if (secondInByte == secondPattByte) {
+return 1;
+  }
 }
   }
+  return 0;
+}
+  }
+
+  /** Handles patterns with length three */
+  private final class Matcher3 extends MatcherFcn {
 
 

  1   2   3   4   >