[GitHub] drill pull request #1228: DRILL-6307: Handle empty batches in record batch s...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1228#discussion_r184483395 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java --- @@ -536,6 +556,11 @@ public ColumnSize getColumn(String name) { */ private int netRowWidth; private int netRowWidthCap50; + + /** + * actual row size if input is not empty. Otherwise, standard size. + */ + private int rowAllocSize; --- End diff -- @paul-rogers yes, this PR is trying to do the second thing i.e. make best guess with no knowledge at all about lengths. Also, right side can produce no rows for reasons other than empty array i.e. for ex. we might be filtering everything out after unnest. Please let me know what you think about the approach. ---
[GitHub] drill pull request #1228: DRILL-6307: Handle empty batches in record batch s...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1228#discussion_r184258865 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java --- @@ -536,6 +556,11 @@ public ColumnSize getColumn(String name) { */ private int netRowWidth; private int netRowWidthCap50; + + /** + * actual row size if input is not empty. Otherwise, standard size. + */ + private int rowAllocSize; --- End diff -- actually, this is a problem only for lateral join. In lateral join, right side will work on one row at a time from left side. If right side produces zero rows because of empty array or some other reason, for left outer join, we still have to finish cross join for that row from left side by having nulls for right side columns. Then, we go to next row on left side, continuing to work on filling the output batch till it is full. What that means is we have to allocate vectors based on that first batch on right side, which can be empty. ---
[GitHub] drill pull request #1228: DRILL-6307: Handle empty batches in record batch s...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1228#discussion_r184236170 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java --- @@ -277,18 +286,29 @@ public boolean isRepeatedList() { /** * This is the average per entry width, used for vector allocation. */ -public int getEntryWidth() { +private int getEntryWidthForAlloc() { int width = 0; if (isVariableWidth) { -width = getNetSizePerEntry() - OFFSET_VECTOR_WIDTH; +width = getAllocSizePerEntry() - OFFSET_VECTOR_WIDTH; // Subtract out the bits (is-set) vector width -if (metadata.getDataMode() == DataMode.OPTIONAL) { +if (isOptional) { width -= BIT_VECTOR_WIDTH; } + +if (isRepeated && getValueCount() == 0) { + return (safeDivide(width, STD_REPETITION_FACTOR)); --- End diff -- You are right. row count can be non-zero and valueCount can still be zero. Since this is intended for empty batch case, changed this to check for row count instead. ---
[GitHub] drill issue #1228: DRILL-6307: Handle empty batches in record batch sizer co...
Github user ppadma commented on the issue: https://github.com/apache/drill/pull/1228 @paul-rogers Paul, I addressed code review comments. Can you take a look when you get a chance ? ---
[GitHub] drill issue #1227: DRILL-6236: batch sizing for hash join
Github user ppadma commented on the issue: https://github.com/apache/drill/pull/1227 @Ben-Zvi I manually added the PR link to the JIRA. all code review comments are addressed. can you look at the latest changes ? ---
[GitHub] drill pull request #1228: DRILL-6307: Handle empty batches in record batch s...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1228#discussion_r184202508 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java --- @@ -536,6 +556,11 @@ public ColumnSize getColumn(String name) { */ private int netRowWidth; private int netRowWidthCap50; + + /** + * actual row size if input is not empty. Otherwise, standard size. + */ + private int rowAllocSize; --- End diff -- This is not just a problem with size estimation for vector memory allocation. Let us say one side of join receives an empty batch as first batch. If we use row width as 0 in outgoing row width calculation, number of rows (to include in the outgoing batch) we will calculate will be higher and later when we get a non empty batch, we might exceed the memory limits. ---
[GitHub] drill pull request #1228: DRILL-6307: Handle empty batches in record batch s...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1228#discussion_r184200281 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java --- @@ -277,18 +286,29 @@ public boolean isRepeatedList() { /** * This is the average per entry width, used for vector allocation. */ -public int getEntryWidth() { +private int getEntryWidthForAlloc() { int width = 0; if (isVariableWidth) { -width = getNetSizePerEntry() - OFFSET_VECTOR_WIDTH; +width = getAllocSizePerEntry() - OFFSET_VECTOR_WIDTH; // Subtract out the bits (is-set) vector width -if (metadata.getDataMode() == DataMode.OPTIONAL) { +if (isOptional) { width -= BIT_VECTOR_WIDTH; } + +if (isRepeated && getValueCount() == 0) { + return (safeDivide(width, STD_REPETITION_FACTOR)); +} } - return (safeDivide(width, cardinality)); + return (safeDivide(width, getEntryCardinalityForAlloc())); +} + +/** + * This is the average per entry cardinality, used for vector allocation. + */ +private float getEntryCardinalityForAlloc() { + return getCardinality() == 0 ? (isRepeated ? STD_REPETITION_FACTOR : 1) :getCardinality(); --- End diff -- This is for joins. We allocate vectors based on first batch sizing information and if that first batch is empty, then, we are allocating vectors with zero capacity. When we read the next batch with data, we will end up going through realloc loop as we write values. For ex., for outer left join, if right side batch is empty, we still have to include the right side columns as null in outgoing batch. With the new lateral join operator, if the input has an empty array as the first record in the unnest column, then also we see the problem. ---
[GitHub] drill pull request #1228: DRILL-6307: Handle empty batches in record batch s...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1228#discussion_r184192443 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java --- @@ -50,7 +50,7 @@ public class RecordBatchSizer { private static final int OFFSET_VECTOR_WIDTH = UInt4Vector.VALUE_WIDTH; private static final int BIT_VECTOR_WIDTH = UInt1Vector.VALUE_WIDTH; - private static final int STD_REPETITION_FACTOR = 10; + public static final int STD_REPETITION_FACTOR = 10; --- End diff -- done. using 5 in both places now. ---
[GitHub] drill issue #1218: DRILL-6335: Refactor row set abstractions to prepare for ...
Github user ppadma commented on the issue: https://github.com/apache/drill/pull/1218 LGTM. +1 ---
[GitHub] drill issue #1227: Drill-6236: batch sizing for hash join
Github user ppadma commented on the issue: https://github.com/apache/drill/pull/1227 @Ben-Zvi my bad. I updated the title. but, it has not updated the JIRA. trying to figure this out. ---
[GitHub] drill pull request #1227: Drill 6236: batch sizing for hash join
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1227#discussion_r183171726 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinBatch.java --- @@ -560,6 +554,40 @@ public void close() { super.close(); } + @Override + protected void updateBatchMemoryManagerStats() { +stats.setLongStat(Metric.LEFT_INPUT_BATCH_COUNT, batchMemoryManager.getNumIncomingBatches(LEFT_INDEX)); --- End diff -- @sohami I will create a JIRA and address that in a separate PR. For now, I would like to override this method. is that ok ? ---
[GitHub] drill issue #1227: Drill 6236: batch sizing for hash join
Github user ppadma commented on the issue: https://github.com/apache/drill/pull/1227 @sohami thanks for the review. updated with review comments addressed. please take a look. ---
[GitHub] drill pull request #1227: Drill 6236: batch sizing for hash join
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1227#discussion_r183112258 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinBatch.java --- @@ -147,7 +150,19 @@ NUM_BUCKETS, NUM_ENTRIES, NUM_RESIZING, -RESIZING_TIME_MS; +RESIZING_TIME_MS, +LEFT_INPUT_BATCH_COUNT, +LEFT_AVG_INPUT_BATCH_BYTES, +LEFT_AVG_INPUT_ROW_BYTES, +LEFT_INPUT_RECORD_COUNT, +RIGHT_INPUT_BATCH_COUNT, +RIGHT_AVG_INPUT_BATCH_BYTES, +RIGHT_AVG_INPUT_ROW_BYTES, +RIGHT_INPUT_RECORD_COUNT, +OUTPUT_BATCH_COUNT, +AVG_OUTPUT_BATCH_BYTES, +AVG_OUTPUT_ROW_BYTES, +OUTPUT_RECORD_COUNT; --- End diff -- It is relevant in the sense that they provide high level picture of amount of data being processed, memory usage etc. by each operator. This is also helpful when debugging trying to figure out what is going on. ---
[GitHub] drill issue #1227: Drill 6236: batch sizing for hash join
Github user ppadma commented on the issue: https://github.com/apache/drill/pull/1227 @Ben-Zvi Thanks a lot for the review. updated PR with review comments taken care of. Please take a look. Regarding spill files, here are my thoughts. For build side, I am using aggregate statistics i.e. average of all batches. On probe side, I am using stats for each batch coming in and adjusting the output row count. So, we can skip applying sizing for batches spilled from build side and continue to do what I am doing on the probe side. Once your code is merged in, I will refactor the code as needed. ---
[GitHub] drill pull request #1227: Drill 6236: batch sizing for hash join
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1227#discussion_r183108078 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinBatch.java --- @@ -300,13 +322,14 @@ public void setupHashTable() throws IOException, SchemaChangeException, ClassTra public void executeBuildPhase() throws SchemaChangeException, ClassTransformationException, IOException { //Setup the underlying hash table - // skip first batch if count is zero, as it may be an empty schema batch if (isFurtherProcessingRequired(rightUpstream) && right.getRecordCount() == 0) { for (final VectorWrapper w : right) { w.clear(); } rightUpstream = next(right); + // For build side, use aggregate i.e. average row width across batches + batchMemoryManager.update(RIGHT_INDEX, 0,true); --- End diff -- There is a call to "next" right above the update. ---
[GitHub] drill pull request #1218: DRILL-6335: Refactor row set abstractions to prepa...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1218#discussion_r18296 --- Diff: exec/java-exec/src/test/java/org/apache/drill/test/rowSet/RowSetWriterImpl.java --- @@ -158,4 +159,10 @@ public SingleRowSet done() { public int lastWriteIndex() { return writerIndex.vectorIndex(); } + + @Override + public ColumnMetadata schema() { +// No column schema for the row as a whole. --- End diff -- same comment as above ---
[GitHub] drill pull request #1218: DRILL-6335: Refactor row set abstractions to prepa...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1218#discussion_r182921553 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/impl/RowSetLoaderImpl.java --- @@ -95,4 +96,10 @@ public void endBatch() { @Override public int rowCount() { return rsLoader.rowCount(); } + + @Override + public ColumnMetadata schema() { +// No column for the row tuple +return null; --- End diff -- this comment is not clear to me. is it ok to return null here ? ---
[GitHub] drill pull request #1218: DRILL-6335: Refactor row set abstractions to prepa...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1218#discussion_r182729279 --- Diff: exec/java-exec/src/test/java/org/apache/drill/test/rowSet/test/TestHyperVectorReaders.java --- @@ -362,4 +363,136 @@ public void testRepeatedMap() { RowSetUtilities.verify(expected, hyperSet); } + + @Test + public void testUnion() { +TupleMetadata schema = new SchemaBuilder() +.add("a", MinorType.INT) +.addUnion("u") + .addType(MinorType.INT) + .addType(MinorType.VARCHAR) + .resumeSchema() +.buildSchema(); + +SingleRowSet rowSet1 = fixture.rowSetBuilder(schema) +.addRow(2, 20) +.addRow(4, "fourth") +.build(); + +SingleRowSet rowSet2 = fixture.rowSetBuilder(schema) +.addRow(1, "first") +.addRow(3, 30) +.build(); + +// Build the hyper batch + +HyperRowSet hyperSet = HyperRowSetImpl.fromRowSets(fixture.allocator(), rowSet1, rowSet2); +assertEquals(4, hyperSet.rowCount()); +SelectionVector4 sv4 = hyperSet.getSv4(); +sv4.set(0, 1, 0); +sv4.set(1, 0, 0); --- End diff -- do we have to manually map each index like this ? may be we can do this in the fromRowSets method itself ? ---
[GitHub] drill pull request #1218: DRILL-6335: Refactor row set abstractions to prepa...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1218#discussion_r182923416 --- Diff: exec/java-exec/src/test/java/org/apache/drill/test/rowSet/test/TestIndirectReaders.java --- @@ -0,0 +1,151 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.test.rowSet.test; + +import static org.junit.Assert.*; + +import org.apache.drill.common.types.TypeProtos.MinorType; +import org.apache.drill.exec.record.metadata.TupleMetadata; +import org.apache.drill.exec.record.selection.SelectionVector2; +import org.apache.drill.exec.vector.accessor.ArrayReader; +import org.apache.drill.exec.vector.accessor.ArrayWriter; +import org.apache.drill.exec.vector.accessor.ScalarReader; +import org.apache.drill.exec.vector.accessor.ScalarWriter; +import org.apache.drill.test.SubOperatorTest; +import org.apache.drill.test.rowSet.RowSetWriter; +import org.apache.drill.test.rowSet.RowSet.ExtendableRowSet; +import org.apache.drill.test.rowSet.RowSet.SingleRowSet; +import org.apache.drill.test.rowSet.schema.SchemaBuilder; +import org.apache.drill.test.rowSet.RowSetComparison; +import org.apache.drill.test.rowSet.RowSetReader; +import org.junit.Test; + +/** + * Test reading with an indirection vector (sv2.) This form of + * indirection vector reorders values within a single batch. + * Since the indirection occurs only in the reader index, only + * light testing is done; all readers go through the same index class, + * so if the index works for one reader, it will for for all. --- End diff -- typo: it will work for all (instead of for for all) ---
[GitHub] drill pull request #1228: DRILL-6307: Handle empty batches in record batch s...
GitHub user ppadma opened a pull request: https://github.com/apache/drill/pull/1228 DRILL-6307: Handle empty batches in record batch sizer correctly When we get empty batch, record batch sizer calculates row width as zero. In that case, we do not do accounting and memory allocation correctly for outgoing batches. For ex., for outer left join, if right side batch is empty, we still have to include the right side columns as null in outgoing batch. Say first batch is empty. Then, for outgoing, we allocate empty vectors with zero capacity. When we read the next batch with data, we will end up going through realloc loop as we write values. Also, if we use right side row width as 0 in outgoing row width calculation, number of rows (to include in the outgoing batch) we will calculate will be higher and later when we get a non empty batch, we might exceed the memory limits. This PR tries to address these problems by allocating memory based on std size for empty input batch. Uses allocation width as width of the batch in number of rows calculation for binary operators. For unary operators, this is not a problem since we drop empty batches without doing any processing. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ppadma/drill DRILL-6307 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/1228.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1228 commit cd78209e9f75a59edc68df3e416f3936fb00f917 Author: Padma Penumarthy Date: 2018-04-06T19:56:06Z DRILL-6307: Handle empty batches in record batch sizer correctly ---
[GitHub] drill pull request #1227: Drill 6236: batch sizing for hash join
GitHub user ppadma opened a pull request: https://github.com/apache/drill/pull/1227 Drill 6236: batch sizing for hash join You can merge this pull request into a Git repository by running: $ git pull https://github.com/ppadma/drill DRILL-6236 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/1227.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1227 commit f41aadab1cdf61bb1818724597aaa7726000af09 Author: Padma Penumarthy Date: 2018-04-07T03:53:26Z DRILL-6236: batch sizing for hash join commit e2ddf8761e006a16ed7b6ceea3c2fa849cc6a6e5 Author: Padma Penumarthy Date: 2018-04-19T05:23:54Z DRILL-6343: bit vector copyFromSafe is not doing realloc ---
[GitHub] drill issue #1161: DRILL-6230: Extend row set readers to handle hyper vector...
Github user ppadma commented on the issue: https://github.com/apache/drill/pull/1161 @paul-rogers ran the tests. they all pass. ---
[GitHub] drill issue #1181: DRILL-6284: Add operator metrics for batch sizing for fla...
Github user ppadma commented on the issue: https://github.com/apache/drill/pull/1181 @paul-rogers thanks a lot for the review. updated PR with review comments taken care of. ---
[GitHub] drill issue #1181: DRILL-6284: Add operator metrics for batch sizing for fla...
Github user ppadma commented on the issue: https://github.com/apache/drill/pull/1181 @paul-rogers I added metrics for merge join also. I refactored AbstractRecordBatchMemoryManager to handle batches from multiple streams. Please review when you get a chance. ---
[GitHub] drill issue #1181: DRILL-6284: Add operator metrics for batch sizing for fla...
Github user ppadma commented on the issue: https://github.com/apache/drill/pull/1181 @paul-rogers thanks for the review. Please take a look at updated changes. ---
[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1181#discussion_r177810092 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java --- @@ -99,6 +100,22 @@ private void clear() { } } + public enum Metric implements MetricDef { +NUM_INCOMING_BATCHES, +AVG_INPUT_BATCH_SIZE, +AVG_INPUT_ROW_WIDTH, --- End diff -- @paul-rogers Paul, I did not understand what you mean by parallel here and below. Do you mean they should be adjacent columns in the web UI ? ---
[GitHub] drill pull request #1161: DRILL-6230: Extend row set readers to handle hyper...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1161#discussion_r177564887 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/rowSet/model/ReaderIndex.java --- @@ -28,26 +28,30 @@ public abstract class ReaderIndex implements ColumnReaderIndex { - protected int rowIndex = -1; + protected int position = -1; protected final int rowCount; public ReaderIndex(int rowCount) { this.rowCount = rowCount; } - public int position() { return rowIndex; } - public void set(int index) { rowIndex = index; } + public void set(int index) { +assert position >= -1 && position <= rowCount; +position = index; + } + + @Override + public int logicalIndex() { return position; } + + @Override + public int size() { return rowCount; } + @Override public boolean next() { -if (++rowIndex < rowCount ) { +if (++position < rowCount) { return true; -} else { - rowIndex--; - return false; } +position = rowCount; --- End diff -- is there a need to set position to rowcount ? It will come here when position = rowcount ---
[GitHub] drill pull request #1161: DRILL-6230: Extend row set readers to handle hyper...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1161#discussion_r177182331 --- Diff: exec/java-exec/src/test/java/org/apache/drill/test/rowSet/test/TestHyperVectorReaders.java --- @@ -0,0 +1,365 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.test.rowSet.test; + +import static org.apache.drill.test.rowSet.RowSetUtilities.mapArray; +import static org.apache.drill.test.rowSet.RowSetUtilities.mapValue; +import static org.apache.drill.test.rowSet.RowSetUtilities.strArray; +import static org.junit.Assert.assertEquals; +import static org.junit.Assert.assertFalse; +import static org.junit.Assert.assertTrue; + +import org.apache.drill.common.types.TypeProtos.MinorType; +import org.apache.drill.exec.record.metadata.TupleMetadata; +import org.apache.drill.exec.record.selection.SelectionVector4; +import org.apache.drill.test.SubOperatorTest; +import org.apache.drill.test.rowSet.HyperRowSetImpl; +import org.apache.drill.test.rowSet.RowSet.ExtendableRowSet; +import org.apache.drill.test.rowSet.RowSet.HyperRowSet; +import org.apache.drill.test.rowSet.RowSet.SingleRowSet; +import org.apache.drill.test.rowSet.RowSetBuilder; +import org.apache.drill.test.rowSet.RowSetReader; +import org.apache.drill.test.rowSet.RowSetUtilities; +import org.apache.drill.test.rowSet.RowSetWriter; +import org.apache.drill.test.rowSet.schema.SchemaBuilder; +import org.junit.Test; + +/** + * Test the reader mechanism that reads rows indexed via an SV4. + * SV4's introduce an additional level of indexing: each row may + * come from a different batch. The readers use the SV4 to find + * the root batch and vector, then must navigate downward from that + * vector for maps, repeated maps, lists, unions, repeated lists, + * nullable vectors and variable-length vectors. + * + * This test does not cover repeated vectors; those tests should be added. --- End diff -- please file a JIRA for this. ---
[GitHub] drill pull request #1161: DRILL-6230: Extend row set readers to handle hyper...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1161#discussion_r177175974 --- Diff: exec/java-exec/src/test/java/org/apache/drill/test/rowSet/HyperRowSetImpl.java --- @@ -45,8 +50,67 @@ public RowSetReader buildReader(HyperRowSet rowSet, SelectionVector4 sv4) { TupleMetadata schema = rowSet.schema(); HyperRowIndex rowIndex = new HyperRowIndex(sv4); return new RowSetReaderImpl(schema, rowIndex, - buildContainerChildren(rowSet.container(), - new MetadataRetrieval(schema))); + buildContainerChildren(rowSet.container(), schema)); +} + } + + public static class HyperRowSetBuilderImpl implements HyperRowSetBuilder { + +private final BufferAllocator allocator; +private final List batches = new ArrayList<>(); +private int totalRowCount; + +public HyperRowSetBuilderImpl(BufferAllocator allocator) { + this.allocator = allocator; +} + +@Override +public void addBatch(SingleRowSet rowSet) { + if (rowSet.rowCount() == 0) { +return; + } + if (rowSet.indirectionType() != SelectionVectorMode.NONE) { +throw new IllegalArgumentException("Batches must not have a selection vector."); + } + batches.add(rowSet.container()); + totalRowCount += rowSet.rowCount(); +} + +@Override +public void addBatch(VectorContainer container) { + if (container.getRecordCount() == 0) { +return; + } + if (container.getSchema().getSelectionVectorMode() != SelectionVectorMode.NONE) { +throw new IllegalArgumentException("Batches must not have a selection vector."); + } + batches.add(container); + totalRowCount += container.getRecordCount(); +} + +@SuppressWarnings("resource") +@Override +public HyperRowSet build() throws SchemaChangeException { + SelectionVector4 sv4 = new SelectionVector4(allocator, totalRowCount); + ExpandableHyperContainer hyperContainer = new ExpandableHyperContainer(); + for (VectorContainer container : batches) { +hyperContainer.addBatch(container); + } + + // TODO: This has a bug. If the hyperset has two batches with unions, + // and the first union contains only VARCHAR, while the second contains --- End diff -- is there a JIRA for this bug ? ---
[GitHub] drill pull request #1181: DRILL-6284: Add operator metrics for batch sizing ...
GitHub user ppadma opened a pull request: https://github.com/apache/drill/pull/1181 DRILL-6284: Add operator metrics for batch sizing for flatten @kkhatua please review. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ppadma/drill DRILL-6284 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/1181.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1181 commit f0b7bed20aef64cdc9e025a5ca209e1ad6220aa6 Author: Padma Penumarthy Date: 2018-03-20T20:44:50Z DRILL-6284: Add operator metrics for batch sizing for flatten ---
[GitHub] drill pull request #1179: DRILL-6254: IllegalArgumentException: the requeste...
GitHub user ppadma opened a pull request: https://github.com/apache/drill/pull/1179 DRILL-6254: IllegalArgumentException: the requested size must be non-⦠â¦negative We should limit memory allocation to number of records that are going to be in the next batch, not the total number of records remaining. For very large remaining record count, when multiplied with high cardinality, integer overflows and is becoming negative. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ppadma/drill DRILL-6254 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/1179.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1179 commit c26e8f44b9c873d0d7acabac3623a3a8e19086eb Author: Padma Penumarthy Date: 2018-03-21T20:39:43Z DRILL-6254: IllegalArgumentException: the requested size must be non-negative ---
[GitHub] drill issue #1175: DRILL-6262: IndexOutOfBoundException in RecordBatchSize f...
Github user ppadma commented on the issue: https://github.com/apache/drill/pull/1175 LGTM. +1. ---
[GitHub] drill issue #1171: DRILL-6231: Fix memory allocation for repeated list vecto...
Github user ppadma commented on the issue: https://github.com/apache/drill/pull/1171 @paul-rogers I added the null check. Also, added a test case for 3D array i.e. repeated repeated list. ---
[GitHub] drill issue #1171: DRILL-6231: Fix memory allocation for repeated list vecto...
Github user ppadma commented on the issue: https://github.com/apache/drill/pull/1171 @paul-rogers please review. ---
[GitHub] drill issue #1161: DRILL-6230: Extend row set readers to handle hyper vector...
Github user ppadma commented on the issue: https://github.com/apache/drill/pull/1161 @paul-rogers I started the review. will get back soon. ---
[GitHub] drill pull request #1171: DRILL-6231: Fix memory allocation for repeated lis...
GitHub user ppadma opened a pull request: https://github.com/apache/drill/pull/1171 DRILL-6231: Fix memory allocation for repeated list vector You can merge this pull request into a Git repository by running: $ git pull https://github.com/ppadma/drill DRILL-6231 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/1171.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1171 commit 114a526240a272bd626862a02d1acd06b9c8b4cf Author: Padma Penumarthy Date: 2018-03-16T04:50:54Z DRILL-6231: Fix memory allocation for repeated list vector ---
[GitHub] drill issue #1150: DRILL-6210: Enhanced test schema utilities
Github user ppadma commented on the issue: https://github.com/apache/drill/pull/1150 @Ben-Zvi Boaz, can you do a committer review for this PR and include it in this week's batch commit ? ---
[GitHub] drill issue #1125: DRILL-6126: Allocate memory for value vectors upfront in ...
Github user ppadma commented on the issue: https://github.com/apache/drill/pull/1125 @paul-rogers Thank you Paul. I will be incorporating some of your suggestions in future PRs. This change will allow us to build more things on top. ---
[GitHub] drill issue #1150: DRILL-6210: Enhanced test schema utilities
Github user ppadma commented on the issue: https://github.com/apache/drill/pull/1150 This is very useful for testing complex types. LGTM. +1. ---
[GitHub] drill issue #1125: DRILL-6126: Allocate memory for value vectors upfront in ...
Github user ppadma commented on the issue: https://github.com/apache/drill/pull/1125 @paul-rogers Updated the PR with latest changes. I have decided not to use vector initializer for allocation as it is subject to alias issues like you mentioned. Instead, I added allocate vector method to columnSize in batch sizer, which will use internal sizing information it has to allocate memory (including all it's children) for a particular record count. Refactored the batch sizer code, added unit tests for verifying sizing and allocation for different vector types. Please take a look when you get a chance and let me know what you think. ---
[GitHub] drill pull request #1125: DRILL-6126: Allocate memory for value vectors upfr...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1125#discussion_r171999276 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java --- @@ -245,16 +251,30 @@ private void buildVectorInitializer(VectorInitializer initializer) { else if (width > 0) { initializer.variableWidth(name, width); } + + for (ColumnSize columnSize : childColumnSizes.values()) { +columnSize.buildVectorInitializer(initializer); + } } + } public static ColumnSize getColumn(ValueVector v, String prefix) { return new ColumnSize(v, prefix); } + public ColumnSize getColumn(String name) { +return allColumnSizes.get(name); + } + public static final int MAX_VECTOR_SIZE = ValueVector.MAX_BUFFER_SIZE; // 16 MiB - private Map columnSizes = CaseInsensitiveMap.newHashMap(); + // This keeps information for all columns i.e. all top columns and nested columns underneath + private Map allColumnSizes = CaseInsensitiveMap.newHashMap(); --- End diff -- yes, I got rid of allColumnSizes. We will have only top level columns. ---
[GitHub] drill pull request #1125: DRILL-6126: Allocate memory for value vectors upfr...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1125#discussion_r171999148 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java --- @@ -418,11 +438,13 @@ private void measureColumn(ValueVector v, String prefix) { netRowWidthCap50 += ! colSize.isVariableWidth ? colSize.estSize : 8 /* offset vector */ + roundUpToPowerOf2(Math.min(colSize.estSize,50)); // above change 8 to 4 after DRILL-5446 is fixed + +return colSize; } - private void expandMap(AbstractMapVector mapVector, String prefix) { + private void expandMap(ColumnSize colSize, AbstractMapVector mapVector, String prefix) { for (ValueVector vector : mapVector) { - measureColumn(vector, prefix); + colSize.childColumnSizes.put(prefix + vector.getField().getName(), measureColumn(vector, prefix)); --- End diff -- I made the change to maintain trees of columns for trees of maps. Any one who wants to access the lower level columns, they get top level column and have to walk through the tree. ---
[GitHub] drill pull request #1147: DRILL-6205: Reduce memory consumption of testFlatt...
GitHub user ppadma opened a pull request: https://github.com/apache/drill/pull/1147 DRILL-6205: Reduce memory consumption of testFlattenUpperLimit test You can merge this pull request into a Git repository by running: $ git pull https://github.com/ppadma/drill DRILL-6205 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/1147.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1147 commit d61ad8fe176996015209eba321e2a176d3a8d24f Author: Padma Penumarthy Date: 2018-03-02T19:13:59Z DRILL-6205: Reduce memory consumption of testFlattenUpperLimit test ---
[GitHub] drill issue #1129: DRILL-6180: Use System Option "output_batch_size" for Ext...
Github user ppadma commented on the issue: https://github.com/apache/drill/pull/1129 @paul-rogers Made the change you suggested. Please take a look when you get a chance. ---
[GitHub] drill issue #1125: DRILL-6126: Allocate memory for value vectors upfront in ...
Github user ppadma commented on the issue: https://github.com/apache/drill/pull/1125 @paul-rogers Paul, Thanks a lot for your review comments and bringing up some good issues. Just want to let you know. I am working on refactoring the batch sizer code, writing bunch of unit tests to test sizing and vector allocation for all different vector types. Found some bugs in the process and fixed them. I will be posting new changes soon and need your review once they are ready. ---
[GitHub] drill pull request #1129: DRILL-6180: Use System Option "output_batch_size" ...
GitHub user ppadma opened a pull request: https://github.com/apache/drill/pull/1129 DRILL-6180: Use System Option "output_batch_size" for External Sort External Sort has boot time configuration for output batch size "drill.exec.sort.external.spill.merge_batch_size" which is defaulted to 16M. To make batch sizing configuration uniform across all operators, change this to use new system option that is added "drill.exec.memory.operator.output_batch_size". This option has default value of 32M. Changed it to 16M. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ppadma/drill DRILL-6180 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/1129.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1129 commit 5121663c1fac618d0374667c97c20570197b7455 Author: Padma Penumarthy Date: 2018-02-23T00:41:47Z DRILL-6180: Use System Option "output_batch_size" for External Sort ---
[GitHub] drill issue #1125: DRILL-6126: Allocate memory for value vectors upfront in ...
Github user ppadma commented on the issue: https://github.com/apache/drill/pull/1125 @paul-rogers Paul, will you be able to review this ? ---
[GitHub] drill pull request #1125: DRILL-6126: Allocate memory for value vectors upfr...
GitHub user ppadma opened a pull request: https://github.com/apache/drill/pull/1125 DRILL-6126: Allocate memory for value vectors upfront in flatten operator Made changes to allocate memory upfront for flatten operator based on sizing calculations. Need to do allocation of single column (can be nested) for a particular record count and allocation hints. Refactored the code a bit for that. Instead of assuming worst case fragmentation factor of 2, changed the logic to round down the number of rows calculated to nearest power of two. This will allow us to pack value vectors more densely and will help with memory utilization. RepeatedMapvector and RepeatedListVector are extended from RepeatedFixedWidthVectorLike. This is wrong and causing problems in Allocation logic (allocatePrecomputedChildCount in AllocationHelper more specifically). Fixed that. This PR has 2 commits. One for all of the above and second one for DRILL-6162: Enhance record batch sizer to retain nesting information. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ppadma/drill DRILL-6126 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/1125.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1125 commit 58c6b9ad584e56c71d982feaaa43ad32b5011eef Author: Padma Penumarthy Date: 2018-02-21T17:33:12Z DRILL-6162: Enhance record batch sizer to retain nesting information for map columns. commit f7c09131179b75d10ffe195785c9aef3b9c7ed97 Author: Padma Penumarthy Date: 2018-02-21T17:35:47Z DRILL-6126: Allocate memory for value vectors upfront in flatten operator ---
[GitHub] drill pull request #1121: DRILL-6153: Operator framework
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1121#discussion_r169155001 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/protocol/OperatorDriver.java --- @@ -0,0 +1,222 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.impl.protocol; + +import org.apache.drill.common.exceptions.UserException; +import org.apache.drill.exec.ops.OperatorContext; +import org.apache.drill.exec.record.RecordBatch.IterOutcome; + +/** + * State machine that drives the operator executable. Converts + * between the iterator protocol and the operator executable protocol. + * Implemented as a separate class in anticipation of eventually + * changing the record batch (iterator) protocol. + */ + +public class OperatorDriver { + public enum State { + +/** + * Before the first call to next(). + */ + +START, + +/** + * The first call to next() has been made and schema (only) + * was returned. On the subsequent call to next(), return any + * data that might have accompanied that first batch. + */ + +SCHEMA, + +/** + * The second call to next() has been made and there is more + * data to deliver on subsequent calls. + */ + +RUN, + +/** + * No more data to deliver. + */ + +END, + +/** + * An error occurred. Operation was cancelled. + */ + +FAILED, + +/** + * close() called and resources are released. + */ + +CLOSED } --- End diff -- minor: closing braces in a separate line for better readability. ---
[GitHub] drill pull request #1121: DRILL-6153: Operator framework
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1121#discussion_r169159418 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/protocol/OperatorDriver.java --- @@ -0,0 +1,222 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.impl.protocol; + +import org.apache.drill.common.exceptions.UserException; +import org.apache.drill.exec.ops.OperatorContext; +import org.apache.drill.exec.record.RecordBatch.IterOutcome; + +/** + * State machine that drives the operator executable. Converts + * between the iterator protocol and the operator executable protocol. + * Implemented as a separate class in anticipation of eventually + * changing the record batch (iterator) protocol. + */ + +public class OperatorDriver { + public enum State { + +/** + * Before the first call to next(). + */ + +START, + +/** + * The first call to next() has been made and schema (only) + * was returned. On the subsequent call to next(), return any + * data that might have accompanied that first batch. + */ + +SCHEMA, + +/** + * The second call to next() has been made and there is more + * data to deliver on subsequent calls. + */ + +RUN, + +/** + * No more data to deliver. + */ + +END, + +/** + * An error occurred. Operation was cancelled. + */ + +FAILED, + +/** + * close() called and resources are released. + */ + +CLOSED } + + private OperatorDriver.State state = State.START; + + /** + * Operator context. The driver "owns" the context and is responsible + * for closing it. + */ + + private final OperatorContext opContext; + private final OperatorExec operatorExec; + private final BatchAccessor batchAccessor; + private int schemaVersion; + + public OperatorDriver(OperatorContext opContext, OperatorExec opExec) { +this.opContext = opContext; +this.operatorExec = opExec; +batchAccessor = operatorExec.batchAccessor(); + } + + /** + * Get the next batch. Performs initialization on the first call. + * @return the iteration outcome to send downstream + */ + + public IterOutcome next() { +try { + switch (state) { + case START: +return start(); + case RUN: +return doNext(); + default: +OperatorRecordBatch.logger.debug("Extra call to next() in state " + state + ": " + operatorLabel()); +return IterOutcome.NONE; + } +} catch (UserException e) { + cancelSilently(); + state = State.FAILED; + throw e; +} catch (Throwable t) { + cancelSilently(); + state = State.FAILED; + throw UserException.executionError(t) +.addContext("Exception thrown from", operatorLabel()) +.build(OperatorRecordBatch.logger); +} + } + + /** + * Cancels the operator before reaching EOF. + */ + + public void cancel() { +try { + switch (state) { + case START: + case RUN: +cancelSilently(); +break; + default: +break; + } +} finally { + state = State.FAILED; --- End diff -- I am thinking FAILED represents internal failure with in the operator. Cancel means we are explicitly canceling it (for whatever reasons) i.e. operator is being asked to shutdown or close. For cancel, should we move the state to CLOSED instead of FAILED. ---
[GitHub] drill pull request #1121: DRILL-6153: Operator framework
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1121#discussion_r169155210 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/protocol/OperatorDriver.java --- @@ -0,0 +1,222 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.impl.protocol; + +import org.apache.drill.common.exceptions.UserException; +import org.apache.drill.exec.ops.OperatorContext; +import org.apache.drill.exec.record.RecordBatch.IterOutcome; + +/** + * State machine that drives the operator executable. Converts + * between the iterator protocol and the operator executable protocol. + * Implemented as a separate class in anticipation of eventually + * changing the record batch (iterator) protocol. + */ + +public class OperatorDriver { + public enum State { + +/** + * Before the first call to next(). + */ + +START, + +/** + * The first call to next() has been made and schema (only) + * was returned. On the subsequent call to next(), return any + * data that might have accompanied that first batch. + */ + +SCHEMA, + +/** + * The second call to next() has been made and there is more + * data to deliver on subsequent calls. + */ + +RUN, + +/** + * No more data to deliver. + */ + +END, + +/** + * An error occurred. Operation was cancelled. + */ + +FAILED, + +/** + * close() called and resources are released. + */ + +CLOSED } + + private OperatorDriver.State state = State.START; + + /** + * Operator context. The driver "owns" the context and is responsible + * for closing it. + */ + + private final OperatorContext opContext; + private final OperatorExec operatorExec; + private final BatchAccessor batchAccessor; + private int schemaVersion; + + public OperatorDriver(OperatorContext opContext, OperatorExec opExec) { +this.opContext = opContext; +this.operatorExec = opExec; +batchAccessor = operatorExec.batchAccessor(); + } + + /** + * Get the next batch. Performs initialization on the first call. + * @return the iteration outcome to send downstream + */ + + public IterOutcome next() { +try { + switch (state) { + case START: +return start(); + case RUN: +return doNext(); + default: --- End diff -- alignment ---
[GitHub] drill pull request #1121: DRILL-6153: Operator framework
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1121#discussion_r169162527 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/protocol/VectorContainerAccessor.java --- @@ -0,0 +1,132 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.impl.protocol; + +import java.util.Collections; +import java.util.Iterator; + +import org.apache.drill.common.expression.SchemaPath; +import org.apache.drill.exec.record.BatchSchema; +import org.apache.drill.exec.record.TypedFieldId; +import org.apache.drill.exec.record.VectorContainer; +import org.apache.drill.exec.record.VectorWrapper; +import org.apache.drill.exec.record.WritableBatch; +import org.apache.drill.exec.record.selection.SelectionVector2; +import org.apache.drill.exec.record.selection.SelectionVector4; + +public class VectorContainerAccessor implements BatchAccessor { + + public static class ContainerAndSv2Accessor extends VectorContainerAccessor { + +private SelectionVector2 sv2; + +public void setSelectionVector(SelectionVector2 sv2) { + this.sv2 = sv2; +} + +@Override +public SelectionVector2 getSelectionVector2() { + return sv2; +} + } + + public static class ContainerAndSv4Accessor extends VectorContainerAccessor { + +private SelectionVector4 sv4; + +@Override +public SelectionVector4 getSelectionVector4() { + return sv4; +} + } + + private VectorContainer container; + private SchemaTracker schemaTracker = new SchemaTracker(); + + /** + * Set the vector container. Done initially, and any time the schema of + * the container may have changed. May be called with the same container + * as the previous call, or a different one. A schema change occurs + * unless the vectors are identical across the two containers. + * + * @param container the container that holds vectors to be sent + * downstream + */ + + public void setContainer(VectorContainer container) { +this.container = container; +if (container != null) { + schemaTracker.trackSchema(container); +} + } + + @Override + public BatchSchema getSchema() { +return container == null ? null : container.getSchema(); + } + + @Override + public int schemaVersion() { return schemaTracker.schemaVersion(); } + + @Override + public int getRowCount() { +return container == null ? 0 : container.getRecordCount(); + } + + @Override + public VectorContainer getOutgoingContainer() { return container; } + + @Override + public TypedFieldId getValueVectorId(SchemaPath path) { +return container.getValueVectorId(path); + } + + @Override + public VectorWrapper getValueAccessorById(Class clazz, int... ids) { +return container.getValueAccessorById(clazz, ids); + } + + @Override + public WritableBatch getWritableBatch() { +return WritableBatch.get(container); + } + + @Override + public SelectionVector2 getSelectionVector2() { +// Throws an exception by default --- End diff -- should we make that explicit by indicating what exceptions it might throw. ---
[GitHub] drill pull request #1121: DRILL-6153: Operator framework
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1121#discussion_r168324742 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/protocol/OperatorDriver.java --- @@ -0,0 +1,183 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.impl.protocol; + +import org.apache.drill.common.exceptions.UserException; +import org.apache.drill.exec.ops.OperatorContext; +import org.apache.drill.exec.record.RecordBatch.IterOutcome; + +/** + * State machine that drives the operator executable. Converts + * between the iterator protocol and the operator executable protocol. + * Implemented as a separate class in anticipation of eventually + * changing the record batch (iterator) protocol. + */ + +public class OperatorDriver { + public enum State { START, SCHEMA, RUN, END, FAILED, CLOSED } + + private OperatorDriver.State state = State.START; + + /** + * Operator context. The driver "owns" the context and is responsible + * for closing it. + */ + + private final OperatorContext opContext; + private final OperatorExec operatorExec; + private final BatchAccessor batchAccessor; + private int schemaVersion; + + public OperatorDriver(OperatorContext opServicees, OperatorExec opExec) { +this.opContext = opServicees; +this.operatorExec = opExec; +batchAccessor = operatorExec.batchAccessor(); + } + + /** + * Get the next batch. Performs initialization on the first call. + * @return the iteration outcome to send downstream + */ + + public IterOutcome next() { +try { + switch (state) { + case START: +return start(); + case RUN: +return doNext(); + default: +OperatorRecordBatch.logger.debug("Extra call to next() in state " + state + ": " + operatorLabel()); +return IterOutcome.NONE; + } +} catch (UserException e) { + cancelSilently(); + state = State.FAILED; + throw e; +} catch (Throwable t) { + cancelSilently(); + state = State.FAILED; + throw UserException.executionError(t) +.addContext("Exception thrown from", operatorLabel()) +.build(OperatorRecordBatch.logger); +} + } + + /** + * Cancels the operator before reaching EOF. + */ + + public void cancel() { +try { + switch (state) { + case START: + case RUN: +cancelSilently(); +break; + default: +break; + } +} finally { + state = State.FAILED; +} + } + + /** + * Start the operator executor. Bind it to the various contexts. + * Then start the executor and fetch the first schema. + * @return result of the first batch, which should contain + * only a schema, or EOF + */ + + private IterOutcome start() { +state = State.SCHEMA; +if (operatorExec.buildSchema()) { + schemaVersion = batchAccessor.schemaVersion(); + state = State.RUN; + return IterOutcome.OK_NEW_SCHEMA; +} else { + state = State.END; + return IterOutcome.NONE; +} + } + + /** + * Fetch a record batch, detecting EOF and a new schema. + * @return the IterOutcome for the above cases + */ + + private IterOutcome doNext() { +if (! operatorExec.next()) { + state = State.END; + return IterOutcome.NONE; +} +int newVersion = batchAccessor.schemaVersion(); +if (newVersion != schemaVersion) { + schemaVersion = newVersion; + return IterOutcome.OK_NEW_SCHEMA; +} +return IterOutcome.OK; + } + + /** + * Implement a cancellati
[GitHub] drill pull request #1121: DRILL-6153: Operator framework
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1121#discussion_r168331779 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/protocol/OperatorDriver.java --- @@ -0,0 +1,183 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.impl.protocol; + +import org.apache.drill.common.exceptions.UserException; +import org.apache.drill.exec.ops.OperatorContext; +import org.apache.drill.exec.record.RecordBatch.IterOutcome; + +/** + * State machine that drives the operator executable. Converts + * between the iterator protocol and the operator executable protocol. + * Implemented as a separate class in anticipation of eventually + * changing the record batch (iterator) protocol. + */ + +public class OperatorDriver { + public enum State { START, SCHEMA, RUN, END, FAILED, CLOSED } --- End diff -- Do we need SCHEMA state as you are moving from START to RUN or END or FAILED. may be combine START and SCHEMA states to say GET_SCHEMA or something like that ? Also, would it be good to have two states FAILED and CANCELLED to differentiate whether the query was cancelled or failed due to error. ---
[GitHub] drill pull request #1121: DRILL-6153: Operator framework
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1121#discussion_r168337843 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/protocol/OperatorDriver.java --- @@ -0,0 +1,183 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.impl.protocol; + +import org.apache.drill.common.exceptions.UserException; +import org.apache.drill.exec.ops.OperatorContext; +import org.apache.drill.exec.record.RecordBatch.IterOutcome; + +/** + * State machine that drives the operator executable. Converts + * between the iterator protocol and the operator executable protocol. + * Implemented as a separate class in anticipation of eventually + * changing the record batch (iterator) protocol. + */ + +public class OperatorDriver { + public enum State { START, SCHEMA, RUN, END, FAILED, CLOSED } + + private OperatorDriver.State state = State.START; + + /** + * Operator context. The driver "owns" the context and is responsible + * for closing it. + */ + + private final OperatorContext opContext; + private final OperatorExec operatorExec; + private final BatchAccessor batchAccessor; + private int schemaVersion; + + public OperatorDriver(OperatorContext opServicees, OperatorExec opExec) { +this.opContext = opServicees; +this.operatorExec = opExec; +batchAccessor = operatorExec.batchAccessor(); + } + + /** + * Get the next batch. Performs initialization on the first call. + * @return the iteration outcome to send downstream + */ + + public IterOutcome next() { +try { + switch (state) { + case START: +return start(); + case RUN: +return doNext(); + default: +OperatorRecordBatch.logger.debug("Extra call to next() in state " + state + ": " + operatorLabel()); +return IterOutcome.NONE; + } +} catch (UserException e) { + cancelSilently(); + state = State.FAILED; + throw e; +} catch (Throwable t) { + cancelSilently(); + state = State.FAILED; + throw UserException.executionError(t) +.addContext("Exception thrown from", operatorLabel()) +.build(OperatorRecordBatch.logger); +} + } + + /** + * Cancels the operator before reaching EOF. + */ + + public void cancel() { +try { + switch (state) { + case START: + case RUN: +cancelSilently(); +break; + default: +break; + } +} finally { + state = State.FAILED; +} + } + + /** + * Start the operator executor. Bind it to the various contexts. + * Then start the executor and fetch the first schema. + * @return result of the first batch, which should contain + * only a schema, or EOF + */ + + private IterOutcome start() { --- End diff -- would be good if we can capture what exceptions each of these functions might throw. ---
[GitHub] drill pull request #1121: DRILL-6153: Operator framework
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1121#discussion_r168325261 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/protocol/OperatorExec.java --- @@ -0,0 +1,127 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.impl.protocol; + +import org.apache.drill.exec.ops.OperatorContext; + +/** + * Core protocol for a Drill operator execution. + * + * Lifecycle + * + * + * Creation via an operator-specific constructor in the + * corresponding RecordBatchCreator. + * bind() called to provide the operator services. + * buildSchema() called to define the schema before + * fetching the first record batch. + * next() called repeatedly to prepare each new record + * batch until EOF or until cancellation. + * cancel() called if the operator should quit early. + * close() called to release resources. Note that + * close() is called in response to: + * EOF + * After cancel() + * After an exception is thrown. + * + * + * Error Handling + * + * Any method can throw an (unchecked) exception. (Drill does not use + * checked exceptions.) Preferably, the code will throw a + * UserException that explains the error to the user. If any + * other kind of exception is thrown, then the enclosing class wraps it + * in a generic UserException that indicates that "something went + * wrong", which is less than ideal. + * + * Result Set + * The operator "publishes" a result set in response to returning + * true from next() by populating a + * {@link BatchAccesor} provided via {@link #batchAccessor()}. For + * compatibility with other Drill operators, the set of vectors within + * the batch must be the same from one batch to the next. + */ + +public interface OperatorExec { + + /** + * Bind this operator to the context. The context provides access + * to per-operator, per-fragment and per-Drillbit services. + * Also provides access to the operator definition (AKA "pop + * config") for this operator. + * + * @param context operator context + */ + + public void bind(OperatorContext context); + + /** + * Provides a generic access mechanism to the batch's output data. + * This method is called after a successful return from + * {@link #buildSchema()} and {@link #next()}. The batch itself + * can be held in a standard {@link VectorContainer}, or in some + * other structure more convenient for this operator. + * + * @return the access for the batch's output container + */ + + BatchAccessor batchAccessor(); + + /** + * Retrieves the schema of the batch before the first actual batch + * of data. The schema is returned via an empty batch (no rows, + * only schema) from {@link #batchAccessor()}. + * + * @return true if a schema is available, false if the operator + * reached EOF before a schema was found + */ + + boolean buildSchema(); + + /** + * Retrieves the next batch of data. The data is returned via + * the {@link #batchAccessor()} method. + * + * @return true if another batch of data is available, false if + * EOF was reached and no more data is available + */ + + boolean next(); + + /** + * Alerts the operator that the query was cancelled. Generally + * optional, but allows the operator to realize that a cancellation --- End diff -- why is this optional ? ---
[GitHub] drill pull request #1121: DRILL-6153: Operator framework
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1121#discussion_r168325765 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/protocol/OperatorRecordBatch.java --- @@ -0,0 +1,167 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.impl.protocol; + +import java.util.Iterator; + +import org.apache.drill.common.exceptions.UserException; +import org.apache.drill.common.expression.SchemaPath; +import org.apache.drill.exec.ops.FragmentContext; +import org.apache.drill.exec.ops.FragmentContextInterface; +import org.apache.drill.exec.ops.OperatorContext; +import org.apache.drill.exec.physical.base.PhysicalOperator; +import org.apache.drill.exec.record.BatchSchema; +import org.apache.drill.exec.record.CloseableRecordBatch; +import org.apache.drill.exec.record.TypedFieldId; +import org.apache.drill.exec.record.VectorContainer; +import org.apache.drill.exec.record.VectorWrapper; +import org.apache.drill.exec.record.WritableBatch; +import org.apache.drill.exec.record.selection.SelectionVector2; +import org.apache.drill.exec.record.selection.SelectionVector4; + +/** + * Modular implementation of the standard Drill record batch iterator + * protocol. The protocol has two parts: control of the operator and + * access to the record batch. Each is encapsulated in separate + * implementation classes to allow easier customization for each + * situation. The operator internals are, themselves, abstracted to + * yet another class with the steps represented as method calls rather + * than as internal states as in the record batch iterator protocol. + * + * Note that downstream operators make an assumption that the + * same vectors will appear from one batch to the next. That is, + * not only must the schema be the same, but if column "a" appears + * in two batches, the same value vector must back "a" in both + * batches. The TransferPair abstraction fails if different + * vectors appear across batches. + */ + +public class OperatorRecordBatch implements CloseableRecordBatch { + static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(OperatorRecordBatch.class); + + private final OperatorDriver driver; + private final BatchAccessor batchAccessor; + + public OperatorRecordBatch(FragmentContext context, PhysicalOperator config, OperatorExec opExec) { +OperatorContext opContext = context.newOperatorContext(config); +opContext.getStats().startProcessing(); + +// Chicken-and-egg binding: the two objects must know about each other. Pass the +// context to the operator exec via a bind method. + +try { + opExec.bind(opContext); + driver = new OperatorDriver(opContext, opExec); + batchAccessor = opExec.batchAccessor(); +} catch (UserException e) { + opContext.close(); + throw e; +} catch (Throwable t) { + opContext.close(); + throw UserException.executionError(t) +.addContext("Exception thrown from", opExec.getClass().getSimpleName() + ".bind()") +.build(logger); +} +finally { + opContext.getStats().stopProcessing(); +} + } + + @Override + public FragmentContext getContext() { + +// Backward compatibility with the full server context. Awkward for testing + +FragmentContext fragmentContext = fragmentContext(); +if (fragmentContext instanceof FragmentContext) { --- End diff -- not clear what we are doing here. why can't we just return fragmentContext ? ---
[GitHub] drill pull request #1121: DRILL-6153: Operator framework
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1121#discussion_r168317696 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/protocol/OperatorDriver.java --- @@ -0,0 +1,183 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.impl.protocol; + +import org.apache.drill.common.exceptions.UserException; +import org.apache.drill.exec.ops.OperatorContext; +import org.apache.drill.exec.record.RecordBatch.IterOutcome; + +/** + * State machine that drives the operator executable. Converts + * between the iterator protocol and the operator executable protocol. + * Implemented as a separate class in anticipation of eventually + * changing the record batch (iterator) protocol. + */ + +public class OperatorDriver { + public enum State { START, SCHEMA, RUN, END, FAILED, CLOSED } + + private OperatorDriver.State state = State.START; + + /** + * Operator context. The driver "owns" the context and is responsible + * for closing it. + */ + + private final OperatorContext opContext; + private final OperatorExec operatorExec; + private final BatchAccessor batchAccessor; + private int schemaVersion; + + public OperatorDriver(OperatorContext opServicees, OperatorExec opExec) { --- End diff -- typo ? opServices instead of opServicees ? ---
[GitHub] drill pull request #1112: DRILL-6114: Metadata revisions
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1112#discussion_r168289206 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/metadata/AbstractColumnMetadata.java --- @@ -0,0 +1,206 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.record.metadata; + +import org.apache.drill.common.types.TypeProtos.DataMode; +import org.apache.drill.common.types.TypeProtos.MajorType; +import org.apache.drill.common.types.TypeProtos.MinorType; +import org.apache.drill.exec.record.MaterializedField; + +/** + * Abstract definition of column metadata. Allows applications to create + * specialized forms of a column metadata object by extending from this + * abstract class. + * + * Note that, by design, primitive columns do not have a link to their + * tuple parent, or their index within that parent. This allows the same + * metadata to be shared between two views of a tuple, perhaps physical + * and projected views. This restriction does not apply to map columns, + * since maps (and the row itself) will, by definition, differ between + * the two views. + */ + +public abstract class AbstractColumnMetadata implements ColumnMetadata { + + // Capture the key schema information. We cannot use the MaterializedField + // or MajorType because then encode child information that we encode here + // as a child schema. Keeping the two in sync is nearly impossible. + + protected final String name; + protected final MinorType type; + protected final DataMode mode; + protected final int precision; + protected final int scale; + protected boolean projected = true; + + /** + * Predicted number of elements per array entry. Default is + * taken from the often hard-coded value of 10. + */ + + protected int expectedElementCount = 1; + + public AbstractColumnMetadata(MaterializedField schema) { +name = schema.getName(); +MajorType majorType = schema.getType(); +type = majorType.getMinorType(); +mode = majorType.getMode(); +precision = majorType.getPrecision(); +scale = majorType.getScale(); +if (isArray()) { + expectedElementCount = DEFAULT_ARRAY_SIZE; +} + } + + public AbstractColumnMetadata(String name, MinorType type, DataMode mode) { +this.name = name; +this.type = type; +this.mode = mode; +precision = 0; +scale = 0; +if (isArray()) { + expectedElementCount = DEFAULT_ARRAY_SIZE; +} + } + + public AbstractColumnMetadata(AbstractColumnMetadata from) { +name = from.name; +type = from.type; +mode = from.mode; +precision = from.precision; +scale = from.scale; +expectedElementCount = from.expectedElementCount; + } + + protected void bind(TupleSchema parentTuple) { } + + @Override + public String name() { return name; } + + @Override + public MinorType type() { return type; } + + @Override + public MajorType majorType() { +return MajorType.newBuilder() +.setMinorType(type()) +.setMode(mode()) +.build(); + } + + @Override + public DataMode mode() { return mode; } + + @Override + public boolean isNullable() { return mode() == DataMode.OPTIONAL; } + + @Override + public boolean isArray() { return mode() == DataMode.REPEATED; } + + @Override + public int dimensions() { return isArray() ? 1 : 0; } + + @Override + public boolean isMap() { return false; } + + @Override + public boolean isVariant() { return false; } --- End diff -- Seems like Variant is new name for old Union. Can we add a comment here ? ---
[GitHub] drill pull request #1112: DRILL-6114: Metadata revisions
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1112#discussion_r168291276 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/metadata/MetadataUtils.java --- @@ -0,0 +1,165 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.record.metadata; + +import java.util.List; + +import org.apache.drill.common.types.TypeProtos.DataMode; +import org.apache.drill.common.types.TypeProtos.MinorType; +import org.apache.drill.exec.record.BatchSchema; +import org.apache.drill.exec.record.MaterializedField; + +public class MetadataUtils { + + public static TupleSchema fromFields(Iterable fields) { +TupleSchema tuple = new TupleSchema(); +for (MaterializedField field : fields) { + tuple.add(field); +} +return tuple; + } + + /** + * Create a column metadata object that holds the given + * {@link MaterializedField}. The type of the object will be either a + * primitive or map column, depending on the field's type. The logic + * here mimics the code as written, which is very messy in some places. + * + * @param field the materialized field to wrap + * @return the column metadata that wraps the field + */ + + public static AbstractColumnMetadata fromField(MaterializedField field) { +MinorType type = field.getType().getMinorType(); +switch (type) { +case MAP: + return MetadataUtils.newMap(field); +case UNION: + if (field.getType().getMode() != DataMode.OPTIONAL) { +throw new UnsupportedOperationException(type.name() + " type must be nullable"); + } + return new VariantColumnMetadata(field); +case LIST: + switch (field.getType().getMode()) { + case OPTIONAL: + +// List of unions (or a degenerate union of a single type.) +// Not supported in Drill. --- End diff -- If not supported, should we throw an exception ? ---
[GitHub] drill pull request #1112: DRILL-6114: Metadata revisions
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1112#discussion_r168301865 --- Diff: exec/vector/src/main/java/org/apache/drill/exec/record/metadata/ColumnMetadata.java --- @@ -15,36 +15,115 @@ * See the License for the specific language governing permissions and * limitations under the License. */ -package org.apache.drill.exec.record; +package org.apache.drill.exec.record.metadata; import org.apache.drill.common.types.TypeProtos.DataMode; import org.apache.drill.common.types.TypeProtos.MajorType; import org.apache.drill.common.types.TypeProtos.MinorType; +import org.apache.drill.exec.record.MaterializedField; /** * Metadata description of a column including names, types and structure * information. */ public interface ColumnMetadata { + + /** + * Rough characterization of Drill types into metadata categories. + * Various aspects of Drill's type system are very, very messy. + * However, Drill is defined by its code, not some abstract design, + * so the metadata system here does the best job it can to simplify + * the messy type system while staying close to the underlying + * implementation. + */ + enum StructureType { -PRIMITIVE, LIST, TUPLE + +/** + * Primitive column (all types except List, Map and Union.) + * Includes (one-dimensional) arrays of those types. + */ + +PRIMITIVE, + +/** + * Map or repeated map. Also describes the row as a whole. + */ + +TUPLE, + +/** + * Union or (non-repeated) list. (A non-repeated list is, + * essentially, a repeated union.) + */ + +VARIANT, + +/** + * A repeated list. A repeated list is not simply the repeated + * form of a list, it is something else entirely. It acts as + * a dimensional wrapper around any other type (except list) + * and adds a non-nullable extra dimension. Hence, this type is + * for 2D+ arrays. + * + * In theory, a 2D list of, say, INT would be an INT column, but + * repeated in to dimensions. Alas, that is not how it is. Also, + * if we have a separate category for 2D lists, we should have + * a separate category for 1D lists. But, again, that is not how + * the code has evolved. + */ + +MULTI_ARRAY } - public static final int DEFAULT_ARRAY_SIZE = 10; + int DEFAULT_ARRAY_SIZE = 10; --- End diff -- why is this changed from static final int to int ? ---
[GitHub] drill pull request #1112: DRILL-6114: Metadata revisions
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1112#discussion_r168295700 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/metadata/RepeatedListColumnMetadata.java --- @@ -0,0 +1,94 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.record.metadata; + +import org.apache.drill.common.types.TypeProtos.DataMode; +import org.apache.drill.common.types.TypeProtos.MinorType; +import org.apache.drill.exec.record.MaterializedField; + +import com.google.common.base.Preconditions; + +public class RepeatedListColumnMetadata extends AbstractColumnMetadata { + + private AbstractColumnMetadata childSchema; + + public RepeatedListColumnMetadata(MaterializedField field) { +super(field); +Preconditions.checkArgument(field.getType().getMinorType() == MinorType.LIST); +Preconditions.checkArgument(field.getType().getMode() == DataMode.REPEATED); +Preconditions.checkArgument(field.getChildren().size() <= 1); +if (! field.getChildren().isEmpty()) { + childSchema = MetadataUtils.fromField(field.getChildren().iterator().next()); + Preconditions.checkArgument(childSchema.isArray()); +} + } + + public RepeatedListColumnMetadata(String name, AbstractColumnMetadata childSchema) { +super(name, MinorType.LIST, DataMode.REPEATED); +if (childSchema != null) { + Preconditions.checkArgument(childSchema.isArray()); +} +this.childSchema = childSchema; + } + + public void childSchema(ColumnMetadata childMetadata) { +Preconditions.checkState(childSchema == null); +Preconditions.checkArgument(childMetadata.mode() == DataMode.REPEATED); +childSchema = (AbstractColumnMetadata) childMetadata; + } + + @Override + public StructureType structureType() { return StructureType.MULTI_ARRAY; } + + @Override + public MaterializedField schema() { +MaterializedField field = emptySchema(); +if (childSchema != null) { + field.addChild(childSchema.schema()); +} +return field; + } + + @Override + public MaterializedField emptySchema() { +return MaterializedField.create(name(), majorType()); + } + + @Override + public ColumnMetadata cloneEmpty() { +return new RepeatedListColumnMetadata(name, null); + } + + @Override + public AbstractColumnMetadata copy() { +return new RepeatedListColumnMetadata(name, childSchema); + } + + @Override + public ColumnMetadata childSchema() { return childSchema; } + + @Override + public int dimensions() { + +// If there is no child, then we don't know the +// dimensionality. + +return childSchema == null ? -1 --- End diff -- can we use a static final instead of -1 ? ---
[GitHub] drill pull request #1112: DRILL-6114: Metadata revisions
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1112#discussion_r168288500 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/metadata/AbstractColumnMetadata.java --- @@ -0,0 +1,206 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.record.metadata; + +import org.apache.drill.common.types.TypeProtos.DataMode; +import org.apache.drill.common.types.TypeProtos.MajorType; +import org.apache.drill.common.types.TypeProtos.MinorType; +import org.apache.drill.exec.record.MaterializedField; + +/** + * Abstract definition of column metadata. Allows applications to create + * specialized forms of a column metadata object by extending from this + * abstract class. + * + * Note that, by design, primitive columns do not have a link to their + * tuple parent, or their index within that parent. This allows the same + * metadata to be shared between two views of a tuple, perhaps physical + * and projected views. This restriction does not apply to map columns, + * since maps (and the row itself) will, by definition, differ between + * the two views. + */ + +public abstract class AbstractColumnMetadata implements ColumnMetadata { + + // Capture the key schema information. We cannot use the MaterializedField + // or MajorType because then encode child information that we encode here + // as a child schema. Keeping the two in sync is nearly impossible. + + protected final String name; + protected final MinorType type; + protected final DataMode mode; + protected final int precision; + protected final int scale; + protected boolean projected = true; + --- End diff -- I see in the current definition of MajorType, there is also timeZone. We don't need that here ? ---
[GitHub] drill issue #1107: DRILL-6123: Limit batch size for Merge Join based on memo...
Github user ppadma commented on the issue: https://github.com/apache/drill/pull/1107 @paul-rogers Thank you very much for the review. Updated the PR with review comments taken care of. Please take a look. ---
[GitHub] drill pull request #1107: DRILL-6123: Limit batch size for Merge Join based ...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1107#discussion_r167380917 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/record/AbstractRecordBatchMemoryManager.java --- @@ -0,0 +1,63 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.drill.exec.record; + +import org.apache.drill.exec.physical.impl.spill.RecordBatchSizer; +import org.apache.drill.exec.vector.ValueVector; + +public abstract class AbstractRecordBatchMemoryManager { + private int outgoingRowWidth; + private int outputRowCount = ValueVector.MAX_ROW_COUNT; + protected static final int OFFSET_VECTOR_WIDTH = 4; + protected static final int WORST_CASE_FRAGMENTATION_FACTOR = 2; + protected static final int MAX_NUM_ROWS = ValueVector.MAX_ROW_COUNT; + protected static final int MIN_NUM_ROWS = 1; + + public void update(int inputIndex) {}; + + public void update() {}; + + public int getOutputRowCount() { +return outputRowCount; + } + + /** + * Given batchSize and rowWidth, this will set output rowCount taking into account + * the min and max that is allowed. + */ + public void setOutputRowCount(long batchSize, int rowWidth) { +this.outputRowCount = Math.min(ValueVector.MAX_ROW_COUNT, + Math.max(RecordBatchSizer.safeDivide(batchSize/WORST_CASE_FRAGMENTATION_FACTOR, rowWidth), MIN_NUM_ROWS)); --- End diff -- batchSize here is the configured outputBatchSize that we should conform to, not the incoming batch size. ---
[GitHub] drill pull request #1107: DRILL-6123: Limit batch size for Merge Join based ...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1107#discussion_r167380761 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/spill/RecordBatchSizer.java --- @@ -311,8 +311,8 @@ public static ColumnSize getColumn(ValueVector v, String prefix) { public RecordBatchSizer(RecordBatch batch) { this(batch, - (batch.getSchema().getSelectionVectorMode() == BatchSchema.SelectionVectorMode.TWO_BYTE) ? - batch.getSelectionVector2() : null); + (batch.getSchema() == null ? null : (batch.getSchema().getSelectionVectorMode() == BatchSchema.SelectionVectorMode.TWO_BYTE ? --- End diff -- yes, we can get empty batches with empty schema and I think it makes sense to add the check here instead of calling code. That way, it be transparently handled underneath. ---
[GitHub] drill pull request #1107: DRILL-6123: Limit batch size for Merge Join based ...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1107#discussion_r167380394 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/spill/RecordBatchSizer.java --- @@ -137,10 +137,10 @@ public ColumnSize(ValueVector v, String prefix) { case MAP: case UNION: // No standard size for Union type -dataSize = v.getPayloadByteCount(valueCount); +dataSize = valueCount == 0 ? 0 : v.getPayloadByteCount(valueCount); --- End diff -- When we get empty batch, getPayloadByteCount will throw out of bounds exception if the underlying method tries to read the value vector for index 0 i.e. when it is trying to read offset vector. I added the check in those methods and removed here. I still have to retain the check here for the case when we are trying to read the offset vector. ---
[GitHub] drill pull request #1115: DRILL-6138: Move RecordBatchSizer to org.apache.dr...
GitHub user ppadma opened a pull request: https://github.com/apache/drill/pull/1115 DRILL-6138: Move RecordBatchSizer to org.apache.drill.exec.record pac⦠â¦kage Also, changed columnSizes in RecordBatchSizer from list to map so we can lookup using field names. @Ben-Zvi can you please review ? You can merge this pull request into a Git repository by running: $ git pull https://github.com/ppadma/drill DRILL-6138 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/1115.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1115 commit b4104e73ad37dee96e29e345b958412791ab9079 Author: Padma Penumarthy Date: 2018-02-06T05:41:45Z DRILL-6138: Move RecordBatchSizer to org.apache.drill.exec.record package ---
[GitHub] drill pull request #1107: DRILL-6123: Limit batch size for Merge Join based ...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1107#discussion_r166427892 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/MergeJoinBatch.java --- @@ -102,20 +105,78 @@ private final List comparators; private final JoinRelType joinType; private JoinWorker worker; + private final long outputBatchSize; private static final String LEFT_INPUT = "LEFT INPUT"; private static final String RIGHT_INPUT = "RIGHT INPUT"; + private class MergeJoinMemoryManager extends AbstractRecordBatchMemoryManager { +private int leftRowWidth; +private int rightRowWidth; + +/** + * mergejoin operates on one record at a time from the left and right batches + * using RecordIterator abstraction. We have a callback mechanism to get notified + * when new batch is loaded in record iterator. + * This can get called in the middle of current output batch we are building. + * when this gets called, adjust number of output rows for the current batch and + * update the value to be used for subsequent batches. + */ +@Override +public void update(int inputIndex) { + switch(inputIndex) { +case 0: + final RecordBatchSizer leftSizer = new RecordBatchSizer(left); + leftRowWidth = leftSizer.netRowWidth(); + break; +case 1: + final RecordBatchSizer rightSizer = new RecordBatchSizer(right); + rightRowWidth = rightSizer.netRowWidth(); +default: + break; + } + + final int newOutgoingRowWidth = leftRowWidth + rightRowWidth; + + // If outgoing row width is 0, just return. This is possible for empty batches or + // when first set of batches come with OK_NEW_SCHEMA and no data. + if (newOutgoingRowWidth == 0) { +return; + } + + // update the value to be used for next batch(es) + setOutputRowCount(Math.min(ValueVector.MAX_ROW_COUNT, + Math.max(RecordBatchSizer.safeDivide(outputBatchSize/WORST_CASE_FRAGMENTATION_FACTOR, newOutgoingRowWidth), MIN_NUM_ROWS))); + + // Adjust for the current batch. + // calculate memory used so far based on previous outgoing row width and how many rows we already processed. + final long memoryUsed = status.getOutPosition() * getOutgoingRowWidth(); + // This is the remaining memory. + final long remainingMemory = Math.max(outputBatchSize/WORST_CASE_FRAGMENTATION_FACTOR - memoryUsed, 0); + // These are number of rows we can fit in remaining memory based on new outgoing row width. + final int numOutputRowsRemaining = RecordBatchSizer.safeDivide(remainingMemory, newOutgoingRowWidth); + + final int adjustedOutputRowCount = Math.min(MAX_NUM_ROWS, Math.max(status.getOutPosition() + numOutputRowsRemaining, MIN_NUM_ROWS)); + status.setOutputRowCount(adjustedOutputRowCount); + setOutgoingRowWidth(newOutgoingRowWidth); --- End diff -- Yes, this is how it works. We read from left and right side using RecordIterator abstraction, which reads full record batches underneath and gives one record at a time with its next call. I have a callback mechanism when we read a new batch in record iterator to adjust the row widths. When I get the callback, I adjust row count for the current batch we are working on based on remaining memory available for the current batch and also compute and save the row count I should use for next full batch. In the innerNext, when we start working on the next output batch, I am setting the target output row count based on this value. Addressed all other code review comments. Please take a look when you get a chance. ---
[GitHub] drill issue #1112: DRILL-6114: Metadata revisions
Github user ppadma commented on the issue: https://github.com/apache/drill/pull/1112 @paul-rogers I ran the pre commit tests. No issues. Everything passed. Will do one more time once code reviews are done. ---
[GitHub] drill issue #1107: DRILL-6123: Limit batch size for Merge Join based on memo...
Github user ppadma commented on the issue: https://github.com/apache/drill/pull/1107 @sachouche @ilooner @paul-rogers Can one of you review this PR for me ? ---
[GitHub] drill pull request #1101: DRILL-6032: Made the batch sizing for HashAgg more...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r166142178 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/spill/RecordBatchSizer.java --- @@ -65,6 +70,14 @@ public int stdSize; +/** + * If the we can determine the exact width of the row of a vector upfront, + * the row widths is saved here. If we cannot determine the exact width + * (for example for VarChar or Repeated vectors), then + */ + +private int knownSize = -1; --- End diff -- Like I mentioned in other comment, seems like we can just use stdSize. ---
[GitHub] drill pull request #1101: DRILL-6032: Made the batch sizing for HashAgg more...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r166098364 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java --- @@ -140,6 +131,9 @@ private OperatorContext oContext; private BufferAllocator allocator; + private Map keySizes; + // The size estimates for varchar value columns. The keys are the index of the varchar value columns. + private Map varcharValueSizes; --- End diff -- Don't you need to adjust size estimates for repeated types also ? ---
[GitHub] drill pull request #1101: DRILL-6032: Made the batch sizing for HashAgg more...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r166141274 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java --- @@ -733,28 +780,32 @@ private void restoreReservedMemory() { * @param records */ private void allocateOutgoing(int records) { -// Skip the keys and only allocate for outputting the workspace values -// (keys will be output through splitAndTransfer) -Iterator> outgoingIter = outContainer.iterator(); -for (int i = 0; i < numGroupByOutFields; i++) { - outgoingIter.next(); -} - // try to preempt an OOM by using the reserved memory useReservedOutgoingMemory(); long allocatedBefore = allocator.getAllocatedMemory(); -while (outgoingIter.hasNext()) { +for (int columnIndex = numGroupByOutFields; columnIndex < outContainer.getNumberOfColumns(); columnIndex++) { + final VectorWrapper wrapper = outContainer.getValueVector(columnIndex); @SuppressWarnings("resource") - ValueVector vv = outgoingIter.next().getValueVector(); + final ValueVector vv = wrapper.getValueVector(); - AllocationHelper.allocatePrecomputedChildCount(vv, records, maxColumnWidth, 0); + final RecordBatchSizer.ColumnSize columnSizer = new RecordBatchSizer.ColumnSize(wrapper.getValueVector()); + int columnSize; + + if (columnSizer.hasKnownSize()) { +// For fixed width vectors we know the size of each record +columnSize = columnSizer.getKnownSize(); + } else { +// For var chars we need to use the input estimate +columnSize = varcharValueSizes.get(columnIndex); + } + + AllocationHelper.allocatePrecomputedChildCount(vv, records, columnSize, 0); --- End diff -- I think we should also get elementCount from sizer and use that instead of passing 0. ---
[GitHub] drill pull request #1101: DRILL-6032: Made the batch sizing for HashAgg more...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r166096630 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java --- @@ -215,6 +206,7 @@ public BatchHolder() { MaterializedField outputField = materializedValueFields[i]; // Create a type-specific ValueVector for this value vector = TypeHelper.getNewVector(outputField, allocator); + int columnSize = new RecordBatchSizer.ColumnSize(vector).estSize; --- End diff -- there is already stdSize which is kind of doing the same thing. can we use that instead of knownSize ? ---
[GitHub] drill pull request #1101: DRILL-6032: Made the batch sizing for HashAgg more...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r166142507 --- Diff: exec/vector/src/main/codegen/templates/FixedValueVectors.java --- @@ -298,6 +298,11 @@ public int getPayloadByteCount(int valueCount) { return valueCount * ${type.width}; } + @Override + public int getValueWidth() { --- End diff -- If we are using stdSize and get the value from TypeHelper, we don't need all value vectors to have this new function. ---
[GitHub] drill pull request #1101: DRILL-6032: Made the batch sizing for HashAgg more...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r166136279 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java --- @@ -226,7 +221,7 @@ public BatchHolder() { ((FixedWidthVector) vector).allocateNew(HashTable.BATCH_SIZE); } else if (vector instanceof VariableWidthVector) { // This case is never used a varchar falls under ObjectVector which is allocated on the heap ! -((VariableWidthVector) vector).allocateNew(maxColumnWidth, HashTable.BATCH_SIZE); +((VariableWidthVector) vector).allocateNew(columnSize, HashTable.BATCH_SIZE); --- End diff -- for a just allocated vector, estSize will return 0. how can we use that for allocation ? ---
[GitHub] drill pull request #1107: DRILL-6123: Limit batch size for Merge Join based ...
GitHub user ppadma opened a pull request: https://github.com/apache/drill/pull/1107 DRILL-6123: Limit batch size for Merge Join based on memory Merge join limits output batch size to 32K rows irrespective of row size. This can create large batches (in terms of memory), depending upon average row width. Changed the logic to figure out output row count based on memory specified with the new outputBatchSize option and average outgoing row width. Average outgoing row width will be sum of left and right batch row widths. Output row count will be minimum of 1 and max of 64k. Added AbstractRecordBatchMemoryManager class to be used across all operators. Restructured the code a little bit for that. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ppadma/drill DRILL-6123 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/1107.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1107 commit e01da78116730afbcd9b5062e316678cebc4848f Author: Padma Penumarthy Date: 2018-01-31T00:58:57Z DRILL-6123: Limit batch size for Merge Join based on memory ---
[GitHub] drill pull request #1101: DRILL-6032: Made the batch sizing for HashAgg more...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r165156589 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java --- @@ -215,6 +206,7 @@ public BatchHolder() { MaterializedField outputField = materializedValueFields[i]; // Create a type-specific ValueVector for this value vector = TypeHelper.getNewVector(outputField, allocator); + int columnSize = new RecordBatchSizer.ColumnSize(vector).estSize; --- End diff -- @ilooner That is the point. If we know the exact value, why do we need RecordBatchSizer ? we should use RecordBatchSizer when we need to get sizing information for a batch (in most cases, incoming batch). In this case, you are allocating memory for value vectors for the batch you are building. For fixed width columns, you can get the column width size for each type you are allocating memory for using TypeHelper.getSize. For variable width columns, TypeHelper.getSize assumes it is 50 bytes. If you want to adjust memory you are allocating for variable width columns for outgoing batch based on incoming batch, that's when you use RecordBatchSizer on actual incoming batch to figure out the average size of that column. You can also use RecordBatchSizer on incoming batch if you want to figure out how many values you want to allocate memory for in the outgoing batch. Note that, with your change, for just created value vectors, variable width columns will return estSize of 1, which is n ot what you want. ---
[GitHub] drill pull request #1101: DRILL-6032: Made the batch sizing for HashAgg more...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r164634551 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java --- @@ -215,6 +206,7 @@ public BatchHolder() { MaterializedField outputField = materializedValueFields[i]; // Create a type-specific ValueVector for this value vector = TypeHelper.getNewVector(outputField, allocator); + int columnSize = new RecordBatchSizer.ColumnSize(vector).estSize; --- End diff -- Why not just use TypeHelper.getSize(outputField.getType()) ? It seems like you don't need RecordBatchSizer for this. estSize in RecordBatchSizer is taking actual memory allocation into account i.e. it includes the over head of unused vector space and we need that value as well for different reasons. I also think it is doing right thing by returning zero when there are no rows and no memory allocated. ---
[GitHub] drill pull request #1101: DRILL-6032: Made the batch sizing for HashAgg more...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1101#discussion_r164599681 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java --- @@ -215,6 +206,7 @@ public BatchHolder() { MaterializedField outputField = materializedValueFields[i]; // Create a type-specific ValueVector for this value vector = TypeHelper.getNewVector(outputField, allocator); + int columnSize = new RecordBatchSizer.ColumnSize(vector).estSize; --- End diff -- I wonder what is wrong if estSize is 0 when there is no data. If there is no data for a column, why would we want to add it's value width to outgoing row width ? ---
[GitHub] drill pull request #1091: DRILL-6071: Limit batch size for flatten operator
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1091#discussion_r164231963 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/flatten/FlattenRecordBatch.java --- @@ -94,8 +98,57 @@ private void clear() { } } + private class FlattenMemoryManager { --- End diff -- Memory manager actually requires incoming batch. All the tests in testOutputBatch are exercising this code. I went down the path of writing tests just for memory manager. But, they are redundant and doing the same thing I am doing with the other tests. Also, did not see much point in validating output row count, since we are interested in batch size more than row count. Please let me know if you feel otherwise and I can include those tests as well. ---
[GitHub] drill pull request #1091: DRILL-6071: Limit batch size for flatten operator
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1091#discussion_r164233329 --- Diff: exec/java-exec/src/test/java/org/apache/drill/exec/physical/unit/TestOutputBatchSize.java --- @@ -0,0 +1,498 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.drill.exec.physical.unit; + +import com.google.common.collect.Lists; +import org.apache.drill.common.expression.SchemaPath; + +import org.apache.drill.exec.physical.base.AbstractBase; +import org.apache.drill.exec.physical.base.PhysicalOperator; +import org.apache.drill.exec.physical.config.FlattenPOP; +import org.apache.drill.exec.physical.impl.ScanBatch; +import org.apache.drill.exec.physical.impl.spill.RecordBatchSizer; +import org.apache.drill.exec.record.RecordBatch; +import org.apache.drill.exec.record.VectorAccessible; +import org.apache.drill.exec.util.JsonStringArrayList; +import org.apache.drill.exec.util.JsonStringHashMap; +import org.apache.drill.exec.util.Text; +import org.junit.Ignore; +import org.junit.Test; + +import java.util.ArrayList; +import java.util.Arrays; +import java.util.List; + +public class TestOutputBatchSize extends PhysicalOpUnitTestBase { --- End diff -- I added a test case like this, testFlattenLargeRecords and there are bunch of other test cases as well. All the tests are verifying the batch sizes and number of batches. ---
[GitHub] drill issue #1091: DRILL-6071: Limit batch size for flatten operator
Github user ppadma commented on the issue: https://github.com/apache/drill/pull/1091 @paul-rogers Thank you Paul. Ready for review. Please take a look when you get a chance. ---
[GitHub] drill issue #1091: DRILL-6071: Limit batch size for flatten operator
Github user ppadma commented on the issue: https://github.com/apache/drill/pull/1091 @paul-rogers Addressed code review comments. Added a new system/session option to configure output batch size. Please review. ---
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r161900034 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -19,44 +19,286 @@ import io.netty.buffer.DrillBuf; -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { +/** SQL Pattern Contains implementation */ +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + private final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString) { super(patternString); + +// Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is +// that there is no single implementation that can do it all well. So, we use multiple implementations +// chosen based on the pattern length. +if (patternLength == 0) { + matcherFcn = new MatcherZero(); +} else if (patternLength == 1) { + matcherFcn = new MatcherOne(); +} else if (patternLength == 2) { + matcherFcn = new MatcherTwo(); +} else if (patternLength == 3) { + matcherFcn = new MatcherThree(); +} else if (patternLength < 10) { + matcherFcn = new MatcherN(); +} else { + matcherFcn = new BoyerMooreMatcher(); +} } @Override public int match(int start, int end, DrillBuf drillBuf) { +return matcherFcn.match(start, end, drillBuf); + } + + //-- + // Inner Data Structure + // -- + + /** Abstract matcher class to allow us pick the most efficient implementation */ + private abstract class MatcherFcn { +protected final byte[] patternArray; + +protected MatcherFcn() { + assert patternByteBuffer.hasArray(); + + patternArray = patternByteBuffer.array(); +} + +/** + * @return 1 if the pattern was matched; 0 otherwise + */ +protected abstract int match(int start, int end, DrillBuf drillBuf); + } + + /** Handles patterns with length one */ + private final class MatcherZero extends MatcherFcn { -if (patternLength == 0) { // Everything should match for null pattern string +private MatcherZero() { +} + +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { return 1; } + } + + /** Handles patterns with length one */ + private final class MatcherOne extends MatcherFcn { + +private MatcherOne() { + super(); +} + +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start; + final byte firstPattByte = patternArray[0]; + + // simplePattern string has meta characters i.e % and _ and escape characters removed. + // so, we can just directly compare. + for (int idx = 0; idx < lengthToProcess; idx++) { +byte inputByte = drillBuf.getByte(start + idx); + +if (firstPattByte != inputByte) { + continue; +} +return 1; + } + return 0; +} + } + + /** Handles patterns with length two */ + private final class MatcherTwo extends MatcherFcn { + +private MatcherTwo() { +} + +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start - 1; + final byte firstPattByte = patternArray[0]; + final byte secondPattByte = patternArray[1]; + + // simplePattern string has meta characters i.e % and _ and escape characters removed. + // so, we can just directly compare. + for (int idx = 0; idx < lengthToProcess; idx++) { +final byte firstInByte = drillBuf.getByte(start + idx); -final int txtLength = end - start; +if (firstPattByte != firstInByte) { + continue; +} else { + final byte secondInByte = drillBuf.getByte(start + idx +1); -// no match if input string length is less than pattern length -if (txtLength < patternLength) { + if (secondInByte == secondPattByte) { +return 1; + } +} + } return 0; } + } + /** Handles patterns with length three */ + private final class MatcherT
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r161895458 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -19,44 +19,286 @@ import io.netty.buffer.DrillBuf; -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { +/** SQL Pattern Contains implementation */ +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + private final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString) { super(patternString); + +// Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is +// that there is no single implementation that can do it all well. So, we use multiple implementations +// chosen based on the pattern length. +if (patternLength == 0) { + matcherFcn = new MatcherZero(); +} else if (patternLength == 1) { + matcherFcn = new MatcherOne(); +} else if (patternLength == 2) { + matcherFcn = new MatcherTwo(); +} else if (patternLength == 3) { + matcherFcn = new MatcherThree(); +} else if (patternLength < 10) { + matcherFcn = new MatcherN(); +} else { + matcherFcn = new BoyerMooreMatcher(); +} } @Override public int match(int start, int end, DrillBuf drillBuf) { +return matcherFcn.match(start, end, drillBuf); + } + + //-- + // Inner Data Structure + // -- + + /** Abstract matcher class to allow us pick the most efficient implementation */ + private abstract class MatcherFcn { +protected final byte[] patternArray; + +protected MatcherFcn() { + assert patternByteBuffer.hasArray(); + + patternArray = patternByteBuffer.array(); +} + +/** + * @return 1 if the pattern was matched; 0 otherwise + */ +protected abstract int match(int start, int end, DrillBuf drillBuf); + } + + /** Handles patterns with length one */ + private final class MatcherZero extends MatcherFcn { -if (patternLength == 0) { // Everything should match for null pattern string +private MatcherZero() { +} + +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { return 1; } + } + + /** Handles patterns with length one */ + private final class MatcherOne extends MatcherFcn { + +private MatcherOne() { + super(); +} + +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start; + final byte firstPattByte = patternArray[0]; --- End diff -- can we name it firstPatternByte ? ---
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r161906070 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -19,44 +19,286 @@ import io.netty.buffer.DrillBuf; -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { +/** SQL Pattern Contains implementation */ +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + private final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString) { super(patternString); + +// Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is +// that there is no single implementation that can do it all well. So, we use multiple implementations +// chosen based on the pattern length. +if (patternLength == 0) { + matcherFcn = new MatcherZero(); +} else if (patternLength == 1) { + matcherFcn = new MatcherOne(); +} else if (patternLength == 2) { + matcherFcn = new MatcherTwo(); +} else if (patternLength == 3) { + matcherFcn = new MatcherThree(); +} else if (patternLength < 10) { + matcherFcn = new MatcherN(); +} else { + matcherFcn = new BoyerMooreMatcher(); +} } @Override public int match(int start, int end, DrillBuf drillBuf) { +return matcherFcn.match(start, end, drillBuf); + } + + //-- + // Inner Data Structure + // -- + + /** Abstract matcher class to allow us pick the most efficient implementation */ + private abstract class MatcherFcn { +protected final byte[] patternArray; + +protected MatcherFcn() { + assert patternByteBuffer.hasArray(); + + patternArray = patternByteBuffer.array(); +} + +/** + * @return 1 if the pattern was matched; 0 otherwise + */ +protected abstract int match(int start, int end, DrillBuf drillBuf); + } + + /** Handles patterns with length one */ + private final class MatcherZero extends MatcherFcn { -if (patternLength == 0) { // Everything should match for null pattern string +private MatcherZero() { +} + +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { return 1; } + } + + /** Handles patterns with length one */ + private final class MatcherOne extends MatcherFcn { + +private MatcherOne() { + super(); +} + +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start; + final byte firstPattByte = patternArray[0]; + + // simplePattern string has meta characters i.e % and _ and escape characters removed. + // so, we can just directly compare. + for (int idx = 0; idx < lengthToProcess; idx++) { +byte inputByte = drillBuf.getByte(start + idx); + +if (firstPattByte != inputByte) { + continue; +} +return 1; + } + return 0; +} + } + + /** Handles patterns with length two */ + private final class MatcherTwo extends MatcherFcn { + +private MatcherTwo() { +} + +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start - 1; + final byte firstPattByte = patternArray[0]; + final byte secondPattByte = patternArray[1]; + + // simplePattern string has meta characters i.e % and _ and escape characters removed. + // so, we can just directly compare. + for (int idx = 0; idx < lengthToProcess; idx++) { +final byte firstInByte = drillBuf.getByte(start + idx); -final int txtLength = end - start; +if (firstPattByte != firstInByte) { + continue; +} else { + final byte secondInByte = drillBuf.getByte(start + idx +1); --- End diff -- space between + and 1 ---
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r161899357 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -19,44 +19,286 @@ import io.netty.buffer.DrillBuf; -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { +/** SQL Pattern Contains implementation */ +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + private final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString) { super(patternString); + +// Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is +// that there is no single implementation that can do it all well. So, we use multiple implementations +// chosen based on the pattern length. +if (patternLength == 0) { + matcherFcn = new MatcherZero(); +} else if (patternLength == 1) { + matcherFcn = new MatcherOne(); +} else if (patternLength == 2) { + matcherFcn = new MatcherTwo(); +} else if (patternLength == 3) { + matcherFcn = new MatcherThree(); +} else if (patternLength < 10) { + matcherFcn = new MatcherN(); +} else { + matcherFcn = new BoyerMooreMatcher(); +} } @Override public int match(int start, int end, DrillBuf drillBuf) { +return matcherFcn.match(start, end, drillBuf); + } + + //-- + // Inner Data Structure + // -- + + /** Abstract matcher class to allow us pick the most efficient implementation */ + private abstract class MatcherFcn { +protected final byte[] patternArray; + +protected MatcherFcn() { + assert patternByteBuffer.hasArray(); + + patternArray = patternByteBuffer.array(); +} + +/** + * @return 1 if the pattern was matched; 0 otherwise + */ +protected abstract int match(int start, int end, DrillBuf drillBuf); + } + + /** Handles patterns with length one */ + private final class MatcherZero extends MatcherFcn { -if (patternLength == 0) { // Everything should match for null pattern string +private MatcherZero() { +} + +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { return 1; } + } + + /** Handles patterns with length one */ + private final class MatcherOne extends MatcherFcn { + +private MatcherOne() { + super(); --- End diff -- redundant ? ---
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r161905865 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -19,44 +19,286 @@ import io.netty.buffer.DrillBuf; -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { +/** SQL Pattern Contains implementation */ +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + private final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString) { super(patternString); + +// Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is +// that there is no single implementation that can do it all well. So, we use multiple implementations +// chosen based on the pattern length. +if (patternLength == 0) { + matcherFcn = new MatcherZero(); +} else if (patternLength == 1) { + matcherFcn = new MatcherOne(); +} else if (patternLength == 2) { + matcherFcn = new MatcherTwo(); +} else if (patternLength == 3) { + matcherFcn = new MatcherThree(); +} else if (patternLength < 10) { + matcherFcn = new MatcherN(); +} else { + matcherFcn = new BoyerMooreMatcher(); +} } @Override public int match(int start, int end, DrillBuf drillBuf) { +return matcherFcn.match(start, end, drillBuf); + } + + //-- + // Inner Data Structure + // -- + + /** Abstract matcher class to allow us pick the most efficient implementation */ + private abstract class MatcherFcn { +protected final byte[] patternArray; + +protected MatcherFcn() { + assert patternByteBuffer.hasArray(); + + patternArray = patternByteBuffer.array(); +} + +/** + * @return 1 if the pattern was matched; 0 otherwise + */ +protected abstract int match(int start, int end, DrillBuf drillBuf); + } + + /** Handles patterns with length one */ + private final class MatcherZero extends MatcherFcn { -if (patternLength == 0) { // Everything should match for null pattern string +private MatcherZero() { +} + +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { return 1; } + } + + /** Handles patterns with length one */ + private final class MatcherOne extends MatcherFcn { + +private MatcherOne() { + super(); +} + +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start; + final byte firstPattByte = patternArray[0]; + + // simplePattern string has meta characters i.e % and _ and escape characters removed. + // so, we can just directly compare. + for (int idx = 0; idx < lengthToProcess; idx++) { +byte inputByte = drillBuf.getByte(start + idx); + +if (firstPattByte != inputByte) { + continue; +} +return 1; + } + return 0; +} + } + + /** Handles patterns with length two */ + private final class MatcherTwo extends MatcherFcn { + +private MatcherTwo() { +} + +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start - 1; + final byte firstPattByte = patternArray[0]; + final byte secondPattByte = patternArray[1]; --- End diff -- can you initialize them in the constructor for matcher function ? that way you can initialize once and reuse for each match. ---
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r161895158 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -19,44 +19,286 @@ import io.netty.buffer.DrillBuf; -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { +/** SQL Pattern Contains implementation */ +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + private final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString) { super(patternString); + +// Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is +// that there is no single implementation that can do it all well. So, we use multiple implementations +// chosen based on the pattern length. +if (patternLength == 0) { + matcherFcn = new MatcherZero(); +} else if (patternLength == 1) { + matcherFcn = new MatcherOne(); +} else if (patternLength == 2) { + matcherFcn = new MatcherTwo(); +} else if (patternLength == 3) { + matcherFcn = new MatcherThree(); +} else if (patternLength < 10) { + matcherFcn = new MatcherN(); +} else { + matcherFcn = new BoyerMooreMatcher(); +} } @Override public int match(int start, int end, DrillBuf drillBuf) { +return matcherFcn.match(start, end, drillBuf); + } + + //-- + // Inner Data Structure + // -- + + /** Abstract matcher class to allow us pick the most efficient implementation */ + private abstract class MatcherFcn { +protected final byte[] patternArray; + +protected MatcherFcn() { + assert patternByteBuffer.hasArray(); + + patternArray = patternByteBuffer.array(); +} + +/** + * @return 1 if the pattern was matched; 0 otherwise + */ +protected abstract int match(int start, int end, DrillBuf drillBuf); + } + + /** Handles patterns with length one */ --- End diff -- length zero. ---
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r161899883 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -19,44 +19,286 @@ import io.netty.buffer.DrillBuf; -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { +/** SQL Pattern Contains implementation */ +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + private final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString) { super(patternString); + +// Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is +// that there is no single implementation that can do it all well. So, we use multiple implementations +// chosen based on the pattern length. +if (patternLength == 0) { + matcherFcn = new MatcherZero(); +} else if (patternLength == 1) { + matcherFcn = new MatcherOne(); +} else if (patternLength == 2) { + matcherFcn = new MatcherTwo(); +} else if (patternLength == 3) { + matcherFcn = new MatcherThree(); +} else if (patternLength < 10) { + matcherFcn = new MatcherN(); +} else { + matcherFcn = new BoyerMooreMatcher(); +} } @Override public int match(int start, int end, DrillBuf drillBuf) { +return matcherFcn.match(start, end, drillBuf); + } + + //-- + // Inner Data Structure + // -- + + /** Abstract matcher class to allow us pick the most efficient implementation */ + private abstract class MatcherFcn { +protected final byte[] patternArray; + +protected MatcherFcn() { + assert patternByteBuffer.hasArray(); + + patternArray = patternByteBuffer.array(); +} + +/** + * @return 1 if the pattern was matched; 0 otherwise + */ +protected abstract int match(int start, int end, DrillBuf drillBuf); + } + + /** Handles patterns with length one */ + private final class MatcherZero extends MatcherFcn { -if (patternLength == 0) { // Everything should match for null pattern string +private MatcherZero() { +} + +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { return 1; } + } + + /** Handles patterns with length one */ + private final class MatcherOne extends MatcherFcn { + +private MatcherOne() { + super(); +} + +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start; + final byte firstPattByte = patternArray[0]; + + // simplePattern string has meta characters i.e % and _ and escape characters removed. + // so, we can just directly compare. + for (int idx = 0; idx < lengthToProcess; idx++) { +byte inputByte = drillBuf.getByte(start + idx); + +if (firstPattByte != inputByte) { + continue; +} +return 1; + } + return 0; +} + } + + /** Handles patterns with length two */ + private final class MatcherTwo extends MatcherFcn { + +private MatcherTwo() { +} + +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start - 1; + final byte firstPattByte = patternArray[0]; + final byte secondPattByte = patternArray[1]; + + // simplePattern string has meta characters i.e % and _ and escape characters removed. + // so, we can just directly compare. + for (int idx = 0; idx < lengthToProcess; idx++) { +final byte firstInByte = drillBuf.getByte(start + idx); -final int txtLength = end - start; +if (firstPattByte != firstInByte) { + continue; +} else { + final byte secondInByte = drillBuf.getByte(start + idx +1); -// no match if input string length is less than pattern length -if (txtLength < patternLength) { + if (secondInByte == secondPattByte) { +return 1; + } +} + } return 0; } + } + /** Handles patterns with length three */ + private final class MatcherT
[GitHub] drill issue #1091: DRILL-6071: Limit batch size for flatten operator
Github user ppadma commented on the issue: https://github.com/apache/drill/pull/1091 @paul-rogers Paul, can you please review this PR ? ---
[GitHub] drill pull request #1091: DRILL-6071: Limit batch size for flatten operator
GitHub user ppadma opened a pull request: https://github.com/apache/drill/pull/1091 DRILL-6071: Limit batch size for flatten operator Please see DRILL-6071 for details. You can merge this pull request into a Git repository by running: $ git pull https://github.com/ppadma/drill DRILL-6071 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/drill/pull/1091.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1091 commit 8c7d60440e41efeab53fe78d9db579e7f85be149 Author: Padma Penumarthy Date: 2018-01-10T13:06:58Z DRILL-6071: Limit batch size for flatten operator ---
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r158033645 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -19,44 +19,283 @@ import io.netty.buffer.DrillBuf; -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { +/** SQL Pattern Contains implementation */ +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + private final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString) { super(patternString); + +// Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is +// that there is no single implementation that can do it all well. So, we use multiple implementations +// chosen based on the pattern length. +if (patternLength == 1) { + matcherFcn = new Matcher1(); +} else if (patternLength == 2) { + matcherFcn = new Matcher2(); +} else if (patternLength == 3) { + matcherFcn = new Matcher3(); +} else if (patternLength < 10) { + matcherFcn = new MatcherN(); +} else { + matcherFcn = new BoyerMooreMatcher(); +} } @Override public int match(int start, int end, DrillBuf drillBuf) { +return matcherFcn.match(start, end, drillBuf); + } + + //-- + // Inner Data Structure + // -- + + /** Abstract matcher class to allow us pick the most efficient implementation */ + private abstract class MatcherFcn { +protected final byte[] patternArray; + +protected MatcherFcn() { + assert patternByteBuffer.hasArray(); + + patternArray = patternByteBuffer.array(); +} + +/** + * @return 1 if the pattern was matched; 0 otherwise + */ +protected abstract int match(int start, int end, DrillBuf drillBuf); + } + + /** Handles patterns with length one */ + private final class Matcher1 extends MatcherFcn { -if (patternLength == 0) { // Everything should match for null pattern string - return 1; +private Matcher1() { + super(); } -final int txtLength = end - start; +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start; + final byte firstPattByte = patternArray[0]; -// no match if input string length is less than pattern length -if (txtLength < patternLength) { + // simplePattern string has meta characters i.e % and _ and escape characters removed. + // so, we can just directly compare. + for (int idx = 0; idx < lengthToProcess; idx++) { +byte inputByte = drillBuf.getByte(start + idx); + +if (firstPattByte != inputByte) { + continue; +} +return 1; + } return 0; } + } + /** Handles patterns with length two */ + private final class Matcher2 extends MatcherFcn { -final int outerEnd = txtLength - patternLength; +private Matcher2() { + super(); +} -outer: -for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) { +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start - 1; + final byte firstPattByte = patternArray[0]; + final byte secondPattByte = patternArray[1]; // simplePattern string has meta characters i.e % and _ and escape characters removed. // so, we can just directly compare. - for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) { -if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) { - continue outer; + for (int idx = 0; idx < lengthToProcess; idx++) { +final byte firstInByte = drillBuf.getByte(start + idx); + +if (firstPattByte != firstInByte) { + continue; +} else { + final byte secondInByte = drillBuf.getByte(start + idx +1); + + if (secondInByte == secondPattByte) { +return 1; + } } } + return 0; +} + } + + /** Handles patterns with length three */ + private final class Matcher3 extends MatcherFcn {
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r158029512 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -19,44 +19,283 @@ import io.netty.buffer.DrillBuf; -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { +/** SQL Pattern Contains implementation */ +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + private final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString) { super(patternString); + +// Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is +// that there is no single implementation that can do it all well. So, we use multiple implementations +// chosen based on the pattern length. +if (patternLength == 1) { + matcherFcn = new Matcher1(); +} else if (patternLength == 2) { + matcherFcn = new Matcher2(); +} else if (patternLength == 3) { + matcherFcn = new Matcher3(); +} else if (patternLength < 10) { + matcherFcn = new MatcherN(); +} else { + matcherFcn = new BoyerMooreMatcher(); +} } @Override public int match(int start, int end, DrillBuf drillBuf) { +return matcherFcn.match(start, end, drillBuf); + } + + //-- + // Inner Data Structure + // -- + + /** Abstract matcher class to allow us pick the most efficient implementation */ + private abstract class MatcherFcn { +protected final byte[] patternArray; + +protected MatcherFcn() { + assert patternByteBuffer.hasArray(); + + patternArray = patternByteBuffer.array(); +} + +/** + * @return 1 if the pattern was matched; 0 otherwise + */ +protected abstract int match(int start, int end, DrillBuf drillBuf); + } + + /** Handles patterns with length one */ + private final class Matcher1 extends MatcherFcn { -if (patternLength == 0) { // Everything should match for null pattern string - return 1; +private Matcher1() { + super(); } -final int txtLength = end - start; +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start; + final byte firstPattByte = patternArray[0]; -// no match if input string length is less than pattern length -if (txtLength < patternLength) { + // simplePattern string has meta characters i.e % and _ and escape characters removed. + // so, we can just directly compare. + for (int idx = 0; idx < lengthToProcess; idx++) { +byte inputByte = drillBuf.getByte(start + idx); + +if (firstPattByte != inputByte) { + continue; +} +return 1; + } return 0; } + } + /** Handles patterns with length two */ + private final class Matcher2 extends MatcherFcn { -final int outerEnd = txtLength - patternLength; +private Matcher2() { + super(); +} -outer: -for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) { +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start - 1; --- End diff -- This is a problem. It can become negative if end is same as start i.e. null text. ---
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r158025793 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -19,44 +19,283 @@ import io.netty.buffer.DrillBuf; -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { +/** SQL Pattern Contains implementation */ +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + private final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString) { super(patternString); + +// Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is +// that there is no single implementation that can do it all well. So, we use multiple implementations +// chosen based on the pattern length. +if (patternLength == 1) { + matcherFcn = new Matcher1(); +} else if (patternLength == 2) { + matcherFcn = new Matcher2(); +} else if (patternLength == 3) { --- End diff -- switch instead of if ---
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r158032627 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -19,44 +19,283 @@ import io.netty.buffer.DrillBuf; -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { +/** SQL Pattern Contains implementation */ +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + private final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString) { super(patternString); + +// Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is +// that there is no single implementation that can do it all well. So, we use multiple implementations +// chosen based on the pattern length. +if (patternLength == 1) { + matcherFcn = new Matcher1(); +} else if (patternLength == 2) { + matcherFcn = new Matcher2(); +} else if (patternLength == 3) { + matcherFcn = new Matcher3(); +} else if (patternLength < 10) { + matcherFcn = new MatcherN(); +} else { + matcherFcn = new BoyerMooreMatcher(); +} } @Override public int match(int start, int end, DrillBuf drillBuf) { +return matcherFcn.match(start, end, drillBuf); + } + + //-- + // Inner Data Structure + // -- + + /** Abstract matcher class to allow us pick the most efficient implementation */ + private abstract class MatcherFcn { +protected final byte[] patternArray; + +protected MatcherFcn() { + assert patternByteBuffer.hasArray(); --- End diff -- is this true for null pattern ? ---
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r158034838 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -19,44 +19,283 @@ import io.netty.buffer.DrillBuf; -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { +/** SQL Pattern Contains implementation */ +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + private final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString) { super(patternString); + +// Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is +// that there is no single implementation that can do it all well. So, we use multiple implementations +// chosen based on the pattern length. +if (patternLength == 1) { + matcherFcn = new Matcher1(); +} else if (patternLength == 2) { + matcherFcn = new Matcher2(); +} else if (patternLength == 3) { + matcherFcn = new Matcher3(); +} else if (patternLength < 10) { + matcherFcn = new MatcherN(); +} else { + matcherFcn = new BoyerMooreMatcher(); +} } @Override public int match(int start, int end, DrillBuf drillBuf) { +return matcherFcn.match(start, end, drillBuf); + } + + //-- + // Inner Data Structure + // -- + + /** Abstract matcher class to allow us pick the most efficient implementation */ + private abstract class MatcherFcn { +protected final byte[] patternArray; + +protected MatcherFcn() { + assert patternByteBuffer.hasArray(); + + patternArray = patternByteBuffer.array(); +} + +/** + * @return 1 if the pattern was matched; 0 otherwise + */ +protected abstract int match(int start, int end, DrillBuf drillBuf); + } + + /** Handles patterns with length one */ + private final class Matcher1 extends MatcherFcn { -if (patternLength == 0) { // Everything should match for null pattern string - return 1; +private Matcher1() { + super(); } -final int txtLength = end - start; +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start; + final byte firstPattByte = patternArray[0]; -// no match if input string length is less than pattern length -if (txtLength < patternLength) { + // simplePattern string has meta characters i.e % and _ and escape characters removed. + // so, we can just directly compare. + for (int idx = 0; idx < lengthToProcess; idx++) { +byte inputByte = drillBuf.getByte(start + idx); + +if (firstPattByte != inputByte) { + continue; +} +return 1; + } return 0; } + } + /** Handles patterns with length two */ + private final class Matcher2 extends MatcherFcn { -final int outerEnd = txtLength - patternLength; +private Matcher2() { + super(); +} -outer: -for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) { +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start - 1; + final byte firstPattByte = patternArray[0]; + final byte secondPattByte = patternArray[1]; // simplePattern string has meta characters i.e % and _ and escape characters removed. // so, we can just directly compare. - for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) { -if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) { - continue outer; + for (int idx = 0; idx < lengthToProcess; idx++) { +final byte firstInByte = drillBuf.getByte(start + idx); + +if (firstPattByte != firstInByte) { + continue; +} else { + final byte secondInByte = drillBuf.getByte(start + idx +1); + + if (secondInByte == secondPattByte) { +return 1; + } } } + return 0; +} + } + + /** Handles patterns with length three */ + private final class Matcher3 extends MatcherFcn {
[GitHub] drill pull request #1072: DRILL-5879: Improved SQL Pattern Contains Performa...
Github user ppadma commented on a diff in the pull request: https://github.com/apache/drill/pull/1072#discussion_r158030478 --- Diff: exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/SqlPatternContainsMatcher.java --- @@ -19,44 +19,283 @@ import io.netty.buffer.DrillBuf; -public class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { +/** SQL Pattern Contains implementation */ +public final class SqlPatternContainsMatcher extends AbstractSqlPatternMatcher { + private final MatcherFcn matcherFcn; public SqlPatternContainsMatcher(String patternString) { super(patternString); + +// Pattern matching is 1) a CPU intensive operation and 2) pattern and input dependent. The conclusion is +// that there is no single implementation that can do it all well. So, we use multiple implementations +// chosen based on the pattern length. +if (patternLength == 1) { + matcherFcn = new Matcher1(); +} else if (patternLength == 2) { + matcherFcn = new Matcher2(); +} else if (patternLength == 3) { + matcherFcn = new Matcher3(); +} else if (patternLength < 10) { + matcherFcn = new MatcherN(); +} else { + matcherFcn = new BoyerMooreMatcher(); +} } @Override public int match(int start, int end, DrillBuf drillBuf) { +return matcherFcn.match(start, end, drillBuf); + } + + //-- + // Inner Data Structure + // -- + + /** Abstract matcher class to allow us pick the most efficient implementation */ + private abstract class MatcherFcn { +protected final byte[] patternArray; + +protected MatcherFcn() { + assert patternByteBuffer.hasArray(); + + patternArray = patternByteBuffer.array(); +} + +/** + * @return 1 if the pattern was matched; 0 otherwise + */ +protected abstract int match(int start, int end, DrillBuf drillBuf); + } + + /** Handles patterns with length one */ + private final class Matcher1 extends MatcherFcn { -if (patternLength == 0) { // Everything should match for null pattern string - return 1; +private Matcher1() { + super(); } -final int txtLength = end - start; +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start; + final byte firstPattByte = patternArray[0]; -// no match if input string length is less than pattern length -if (txtLength < patternLength) { + // simplePattern string has meta characters i.e % and _ and escape characters removed. + // so, we can just directly compare. + for (int idx = 0; idx < lengthToProcess; idx++) { +byte inputByte = drillBuf.getByte(start + idx); + +if (firstPattByte != inputByte) { + continue; +} +return 1; + } return 0; } + } + /** Handles patterns with length two */ + private final class Matcher2 extends MatcherFcn { -final int outerEnd = txtLength - patternLength; +private Matcher2() { + super(); +} -outer: -for (int txtIndex = 0; txtIndex <= outerEnd; txtIndex++) { +/** {@inheritDoc} */ +@Override +protected final int match(int start, int end, DrillBuf drillBuf) { + final int lengthToProcess = end - start - 1; + final byte firstPattByte = patternArray[0]; + final byte secondPattByte = patternArray[1]; // simplePattern string has meta characters i.e % and _ and escape characters removed. // so, we can just directly compare. - for (int patternIndex = 0; patternIndex < patternLength; patternIndex++) { -if (patternByteBuffer.get(patternIndex) != drillBuf.getByte(start + txtIndex + patternIndex)) { - continue outer; + for (int idx = 0; idx < lengthToProcess; idx++) { +final byte firstInByte = drillBuf.getByte(start + idx); + +if (firstPattByte != firstInByte) { + continue; +} else { + final byte secondInByte = drillBuf.getByte(start + idx +1); + + if (secondInByte == secondPattByte) { +return 1; + } } } + return 0; +} + } + + /** Handles patterns with length three */ + private final class Matcher3 extends MatcherFcn {