[jira] [Commented] (DRILL-6385) Support JPPD (Join Predicate Push Down)
[ https://issues.apache.org/jira/browse/DRILL-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579169#comment-16579169 ] ASF GitHub Bot commented on DRILL-6385: --- sohami commented on issue #1334: DRILL-6385: Support JPPD feature URL: https://github.com/apache/drill/pull/1334#issuecomment-412738302 @weijietong - Thanks for making the changes. But I am seeing issues in implementation related to how internals work, following protocol of iterative model of Drill and few other things. I have left few comments for now and was wondering that it will be easier to provide a commit myself to address those issues and help you get this PR pushed in faster. Also I was working on a change for SV2 optimization [here](https://github.com/sohami/drill/commits/SV2_Optimization). Given you also have subset of the change for it with some issues, it would be great if you can rebase on top of my change. Please let me know what you think. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support JPPD (Join Predicate Push Down) > --- > > Key: DRILL-6385 > URL: https://issues.apache.org/jira/browse/DRILL-6385 > Project: Apache Drill > Issue Type: New Feature > Components: Server, Execution - Flow >Affects Versions: 1.14.0 >Reporter: weijie.tong >Assignee: weijie.tong >Priority: Major > > This feature is to support the JPPD (Join Predicate Push Down). It will > benefit the HashJoin ,Broadcast HashJoin performance by reducing the number > of rows to send across the network ,the memory consumed. This feature is > already supported by Impala which calls it RuntimeFilter > ([https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_runtime_filtering.html]). > The first PR will try to push down a bloom filter of HashJoin node to > Parquet’s scan node. The propose basic procedure is described as follow: > # The HashJoin build side accumulate the equal join condition rows to > construct a bloom filter. Then it sends out the bloom filter to the foreman > node. > # The foreman node accept the bloom filters passively from all the fragments > that has the HashJoin operator. It then aggregates the bloom filters to form > a global bloom filter. > # The foreman node broadcasts the global bloom filter to all the probe side > scan nodes which maybe already have send out partial data to the hash join > nodes(currently the hash join node will prefetch one batch from both sides ). > 4. The scan node accepts a global bloom filter from the foreman node. > It will filter the rest rows satisfying the bloom filter. > > To implement above execution flow, some main new notion described as below: > 1. RuntimeFilter > It’s a filter container which may contain BloomFilter or MinMaxFilter. > 2. RuntimeFilterReporter > It wraps the logic to send hash join’s bloom filter to the foreman.The > serialized bloom filter will be sent out through the data tunnel.This object > will be instanced by the FragmentExecutor and passed to the > FragmentContext.So the HashJoin operator can obtain it through the > FragmentContext. > 3. RuntimeFilterRequestHandler > It is responsible to accept a SendRuntimeFilterRequest RPC to strip the > actual BloomFilter from the network. It then translates this filter to the > WorkerBee’s new interface registerRuntimeFilter. > Another RPC type is BroadcastRuntimeFilterRequest. It will register the > accepted global bloom filter to the WorkerBee by the registerRuntimeFilter > method and then propagate to the FragmentContext through which the probe side > scan node can fetch the aggregated bloom filter. > 4.RuntimeFilterManager > The foreman will instance a RuntimeFilterManager .It will indirectly get > every RuntimeFilter by the WorkerBee. Once all the BloomFilters have been > accepted and aggregated . It will broadcast the aggregated bloom filter to > all the probe side scan nodes through the data tunnel by a > BroadcastRuntimeFilterRequest RPC. > 5. RuntimeFilterEnableOption > A global option will be added to decide whether to enable this new feature. > > Welcome suggestion and advice from you.The related PR will be presented as > soon as possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6385) Support JPPD (Join Predicate Push Down)
[ https://issues.apache.org/jira/browse/DRILL-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579168#comment-16579168 ] ASF GitHub Bot commented on DRILL-6385: --- sohami commented on a change in pull request #1334: DRILL-6385: Support JPPD feature URL: https://github.com/apache/drill/pull/1334#discussion_r209803392 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/filter/RuntimeFilterRecordBatch.java ## @@ -0,0 +1,222 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.impl.filter; + +import org.apache.drill.common.exceptions.UserException; +import org.apache.drill.common.expression.ExpressionPosition; +import org.apache.drill.common.expression.LogicalExpression; +import org.apache.drill.common.expression.PathSegment; +import org.apache.drill.common.expression.SchemaPath; +import org.apache.drill.exec.exception.OutOfMemoryException; +import org.apache.drill.exec.exception.SchemaChangeException; +import org.apache.drill.exec.expr.ValueVectorReadExpression; +import org.apache.drill.exec.expr.fn.impl.ValueVectorHashHelper; +import org.apache.drill.exec.ops.FragmentContext; +import org.apache.drill.exec.physical.config.Filter; +import org.apache.drill.exec.record.AbstractSingleRecordBatch; +import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode; +import org.apache.drill.exec.record.RecordBatch; +import org.apache.drill.exec.record.TypedFieldId; +import org.apache.drill.exec.record.selection.SelectionVector2; +import org.apache.drill.exec.record.selection.SelectionVector4; +import org.apache.drill.exec.work.filter.BloomFilter; +import org.apache.drill.exec.work.filter.RuntimeFilterWritable; +import java.util.ArrayList; +import java.util.BitSet; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +/** + * A RuntimeFilterRecordBatch steps over the ScanBatch. If the ScanBatch participates + * in the HashJoinBatch and can be applied by a RuntimeFilter, it will generate a filtered + * SV2, otherwise will generate a same recordCount-originalRecordCount SV2 which will not affect + * the Query's performance ,but just do a memory transfer by the later RemovingRecordBatch op. + */ +public class RuntimeFilterRecordBatch extends AbstractSingleRecordBatch { + private SelectionVector2 sv2; + + private ValueVectorHashHelper.Hash64 hash64; + private Map field2id = new HashMap<>(); + private List toFilterFields; + private int originalRecordCount; + private int recordCount; + private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(RuntimeFilterRecordBatch.class); + + public RuntimeFilterRecordBatch(Filter pop, RecordBatch incoming, FragmentContext context) throws OutOfMemoryException { +super(pop, context, incoming); + } + + @Override + public FragmentContext getContext() { +return context; + } + + @Override + public int getRecordCount() { +return sv2.getCount(); + } + + @Override + public SelectionVector2 getSelectionVector2() { +return sv2; + } + + @Override + public SelectionVector4 getSelectionVector4() { +return null; + } + + @Override + protected IterOutcome doWork() { +container.transferIn(incoming.getContainer()); +originalRecordCount = incoming.getRecordCount(); +sv2.setOriginalRecordCount(originalRecordCount); +try { + applyRuntimeFilter(); +} catch (SchemaChangeException e) { + throw new UnsupportedOperationException(e); +} +return getFinalOutcome(false); + } + + @Override + public void close() { +if (sv2 != null) { + sv2.clear(); +} +super.close(); + } + + @Override + protected boolean setupNewSchema() throws SchemaChangeException { Review comment: In `setupNewSchema` the output container needs to prepared properly before `doWork` can be called. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service,
[jira] [Commented] (DRILL-6385) Support JPPD (Join Predicate Push Down)
[ https://issues.apache.org/jira/browse/DRILL-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579162#comment-16579162 ] ASF GitHub Bot commented on DRILL-6385: --- sohami commented on a change in pull request #1334: DRILL-6385: Support JPPD feature URL: https://github.com/apache/drill/pull/1334#discussion_r209718874 ## File path: protocol/src/main/java/org/apache/drill/exec/proto/UserBitShared.java ## @@ -585,6 +585,10 @@ private FragmentState(int index, int value) { * PARTITION_LIMIT = 54; */ PARTITION_LIMIT(54, 54), +/** + * RUNTIME_FILTER = 55; + */ +RUNTIME_FILTER(55, 55) Review comment: this file is generated after making changes in the `.proto` files. Please add this new operator type [here](https://github.com/apache/drill/blob/master/protocol/src/main/protobuf/UserBitShared.proto#L345) and later generate the .java and .cc files following instructions [here](https://github.com/apache/drill/blob/master/protocol/readme.txt). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support JPPD (Join Predicate Push Down) > --- > > Key: DRILL-6385 > URL: https://issues.apache.org/jira/browse/DRILL-6385 > Project: Apache Drill > Issue Type: New Feature > Components: Server, Execution - Flow >Affects Versions: 1.14.0 >Reporter: weijie.tong >Assignee: weijie.tong >Priority: Major > > This feature is to support the JPPD (Join Predicate Push Down). It will > benefit the HashJoin ,Broadcast HashJoin performance by reducing the number > of rows to send across the network ,the memory consumed. This feature is > already supported by Impala which calls it RuntimeFilter > ([https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_runtime_filtering.html]). > The first PR will try to push down a bloom filter of HashJoin node to > Parquet’s scan node. The propose basic procedure is described as follow: > # The HashJoin build side accumulate the equal join condition rows to > construct a bloom filter. Then it sends out the bloom filter to the foreman > node. > # The foreman node accept the bloom filters passively from all the fragments > that has the HashJoin operator. It then aggregates the bloom filters to form > a global bloom filter. > # The foreman node broadcasts the global bloom filter to all the probe side > scan nodes which maybe already have send out partial data to the hash join > nodes(currently the hash join node will prefetch one batch from both sides ). > 4. The scan node accepts a global bloom filter from the foreman node. > It will filter the rest rows satisfying the bloom filter. > > To implement above execution flow, some main new notion described as below: > 1. RuntimeFilter > It’s a filter container which may contain BloomFilter or MinMaxFilter. > 2. RuntimeFilterReporter > It wraps the logic to send hash join’s bloom filter to the foreman.The > serialized bloom filter will be sent out through the data tunnel.This object > will be instanced by the FragmentExecutor and passed to the > FragmentContext.So the HashJoin operator can obtain it through the > FragmentContext. > 3. RuntimeFilterRequestHandler > It is responsible to accept a SendRuntimeFilterRequest RPC to strip the > actual BloomFilter from the network. It then translates this filter to the > WorkerBee’s new interface registerRuntimeFilter. > Another RPC type is BroadcastRuntimeFilterRequest. It will register the > accepted global bloom filter to the WorkerBee by the registerRuntimeFilter > method and then propagate to the FragmentContext through which the probe side > scan node can fetch the aggregated bloom filter. > 4.RuntimeFilterManager > The foreman will instance a RuntimeFilterManager .It will indirectly get > every RuntimeFilter by the WorkerBee. Once all the BloomFilters have been > accepted and aggregated . It will broadcast the aggregated bloom filter to > all the probe side scan nodes through the data tunnel by a > BroadcastRuntimeFilterRequest RPC. > 5. RuntimeFilterEnableOption > A global option will be added to decide whether to enable this new feature. > > Welcome suggestion and advice from you.The related PR will be presented as > soon as possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6385) Support JPPD (Join Predicate Push Down)
[ https://issues.apache.org/jira/browse/DRILL-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579167#comment-16579167 ] ASF GitHub Bot commented on DRILL-6385: --- sohami commented on a change in pull request #1334: DRILL-6385: Support JPPD feature URL: https://github.com/apache/drill/pull/1334#discussion_r209815818 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/filter/RuntimeFilterRecordBatch.java ## @@ -0,0 +1,222 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.impl.filter; + +import org.apache.drill.common.exceptions.UserException; +import org.apache.drill.common.expression.ExpressionPosition; +import org.apache.drill.common.expression.LogicalExpression; +import org.apache.drill.common.expression.PathSegment; +import org.apache.drill.common.expression.SchemaPath; +import org.apache.drill.exec.exception.OutOfMemoryException; +import org.apache.drill.exec.exception.SchemaChangeException; +import org.apache.drill.exec.expr.ValueVectorReadExpression; +import org.apache.drill.exec.expr.fn.impl.ValueVectorHashHelper; +import org.apache.drill.exec.ops.FragmentContext; +import org.apache.drill.exec.physical.config.Filter; +import org.apache.drill.exec.record.AbstractSingleRecordBatch; +import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode; +import org.apache.drill.exec.record.RecordBatch; +import org.apache.drill.exec.record.TypedFieldId; +import org.apache.drill.exec.record.selection.SelectionVector2; +import org.apache.drill.exec.record.selection.SelectionVector4; +import org.apache.drill.exec.work.filter.BloomFilter; +import org.apache.drill.exec.work.filter.RuntimeFilterWritable; +import java.util.ArrayList; +import java.util.BitSet; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +/** + * A RuntimeFilterRecordBatch steps over the ScanBatch. If the ScanBatch participates + * in the HashJoinBatch and can be applied by a RuntimeFilter, it will generate a filtered + * SV2, otherwise will generate a same recordCount-originalRecordCount SV2 which will not affect + * the Query's performance ,but just do a memory transfer by the later RemovingRecordBatch op. + */ +public class RuntimeFilterRecordBatch extends AbstractSingleRecordBatch { + private SelectionVector2 sv2; + + private ValueVectorHashHelper.Hash64 hash64; + private Map field2id = new HashMap<>(); + private List toFilterFields; + private int originalRecordCount; + private int recordCount; + private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(RuntimeFilterRecordBatch.class); + + public RuntimeFilterRecordBatch(Filter pop, RecordBatch incoming, FragmentContext context) throws OutOfMemoryException { +super(pop, context, incoming); + } + + @Override + public FragmentContext getContext() { +return context; + } + + @Override + public int getRecordCount() { +return sv2.getCount(); + } + + @Override + public SelectionVector2 getSelectionVector2() { +return sv2; + } + + @Override + public SelectionVector4 getSelectionVector4() { +return null; + } + + @Override + protected IterOutcome doWork() { +container.transferIn(incoming.getContainer()); +originalRecordCount = incoming.getRecordCount(); +sv2.setOriginalRecordCount(originalRecordCount); +try { + applyRuntimeFilter(); +} catch (SchemaChangeException e) { + throw new UnsupportedOperationException(e); +} +return getFinalOutcome(false); + } + + @Override + public void close() { +if (sv2 != null) { + sv2.clear(); +} +super.close(); + } + + @Override + protected boolean setupNewSchema() throws SchemaChangeException { +if (sv2 != null) { + sv2.clear(); +} + +switch (incoming.getSchema().getSelectionVectorMode()) { + case NONE: +if (sv2 == null) { + sv2 = new SelectionVector2(oContext.getAllocator()); +} +break; + case TWO_BYTE: +sv2 = new SelectionVector2(oContext.getAllocator()); +break; + case FOUR_BYTE: + +
[jira] [Commented] (DRILL-6385) Support JPPD (Join Predicate Push Down)
[ https://issues.apache.org/jira/browse/DRILL-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579164#comment-16579164 ] ASF GitHub Bot commented on DRILL-6385: --- sohami commented on a change in pull request #1334: DRILL-6385: Support JPPD feature URL: https://github.com/apache/drill/pull/1334#discussion_r209723766 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/handlers/DefaultSqlHandler.java ## @@ -594,15 +595,24 @@ protected Prel convertToPrel(RelNode drel, RelDataType validatedRowType) throws */ phyRelNode = InsertLocalExchangeVisitor.insertLocalExchanges(phyRelNode, queryOptions); +/* + * 8.) + * Insert RuntimeFilter over Scan nodes + */ +if (context.isRuntimeFilterEnabled()) { + phyRelNode = RuntimeFilterPrelVisitor.addRuntimeFilterPrelOverScanPrel(phyRelNode); Review comment: How will this know to add `RuntimeFilter` operator downstream of only probe side Scan and not on build side Scan ? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support JPPD (Join Predicate Push Down) > --- > > Key: DRILL-6385 > URL: https://issues.apache.org/jira/browse/DRILL-6385 > Project: Apache Drill > Issue Type: New Feature > Components: Server, Execution - Flow >Affects Versions: 1.14.0 >Reporter: weijie.tong >Assignee: weijie.tong >Priority: Major > > This feature is to support the JPPD (Join Predicate Push Down). It will > benefit the HashJoin ,Broadcast HashJoin performance by reducing the number > of rows to send across the network ,the memory consumed. This feature is > already supported by Impala which calls it RuntimeFilter > ([https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_runtime_filtering.html]). > The first PR will try to push down a bloom filter of HashJoin node to > Parquet’s scan node. The propose basic procedure is described as follow: > # The HashJoin build side accumulate the equal join condition rows to > construct a bloom filter. Then it sends out the bloom filter to the foreman > node. > # The foreman node accept the bloom filters passively from all the fragments > that has the HashJoin operator. It then aggregates the bloom filters to form > a global bloom filter. > # The foreman node broadcasts the global bloom filter to all the probe side > scan nodes which maybe already have send out partial data to the hash join > nodes(currently the hash join node will prefetch one batch from both sides ). > 4. The scan node accepts a global bloom filter from the foreman node. > It will filter the rest rows satisfying the bloom filter. > > To implement above execution flow, some main new notion described as below: > 1. RuntimeFilter > It’s a filter container which may contain BloomFilter or MinMaxFilter. > 2. RuntimeFilterReporter > It wraps the logic to send hash join’s bloom filter to the foreman.The > serialized bloom filter will be sent out through the data tunnel.This object > will be instanced by the FragmentExecutor and passed to the > FragmentContext.So the HashJoin operator can obtain it through the > FragmentContext. > 3. RuntimeFilterRequestHandler > It is responsible to accept a SendRuntimeFilterRequest RPC to strip the > actual BloomFilter from the network. It then translates this filter to the > WorkerBee’s new interface registerRuntimeFilter. > Another RPC type is BroadcastRuntimeFilterRequest. It will register the > accepted global bloom filter to the WorkerBee by the registerRuntimeFilter > method and then propagate to the FragmentContext through which the probe side > scan node can fetch the aggregated bloom filter. > 4.RuntimeFilterManager > The foreman will instance a RuntimeFilterManager .It will indirectly get > every RuntimeFilter by the WorkerBee. Once all the BloomFilters have been > accepted and aggregated . It will broadcast the aggregated bloom filter to > all the probe side scan nodes through the data tunnel by a > BroadcastRuntimeFilterRequest RPC. > 5. RuntimeFilterEnableOption > A global option will be added to decide whether to enable this new feature. > > Welcome suggestion and advice from you.The related PR will be presented as > soon as possible. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6385) Support JPPD (Join Predicate Push Down)
[ https://issues.apache.org/jira/browse/DRILL-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579166#comment-16579166 ] ASF GitHub Bot commented on DRILL-6385: --- sohami commented on a change in pull request #1334: DRILL-6385: Support JPPD feature URL: https://github.com/apache/drill/pull/1334#discussion_r209791450 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/visitor/RuntimeFilterPrelVisitor.java ## @@ -0,0 +1,38 @@ +package org.apache.drill.exec.planner.physical.visitor; + +import com.google.common.collect.Lists; +import org.apache.calcite.rel.RelNode; +import org.apache.drill.exec.planner.physical.Prel; +import org.apache.drill.exec.planner.physical.RuntimeFilterPrel; +import org.apache.drill.exec.planner.physical.ScanPrel; + +import java.util.List; + +/** + * Generate a RuntimeFilterPrel over all the ScanPrel. + */ +public class RuntimeFilterPrelVisitor extends BasePrelVisitor{ + + private static RuntimeFilterPrelVisitor INSTANCE = new RuntimeFilterPrelVisitor(); + + public static Prel addRuntimeFilterPrelOverScanPrel(Prel prel) { +return prel.accept(INSTANCE, null); + } + + @Override + public Prel visitPrel(Prel prel, Void value) throws RuntimeException { +if (prel instanceof ScanPrel) { + List children = Lists.newArrayList(); + RuntimeFilterPrel runtimeFilterPrel = new RuntimeFilterPrel(prel); + children.add(runtimeFilterPrel); + return (Prel) prel.copy(prel.getTraitSet(), children); Review comment: This will call `copy` on `ScanPrel` which ignores the children probably because `Scan` is leaf node and there won't be any child. I think you should just return `runtimeFilterPrel` instead ? Or override method called `visitScan(ScanPrel prel, EXTRA value)` and put the code under if inside that method which will be as below: ``` @Override public Prel visitScan(ScanPrel prel, Void value) throws RuntimeException { RuntimeFilterPrel runtimeFilterPrel = new RuntimeFilterPrel(prel); return runtimeFilterPrel; } ``` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support JPPD (Join Predicate Push Down) > --- > > Key: DRILL-6385 > URL: https://issues.apache.org/jira/browse/DRILL-6385 > Project: Apache Drill > Issue Type: New Feature > Components: Server, Execution - Flow >Affects Versions: 1.14.0 >Reporter: weijie.tong >Assignee: weijie.tong >Priority: Major > > This feature is to support the JPPD (Join Predicate Push Down). It will > benefit the HashJoin ,Broadcast HashJoin performance by reducing the number > of rows to send across the network ,the memory consumed. This feature is > already supported by Impala which calls it RuntimeFilter > ([https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_runtime_filtering.html]). > The first PR will try to push down a bloom filter of HashJoin node to > Parquet’s scan node. The propose basic procedure is described as follow: > # The HashJoin build side accumulate the equal join condition rows to > construct a bloom filter. Then it sends out the bloom filter to the foreman > node. > # The foreman node accept the bloom filters passively from all the fragments > that has the HashJoin operator. It then aggregates the bloom filters to form > a global bloom filter. > # The foreman node broadcasts the global bloom filter to all the probe side > scan nodes which maybe already have send out partial data to the hash join > nodes(currently the hash join node will prefetch one batch from both sides ). > 4. The scan node accepts a global bloom filter from the foreman node. > It will filter the rest rows satisfying the bloom filter. > > To implement above execution flow, some main new notion described as below: > 1. RuntimeFilter > It’s a filter container which may contain BloomFilter or MinMaxFilter. > 2. RuntimeFilterReporter > It wraps the logic to send hash join’s bloom filter to the foreman.The > serialized bloom filter will be sent out through the data tunnel.This object > will be instanced by the FragmentExecutor and passed to the > FragmentContext.So the HashJoin operator can obtain it through the > FragmentContext. > 3. RuntimeFilterRequestHandler > It is responsible to accept a SendRuntimeFilterRequest RPC to strip the > actual BloomFilter from the network. It then translates this filter to the > WorkerBee’s new interface registerRuntimeFilter. > Another RPC type is Broad
[jira] [Commented] (DRILL-6385) Support JPPD (Join Predicate Push Down)
[ https://issues.apache.org/jira/browse/DRILL-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579165#comment-16579165 ] ASF GitHub Bot commented on DRILL-6385: --- sohami commented on a change in pull request #1334: DRILL-6385: Support JPPD feature URL: https://github.com/apache/drill/pull/1334#discussion_r209791940 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/RuntimeFilterPrel.java ## @@ -0,0 +1,61 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.planner.physical; + +import org.apache.calcite.plan.RelOptCluster; +import org.apache.calcite.plan.RelTraitSet; +import org.apache.calcite.rel.RelNode; +import org.apache.drill.exec.physical.base.PhysicalOperator; +import org.apache.drill.exec.physical.config.RuntimeFilterPOP; +import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode; + +import java.io.IOException; +import java.util.List; + +public class RuntimeFilterPrel extends SinglePrel{ + static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(RuntimeFilterPrel.class); + + public RuntimeFilterPrel(Prel child){ +super(child.getCluster(), child.getTraitSet(), child); + } + + public RuntimeFilterPrel(RelOptCluster cluster, RelTraitSet traits, RelNode child) { +super(cluster, traits, child); + } + + @Override + public RelNode copy(RelTraitSet traitSet, List inputs) { +return new RuntimeFilterPrel(this.getCluster(), traitSet, inputs.get(0)); + } + + @Override + public PhysicalOperator getPhysicalOperator(PhysicalPlanCreator creator) throws IOException { +RuntimeFilterPOP r = new RuntimeFilterPOP( ((Prel)getInput()).getPhysicalOperator(creator)); +return creator.addMetadata(this, r); + } + + @Override + public SelectionVectorMode getEncoding() { +return SelectionVectorMode.TWO_BYTE; + } + + @Override + public Prel prepareForLateralUnnestPipeline(List children) { +return (Prel) this.copy(this.traitSet, children); Review comment: `prepareForLateralUnnestPipeline` be same as `FilterPrel` implementation [here](https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/FilterPrel.java#L91) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Support JPPD (Join Predicate Push Down) > --- > > Key: DRILL-6385 > URL: https://issues.apache.org/jira/browse/DRILL-6385 > Project: Apache Drill > Issue Type: New Feature > Components: Server, Execution - Flow >Affects Versions: 1.14.0 >Reporter: weijie.tong >Assignee: weijie.tong >Priority: Major > > This feature is to support the JPPD (Join Predicate Push Down). It will > benefit the HashJoin ,Broadcast HashJoin performance by reducing the number > of rows to send across the network ,the memory consumed. This feature is > already supported by Impala which calls it RuntimeFilter > ([https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_runtime_filtering.html]). > The first PR will try to push down a bloom filter of HashJoin node to > Parquet’s scan node. The propose basic procedure is described as follow: > # The HashJoin build side accumulate the equal join condition rows to > construct a bloom filter. Then it sends out the bloom filter to the foreman > node. > # The foreman node accept the bloom filters passively from all the fragments > that has the HashJoin operator. It then aggregates the bloom filters to form > a global bloom filter. > # The foreman node broadcasts the global bloom filter to all the probe side > scan nodes which maybe already have send out partial data to the hash join > nodes(currently the hash join node will prefetch one batch from both sides ). >
[jira] [Commented] (DRILL-6385) Support JPPD (Join Predicate Push Down)
[ https://issues.apache.org/jira/browse/DRILL-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579163#comment-16579163 ] ASF GitHub Bot commented on DRILL-6385: --- sohami commented on a change in pull request #1334: DRILL-6385: Support JPPD feature URL: https://github.com/apache/drill/pull/1334#discussion_r209793080 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/filter/RuntimeFilterRecordBatch.java ## @@ -0,0 +1,222 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.impl.filter; + +import org.apache.drill.common.exceptions.UserException; +import org.apache.drill.common.expression.ExpressionPosition; +import org.apache.drill.common.expression.LogicalExpression; +import org.apache.drill.common.expression.PathSegment; +import org.apache.drill.common.expression.SchemaPath; +import org.apache.drill.exec.exception.OutOfMemoryException; +import org.apache.drill.exec.exception.SchemaChangeException; +import org.apache.drill.exec.expr.ValueVectorReadExpression; +import org.apache.drill.exec.expr.fn.impl.ValueVectorHashHelper; +import org.apache.drill.exec.ops.FragmentContext; +import org.apache.drill.exec.physical.config.Filter; +import org.apache.drill.exec.record.AbstractSingleRecordBatch; +import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode; +import org.apache.drill.exec.record.RecordBatch; +import org.apache.drill.exec.record.TypedFieldId; +import org.apache.drill.exec.record.selection.SelectionVector2; +import org.apache.drill.exec.record.selection.SelectionVector4; +import org.apache.drill.exec.work.filter.BloomFilter; +import org.apache.drill.exec.work.filter.RuntimeFilterWritable; +import java.util.ArrayList; +import java.util.BitSet; +import java.util.HashMap; +import java.util.List; +import java.util.Map; + +/** + * A RuntimeFilterRecordBatch steps over the ScanBatch. If the ScanBatch participates + * in the HashJoinBatch and can be applied by a RuntimeFilter, it will generate a filtered + * SV2, otherwise will generate a same recordCount-originalRecordCount SV2 which will not affect + * the Query's performance ,but just do a memory transfer by the later RemovingRecordBatch op. + */ +public class RuntimeFilterRecordBatch extends AbstractSingleRecordBatch { + private SelectionVector2 sv2; + + private ValueVectorHashHelper.Hash64 hash64; + private Map field2id = new HashMap<>(); + private List toFilterFields; + private int originalRecordCount; + private int recordCount; + private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(RuntimeFilterRecordBatch.class); + + public RuntimeFilterRecordBatch(Filter pop, RecordBatch incoming, FragmentContext context) throws OutOfMemoryException { +super(pop, context, incoming); + } + + @Override + public FragmentContext getContext() { +return context; + } + + @Override + public int getRecordCount() { +return sv2.getCount(); + } + + @Override + public SelectionVector2 getSelectionVector2() { +return sv2; + } + + @Override + public SelectionVector4 getSelectionVector4() { +return null; + } + + @Override + protected IterOutcome doWork() { +container.transferIn(incoming.getContainer()); +originalRecordCount = incoming.getRecordCount(); +sv2.setOriginalRecordCount(originalRecordCount); +try { + applyRuntimeFilter(); +} catch (SchemaChangeException e) { + throw new UnsupportedOperationException(e); +} +return getFinalOutcome(false); + } + + @Override + public void close() { +if (sv2 != null) { + sv2.clear(); +} +super.close(); + } + + @Override + protected boolean setupNewSchema() throws SchemaChangeException { +if (sv2 != null) { + sv2.clear(); +} + +switch (incoming.getSchema().getSelectionVectorMode()) { + case NONE: +if (sv2 == null) { + sv2 = new SelectionVector2(oContext.getAllocator()); +} +break; + case TWO_BYTE: +sv2 = new SelectionVector2(oContext.getAllocator()); +break; + case FOUR_BYTE: + +
[jira] [Commented] (DRILL-6676) Add Union, List and Repeated List types to Result Set Loader
[ https://issues.apache.org/jira/browse/DRILL-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579152#comment-16579152 ] ASF GitHub Bot commented on DRILL-6676: --- ilooner commented on issue #1429: DRILL-6676: Add Union, List and Repeated List types to Result Set Loader URL: https://github.com/apache/drill/pull/1429#issuecomment-412734987 Thanks for the help @ppadma. Ran the tests. There are a few failures but those are known issues unrelated to your change. So I think we are all good with respect to the build. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Union, List and Repeated List types to Result Set Loader > > > Key: DRILL-6676 > URL: https://issues.apache.org/jira/browse/DRILL-6676 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.15.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.15.0 > > > Add support for the "obscure" vector types to the {{ResultSetLoader}}: > * Union > * List > * Repeated List -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6676) Add Union, List and Repeated List types to Result Set Loader
[ https://issues.apache.org/jira/browse/DRILL-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579097#comment-16579097 ] ASF GitHub Bot commented on DRILL-6676: --- paul-rogers commented on a change in pull request #1429: DRILL-6676: Add Union, List and Repeated List types to Result Set Loader URL: https://github.com/apache/drill/pull/1429#discussion_r209803870 ## File path: exec/vector/src/main/java/org/apache/drill/exec/vector/complex/ListVector.java ## @@ -250,22 +270,140 @@ public void load(UserBitShared.SerializedField metadata, DrillBuf buffer) { bits.load(bitMetadata, buffer.slice(offsetLength, bitLength)); final UserBitShared.SerializedField vectorMetadata = metadata.getChild(2); -if (getDataVector() == DEFAULT_DATA_VECTOR) { +if (isEmptyType()) { addOrGetVector(VectorDescriptor.create(vectorMetadata.getMajorType())); } final int vectorLength = vectorMetadata.getBufferLength(); vector.load(vectorMetadata, buffer.slice(offsetLength + bitLength, vectorLength)); } + public boolean isEmptyType() { +return getDataVector() == DEFAULT_DATA_VECTOR; + } + + @Override + public void setChildVector(ValueVector childVector) { + +// Unlike the repeated list vector, the (plain) list vector +// adds the dummy vector as a child type. + +assert field.getChildren().size() == 1; +assert field.getChildren().iterator().next().getType().getMinorType() == MinorType.LATE; +field.removeChild(vector.getField()); + +super.setChildVector(childVector); + +// Initial LATE type vector not added as a subtype initially. +// So, no need to remove it, just add the new subtype. Since the +// MajorType is immutable, must build a new one and replace the type +// in the materialized field. (We replace the type, rather than creating +// a new materialized field, to preserve the link to this field from +// a parent map, list or union.) + +assert field.getType().getSubTypeCount() == 0; +field.replaceType( +field.getType().toBuilder() + .addSubType(childVector.getField().getType().getMinorType()) + .build()); + } + + /** + * Promote the list to a union. Called from old-style writers. This implementation + * relies on the caller to set the types vector for any existing values. + * This method simply clears the existing vector. + * + * @return the new union vector + */ + public UnionVector promoteToUnion() { -MaterializedField newField = MaterializedField.create(getField().getName(), Types.optional(MinorType.UNION)); -UnionVector vector = new UnionVector(newField, allocator, null); +UnionVector vector = createUnion(); + +// Replace the current vector, clearing its data. (This is the +// old behavior. + replaceDataVector(vector); reader = new UnionListReader(this); return vector; } + /** + * Revised form of promote to union that correctly fixes up the list + * field metadata to match the new union type. + * + * @return the new union vector + */ + + public UnionVector promoteToUnion2() { +UnionVector unionVector = createUnion(); + +// Replace the current vector, clearing its data. (This is the +// old behavior. + +setChildVector(unionVector); +return unionVector; + } + + /** + * Promote to a union, preserving the existing data vector as a member of + * the new union. Back-fill the types vector with the proper type value + * for existing rows. + * + * @return the new union vector + */ + + public UnionVector convertToUnion(int allocValueCount, int valueCount) { +assert allocValueCount >= valueCount; +UnionVector unionVector = createUnion(); +unionVector.allocateNew(allocValueCount); + +// Preserve the current vector (and its data) if it is other than +// the default. (New behavior used by column writers.) + +if (! isEmptyType()) { + unionVector.addType(vector); + int prevType = vector.getField().getType().getMinorType().getNumber(); + UInt1Vector.Mutator typeMutator = unionVector.getTypeVector().getMutator(); + + // If the previous vector was nullable, then promote the nullable state + // to the type vector by setting either the null marker or the type + // marker depending on the original nullable values. + + if (vector instanceof NullableVector) { +UInt1Vector.Accessor bitsAccessor = +((UInt1Vector) ((NullableVector) vector).getBitsVector()).getAccessor(); +for (int i = 0; i < valueCount; i++) { + typeMutator.setSafe(i, (bitsAccessor.get(i) == 0) + ? UnionVector.NULL_MARKER + : prevType); +} + } else { + +// The value is not nullable. (Perhaps it is a map.) +// Note that the original design of lists have a flaw: if the sole member +
[jira] [Commented] (DRILL-6676) Add Union, List and Repeated List types to Result Set Loader
[ https://issues.apache.org/jira/browse/DRILL-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579096#comment-16579096 ] ASF GitHub Bot commented on DRILL-6676: --- paul-rogers commented on a change in pull request #1429: DRILL-6676: Add Union, List and Repeated List types to Result Set Loader URL: https://github.com/apache/drill/pull/1429#discussion_r209803508 ## File path: exec/vector/src/main/java/org/apache/drill/exec/vector/complex/ListVector.java ## @@ -250,22 +270,140 @@ public void load(UserBitShared.SerializedField metadata, DrillBuf buffer) { bits.load(bitMetadata, buffer.slice(offsetLength, bitLength)); final UserBitShared.SerializedField vectorMetadata = metadata.getChild(2); -if (getDataVector() == DEFAULT_DATA_VECTOR) { +if (isEmptyType()) { addOrGetVector(VectorDescriptor.create(vectorMetadata.getMajorType())); } final int vectorLength = vectorMetadata.getBufferLength(); vector.load(vectorMetadata, buffer.slice(offsetLength + bitLength, vectorLength)); } + public boolean isEmptyType() { +return getDataVector() == DEFAULT_DATA_VECTOR; + } + + @Override + public void setChildVector(ValueVector childVector) { + +// Unlike the repeated list vector, the (plain) list vector +// adds the dummy vector as a child type. + +assert field.getChildren().size() == 1; +assert field.getChildren().iterator().next().getType().getMinorType() == MinorType.LATE; +field.removeChild(vector.getField()); + +super.setChildVector(childVector); + +// Initial LATE type vector not added as a subtype initially. +// So, no need to remove it, just add the new subtype. Since the +// MajorType is immutable, must build a new one and replace the type +// in the materialized field. (We replace the type, rather than creating +// a new materialized field, to preserve the link to this field from +// a parent map, list or union.) + +assert field.getType().getSubTypeCount() == 0; +field.replaceType( +field.getType().toBuilder() + .addSubType(childVector.getField().getType().getMinorType()) + .build()); + } + + /** + * Promote the list to a union. Called from old-style writers. This implementation + * relies on the caller to set the types vector for any existing values. + * This method simply clears the existing vector. + * + * @return the new union vector + */ + public UnionVector promoteToUnion() { -MaterializedField newField = MaterializedField.create(getField().getName(), Types.optional(MinorType.UNION)); -UnionVector vector = new UnionVector(newField, allocator, null); +UnionVector vector = createUnion(); + +// Replace the current vector, clearing its data. (This is the +// old behavior. + replaceDataVector(vector); reader = new UnionListReader(this); return vector; } + /** + * Revised form of promote to union that correctly fixes up the list + * field metadata to match the new union type. + * + * @return the new union vector + */ + + public UnionVector promoteToUnion2() { +UnionVector unionVector = createUnion(); + +// Replace the current vector, clearing its data. (This is the +// old behavior. + +setChildVector(unionVector); +return unionVector; + } + + /** + * Promote to a union, preserving the existing data vector as a member of + * the new union. Back-fill the types vector with the proper type value + * for existing rows. + * + * @return the new union vector + */ + + public UnionVector convertToUnion(int allocValueCount, int valueCount) { +assert allocValueCount >= valueCount; +UnionVector unionVector = createUnion(); +unionVector.allocateNew(allocValueCount); + +// Preserve the current vector (and its data) if it is other than +// the default. (New behavior used by column writers.) + +if (! isEmptyType()) { + unionVector.addType(vector); + int prevType = vector.getField().getType().getMinorType().getNumber(); + UInt1Vector.Mutator typeMutator = unionVector.getTypeVector().getMutator(); + + // If the previous vector was nullable, then promote the nullable state + // to the type vector by setting either the null marker or the type + // marker depending on the original nullable values. + + if (vector instanceof NullableVector) { +UInt1Vector.Accessor bitsAccessor = +((UInt1Vector) ((NullableVector) vector).getBitsVector()).getAccessor(); +for (int i = 0; i < valueCount; i++) { + typeMutator.setSafe(i, (bitsAccessor.get(i) == 0) + ? UnionVector.NULL_MARKER + : prevType); +} + } else { + +// The value is not nullable. (Perhaps it is a map.) +// Note that the original design of lists have a flaw: if the sole member +
[jira] [Commented] (DRILL-6676) Add Union, List and Repeated List types to Result Set Loader
[ https://issues.apache.org/jira/browse/DRILL-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579088#comment-16579088 ] ASF GitHub Bot commented on DRILL-6676: --- paul-rogers commented on issue #1429: DRILL-6676: Add Union, List and Repeated List types to Result Set Loader URL: https://github.com/apache/drill/pull/1429#issuecomment-412715369 @ilooner, thanks much for taking a look at this PR. Agree it would be great for @ppadma to review this as she's been following allow with earlier work in this series. The code here can be overwhelming if you dive right in. The best way to start is to look at the tests: to see how the code is intended to be used. The API is intended to be simple; the large bulk of code is needed to create that simple API by hiding the highly complex stuff that must be done to keep multiple vectors in sync -- with ever increasing complexity as you move from fixed-width to nullable to variable width to arrays to maps to unions to lists. Rebased on latest master. Fixed check style issues from latest rules. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Union, List and Repeated List types to Result Set Loader > > > Key: DRILL-6676 > URL: https://issues.apache.org/jira/browse/DRILL-6676 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.15.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.15.0 > > > Add support for the "obscure" vector types to the {{ResultSetLoader}}: > * Union > * List > * Repeated List -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6683) move getSelectionVector2 and getSelectionVector4 from VectorAccessible interface to RecordBatch interface
[ https://issues.apache.org/jira/browse/DRILL-6683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579069#comment-16579069 ] Paul Rogers commented on DRILL-6683: While this seems a good idea; it does get to the core of the design of the {{VectorContainer}} vs. {{RecordBatch}} abstractions. Despite its name, {{RecordBatch}} is an *operator*, not a batch of data. A {{RecordBatch}} (operator) has an associated output batch of data (a record batch but not a {{RecordBatch}}) represented by a {{VectorContainer}}. Metadata for that container is described by {{BatchSchema}}, which is stored in the {{VectorContainer}}. Since a full record batch is defined by a set of vectors *and* it associated selection vector, it seems odd to disassociate them. Rather than remove the methods from {{VectorContainer}}, a better longer-term change would be to move the selection vector into the {{VectorContainer}}. Today, it is an odd add-on maintained by the operator, (the so-called {{RecordBatch}}), not the record batch (the so-called {{VectorContainer}}.) As you've seen in the {{RowSet}} classes, a {{RowSet}} is the logical equivalent of (actually a wrapper for) both a {{VectorContainer}} and a selection vector. Also, the newer stuff to come that builds on the result set loader splits the operator interface into three responsibilities: * Operator * Outgoing batch * Iterator protocol driver In this world, a {{RowSet}} (or the result set loader equivalent for reading) would represent the outgoing batch, the operator handle the work of transforming batches. So, long comment, because the design in this area needs work (which this bug suggests), but the fixes are subtle. > move getSelectionVector2 and getSelectionVector4 from VectorAccessible > interface to RecordBatch interface > - > > Key: DRILL-6683 > URL: https://issues.apache.org/jira/browse/DRILL-6683 > Project: Apache Drill > Issue Type: Improvement >Reporter: Timothy Farkas >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6676) Add Union, List and Repeated List types to Result Set Loader
[ https://issues.apache.org/jira/browse/DRILL-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579066#comment-16579066 ] ASF GitHub Bot commented on DRILL-6676: --- ppadma commented on issue #1429: DRILL-6676: Add Union, List and Repeated List types to Result Set Loader URL: https://github.com/apache/drill/pull/1429#issuecomment-412711621 @paul-rogers @ilooner I can review this PR. It will take some time. I will post comments as I make progress. meanwhile, @ilooner, can you run functional tests ? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Union, List and Repeated List types to Result Set Loader > > > Key: DRILL-6676 > URL: https://issues.apache.org/jira/browse/DRILL-6676 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.15.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.15.0 > > > Add support for the "obscure" vector types to the {{ResultSetLoader}}: > * Union > * List > * Repeated List -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6676) Add Union, List and Repeated List types to Result Set Loader
[ https://issues.apache.org/jira/browse/DRILL-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579045#comment-16579045 ] ASF GitHub Bot commented on DRILL-6676: --- ilooner commented on issue #1429: DRILL-6676: Add Union, List and Repeated List types to Result Set Loader URL: https://github.com/apache/drill/pull/1429#issuecomment-412703438 @paul-rogers I added some more checkstyle checks recently, and they fail when the PR is rebased onto the latest master. Could you rebase and fix the few checkstyle errors? This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Union, List and Repeated List types to Result Set Loader > > > Key: DRILL-6676 > URL: https://issues.apache.org/jira/browse/DRILL-6676 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.15.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.15.0 > > > Add support for the "obscure" vector types to the {{ResultSetLoader}}: > * Union > * List > * Repeated List -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6676) Add Union, List and Repeated List types to Result Set Loader
[ https://issues.apache.org/jira/browse/DRILL-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579041#comment-16579041 ] ASF GitHub Bot commented on DRILL-6676: --- ilooner commented on a change in pull request #1429: DRILL-6676: Add Union, List and Repeated List types to Result Set Loader URL: https://github.com/apache/drill/pull/1429#discussion_r209791763 ## File path: exec/vector/src/main/java/org/apache/drill/exec/vector/complex/ListVector.java ## @@ -250,22 +270,140 @@ public void load(UserBitShared.SerializedField metadata, DrillBuf buffer) { bits.load(bitMetadata, buffer.slice(offsetLength, bitLength)); final UserBitShared.SerializedField vectorMetadata = metadata.getChild(2); -if (getDataVector() == DEFAULT_DATA_VECTOR) { +if (isEmptyType()) { addOrGetVector(VectorDescriptor.create(vectorMetadata.getMajorType())); } final int vectorLength = vectorMetadata.getBufferLength(); vector.load(vectorMetadata, buffer.slice(offsetLength + bitLength, vectorLength)); } + public boolean isEmptyType() { +return getDataVector() == DEFAULT_DATA_VECTOR; + } + + @Override + public void setChildVector(ValueVector childVector) { + +// Unlike the repeated list vector, the (plain) list vector +// adds the dummy vector as a child type. + +assert field.getChildren().size() == 1; +assert field.getChildren().iterator().next().getType().getMinorType() == MinorType.LATE; +field.removeChild(vector.getField()); + +super.setChildVector(childVector); + +// Initial LATE type vector not added as a subtype initially. +// So, no need to remove it, just add the new subtype. Since the +// MajorType is immutable, must build a new one and replace the type +// in the materialized field. (We replace the type, rather than creating +// a new materialized field, to preserve the link to this field from +// a parent map, list or union.) + +assert field.getType().getSubTypeCount() == 0; +field.replaceType( +field.getType().toBuilder() + .addSubType(childVector.getField().getType().getMinorType()) + .build()); + } + + /** + * Promote the list to a union. Called from old-style writers. This implementation + * relies on the caller to set the types vector for any existing values. + * This method simply clears the existing vector. + * + * @return the new union vector + */ + public UnionVector promoteToUnion() { -MaterializedField newField = MaterializedField.create(getField().getName(), Types.optional(MinorType.UNION)); -UnionVector vector = new UnionVector(newField, allocator, null); +UnionVector vector = createUnion(); + +// Replace the current vector, clearing its data. (This is the +// old behavior. + replaceDataVector(vector); reader = new UnionListReader(this); return vector; } + /** + * Revised form of promote to union that correctly fixes up the list + * field metadata to match the new union type. + * + * @return the new union vector + */ + + public UnionVector promoteToUnion2() { +UnionVector unionVector = createUnion(); + +// Replace the current vector, clearing its data. (This is the +// old behavior. + +setChildVector(unionVector); +return unionVector; + } + + /** + * Promote to a union, preserving the existing data vector as a member of + * the new union. Back-fill the types vector with the proper type value + * for existing rows. + * + * @return the new union vector + */ + + public UnionVector convertToUnion(int allocValueCount, int valueCount) { +assert allocValueCount >= valueCount; +UnionVector unionVector = createUnion(); +unionVector.allocateNew(allocValueCount); + +// Preserve the current vector (and its data) if it is other than +// the default. (New behavior used by column writers.) + +if (! isEmptyType()) { + unionVector.addType(vector); + int prevType = vector.getField().getType().getMinorType().getNumber(); + UInt1Vector.Mutator typeMutator = unionVector.getTypeVector().getMutator(); + + // If the previous vector was nullable, then promote the nullable state + // to the type vector by setting either the null marker or the type + // marker depending on the original nullable values. + + if (vector instanceof NullableVector) { +UInt1Vector.Accessor bitsAccessor = +((UInt1Vector) ((NullableVector) vector).getBitsVector()).getAccessor(); +for (int i = 0; i < valueCount; i++) { + typeMutator.setSafe(i, (bitsAccessor.get(i) == 0) + ? UnionVector.NULL_MARKER + : prevType); +} + } else { + +// The value is not nullable. (Perhaps it is a map.) +// Note that the original design of lists have a flaw: if the sole member +
[jira] [Commented] (DRILL-6676) Add Union, List and Repeated List types to Result Set Loader
[ https://issues.apache.org/jira/browse/DRILL-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579040#comment-16579040 ] ASF GitHub Bot commented on DRILL-6676: --- ilooner commented on a change in pull request #1429: DRILL-6676: Add Union, List and Repeated List types to Result Set Loader URL: https://github.com/apache/drill/pull/1429#discussion_r209791651 ## File path: exec/vector/src/main/java/org/apache/drill/exec/vector/complex/ListVector.java ## @@ -250,22 +270,140 @@ public void load(UserBitShared.SerializedField metadata, DrillBuf buffer) { bits.load(bitMetadata, buffer.slice(offsetLength, bitLength)); final UserBitShared.SerializedField vectorMetadata = metadata.getChild(2); -if (getDataVector() == DEFAULT_DATA_VECTOR) { +if (isEmptyType()) { addOrGetVector(VectorDescriptor.create(vectorMetadata.getMajorType())); } final int vectorLength = vectorMetadata.getBufferLength(); vector.load(vectorMetadata, buffer.slice(offsetLength + bitLength, vectorLength)); } + public boolean isEmptyType() { +return getDataVector() == DEFAULT_DATA_VECTOR; + } + + @Override + public void setChildVector(ValueVector childVector) { + +// Unlike the repeated list vector, the (plain) list vector +// adds the dummy vector as a child type. + +assert field.getChildren().size() == 1; +assert field.getChildren().iterator().next().getType().getMinorType() == MinorType.LATE; +field.removeChild(vector.getField()); + +super.setChildVector(childVector); + +// Initial LATE type vector not added as a subtype initially. +// So, no need to remove it, just add the new subtype. Since the +// MajorType is immutable, must build a new one and replace the type +// in the materialized field. (We replace the type, rather than creating +// a new materialized field, to preserve the link to this field from +// a parent map, list or union.) + +assert field.getType().getSubTypeCount() == 0; +field.replaceType( +field.getType().toBuilder() + .addSubType(childVector.getField().getType().getMinorType()) + .build()); + } + + /** + * Promote the list to a union. Called from old-style writers. This implementation + * relies on the caller to set the types vector for any existing values. + * This method simply clears the existing vector. + * + * @return the new union vector + */ + public UnionVector promoteToUnion() { -MaterializedField newField = MaterializedField.create(getField().getName(), Types.optional(MinorType.UNION)); -UnionVector vector = new UnionVector(newField, allocator, null); +UnionVector vector = createUnion(); + +// Replace the current vector, clearing its data. (This is the +// old behavior. + replaceDataVector(vector); reader = new UnionListReader(this); return vector; } + /** + * Revised form of promote to union that correctly fixes up the list + * field metadata to match the new union type. + * + * @return the new union vector + */ + + public UnionVector promoteToUnion2() { +UnionVector unionVector = createUnion(); + +// Replace the current vector, clearing its data. (This is the +// old behavior. + +setChildVector(unionVector); +return unionVector; + } + + /** + * Promote to a union, preserving the existing data vector as a member of + * the new union. Back-fill the types vector with the proper type value + * for existing rows. + * + * @return the new union vector + */ + + public UnionVector convertToUnion(int allocValueCount, int valueCount) { +assert allocValueCount >= valueCount; +UnionVector unionVector = createUnion(); +unionVector.allocateNew(allocValueCount); + +// Preserve the current vector (and its data) if it is other than +// the default. (New behavior used by column writers.) + +if (! isEmptyType()) { + unionVector.addType(vector); + int prevType = vector.getField().getType().getMinorType().getNumber(); + UInt1Vector.Mutator typeMutator = unionVector.getTypeVector().getMutator(); + + // If the previous vector was nullable, then promote the nullable state + // to the type vector by setting either the null marker or the type + // marker depending on the original nullable values. + + if (vector instanceof NullableVector) { +UInt1Vector.Accessor bitsAccessor = +((UInt1Vector) ((NullableVector) vector).getBitsVector()).getAccessor(); +for (int i = 0; i < valueCount; i++) { + typeMutator.setSafe(i, (bitsAccessor.get(i) == 0) + ? UnionVector.NULL_MARKER + : prevType); +} + } else { + +// The value is not nullable. (Perhaps it is a map.) +// Note that the original design of lists have a flaw: if the sole member +
[jira] [Updated] (DRILL-6685) Error in parquet record reader
[ https://issues.apache.org/jira/browse/DRILL-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Hou updated DRILL-6685: -- Attachment: drillbit.log.6685 > Error in parquet record reader > -- > > Key: DRILL-6685 > URL: https://issues.apache.org/jira/browse/DRILL-6685 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.14.0 >Reporter: Robert Hou >Assignee: salim achouche >Priority: Major > Fix For: 1.15.0 > > Attachments: drillbit.log.6685 > > > This is the query: > select VarbinaryValue1 from > dfs.`/drill/testdata/batch_memory/fourvarchar_asc_nulls_16MB.parquet` limit > 36; > It appears to be caused by this commit: > DRILL-6570: Fixed IndexOutofBoundException in Parquet Reader > aee899c1b26ebb9a5781d280d5a73b42c273d4d5 > This is the stack trace: > {noformat} > Error: INTERNAL_ERROR ERROR: Error in parquet record reader. > Message: > Hadoop path: > /drill/testdata/batch_memory/fourvarchar_asc_nulls_16MB.parquet/0_0_0.parquet > Total records read: 0 > Row group index: 0 > Records in row group: 1250 > Parquet Metadata: ParquetMetaData{FileMetaData{schema: message root { > optional int64 Index; > optional binary VarbinaryValue1; > optional int64 BigIntValue; > optional boolean BooleanValue; > optional int32 DateValue (DATE); > optional float FloatValue; > optional binary VarcharValue1 (UTF8); > optional double DoubleValue; > optional int32 IntegerValue; > optional int32 TimeValue (TIME_MILLIS); > optional int64 TimestampValue (TIMESTAMP_MILLIS); > optional binary VarbinaryValue2; > optional fixed_len_byte_array(12) IntervalYearValue (INTERVAL); > optional fixed_len_byte_array(12) IntervalDayValue (INTERVAL); > optional fixed_len_byte_array(12) IntervalSecondValue (INTERVAL); > optional binary VarcharValue2 (UTF8); > } > , metadata: {drill-writer.version=2, drill.version=1.14.0-SNAPSHOT}}, blocks: > [BlockMetaData{1250, 23750308 [ColumnMetaData{UNCOMPRESSED [Index] optional > int64 Index [PLAIN, RLE, BIT_PACKED], 4}, ColumnMetaData{UNCOMPRESSED > [VarbinaryValue1] optional binary VarbinaryValue1 [PLAIN, RLE, BIT_PACKED], > 10057}, ColumnMetaData{UNCOMPRESSED [BigIntValue] optional int64 BigIntValue > [PLAIN, RLE, BIT_PACKED], 8174655}, ColumnMetaData{UNCOMPRESSED > [BooleanValue] optional boolean BooleanValue [PLAIN, RLE, BIT_PACKED], > 8179722}, ColumnMetaData{UNCOMPRESSED [DateValue] optional int32 DateValue > (DATE) [PLAIN, RLE, BIT_PACKED], 8179916}, ColumnMetaData{UNCOMPRESSED > [FloatValue] optional float FloatValue [PLAIN, RLE, BIT_PACKED], 8184959}, > ColumnMetaData{UNCOMPRESSED [VarcharValue1] optional binary VarcharValue1 > (UTF8) [PLAIN, RLE, BIT_PACKED], 8190002}, ColumnMetaData{UNCOMPRESSED > [DoubleValue] optional double DoubleValue [PLAIN, RLE, BIT_PACKED], > 10230058}, ColumnMetaData{UNCOMPRESSED [IntegerValue] optional int32 > IntegerValue [PLAIN, RLE, BIT_PACKED], 10240111}, > ColumnMetaData{UNCOMPRESSED [TimeValue] optional int32 TimeValue > (TIME_MILLIS) [PLAIN, RLE, BIT_PACKED], 10245154}, > ColumnMetaData{UNCOMPRESSED [TimestampValue] optional int64 TimestampValue > (TIMESTAMP_MILLIS) [PLAIN, RLE, BIT_PACKED], 10250197}, > ColumnMetaData{UNCOMPRESSED [VarbinaryValue2] optional binary VarbinaryValue2 > [PLAIN, RLE, BIT_PACKED], 10260250}, ColumnMetaData{UNCOMPRESSED > [IntervalYearValue] optional fixed_len_byte_array(12) IntervalYearValue > (INTERVAL) [PLAIN, RLE, BIT_PACKED], 19632385}, ColumnMetaData{UNCOMPRESSED > [IntervalDayValue] optional fixed_len_byte_array(12) IntervalDayValue > (INTERVAL) [PLAIN, RLE, BIT_PACKED], 19647446}, ColumnMetaData{UNCOMPRESSED > [IntervalSecondValue] optional fixed_len_byte_array(12) IntervalSecondValue > (INTERVAL) [PLAIN, RLE, BIT_PACKED], 19662507}, ColumnMetaData{UNCOMPRESSED > [VarcharValue2] optional binary VarcharValue2 (UTF8) [PLAIN, RLE, > BIT_PACKED], 19677568}]}]} > Fragment 0:0 > [Error Id: 25852cdb-3217-4041-9743-66e9f3a2fbe4 on qa-node186.qa.lab:31010] > (state=,code=0) > {noformat} > Table can be found in 10.10.100.186:/tmp/fourvarchar_asc_nulls_16MB.parquet > sys.version is: > 1.15.0-SNAPSHOT a05f17d6fcd80f0d21260d3b1074ab895f457bacChanged > PROJECT_OUTPUT_BATCH_SIZE to System + Session 30.07.2018 @ 17:12:53 PDT > r...@mapr.com 30.07.2018 @ 17:25:21 PDT^M > fourvarchar_asc_nulls70.q -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-6685) Error in parquet record reader
[ https://issues.apache.org/jira/browse/DRILL-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Hou updated DRILL-6685: -- Description: This is the query: select VarbinaryValue1 from dfs.`/drill/testdata/batch_memory/fourvarchar_asc_nulls_16MB.parquet` limit 36; It appears to be caused by this commit: DRILL-6570: Fixed IndexOutofBoundException in Parquet Reader aee899c1b26ebb9a5781d280d5a73b42c273d4d5 This is the stack trace: {noformat} Error: INTERNAL_ERROR ERROR: Error in parquet record reader. Message: Hadoop path: /drill/testdata/batch_memory/fourvarchar_asc_nulls_16MB.parquet/0_0_0.parquet Total records read: 0 Row group index: 0 Records in row group: 1250 Parquet Metadata: ParquetMetaData{FileMetaData{schema: message root { optional int64 Index; optional binary VarbinaryValue1; optional int64 BigIntValue; optional boolean BooleanValue; optional int32 DateValue (DATE); optional float FloatValue; optional binary VarcharValue1 (UTF8); optional double DoubleValue; optional int32 IntegerValue; optional int32 TimeValue (TIME_MILLIS); optional int64 TimestampValue (TIMESTAMP_MILLIS); optional binary VarbinaryValue2; optional fixed_len_byte_array(12) IntervalYearValue (INTERVAL); optional fixed_len_byte_array(12) IntervalDayValue (INTERVAL); optional fixed_len_byte_array(12) IntervalSecondValue (INTERVAL); optional binary VarcharValue2 (UTF8); } , metadata: {drill-writer.version=2, drill.version=1.14.0-SNAPSHOT}}, blocks: [BlockMetaData{1250, 23750308 [ColumnMetaData{UNCOMPRESSED [Index] optional int64 Index [PLAIN, RLE, BIT_PACKED], 4}, ColumnMetaData{UNCOMPRESSED [VarbinaryValue1] optional binary VarbinaryValue1 [PLAIN, RLE, BIT_PACKED], 10057}, ColumnMetaData{UNCOMPRESSED [BigIntValue] optional int64 BigIntValue [PLAIN, RLE, BIT_PACKED], 8174655}, ColumnMetaData{UNCOMPRESSED [BooleanValue] optional boolean BooleanValue [PLAIN, RLE, BIT_PACKED], 8179722}, ColumnMetaData{UNCOMPRESSED [DateValue] optional int32 DateValue (DATE) [PLAIN, RLE, BIT_PACKED], 8179916}, ColumnMetaData{UNCOMPRESSED [FloatValue] optional float FloatValue [PLAIN, RLE, BIT_PACKED], 8184959}, ColumnMetaData{UNCOMPRESSED [VarcharValue1] optional binary VarcharValue1 (UTF8) [PLAIN, RLE, BIT_PACKED], 8190002}, ColumnMetaData{UNCOMPRESSED [DoubleValue] optional double DoubleValue [PLAIN, RLE, BIT_PACKED], 10230058}, ColumnMetaData{UNCOMPRESSED [IntegerValue] optional int32 IntegerValue [PLAIN, RLE, BIT_PACKED], 10240111}, ColumnMetaData{UNCOMPRESSED [TimeValue] optional int32 TimeValue (TIME_MILLIS) [PLAIN, RLE, BIT_PACKED], 10245154}, ColumnMetaData{UNCOMPRESSED [TimestampValue] optional int64 TimestampValue (TIMESTAMP_MILLIS) [PLAIN, RLE, BIT_PACKED], 10250197}, ColumnMetaData{UNCOMPRESSED [VarbinaryValue2] optional binary VarbinaryValue2 [PLAIN, RLE, BIT_PACKED], 10260250}, ColumnMetaData{UNCOMPRESSED [IntervalYearValue] optional fixed_len_byte_array(12) IntervalYearValue (INTERVAL) [PLAIN, RLE, BIT_PACKED], 19632385}, ColumnMetaData{UNCOMPRESSED [IntervalDayValue] optional fixed_len_byte_array(12) IntervalDayValue (INTERVAL) [PLAIN, RLE, BIT_PACKED], 19647446}, ColumnMetaData{UNCOMPRESSED [IntervalSecondValue] optional fixed_len_byte_array(12) IntervalSecondValue (INTERVAL) [PLAIN, RLE, BIT_PACKED], 19662507}, ColumnMetaData{UNCOMPRESSED [VarcharValue2] optional binary VarcharValue2 (UTF8) [PLAIN, RLE, BIT_PACKED], 19677568}]}]} Fragment 0:0 [Error Id: 25852cdb-3217-4041-9743-66e9f3a2fbe4 on qa-node186.qa.lab:31010] (state=,code=0) {noformat} Table can be found in 10.10.100.186:/tmp/fourvarchar_asc_nulls_16MB.parquet sys.version is: 1.15.0-SNAPSHOT a05f17d6fcd80f0d21260d3b1074ab895f457bacChanged PROJECT_OUTPUT_BATCH_SIZE to System + Session 30.07.2018 @ 17:12:53 PDT r...@mapr.com 30.07.2018 @ 17:25:21 PDT^M fourvarchar_asc_nulls70.q was: This is the query: alter session set `drill.exec.memory.operator.project.output_batch_size` = 131072; alter session set `planner.width.max_per_node` = 1; alter session set `planner.width.max_per_query` = 1; select * from ( select BigIntValue, BooleanValue, DateValue, FloatValue, DoubleValue, IntegerValue, TimeValue, TimestampValue, IntervalYearValue, IntervalDayValue, IntervalSecondValue, VarbinaryValue1, VarcharValue1, VarbinaryValue2, VarcharValue2 from (select * from dfs.`/drill/testdata/batch_memory/fourvarchar_asc_nulls_16MB_1GB.parquet` order by IntegerValue)) d where d.VarcharValue1 = 'Fq'; It appears to be caused by this commit: DRILL-6570: Fixed IndexOutofBoundException in Parquet Reader aee899c1b26ebb9a5781d280d5a73b42c273d4d5 This is the stack trace: {noformat} oadd.org.apache.drill.common.exceptions.UserRemoteException: INTERNAL_ERROR ERROR: Error in parquet record reader.^M Message: ^M Hadoop path: /drill/testdata/batch_memory/fourvarchar_asc_nulls_16MB_1GB.parquet/0_0_0.parquet^M Total records read:
[jira] [Updated] (DRILL-6685) Error in parquet record reader
[ https://issues.apache.org/jira/browse/DRILL-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Robert Hou updated DRILL-6685: -- Description: This is the query: alter session set `drill.exec.memory.operator.project.output_batch_size` = 131072; alter session set `planner.width.max_per_node` = 1; alter session set `planner.width.max_per_query` = 1; select * from ( select BigIntValue, BooleanValue, DateValue, FloatValue, DoubleValue, IntegerValue, TimeValue, TimestampValue, IntervalYearValue, IntervalDayValue, IntervalSecondValue, VarbinaryValue1, VarcharValue1, VarbinaryValue2, VarcharValue2 from (select * from dfs.`/drill/testdata/batch_memory/fourvarchar_asc_nulls_16MB_1GB.parquet` order by IntegerValue)) d where d.VarcharValue1 = 'Fq'; It appears to be caused by this commit: DRILL-6570: Fixed IndexOutofBoundException in Parquet Reader aee899c1b26ebb9a5781d280d5a73b42c273d4d5 This is the stack trace: {noformat} oadd.org.apache.drill.common.exceptions.UserRemoteException: INTERNAL_ERROR ERROR: Error in parquet record reader.^M Message: ^M Hadoop path: /drill/testdata/batch_memory/fourvarchar_asc_nulls_16MB_1GB.parquet/0_0_0.parquet^M Total records read: 0^M Row group index: 0^M Records in row group: 14565^M Parquet Metadata: ParquetMetaData{FileMetaData{schema: message root {^M optional int64 Index;^M optional binary VarbinaryValue1;^M optional int64 BigIntValue;^M optional boolean BooleanValue;^M optional int32 DateValue (DATE);^M optional float FloatValue;^M optional binary VarcharValue1 (UTF8);^M optional double DoubleValue;^M optional int32 IntegerValue;^M optional int32 TimeValue (TIME_MILLIS);^M optional int64 TimestampValue (TIMESTAMP_MILLIS);^M optional binary VarbinaryValue2;^M optional fixed_len_byte_array(12) IntervalYearValue (INTERVAL);^M optional fixed_len_byte_array(12) IntervalDayValue (INTERVAL);^M optional fixed_len_byte_array(12) IntervalSecondValue (INTERVAL);^M optional binary VarcharValue2 (UTF8);^M }^M , metadata: {drill-writer.version=2, drill.version=1.14.0-SNAPSHOT}}, blocks: [BlockMetaData{14565, 268477520 [ColumnMetaData{UNCOMPRESSED [Index] optional int64 Index [PLAIN, RLE, BIT_PACKED], 4}, ColumnMetaData{UNCOMPRESSED [VarbinaryValue1] optional binary VarbinaryValue1 [PLAIN, RLE, BIT_PACKED], 116579}, ColumnMetaData{UNCOMPRESSED [BigIntValue] optional int64 BigIntValue [PLAIN, RLE, BIT_PACKED], 91098467}, ColumnMetaData{UNCOMPRESSED [BooleanValue] optional boolean BooleanValue [PLAIN, RLE, BIT_PACKED], 91155431}, ColumnMetaData{UNCOMPRESSED [DateValue] optional int32 DateValue (DATE) [PLAIN, RLE, BIT_PACKED], 91157291}, ColumnMetaData{UNCOMPRESSED [FloatValue] optional float FloatValue [PLAIN, RLE, BIT_PACKED], 91215598}, ColumnMetaData{UNCOMPRESSED [VarcharValue1] optional binary VarcharValue1 (UTF8) [PLAIN, RLE, BIT_PACKED], 91273905}, ColumnMetaData{UNCOMPRESSED [DoubleValue] optional double DoubleValue [PLAIN, RLE, BIT_PACKED], 114039039}, ColumnMetaData{UNCOMPRESSED [IntegerValue] optional int32 IntegerValue [PLAIN, RLE, BIT_PACKED], 114155614}, ColumnMetaData{UNCOMPRESSED [TimeValue] optional int32 TimeValue (TIME_MILLIS) [PLAIN, RLE, BIT_PACKED], 114213921}, ColumnMetaData{UNCOMPRESSED [TimestampValue] optional int64 TimestampValue (TIMESTAMP_MILLIS) [PLAIN, RLE, BIT_PACKED], 114272228}, ColumnMetaData{UNCOMPRESSED [VarbinaryValue2] optional binary VarbinaryValue2 [PLAIN, RLE, BIT_PACKED], 114388803}, ColumnMetaData{UNCOMPRESSED [IntervalYearValue] optional fixed_len_byte_array(12) IntervalYearValue (INTERVAL) [PLAIN, RLE, BIT_PACKED], 222455665}, ColumnMetaData{UNCOMPRESSED [IntervalDayValue] optional fixed_len_byte_array(12) IntervalDayValue (INTERVAL) [PLAIN, RLE, BIT_PACKED], 222630508}, ColumnMetaData{UNCOMPRESSED [IntervalSecondValue] optional fixed_len_byte_array(12) IntervalSecondValue (INTERVAL) [PLAIN, RLE, BIT_PACKED], 222805351}, ColumnMetaData{UNCOMPRESSED [VarcharValue2] optional binary VarcharValue2 (UTF8) [PLAIN, RLE, BIT_PACKED], 222980194}]}]}^M ^M Fragment 0:0^M ^M [Error Id: c6690ea1-2f28-4fbe-969f-d8b90da488fb on qa-node186.qa.lab:31010]^M ^M (org.apache.drill.common.exceptions.DrillRuntimeException) Error in parquet record reader.^M Message: ^M Hadoop path: /drill/testdata/batch_memory/fourvarchar_asc_nulls_16MB_1GB.parquet/0_0_0.parquet^M Total records read: 0^M Row group index: 0^M Records in row group: 14565^M Parquet Metadata: ParquetMetaData{FileMetaData{schema: message root {^M optional int64 Index;^M optional binary VarbinaryValue1;^M optional int64 BigIntValue;^M optional boolean BooleanValue;^M optional int32 DateValue (DATE);^M optional float FloatValue;^M optional binary VarcharValue1 (UTF8);^M optional double DoubleValue;^M optional int32 IntegerValue;^M optional int32 TimeValue (TIME_MILLIS);^M optional int64 TimestampValue (TIMESTAMP_MILLIS);^M optional binary
[jira] [Created] (DRILL-6685) Error in parquet record reader
Robert Hou created DRILL-6685: - Summary: Error in parquet record reader Key: DRILL-6685 URL: https://issues.apache.org/jira/browse/DRILL-6685 Project: Apache Drill Issue Type: Bug Components: Storage - Parquet Affects Versions: 1.14.0 Reporter: Robert Hou Assignee: salim achouche Fix For: 1.15.0 This is the query: alter session set `drill.exec.memory.operator.project.output_batch_size` = 131072; alter session set `planner.width.max_per_node` = 1; alter session set `planner.width.max_per_query` = 1; select * from ( select BigIntValue, BooleanValue, DateValue, FloatValue, DoubleValue, IntegerValue, TimeValue, TimestampValue, IntervalYearValue, IntervalDayValue, IntervalSecondValue, VarbinaryValue1, VarcharValue1, VarbinaryValue2, VarcharValue2 from (select * from dfs.`/drill/testdata/batch_memory/fourvarchar_asc_nulls_16MB_1GB.parquet` order by IntegerValue)) d where d.VarcharValue1 = 'Fq'; It appears to be caused by this commit: DRILL-6570: Fixed IndexOutofBoundException in Parquet Reader aee899c1b26ebb9a5781d280d5a73b42c273d4d5 This is the stack trace: {noformat} oadd.org.apache.drill.common.exceptions.UserRemoteException: INTERNAL_ERROR ERROR: Error in parquet record reader.^M Message: ^M Hadoop path: /drill/testdata/batch_memory/fourvarchar_asc_nulls_16MB_1GB.parquet/0_0_0.parquet^M Total records read: 0^M Row group index: 0^M Records in row group: 14565^M Parquet Metadata: ParquetMetaData{FileMetaData{schema: message root {^M optional int64 Index;^M optional binary VarbinaryValue1;^M optional int64 BigIntValue;^M optional boolean BooleanValue;^M optional int32 DateValue (DATE);^M optional float FloatValue;^M optional binary VarcharValue1 (UTF8);^M optional double DoubleValue;^M optional int32 IntegerValue;^M optional int32 TimeValue (TIME_MILLIS);^M optional int64 TimestampValue (TIMESTAMP_MILLIS);^M optional binary VarbinaryValue2;^M optional fixed_len_byte_array(12) IntervalYearValue (INTERVAL);^M optional fixed_len_byte_array(12) IntervalDayValue (INTERVAL);^M optional fixed_len_byte_array(12) IntervalSecondValue (INTERVAL);^M optional binary VarcharValue2 (UTF8);^M }^M , metadata: {drill-writer.version=2, drill.version=1.14.0-SNAPSHOT}}, blocks: [BlockMetaData{14565, 268477520 [ColumnMetaData{UNCOMPRESSED [Index] optional int64 Index [PLAIN, RLE, BIT_PACKED], 4}, ColumnMetaData{UNCOMPRESSED [VarbinaryValue1] optional binary VarbinaryValue1 [PLAIN, RLE, BIT_PACKED], 116579}, ColumnMetaData{UNCOMPRESSED [BigIntValue] optional int64 BigIntValue [PLAIN, RLE, BIT_PACKED], 91098467}, ColumnMetaData{UNCOMPRESSED [BooleanValue] optional boolean BooleanValue [PLAIN, RLE, BIT_PACKED], 91155431}, ColumnMetaData{UNCOMPRESSED [DateValue] optional int32 DateValue (DATE) [PLAIN, RLE, BIT_PACKED], 91157291}, ColumnMetaData{UNCOMPRESSED [FloatValue] optional float FloatValue [PLAIN, RLE, BIT_PACKED], 91215598}, ColumnMetaData{UNCOMPRESSED [VarcharValue1] optional binary VarcharValue1 (UTF8) [PLAIN, RLE, BIT_PACKED], 91273905}, ColumnMetaData{UNCOMPRESSED [DoubleValue] optional double DoubleValue [PLAIN, RLE, BIT_PACKED], 114039039}, ColumnMetaData{UNCOMPRESSED [IntegerValue] optional int32 IntegerValue [PLAIN, RLE, BIT_PACKED], 114155614}, ColumnMetaData{UNCOMPRESSED [TimeValue] optional int32 TimeValue (TIME_MILLIS) [PLAIN, RLE, BIT_PACKED], 114213921}, ColumnMetaData{UNCOMPRESSED [TimestampValue] optional int64 TimestampValue (TIMESTAMP_MILLIS) [PLAIN, RLE, BIT_PACKED], 114272228}, ColumnMetaData{UNCOMPRESSED [VarbinaryValue2] optional binary VarbinaryValue2 [PLAIN, RLE, BIT_PACKED], 114388803}, ColumnMetaData{UNCOMPRESSED [IntervalYearValue] optional fixed_len_byte_array(12) IntervalYearValue (INTERVAL) [PLAIN, RLE, BIT_PACKED], 222455665}, ColumnMetaData{UNCOMPRESSED [IntervalDayValue] optional fixed_len_byte_array(12) IntervalDayValue (INTERVAL) [PLAIN, RLE, BIT_PACKED], 222630508}, ColumnMetaData{UNCOMPRESSED [IntervalSecondValue] optional fixed_len_byte_array(12) IntervalSecondValue (INTERVAL) [PLAIN, RLE, BIT_PACKED], 222805351}, ColumnMetaData{UNCOMPRESSED [VarcharValue2] optional binary VarcharValue2 (UTF8) [PLAIN, RLE, BIT_PACKED], 222980194}]}]}^M ^M Fragment 0:0^M ^M [Error Id: c6690ea1-2f28-4fbe-969f-d8b90da488fb on qa-node186.qa.lab:31010]^M ^M (org.apache.drill.common.exceptions.DrillRuntimeException) Error in parquet record reader.^M Message: ^M Hadoop path: /drill/testdata/batch_memory/fourvarchar_asc_nulls_16MB_1GB.parquet/0_0_0.parquet^M Total records read: 0^M Row group index: 0^M Records in row group: 14565^M Parquet Metadata: ParquetMetaData{FileMetaData{schema: message root {^M optional int64 Index;^M optional binary VarbinaryValue1;^M optional int64 BigIntValue;^M optional boolean BooleanValue;^M optional int32 DateValue (DATE);^M optiona
[jira] [Updated] (DRILL-3988) Create a sys.functions table to expose available Drill functions
[ https://issues.apache.org/jira/browse/DRILL-3988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kunal Khatua updated DRILL-3988: Fix Version/s: 1.15.0 > Create a sys.functions table to expose available Drill functions > > > Key: DRILL-3988 > URL: https://issues.apache.org/jira/browse/DRILL-3988 > Project: Apache Drill > Issue Type: Sub-task > Components: Metadata >Reporter: Jacques Nadeau >Priority: Major > Labels: newbie > Fix For: 1.15.0 > > > Create a new sys.functions table that returns a list of all available > functions. > Key considerations: > - one row per name or one per argument set. I'm inclined to latter so people > can use queries to get to data. > - we need to create a delineation between user functions and internal > functions and only show user functions. 'CastInt' isn't something the user > should be able to see (or run). > - should we add a description annotation that could be included in the > sys.functions table? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6676) Add Union, List and Repeated List types to Result Set Loader
[ https://issues.apache.org/jira/browse/DRILL-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579016#comment-16579016 ] ASF GitHub Bot commented on DRILL-6676: --- ilooner commented on issue #1429: DRILL-6676: Add Union, List and Repeated List types to Result Set Loader URL: https://github.com/apache/drill/pull/1429#issuecomment-412693176 Thanks @paul-rogers . I'll run the functional tests and let you know if there are any issues. Since your code is always well documented and tested, I feel comfortable merging this pretty quickly. So maybe we can time-box the review to 2 weeks at most, unless some issue pops up. @sohami @bitblender since you guys have experience doing low level work with the vectors could you help with looking at this PR? The PR is quite big and would take me a while to go through by myself. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Union, List and Repeated List types to Result Set Loader > > > Key: DRILL-6676 > URL: https://issues.apache.org/jira/browse/DRILL-6676 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.15.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.15.0 > > > Add support for the "obscure" vector types to the {{ResultSetLoader}}: > * Union > * List > * Repeated List -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-6211) Optimizations for SelectionVectorRemover
[ https://issues.apache.org/jira/browse/DRILL-6211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kunal Khatua updated DRILL-6211: Fix Version/s: 1.15.0 > Optimizations for SelectionVectorRemover > - > > Key: DRILL-6211 > URL: https://issues.apache.org/jira/browse/DRILL-6211 > Project: Apache Drill > Issue Type: Bug > Components: Execution - Codegen >Reporter: Kunal Khatua >Assignee: Karthikeyan Manivannan >Priority: Major > Fix For: 1.15.0 > > Attachments: 255d264c-f55e-b343-0bef-49d3e672d93f.sys.drill, > 255d2664-2418-19e0-00ea-2076a06572a2.sys.drill, > 255d2682-8481-bed0-fc22-197a75371c04.sys.drill, > 255d26ae-2c0b-6cd6-ae71-4ad04c992daf.sys.drill, > 255d2880-48a2-d86b-5410-29ce0cd249ed.sys.drill > > > Currently, when a SelectionVectorRemover receives a record batch from an > upstream operator (like a Filter), it immediately starts copying over records > into a new outgoing batch. > It can be worthwhile if the RecordBatch can be enriched with some additional > summary statistics about the attached SelectionVector, such as > # number of records that need to be removed/copied > # total number of records in the record-batch > The benefit of this would be that in extreme cases, if *all* the records in a > batch need to be either truncated or copies, the SelectionVectorRemover can > simply drop the record-batch or simply forward it to the next downstream > operator. > While the extreme cases of simply dropping the batch kind of works (because > there is no overhead in copying), for cases where the record batch should > pass through, the overhead remains (and is actually more than 35% of the > time, if you discount for the streaming agg cost within the tests). > Here are the statistics of having such an optimization > ||Selectivity||Query Time||%Time used by SVR||Time||Profile|| > |0%|6.996|0.13%|0.0090948|[^255d264c-f55e-b343-0bef-49d3e672d93f.sys.drill]| > |10%|7.836|7.97%|0.6245292|[^255d2682-8481-bed0-fc22-197a75371c04.sys.drill]| > |50%|11.225|25.59%|2.8724775|[^255d2664-2418-19e0-00ea-2076a06572a2.sys.drill]| > |90%|14.966|33.91%|5.0749706|[^255d26ae-2c0b-6cd6-ae71-4ad04c992daf.sys.drill]| > |100%|19.003|35.73%|6.7897719|[^255d2880-48a2-d86b-5410-29ce0cd249ed.sys.drill]| > To summarize, the SVR should avoid creating new batches as much as possible. > A more generic (non-trivial) optimization should take into account the fact > that multiple batches emitted can be coalesced, but we don't currently have > test metrics for that. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (DRILL-6084) Expose list of Drill functions for consumption by JavaScript libraries
[ https://issues.apache.org/jira/browse/DRILL-6084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kunal Khatua reassigned DRILL-6084: --- Assignee: Kunal Khatua > Expose list of Drill functions for consumption by JavaScript libraries > -- > > Key: DRILL-6084 > URL: https://issues.apache.org/jira/browse/DRILL-6084 > Project: Apache Drill > Issue Type: Improvement > Components: Web Server >Reporter: Kunal Khatua >Assignee: Kunal Khatua >Priority: Minor > Labels: javascript > Fix For: 1.15.0 > > > DRILL-5868 introduces SQL highlighter and the auto-complete feature in the > web-UI's query editor. > For the backend Javascript to highlight the keywords correctly as SQL > reserved words or functions, it needs a list of all the Drill functions to be > already defined in a source JavaScript ( {{../mode-sql.js}} ) . > While this works well for standard Drill functions, it means that any new > function added to Drill needs to be explicitly hard-coded into the file, > rather than be programatically discovered and loaded for the library to > highlight. > It would help if this can be done dynamically without contributors having to > explicitly alter the script file to add new function names. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-6084) Expose list of Drill functions for consumption by JavaScript libraries
[ https://issues.apache.org/jira/browse/DRILL-6084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kunal Khatua updated DRILL-6084: Fix Version/s: (was: Future) 1.15.0 > Expose list of Drill functions for consumption by JavaScript libraries > -- > > Key: DRILL-6084 > URL: https://issues.apache.org/jira/browse/DRILL-6084 > Project: Apache Drill > Issue Type: Improvement > Components: Web Server >Reporter: Kunal Khatua >Priority: Minor > Labels: javascript > Fix For: 1.15.0 > > > DRILL-5868 introduces SQL highlighter and the auto-complete feature in the > web-UI's query editor. > For the backend Javascript to highlight the keywords correctly as SQL > reserved words or functions, it needs a list of all the Drill functions to be > already defined in a source JavaScript ( {{../mode-sql.js}} ) . > While this works well for standard Drill functions, it means that any new > function added to Drill needs to be explicitly hard-coded into the file, > rather than be programatically discovered and loaded for the library to > highlight. > It would help if this can be done dynamically without contributors having to > explicitly alter the script file to add new function names. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (DRILL-6684) Swap sys.options and sys.options_val tables
Kunal Khatua created DRILL-6684: --- Summary: Swap sys.options and sys.options_val tables Key: DRILL-6684 URL: https://issues.apache.org/jira/browse/DRILL-6684 Project: Apache Drill Issue Type: Improvement Components: Storage - Information Schema Affects Versions: 1.14.0 Reporter: Kunal Khatua Fix For: 1.15.0 The current sys.options table has a verbose layout, because of which sys.options_internal was introduced. The latter will also support descriptions, which means it makes sense to have that table as the new `sys.options`. The recommendation is to deprecate the old format, so that any dependencies continue to make use of it as long as required. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (DRILL-6683) move getSelectionVector2 and getSelectionVector4 from VectorAccessible interface to RecordBatch interface
Timothy Farkas created DRILL-6683: - Summary: move getSelectionVector2 and getSelectionVector4 from VectorAccessible interface to RecordBatch interface Key: DRILL-6683 URL: https://issues.apache.org/jira/browse/DRILL-6683 Project: Apache Drill Issue Type: Improvement Reporter: Timothy Farkas -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6461) Add Basic Data Correctness Unit Tests
[ https://issues.apache.org/jira/browse/DRILL-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578993#comment-16578993 ] ASF GitHub Bot commented on DRILL-6461: --- ilooner commented on a change in pull request #1344: DRILL-6461: Added basic data correctness tests for hash agg, and improved operator unit testing framework. URL: https://github.com/apache/drill/pull/1344#discussion_r209779765 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/record/VectorContainer.java ## @@ -42,6 +42,8 @@ private final BufferAllocator allocator; protected final List> wrappers = Lists.newArrayList(); private BatchSchema schema; + private SelectionVector2 selectionVector2; + private SelectionVector4 selectionVector4; Review comment: Cool made https://issues.apache.org/jira/browse/DRILL-6683 to track that change. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Basic Data Correctness Unit Tests > - > > Key: DRILL-6461 > URL: https://issues.apache.org/jira/browse/DRILL-6461 > Project: Apache Drill > Issue Type: Sub-task >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > > There are no data correctness unit tests for HashAgg. We need to add some. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6461) Add Basic Data Correctness Unit Tests
[ https://issues.apache.org/jira/browse/DRILL-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578991#comment-16578991 ] ASF GitHub Bot commented on DRILL-6461: --- ilooner commented on a change in pull request #1344: DRILL-6461: Added basic data correctness tests for hash agg, and improved operator unit testing framework. URL: https://github.com/apache/drill/pull/1344#discussion_r209779483 ## File path: exec/java-exec/src/test/java/org/apache/drill/test/rowSet/RowSetBatch.java ## @@ -29,13 +30,28 @@ import org.apache.drill.exec.record.selection.SelectionVector2; import org.apache.drill.exec.record.selection.SelectionVector4; +import java.util.ArrayList; import java.util.Iterator; +import java.util.List; -public class RowSetBatch implements RecordBatch { - private final RowSet rowSet; +/** + * A mock operator that returns the provided {@link RowSet}s as batches. Currently it's assumed that all the {@link RowSet}s have the same schema. + */ +public class RowSetBatch implements CloseableRecordBatch { Review comment: @sohami It's okay, I've already made the move to MockRecordBatch for this PR. Just doing a final test runs, and debugging straggling issues. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Basic Data Correctness Unit Tests > - > > Key: DRILL-6461 > URL: https://issues.apache.org/jira/browse/DRILL-6461 > Project: Apache Drill > Issue Type: Sub-task >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > > There are no data correctness unit tests for HashAgg. We need to add some. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6461) Add Basic Data Correctness Unit Tests
[ https://issues.apache.org/jira/browse/DRILL-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578988#comment-16578988 ] ASF GitHub Bot commented on DRILL-6461: --- ilooner commented on a change in pull request #1344: DRILL-6461: Added basic data correctness tests for hash agg, and improved operator unit testing framework. URL: https://github.com/apache/drill/pull/1344#discussion_r209779109 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java ## @@ -639,6 +641,8 @@ public ColumnSize getColumn(String name) { // columns can be obtained from children of topColumns. private Map columnSizes = CaseInsensitiveMap.newHashMap(); + private List columnSizesList = new ArrayList<>(); Review comment: done. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Basic Data Correctness Unit Tests > - > > Key: DRILL-6461 > URL: https://issues.apache.org/jira/browse/DRILL-6461 > Project: Apache Drill > Issue Type: Sub-task >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > > There are no data correctness unit tests for HashAgg. We need to add some. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6461) Add Basic Data Correctness Unit Tests
[ https://issues.apache.org/jira/browse/DRILL-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578595#comment-16578595 ] ASF GitHub Bot commented on DRILL-6461: --- sohami commented on a change in pull request #1344: DRILL-6461: Added basic data correctness tests for hash agg, and improved operator unit testing framework. URL: https://github.com/apache/drill/pull/1344#discussion_r209677503 ## File path: exec/java-exec/src/test/java/org/apache/drill/test/rowSet/RowSetBatch.java ## @@ -29,13 +30,28 @@ import org.apache.drill.exec.record.selection.SelectionVector2; import org.apache.drill.exec.record.selection.SelectionVector4; +import java.util.ArrayList; import java.util.Iterator; +import java.util.List; -public class RowSetBatch implements RecordBatch { - private final RowSet rowSet; +/** + * A mock operator that returns the provided {@link RowSet}s as batches. Currently it's assumed that all the {@link RowSet}s have the same schema. + */ +public class RowSetBatch implements CloseableRecordBatch { Review comment: If you want you can leave it as is for now. I can take a look on how to refactor MockRecordBatch class along with other tests which uses it, which I have to do anyways. I was also thinking of making it use RowSets instead of container. The reason for creating separate class was to use it specifically for testing different IterOutcomes in context of Lateral&Unnest only whereas RowSetBatch looked more of a very simple implementation to just hold one container. Was not sure if MockRecordBatch will play well with other operators in general. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Basic Data Correctness Unit Tests > - > > Key: DRILL-6461 > URL: https://issues.apache.org/jira/browse/DRILL-6461 > Project: Apache Drill > Issue Type: Sub-task >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > > There are no data correctness unit tests for HashAgg. We need to add some. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6461) Add Basic Data Correctness Unit Tests
[ https://issues.apache.org/jira/browse/DRILL-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578560#comment-16578560 ] ASF GitHub Bot commented on DRILL-6461: --- sohami commented on a change in pull request #1344: DRILL-6461: Added basic data correctness tests for hash agg, and improved operator unit testing framework. URL: https://github.com/apache/drill/pull/1344#discussion_r209672777 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/record/VectorContainer.java ## @@ -42,6 +42,8 @@ private final BufferAllocator allocator; protected final List> wrappers = Lists.newArrayList(); private BatchSchema schema; + private SelectionVector2 selectionVector2; + private SelectionVector4 selectionVector4; Review comment: I agree with that and think it would be a good refactoring to move `getSelectionVector2` and `getSelectionVector4` from `VectorAccessible` interface to `RecordBatch` interface. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add Basic Data Correctness Unit Tests > - > > Key: DRILL-6461 > URL: https://issues.apache.org/jira/browse/DRILL-6461 > Project: Apache Drill > Issue Type: Sub-task >Reporter: Timothy Farkas >Assignee: Timothy Farkas >Priority: Major > > There are no data correctness unit tests for HashAgg. We need to add some. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-6680) Expose SHOW FILES command into INFORMATION_SCHEMA
[ https://issues.apache.org/jira/browse/DRILL-6680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Volodymyr Vysotskyi updated DRILL-6680: --- Labels: doc-impacting ready-to-commit (was: doc-impacting re) > Expose SHOW FILES command into INFORMATION_SCHEMA > - > > Key: DRILL-6680 > URL: https://issues.apache.org/jira/browse/DRILL-6680 > Project: Apache Drill > Issue Type: New Feature >Affects Versions: 1.14.0 >Reporter: Arina Ielchiieva >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting, ready-to-commit > Fix For: 1.15.0 > > > Link to design document - > https://docs.google.com/document/d/1UnvATwH4obn1-XsA83xMz3LtylbMu867eBLN9r3eV3k/edit# -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (DRILL-6680) Expose SHOW FILES command into INFORMATION_SCHEMA
[ https://issues.apache.org/jira/browse/DRILL-6680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Volodymyr Vysotskyi updated DRILL-6680: --- Labels: doc-impacting re (was: doc-impacting) > Expose SHOW FILES command into INFORMATION_SCHEMA > - > > Key: DRILL-6680 > URL: https://issues.apache.org/jira/browse/DRILL-6680 > Project: Apache Drill > Issue Type: New Feature >Affects Versions: 1.14.0 >Reporter: Arina Ielchiieva >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting, re > Fix For: 1.15.0 > > > Link to design document - > https://docs.google.com/document/d/1UnvATwH4obn1-XsA83xMz3LtylbMu867eBLN9r3eV3k/edit# -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6680) Expose SHOW FILES command into INFORMATION_SCHEMA
[ https://issues.apache.org/jira/browse/DRILL-6680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578302#comment-16578302 ] ASF GitHub Bot commented on DRILL-6680: --- arina-ielchiieva commented on issue #1430: DRILL-6680: Expose show files command into INFORMATION_SCHEMA URL: https://github.com/apache/drill/pull/1430#issuecomment-412522612 @vvysotskyi thanks for the code review, addressed CR comments. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Expose SHOW FILES command into INFORMATION_SCHEMA > - > > Key: DRILL-6680 > URL: https://issues.apache.org/jira/browse/DRILL-6680 > Project: Apache Drill > Issue Type: New Feature >Affects Versions: 1.14.0 >Reporter: Arina Ielchiieva >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting > Fix For: 1.15.0 > > > Link to design document - > https://docs.google.com/document/d/1UnvATwH4obn1-XsA83xMz3LtylbMu867eBLN9r3eV3k/edit# -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6670) Error in parquet record reader - previously readable file fails to be read in 1.14
[ https://issues.apache.org/jira/browse/DRILL-6670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578162#comment-16578162 ] ASF GitHub Bot commented on DRILL-6670: --- okalinin commented on issue #1428: DRILL-6670: align Parquet TIMESTAMP_MICROS logical type handling with earlier versions + minor fixes URL: https://github.com/apache/drill/pull/1428#issuecomment-412493398 Comment addressed, conflicts resolved. PR should be OK for further review. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Error in parquet record reader - previously readable file fails to be read in > 1.14 > -- > > Key: DRILL-6670 > URL: https://issues.apache.org/jira/browse/DRILL-6670 > Project: Apache Drill > Issue Type: Bug > Components: Storage - Parquet >Affects Versions: 1.14.0 >Reporter: Dave Challis >Assignee: Oleksandr Kalinin >Priority: Major > Fix For: 1.15.0 > > Attachments: example.parquet > > > Parquet file which was generated by PyArrow was readable in Apache Drill 1.12 > and 1.13, but fails to be read with 1.14. > Running the query "SELECT * FROM dfs.`foo.parquet`" results in the following > error message from the Drill web query UI: > {code} > Query Failed: An Error Occurred > org.apache.drill.common.exceptions.UserRemoteException: INTERNAL_ERROR ERROR: > Error in parquet record reader. Message: Failure in setting up reader Parquet > Metadata: ParquetMetaData{FileMetaData{schema: message schema { optional > binary name (UTF8); optional binary creation_parameters (UTF8); optional > int64 creation_date (TIMESTAMP_MICROS); optional int32 data_version; optional > int32 schema_version; } , metadata: {pandas={"index_columns": [], > "column_indexes": [], "columns": [{"name": "name", "field_name": "name", > "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": > "creation_parameters", "field_name": "creation_parameters", "pandas_type": > "unicode", "numpy_type": "object", "metadata": null}, {"name": > "creation_date", "field_name": "creation_date", "pandas_type": "datetime", > "numpy_type": "datetime64[ns]", "metadata": null}, {"name": "data_version", > "field_name": "data_version", "pandas_type": "int32", "numpy_type": "int32", > "metadata": null}, {"name": "schema_version", "field_name": "schema_version", > "pandas_type": "int32", "numpy_type": "int32", "metadata": null}], > "pandas_version": "0.22.0"}}}, blocks: [BlockMetaData{1, 27142 > [ColumnMetaData{SNAPPY [name] optional binary name (UTF8) [PLAIN, RLE], 4}, > ColumnMetaData{SNAPPY [creation_parameters] optional binary > creation_parameters (UTF8) [PLAIN, RLE], 252}, ColumnMetaData{SNAPPY > [creation_date] optional int64 creation_date (TIMESTAMP_MICROS) [PLAIN, RLE], > 46334}, ColumnMetaData{SNAPPY [data_version] optional int32 data_version > [PLAIN, RLE], 46478}, ColumnMetaData{SNAPPY [schema_version] optional int32 > schema_version [PLAIN, RLE], 46593}]}]} Fragment 0:0 [Error Id: > bdb2e4d5-5982-4cc6-b95e-244782f827d2 on f9d0456cddd2:31010] > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6453) TPC-DS query 72 has regressed
[ https://issues.apache.org/jira/browse/DRILL-6453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578160#comment-16578160 ] ASF GitHub Bot commented on DRILL-6453: --- asfgit closed pull request #1408: DRILL-6453: Resolve deadlock when reading from build and probe sides simultaneously in HashJoin URL: https://github.com/apache/drill/pull/1408 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashPartition.java b/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashPartition.java index eaccd335527..fbdc4f3b8a1 100644 --- a/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashPartition.java +++ b/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashPartition.java @@ -17,6 +17,7 @@ */ package org.apache.drill.exec.physical.impl.common; +import com.google.common.base.Preconditions; import com.google.common.collect.Lists; import org.apache.drill.common.exceptions.RetryAfterSpillException; import org.apache.drill.common.exceptions.UserException; @@ -122,6 +123,7 @@ private List inMemoryBatchStats = Lists.newArrayList(); private long partitionInMemorySize; private long numInMemoryRecords; + private boolean updatedRecordsPerBatch = false; public HashPartition(FragmentContext context, BufferAllocator allocator, ChainedHashTable baseHashTable, RecordBatch buildBatch, RecordBatch probeBatch, @@ -155,6 +157,18 @@ public HashPartition(FragmentContext context, BufferAllocator allocator, Chained } } + /** + * Configure a different temporary batch size when spilling probe batches. + * @param newRecordsPerBatch The new temporary batch size to use. + */ + public void updateProbeRecordsPerBatch(int newRecordsPerBatch) { +Preconditions.checkArgument(newRecordsPerBatch > 0); +Preconditions.checkState(!updatedRecordsPerBatch); // Only allow updating once +Preconditions.checkState(processingOuter); // We can only update the records per batch when probing. + +recordsPerBatch = newRecordsPerBatch; + } + /** * Allocate a new vector container for either right or left record batch * Add an additional special vector for the hash values diff --git a/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/BatchSizePredictor.java b/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/BatchSizePredictor.java new file mode 100644 index 000..912e4feaf3c --- /dev/null +++ b/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/BatchSizePredictor.java @@ -0,0 +1,79 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.physical.impl.join; + +import org.apache.drill.exec.record.RecordBatch; + +/** + * This class predicts the sizes of batches given an input batch. + * + * Invariants + * + * The {@link BatchSizePredictor} assumes that a {@link RecordBatch} is in a state where it can return a valid record count. + * + */ +public interface BatchSizePredictor { + /** + * Gets the batchSize computed in the call to {@link #updateStats()}. Returns 0 if {@link #hadDataLastTime()} is false. + * @return Gets the batchSize computed in the call to {@link #updateStats()}. Returns 0 if {@link #hadDataLastTime()} is false. + * @throws IllegalStateException if {@link #updateStats()} was never called. + */ + long getBatchSize(); + + /** + * Gets the number of records computed in the call to {@link #updateStats()}. Returns 0 if {@link #hadDataLastTime()} is false. + * @return Gets the number of records computed in the call to {@link #updateStats()}. Returns 0 if {@link #hadDataLastTime()} is false. + * @throws IllegalStateException if {@link #updateStats()} was never called. + */ + int getNumRecords(); + + /** + * True if the i
[jira] [Commented] (DRILL-5179) Dropping NULL Columns from putting in parquet files
[ https://issues.apache.org/jira/browse/DRILL-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578123#comment-16578123 ] mehran commented on DRILL-5179: --- It is convenient to have a skip_null_columns in parquet.writer. > Dropping NULL Columns from putting in parquet files > --- > > Key: DRILL-5179 > URL: https://issues.apache.org/jira/browse/DRILL-5179 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.9.0 >Reporter: mehran >Priority: Major > > Due to schemaless nature of drill and parquet files. It is suitable that all > NULL columns be dropped in "CREATE TABLE" command from json. This will > enhance speed of header interpretation. I think this is a big enhancement. > This can be an option in configuration parameters. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6433) a process(find) that never finishes slows down apache drill
[ https://issues.apache.org/jira/browse/DRILL-6433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578120#comment-16578120 ] mehran commented on DRILL-6433: --- I found the problem. you have the following statement in drill-config.sh JAVA=`find -L "$JAVA_HOME" -name $JAVA_BIN -type f | head -n 1` the option of -L is the problem. this command converts to" find -L / -name java -type f" that creates a loop. you should omit " -L" in this statement. > a process(find) that never finishes slows down apache drill > --- > > Key: DRILL-6433 > URL: https://issues.apache.org/jira/browse/DRILL-6433 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.13.0 >Reporter: mehran >Priority: Major > > IN version 13 we have a process as follows > find -L / -name java -type f > this process is added to system every day. and it will never finishes. > I have 100 process in my server. > I did not succeed to set JAVA_HOME in drill-env.sh so i set it in /etc/bashrc > these many processes slows down apache drill. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6680) Expose SHOW FILES command into INFORMATION_SCHEMA
[ https://issues.apache.org/jira/browse/DRILL-6680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578082#comment-16578082 ] ASF GitHub Bot commented on DRILL-6680: --- vvysotskyi commented on a change in pull request #1430: DRILL-6680: Expose show files command into INFORMATION_SCHEMA URL: https://github.com/apache/drill/pull/1430#discussion_r209556134 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/store/ischema/Records.java ## @@ -543,4 +550,38 @@ public Schema(String catalog, String name, String owner, String type, boolean is this.IS_MUTABLE = isMutable ? "YES" : "NO"; } } -} + + public static class File { Review comment: Could you please add Javadoc similar to Javadocs for other nested classes in this class. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Expose SHOW FILES command into INFORMATION_SCHEMA > - > > Key: DRILL-6680 > URL: https://issues.apache.org/jira/browse/DRILL-6680 > Project: Apache Drill > Issue Type: New Feature >Affects Versions: 1.14.0 >Reporter: Arina Ielchiieva >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting > Fix For: 1.15.0 > > > Link to design document - > https://docs.google.com/document/d/1UnvATwH4obn1-XsA83xMz3LtylbMu867eBLN9r3eV3k/edit# -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6680) Expose SHOW FILES command into INFORMATION_SCHEMA
[ https://issues.apache.org/jira/browse/DRILL-6680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578080#comment-16578080 ] ASF GitHub Bot commented on DRILL-6680: --- vvysotskyi commented on a change in pull request #1430: DRILL-6680: Expose show files command into INFORMATION_SCHEMA URL: https://github.com/apache/drill/pull/1430#discussion_r209540538 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/handlers/ShowFilesHandler.java ## @@ -0,0 +1,120 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.planner.sql.handlers; + +import org.apache.calcite.schema.SchemaPlus; +import org.apache.calcite.sql.SqlIdentifier; +import org.apache.calcite.sql.SqlLiteral; +import org.apache.calcite.sql.SqlNode; +import org.apache.calcite.sql.SqlNodeList; +import org.apache.calcite.sql.SqlSelect; +import org.apache.calcite.sql.fun.SqlStdOperatorTable; +import org.apache.calcite.sql.parser.SqlParserPos; +import org.apache.calcite.util.Util; +import org.apache.drill.common.exceptions.UserException; +import org.apache.drill.exec.planner.sql.SchemaUtilites; +import org.apache.drill.exec.planner.sql.parser.DrillParserUtil; +import org.apache.drill.exec.planner.sql.parser.SqlShowFiles; +import org.apache.drill.exec.store.AbstractSchema; +import org.apache.drill.exec.store.dfs.WorkspaceSchemaFactory.WorkspaceSchema; +import org.apache.drill.exec.store.ischema.InfoSchemaTableType; +import org.apache.drill.exec.work.foreman.ForemanSetupException; + +import java.util.Arrays; +import java.util.Collections; +import java.util.List; + +import static org.apache.drill.exec.store.ischema.InfoSchemaConstants.FILES_COL_RELATIVE_PATH; +import static org.apache.drill.exec.store.ischema.InfoSchemaConstants.FILES_COL_SCHEMA_NAME; +import static org.apache.drill.exec.store.ischema.InfoSchemaConstants.IS_SCHEMA_NAME; + + +public class ShowFilesHandler extends DefaultSqlHandler { + private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(SetOptionHandler.class); + + public ShowFilesHandler(SqlHandlerConfig config) { +super(config); + } + + /** Rewrite the parse tree as SELECT ... FROM INFORMATION_SCHEMA.FILES ... */ + @Override + public SqlNode rewrite(SqlNode sqlNode) throws ForemanSetupException { + +List selectList = Collections.singletonList(SqlIdentifier.star(SqlParserPos.ZERO)); + +SqlNode fromClause = new SqlIdentifier( Review comment: The constructor which takes only list of names and `SqlParserPos` may be used here. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Expose SHOW FILES command into INFORMATION_SCHEMA > - > > Key: DRILL-6680 > URL: https://issues.apache.org/jira/browse/DRILL-6680 > Project: Apache Drill > Issue Type: New Feature >Affects Versions: 1.14.0 >Reporter: Arina Ielchiieva >Assignee: Arina Ielchiieva >Priority: Major > Labels: doc-impacting > Fix For: 1.15.0 > > > Link to design document - > https://docs.google.com/document/d/1UnvATwH4obn1-XsA83xMz3LtylbMu867eBLN9r3eV3k/edit# -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (DRILL-6680) Expose SHOW FILES command into INFORMATION_SCHEMA
[ https://issues.apache.org/jira/browse/DRILL-6680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578081#comment-16578081 ] ASF GitHub Bot commented on DRILL-6680: --- vvysotskyi commented on a change in pull request #1430: DRILL-6680: Expose show files command into INFORMATION_SCHEMA URL: https://github.com/apache/drill/pull/1430#discussion_r209544436 ## File path: exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/handlers/ShowFilesHandler.java ## @@ -0,0 +1,120 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.drill.exec.planner.sql.handlers; + +import org.apache.calcite.schema.SchemaPlus; +import org.apache.calcite.sql.SqlIdentifier; +import org.apache.calcite.sql.SqlLiteral; +import org.apache.calcite.sql.SqlNode; +import org.apache.calcite.sql.SqlNodeList; +import org.apache.calcite.sql.SqlSelect; +import org.apache.calcite.sql.fun.SqlStdOperatorTable; +import org.apache.calcite.sql.parser.SqlParserPos; +import org.apache.calcite.util.Util; +import org.apache.drill.common.exceptions.UserException; +import org.apache.drill.exec.planner.sql.SchemaUtilites; +import org.apache.drill.exec.planner.sql.parser.DrillParserUtil; +import org.apache.drill.exec.planner.sql.parser.SqlShowFiles; +import org.apache.drill.exec.store.AbstractSchema; +import org.apache.drill.exec.store.dfs.WorkspaceSchemaFactory.WorkspaceSchema; +import org.apache.drill.exec.store.ischema.InfoSchemaTableType; +import org.apache.drill.exec.work.foreman.ForemanSetupException; + +import java.util.Arrays; +import java.util.Collections; +import java.util.List; + +import static org.apache.drill.exec.store.ischema.InfoSchemaConstants.FILES_COL_RELATIVE_PATH; +import static org.apache.drill.exec.store.ischema.InfoSchemaConstants.FILES_COL_SCHEMA_NAME; +import static org.apache.drill.exec.store.ischema.InfoSchemaConstants.IS_SCHEMA_NAME; + + +public class ShowFilesHandler extends DefaultSqlHandler { + private static final org.slf4j.Logger logger = org.slf4j.LoggerFactory.getLogger(SetOptionHandler.class); + + public ShowFilesHandler(SqlHandlerConfig config) { +super(config); + } + + /** Rewrite the parse tree as SELECT ... FROM INFORMATION_SCHEMA.FILES ... */ + @Override + public SqlNode rewrite(SqlNode sqlNode) throws ForemanSetupException { + +List selectList = Collections.singletonList(SqlIdentifier.star(SqlParserPos.ZERO)); + +SqlNode fromClause = new SqlIdentifier( +Arrays.asList(IS_SCHEMA_NAME, InfoSchemaTableType.FILES.name()), null, SqlParserPos.ZERO, null); + +SchemaPlus defaultSchema = config.getConverter().getDefaultSchema(); +SchemaPlus drillSchema = defaultSchema; + +SqlShowFiles showFiles = unwrap(sqlNode, SqlShowFiles.class); +SqlIdentifier from = showFiles.getDb(); +boolean addRelativePathLikeClause = false; + +// Show files can be used without from clause, in which case we display the files in the default schema +if (from != null) { + // We are not sure if the full from clause is just the schema or includes table name, + // first try to see if the full path specified is a schema + drillSchema = SchemaUtilites.findSchema(defaultSchema, from.names); + if (drillSchema == null) { +// Entire from clause is not a schema, try to obtain the schema without the last part of the specified clause. +drillSchema = SchemaUtilites.findSchema(defaultSchema, from.names.subList(0, from.names.size() - 1)); +addRelativePathLikeClause = true; + } + + if (drillSchema == null) { +throw UserException.validationError() +.message("Invalid FROM/IN clause [%s]", from.toString()) +.build(logger); + } +} + +WorkspaceSchema wsSchema; + +try { + AbstractSchema abstractSchema = drillSchema.unwrap(AbstractSchema.class); + if (abstractSchema instanceof WorkspaceSchema) { Review comment: `WorkspaceSchema` uses implementation of `getDefaultSchema()` method from `AbstractSchema` which returns `this` instance. So I think `if` block may be replaced with the statement from `else` branc