[jira] [Commented] (DRILL-6385) Support JPPD (Join Predicate Push Down)

2018-08-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579169#comment-16579169
 ] 

ASF GitHub Bot commented on DRILL-6385:
---

sohami commented on issue #1334: DRILL-6385: Support JPPD feature
URL: https://github.com/apache/drill/pull/1334#issuecomment-412738302
 
 
   @weijietong - Thanks for making the changes. But I am seeing issues in 
implementation related to how internals work, following protocol of iterative 
model of Drill and few other things. I have left few comments for now and was 
wondering that it will be easier to provide a commit myself to address those 
issues and help you get this PR pushed in faster.
   
   Also I was working on a change for SV2 optimization 
[here](https://github.com/sohami/drill/commits/SV2_Optimization). Given you 
also have subset of the change for it with some issues, it would be great if 
you can rebase on top of my change.
   Please let me know what you think.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support JPPD (Join Predicate Push Down)
> ---
>
> Key: DRILL-6385
> URL: https://issues.apache.org/jira/browse/DRILL-6385
> Project: Apache Drill
>  Issue Type: New Feature
>  Components:  Server, Execution - Flow
>Affects Versions: 1.14.0
>Reporter: weijie.tong
>Assignee: weijie.tong
>Priority: Major
>
> This feature is to support the JPPD (Join Predicate Push Down). It will 
> benefit the HashJoin ,Broadcast HashJoin performance by reducing the number 
> of rows to send across the network ,the memory consumed. This feature is 
> already supported by Impala which calls it RuntimeFilter 
> ([https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_runtime_filtering.html]).
>  The first PR will try to push down a bloom filter of HashJoin node to 
> Parquet’s scan node.   The propose basic procedure is described as follow:
>  # The HashJoin build side accumulate the equal join condition rows to 
> construct a bloom filter. Then it sends out the bloom filter to the foreman 
> node.
>  # The foreman node accept the bloom filters passively from all the fragments 
> that has the HashJoin operator. It then aggregates the bloom filters to form 
> a global bloom filter.
>  # The foreman node broadcasts the global bloom filter to all the probe side 
> scan nodes which maybe already have send out partial data to the hash join 
> nodes(currently the hash join node will prefetch one batch from both sides ).
>       4.  The scan node accepts a global bloom filter from the foreman node. 
> It will filter the rest rows satisfying the bloom filter.
>  
> To implement above execution flow, some main new notion described as below:
>       1. RuntimeFilter
> It’s a filter container which may contain BloomFilter or MinMaxFilter.
>       2. RuntimeFilterReporter
> It wraps the logic to send hash join’s bloom filter to the foreman.The 
> serialized bloom filter will be sent out through the data tunnel.This object 
> will be instanced by the FragmentExecutor and passed to the 
> FragmentContext.So the HashJoin operator can obtain it through the 
> FragmentContext.
>      3. RuntimeFilterRequestHandler
> It is responsible to accept a SendRuntimeFilterRequest RPC to strip the 
> actual BloomFilter from the network. It then translates this filter to the 
> WorkerBee’s new interface registerRuntimeFilter.
> Another RPC type is BroadcastRuntimeFilterRequest. It will register the 
> accepted global bloom filter to the WorkerBee by the registerRuntimeFilter 
> method and then propagate to the FragmentContext through which the probe side 
> scan node can fetch the aggregated bloom filter.
>       4.RuntimeFilterManager
> The foreman will instance a RuntimeFilterManager .It will indirectly get 
> every RuntimeFilter by the WorkerBee. Once all the BloomFilters have been 
> accepted and aggregated . It will broadcast the aggregated bloom filter to 
> all the probe side scan nodes through the data tunnel by a 
> BroadcastRuntimeFilterRequest RPC.
>      5. RuntimeFilterEnableOption 
>  A global option will be added to decide whether to enable this new feature.
>  
> Welcome suggestion and advice from you.The related PR will be presented as 
> soon as possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6385) Support JPPD (Join Predicate Push Down)

2018-08-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579168#comment-16579168
 ] 

ASF GitHub Bot commented on DRILL-6385:
---

sohami commented on a change in pull request #1334: DRILL-6385: Support JPPD 
feature
URL: https://github.com/apache/drill/pull/1334#discussion_r209803392
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/filter/RuntimeFilterRecordBatch.java
 ##
 @@ -0,0 +1,222 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.filter;
+
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.common.expression.ExpressionPosition;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.PathSegment;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ValueVectorReadExpression;
+import org.apache.drill.exec.expr.fn.impl.ValueVectorHashHelper;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.Filter;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TypedFieldId;
+import org.apache.drill.exec.record.selection.SelectionVector2;
+import org.apache.drill.exec.record.selection.SelectionVector4;
+import org.apache.drill.exec.work.filter.BloomFilter;
+import org.apache.drill.exec.work.filter.RuntimeFilterWritable;
+import java.util.ArrayList;
+import java.util.BitSet;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * A RuntimeFilterRecordBatch steps over the ScanBatch. If the ScanBatch 
participates
+ * in the HashJoinBatch and can be applied by a RuntimeFilter, it will 
generate a filtered
+ * SV2, otherwise will generate a same recordCount-originalRecordCount SV2 
which will not affect
+ * the Query's performance ,but just do a memory transfer by the later 
RemovingRecordBatch op.
+ */
+public class RuntimeFilterRecordBatch extends 
AbstractSingleRecordBatch {
+  private SelectionVector2 sv2;
+
+  private ValueVectorHashHelper.Hash64 hash64;
+  private Map field2id = new HashMap<>();
+  private List toFilterFields;
+  private int originalRecordCount;
+  private int recordCount;
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(RuntimeFilterRecordBatch.class);
+
+  public RuntimeFilterRecordBatch(Filter pop, RecordBatch incoming, 
FragmentContext context) throws OutOfMemoryException {
+super(pop, context, incoming);
+  }
+
+  @Override
+  public FragmentContext getContext() {
+return context;
+  }
+
+  @Override
+  public int getRecordCount() {
+return sv2.getCount();
+  }
+
+  @Override
+  public SelectionVector2 getSelectionVector2() {
+return sv2;
+  }
+
+  @Override
+  public SelectionVector4 getSelectionVector4() {
+return null;
+  }
+
+  @Override
+  protected IterOutcome doWork() {
+container.transferIn(incoming.getContainer());
+originalRecordCount = incoming.getRecordCount();
+sv2.setOriginalRecordCount(originalRecordCount);
+try {
+  applyRuntimeFilter();
+} catch (SchemaChangeException e) {
+  throw new UnsupportedOperationException(e);
+}
+return getFinalOutcome(false);
+  }
+
+  @Override
+  public void close() {
+if (sv2 != null) {
+  sv2.clear();
+}
+super.close();
+  }
+
+  @Override
+  protected boolean setupNewSchema() throws SchemaChangeException {
 
 Review comment:
   In `setupNewSchema` the output container needs to prepared properly before 
`doWork` can be called.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service,

[jira] [Commented] (DRILL-6385) Support JPPD (Join Predicate Push Down)

2018-08-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579162#comment-16579162
 ] 

ASF GitHub Bot commented on DRILL-6385:
---

sohami commented on a change in pull request #1334: DRILL-6385: Support JPPD 
feature
URL: https://github.com/apache/drill/pull/1334#discussion_r209718874
 
 

 ##
 File path: 
protocol/src/main/java/org/apache/drill/exec/proto/UserBitShared.java
 ##
 @@ -585,6 +585,10 @@ private FragmentState(int index, int value) {
  * PARTITION_LIMIT = 54;
  */
 PARTITION_LIMIT(54, 54),
+/**
+ * RUNTIME_FILTER = 55;
+ */
+RUNTIME_FILTER(55, 55)
 
 Review comment:
   this file is generated after making changes in the `.proto` files. Please 
add this new operator type 
[here](https://github.com/apache/drill/blob/master/protocol/src/main/protobuf/UserBitShared.proto#L345)
 and later generate the .java and .cc files following instructions 
[here](https://github.com/apache/drill/blob/master/protocol/readme.txt).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support JPPD (Join Predicate Push Down)
> ---
>
> Key: DRILL-6385
> URL: https://issues.apache.org/jira/browse/DRILL-6385
> Project: Apache Drill
>  Issue Type: New Feature
>  Components:  Server, Execution - Flow
>Affects Versions: 1.14.0
>Reporter: weijie.tong
>Assignee: weijie.tong
>Priority: Major
>
> This feature is to support the JPPD (Join Predicate Push Down). It will 
> benefit the HashJoin ,Broadcast HashJoin performance by reducing the number 
> of rows to send across the network ,the memory consumed. This feature is 
> already supported by Impala which calls it RuntimeFilter 
> ([https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_runtime_filtering.html]).
>  The first PR will try to push down a bloom filter of HashJoin node to 
> Parquet’s scan node.   The propose basic procedure is described as follow:
>  # The HashJoin build side accumulate the equal join condition rows to 
> construct a bloom filter. Then it sends out the bloom filter to the foreman 
> node.
>  # The foreman node accept the bloom filters passively from all the fragments 
> that has the HashJoin operator. It then aggregates the bloom filters to form 
> a global bloom filter.
>  # The foreman node broadcasts the global bloom filter to all the probe side 
> scan nodes which maybe already have send out partial data to the hash join 
> nodes(currently the hash join node will prefetch one batch from both sides ).
>       4.  The scan node accepts a global bloom filter from the foreman node. 
> It will filter the rest rows satisfying the bloom filter.
>  
> To implement above execution flow, some main new notion described as below:
>       1. RuntimeFilter
> It’s a filter container which may contain BloomFilter or MinMaxFilter.
>       2. RuntimeFilterReporter
> It wraps the logic to send hash join’s bloom filter to the foreman.The 
> serialized bloom filter will be sent out through the data tunnel.This object 
> will be instanced by the FragmentExecutor and passed to the 
> FragmentContext.So the HashJoin operator can obtain it through the 
> FragmentContext.
>      3. RuntimeFilterRequestHandler
> It is responsible to accept a SendRuntimeFilterRequest RPC to strip the 
> actual BloomFilter from the network. It then translates this filter to the 
> WorkerBee’s new interface registerRuntimeFilter.
> Another RPC type is BroadcastRuntimeFilterRequest. It will register the 
> accepted global bloom filter to the WorkerBee by the registerRuntimeFilter 
> method and then propagate to the FragmentContext through which the probe side 
> scan node can fetch the aggregated bloom filter.
>       4.RuntimeFilterManager
> The foreman will instance a RuntimeFilterManager .It will indirectly get 
> every RuntimeFilter by the WorkerBee. Once all the BloomFilters have been 
> accepted and aggregated . It will broadcast the aggregated bloom filter to 
> all the probe side scan nodes through the data tunnel by a 
> BroadcastRuntimeFilterRequest RPC.
>      5. RuntimeFilterEnableOption 
>  A global option will be added to decide whether to enable this new feature.
>  
> Welcome suggestion and advice from you.The related PR will be presented as 
> soon as possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6385) Support JPPD (Join Predicate Push Down)

2018-08-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579167#comment-16579167
 ] 

ASF GitHub Bot commented on DRILL-6385:
---

sohami commented on a change in pull request #1334: DRILL-6385: Support JPPD 
feature
URL: https://github.com/apache/drill/pull/1334#discussion_r209815818
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/filter/RuntimeFilterRecordBatch.java
 ##
 @@ -0,0 +1,222 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.filter;
+
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.common.expression.ExpressionPosition;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.PathSegment;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ValueVectorReadExpression;
+import org.apache.drill.exec.expr.fn.impl.ValueVectorHashHelper;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.Filter;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TypedFieldId;
+import org.apache.drill.exec.record.selection.SelectionVector2;
+import org.apache.drill.exec.record.selection.SelectionVector4;
+import org.apache.drill.exec.work.filter.BloomFilter;
+import org.apache.drill.exec.work.filter.RuntimeFilterWritable;
+import java.util.ArrayList;
+import java.util.BitSet;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * A RuntimeFilterRecordBatch steps over the ScanBatch. If the ScanBatch 
participates
+ * in the HashJoinBatch and can be applied by a RuntimeFilter, it will 
generate a filtered
+ * SV2, otherwise will generate a same recordCount-originalRecordCount SV2 
which will not affect
+ * the Query's performance ,but just do a memory transfer by the later 
RemovingRecordBatch op.
+ */
+public class RuntimeFilterRecordBatch extends 
AbstractSingleRecordBatch {
+  private SelectionVector2 sv2;
+
+  private ValueVectorHashHelper.Hash64 hash64;
+  private Map field2id = new HashMap<>();
+  private List toFilterFields;
+  private int originalRecordCount;
+  private int recordCount;
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(RuntimeFilterRecordBatch.class);
+
+  public RuntimeFilterRecordBatch(Filter pop, RecordBatch incoming, 
FragmentContext context) throws OutOfMemoryException {
+super(pop, context, incoming);
+  }
+
+  @Override
+  public FragmentContext getContext() {
+return context;
+  }
+
+  @Override
+  public int getRecordCount() {
+return sv2.getCount();
+  }
+
+  @Override
+  public SelectionVector2 getSelectionVector2() {
+return sv2;
+  }
+
+  @Override
+  public SelectionVector4 getSelectionVector4() {
+return null;
+  }
+
+  @Override
+  protected IterOutcome doWork() {
+container.transferIn(incoming.getContainer());
+originalRecordCount = incoming.getRecordCount();
+sv2.setOriginalRecordCount(originalRecordCount);
+try {
+  applyRuntimeFilter();
+} catch (SchemaChangeException e) {
+  throw new UnsupportedOperationException(e);
+}
+return getFinalOutcome(false);
+  }
+
+  @Override
+  public void close() {
+if (sv2 != null) {
+  sv2.clear();
+}
+super.close();
+  }
+
+  @Override
+  protected boolean setupNewSchema() throws SchemaChangeException {
+if (sv2 != null) {
+  sv2.clear();
+}
+
+switch (incoming.getSchema().getSelectionVectorMode()) {
+  case NONE:
+if (sv2 == null) {
+  sv2 = new SelectionVector2(oContext.getAllocator());
+}
+break;
+  case TWO_BYTE:
+sv2 = new SelectionVector2(oContext.getAllocator());
+break;
+  case FOUR_BYTE:
+
+  

[jira] [Commented] (DRILL-6385) Support JPPD (Join Predicate Push Down)

2018-08-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579164#comment-16579164
 ] 

ASF GitHub Bot commented on DRILL-6385:
---

sohami commented on a change in pull request #1334: DRILL-6385: Support JPPD 
feature
URL: https://github.com/apache/drill/pull/1334#discussion_r209723766
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/handlers/DefaultSqlHandler.java
 ##
 @@ -594,15 +595,24 @@ protected Prel convertToPrel(RelNode drel, RelDataType 
validatedRowType) throws
  */
 phyRelNode = InsertLocalExchangeVisitor.insertLocalExchanges(phyRelNode, 
queryOptions);
 
+/*
+ * 8.)
+ * Insert RuntimeFilter over Scan nodes
+ */
+if (context.isRuntimeFilterEnabled()) {
+  phyRelNode = 
RuntimeFilterPrelVisitor.addRuntimeFilterPrelOverScanPrel(phyRelNode);
 
 Review comment:
   How will this know to add `RuntimeFilter` operator downstream of only probe 
side Scan and not on build side Scan ?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support JPPD (Join Predicate Push Down)
> ---
>
> Key: DRILL-6385
> URL: https://issues.apache.org/jira/browse/DRILL-6385
> Project: Apache Drill
>  Issue Type: New Feature
>  Components:  Server, Execution - Flow
>Affects Versions: 1.14.0
>Reporter: weijie.tong
>Assignee: weijie.tong
>Priority: Major
>
> This feature is to support the JPPD (Join Predicate Push Down). It will 
> benefit the HashJoin ,Broadcast HashJoin performance by reducing the number 
> of rows to send across the network ,the memory consumed. This feature is 
> already supported by Impala which calls it RuntimeFilter 
> ([https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_runtime_filtering.html]).
>  The first PR will try to push down a bloom filter of HashJoin node to 
> Parquet’s scan node.   The propose basic procedure is described as follow:
>  # The HashJoin build side accumulate the equal join condition rows to 
> construct a bloom filter. Then it sends out the bloom filter to the foreman 
> node.
>  # The foreman node accept the bloom filters passively from all the fragments 
> that has the HashJoin operator. It then aggregates the bloom filters to form 
> a global bloom filter.
>  # The foreman node broadcasts the global bloom filter to all the probe side 
> scan nodes which maybe already have send out partial data to the hash join 
> nodes(currently the hash join node will prefetch one batch from both sides ).
>       4.  The scan node accepts a global bloom filter from the foreman node. 
> It will filter the rest rows satisfying the bloom filter.
>  
> To implement above execution flow, some main new notion described as below:
>       1. RuntimeFilter
> It’s a filter container which may contain BloomFilter or MinMaxFilter.
>       2. RuntimeFilterReporter
> It wraps the logic to send hash join’s bloom filter to the foreman.The 
> serialized bloom filter will be sent out through the data tunnel.This object 
> will be instanced by the FragmentExecutor and passed to the 
> FragmentContext.So the HashJoin operator can obtain it through the 
> FragmentContext.
>      3. RuntimeFilterRequestHandler
> It is responsible to accept a SendRuntimeFilterRequest RPC to strip the 
> actual BloomFilter from the network. It then translates this filter to the 
> WorkerBee’s new interface registerRuntimeFilter.
> Another RPC type is BroadcastRuntimeFilterRequest. It will register the 
> accepted global bloom filter to the WorkerBee by the registerRuntimeFilter 
> method and then propagate to the FragmentContext through which the probe side 
> scan node can fetch the aggregated bloom filter.
>       4.RuntimeFilterManager
> The foreman will instance a RuntimeFilterManager .It will indirectly get 
> every RuntimeFilter by the WorkerBee. Once all the BloomFilters have been 
> accepted and aggregated . It will broadcast the aggregated bloom filter to 
> all the probe side scan nodes through the data tunnel by a 
> BroadcastRuntimeFilterRequest RPC.
>      5. RuntimeFilterEnableOption 
>  A global option will be added to decide whether to enable this new feature.
>  
> Welcome suggestion and advice from you.The related PR will be presented as 
> soon as possible.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6385) Support JPPD (Join Predicate Push Down)

2018-08-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579166#comment-16579166
 ] 

ASF GitHub Bot commented on DRILL-6385:
---

sohami commented on a change in pull request #1334: DRILL-6385: Support JPPD 
feature
URL: https://github.com/apache/drill/pull/1334#discussion_r209791450
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/visitor/RuntimeFilterPrelVisitor.java
 ##
 @@ -0,0 +1,38 @@
+package org.apache.drill.exec.planner.physical.visitor;
+
+import com.google.common.collect.Lists;
+import org.apache.calcite.rel.RelNode;
+import org.apache.drill.exec.planner.physical.Prel;
+import org.apache.drill.exec.planner.physical.RuntimeFilterPrel;
+import org.apache.drill.exec.planner.physical.ScanPrel;
+
+import java.util.List;
+
+/**
+ * Generate a RuntimeFilterPrel over all the ScanPrel.
+ */
+public class RuntimeFilterPrelVisitor extends BasePrelVisitor{
+
+  private static RuntimeFilterPrelVisitor INSTANCE = new 
RuntimeFilterPrelVisitor();
+
+  public static Prel addRuntimeFilterPrelOverScanPrel(Prel prel) {
+return prel.accept(INSTANCE, null);
+  }
+
+  @Override
+  public Prel visitPrel(Prel prel, Void value) throws RuntimeException {
+if (prel instanceof ScanPrel) {
+  List children = Lists.newArrayList();
+  RuntimeFilterPrel runtimeFilterPrel = new RuntimeFilterPrel(prel);
+  children.add(runtimeFilterPrel);
+  return (Prel) prel.copy(prel.getTraitSet(), children);
 
 Review comment:
   This will call `copy` on `ScanPrel` which ignores the children probably 
because `Scan` is leaf node and there won't be any child. I think you should 
just return `runtimeFilterPrel` instead ? 
   Or override method called `visitScan(ScanPrel prel, EXTRA value)` and put 
the code under if inside that method which will be as below:
   
   ```
   @Override
   public Prel visitScan(ScanPrel prel, Void value) throws RuntimeException {
 RuntimeFilterPrel runtimeFilterPrel = new RuntimeFilterPrel(prel);
 return runtimeFilterPrel;
   }
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support JPPD (Join Predicate Push Down)
> ---
>
> Key: DRILL-6385
> URL: https://issues.apache.org/jira/browse/DRILL-6385
> Project: Apache Drill
>  Issue Type: New Feature
>  Components:  Server, Execution - Flow
>Affects Versions: 1.14.0
>Reporter: weijie.tong
>Assignee: weijie.tong
>Priority: Major
>
> This feature is to support the JPPD (Join Predicate Push Down). It will 
> benefit the HashJoin ,Broadcast HashJoin performance by reducing the number 
> of rows to send across the network ,the memory consumed. This feature is 
> already supported by Impala which calls it RuntimeFilter 
> ([https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_runtime_filtering.html]).
>  The first PR will try to push down a bloom filter of HashJoin node to 
> Parquet’s scan node.   The propose basic procedure is described as follow:
>  # The HashJoin build side accumulate the equal join condition rows to 
> construct a bloom filter. Then it sends out the bloom filter to the foreman 
> node.
>  # The foreman node accept the bloom filters passively from all the fragments 
> that has the HashJoin operator. It then aggregates the bloom filters to form 
> a global bloom filter.
>  # The foreman node broadcasts the global bloom filter to all the probe side 
> scan nodes which maybe already have send out partial data to the hash join 
> nodes(currently the hash join node will prefetch one batch from both sides ).
>       4.  The scan node accepts a global bloom filter from the foreman node. 
> It will filter the rest rows satisfying the bloom filter.
>  
> To implement above execution flow, some main new notion described as below:
>       1. RuntimeFilter
> It’s a filter container which may contain BloomFilter or MinMaxFilter.
>       2. RuntimeFilterReporter
> It wraps the logic to send hash join’s bloom filter to the foreman.The 
> serialized bloom filter will be sent out through the data tunnel.This object 
> will be instanced by the FragmentExecutor and passed to the 
> FragmentContext.So the HashJoin operator can obtain it through the 
> FragmentContext.
>      3. RuntimeFilterRequestHandler
> It is responsible to accept a SendRuntimeFilterRequest RPC to strip the 
> actual BloomFilter from the network. It then translates this filter to the 
> WorkerBee’s new interface registerRuntimeFilter.
> Another RPC type is Broad

[jira] [Commented] (DRILL-6385) Support JPPD (Join Predicate Push Down)

2018-08-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579165#comment-16579165
 ] 

ASF GitHub Bot commented on DRILL-6385:
---

sohami commented on a change in pull request #1334: DRILL-6385: Support JPPD 
feature
URL: https://github.com/apache/drill/pull/1334#discussion_r209791940
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/RuntimeFilterPrel.java
 ##
 @@ -0,0 +1,61 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.planner.physical;
+
+import org.apache.calcite.plan.RelOptCluster;
+import org.apache.calcite.plan.RelTraitSet;
+import org.apache.calcite.rel.RelNode;
+import org.apache.drill.exec.physical.base.PhysicalOperator;
+import org.apache.drill.exec.physical.config.RuntimeFilterPOP;
+import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode;
+
+import java.io.IOException;
+import java.util.List;
+
+public class RuntimeFilterPrel extends SinglePrel{
+  static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(RuntimeFilterPrel.class);
+
+  public RuntimeFilterPrel(Prel child){
+super(child.getCluster(), child.getTraitSet(), child);
+  }
+
+  public RuntimeFilterPrel(RelOptCluster cluster, RelTraitSet traits, RelNode 
child) {
+super(cluster, traits, child);
+  }
+
+  @Override
+  public RelNode copy(RelTraitSet traitSet, List inputs) {
+return new RuntimeFilterPrel(this.getCluster(), traitSet, inputs.get(0));
+  }
+
+  @Override
+  public PhysicalOperator getPhysicalOperator(PhysicalPlanCreator creator) 
throws IOException {
+RuntimeFilterPOP r =  new RuntimeFilterPOP( 
((Prel)getInput()).getPhysicalOperator(creator));
+return creator.addMetadata(this, r);
+  }
+
+  @Override
+  public SelectionVectorMode getEncoding() {
+return SelectionVectorMode.TWO_BYTE;
+  }
+
+  @Override
+  public Prel prepareForLateralUnnestPipeline(List children) {
+return (Prel) this.copy(this.traitSet, children);
 
 Review comment:
   `prepareForLateralUnnestPipeline` be same as `FilterPrel` implementation 
[here](https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/physical/FilterPrel.java#L91)


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support JPPD (Join Predicate Push Down)
> ---
>
> Key: DRILL-6385
> URL: https://issues.apache.org/jira/browse/DRILL-6385
> Project: Apache Drill
>  Issue Type: New Feature
>  Components:  Server, Execution - Flow
>Affects Versions: 1.14.0
>Reporter: weijie.tong
>Assignee: weijie.tong
>Priority: Major
>
> This feature is to support the JPPD (Join Predicate Push Down). It will 
> benefit the HashJoin ,Broadcast HashJoin performance by reducing the number 
> of rows to send across the network ,the memory consumed. This feature is 
> already supported by Impala which calls it RuntimeFilter 
> ([https://www.cloudera.com/documentation/enterprise/5-9-x/topics/impala_runtime_filtering.html]).
>  The first PR will try to push down a bloom filter of HashJoin node to 
> Parquet’s scan node.   The propose basic procedure is described as follow:
>  # The HashJoin build side accumulate the equal join condition rows to 
> construct a bloom filter. Then it sends out the bloom filter to the foreman 
> node.
>  # The foreman node accept the bloom filters passively from all the fragments 
> that has the HashJoin operator. It then aggregates the bloom filters to form 
> a global bloom filter.
>  # The foreman node broadcasts the global bloom filter to all the probe side 
> scan nodes which maybe already have send out partial data to the hash join 
> nodes(currently the hash join node will prefetch one batch from both sides ).
> 

[jira] [Commented] (DRILL-6385) Support JPPD (Join Predicate Push Down)

2018-08-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579163#comment-16579163
 ] 

ASF GitHub Bot commented on DRILL-6385:
---

sohami commented on a change in pull request #1334: DRILL-6385: Support JPPD 
feature
URL: https://github.com/apache/drill/pull/1334#discussion_r209793080
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/filter/RuntimeFilterRecordBatch.java
 ##
 @@ -0,0 +1,222 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.filter;
+
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.common.expression.ExpressionPosition;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.PathSegment;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ValueVectorReadExpression;
+import org.apache.drill.exec.expr.fn.impl.ValueVectorHashHelper;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.Filter;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TypedFieldId;
+import org.apache.drill.exec.record.selection.SelectionVector2;
+import org.apache.drill.exec.record.selection.SelectionVector4;
+import org.apache.drill.exec.work.filter.BloomFilter;
+import org.apache.drill.exec.work.filter.RuntimeFilterWritable;
+import java.util.ArrayList;
+import java.util.BitSet;
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+/**
+ * A RuntimeFilterRecordBatch steps over the ScanBatch. If the ScanBatch 
participates
+ * in the HashJoinBatch and can be applied by a RuntimeFilter, it will 
generate a filtered
+ * SV2, otherwise will generate a same recordCount-originalRecordCount SV2 
which will not affect
+ * the Query's performance ,but just do a memory transfer by the later 
RemovingRecordBatch op.
+ */
+public class RuntimeFilterRecordBatch extends 
AbstractSingleRecordBatch {
+  private SelectionVector2 sv2;
+
+  private ValueVectorHashHelper.Hash64 hash64;
+  private Map field2id = new HashMap<>();
+  private List toFilterFields;
+  private int originalRecordCount;
+  private int recordCount;
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(RuntimeFilterRecordBatch.class);
+
+  public RuntimeFilterRecordBatch(Filter pop, RecordBatch incoming, 
FragmentContext context) throws OutOfMemoryException {
+super(pop, context, incoming);
+  }
+
+  @Override
+  public FragmentContext getContext() {
+return context;
+  }
+
+  @Override
+  public int getRecordCount() {
+return sv2.getCount();
+  }
+
+  @Override
+  public SelectionVector2 getSelectionVector2() {
+return sv2;
+  }
+
+  @Override
+  public SelectionVector4 getSelectionVector4() {
+return null;
+  }
+
+  @Override
+  protected IterOutcome doWork() {
+container.transferIn(incoming.getContainer());
+originalRecordCount = incoming.getRecordCount();
+sv2.setOriginalRecordCount(originalRecordCount);
+try {
+  applyRuntimeFilter();
+} catch (SchemaChangeException e) {
+  throw new UnsupportedOperationException(e);
+}
+return getFinalOutcome(false);
+  }
+
+  @Override
+  public void close() {
+if (sv2 != null) {
+  sv2.clear();
+}
+super.close();
+  }
+
+  @Override
+  protected boolean setupNewSchema() throws SchemaChangeException {
+if (sv2 != null) {
+  sv2.clear();
+}
+
+switch (incoming.getSchema().getSelectionVectorMode()) {
+  case NONE:
+if (sv2 == null) {
+  sv2 = new SelectionVector2(oContext.getAllocator());
+}
+break;
+  case TWO_BYTE:
+sv2 = new SelectionVector2(oContext.getAllocator());
+break;
+  case FOUR_BYTE:
+
+  

[jira] [Commented] (DRILL-6676) Add Union, List and Repeated List types to Result Set Loader

2018-08-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579152#comment-16579152
 ] 

ASF GitHub Bot commented on DRILL-6676:
---

ilooner commented on issue #1429: DRILL-6676: Add Union, List and Repeated List 
types to Result Set Loader
URL: https://github.com/apache/drill/pull/1429#issuecomment-412734987
 
 
   Thanks for the help @ppadma. Ran the tests. There are a few failures but 
those are known issues unrelated to your change. So I think we are all good 
with respect to the build.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Union, List and Repeated List types to Result Set Loader
> 
>
> Key: DRILL-6676
> URL: https://issues.apache.org/jira/browse/DRILL-6676
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.15.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
> Fix For: 1.15.0
>
>
> Add support for the "obscure" vector types to the {{ResultSetLoader}}:
> * Union
> * List
> * Repeated List



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6676) Add Union, List and Repeated List types to Result Set Loader

2018-08-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579097#comment-16579097
 ] 

ASF GitHub Bot commented on DRILL-6676:
---

paul-rogers commented on a change in pull request #1429: DRILL-6676: Add Union, 
List and Repeated List types to Result Set Loader
URL: https://github.com/apache/drill/pull/1429#discussion_r209803870
 
 

 ##
 File path: 
exec/vector/src/main/java/org/apache/drill/exec/vector/complex/ListVector.java
 ##
 @@ -250,22 +270,140 @@ public void load(UserBitShared.SerializedField 
metadata, DrillBuf buffer) {
 bits.load(bitMetadata, buffer.slice(offsetLength, bitLength));
 
 final UserBitShared.SerializedField vectorMetadata = metadata.getChild(2);
-if (getDataVector() == DEFAULT_DATA_VECTOR) {
+if (isEmptyType()) {
   addOrGetVector(VectorDescriptor.create(vectorMetadata.getMajorType()));
 }
 
 final int vectorLength = vectorMetadata.getBufferLength();
 vector.load(vectorMetadata, buffer.slice(offsetLength + bitLength, 
vectorLength));
   }
 
+  public boolean isEmptyType() {
+return getDataVector() == DEFAULT_DATA_VECTOR;
+  }
+
+  @Override
+  public void setChildVector(ValueVector childVector) {
+
+// Unlike the repeated list vector, the (plain) list vector
+// adds the dummy vector as a child type.
+
+assert field.getChildren().size() == 1;
+assert field.getChildren().iterator().next().getType().getMinorType() == 
MinorType.LATE;
+field.removeChild(vector.getField());
+
+super.setChildVector(childVector);
+
+// Initial LATE type vector not added as a subtype initially.
+// So, no need to remove it, just add the new subtype. Since the
+// MajorType is immutable, must build a new one and replace the type
+// in the materialized field. (We replace the type, rather than creating
+// a new materialized field, to preserve the link to this field from
+// a parent map, list or union.)
+
+assert field.getType().getSubTypeCount() == 0;
+field.replaceType(
+field.getType().toBuilder()
+  .addSubType(childVector.getField().getType().getMinorType())
+  .build());
+  }
+
+  /**
+   * Promote the list to a union. Called from old-style writers. This 
implementation
+   * relies on the caller to set the types vector for any existing values.
+   * This method simply clears the existing vector.
+   *
+   * @return the new union vector
+   */
+
   public UnionVector promoteToUnion() {
-MaterializedField newField = 
MaterializedField.create(getField().getName(), Types.optional(MinorType.UNION));
-UnionVector vector = new UnionVector(newField, allocator, null);
+UnionVector vector = createUnion();
+
+// Replace the current vector, clearing its data. (This is the
+// old behavior.
+
 replaceDataVector(vector);
 reader = new UnionListReader(this);
 return vector;
   }
 
+  /**
+   * Revised form of promote to union that correctly fixes up the list
+   * field metadata to match the new union type.
+   *
+   * @return the new union vector
+   */
+
+  public UnionVector promoteToUnion2() {
+UnionVector unionVector = createUnion();
+
+// Replace the current vector, clearing its data. (This is the
+// old behavior.
+
+setChildVector(unionVector);
+return unionVector;
+  }
+
+  /**
+   * Promote to a union, preserving the existing data vector as a member of
+   * the new union. Back-fill the types vector with the proper type value
+   * for existing rows.
+   *
+   * @return the new union vector
+   */
+
+  public UnionVector convertToUnion(int allocValueCount, int valueCount) {
+assert allocValueCount >= valueCount;
+UnionVector unionVector = createUnion();
+unionVector.allocateNew(allocValueCount);
+
+// Preserve the current vector (and its data) if it is other than
+// the default. (New behavior used by column writers.)
+
+if (! isEmptyType()) {
+  unionVector.addType(vector);
+  int prevType = vector.getField().getType().getMinorType().getNumber();
+  UInt1Vector.Mutator typeMutator = 
unionVector.getTypeVector().getMutator();
+
+  // If the previous vector was nullable, then promote the nullable state
+  // to the type vector by setting either the null marker or the type
+  // marker depending on the original nullable values.
+
+  if (vector instanceof NullableVector) {
+UInt1Vector.Accessor bitsAccessor =
+((UInt1Vector) ((NullableVector) 
vector).getBitsVector()).getAccessor();
+for (int i = 0; i < valueCount; i++) {
+  typeMutator.setSafe(i, (bitsAccessor.get(i) == 0)
+  ? UnionVector.NULL_MARKER
+  : prevType);
+}
+  } else {
+
+// The value is not nullable. (Perhaps it is a map.)
+// Note that the original design of lists have a flaw: if the sole 
member
+  

[jira] [Commented] (DRILL-6676) Add Union, List and Repeated List types to Result Set Loader

2018-08-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579096#comment-16579096
 ] 

ASF GitHub Bot commented on DRILL-6676:
---

paul-rogers commented on a change in pull request #1429: DRILL-6676: Add Union, 
List and Repeated List types to Result Set Loader
URL: https://github.com/apache/drill/pull/1429#discussion_r209803508
 
 

 ##
 File path: 
exec/vector/src/main/java/org/apache/drill/exec/vector/complex/ListVector.java
 ##
 @@ -250,22 +270,140 @@ public void load(UserBitShared.SerializedField 
metadata, DrillBuf buffer) {
 bits.load(bitMetadata, buffer.slice(offsetLength, bitLength));
 
 final UserBitShared.SerializedField vectorMetadata = metadata.getChild(2);
-if (getDataVector() == DEFAULT_DATA_VECTOR) {
+if (isEmptyType()) {
   addOrGetVector(VectorDescriptor.create(vectorMetadata.getMajorType()));
 }
 
 final int vectorLength = vectorMetadata.getBufferLength();
 vector.load(vectorMetadata, buffer.slice(offsetLength + bitLength, 
vectorLength));
   }
 
+  public boolean isEmptyType() {
+return getDataVector() == DEFAULT_DATA_VECTOR;
+  }
+
+  @Override
+  public void setChildVector(ValueVector childVector) {
+
+// Unlike the repeated list vector, the (plain) list vector
+// adds the dummy vector as a child type.
+
+assert field.getChildren().size() == 1;
+assert field.getChildren().iterator().next().getType().getMinorType() == 
MinorType.LATE;
+field.removeChild(vector.getField());
+
+super.setChildVector(childVector);
+
+// Initial LATE type vector not added as a subtype initially.
+// So, no need to remove it, just add the new subtype. Since the
+// MajorType is immutable, must build a new one and replace the type
+// in the materialized field. (We replace the type, rather than creating
+// a new materialized field, to preserve the link to this field from
+// a parent map, list or union.)
+
+assert field.getType().getSubTypeCount() == 0;
+field.replaceType(
+field.getType().toBuilder()
+  .addSubType(childVector.getField().getType().getMinorType())
+  .build());
+  }
+
+  /**
+   * Promote the list to a union. Called from old-style writers. This 
implementation
+   * relies on the caller to set the types vector for any existing values.
+   * This method simply clears the existing vector.
+   *
+   * @return the new union vector
+   */
+
   public UnionVector promoteToUnion() {
-MaterializedField newField = 
MaterializedField.create(getField().getName(), Types.optional(MinorType.UNION));
-UnionVector vector = new UnionVector(newField, allocator, null);
+UnionVector vector = createUnion();
+
+// Replace the current vector, clearing its data. (This is the
+// old behavior.
+
 replaceDataVector(vector);
 reader = new UnionListReader(this);
 return vector;
   }
 
+  /**
+   * Revised form of promote to union that correctly fixes up the list
+   * field metadata to match the new union type.
+   *
+   * @return the new union vector
+   */
+
+  public UnionVector promoteToUnion2() {
+UnionVector unionVector = createUnion();
+
+// Replace the current vector, clearing its data. (This is the
+// old behavior.
+
+setChildVector(unionVector);
+return unionVector;
+  }
+
+  /**
+   * Promote to a union, preserving the existing data vector as a member of
+   * the new union. Back-fill the types vector with the proper type value
+   * for existing rows.
+   *
+   * @return the new union vector
+   */
+
+  public UnionVector convertToUnion(int allocValueCount, int valueCount) {
+assert allocValueCount >= valueCount;
+UnionVector unionVector = createUnion();
+unionVector.allocateNew(allocValueCount);
+
+// Preserve the current vector (and its data) if it is other than
+// the default. (New behavior used by column writers.)
+
+if (! isEmptyType()) {
+  unionVector.addType(vector);
+  int prevType = vector.getField().getType().getMinorType().getNumber();
+  UInt1Vector.Mutator typeMutator = 
unionVector.getTypeVector().getMutator();
+
+  // If the previous vector was nullable, then promote the nullable state
+  // to the type vector by setting either the null marker or the type
+  // marker depending on the original nullable values.
+
+  if (vector instanceof NullableVector) {
+UInt1Vector.Accessor bitsAccessor =
+((UInt1Vector) ((NullableVector) 
vector).getBitsVector()).getAccessor();
+for (int i = 0; i < valueCount; i++) {
+  typeMutator.setSafe(i, (bitsAccessor.get(i) == 0)
+  ? UnionVector.NULL_MARKER
+  : prevType);
+}
+  } else {
+
+// The value is not nullable. (Perhaps it is a map.)
+// Note that the original design of lists have a flaw: if the sole 
member
+  

[jira] [Commented] (DRILL-6676) Add Union, List and Repeated List types to Result Set Loader

2018-08-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579088#comment-16579088
 ] 

ASF GitHub Bot commented on DRILL-6676:
---

paul-rogers commented on issue #1429: DRILL-6676: Add Union, List and Repeated 
List types to Result Set Loader
URL: https://github.com/apache/drill/pull/1429#issuecomment-412715369
 
 
   @ilooner, thanks much for taking a look at this PR. Agree it would be great 
for @ppadma to review this as she's been following allow with earlier work in 
this series.
   
   The code here can be overwhelming if you dive right in. The best way to 
start is to look at the tests: to see how the code is intended to be used. The 
API is intended to be simple; the large bulk of code is needed to create that 
simple API by hiding the highly complex stuff that must be done to keep 
multiple vectors in sync -- with ever increasing complexity as you move from 
fixed-width to nullable to variable width to arrays to maps to unions to lists.
   
   Rebased on latest master. Fixed check style issues from latest rules.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Union, List and Repeated List types to Result Set Loader
> 
>
> Key: DRILL-6676
> URL: https://issues.apache.org/jira/browse/DRILL-6676
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.15.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
> Fix For: 1.15.0
>
>
> Add support for the "obscure" vector types to the {{ResultSetLoader}}:
> * Union
> * List
> * Repeated List



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6683) move getSelectionVector2 and getSelectionVector4 from VectorAccessible interface to RecordBatch interface

2018-08-13 Thread Paul Rogers (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579069#comment-16579069
 ] 

Paul Rogers commented on DRILL-6683:


While this seems a good idea; it does get to the core of the design of the 
{{VectorContainer}} vs. {{RecordBatch}} abstractions.

Despite its name, {{RecordBatch}} is an *operator*, not a batch of data. A 
{{RecordBatch}} (operator) has an associated output batch of data (a record 
batch but not a {{RecordBatch}}) represented by a {{VectorContainer}}. Metadata 
for that container is described by {{BatchSchema}}, which is stored in the 
{{VectorContainer}}. Since a full record batch is defined by a set of vectors 
*and* it associated selection vector, it seems odd to disassociate them.

Rather than remove the methods from {{VectorContainer}}, a better longer-term 
change would be to move the selection vector into the {{VectorContainer}}. 
Today, it is an odd add-on maintained by the operator, (the so-called 
{{RecordBatch}}), not the record batch (the so-called {{VectorContainer}}.)

As you've seen in the {{RowSet}} classes, a {{RowSet}} is the logical 
equivalent of (actually a wrapper for) both a {{VectorContainer}} and a 
selection vector.

Also, the newer stuff to come that builds on the result set loader splits the 
operator interface into three responsibilities:

* Operator
* Outgoing batch
* Iterator protocol driver

In this world, a {{RowSet}} (or the result set loader equivalent for reading) 
would represent the outgoing batch, the operator handle the work of 
transforming batches.

So, long comment, because the design in this area needs work (which this bug 
suggests), but the fixes are subtle.

> move getSelectionVector2 and getSelectionVector4 from VectorAccessible 
> interface to RecordBatch interface
> -
>
> Key: DRILL-6683
> URL: https://issues.apache.org/jira/browse/DRILL-6683
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Timothy Farkas
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6676) Add Union, List and Repeated List types to Result Set Loader

2018-08-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579066#comment-16579066
 ] 

ASF GitHub Bot commented on DRILL-6676:
---

ppadma commented on issue #1429: DRILL-6676: Add Union, List and Repeated List 
types to Result Set Loader
URL: https://github.com/apache/drill/pull/1429#issuecomment-412711621
 
 
   @paul-rogers @ilooner I can review this PR. It will take some time. I will 
post comments as I make progress.  meanwhile, @ilooner, can you run functional 
tests ? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Union, List and Repeated List types to Result Set Loader
> 
>
> Key: DRILL-6676
> URL: https://issues.apache.org/jira/browse/DRILL-6676
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.15.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
> Fix For: 1.15.0
>
>
> Add support for the "obscure" vector types to the {{ResultSetLoader}}:
> * Union
> * List
> * Repeated List



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6676) Add Union, List and Repeated List types to Result Set Loader

2018-08-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579045#comment-16579045
 ] 

ASF GitHub Bot commented on DRILL-6676:
---

ilooner commented on issue #1429: DRILL-6676: Add Union, List and Repeated List 
types to Result Set Loader
URL: https://github.com/apache/drill/pull/1429#issuecomment-412703438
 
 
   @paul-rogers I added some more checkstyle checks recently, and they fail 
when the PR is rebased onto the latest master. Could you rebase and fix the few 
checkstyle errors?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Union, List and Repeated List types to Result Set Loader
> 
>
> Key: DRILL-6676
> URL: https://issues.apache.org/jira/browse/DRILL-6676
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.15.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
> Fix For: 1.15.0
>
>
> Add support for the "obscure" vector types to the {{ResultSetLoader}}:
> * Union
> * List
> * Repeated List



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6676) Add Union, List and Repeated List types to Result Set Loader

2018-08-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579041#comment-16579041
 ] 

ASF GitHub Bot commented on DRILL-6676:
---

ilooner commented on a change in pull request #1429: DRILL-6676: Add Union, 
List and Repeated List types to Result Set Loader
URL: https://github.com/apache/drill/pull/1429#discussion_r209791763
 
 

 ##
 File path: 
exec/vector/src/main/java/org/apache/drill/exec/vector/complex/ListVector.java
 ##
 @@ -250,22 +270,140 @@ public void load(UserBitShared.SerializedField 
metadata, DrillBuf buffer) {
 bits.load(bitMetadata, buffer.slice(offsetLength, bitLength));
 
 final UserBitShared.SerializedField vectorMetadata = metadata.getChild(2);
-if (getDataVector() == DEFAULT_DATA_VECTOR) {
+if (isEmptyType()) {
   addOrGetVector(VectorDescriptor.create(vectorMetadata.getMajorType()));
 }
 
 final int vectorLength = vectorMetadata.getBufferLength();
 vector.load(vectorMetadata, buffer.slice(offsetLength + bitLength, 
vectorLength));
   }
 
+  public boolean isEmptyType() {
+return getDataVector() == DEFAULT_DATA_VECTOR;
+  }
+
+  @Override
+  public void setChildVector(ValueVector childVector) {
+
+// Unlike the repeated list vector, the (plain) list vector
+// adds the dummy vector as a child type.
+
+assert field.getChildren().size() == 1;
+assert field.getChildren().iterator().next().getType().getMinorType() == 
MinorType.LATE;
+field.removeChild(vector.getField());
+
+super.setChildVector(childVector);
+
+// Initial LATE type vector not added as a subtype initially.
+// So, no need to remove it, just add the new subtype. Since the
+// MajorType is immutable, must build a new one and replace the type
+// in the materialized field. (We replace the type, rather than creating
+// a new materialized field, to preserve the link to this field from
+// a parent map, list or union.)
+
+assert field.getType().getSubTypeCount() == 0;
+field.replaceType(
+field.getType().toBuilder()
+  .addSubType(childVector.getField().getType().getMinorType())
+  .build());
+  }
+
+  /**
+   * Promote the list to a union. Called from old-style writers. This 
implementation
+   * relies on the caller to set the types vector for any existing values.
+   * This method simply clears the existing vector.
+   *
+   * @return the new union vector
+   */
+
   public UnionVector promoteToUnion() {
-MaterializedField newField = 
MaterializedField.create(getField().getName(), Types.optional(MinorType.UNION));
-UnionVector vector = new UnionVector(newField, allocator, null);
+UnionVector vector = createUnion();
+
+// Replace the current vector, clearing its data. (This is the
+// old behavior.
+
 replaceDataVector(vector);
 reader = new UnionListReader(this);
 return vector;
   }
 
+  /**
+   * Revised form of promote to union that correctly fixes up the list
+   * field metadata to match the new union type.
+   *
+   * @return the new union vector
+   */
+
+  public UnionVector promoteToUnion2() {
+UnionVector unionVector = createUnion();
+
+// Replace the current vector, clearing its data. (This is the
+// old behavior.
+
+setChildVector(unionVector);
+return unionVector;
+  }
+
+  /**
+   * Promote to a union, preserving the existing data vector as a member of
+   * the new union. Back-fill the types vector with the proper type value
+   * for existing rows.
+   *
+   * @return the new union vector
+   */
+
+  public UnionVector convertToUnion(int allocValueCount, int valueCount) {
+assert allocValueCount >= valueCount;
+UnionVector unionVector = createUnion();
+unionVector.allocateNew(allocValueCount);
+
+// Preserve the current vector (and its data) if it is other than
+// the default. (New behavior used by column writers.)
+
+if (! isEmptyType()) {
+  unionVector.addType(vector);
+  int prevType = vector.getField().getType().getMinorType().getNumber();
+  UInt1Vector.Mutator typeMutator = 
unionVector.getTypeVector().getMutator();
+
+  // If the previous vector was nullable, then promote the nullable state
+  // to the type vector by setting either the null marker or the type
+  // marker depending on the original nullable values.
+
+  if (vector instanceof NullableVector) {
+UInt1Vector.Accessor bitsAccessor =
+((UInt1Vector) ((NullableVector) 
vector).getBitsVector()).getAccessor();
+for (int i = 0; i < valueCount; i++) {
+  typeMutator.setSafe(i, (bitsAccessor.get(i) == 0)
+  ? UnionVector.NULL_MARKER
+  : prevType);
+}
+  } else {
+
+// The value is not nullable. (Perhaps it is a map.)
+// Note that the original design of lists have a flaw: if the sole 
member
+  

[jira] [Commented] (DRILL-6676) Add Union, List and Repeated List types to Result Set Loader

2018-08-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579040#comment-16579040
 ] 

ASF GitHub Bot commented on DRILL-6676:
---

ilooner commented on a change in pull request #1429: DRILL-6676: Add Union, 
List and Repeated List types to Result Set Loader
URL: https://github.com/apache/drill/pull/1429#discussion_r209791651
 
 

 ##
 File path: 
exec/vector/src/main/java/org/apache/drill/exec/vector/complex/ListVector.java
 ##
 @@ -250,22 +270,140 @@ public void load(UserBitShared.SerializedField 
metadata, DrillBuf buffer) {
 bits.load(bitMetadata, buffer.slice(offsetLength, bitLength));
 
 final UserBitShared.SerializedField vectorMetadata = metadata.getChild(2);
-if (getDataVector() == DEFAULT_DATA_VECTOR) {
+if (isEmptyType()) {
   addOrGetVector(VectorDescriptor.create(vectorMetadata.getMajorType()));
 }
 
 final int vectorLength = vectorMetadata.getBufferLength();
 vector.load(vectorMetadata, buffer.slice(offsetLength + bitLength, 
vectorLength));
   }
 
+  public boolean isEmptyType() {
+return getDataVector() == DEFAULT_DATA_VECTOR;
+  }
+
+  @Override
+  public void setChildVector(ValueVector childVector) {
+
+// Unlike the repeated list vector, the (plain) list vector
+// adds the dummy vector as a child type.
+
+assert field.getChildren().size() == 1;
+assert field.getChildren().iterator().next().getType().getMinorType() == 
MinorType.LATE;
+field.removeChild(vector.getField());
+
+super.setChildVector(childVector);
+
+// Initial LATE type vector not added as a subtype initially.
+// So, no need to remove it, just add the new subtype. Since the
+// MajorType is immutable, must build a new one and replace the type
+// in the materialized field. (We replace the type, rather than creating
+// a new materialized field, to preserve the link to this field from
+// a parent map, list or union.)
+
+assert field.getType().getSubTypeCount() == 0;
+field.replaceType(
+field.getType().toBuilder()
+  .addSubType(childVector.getField().getType().getMinorType())
+  .build());
+  }
+
+  /**
+   * Promote the list to a union. Called from old-style writers. This 
implementation
+   * relies on the caller to set the types vector for any existing values.
+   * This method simply clears the existing vector.
+   *
+   * @return the new union vector
+   */
+
   public UnionVector promoteToUnion() {
-MaterializedField newField = 
MaterializedField.create(getField().getName(), Types.optional(MinorType.UNION));
-UnionVector vector = new UnionVector(newField, allocator, null);
+UnionVector vector = createUnion();
+
+// Replace the current vector, clearing its data. (This is the
+// old behavior.
+
 replaceDataVector(vector);
 reader = new UnionListReader(this);
 return vector;
   }
 
+  /**
+   * Revised form of promote to union that correctly fixes up the list
+   * field metadata to match the new union type.
+   *
+   * @return the new union vector
+   */
+
+  public UnionVector promoteToUnion2() {
+UnionVector unionVector = createUnion();
+
+// Replace the current vector, clearing its data. (This is the
+// old behavior.
+
+setChildVector(unionVector);
+return unionVector;
+  }
+
+  /**
+   * Promote to a union, preserving the existing data vector as a member of
+   * the new union. Back-fill the types vector with the proper type value
+   * for existing rows.
+   *
+   * @return the new union vector
+   */
+
+  public UnionVector convertToUnion(int allocValueCount, int valueCount) {
+assert allocValueCount >= valueCount;
+UnionVector unionVector = createUnion();
+unionVector.allocateNew(allocValueCount);
+
+// Preserve the current vector (and its data) if it is other than
+// the default. (New behavior used by column writers.)
+
+if (! isEmptyType()) {
+  unionVector.addType(vector);
+  int prevType = vector.getField().getType().getMinorType().getNumber();
+  UInt1Vector.Mutator typeMutator = 
unionVector.getTypeVector().getMutator();
+
+  // If the previous vector was nullable, then promote the nullable state
+  // to the type vector by setting either the null marker or the type
+  // marker depending on the original nullable values.
+
+  if (vector instanceof NullableVector) {
+UInt1Vector.Accessor bitsAccessor =
+((UInt1Vector) ((NullableVector) 
vector).getBitsVector()).getAccessor();
+for (int i = 0; i < valueCount; i++) {
+  typeMutator.setSafe(i, (bitsAccessor.get(i) == 0)
+  ? UnionVector.NULL_MARKER
+  : prevType);
+}
+  } else {
+
+// The value is not nullable. (Perhaps it is a map.)
+// Note that the original design of lists have a flaw: if the sole 
member
+  

[jira] [Updated] (DRILL-6685) Error in parquet record reader

2018-08-13 Thread Robert Hou (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Hou updated DRILL-6685:
--
Attachment: drillbit.log.6685

> Error in parquet record reader
> --
>
> Key: DRILL-6685
> URL: https://issues.apache.org/jira/browse/DRILL-6685
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.14.0
>Reporter: Robert Hou
>Assignee: salim achouche
>Priority: Major
> Fix For: 1.15.0
>
> Attachments: drillbit.log.6685
>
>
> This is the query:
> select VarbinaryValue1 from 
> dfs.`/drill/testdata/batch_memory/fourvarchar_asc_nulls_16MB.parquet` limit 
> 36;
> It appears to be caused by this commit:
> DRILL-6570: Fixed IndexOutofBoundException in Parquet Reader
> aee899c1b26ebb9a5781d280d5a73b42c273d4d5
> This is the stack trace:
> {noformat}
> Error: INTERNAL_ERROR ERROR: Error in parquet record reader.
> Message: 
> Hadoop path: 
> /drill/testdata/batch_memory/fourvarchar_asc_nulls_16MB.parquet/0_0_0.parquet
> Total records read: 0
> Row group index: 0
> Records in row group: 1250
> Parquet Metadata: ParquetMetaData{FileMetaData{schema: message root {
>   optional int64 Index;
>   optional binary VarbinaryValue1;
>   optional int64 BigIntValue;
>   optional boolean BooleanValue;
>   optional int32 DateValue (DATE);
>   optional float FloatValue;
>   optional binary VarcharValue1 (UTF8);
>   optional double DoubleValue;
>   optional int32 IntegerValue;
>   optional int32 TimeValue (TIME_MILLIS);
>   optional int64 TimestampValue (TIMESTAMP_MILLIS);
>   optional binary VarbinaryValue2;
>   optional fixed_len_byte_array(12) IntervalYearValue (INTERVAL);
>   optional fixed_len_byte_array(12) IntervalDayValue (INTERVAL);
>   optional fixed_len_byte_array(12) IntervalSecondValue (INTERVAL);
>   optional binary VarcharValue2 (UTF8);
> }
> , metadata: {drill-writer.version=2, drill.version=1.14.0-SNAPSHOT}}, blocks: 
> [BlockMetaData{1250, 23750308 [ColumnMetaData{UNCOMPRESSED [Index] optional 
> int64 Index  [PLAIN, RLE, BIT_PACKED], 4}, ColumnMetaData{UNCOMPRESSED 
> [VarbinaryValue1] optional binary VarbinaryValue1  [PLAIN, RLE, BIT_PACKED], 
> 10057}, ColumnMetaData{UNCOMPRESSED [BigIntValue] optional int64 BigIntValue  
> [PLAIN, RLE, BIT_PACKED], 8174655}, ColumnMetaData{UNCOMPRESSED 
> [BooleanValue] optional boolean BooleanValue  [PLAIN, RLE, BIT_PACKED], 
> 8179722}, ColumnMetaData{UNCOMPRESSED [DateValue] optional int32 DateValue 
> (DATE)  [PLAIN, RLE, BIT_PACKED], 8179916}, ColumnMetaData{UNCOMPRESSED 
> [FloatValue] optional float FloatValue  [PLAIN, RLE, BIT_PACKED], 8184959}, 
> ColumnMetaData{UNCOMPRESSED [VarcharValue1] optional binary VarcharValue1 
> (UTF8)  [PLAIN, RLE, BIT_PACKED], 8190002}, ColumnMetaData{UNCOMPRESSED 
> [DoubleValue] optional double DoubleValue  [PLAIN, RLE, BIT_PACKED], 
> 10230058}, ColumnMetaData{UNCOMPRESSED [IntegerValue] optional int32 
> IntegerValue  [PLAIN, RLE, BIT_PACKED], 10240111}, 
> ColumnMetaData{UNCOMPRESSED [TimeValue] optional int32 TimeValue 
> (TIME_MILLIS)  [PLAIN, RLE, BIT_PACKED], 10245154}, 
> ColumnMetaData{UNCOMPRESSED [TimestampValue] optional int64 TimestampValue 
> (TIMESTAMP_MILLIS)  [PLAIN, RLE, BIT_PACKED], 10250197}, 
> ColumnMetaData{UNCOMPRESSED [VarbinaryValue2] optional binary VarbinaryValue2 
>  [PLAIN, RLE, BIT_PACKED], 10260250}, ColumnMetaData{UNCOMPRESSED 
> [IntervalYearValue] optional fixed_len_byte_array(12) IntervalYearValue 
> (INTERVAL)  [PLAIN, RLE, BIT_PACKED], 19632385}, ColumnMetaData{UNCOMPRESSED 
> [IntervalDayValue] optional fixed_len_byte_array(12) IntervalDayValue 
> (INTERVAL)  [PLAIN, RLE, BIT_PACKED], 19647446}, ColumnMetaData{UNCOMPRESSED 
> [IntervalSecondValue] optional fixed_len_byte_array(12) IntervalSecondValue 
> (INTERVAL)  [PLAIN, RLE, BIT_PACKED], 19662507}, ColumnMetaData{UNCOMPRESSED 
> [VarcharValue2] optional binary VarcharValue2 (UTF8)  [PLAIN, RLE, 
> BIT_PACKED], 19677568}]}]}
> Fragment 0:0
> [Error Id: 25852cdb-3217-4041-9743-66e9f3a2fbe4 on qa-node186.qa.lab:31010] 
> (state=,code=0)
> {noformat}
> Table can be found in 10.10.100.186:/tmp/fourvarchar_asc_nulls_16MB.parquet
> sys.version is:
> 1.15.0-SNAPSHOT a05f17d6fcd80f0d21260d3b1074ab895f457bacChanged 
> PROJECT_OUTPUT_BATCH_SIZE to System + Session   30.07.2018 @ 17:12:53 PDT 
>   r...@mapr.com   30.07.2018 @ 17:25:21 PDT^M
> fourvarchar_asc_nulls70.q



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6685) Error in parquet record reader

2018-08-13 Thread Robert Hou (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Hou updated DRILL-6685:
--
Description: 
This is the query:
select VarbinaryValue1 from 
dfs.`/drill/testdata/batch_memory/fourvarchar_asc_nulls_16MB.parquet` limit 36;

It appears to be caused by this commit:
DRILL-6570: Fixed IndexOutofBoundException in Parquet Reader
aee899c1b26ebb9a5781d280d5a73b42c273d4d5

This is the stack trace:
{noformat}
Error: INTERNAL_ERROR ERROR: Error in parquet record reader.
Message: 
Hadoop path: 
/drill/testdata/batch_memory/fourvarchar_asc_nulls_16MB.parquet/0_0_0.parquet
Total records read: 0
Row group index: 0
Records in row group: 1250
Parquet Metadata: ParquetMetaData{FileMetaData{schema: message root {
  optional int64 Index;
  optional binary VarbinaryValue1;
  optional int64 BigIntValue;
  optional boolean BooleanValue;
  optional int32 DateValue (DATE);
  optional float FloatValue;
  optional binary VarcharValue1 (UTF8);
  optional double DoubleValue;
  optional int32 IntegerValue;
  optional int32 TimeValue (TIME_MILLIS);
  optional int64 TimestampValue (TIMESTAMP_MILLIS);
  optional binary VarbinaryValue2;
  optional fixed_len_byte_array(12) IntervalYearValue (INTERVAL);
  optional fixed_len_byte_array(12) IntervalDayValue (INTERVAL);
  optional fixed_len_byte_array(12) IntervalSecondValue (INTERVAL);
  optional binary VarcharValue2 (UTF8);
}
, metadata: {drill-writer.version=2, drill.version=1.14.0-SNAPSHOT}}, blocks: 
[BlockMetaData{1250, 23750308 [ColumnMetaData{UNCOMPRESSED [Index] optional 
int64 Index  [PLAIN, RLE, BIT_PACKED], 4}, ColumnMetaData{UNCOMPRESSED 
[VarbinaryValue1] optional binary VarbinaryValue1  [PLAIN, RLE, BIT_PACKED], 
10057}, ColumnMetaData{UNCOMPRESSED [BigIntValue] optional int64 BigIntValue  
[PLAIN, RLE, BIT_PACKED], 8174655}, ColumnMetaData{UNCOMPRESSED [BooleanValue] 
optional boolean BooleanValue  [PLAIN, RLE, BIT_PACKED], 8179722}, 
ColumnMetaData{UNCOMPRESSED [DateValue] optional int32 DateValue (DATE)  
[PLAIN, RLE, BIT_PACKED], 8179916}, ColumnMetaData{UNCOMPRESSED [FloatValue] 
optional float FloatValue  [PLAIN, RLE, BIT_PACKED], 8184959}, 
ColumnMetaData{UNCOMPRESSED [VarcharValue1] optional binary VarcharValue1 
(UTF8)  [PLAIN, RLE, BIT_PACKED], 8190002}, ColumnMetaData{UNCOMPRESSED 
[DoubleValue] optional double DoubleValue  [PLAIN, RLE, BIT_PACKED], 10230058}, 
ColumnMetaData{UNCOMPRESSED [IntegerValue] optional int32 IntegerValue  [PLAIN, 
RLE, BIT_PACKED], 10240111}, ColumnMetaData{UNCOMPRESSED [TimeValue] optional 
int32 TimeValue (TIME_MILLIS)  [PLAIN, RLE, BIT_PACKED], 10245154}, 
ColumnMetaData{UNCOMPRESSED [TimestampValue] optional int64 TimestampValue 
(TIMESTAMP_MILLIS)  [PLAIN, RLE, BIT_PACKED], 10250197}, 
ColumnMetaData{UNCOMPRESSED [VarbinaryValue2] optional binary VarbinaryValue2  
[PLAIN, RLE, BIT_PACKED], 10260250}, ColumnMetaData{UNCOMPRESSED 
[IntervalYearValue] optional fixed_len_byte_array(12) IntervalYearValue 
(INTERVAL)  [PLAIN, RLE, BIT_PACKED], 19632385}, ColumnMetaData{UNCOMPRESSED 
[IntervalDayValue] optional fixed_len_byte_array(12) IntervalDayValue 
(INTERVAL)  [PLAIN, RLE, BIT_PACKED], 19647446}, ColumnMetaData{UNCOMPRESSED 
[IntervalSecondValue] optional fixed_len_byte_array(12) IntervalSecondValue 
(INTERVAL)  [PLAIN, RLE, BIT_PACKED], 19662507}, ColumnMetaData{UNCOMPRESSED 
[VarcharValue2] optional binary VarcharValue2 (UTF8)  [PLAIN, RLE, BIT_PACKED], 
19677568}]}]}

Fragment 0:0

[Error Id: 25852cdb-3217-4041-9743-66e9f3a2fbe4 on qa-node186.qa.lab:31010] 
(state=,code=0)
{noformat}

Table can be found in 10.10.100.186:/tmp/fourvarchar_asc_nulls_16MB.parquet

sys.version is:
1.15.0-SNAPSHOT a05f17d6fcd80f0d21260d3b1074ab895f457bacChanged 
PROJECT_OUTPUT_BATCH_SIZE to System + Session   30.07.2018 @ 17:12:53 PDT   
r...@mapr.com   30.07.2018 @ 17:25:21 PDT^M


fourvarchar_asc_nulls70.q


  was:
This is the query:
alter session set `drill.exec.memory.operator.project.output_batch_size` = 
131072;
alter session set `planner.width.max_per_node` = 1;
alter session set `planner.width.max_per_query` = 1;
select * from (
select BigIntValue, BooleanValue, DateValue, FloatValue, DoubleValue, 
IntegerValue, TimeValue, TimestampValue, IntervalYearValue, IntervalDayValue, 
IntervalSecondValue, VarbinaryValue1, VarcharValue1, VarbinaryValue2, 
VarcharValue2 from (select * from 
dfs.`/drill/testdata/batch_memory/fourvarchar_asc_nulls_16MB_1GB.parquet` order 
by IntegerValue)) d where d.VarcharValue1 = 'Fq';

It appears to be caused by this commit:
DRILL-6570: Fixed IndexOutofBoundException in Parquet Reader
aee899c1b26ebb9a5781d280d5a73b42c273d4d5

This is the stack trace:
{noformat}
oadd.org.apache.drill.common.exceptions.UserRemoteException: INTERNAL_ERROR 
ERROR: Error in parquet record reader.^M
Message: ^M
Hadoop path: 
/drill/testdata/batch_memory/fourvarchar_asc_nulls_16MB_1GB.parquet/0_0_0.parquet^M
Total records read:

[jira] [Updated] (DRILL-6685) Error in parquet record reader

2018-08-13 Thread Robert Hou (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6685?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Hou updated DRILL-6685:
--
Description: 
This is the query:
alter session set `drill.exec.memory.operator.project.output_batch_size` = 
131072;
alter session set `planner.width.max_per_node` = 1;
alter session set `planner.width.max_per_query` = 1;
select * from (
select BigIntValue, BooleanValue, DateValue, FloatValue, DoubleValue, 
IntegerValue, TimeValue, TimestampValue, IntervalYearValue, IntervalDayValue, 
IntervalSecondValue, VarbinaryValue1, VarcharValue1, VarbinaryValue2, 
VarcharValue2 from (select * from 
dfs.`/drill/testdata/batch_memory/fourvarchar_asc_nulls_16MB_1GB.parquet` order 
by IntegerValue)) d where d.VarcharValue1 = 'Fq';

It appears to be caused by this commit:
DRILL-6570: Fixed IndexOutofBoundException in Parquet Reader
aee899c1b26ebb9a5781d280d5a73b42c273d4d5

This is the stack trace:
{noformat}
oadd.org.apache.drill.common.exceptions.UserRemoteException: INTERNAL_ERROR 
ERROR: Error in parquet record reader.^M
Message: ^M
Hadoop path: 
/drill/testdata/batch_memory/fourvarchar_asc_nulls_16MB_1GB.parquet/0_0_0.parquet^M
Total records read: 0^M
Row group index: 0^M
Records in row group: 14565^M
Parquet Metadata: ParquetMetaData{FileMetaData{schema: message root {^M
  optional int64 Index;^M
  optional binary VarbinaryValue1;^M
  optional int64 BigIntValue;^M
  optional boolean BooleanValue;^M
  optional int32 DateValue (DATE);^M
  optional float FloatValue;^M
  optional binary VarcharValue1 (UTF8);^M
  optional double DoubleValue;^M
  optional int32 IntegerValue;^M
  optional int32 TimeValue (TIME_MILLIS);^M
  optional int64 TimestampValue (TIMESTAMP_MILLIS);^M
  optional binary VarbinaryValue2;^M
  optional fixed_len_byte_array(12) IntervalYearValue (INTERVAL);^M
  optional fixed_len_byte_array(12) IntervalDayValue (INTERVAL);^M
  optional fixed_len_byte_array(12) IntervalSecondValue (INTERVAL);^M
  optional binary VarcharValue2 (UTF8);^M
}^M
, metadata: {drill-writer.version=2, drill.version=1.14.0-SNAPSHOT}}, blocks: 
[BlockMetaData{14565, 268477520 [ColumnMetaData{UNCOMPRESSED [Index] optional 
int64 Index  [PLAIN, RLE, BIT_PACKED], 4}, ColumnMetaData{UNCOMPRESSED 
[VarbinaryValue1] optional binary VarbinaryValue1  [PLAIN, RLE, BIT_PACKED], 
116579}, ColumnMetaData{UNCOMPRESSED [BigIntValue] optional int64 BigIntValue  
[PLAIN, RLE, BIT_PACKED], 91098467}, ColumnMetaData{UNCOMPRESSED [BooleanValue] 
optional boolean BooleanValue  [PLAIN, RLE, BIT_PACKED], 91155431}, 
ColumnMetaData{UNCOMPRESSED [DateValue] optional int32 DateValue (DATE)  
[PLAIN, RLE, BIT_PACKED], 91157291}, ColumnMetaData{UNCOMPRESSED [FloatValue] 
optional float FloatValue  [PLAIN, RLE, BIT_PACKED], 91215598}, 
ColumnMetaData{UNCOMPRESSED [VarcharValue1] optional binary VarcharValue1 
(UTF8)  [PLAIN, RLE, BIT_PACKED], 91273905}, ColumnMetaData{UNCOMPRESSED 
[DoubleValue] optional double DoubleValue  [PLAIN, RLE, BIT_PACKED], 
114039039}, ColumnMetaData{UNCOMPRESSED [IntegerValue] optional int32 
IntegerValue  [PLAIN, RLE, BIT_PACKED], 114155614}, ColumnMetaData{UNCOMPRESSED 
[TimeValue] optional int32 TimeValue (TIME_MILLIS)  [PLAIN, RLE, BIT_PACKED], 
114213921}, ColumnMetaData{UNCOMPRESSED [TimestampValue] optional int64 
TimestampValue (TIMESTAMP_MILLIS)  [PLAIN, RLE, BIT_PACKED], 114272228}, 
ColumnMetaData{UNCOMPRESSED [VarbinaryValue2] optional binary VarbinaryValue2  
[PLAIN, RLE, BIT_PACKED], 114388803}, ColumnMetaData{UNCOMPRESSED 
[IntervalYearValue] optional fixed_len_byte_array(12) IntervalYearValue 
(INTERVAL)  [PLAIN, RLE, BIT_PACKED], 222455665}, ColumnMetaData{UNCOMPRESSED 
[IntervalDayValue] optional fixed_len_byte_array(12) IntervalDayValue 
(INTERVAL)  [PLAIN, RLE, BIT_PACKED], 222630508}, ColumnMetaData{UNCOMPRESSED 
[IntervalSecondValue] optional fixed_len_byte_array(12) IntervalSecondValue 
(INTERVAL)  [PLAIN, RLE, BIT_PACKED], 222805351}, ColumnMetaData{UNCOMPRESSED 
[VarcharValue2] optional binary VarcharValue2 (UTF8)  [PLAIN, RLE, BIT_PACKED], 
222980194}]}]}^M
^M
Fragment 0:0^M
^M
[Error Id: c6690ea1-2f28-4fbe-969f-d8b90da488fb on qa-node186.qa.lab:31010]^M
^M
  (org.apache.drill.common.exceptions.DrillRuntimeException) Error in parquet 
record reader.^M
Message: ^M
Hadoop path: 
/drill/testdata/batch_memory/fourvarchar_asc_nulls_16MB_1GB.parquet/0_0_0.parquet^M
Total records read: 0^M
Row group index: 0^M
Records in row group: 14565^M
Parquet Metadata: ParquetMetaData{FileMetaData{schema: message root {^M
  optional int64 Index;^M
  optional binary VarbinaryValue1;^M
  optional int64 BigIntValue;^M
  optional boolean BooleanValue;^M
  optional int32 DateValue (DATE);^M
  optional float FloatValue;^M
  optional binary VarcharValue1 (UTF8);^M
  optional double DoubleValue;^M
  optional int32 IntegerValue;^M
  optional int32 TimeValue (TIME_MILLIS);^M
  optional int64 TimestampValue (TIMESTAMP_MILLIS);^M
  optional binary

[jira] [Created] (DRILL-6685) Error in parquet record reader

2018-08-13 Thread Robert Hou (JIRA)
Robert Hou created DRILL-6685:
-

 Summary: Error in parquet record reader
 Key: DRILL-6685
 URL: https://issues.apache.org/jira/browse/DRILL-6685
 Project: Apache Drill
  Issue Type: Bug
  Components: Storage - Parquet
Affects Versions: 1.14.0
Reporter: Robert Hou
Assignee: salim achouche
 Fix For: 1.15.0


This is the query:
alter session set `drill.exec.memory.operator.project.output_batch_size` = 
131072;
alter session set `planner.width.max_per_node` = 1;
alter session set `planner.width.max_per_query` = 1;
select * from (
select BigIntValue, BooleanValue, DateValue, FloatValue, DoubleValue, 
IntegerValue, TimeValue, TimestampValue, IntervalYearValue, IntervalDayValue, 
IntervalSecondValue, VarbinaryValue1, VarcharValue1, VarbinaryValue2, 
VarcharValue2 from (select * from 
dfs.`/drill/testdata/batch_memory/fourvarchar_asc_nulls_16MB_1GB.parquet` order 
by IntegerValue)) d where d.VarcharValue1 = 'Fq';

It appears to be caused by this commit:
DRILL-6570: Fixed IndexOutofBoundException in Parquet Reader
aee899c1b26ebb9a5781d280d5a73b42c273d4d5

This is the stack trace:
{noformat}
oadd.org.apache.drill.common.exceptions.UserRemoteException: INTERNAL_ERROR 
ERROR: Error in parquet record reader.^M
Message: ^M
Hadoop path: 
/drill/testdata/batch_memory/fourvarchar_asc_nulls_16MB_1GB.parquet/0_0_0.parquet^M
Total records read: 0^M
Row group index: 0^M
Records in row group: 14565^M
Parquet Metadata: ParquetMetaData{FileMetaData{schema: message root {^M
  optional int64 Index;^M
  optional binary VarbinaryValue1;^M
  optional int64 BigIntValue;^M
  optional boolean BooleanValue;^M
  optional int32 DateValue (DATE);^M
  optional float FloatValue;^M
  optional binary VarcharValue1 (UTF8);^M
  optional double DoubleValue;^M
  optional int32 IntegerValue;^M
  optional int32 TimeValue (TIME_MILLIS);^M
  optional int64 TimestampValue (TIMESTAMP_MILLIS);^M
  optional binary VarbinaryValue2;^M
  optional fixed_len_byte_array(12) IntervalYearValue (INTERVAL);^M
  optional fixed_len_byte_array(12) IntervalDayValue (INTERVAL);^M
  optional fixed_len_byte_array(12) IntervalSecondValue (INTERVAL);^M
  optional binary VarcharValue2 (UTF8);^M
}^M
, metadata: {drill-writer.version=2, drill.version=1.14.0-SNAPSHOT}}, blocks: 
[BlockMetaData{14565, 268477520 [ColumnMetaData{UNCOMPRESSED [Index] optional 
int64 Index  [PLAIN, RLE, BIT_PACKED], 4}, ColumnMetaData{UNCOMPRESSED 
[VarbinaryValue1] optional binary VarbinaryValue1  [PLAIN, RLE, BIT_PACKED], 
116579}, ColumnMetaData{UNCOMPRESSED [BigIntValue] optional int64 BigIntValue  
[PLAIN, RLE, BIT_PACKED], 91098467}, ColumnMetaData{UNCOMPRESSED [BooleanValue] 
optional boolean BooleanValue  [PLAIN, RLE, BIT_PACKED], 91155431}, 
ColumnMetaData{UNCOMPRESSED [DateValue] optional int32 DateValue (DATE)  
[PLAIN, RLE, BIT_PACKED], 91157291}, ColumnMetaData{UNCOMPRESSED [FloatValue] 
optional float FloatValue  [PLAIN, RLE, BIT_PACKED], 91215598}, 
ColumnMetaData{UNCOMPRESSED [VarcharValue1] optional binary VarcharValue1 
(UTF8)  [PLAIN, RLE, BIT_PACKED], 91273905}, ColumnMetaData{UNCOMPRESSED 
[DoubleValue] optional double DoubleValue  [PLAIN, RLE, BIT_PACKED], 
114039039}, ColumnMetaData{UNCOMPRESSED [IntegerValue] optional int32 
IntegerValue  [PLAIN, RLE, BIT_PACKED], 114155614}, ColumnMetaData{UNCOMPRESSED 
[TimeValue] optional int32 TimeValue (TIME_MILLIS)  [PLAIN, RLE, BIT_PACKED], 
114213921}, ColumnMetaData{UNCOMPRESSED [TimestampValue] optional int64 
TimestampValue (TIMESTAMP_MILLIS)  [PLAIN, RLE, BIT_PACKED], 114272228}, 
ColumnMetaData{UNCOMPRESSED [VarbinaryValue2] optional binary VarbinaryValue2  
[PLAIN, RLE, BIT_PACKED], 114388803}, ColumnMetaData{UNCOMPRESSED 
[IntervalYearValue] optional fixed_len_byte_array(12) IntervalYearValue 
(INTERVAL)  [PLAIN, RLE, BIT_PACKED], 222455665}, ColumnMetaData{UNCOMPRESSED 
[IntervalDayValue] optional fixed_len_byte_array(12) IntervalDayValue 
(INTERVAL)  [PLAIN, RLE, BIT_PACKED], 222630508}, ColumnMetaData{UNCOMPRESSED 
[IntervalSecondValue] optional fixed_len_byte_array(12) IntervalSecondValue 
(INTERVAL)  [PLAIN, RLE, BIT_PACKED], 222805351}, ColumnMetaData{UNCOMPRESSED 
[VarcharValue2] optional binary VarcharValue2 (UTF8)  [PLAIN, RLE, BIT_PACKED], 
222980194}]}]}^M
^M
Fragment 0:0^M
^M
[Error Id: c6690ea1-2f28-4fbe-969f-d8b90da488fb on qa-node186.qa.lab:31010]^M
^M
  (org.apache.drill.common.exceptions.DrillRuntimeException) Error in parquet 
record reader.^M
Message: ^M
Hadoop path: 
/drill/testdata/batch_memory/fourvarchar_asc_nulls_16MB_1GB.parquet/0_0_0.parquet^M
Total records read: 0^M
Row group index: 0^M
Records in row group: 14565^M
Parquet Metadata: ParquetMetaData{FileMetaData{schema: message root {^M
  optional int64 Index;^M
  optional binary VarbinaryValue1;^M
  optional int64 BigIntValue;^M
  optional boolean BooleanValue;^M
  optional int32 DateValue (DATE);^M
  optiona

[jira] [Updated] (DRILL-3988) Create a sys.functions table to expose available Drill functions

2018-08-13 Thread Kunal Khatua (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-3988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kunal Khatua updated DRILL-3988:

Fix Version/s: 1.15.0

> Create a sys.functions table to expose available Drill functions
> 
>
> Key: DRILL-3988
> URL: https://issues.apache.org/jira/browse/DRILL-3988
> Project: Apache Drill
>  Issue Type: Sub-task
>  Components: Metadata
>Reporter: Jacques Nadeau
>Priority: Major
>  Labels: newbie
> Fix For: 1.15.0
>
>
> Create a new sys.functions table that returns a list of all available 
> functions.
> Key considerations: 
> - one row per name or one per argument set. I'm inclined to latter so people 
> can use queries to get to data.
> - we need to create a delineation between user functions and internal 
> functions and only show user functions. 'CastInt' isn't something the user 
> should be able to see (or run).
> - should we add a description annotation that could be included in the 
> sys.functions table?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6676) Add Union, List and Repeated List types to Result Set Loader

2018-08-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579016#comment-16579016
 ] 

ASF GitHub Bot commented on DRILL-6676:
---

ilooner commented on issue #1429: DRILL-6676: Add Union, List and Repeated List 
types to Result Set Loader
URL: https://github.com/apache/drill/pull/1429#issuecomment-412693176
 
 
   Thanks @paul-rogers . I'll run the functional tests and let you know if 
there are any issues. Since your code is always well documented and tested, I 
feel comfortable merging this pretty quickly. So maybe we can time-box the 
review to 2 weeks at most, unless some issue pops up.
   
   @sohami @bitblender since you guys have experience doing low level work with 
the vectors could you help with looking at this PR? The PR is quite big and 
would take me a while to go through by myself.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Union, List and Repeated List types to Result Set Loader
> 
>
> Key: DRILL-6676
> URL: https://issues.apache.org/jira/browse/DRILL-6676
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.15.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
> Fix For: 1.15.0
>
>
> Add support for the "obscure" vector types to the {{ResultSetLoader}}:
> * Union
> * List
> * Repeated List



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6211) Optimizations for SelectionVectorRemover

2018-08-13 Thread Kunal Khatua (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kunal Khatua updated DRILL-6211:

Fix Version/s: 1.15.0

> Optimizations for SelectionVectorRemover 
> -
>
> Key: DRILL-6211
> URL: https://issues.apache.org/jira/browse/DRILL-6211
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Execution - Codegen
>Reporter: Kunal Khatua
>Assignee: Karthikeyan Manivannan
>Priority: Major
> Fix For: 1.15.0
>
> Attachments: 255d264c-f55e-b343-0bef-49d3e672d93f.sys.drill, 
> 255d2664-2418-19e0-00ea-2076a06572a2.sys.drill, 
> 255d2682-8481-bed0-fc22-197a75371c04.sys.drill, 
> 255d26ae-2c0b-6cd6-ae71-4ad04c992daf.sys.drill, 
> 255d2880-48a2-d86b-5410-29ce0cd249ed.sys.drill
>
>
> Currently, when a SelectionVectorRemover receives a record batch from an 
> upstream operator (like a Filter), it immediately starts copying over records 
> into a new outgoing batch.
>  It can be worthwhile if the RecordBatch can be enriched with some additional 
> summary statistics about the attached SelectionVector, such as
>  # number of records that need to be removed/copied
>  # total number of records in the record-batch
> The benefit of this would be that in extreme cases, if *all* the records in a 
> batch need to be either truncated or copies, the SelectionVectorRemover can 
> simply drop the record-batch or simply forward it to the next downstream 
> operator.
> While the extreme cases of simply dropping the batch kind of works (because 
> there is no overhead in copying), for cases where the record batch should 
> pass through, the overhead remains (and is actually more than 35% of the 
> time, if you discount for the streaming agg cost within the tests).
> Here are the statistics of having such an optimization
> ||Selectivity||Query Time||%Time used by SVR||Time||Profile||
> |0%|6.996|0.13%|0.0090948|[^255d264c-f55e-b343-0bef-49d3e672d93f.sys.drill]|
> |10%|7.836|7.97%|0.6245292|[^255d2682-8481-bed0-fc22-197a75371c04.sys.drill]|
> |50%|11.225|25.59%|2.8724775|[^255d2664-2418-19e0-00ea-2076a06572a2.sys.drill]|
> |90%|14.966|33.91%|5.0749706|[^255d26ae-2c0b-6cd6-ae71-4ad04c992daf.sys.drill]|
> |100%|19.003|35.73%|6.7897719|[^255d2880-48a2-d86b-5410-29ce0cd249ed.sys.drill]|
> To summarize, the SVR should avoid creating new batches as much as possible.
> A more generic (non-trivial) optimization should take into account the fact 
> that multiple batches emitted can be coalesced, but we don't currently have 
> test metrics for that.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (DRILL-6084) Expose list of Drill functions for consumption by JavaScript libraries

2018-08-13 Thread Kunal Khatua (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kunal Khatua reassigned DRILL-6084:
---

Assignee: Kunal Khatua

> Expose list of Drill functions for consumption by JavaScript libraries
> --
>
> Key: DRILL-6084
> URL: https://issues.apache.org/jira/browse/DRILL-6084
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Web Server
>Reporter: Kunal Khatua
>Assignee: Kunal Khatua
>Priority: Minor
>  Labels: javascript
> Fix For: 1.15.0
>
>
> DRILL-5868 introduces SQL highlighter and the auto-complete feature in the 
> web-UI's query editor.
> For the backend Javascript to highlight the keywords correctly as SQL 
> reserved words or functions, it needs a list of all the Drill functions to be 
> already defined in a source JavaScript ( {{../mode-sql.js}} ) .
> While this works well for standard Drill functions, it means that any new 
> function added to Drill needs to be explicitly hard-coded into the file, 
> rather than be programatically discovered and loaded for the library to 
> highlight.
> It would help if this can be done dynamically without contributors having to 
> explicitly alter the script file to add new function names.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6084) Expose list of Drill functions for consumption by JavaScript libraries

2018-08-13 Thread Kunal Khatua (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6084?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kunal Khatua updated DRILL-6084:

Fix Version/s: (was: Future)
   1.15.0

> Expose list of Drill functions for consumption by JavaScript libraries
> --
>
> Key: DRILL-6084
> URL: https://issues.apache.org/jira/browse/DRILL-6084
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Web Server
>Reporter: Kunal Khatua
>Priority: Minor
>  Labels: javascript
> Fix For: 1.15.0
>
>
> DRILL-5868 introduces SQL highlighter and the auto-complete feature in the 
> web-UI's query editor.
> For the backend Javascript to highlight the keywords correctly as SQL 
> reserved words or functions, it needs a list of all the Drill functions to be 
> already defined in a source JavaScript ( {{../mode-sql.js}} ) .
> While this works well for standard Drill functions, it means that any new 
> function added to Drill needs to be explicitly hard-coded into the file, 
> rather than be programatically discovered and loaded for the library to 
> highlight.
> It would help if this can be done dynamically without contributors having to 
> explicitly alter the script file to add new function names.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-6684) Swap sys.options and sys.options_val tables

2018-08-13 Thread Kunal Khatua (JIRA)
Kunal Khatua created DRILL-6684:
---

 Summary: Swap sys.options and sys.options_val tables
 Key: DRILL-6684
 URL: https://issues.apache.org/jira/browse/DRILL-6684
 Project: Apache Drill
  Issue Type: Improvement
  Components: Storage - Information Schema
Affects Versions: 1.14.0
Reporter: Kunal Khatua
 Fix For: 1.15.0


The current sys.options table has a verbose layout, because of which 
sys.options_internal was introduced. The latter will also support descriptions, 
which means it makes sense to have that table as the new `sys.options`. The 
recommendation is to deprecate the old format, so that any dependencies 
continue to make use of it as long as required.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-6683) move getSelectionVector2 and getSelectionVector4 from VectorAccessible interface to RecordBatch interface

2018-08-13 Thread Timothy Farkas (JIRA)
Timothy Farkas created DRILL-6683:
-

 Summary: move getSelectionVector2 and getSelectionVector4 from 
VectorAccessible interface to RecordBatch interface
 Key: DRILL-6683
 URL: https://issues.apache.org/jira/browse/DRILL-6683
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Timothy Farkas






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6461) Add Basic Data Correctness Unit Tests

2018-08-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578993#comment-16578993
 ] 

ASF GitHub Bot commented on DRILL-6461:
---

ilooner commented on a change in pull request #1344: DRILL-6461: Added basic 
data correctness tests for hash agg, and improved operator unit testing 
framework.
URL: https://github.com/apache/drill/pull/1344#discussion_r209779765
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/VectorContainer.java
 ##
 @@ -42,6 +42,8 @@
   private final BufferAllocator allocator;
   protected final List> wrappers = Lists.newArrayList();
   private BatchSchema schema;
+  private SelectionVector2 selectionVector2;
+  private SelectionVector4 selectionVector4;
 
 Review comment:
   Cool made https://issues.apache.org/jira/browse/DRILL-6683 to track that 
change.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Basic Data Correctness Unit Tests
> -
>
> Key: DRILL-6461
> URL: https://issues.apache.org/jira/browse/DRILL-6461
> Project: Apache Drill
>  Issue Type: Sub-task
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
>
> There are no data correctness unit tests for HashAgg. We need to add some.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6461) Add Basic Data Correctness Unit Tests

2018-08-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578991#comment-16578991
 ] 

ASF GitHub Bot commented on DRILL-6461:
---

ilooner commented on a change in pull request #1344: DRILL-6461: Added basic 
data correctness tests for hash agg, and improved operator unit testing 
framework.
URL: https://github.com/apache/drill/pull/1344#discussion_r209779483
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/test/rowSet/RowSetBatch.java
 ##
 @@ -29,13 +30,28 @@
 import org.apache.drill.exec.record.selection.SelectionVector2;
 import org.apache.drill.exec.record.selection.SelectionVector4;
 
+import java.util.ArrayList;
 import java.util.Iterator;
+import java.util.List;
 
-public class RowSetBatch implements RecordBatch {
-  private final RowSet rowSet;
+/**
+ * A mock operator that returns the provided {@link RowSet}s as batches. 
Currently it's assumed that all the {@link RowSet}s have the same schema.
+ */
+public class RowSetBatch implements CloseableRecordBatch {
 
 Review comment:
   @sohami It's okay, I've already made the move to MockRecordBatch for this 
PR. Just doing a final test runs, and debugging straggling issues.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Basic Data Correctness Unit Tests
> -
>
> Key: DRILL-6461
> URL: https://issues.apache.org/jira/browse/DRILL-6461
> Project: Apache Drill
>  Issue Type: Sub-task
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
>
> There are no data correctness unit tests for HashAgg. We need to add some.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6461) Add Basic Data Correctness Unit Tests

2018-08-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578988#comment-16578988
 ] 

ASF GitHub Bot commented on DRILL-6461:
---

ilooner commented on a change in pull request #1344: DRILL-6461: Added basic 
data correctness tests for hash agg, and improved operator unit testing 
framework.
URL: https://github.com/apache/drill/pull/1344#discussion_r209779109
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/RecordBatchSizer.java
 ##
 @@ -639,6 +641,8 @@ public ColumnSize getColumn(String name) {
   // columns can be obtained from children of topColumns.
   private Map columnSizes = 
CaseInsensitiveMap.newHashMap();
 
+  private List columnSizesList = new ArrayList<>();
 
 Review comment:
   done.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Basic Data Correctness Unit Tests
> -
>
> Key: DRILL-6461
> URL: https://issues.apache.org/jira/browse/DRILL-6461
> Project: Apache Drill
>  Issue Type: Sub-task
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
>
> There are no data correctness unit tests for HashAgg. We need to add some.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6461) Add Basic Data Correctness Unit Tests

2018-08-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578595#comment-16578595
 ] 

ASF GitHub Bot commented on DRILL-6461:
---

sohami commented on a change in pull request #1344: DRILL-6461: Added basic 
data correctness tests for hash agg, and improved operator unit testing 
framework.
URL: https://github.com/apache/drill/pull/1344#discussion_r209677503
 
 

 ##
 File path: 
exec/java-exec/src/test/java/org/apache/drill/test/rowSet/RowSetBatch.java
 ##
 @@ -29,13 +30,28 @@
 import org.apache.drill.exec.record.selection.SelectionVector2;
 import org.apache.drill.exec.record.selection.SelectionVector4;
 
+import java.util.ArrayList;
 import java.util.Iterator;
+import java.util.List;
 
-public class RowSetBatch implements RecordBatch {
-  private final RowSet rowSet;
+/**
+ * A mock operator that returns the provided {@link RowSet}s as batches. 
Currently it's assumed that all the {@link RowSet}s have the same schema.
+ */
+public class RowSetBatch implements CloseableRecordBatch {
 
 Review comment:
   If you want you can leave it as is for now. I can take a look on how to 
refactor MockRecordBatch class along with other tests which uses it, which I 
have to do anyways. I was also thinking of making it use RowSets instead of 
container. 
   The reason for creating separate class was to use it specifically for 
testing different IterOutcomes in context of Lateral&Unnest only whereas 
RowSetBatch looked more of a very simple implementation to just hold one 
container. Was not sure if MockRecordBatch will play well with other operators 
in general.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Basic Data Correctness Unit Tests
> -
>
> Key: DRILL-6461
> URL: https://issues.apache.org/jira/browse/DRILL-6461
> Project: Apache Drill
>  Issue Type: Sub-task
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
>
> There are no data correctness unit tests for HashAgg. We need to add some.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6461) Add Basic Data Correctness Unit Tests

2018-08-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578560#comment-16578560
 ] 

ASF GitHub Bot commented on DRILL-6461:
---

sohami commented on a change in pull request #1344: DRILL-6461: Added basic 
data correctness tests for hash agg, and improved operator unit testing 
framework.
URL: https://github.com/apache/drill/pull/1344#discussion_r209672777
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/record/VectorContainer.java
 ##
 @@ -42,6 +42,8 @@
   private final BufferAllocator allocator;
   protected final List> wrappers = Lists.newArrayList();
   private BatchSchema schema;
+  private SelectionVector2 selectionVector2;
+  private SelectionVector4 selectionVector4;
 
 Review comment:
   I agree with that and think it would be a good refactoring to move 
`getSelectionVector2` and `getSelectionVector4` from `VectorAccessible` 
interface to `RecordBatch` interface. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add Basic Data Correctness Unit Tests
> -
>
> Key: DRILL-6461
> URL: https://issues.apache.org/jira/browse/DRILL-6461
> Project: Apache Drill
>  Issue Type: Sub-task
>Reporter: Timothy Farkas
>Assignee: Timothy Farkas
>Priority: Major
>
> There are no data correctness unit tests for HashAgg. We need to add some.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6680) Expose SHOW FILES command into INFORMATION_SCHEMA

2018-08-13 Thread Volodymyr Vysotskyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Volodymyr Vysotskyi updated DRILL-6680:
---
Labels: doc-impacting ready-to-commit  (was: doc-impacting re)

> Expose SHOW FILES command into INFORMATION_SCHEMA
> -
>
> Key: DRILL-6680
> URL: https://issues.apache.org/jira/browse/DRILL-6680
> Project: Apache Drill
>  Issue Type: New Feature
>Affects Versions: 1.14.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.15.0
>
>
> Link to design document - 
> https://docs.google.com/document/d/1UnvATwH4obn1-XsA83xMz3LtylbMu867eBLN9r3eV3k/edit#



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-6680) Expose SHOW FILES command into INFORMATION_SCHEMA

2018-08-13 Thread Volodymyr Vysotskyi (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6680?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Volodymyr Vysotskyi updated DRILL-6680:
---
Labels: doc-impacting re  (was: doc-impacting)

> Expose SHOW FILES command into INFORMATION_SCHEMA
> -
>
> Key: DRILL-6680
> URL: https://issues.apache.org/jira/browse/DRILL-6680
> Project: Apache Drill
>  Issue Type: New Feature
>Affects Versions: 1.14.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting, re
> Fix For: 1.15.0
>
>
> Link to design document - 
> https://docs.google.com/document/d/1UnvATwH4obn1-XsA83xMz3LtylbMu867eBLN9r3eV3k/edit#



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6680) Expose SHOW FILES command into INFORMATION_SCHEMA

2018-08-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578302#comment-16578302
 ] 

ASF GitHub Bot commented on DRILL-6680:
---

arina-ielchiieva commented on issue #1430: DRILL-6680: Expose show files 
command into INFORMATION_SCHEMA
URL: https://github.com/apache/drill/pull/1430#issuecomment-412522612
 
 
   @vvysotskyi thanks for the code review, addressed CR comments.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Expose SHOW FILES command into INFORMATION_SCHEMA
> -
>
> Key: DRILL-6680
> URL: https://issues.apache.org/jira/browse/DRILL-6680
> Project: Apache Drill
>  Issue Type: New Feature
>Affects Versions: 1.14.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.15.0
>
>
> Link to design document - 
> https://docs.google.com/document/d/1UnvATwH4obn1-XsA83xMz3LtylbMu867eBLN9r3eV3k/edit#



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6670) Error in parquet record reader - previously readable file fails to be read in 1.14

2018-08-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578162#comment-16578162
 ] 

ASF GitHub Bot commented on DRILL-6670:
---

okalinin commented on issue #1428: DRILL-6670: align Parquet TIMESTAMP_MICROS 
logical type handling with earlier versions + minor fixes
URL: https://github.com/apache/drill/pull/1428#issuecomment-412493398
 
 
   Comment addressed, conflicts resolved. PR should be OK for further review.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Error in parquet record reader - previously readable file fails to be read in 
> 1.14
> --
>
> Key: DRILL-6670
> URL: https://issues.apache.org/jira/browse/DRILL-6670
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.14.0
>Reporter: Dave Challis
>Assignee: Oleksandr Kalinin
>Priority: Major
> Fix For: 1.15.0
>
> Attachments: example.parquet
>
>
> Parquet file which was generated by PyArrow was readable in Apache Drill 1.12 
> and 1.13, but fails to be read with 1.14.
> Running the query "SELECT * FROM dfs.`foo.parquet`" results in the following 
> error message from the Drill web query UI:
> {code}
> Query Failed: An Error Occurred
> org.apache.drill.common.exceptions.UserRemoteException: INTERNAL_ERROR ERROR: 
> Error in parquet record reader. Message: Failure in setting up reader Parquet 
> Metadata: ParquetMetaData{FileMetaData{schema: message schema { optional 
> binary name (UTF8); optional binary creation_parameters (UTF8); optional 
> int64 creation_date (TIMESTAMP_MICROS); optional int32 data_version; optional 
> int32 schema_version; } , metadata: {pandas={"index_columns": [], 
> "column_indexes": [], "columns": [{"name": "name", "field_name": "name", 
> "pandas_type": "unicode", "numpy_type": "object", "metadata": null}, {"name": 
> "creation_parameters", "field_name": "creation_parameters", "pandas_type": 
> "unicode", "numpy_type": "object", "metadata": null}, {"name": 
> "creation_date", "field_name": "creation_date", "pandas_type": "datetime", 
> "numpy_type": "datetime64[ns]", "metadata": null}, {"name": "data_version", 
> "field_name": "data_version", "pandas_type": "int32", "numpy_type": "int32", 
> "metadata": null}, {"name": "schema_version", "field_name": "schema_version", 
> "pandas_type": "int32", "numpy_type": "int32", "metadata": null}], 
> "pandas_version": "0.22.0"}}}, blocks: [BlockMetaData{1, 27142 
> [ColumnMetaData{SNAPPY [name] optional binary name (UTF8) [PLAIN, RLE], 4}, 
> ColumnMetaData{SNAPPY [creation_parameters] optional binary 
> creation_parameters (UTF8) [PLAIN, RLE], 252}, ColumnMetaData{SNAPPY 
> [creation_date] optional int64 creation_date (TIMESTAMP_MICROS) [PLAIN, RLE], 
> 46334}, ColumnMetaData{SNAPPY [data_version] optional int32 data_version 
> [PLAIN, RLE], 46478}, ColumnMetaData{SNAPPY [schema_version] optional int32 
> schema_version [PLAIN, RLE], 46593}]}]} Fragment 0:0 [Error Id: 
> bdb2e4d5-5982-4cc6-b95e-244782f827d2 on f9d0456cddd2:31010] 
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6453) TPC-DS query 72 has regressed

2018-08-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578160#comment-16578160
 ] 

ASF GitHub Bot commented on DRILL-6453:
---

asfgit closed pull request #1408: DRILL-6453: Resolve deadlock when reading 
from build and probe sides simultaneously in HashJoin
URL: https://github.com/apache/drill/pull/1408
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashPartition.java
 
b/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashPartition.java
index eaccd335527..fbdc4f3b8a1 100644
--- 
a/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashPartition.java
+++ 
b/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashPartition.java
@@ -17,6 +17,7 @@
  */
 package org.apache.drill.exec.physical.impl.common;
 
+import com.google.common.base.Preconditions;
 import com.google.common.collect.Lists;
 import org.apache.drill.common.exceptions.RetryAfterSpillException;
 import org.apache.drill.common.exceptions.UserException;
@@ -122,6 +123,7 @@
   private List inMemoryBatchStats = 
Lists.newArrayList();
   private long partitionInMemorySize;
   private long numInMemoryRecords;
+  private boolean updatedRecordsPerBatch = false;
 
   public HashPartition(FragmentContext context, BufferAllocator allocator, 
ChainedHashTable baseHashTable,
RecordBatch buildBatch, RecordBatch probeBatch,
@@ -155,6 +157,18 @@ public HashPartition(FragmentContext context, 
BufferAllocator allocator, Chained
 }
   }
 
+  /**
+   * Configure a different temporary batch size when spilling probe batches.
+   * @param newRecordsPerBatch The new temporary batch size to use.
+   */
+  public void updateProbeRecordsPerBatch(int newRecordsPerBatch) {
+Preconditions.checkArgument(newRecordsPerBatch > 0);
+Preconditions.checkState(!updatedRecordsPerBatch); // Only allow updating 
once
+Preconditions.checkState(processingOuter); // We can only update the 
records per batch when probing.
+
+recordsPerBatch = newRecordsPerBatch;
+  }
+
   /**
* Allocate a new vector container for either right or left record batch
* Add an additional special vector for the hash values
diff --git 
a/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/BatchSizePredictor.java
 
b/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/BatchSizePredictor.java
new file mode 100644
index 000..912e4feaf3c
--- /dev/null
+++ 
b/exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/BatchSizePredictor.java
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.join;
+
+import org.apache.drill.exec.record.RecordBatch;
+
+/**
+ * This class predicts the sizes of batches given an input batch.
+ *
+ * Invariants
+ * 
+ *   The {@link BatchSizePredictor} assumes that a {@link RecordBatch} is 
in a state where it can return a valid record count.
+ * 
+ */
+public interface BatchSizePredictor {
+  /**
+   * Gets the batchSize computed in the call to {@link #updateStats()}. 
Returns 0 if {@link #hadDataLastTime()} is false.
+   * @return Gets the batchSize computed in the call to {@link 
#updateStats()}. Returns 0 if {@link #hadDataLastTime()} is false.
+   * @throws IllegalStateException if {@link #updateStats()} was never called.
+   */
+  long getBatchSize();
+
+  /**
+   * Gets the number of records computed in the call to {@link 
#updateStats()}. Returns 0 if {@link #hadDataLastTime()} is false.
+   * @return Gets the number of records computed in the call to {@link 
#updateStats()}. Returns 0 if {@link #hadDataLastTime()} is false.
+   * @throws IllegalStateException if {@link #updateStats()} was never called.
+   */
+  int getNumRecords();
+
+  /**
+   * True if the i

[jira] [Commented] (DRILL-5179) Dropping NULL Columns from putting in parquet files

2018-08-13 Thread mehran (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578123#comment-16578123
 ] 

mehran commented on DRILL-5179:
---

It is convenient to have a  skip_null_columns in parquet.writer.

 

> Dropping NULL Columns from putting in parquet files
> ---
>
> Key: DRILL-5179
> URL: https://issues.apache.org/jira/browse/DRILL-5179
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.9.0
>Reporter: mehran
>Priority: Major
>
> Due to schemaless nature of drill and parquet files. It is suitable that all 
> NULL columns be dropped in "CREATE TABLE" command from json. This will 
> enhance speed of header interpretation. I think this is a big enhancement.
>  This can be an option in configuration parameters.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6433) a process(find) that never finishes slows down apache drill

2018-08-13 Thread mehran (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6433?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578120#comment-16578120
 ] 

mehran commented on DRILL-6433:
---

I found the problem. you have  the following statement in drill-config.sh

 JAVA=`find -L "$JAVA_HOME" -name $JAVA_BIN -type f | head -n 1`

the option of -L is the problem. this command converts to" find -L  / -name 
java -type f" that creates a loop. you should omit " -L" in this statement.

 

> a process(find) that never finishes slows down apache drill
> ---
>
> Key: DRILL-6433
> URL: https://issues.apache.org/jira/browse/DRILL-6433
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: mehran
>Priority: Major
>
> IN version 13 we have a process as follows
> find -L / -name java -type f
> this process is added to system every day. and it will never finishes.
> I have 100 process in my server.
> I did not succeed to set JAVA_HOME  in drill-env.sh so i set it in /etc/bashrc
> these many processes slows down apache drill.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6680) Expose SHOW FILES command into INFORMATION_SCHEMA

2018-08-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578082#comment-16578082
 ] 

ASF GitHub Bot commented on DRILL-6680:
---

vvysotskyi commented on a change in pull request #1430: DRILL-6680: Expose show 
files command into INFORMATION_SCHEMA
URL: https://github.com/apache/drill/pull/1430#discussion_r209556134
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/ischema/Records.java
 ##
 @@ -543,4 +550,38 @@ public Schema(String catalog, String name, String owner, 
String type, boolean is
   this.IS_MUTABLE = isMutable ? "YES" : "NO";
 }
   }
-}
+
+  public static class File {
 
 Review comment:
   Could you please add Javadoc similar to Javadocs for other nested classes in 
this class.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Expose SHOW FILES command into INFORMATION_SCHEMA
> -
>
> Key: DRILL-6680
> URL: https://issues.apache.org/jira/browse/DRILL-6680
> Project: Apache Drill
>  Issue Type: New Feature
>Affects Versions: 1.14.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.15.0
>
>
> Link to design document - 
> https://docs.google.com/document/d/1UnvATwH4obn1-XsA83xMz3LtylbMu867eBLN9r3eV3k/edit#



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6680) Expose SHOW FILES command into INFORMATION_SCHEMA

2018-08-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578080#comment-16578080
 ] 

ASF GitHub Bot commented on DRILL-6680:
---

vvysotskyi commented on a change in pull request #1430: DRILL-6680: Expose show 
files command into INFORMATION_SCHEMA
URL: https://github.com/apache/drill/pull/1430#discussion_r209540538
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/handlers/ShowFilesHandler.java
 ##
 @@ -0,0 +1,120 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.planner.sql.handlers;
+
+import org.apache.calcite.schema.SchemaPlus;
+import org.apache.calcite.sql.SqlIdentifier;
+import org.apache.calcite.sql.SqlLiteral;
+import org.apache.calcite.sql.SqlNode;
+import org.apache.calcite.sql.SqlNodeList;
+import org.apache.calcite.sql.SqlSelect;
+import org.apache.calcite.sql.fun.SqlStdOperatorTable;
+import org.apache.calcite.sql.parser.SqlParserPos;
+import org.apache.calcite.util.Util;
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.exec.planner.sql.SchemaUtilites;
+import org.apache.drill.exec.planner.sql.parser.DrillParserUtil;
+import org.apache.drill.exec.planner.sql.parser.SqlShowFiles;
+import org.apache.drill.exec.store.AbstractSchema;
+import org.apache.drill.exec.store.dfs.WorkspaceSchemaFactory.WorkspaceSchema;
+import org.apache.drill.exec.store.ischema.InfoSchemaTableType;
+import org.apache.drill.exec.work.foreman.ForemanSetupException;
+
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+
+import static 
org.apache.drill.exec.store.ischema.InfoSchemaConstants.FILES_COL_RELATIVE_PATH;
+import static 
org.apache.drill.exec.store.ischema.InfoSchemaConstants.FILES_COL_SCHEMA_NAME;
+import static 
org.apache.drill.exec.store.ischema.InfoSchemaConstants.IS_SCHEMA_NAME;
+
+
+public class ShowFilesHandler extends DefaultSqlHandler {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(SetOptionHandler.class);
+
+  public ShowFilesHandler(SqlHandlerConfig config) {
+super(config);
+  }
+
+  /** Rewrite the parse tree as SELECT ... FROM INFORMATION_SCHEMA.FILES ... */
+  @Override
+  public SqlNode rewrite(SqlNode sqlNode) throws ForemanSetupException {
+
+List selectList = 
Collections.singletonList(SqlIdentifier.star(SqlParserPos.ZERO));
+
+SqlNode fromClause = new SqlIdentifier(
 
 Review comment:
   The constructor which takes only list of names and `SqlParserPos` may be 
used here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Expose SHOW FILES command into INFORMATION_SCHEMA
> -
>
> Key: DRILL-6680
> URL: https://issues.apache.org/jira/browse/DRILL-6680
> Project: Apache Drill
>  Issue Type: New Feature
>Affects Versions: 1.14.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: doc-impacting
> Fix For: 1.15.0
>
>
> Link to design document - 
> https://docs.google.com/document/d/1UnvATwH4obn1-XsA83xMz3LtylbMu867eBLN9r3eV3k/edit#



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-6680) Expose SHOW FILES command into INFORMATION_SCHEMA

2018-08-13 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6680?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16578081#comment-16578081
 ] 

ASF GitHub Bot commented on DRILL-6680:
---

vvysotskyi commented on a change in pull request #1430: DRILL-6680: Expose show 
files command into INFORMATION_SCHEMA
URL: https://github.com/apache/drill/pull/1430#discussion_r209544436
 
 

 ##
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/handlers/ShowFilesHandler.java
 ##
 @@ -0,0 +1,120 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.planner.sql.handlers;
+
+import org.apache.calcite.schema.SchemaPlus;
+import org.apache.calcite.sql.SqlIdentifier;
+import org.apache.calcite.sql.SqlLiteral;
+import org.apache.calcite.sql.SqlNode;
+import org.apache.calcite.sql.SqlNodeList;
+import org.apache.calcite.sql.SqlSelect;
+import org.apache.calcite.sql.fun.SqlStdOperatorTable;
+import org.apache.calcite.sql.parser.SqlParserPos;
+import org.apache.calcite.util.Util;
+import org.apache.drill.common.exceptions.UserException;
+import org.apache.drill.exec.planner.sql.SchemaUtilites;
+import org.apache.drill.exec.planner.sql.parser.DrillParserUtil;
+import org.apache.drill.exec.planner.sql.parser.SqlShowFiles;
+import org.apache.drill.exec.store.AbstractSchema;
+import org.apache.drill.exec.store.dfs.WorkspaceSchemaFactory.WorkspaceSchema;
+import org.apache.drill.exec.store.ischema.InfoSchemaTableType;
+import org.apache.drill.exec.work.foreman.ForemanSetupException;
+
+import java.util.Arrays;
+import java.util.Collections;
+import java.util.List;
+
+import static 
org.apache.drill.exec.store.ischema.InfoSchemaConstants.FILES_COL_RELATIVE_PATH;
+import static 
org.apache.drill.exec.store.ischema.InfoSchemaConstants.FILES_COL_SCHEMA_NAME;
+import static 
org.apache.drill.exec.store.ischema.InfoSchemaConstants.IS_SCHEMA_NAME;
+
+
+public class ShowFilesHandler extends DefaultSqlHandler {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(SetOptionHandler.class);
+
+  public ShowFilesHandler(SqlHandlerConfig config) {
+super(config);
+  }
+
+  /** Rewrite the parse tree as SELECT ... FROM INFORMATION_SCHEMA.FILES ... */
+  @Override
+  public SqlNode rewrite(SqlNode sqlNode) throws ForemanSetupException {
+
+List selectList = 
Collections.singletonList(SqlIdentifier.star(SqlParserPos.ZERO));
+
+SqlNode fromClause = new SqlIdentifier(
+Arrays.asList(IS_SCHEMA_NAME, InfoSchemaTableType.FILES.name()), null, 
SqlParserPos.ZERO, null);
+
+SchemaPlus defaultSchema = config.getConverter().getDefaultSchema();
+SchemaPlus drillSchema = defaultSchema;
+
+SqlShowFiles showFiles = unwrap(sqlNode, SqlShowFiles.class);
+SqlIdentifier from = showFiles.getDb();
+boolean addRelativePathLikeClause = false;
+
+// Show files can be used without from clause, in which case we display 
the files in the default schema
+if (from != null) {
+  // We are not sure if the full from clause is just the schema or 
includes table name,
+  // first try to see if the full path specified is a schema
+  drillSchema = SchemaUtilites.findSchema(defaultSchema, from.names);
+  if (drillSchema == null) {
+// Entire from clause is not a schema, try to obtain the schema 
without the last part of the specified clause.
+drillSchema = SchemaUtilites.findSchema(defaultSchema, 
from.names.subList(0, from.names.size() - 1));
+addRelativePathLikeClause = true;
+  }
+
+  if (drillSchema == null) {
+throw UserException.validationError()
+.message("Invalid FROM/IN clause [%s]", from.toString())
+.build(logger);
+  }
+}
+
+WorkspaceSchema wsSchema;
+
+try {
+  AbstractSchema abstractSchema = drillSchema.unwrap(AbstractSchema.class);
+  if (abstractSchema instanceof WorkspaceSchema) {
 
 Review comment:
   `WorkspaceSchema` uses implementation of `getDefaultSchema()` method from 
`AbstractSchema` which returns `this` instance. So I think `if` block may be 
replaced with the statement from `else` branc