subject:"\[GitHub\] drill pull request #729\: Drill 1328\: Support table statistics for Parquet"

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-03-01 Thread amansinha100

Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103806768
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ops/ContextInformation.java 
---
@@ -29,13 +29,15 @@
   private final long queryStartTime;
   private final int rootFragmentTimeZone;
   private final String sessionId;
+  private final int hllAccuracy;
 
   public ContextInformation(final UserCredentials userCredentials, final 
QueryContextInformation queryContextInfo) {
 this.queryUser = userCredentials.getUserName();
 this.currentDefaultSchema = queryContextInfo.getDefaultSchemaName();
 this.queryStartTime = queryContextInfo.getQueryStartTime();
 this.rootFragmentTimeZone = queryContextInfo.getTimeZone();
 this.sessionId = queryContextInfo.getSessionId();
+this.hllAccuracy = queryContextInfo.getHllAccuracy();
--- End diff --

Perhaps consider doing an @Inject some new context such as StatsContext 
into the UDF. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-03-01 Thread amansinha100

Github user amansinha100 commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103796907
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java ---
@@ -390,4 +391,15 @@
 
   String DYNAMIC_UDF_SUPPORT_ENABLED = "exec.udf.enable_dynamic_support";
   BooleanValidator DYNAMIC_UDF_SUPPORT_ENABLED_VALIDATOR = new 
BooleanValidator(DYNAMIC_UDF_SUPPORT_ENABLED, true, true);
+
+  /**
+   * Option whose value is a long value representing the number of bits 
required for computing ndv (using HLL)
+   */
+  LongValidator NDV_MEMORY_LIMIT = new 
PositiveLongValidator("exec.statistics.ndv_memory_limit", 30, 20);
+
+  /**
+   * Option whose value represents the current version of the statistics. 
Decreasing the value will generate
+   * the older version of statistics
+   */
+  LongValidator STATISTICS_VERSION = new 
NonNegativeLongValidator("exec.statistics.capability_version", 1, 1);
--- End diff --

Agree with @paul-rogers on this. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-03-01 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103636828
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ops/ContextInformation.java 
---
@@ -29,13 +29,15 @@
   private final long queryStartTime;
   private final int rootFragmentTimeZone;
   private final String sessionId;
+  private final int hllAccuracy;
 
   public ContextInformation(final UserCredentials userCredentials, final 
QueryContextInformation queryContextInfo) {
 this.queryUser = userCredentials.getUserName();
 this.currentDefaultSchema = queryContextInfo.getDefaultSchemaName();
 this.queryStartTime = queryContextInfo.getQueryStartTime();
 this.rootFragmentTimeZone = queryContextInfo.getTimeZone();
 this.sessionId = queryContextInfo.getSessionId();
+this.hllAccuracy = queryContextInfo.getHllAccuracy();
--- End diff --

This is not required for use within the `RecordBatch` but within the UDFs. 
The `ContextInformation` seems to be the existing mechanism to pass information 
to the UDFs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-03-01 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103636512
  
--- Diff: 
protocol/src/main/java/org/apache/drill/exec/proto/beans/QueryContextInformation.java
 ---
@@ -51,6 +51,7 @@ public static QueryContextInformation getDefaultInstance()
 private int timeZone;
 private String defaultSchemaName;
 private String sessionId;
+private int hllAccuracy;
--- End diff --

hllAccuracy is required within the HLL UDFs. The only way I found to pass 
information inside the UDFs is by injecting the `@Inject ContextInformation` 
available in the `UDFUtilities.java`.
Hence, I define it in the QueryContext eventually passing it to the 
`ContextInformation`. Please suggest if there is a better way to do the 
same.hllAccuracy is required within the HLL UDFs. The only way I found to pass 
information inside the UDFs is by injecting the `@Inject ContextInformation` 
available in the `UDFUtilities.java`.
Hence, I define it in the QueryContext eventually passing it to the 
`ContextInformation`. Please suggest if there is a better way to do the same.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103619932
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/AbstractMergedStatistic.java
 ---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.exec.vector.ValueVector;
+
+public abstract class AbstractMergedStatistic extends Statistic implements 
MergedStatistic {
+  @Override
+  public String getName() {
+throw new UnsupportedOperationException("getName() not implemented");
+  }
+
+  @Override
+  public String getInput() {
+throw new UnsupportedOperationException("getInput() not implemented");
+  }
+
+  @Override
+  public void merge(ValueVector input) {
+throw new UnsupportedOperationException("merge() not implemented");
+  }
+
+  @Override
+  public Object getStat(String colName) {
+throw new UnsupportedOperationException("getStat() not implemented");
+  }
+
+  @Override
+  public void setOutput(ValueVector output) {
+throw new UnsupportedOperationException("getOutput() not implemented");
+  }
+
+  @Override
+  public void configure(Object configurations) {
--- End diff --

Changed the types to be specific. This config does not do hash lookups. 
However, for functions like `merge()` and `setOutput()` do hash lookups which 
might not be as expensive given the alternative (columns mappings for 
`MapVector` which needs to be recomputed for each incoming batch). Also, all 
the `MergedStatistic` and `StatisticsMergeBatch` contain a single map and given 
that both the incoming(no. of minor fragments) and outgoing batches(1) would 
not be many we can keep the code simple here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103618796
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/AvgWidthMergedStatistic.java
 ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+public class AvgWidthMergedStatistic extends AbstractMergedStatistic {
+
+  private String name;
+  private String inputName;
+  private boolean configureComplete = false;
+  private boolean mergeComplete = false;
+  private Map sumHolder;
+  MergedStatistic types, nonNullStatCounts, statCounts;
+
+  public AvgWidthMergedStatistic (String name, String inputName) {
+this.name = name;
+this.inputName = inputName;
+this.sumHolder = new HashMap<>();
+types = nonNullStatCounts = statCounts = null;
+  }
+
+  @Override
+  public String getName() {
+return name;
+  }
+
+  @Override
+  public String getInput() {
+return inputName;
+  }
+
+  @Override
+  public void merge(ValueVector input) {
+// Check the input is a Map Vector
+assert (input.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+MapVector inputMap = (MapVector) input;
+for (ValueVector vv : inputMap) {
+  String colName = vv.getField().getLastName();
+  NullableFloat8Holder colSumHolder;
+  if (sumHolder.get(colName) != null) {
+colSumHolder = (NullableFloat8Holder) sumHolder.get(colName);
+  } else {
+colSumHolder = new NullableFloat8Holder();
+sumHolder.put(colName, colSumHolder);
+  }
+  Object val = vv.getAccessor().getObject(0);
+  if (val != null) {
+colSumHolder.value += (double) val;
+colSumHolder.isSet = 1;
+  }
+}
+  }
+
+  @Override
+  public Object getStat(String colName) {
+  if (mergeComplete != true) {
+throw new IllegalStateException(
+String.format("Statistic `%s` has not completed merging 
statistics", name));
+  }
+  NullableFloat8Holder colSumHolder = (NullableFloat8Holder) 
sumHolder.get(colName);
+  return (long) (colSumHolder.value/ getRowCount(colName));
+}
+
+  @Override
+  public void setOutput(ValueVector output) {
+// Check the input is a Map Vector
+assert (output.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+// Dependencies have been configured correctly
+assert (configureComplete == true);
+MapVector outputMap = (MapVector) output;
+
+for (ValueVector outMapCol : outputMap) {
+  String colName = outMapCol.getField().getLastName();
+  NullableFloat8Holder colSumHolder = (NullableFloat8Holder) 
sumHolder.get(colName);
+  NullableFloat8Vector vv = (NullableFloat8Vector) outMapCol;
+  vv.allocateNewSafe();
+  vv.getMutator().setSafe(0, (colSumHolder.value / 
getRowCount(colName)));
+}
+mergeComplete = true;
+  }
+
+  @Override
+  public void configure(Object configurations) {
+List statistics = (List) 
configurations;
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103617262
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/AvgWidthMergedStatistic.java
 ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+public class AvgWidthMergedStatistic extends AbstractMergedStatistic {
+
+  private String name;
+  private String inputName;
+  private boolean configureComplete = false;
+  private boolean mergeComplete = false;
+  private Map sumHolder;
+  MergedStatistic types, nonNullStatCounts, statCounts;
+
+  public AvgWidthMergedStatistic (String name, String inputName) {
+this.name = name;
+this.inputName = inputName;
+this.sumHolder = new HashMap<>();
+types = nonNullStatCounts = statCounts = null;
+  }
+
+  @Override
+  public String getName() {
+return name;
+  }
+
+  @Override
+  public String getInput() {
+return inputName;
+  }
+
+  @Override
+  public void merge(ValueVector input) {
+// Check the input is a Map Vector
+assert (input.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+MapVector inputMap = (MapVector) input;
+for (ValueVector vv : inputMap) {
+  String colName = vv.getField().getLastName();
+  NullableFloat8Holder colSumHolder;
+  if (sumHolder.get(colName) != null) {
+colSumHolder = (NullableFloat8Holder) sumHolder.get(colName);
+  } else {
+colSumHolder = new NullableFloat8Holder();
+sumHolder.put(colName, colSumHolder);
+  }
+  Object val = vv.getAccessor().getObject(0);
+  if (val != null) {
+colSumHolder.value += (double) val;
+colSumHolder.isSet = 1;
+  }
+}
+  }
+
+  @Override
+  public Object getStat(String colName) {
+  if (mergeComplete != true) {
+throw new IllegalStateException(
+String.format("Statistic `%s` has not completed merging 
statistics", name));
+  }
+  NullableFloat8Holder colSumHolder = (NullableFloat8Holder) 
sumHolder.get(colName);
+  return (long) (colSumHolder.value/ getRowCount(colName));
+}
+
+  @Override
+  public void setOutput(ValueVector output) {
+// Check the input is a Map Vector
+assert (output.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+// Dependencies have been configured correctly
+assert (configureComplete == true);
+MapVector outputMap = (MapVector) output;
+
+for (ValueVector outMapCol : outputMap) {
+  String colName = outMapCol.getField().getLastName();
+  NullableFloat8Holder colSumHolder = (NullableFloat8Holder) 
sumHolder.get(colName);
+  NullableFloat8Vector vv = (NullableFloat8Vector) outMapCol;
+  vv.allocateNewSafe();
+  vv.getMutator().setSafe(0, (colSumHolder.value / 
getRowCount(colName)));
+}
+mergeComplete = true;
+  }
+
+  @Override
+  public void configure(Object configurations) {
+List statistics = (List) 
configurations;
+for (MergedStatistic statistic : statistics) {
+  if (statistic.getName().equals("type")) {
+types = statistic;
+  } else if (statistic.getName().equals("statcount")) {
+statCounts = statistic;
+  } else if

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103614296
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map copySrcVecMap = null;
+  private Map> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {
+functions.put(key, "sum");
+  } else if (key.equalsIgnoreCase("hll")) {
+

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103614271
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map copySrcVecMap = null;
+  private Map> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {
+functions.put(key, "sum");
+  } else if (key.equalsIgnoreCase("hll")) {
+

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103612078
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsAggBatch.java
 ---
@@ -0,0 +1,256 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.sun.codemodel.JExpr;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.exec.exception.ClassTransformationException;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ClassGenerator;
+import org.apache.drill.exec.expr.CodeGenerator;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.ValueVectorWriteExpression;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsAggregate;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggTemplate;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggregator;
+import org.apache.drill.exec.planner.physical.StatsAggPrel.OperatorPhase;
+import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TypedFieldId;
+import org.apache.drill.exec.store.ImplicitColumnExplorer;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.FieldIdUtil;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.IOException;
+import java.util.GregorianCalendar;
+import java.util.List;
+import java.util.TimeZone;
+
+/**
+ * TODO: This needs cleanup. Currently the key values are constants and we 
compare the constants
+ * for every record. Seems unnecessary.
+ *
+ * Example input and output:
+ * Schema of incoming batch: region_id (VARCHAR), sales_city (VARCHAR), 
cnt (BIGINT)
+ * Schema of output:
+ *"schema" : BIGINT - Schema number. For each schema change this 
number is incremented.
+ *"computed" : BIGINT - What time is it computed?
+ *"columns"   : MAP - Column names
+ *   "region_id"  : VARCHAR
+ *   "sales_city" : VARCHAR
+ *   "cnt": VARCHAR
+ *"statscount" : MAP
--- End diff --

Explained here and in `AnalyzePrule` where all the functions are defined.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103611152
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsAggBatch.java
 ---
@@ -0,0 +1,256 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.sun.codemodel.JExpr;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.exec.exception.ClassTransformationException;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ClassGenerator;
+import org.apache.drill.exec.expr.CodeGenerator;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.ValueVectorWriteExpression;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsAggregate;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggTemplate;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggregator;
+import org.apache.drill.exec.planner.physical.StatsAggPrel.OperatorPhase;
+import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TypedFieldId;
+import org.apache.drill.exec.store.ImplicitColumnExplorer;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.FieldIdUtil;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.IOException;
+import java.util.GregorianCalendar;
+import java.util.List;
+import java.util.TimeZone;
+
+/**
+ * TODO: This needs cleanup. Currently the key values are constants and we 
compare the constants
+ * for every record. Seems unnecessary.
--- End diff --

Refactored out the Javadoc comment since it includes a TODO tag


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103611163
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsAggBatch.java
 ---
@@ -0,0 +1,256 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.sun.codemodel.JExpr;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.exec.exception.ClassTransformationException;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ClassGenerator;
+import org.apache.drill.exec.expr.CodeGenerator;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.ValueVectorWriteExpression;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsAggregate;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggTemplate;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggregator;
+import org.apache.drill.exec.planner.physical.StatsAggPrel.OperatorPhase;
+import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TypedFieldId;
+import org.apache.drill.exec.store.ImplicitColumnExplorer;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.FieldIdUtil;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.IOException;
+import java.util.GregorianCalendar;
+import java.util.List;
+import java.util.TimeZone;
+
+/**
+ * TODO: This needs cleanup. Currently the key values are constants and we 
compare the constants
+ * for every record. Seems unnecessary.
+ *
+ * Example input and output:
+ * Schema of incoming batch: region_id (VARCHAR), sales_city (VARCHAR), 
cnt (BIGINT)
+ * Schema of output:
+ *"schema" : BIGINT - Schema number. For each schema change this 
number is incremented.
+ *"computed" : BIGINT - What time is it computed?
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103611021
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatch.java
 ---
@@ -82,11 +82,18 @@
   private boolean specialBatchSent = false;
   private static final int SPECIAL_BATCH_COUNT = 1;
 
-  public StreamingAggBatch(StreamingAggregate popConfig, RecordBatch 
incoming, FragmentContext context) throws OutOfMemoryException {
+  public StreamingAggBatch(StreamingAggregate popConfig, RecordBatch 
incoming, FragmentContext context)
+  throws OutOfMemoryException {
 super(popConfig, context);
 this.incoming = incoming;
   }
 
+  public StreamingAggBatch(StreamingAggregate popConfig, RecordBatch 
incoming, FragmentContext context,
+   final boolean buildSchema) throws 
OutOfMemoryException {
--- End diff --

I am not familiar with the entire inner implementation of 
StreamingAggBatch. @amansinha100 could you please explain why we need another 
buildSchema flag?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103609111
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ops/ContextInformation.java 
---
@@ -63,4 +65,11 @@ public long getQueryStartTime() {
   public int getRootFragmentTimeZone() {
 return rootFragmentTimeZone;
   }
+
+  /**
+   * @return HLL accuracy parameter
+   */
+  public int getHllMemoryLimit() {
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103608235
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java ---
@@ -390,4 +391,15 @@
 
   String DYNAMIC_UDF_SUPPORT_ENABLED = "exec.udf.enable_dynamic_support";
   BooleanValidator DYNAMIC_UDF_SUPPORT_ENABLED_VALIDATOR = new 
BooleanValidator(DYNAMIC_UDF_SUPPORT_ENABLED, true, true);
+
+  /**
+   * Option whose value is a long value representing the number of bits 
required for computing ndv (using HLL)
+   */
+  LongValidator NDV_MEMORY_LIMIT = new 
PositiveLongValidator("exec.statistics.ndv_memory_limit", 30, 20);
+
+  /**
+   * Option whose value represents the current version of the statistics. 
Decreasing the value will generate
+   * the older version of statistics
+   */
+  LongValidator STATISTICS_VERSION = new 
NonNegativeLongValidator("exec.statistics.capability_version", 1, 1);
--- End diff --

Also, the version in the option will change to the latest for any changes 
to the statistics. However, this option will give users flexibility for 
scenarios described in the earlier commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103608069
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillRelOptUtil.java
 ---
@@ -169,4 +175,43 @@ private static boolean containIdentity(List exps,
 }
 return true;
   }
+
+  /**
+   * Returns whether statistics-based estimates or guesses are used by the 
optimizer
+   * */
+  public static boolean guessRows(RelNode rel) {
+final PlannerSettings settings =
+
rel.getCluster().getPlanner().getContext().unwrap(PlannerSettings.class);
+if (!settings.useStatistics()) {
+  return true;
+}
+if (rel instanceof RelSubset) {
--- End diff --

Added comment explaining the special treatment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103608020
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillRelOptUtil.java
 ---
@@ -169,4 +182,99 @@ private static boolean containIdentity(List exps,
 }
 return true;
   }
+
+  /**
+   * Returns whether statistics-based estimates or guesses are used by the 
optimizer
+   * */
+  public static boolean guessRows(RelNode rel) {
+final PlannerSettings settings =
+
rel.getCluster().getPlanner().getContext().unwrap(PlannerSettings.class);
+if (!settings.useStatistics()) {
+  return true;
+}
+if (rel instanceof RelSubset) {
+  if (((RelSubset) rel).getBest() != null) {
+return guessRows(((RelSubset) rel).getBest());
+  } else if (((RelSubset) rel).getOriginal() != null) {
+return guessRows(((RelSubset) rel).getOriginal());
+  }
+} else if (rel instanceof HepRelVertex) {
+  if (((HepRelVertex) rel).getCurrentRel() != null) {
+return guessRows(((HepRelVertex) rel).getCurrentRel());
+  }
+} else if (rel instanceof TableScan) {
+  DrillTable table = rel.getTable().unwrap(DrillTable.class);
+  if (table == null) {
+table = 
rel.getTable().unwrap(DrillTranslatableTable.class).getDrillTable();
+  }
+  if (table != null && table.getStatsTable() != null) {
+return false;
+  } else {
+return true;
+  }
+} else {
+  for (RelNode child : rel.getInputs()) {
+if (guessRows(child)) { // at least one child is a guess
+  return true;
+}
+  }
+}
+return false;
+  }
+
+  private static boolean findLikeOrRangePredicate(RexNode predicate) {
+if ((predicate == null) || predicate.isAlwaysTrue()) {
+  return false;
+}
+for (RexNode pred : RelOptUtil.conjunctions(predicate)) {
+  for (RexNode orPred : RelOptUtil.disjunctions(pred)) {
+if (!orPred.isA(SqlKind.EQUALS) ||
+ orPred.isA(SqlKind.LIKE)) {
+  return true;
+}
+  }
+}
+return false;
+  }
+
+  public static boolean analyzeSimpleEquiJoin(Join join, int[] 
joinFieldOrdinals) {
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103607754
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsAggBatch.java
 ---
@@ -0,0 +1,256 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.sun.codemodel.JExpr;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.exec.exception.ClassTransformationException;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ClassGenerator;
+import org.apache.drill.exec.expr.CodeGenerator;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.ValueVectorWriteExpression;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsAggregate;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggTemplate;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggregator;
+import org.apache.drill.exec.planner.physical.StatsAggPrel.OperatorPhase;
+import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TypedFieldId;
+import org.apache.drill.exec.store.ImplicitColumnExplorer;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.FieldIdUtil;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.IOException;
+import java.util.GregorianCalendar;
+import java.util.List;
+import java.util.TimeZone;
+
+/**
+ * TODO: This needs cleanup. Currently the key values are constants and we 
compare the constants
+ * for every record. Seems unnecessary.
+ *
+ * Example input and output:
+ * Schema of incoming batch: region_id (VARCHAR), sales_city (VARCHAR), 
cnt (BIGINT)
+ * Schema of output:
+ *"schema" : BIGINT - Schema number. For each schema change this 
number is incremented.
+ *"computed" : BIGINT - What time is it computed?
+ *"columns"   : MAP - Column names
+ *   "region_id"  : VARCHAR
+ *   "sales_city" : VARCHAR
+ *   "cnt": VARCHAR
+ *"statscount" : MAP
+ *   "region_id"  : BIGINT - statscount(region_id) - aggregation over 
all values of region_id
+ *  in incoming batch
+ *   "sales_city" : BIGINT - statscount(sales_city)
+ *   "cnt": BIGINT - statscount(cnt)
+ *"nonnullstatcount" : MAP
+ *   "region_id"  : BIGINT - nonnullstatcount(region_id)
+ *   "sales_city" : BIGINT - nonnullstatcount(sales_city)
+ *   "cnt": BIGINT - nonnullstatcount(cnt)
+ *    another map for next stats function 
+ */
+public class StatisticsAggBatch extends StreamingAggBatch {
+  private List functions;
+  private int schema = 0;
+
+  public StatisticsAggBatch(StatisticsAggregate popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws OutOfMemoryException {
+super(popConfig, incoming, context);
+this.functions = popConfig.getFunctions();
+  }
+
+  private void createKeyColumn(String name, LogicalExpression expr, 
List keyExprs,
+  List

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103606988
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/MergedStatisticFactory.java
 ---
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+public class MergedStatisticFactory {
+  /*
+   * Creates the appropriate statistics object given the name of the 
statistics and the input statistic
+   */
+  public static MergedStatistic getMergedStatistic(String outputStatName, 
String inputStatName) {
+if (outputStatName == null || inputStatName == null) {
+  return null;
+} else if (outputStatName.equals(Statistic.COLNAME)) {
+  return new ColumnMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.COLTYPE)) {
+  return new ColTypeMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.STATCOUNT)) {
+  return new StatCountMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.NNSTATCOUNT)) {
+  return new NNStatCountMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.AVG_WIDTH)) {
+  return new AvgWidthMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.HLL_MERGE)) {
+  return new HLLMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.NDV)) {
+  return new NDVMergedStatistic(outputStatName, inputStatName);
--- End diff --

Done using the factory approach you described above. Thanks so much for the 
suggestion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103606914
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/NNStatCountMergedStatistic.java
 ---
@@ -0,0 +1,98 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.util.HashMap;
+import java.util.Map;
+
+public class NNStatCountMergedStatistic extends AbstractMergedStatistic {
+
+  private String name;
+  private String inputName;
+  private boolean mergeComplete = false;
+  private Map sumHolder;
+
+  public NNStatCountMergedStatistic (String name, String inputName) {
+this.name = name;
+this.inputName = inputName;
+this.sumHolder = new HashMap<>();
+  }
+
+  @Override
+  public String getName() {
+  return name;
+  }
+
+  @Override
+  public String getInput() {
+return inputName;
+  }
+
+  @Override
+  public void merge(ValueVector input) {
+// Check the input is a Map Vector
+assert (input.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+MapVector inputMap = (MapVector) input;
+for (ValueVector vv : inputMap) {
+  String colName = vv.getField().getLastName();
+  BigIntHolder colSumHolder;
+  if (sumHolder.get(colName) != null) {
+colSumHolder = (BigIntHolder) sumHolder.get(colName);
+  } else {
+colSumHolder = new BigIntHolder();
+sumHolder.put(colName, colSumHolder);
+  }
+  Object val = vv.getAccessor().getObject(0);
+  if (val != null) {
+colSumHolder.value += (long) val;
+  }
+}
+  }
+
+  @Override
+  public Object getStat(String colName) {
+if (mergeComplete != true) {
+  throw new IllegalStateException(String.format("Statistic `%s` has 
not completed merging statistics",
+  name));
+}
+BigIntHolder colSumHolder = (BigIntHolder) sumHolder.get(colName);
+return colSumHolder.value;
+  }
+
+  @Override
+  public void setOutput(ValueVector output) {
+// Check the input is a Map Vector
+assert (output.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+MapVector outputMap = (MapVector) output;
+for (ValueVector outMapCol : outputMap) {
+  String colName = outMapCol.getField().getLastName();
+  BigIntHolder holder = (BigIntHolder) sumHolder.get(colName);
+  NullableBigIntVector vv = (NullableBigIntVector) outMapCol;
+  vv.allocateNewSafe();
+  vv.getMutator().setSafe(0, holder);
+}
+mergeComplete = true;
--- End diff --

The value vector and the results map are all type specific. I do not see a 
way to factor it out to a common function. The only common portion is 2 lines - 
the `for` loop and getting the column name. Same goes for the `merge()` 
function. Hence, leaving as-is.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103419914
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsAggBatch.java
 ---
@@ -0,0 +1,256 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.sun.codemodel.JExpr;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.exec.exception.ClassTransformationException;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ClassGenerator;
+import org.apache.drill.exec.expr.CodeGenerator;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.ValueVectorWriteExpression;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsAggregate;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggTemplate;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggregator;
+import org.apache.drill.exec.planner.physical.StatsAggPrel.OperatorPhase;
+import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TypedFieldId;
+import org.apache.drill.exec.store.ImplicitColumnExplorer;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.FieldIdUtil;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.IOException;
+import java.util.GregorianCalendar;
+import java.util.List;
+import java.util.TimeZone;
+
+/**
+ * TODO: This needs cleanup. Currently the key values are constants and we 
compare the constants
+ * for every record. Seems unnecessary.
+ *
+ * Example input and output:
+ * Schema of incoming batch: region_id (VARCHAR), sales_city (VARCHAR), 
cnt (BIGINT)
+ * Schema of output:
+ *"schema" : BIGINT - Schema number. For each schema change this 
number is incremented.
+ *"computed" : BIGINT - What time is it computed?
+ *"columns"   : MAP - Column names
+ *   "region_id"  : VARCHAR
+ *   "sales_city" : VARCHAR
+ *   "cnt": VARCHAR
+ *"statscount" : MAP
--- End diff --

Rowcount for each column


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103417821
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/MergedStatisticFactory.java
 ---
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+public class MergedStatisticFactory {
+  /*
+   * Creates the appropriate statistics object given the name of the 
statistics and the input statistic
+   */
+  public static MergedStatistic getMergedStatistic(String outputStatName, 
String inputStatName) {
+if (outputStatName == null || inputStatName == null) {
+  return null;
+} else if (outputStatName.equals(Statistic.COLNAME)) {
+  return new ColumnMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.COLTYPE)) {
+  return new ColTypeMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.STATCOUNT)) {
+  return new StatCountMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.NNSTATCOUNT)) {
+  return new NNStatCountMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.AVG_WIDTH)) {
+  return new AvgWidthMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.HLL_MERGE)) {
+  return new HLLMergedStatistic(outputStatName, inputStatName);
--- End diff --

Yes, the output name can be hard-coded in the class. I want to avoid 
hard-coding the input name because then for each new statistic we would have to 
define/duplicate the mapping in multiple places. Right now the mapping is 
defined only once in AnalyzePRule.java. Also there might be a many:1 mapping 
e.g. if the input implementation for a statistic changes but the interface 
remains the same. I don't see this happening now but maybe in the future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103411133
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/AvgWidthMergedStatistic.java
 ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+public class AvgWidthMergedStatistic extends AbstractMergedStatistic {
+
+  private String name;
+  private String inputName;
+  private boolean configureComplete = false;
+  private boolean mergeComplete = false;
+  private Map sumHolder;
+  MergedStatistic types, nonNullStatCounts, statCounts;
+
+  public AvgWidthMergedStatistic (String name, String inputName) {
+this.name = name;
+this.inputName = inputName;
+this.sumHolder = new HashMap<>();
+types = nonNullStatCounts = statCounts = null;
+  }
+
+  @Override
+  public String getName() {
+return name;
+  }
+
+  @Override
+  public String getInput() {
+return inputName;
+  }
+
+  @Override
+  public void merge(ValueVector input) {
+// Check the input is a Map Vector
+assert (input.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+MapVector inputMap = (MapVector) input;
+for (ValueVector vv : inputMap) {
+  String colName = vv.getField().getLastName();
+  NullableFloat8Holder colSumHolder;
+  if (sumHolder.get(colName) != null) {
+colSumHolder = (NullableFloat8Holder) sumHolder.get(colName);
+  } else {
+colSumHolder = new NullableFloat8Holder();
+sumHolder.put(colName, colSumHolder);
+  }
+  Object val = vv.getAccessor().getObject(0);
+  if (val != null) {
+colSumHolder.value += (double) val;
+colSumHolder.isSet = 1;
+  }
+}
+  }
+
+  @Override
+  public Object getStat(String colName) {
--- End diff --

GetStat() returns different types based on the Statistic Type - addressed 
in a subsequent comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103410529
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/ColTypeMergedStatistic.java
 ---
@@ -0,0 +1,93 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.exec.expr.holders.IntHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.vector.IntVector;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.util.HashMap;
+import java.util.Map;
+
+public class ColTypeMergedStatistic extends AbstractMergedStatistic {
+  private String name;
+  private String inputName;
+  private boolean mergeComplete = false;
+  private Map typeHolder;
+
+
+  public ColTypeMergedStatistic (String name, String inputName) {
+this.name = name;
+this.inputName = inputName;
+typeHolder = new HashMap<>();
+  }
+
+  @Override
+  public String getName() {
+return name;
+  }
+
+  @Override
+  public String getInput() {
+return inputName;
+  }
+
+  @Override
+  public void merge(ValueVector input) {
+// Check the input is a Map Vector
+assert (input.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+MapVector inputMap = (MapVector) input;
+for (ValueVector vv : inputMap) {
+  String colName = vv.getField().getLastName();
+  if (typeHolder.get(colName) == null) {
+IntHolder colType = new IntHolder();
+((IntVector) vv).getAccessor().get(0, colType);
+typeHolder.put(colName, colType);
+  }
+}
+  }
+
+  @Override
+  public Object getStat(String colName) {
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103408205
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/MergedStatisticFactory.java
 ---
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+public class MergedStatisticFactory {
+  /*
+   * Creates the appropriate statistics object given the name of the 
statistics and the input statistic
+   */
+  public static MergedStatistic getMergedStatistic(String outputStatName, 
String inputStatName) {
+if (outputStatName == null || inputStatName == null) {
+  return null;
+} else if (outputStatName.equals(Statistic.COLNAME)) {
+  return new ColumnMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.COLTYPE)) {
+  return new ColTypeMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.STATCOUNT)) {
+  return new StatCountMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.NNSTATCOUNT)) {
+  return new NNStatCountMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.AVG_WIDTH)) {
+  return new AvgWidthMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.HLL_MERGE)) {
+  return new HLLMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.NDV)) {
+  return new NDVMergedStatistic(outputStatName, inputStatName);
+} else {
+  return null;
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103408160
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/Statistic.java
 ---
@@ -0,0 +1,38 @@
+/**
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103407891
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/MergedStatisticFactory.java
 ---
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+public class MergedStatisticFactory {
+  /*
+   * Creates the appropriate statistics object given the name of the 
statistics and the input statistic
+   */
+  public static MergedStatistic getMergedStatistic(String outputStatName, 
String inputStatName) {
+if (outputStatName == null || inputStatName == null) {
+  return null;
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103406448
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/MergedStatisticFactory.java
 ---
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+public class MergedStatisticFactory {
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103405690
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/AvgWidthMergedStatistic.java
 ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+public class AvgWidthMergedStatistic extends AbstractMergedStatistic {
+
+  private String name;
+  private String inputName;
+  private boolean configureComplete = false;
+  private boolean mergeComplete = false;
+  private Map sumHolder;
+  MergedStatistic types, nonNullStatCounts, statCounts;
+
+  public AvgWidthMergedStatistic (String name, String inputName) {
+this.name = name;
+this.inputName = inputName;
+this.sumHolder = new HashMap<>();
+types = nonNullStatCounts = statCounts = null;
+  }
+
+  @Override
+  public String getName() {
+return name;
+  }
+
+  @Override
+  public String getInput() {
+return inputName;
+  }
+
+  @Override
+  public void merge(ValueVector input) {
+// Check the input is a Map Vector
+assert (input.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+MapVector inputMap = (MapVector) input;
+for (ValueVector vv : inputMap) {
+  String colName = vv.getField().getLastName();
+  NullableFloat8Holder colSumHolder;
+  if (sumHolder.get(colName) != null) {
+colSumHolder = (NullableFloat8Holder) sumHolder.get(colName);
+  } else {
+colSumHolder = new NullableFloat8Holder();
+sumHolder.put(colName, colSumHolder);
+  }
+  Object val = vv.getAccessor().getObject(0);
+  if (val != null) {
+colSumHolder.value += (double) val;
+colSumHolder.isSet = 1;
+  }
+}
+  }
+
+  @Override
+  public Object getStat(String colName) {
+  if (mergeComplete != true) {
+throw new IllegalStateException(
+String.format("Statistic `%s` has not completed merging 
statistics", name));
+  }
+  NullableFloat8Holder colSumHolder = (NullableFloat8Holder) 
sumHolder.get(colName);
+  return (long) (colSumHolder.value/ getRowCount(colName));
+}
+
+  @Override
+  public void setOutput(ValueVector output) {
+// Check the input is a Map Vector
+assert (output.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+// Dependencies have been configured correctly
+assert (configureComplete == true);
+MapVector outputMap = (MapVector) output;
+
+for (ValueVector outMapCol : outputMap) {
+  String colName = outMapCol.getField().getLastName();
+  NullableFloat8Holder colSumHolder = (NullableFloat8Holder) 
sumHolder.get(colName);
+  NullableFloat8Vector vv = (NullableFloat8Vector) outMapCol;
+  vv.allocateNewSafe();
+  vv.getMutator().setSafe(0, (colSumHolder.value / 
getRowCount(colName)));
+}
+mergeComplete = true;
+  }
+
+  @Override
+  public void configure(Object configurations) {
+List statistics = (List) 
configurations;
+for (MergedStatistic statistic : statistics) {
+  if (statistic.getName().equals("type")) {
+types = statistic;
+  } else if (statistic.getName().equals("statcount")) {
+statCounts = statistic;
+  } else if

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103403068
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/AvgWidthMergedStatistic.java
 ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+public class AvgWidthMergedStatistic extends AbstractMergedStatistic {
+
+  private String name;
+  private String inputName;
+  private boolean configureComplete = false;
+  private boolean mergeComplete = false;
+  private Map sumHolder;
+  MergedStatistic types, nonNullStatCounts, statCounts;
+
+  public AvgWidthMergedStatistic (String name, String inputName) {
+this.name = name;
+this.inputName = inputName;
+this.sumHolder = new HashMap<>();
+types = nonNullStatCounts = statCounts = null;
+  }
+
+  @Override
+  public String getName() {
+return name;
+  }
+
+  @Override
+  public String getInput() {
+return inputName;
+  }
+
+  @Override
+  public void merge(ValueVector input) {
+// Check the input is a Map Vector
+assert (input.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+MapVector inputMap = (MapVector) input;
+for (ValueVector vv : inputMap) {
+  String colName = vv.getField().getLastName();
+  NullableFloat8Holder colSumHolder;
+  if (sumHolder.get(colName) != null) {
+colSumHolder = (NullableFloat8Holder) sumHolder.get(colName);
+  } else {
+colSumHolder = new NullableFloat8Holder();
+sumHolder.put(colName, colSumHolder);
+  }
+  Object val = vv.getAccessor().getObject(0);
+  if (val != null) {
+colSumHolder.value += (double) val;
+colSumHolder.isSet = 1;
+  }
+}
+  }
+
+  @Override
+  public Object getStat(String colName) {
+  if (mergeComplete != true) {
+throw new IllegalStateException(
+String.format("Statistic `%s` has not completed merging 
statistics", name));
+  }
+  NullableFloat8Holder colSumHolder = (NullableFloat8Holder) 
sumHolder.get(colName);
+  return (long) (colSumHolder.value/ getRowCount(colName));
+}
+
+  @Override
+  public void setOutput(ValueVector output) {
+// Check the input is a Map Vector
+assert (output.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+// Dependencies have been configured correctly
+assert (configureComplete == true);
+MapVector outputMap = (MapVector) output;
+
+for (ValueVector outMapCol : outputMap) {
+  String colName = outMapCol.getField().getLastName();
+  NullableFloat8Holder colSumHolder = (NullableFloat8Holder) 
sumHolder.get(colName);
+  NullableFloat8Vector vv = (NullableFloat8Vector) outMapCol;
+  vv.allocateNewSafe();
+  vv.getMutator().setSafe(0, (colSumHolder.value / 
getRowCount(colName)));
+}
--- End diff --

Since we create these VVs in the StatisticsMergeBatch, we will only create 
one per statistic per column. We can redesign this in a subsequent 
implementation - not sure how we can get by without generating a VV or maybe 
you mean generate less VVs?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-27 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103383633
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/AvgWidthMergedStatistic.java
 ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+public class AvgWidthMergedStatistic extends AbstractMergedStatistic {
+
+  private String name;
+  private String inputName;
+  private boolean configureComplete = false;
+  private boolean mergeComplete = false;
--- End diff --

Yes, not necessary. However, having them move through different states 
would make it easier to debug/track issues. Also, incremental statistics would 
probably increase the run-time states interactions. As suggested added a state 
enum `enum State {Config, Merge, Complete};`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-27 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103371017
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/AvgWidthMergedStatistic.java
 ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+public class AvgWidthMergedStatistic extends AbstractMergedStatistic {
+
+  private String name;
+  private String inputName;
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-27 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103367008
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/AbstractMergedStatistic.java
 ---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.exec.vector.ValueVector;
+
+public abstract class AbstractMergedStatistic extends Statistic implements 
MergedStatistic {
+  @Override
+  public String getName() {
+throw new UnsupportedOperationException("getName() not implemented");
+  }
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-27 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103366767
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ops/QueryContext.java ---
@@ -283,4 +288,22 @@ public void close() throws Exception {
   closed = true;
 }
   }
+
+  /**
+  * @param stmtType : Sets the type {@link SqlStatementType} of the 
statement e.g. CTAS, ANALYZE
+  */
+  public void setSQLStatementType(SqlStatementType stmtType) {
+if (this.stmtType == null) {
+  this.stmtType = stmtType;
+} else {
+  throw new UnsupportedOperationException("SQL Statement type is 
already set");
+}
+  }
+
+  /**
+   * @return Get the type {@link SqlStatementType} of the statement e.g. 
CTAS, ANALYZE
+   */
+  public SqlStatementType getSQLStatementType() {
+return this.stmtType;
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-27 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103366732
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ops/QueryContext.java ---
@@ -283,4 +288,22 @@ public void close() throws Exception {
   closed = true;
 }
   }
+
+  /**
+  * @param stmtType : Sets the type {@link SqlStatementType} of the 
statement e.g. CTAS, ANALYZE
+  */
+  public void setSQLStatementType(SqlStatementType stmtType) {
+if (this.stmtType == null) {
+  this.stmtType = stmtType;
+} else {
+  throw new UnsupportedOperationException("SQL Statement type is 
already set");
--- End diff --

Changed to `IllegalStateException`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-27 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103366521
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentContext.java ---
@@ -245,6 +247,18 @@ public SchemaPlus getRootSchema() {
   }
 
   /**
+   * Returns the statement type (e.g. SELECT, CTAS, ANALYZE) from the 
query context.
+   * @return query statement type {@link SqlStatementType}, if known.
+   */
+  public SqlStatementType getSQLStatementType() {
+if (queryContext == null) {
+  fail(new UnsupportedOperationException("Statement type is only valid 
for root fragment. " +
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-27 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103365674
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java ---
@@ -390,4 +391,15 @@
 
   String DYNAMIC_UDF_SUPPORT_ENABLED = "exec.udf.enable_dynamic_support";
   BooleanValidator DYNAMIC_UDF_SUPPORT_ENABLED_VALIDATOR = new 
BooleanValidator(DYNAMIC_UDF_SUPPORT_ENABLED, true, true);
+
+  /**
+   * Option whose value is a long value representing the number of bits 
required for computing ndv (using HLL)
+   */
+  LongValidator NDV_MEMORY_LIMIT = new 
PositiveLongValidator("exec.statistics.ndv_memory_limit", 30, 20);
+
+  /**
+   * Option whose value represents the current version of the statistics. 
Decreasing the value will generate
+   * the older version of statistics
+   */
+  LongValidator STATISTICS_VERSION = new 
NonNegativeLongValidator("exec.statistics.capability_version", 1, 1);
--- End diff --

Say in the next version(v2), we add histograms. Computing stats is 
expensive so users might prefer to remain on the present version(v1) maybe 
because their queries do not involve too many inequalities. Always generating 
the latest version of the stats will force the users to compute the latest and 
greatest stats without needing them. On the other hand, providing individual 
control of which statistic to compute moves too much burden onto the user to 
figure out exactly which statistics would help their use-cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-24 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103060381
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/NNStatCountMergedStatistic.java
 ---
@@ -0,0 +1,98 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.util.HashMap;
+import java.util.Map;
+
+public class NNStatCountMergedStatistic extends AbstractMergedStatistic {
+
+  private String name;
+  private String inputName;
+  private boolean mergeComplete = false;
+  private Map sumHolder;
+
+  public NNStatCountMergedStatistic (String name, String inputName) {
+this.name = name;
+this.inputName = inputName;
+this.sumHolder = new HashMap<>();
+  }
+
+  @Override
+  public String getName() {
+  return name;
+  }
+
+  @Override
+  public String getInput() {
+return inputName;
+  }
+
+  @Override
+  public void merge(ValueVector input) {
+// Check the input is a Map Vector
+assert (input.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+MapVector inputMap = (MapVector) input;
+for (ValueVector vv : inputMap) {
+  String colName = vv.getField().getLastName();
+  BigIntHolder colSumHolder;
+  if (sumHolder.get(colName) != null) {
+colSumHolder = (BigIntHolder) sumHolder.get(colName);
+  } else {
+colSumHolder = new BigIntHolder();
+sumHolder.put(colName, colSumHolder);
+  }
+  Object val = vv.getAccessor().getObject(0);
+  if (val != null) {
+colSumHolder.value += (long) val;
+  }
+}
+  }
+
+  @Override
+  public Object getStat(String colName) {
+if (mergeComplete != true) {
+  throw new IllegalStateException(String.format("Statistic `%s` has 
not completed merging statistics",
+  name));
+}
+BigIntHolder colSumHolder = (BigIntHolder) sumHolder.get(colName);
+return colSumHolder.value;
+  }
+
+  @Override
+  public void setOutput(ValueVector output) {
+// Check the input is a Map Vector
+assert (output.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+MapVector outputMap = (MapVector) output;
+for (ValueVector outMapCol : outputMap) {
+  String colName = outMapCol.getField().getLastName();
+  BigIntHolder holder = (BigIntHolder) sumHolder.get(colName);
+  NullableBigIntVector vv = (NullableBigIntVector) outMapCol;
+  vv.allocateNewSafe();
+  vv.getMutator().setSafe(0, holder);
+}
+mergeComplete = true;
--- End diff --

The bulk of the code here is copy & paste across implementations. Any way 
to factor out the common code? The part that seems unique is the type-specific 
set of the mutator on line 94. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-23 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102872722
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/ColTypeMergedStatistic.java
 ---
@@ -0,0 +1,93 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.exec.expr.holders.IntHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.vector.IntVector;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.util.HashMap;
+import java.util.Map;
+
+public class ColTypeMergedStatistic extends AbstractMergedStatistic {
+  private String name;
+  private String inputName;
+  private boolean mergeComplete = false;
+  private Map typeHolder;
+
+
+  public ColTypeMergedStatistic (String name, String inputName) {
+this.name = name;
+this.inputName = inputName;
+typeHolder = new HashMap<>();
+  }
+
+  @Override
+  public String getName() {
+return name;
+  }
+
+  @Override
+  public String getInput() {
+return inputName;
+  }
+
+  @Override
+  public void merge(ValueVector input) {
+// Check the input is a Map Vector
+assert (input.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+MapVector inputMap = (MapVector) input;
+for (ValueVector vv : inputMap) {
+  String colName = vv.getField().getLastName();
+  if (typeHolder.get(colName) == null) {
+IntHolder colType = new IntHolder();
+((IntVector) vv).getAccessor().get(0, colType);
+typeHolder.put(colName, colType);
+  }
+}
+  }
+
+  @Override
+  public Object getStat(String colName) {
--- End diff --

I see. Some stats are longs, some are ints?

Then, the getStat() should not be on the base class, it should be defined, 
with a specific return type, on each subclass. Why? I have to know the subclass 
to know how to interpret the returned Object. So, I might as well just use the 
subclass directly and get the unboxed primitive. That is, rather than:

```
AbstractMergedStatistic stat = ...
Object value = stat.getStat();
if (stat instanceof ColTypeMergedStatistic) {
  int colType = (int) value;
}
```
Just do:

```
ColTypeMergedStatistic colTypeStat = ...;
int colType = colTypeStat.getType(colName);
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-23 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102875132
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/MergedStatisticFactory.java
 ---
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+public class MergedStatisticFactory {
+  /*
+   * Creates the appropriate statistics object given the name of the 
statistics and the input statistic
+   */
+  public static MergedStatistic getMergedStatistic(String outputStatName, 
String inputStatName) {
+if (outputStatName == null || inputStatName == null) {
+  return null;
+} else if (outputStatName.equals(Statistic.COLNAME)) {
+  return new ColumnMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.COLTYPE)) {
+  return new ColTypeMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.STATCOUNT)) {
+  return new StatCountMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.NNSTATCOUNT)) {
+  return new NNStatCountMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.AVG_WIDTH)) {
+  return new AvgWidthMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.HLL_MERGE)) {
+  return new HLLMergedStatistic(outputStatName, inputStatName);
--- End diff --

If the outputStatName determines the class, then we don't have to tell the 
class the name of its statistic, do we? The stat name can be hard-coded for 
each class to the matching constant...

Otherwise, it raises the question whether a single stats implementation can 
be known by multiple names...

Is there a 1:1 mapping from input stat to output stat? Then the input stat 
name can be hard-coded in each implementation also, right?
```
public class HLLMergedStatistic extends AbstractMergedStatistic {
  ...
  @Override
  public String getName() { return Statistic.HLL_MERGE; }
}


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-23 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102873542
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/AvgWidthMergedStatistic.java
 ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+public class AvgWidthMergedStatistic extends AbstractMergedStatistic {
+
+  private String name;
+  private String inputName;
--- End diff --

Since multiple classes have member variables for name, inputName, and so 
on, go ahead and move them into the common base class to avoid duplication.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-23 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102871257
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/AvgWidthMergedStatistic.java
 ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+public class AvgWidthMergedStatistic extends AbstractMergedStatistic {
+
+  private String name;
+  private String inputName;
+  private boolean configureComplete = false;
+  private boolean mergeComplete = false;
+  private Map sumHolder;
+  MergedStatistic types, nonNullStatCounts, statCounts;
+
+  public AvgWidthMergedStatistic (String name, String inputName) {
+this.name = name;
+this.inputName = inputName;
+this.sumHolder = new HashMap<>();
+types = nonNullStatCounts = statCounts = null;
+  }
+
+  @Override
+  public String getName() {
+return name;
+  }
+
+  @Override
+  public String getInput() {
+return inputName;
+  }
+
+  @Override
+  public void merge(ValueVector input) {
+// Check the input is a Map Vector
+assert (input.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+MapVector inputMap = (MapVector) input;
+for (ValueVector vv : inputMap) {
+  String colName = vv.getField().getLastName();
+  NullableFloat8Holder colSumHolder;
+  if (sumHolder.get(colName) != null) {
+colSumHolder = (NullableFloat8Holder) sumHolder.get(colName);
+  } else {
+colSumHolder = new NullableFloat8Holder();
+sumHolder.put(colName, colSumHolder);
+  }
+  Object val = vv.getAccessor().getObject(0);
+  if (val != null) {
+colSumHolder.value += (double) val;
+colSumHolder.isSet = 1;
+  }
+}
+  }
+
+  @Override
+  public Object getStat(String colName) {
+  if (mergeComplete != true) {
+throw new IllegalStateException(
+String.format("Statistic `%s` has not completed merging 
statistics", name));
+  }
+  NullableFloat8Holder colSumHolder = (NullableFloat8Holder) 
sumHolder.get(colName);
+  return (long) (colSumHolder.value/ getRowCount(colName));
+}
+
+  @Override
+  public void setOutput(ValueVector output) {
--- End diff --

```
setOutput(MapVector output)
```

Does not make sense with any other vector type.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-23 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102875193
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/Statistic.java
 ---
@@ -0,0 +1,38 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+/*
+ * Base Statistics class - all statistics classes should extend this class
+ */
+public abstract class Statistic {
+  /*
+   * List of statistics used in Drill.
+   */
+  public static final String COLNAME = "column";
+  public static final String COLTYPE = "type";
+  public static final String SCHEMA = "schema";
+  public static final String COMPUTED = "computed";
+  public static final String STATCOUNT = "statcount";
+  public static final String NNSTATCOUNT = "nonnullstatcount";
+  public static final String NDV = "ndv";
+  public static final String HLL_MERGE = "hll_merge";
+  public static final String HLL = "hll";
+  public static final String AVG_WIDTH = "avg_width";
+  public static final String SUM_WIDTH = "sum_width";
--- End diff --

Nice!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-23 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102871196
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/AvgWidthMergedStatistic.java
 ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+public class AvgWidthMergedStatistic extends AbstractMergedStatistic {
+
+  private String name;
+  private String inputName;
+  private boolean configureComplete = false;
+  private boolean mergeComplete = false;
+  private Map sumHolder;
+  MergedStatistic types, nonNullStatCounts, statCounts;
+
+  public AvgWidthMergedStatistic (String name, String inputName) {
+this.name = name;
+this.inputName = inputName;
+this.sumHolder = new HashMap<>();
+types = nonNullStatCounts = statCounts = null;
+  }
+
+  @Override
+  public String getName() {
+return name;
+  }
+
+  @Override
+  public String getInput() {
+return inputName;
+  }
+
+  @Override
+  public void merge(ValueVector input) {
+// Check the input is a Map Vector
+assert (input.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+MapVector inputMap = (MapVector) input;
+for (ValueVector vv : inputMap) {
+  String colName = vv.getField().getLastName();
+  NullableFloat8Holder colSumHolder;
+  if (sumHolder.get(colName) != null) {
+colSumHolder = (NullableFloat8Holder) sumHolder.get(colName);
+  } else {
+colSumHolder = new NullableFloat8Holder();
+sumHolder.put(colName, colSumHolder);
+  }
+  Object val = vv.getAccessor().getObject(0);
+  if (val != null) {
+colSumHolder.value += (double) val;
+colSumHolder.isSet = 1;
+  }
+}
+  }
+
+  @Override
+  public Object getStat(String colName) {
+  if (mergeComplete != true) {
+throw new IllegalStateException(
+String.format("Statistic `%s` has not completed merging 
statistics", name));
+  }
+  NullableFloat8Holder colSumHolder = (NullableFloat8Holder) 
sumHolder.get(colName);
+  return (long) (colSumHolder.value/ getRowCount(colName));
+}
+
+  @Override
+  public void setOutput(ValueVector output) {
+// Check the input is a Map Vector
+assert (output.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+// Dependencies have been configured correctly
+assert (configureComplete == true);
+MapVector outputMap = (MapVector) output;
+
+for (ValueVector outMapCol : outputMap) {
+  String colName = outMapCol.getField().getLastName();
+  NullableFloat8Holder colSumHolder = (NullableFloat8Holder) 
sumHolder.get(colName);
+  NullableFloat8Vector vv = (NullableFloat8Vector) outMapCol;
+  vv.allocateNewSafe();
+  vv.getMutator().setSafe(0, (colSumHolder.value / 
getRowCount(colName)));
+}
--- End diff --

This is very expensive implementation! Creating value vectors is pretty 
heavy-weight.

```
for (String colName : sumHolder.keySet()) {
  NullableFloat8Vector vv = outMapCol.getChild(
colName, NullableFloat8Vector.class);
vv.allocateNewSafe();
vv.getMutator().setSafe(0, getStat(colName));
}
```


---
If your project is set up for it, you can reply to this

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-23 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102874413
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/MergedStatisticFactory.java
 ---
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+public class MergedStatisticFactory {
--- End diff --

If the goal is to use only the static function, then:
```
// Can't instantiate
private MergedStatisticFactory() { }
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-23 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102871374
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/AvgWidthMergedStatistic.java
 ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+public class AvgWidthMergedStatistic extends AbstractMergedStatistic {
+
+  private String name;
+  private String inputName;
+  private boolean configureComplete = false;
+  private boolean mergeComplete = false;
+  private Map sumHolder;
+  MergedStatistic types, nonNullStatCounts, statCounts;
+
+  public AvgWidthMergedStatistic (String name, String inputName) {
+this.name = name;
+this.inputName = inputName;
+this.sumHolder = new HashMap<>();
+types = nonNullStatCounts = statCounts = null;
+  }
+
+  @Override
+  public String getName() {
+return name;
+  }
+
+  @Override
+  public String getInput() {
+return inputName;
+  }
+
+  @Override
+  public void merge(ValueVector input) {
+// Check the input is a Map Vector
+assert (input.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+MapVector inputMap = (MapVector) input;
+for (ValueVector vv : inputMap) {
+  String colName = vv.getField().getLastName();
+  NullableFloat8Holder colSumHolder;
+  if (sumHolder.get(colName) != null) {
+colSumHolder = (NullableFloat8Holder) sumHolder.get(colName);
+  } else {
+colSumHolder = new NullableFloat8Holder();
+sumHolder.put(colName, colSumHolder);
+  }
+  Object val = vv.getAccessor().getObject(0);
+  if (val != null) {
+colSumHolder.value += (double) val;
+colSumHolder.isSet = 1;
+  }
+}
+  }
+
+  @Override
+  public Object getStat(String colName) {
+  if (mergeComplete != true) {
+throw new IllegalStateException(
+String.format("Statistic `%s` has not completed merging 
statistics", name));
+  }
+  NullableFloat8Holder colSumHolder = (NullableFloat8Holder) 
sumHolder.get(colName);
+  return (long) (colSumHolder.value/ getRowCount(colName));
+}
+
+  @Override
+  public void setOutput(ValueVector output) {
+// Check the input is a Map Vector
+assert (output.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+// Dependencies have been configured correctly
+assert (configureComplete == true);
+MapVector outputMap = (MapVector) output;
+
+for (ValueVector outMapCol : outputMap) {
+  String colName = outMapCol.getField().getLastName();
+  NullableFloat8Holder colSumHolder = (NullableFloat8Holder) 
sumHolder.get(colName);
+  NullableFloat8Vector vv = (NullableFloat8Vector) outMapCol;
+  vv.allocateNewSafe();
+  vv.getMutator().setSafe(0, (colSumHolder.value / 
getRowCount(colName)));
+}
+mergeComplete = true;
+  }
+
+  @Override
+  public void configure(Object configurations) {
+List statistics = (List) 
configurations;
--- End diff --

```
public void configure(List configurations) {
```

Please don't use generic Object and casting unless absolutely necessary. 
(Sometimes it is. This doesn't appear to be one of those times...)


---
If your project is set

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-23 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102874162
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/MergedStatistic.java
 ---
@@ -0,0 +1,41 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.exec.vector.ValueVector;
+/*
+ * Interface for implementing a merged statistic. A merged statistic can 
merge
+ * the input statistics to get the overall value. e.g. `rowcount` merged 
statistic
+ * should merge all `rowcount` input statistic and return the overall 
`rowcount`.
+ * Given `rowcount`s 10 and 20, the `rowcount` merged statistic will 
return 30.
+ */
+public interface MergedStatistic {
+  // Gets the name of the merged statistic
--- End diff --

If you use Javadoc comments, then the description will show up in any 
generated documentation:
```
/** Gets the name ... */
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-23 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102866471
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ops/ContextInformation.java 
---
@@ -29,13 +29,15 @@
   private final long queryStartTime;
   private final int rootFragmentTimeZone;
   private final String sessionId;
+  private final int hllAccuracy;
 
   public ContextInformation(final UserCredentials userCredentials, final 
QueryContextInformation queryContextInfo) {
 this.queryUser = userCredentials.getUserName();
 this.currentDefaultSchema = queryContextInfo.getDefaultSchemaName();
 this.queryStartTime = queryContextInfo.getQueryStartTime();
 this.rootFragmentTimeZone = queryContextInfo.getTimeZone();
 this.sessionId = queryContextInfo.getSessionId();
+this.hllAccuracy = queryContextInfo.getHllAccuracy();
--- End diff --

The query context is very general and is probably not the place to store 
specific options such as the hllAccuracy. The operator and/or factory can get 
the value directly from the option manager.

Actually, any stats options should be set on the operator definition itself 
(`StatisticsAggregate`, etc.) so that all fragments use the same value: the one 
selected when creating the plan. This behavior would mimic how we set memory 
for sort operators, etc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-23 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102868884
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/AbstractMergedStatistic.java
 ---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.exec.vector.ValueVector;
+
+public abstract class AbstractMergedStatistic extends Statistic implements 
MergedStatistic {
+  @Override
+  public String getName() {
+throw new UnsupportedOperationException("getName() not implemented");
+  }
--- End diff --

The classic way to do this is:

```
public abstract String getName();
```

Or, since this is implementing/extending some other class, just leave the 
method unimplemented. This way, the compiler will tell you that you forgot to 
implement a required method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-23 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102875234
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/Statistic.java
 ---
@@ -0,0 +1,38 @@
+/**
--- End diff --

/** --> /*
The copyright notice need not be Javadoc.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-23 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102869973
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/AvgWidthMergedStatistic.java
 ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+public class AvgWidthMergedStatistic extends AbstractMergedStatistic {
+
+  private String name;
+  private String inputName;
+  private boolean configureComplete = false;
+  private boolean mergeComplete = false;
+  private Map sumHolder;
+  MergedStatistic types, nonNullStatCounts, statCounts;
+
+  public AvgWidthMergedStatistic (String name, String inputName) {
+this.name = name;
+this.inputName = inputName;
+this.sumHolder = new HashMap<>();
+types = nonNullStatCounts = statCounts = null;
+  }
+
+  @Override
+  public String getName() {
+return name;
+  }
+
+  @Override
+  public String getInput() {
+return inputName;
+  }
+
+  @Override
+  public void merge(ValueVector input) {
+// Check the input is a Map Vector
+assert (input.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+MapVector inputMap = (MapVector) input;
+for (ValueVector vv : inputMap) {
+  String colName = vv.getField().getLastName();
+  NullableFloat8Holder colSumHolder;
+  if (sumHolder.get(colName) != null) {
+colSumHolder = (NullableFloat8Holder) sumHolder.get(colName);
+  } else {
+colSumHolder = new NullableFloat8Holder();
+sumHolder.put(colName, colSumHolder);
+  }
+  Object val = vv.getAccessor().getObject(0);
+  if (val != null) {
+colSumHolder.value += (double) val;
+colSumHolder.isSet = 1;
+  }
+}
+  }
+
+  @Override
+  public Object getStat(String colName) {
--- End diff --

Are all stats doubles? If so, return a double, not an Object.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-23 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102863904
  
--- Diff: exec/java-exec/src/main/codegen/data/Parser.tdd ---
@@ -39,7 +39,13 @@
 "METADATA",
 "DATABASE",
 "IF",
-"JAR"
+"JAR",
+"ANALYZE",
+"COMPUTE",
+"ESTIMATE",
+"STATISTICS",
+"SAMPLE",
+"PERCENT"
--- End diff --

We need a solution to this problem. For this feature, we probably just have 
to live with breaking existing queries. Please be sure the list of new keywords 
goes into both release notes and a [list of reserved 
words](http://drill.apache.org/docs/reserved-keywords/) on the Apache Drill web 
site.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-23 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102865583
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java ---
@@ -390,4 +391,15 @@
 
   String DYNAMIC_UDF_SUPPORT_ENABLED = "exec.udf.enable_dynamic_support";
   BooleanValidator DYNAMIC_UDF_SUPPORT_ENABLED_VALIDATOR = new 
BooleanValidator(DYNAMIC_UDF_SUPPORT_ENABLED, true, true);
+
+  /**
+   * Option whose value is a long value representing the number of bits 
required for computing ndv (using HLL)
+   */
+  LongValidator NDV_MEMORY_LIMIT = new 
PositiveLongValidator("exec.statistics.ndv_memory_limit", 30, 20);
+
+  /**
+   * Option whose value represents the current version of the statistics. 
Decreasing the value will generate
+   * the older version of statistics
+   */
+  LongValidator STATISTICS_VERSION = new 
NonNegativeLongValidator("exec.statistics.capability_version", 1, 1);
--- End diff --

Having a statistics version number makes sense. What I disagree on is how 
we are managing the version.

The version is defined by the code that gathers and writes the stats. If 
I'm running a Drill that has version 3 of the implementation, I write version 3 
files. That version number should be a constant defined in the code. When we 
change stats format, we bump the version number.

The reader should handle old versions of the file: at least one older 
version (to ease software upgrades.) The reader retrieves the version from the 
file and checks if it is supported by the reader implementation.

This is all very standard practice.

Where, then, is there room for the user to specify a version? What does 
specifying a version mean? This is the question we need to clarify.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-23 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102874232
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/MergedStatisticFactory.java
 ---
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+public class MergedStatisticFactory {
+  /*
+   * Creates the appropriate statistics object given the name of the 
statistics and the input statistic
+   */
+  public static MergedStatistic getMergedStatistic(String outputStatName, 
String inputStatName) {
+if (outputStatName == null || inputStatName == null) {
+  return null;
+} else if (outputStatName.equals(Statistic.COLNAME)) {
+  return new ColumnMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.COLTYPE)) {
+  return new ColTypeMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.STATCOUNT)) {
+  return new StatCountMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.NNSTATCOUNT)) {
+  return new NNStatCountMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.AVG_WIDTH)) {
+  return new AvgWidthMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.HLL_MERGE)) {
+  return new HLLMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.NDV)) {
+  return new NDVMergedStatistic(outputStatName, inputStatName);
+} else {
+  return null;
--- End diff --

Is this an expected case? Or
```
throw new IllegalArgumentException("No implementation for " + 
outputStatName);
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-23 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102871725
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/AvgWidthMergedStatistic.java
 ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+public class AvgWidthMergedStatistic extends AbstractMergedStatistic {
+
+  private String name;
+  private String inputName;
+  private boolean configureComplete = false;
+  private boolean mergeComplete = false;
+  private Map sumHolder;
+  MergedStatistic types, nonNullStatCounts, statCounts;
+
+  public AvgWidthMergedStatistic (String name, String inputName) {
+this.name = name;
+this.inputName = inputName;
+this.sumHolder = new HashMap<>();
+types = nonNullStatCounts = statCounts = null;
+  }
+
+  @Override
+  public String getName() {
+return name;
+  }
+
+  @Override
+  public String getInput() {
+return inputName;
+  }
+
+  @Override
+  public void merge(ValueVector input) {
+// Check the input is a Map Vector
+assert (input.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+MapVector inputMap = (MapVector) input;
+for (ValueVector vv : inputMap) {
+  String colName = vv.getField().getLastName();
+  NullableFloat8Holder colSumHolder;
+  if (sumHolder.get(colName) != null) {
+colSumHolder = (NullableFloat8Holder) sumHolder.get(colName);
+  } else {
+colSumHolder = new NullableFloat8Holder();
+sumHolder.put(colName, colSumHolder);
+  }
+  Object val = vv.getAccessor().getObject(0);
+  if (val != null) {
+colSumHolder.value += (double) val;
+colSumHolder.isSet = 1;
+  }
+}
+  }
+
+  @Override
+  public Object getStat(String colName) {
+  if (mergeComplete != true) {
+throw new IllegalStateException(
+String.format("Statistic `%s` has not completed merging 
statistics", name));
+  }
+  NullableFloat8Holder colSumHolder = (NullableFloat8Holder) 
sumHolder.get(colName);
+  return (long) (colSumHolder.value/ getRowCount(colName));
+}
+
+  @Override
+  public void setOutput(ValueVector output) {
+// Check the input is a Map Vector
+assert (output.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+// Dependencies have been configured correctly
+assert (configureComplete == true);
+MapVector outputMap = (MapVector) output;
+
+for (ValueVector outMapCol : outputMap) {
+  String colName = outMapCol.getField().getLastName();
+  NullableFloat8Holder colSumHolder = (NullableFloat8Holder) 
sumHolder.get(colName);
+  NullableFloat8Vector vv = (NullableFloat8Vector) outMapCol;
+  vv.allocateNewSafe();
+  vv.getMutator().setSafe(0, (colSumHolder.value / 
getRowCount(colName)));
+}
+mergeComplete = true;
+  }
+
+  @Override
+  public void configure(Object configurations) {
+List statistics = (List) 
configurations;
+for (MergedStatistic statistic : statistics) {
+  if (statistic.getName().equals("type")) {
+types = statistic;
+  } else if (statistic.getName().equals("statcount")) {
+statCounts = statistic;
+  } else if

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-23 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102872366
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/AbstractMergedStatistic.java
 ---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.exec.vector.ValueVector;
+
+public abstract class AbstractMergedStatistic extends Statistic implements 
MergedStatistic {
+  @Override
+  public String getName() {
+throw new UnsupportedOperationException("getName() not implemented");
+  }
+
+  @Override
+  public String getInput() {
+throw new UnsupportedOperationException("getInput() not implemented");
+  }
+
+  @Override
+  public void merge(ValueVector input) {
+throw new UnsupportedOperationException("merge() not implemented");
+  }
+
+  @Override
+  public Object getStat(String colName) {
+throw new UnsupportedOperationException("getStat() not implemented");
+  }
+
+  @Override
+  public void setOutput(ValueVector output) {
+throw new UnsupportedOperationException("getOutput() not implemented");
+  }
+
+  @Override
+  public void configure(Object configurations) {
--- End diff --

See comments below. Types should not be so generic.

Also, given the number of hash lookups, I wonder if we need a class to hold 
the information about each column?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-23 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102866270
  
--- Diff: 
protocol/src/main/java/org/apache/drill/exec/proto/beans/QueryContextInformation.java
 ---
@@ -51,6 +51,7 @@ public static QueryContextInformation getDefaultInstance()
 private int timeZone;
 private String defaultSchemaName;
 private String sessionId;
+private int hllAccuracy;
--- End diff --

This need not be set here. Instead, get it from the option manager 
available to each fragment. Looks like this class is for static info common to 
all queries, not for caching system options.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-23 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102872916
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/AbstractMergedStatistic.java
 ---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.exec.vector.ValueVector;
+
+public abstract class AbstractMergedStatistic extends Statistic implements 
MergedStatistic {
--- End diff --

In general, I like how this is working out. Having the core logic in these 
classes is much clearer. See below for a view Java-related refinements.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-23 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102867208
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ops/QueryContext.java ---
@@ -283,4 +288,22 @@ public void close() throws Exception {
   closed = true;
 }
   }
+
+  /**
+  * @param stmtType : Sets the type {@link SqlStatementType} of the 
statement e.g. CTAS, ANALYZE
+  */
+  public void setSQLStatementType(SqlStatementType stmtType) {
+if (this.stmtType == null) {
+  this.stmtType = stmtType;
+} else {
+  throw new UnsupportedOperationException("SQL Statement type is 
already set");
--- End diff --

`UnsupportedOperationException` is when the user requests an operation that 
Drill does not support. Here, you mean either `IllegalArgumentException` or 
`IllegalStateException` - both indicate a programming error.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-23 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102872187
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/AvgWidthMergedStatistic.java
 ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+public class AvgWidthMergedStatistic extends AbstractMergedStatistic {
+
+  private String name;
+  private String inputName;
+  private boolean configureComplete = false;
+  private boolean mergeComplete = false;
+  private Map sumHolder;
+  MergedStatistic types, nonNullStatCounts, statCounts;
+
+  public AvgWidthMergedStatistic (String name, String inputName) {
+this.name = name;
+this.inputName = inputName;
+this.sumHolder = new HashMap<>();
+types = nonNullStatCounts = statCounts = null;
+  }
+
+  @Override
+  public String getName() {
+return name;
+  }
+
+  @Override
+  public String getInput() {
+return inputName;
+  }
+
+  @Override
+  public void merge(ValueVector input) {
+// Check the input is a Map Vector
+assert (input.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+MapVector inputMap = (MapVector) input;
+for (ValueVector vv : inputMap) {
+  String colName = vv.getField().getLastName();
+  NullableFloat8Holder colSumHolder;
+  if (sumHolder.get(colName) != null) {
+colSumHolder = (NullableFloat8Holder) sumHolder.get(colName);
+  } else {
+colSumHolder = new NullableFloat8Holder();
+sumHolder.put(colName, colSumHolder);
+  }
+  Object val = vv.getAccessor().getObject(0);
+  if (val != null) {
+colSumHolder.value += (double) val;
+colSumHolder.isSet = 1;
+  }
+}
+  }
+
+  @Override
+  public Object getStat(String colName) {
+  if (mergeComplete != true) {
+throw new IllegalStateException(
+String.format("Statistic `%s` has not completed merging 
statistics", name));
+  }
+  NullableFloat8Holder colSumHolder = (NullableFloat8Holder) 
sumHolder.get(colName);
+  return (long) (colSumHolder.value/ getRowCount(colName));
+}
+
+  @Override
+  public void setOutput(ValueVector output) {
+// Check the input is a Map Vector
+assert (output.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+// Dependencies have been configured correctly
+assert (configureComplete == true);
+MapVector outputMap = (MapVector) output;
+
+for (ValueVector outMapCol : outputMap) {
+  String colName = outMapCol.getField().getLastName();
+  NullableFloat8Holder colSumHolder = (NullableFloat8Holder) 
sumHolder.get(colName);
+  NullableFloat8Vector vv = (NullableFloat8Vector) outMapCol;
+  vv.allocateNewSafe();
+  vv.getMutator().setSafe(0, (colSumHolder.value / 
getRowCount(colName)));
+}
+mergeComplete = true;
+  }
+
+  @Override
+  public void configure(Object configurations) {
+List statistics = (List) 
configurations;
+for (MergedStatistic statistic : statistics) {
+  if (statistic.getName().equals("type")) {
+types = statistic;
+  } else if (statistic.getName().equals("statcount")) {
+statCounts = statistic;
+  } else if

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-23 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102874862
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/MergedStatisticFactory.java
 ---
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+public class MergedStatisticFactory {
+  /*
+   * Creates the appropriate statistics object given the name of the 
statistics and the input statistic
+   */
+  public static MergedStatistic getMergedStatistic(String outputStatName, 
String inputStatName) {
+if (outputStatName == null || inputStatName == null) {
+  return null;
+} else if (outputStatName.equals(Statistic.COLNAME)) {
+  return new ColumnMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.COLTYPE)) {
+  return new ColTypeMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.STATCOUNT)) {
+  return new StatCountMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.NNSTATCOUNT)) {
+  return new NNStatCountMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.AVG_WIDTH)) {
+  return new AvgWidthMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.HLL_MERGE)) {
+  return new HLLMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.NDV)) {
+  return new NDVMergedStatistic(outputStatName, inputStatName);
--- End diff --

Consider:

```
if (outputStatName.equals(Statistic.NDV)) {
  stat = new NDVMergedStatistic();
}
...
stat.init(outputStatName, inputStatName);
```
How often will this factory be used? If frequently, then consider:
```
class Factory() {
   private Factory instance = new Factory();
   private HashMap statsClasses = 
new HashMap<>( );

   private Factory() {
 statsClasses.put(Statistic.NDV, NDVMergedStatistic.class);
 ...
  }
   public MergedStatistic getMergedStatistic(String outputStatName, String 
inputStatName) {
   return instance.newMergedStatistic(outputStatName, inputStatName);
  }
  private MergedStatistic instance.newMergedStatistic(String 
outputStatName, String inputStatName) {
MergedStatistic stat = statsClasses.get(outputStatName).newInstance();
stat.init(outputStatName, inputStatName);
return stat;
  }
```
Lots of ways to do the above; this is just an example.
 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-23 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102870259
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/AvgWidthMergedStatistic.java
 ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+public class AvgWidthMergedStatistic extends AbstractMergedStatistic {
+
+  private String name;
+  private String inputName;
+  private boolean configureComplete = false;
+  private boolean mergeComplete = false;
+  private Map sumHolder;
+  MergedStatistic types, nonNullStatCounts, statCounts;
+
+  public AvgWidthMergedStatistic (String name, String inputName) {
+this.name = name;
+this.inputName = inputName;
+this.sumHolder = new HashMap<>();
+types = nonNullStatCounts = statCounts = null;
+  }
+
+  @Override
+  public String getName() {
+return name;
+  }
+
+  @Override
+  public String getInput() {
+return inputName;
+  }
+
+  @Override
+  public void merge(ValueVector input) {
+// Check the input is a Map Vector
+assert (input.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+MapVector inputMap = (MapVector) input;
+for (ValueVector vv : inputMap) {
+  String colName = vv.getField().getLastName();
+  NullableFloat8Holder colSumHolder;
+  if (sumHolder.get(colName) != null) {
+colSumHolder = (NullableFloat8Holder) sumHolder.get(colName);
+  } else {
+colSumHolder = new NullableFloat8Holder();
+sumHolder.put(colName, colSumHolder);
+  }
+  Object val = vv.getAccessor().getObject(0);
+  if (val != null) {
+colSumHolder.value += (double) val;
+colSumHolder.isSet = 1;
+  }
+}
+  }
+
+  @Override
+  public Object getStat(String colName) {
+  if (mergeComplete != true) {
+throw new IllegalStateException(
+String.format("Statistic `%s` has not completed merging 
statistics", name));
+  }
+  NullableFloat8Holder colSumHolder = (NullableFloat8Holder) 
sumHolder.get(colName);
+  return (long) (colSumHolder.value/ getRowCount(colName));
--- End diff --

In Drill, different columns generally cannot have different row counts. In 
any vector, all columns must have the same number of values. Or, is the 
`getRowCount` method really `getNonNullValueCount`: the number of rows for 
which the column in question had non-null values?

By discarding the holder, the above reduces to:

```
return Math.round(colSum / rowCount);
```

Also, I'm not really sure what this is doing. The sum seems global, but the 
value returned is per column, with only the row count differing between 
columns? Perhaps explain that a bit?...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-23 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102874299
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/MergedStatisticFactory.java
 ---
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+public class MergedStatisticFactory {
+  /*
+   * Creates the appropriate statistics object given the name of the 
statistics and the input statistic
+   */
+  public static MergedStatistic getMergedStatistic(String outputStatName, 
String inputStatName) {
+if (outputStatName == null || inputStatName == null) {
+  return null;
--- End diff --

This is probably an error:
```
throw new IllegalArgumentException("Names cannot be null");
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-23 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102867030
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentContext.java ---
@@ -245,6 +247,18 @@ public SchemaPlus getRootSchema() {
   }
 
   /**
+   * Returns the statement type (e.g. SELECT, CTAS, ANALYZE) from the 
query context.
+   * @return query statement type {@link SqlStatementType}, if known.
+   */
+  public SqlStatementType getSQLStatementType() {
+if (queryContext == null) {
+  fail(new UnsupportedOperationException("Statement type is only valid 
for root fragment. " +
--- End diff --

The `fail()` call is for runtime errors due to external causes, user error, 
etc. Here, we have a programming error. Better ways to handle this:

```
if (queryContext == null) {
  throw new IllegalStateException("Statement type...");
}
```
Or just:

```
Preconditions.checkNotNull(queryContext);
```


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-23 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102867347
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ops/QueryContext.java ---
@@ -283,4 +288,22 @@ public void close() throws Exception {
   closed = true;
 }
   }
+
+  /**
+  * @param stmtType : Sets the type {@link SqlStatementType} of the 
statement e.g. CTAS, ANALYZE
+  */
+  public void setSQLStatementType(SqlStatementType stmtType) {
+if (this.stmtType == null) {
+  this.stmtType = stmtType;
+} else {
+  throw new UnsupportedOperationException("SQL Statement type is 
already set");
+}
+  }
+
+  /**
+   * @return Get the type {@link SqlStatementType} of the statement e.g. 
CTAS, ANALYZE
+   */
+  public SqlStatementType getSQLStatementType() {
+return this.stmtType;
--- End diff --

No need for "this."


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-23 Thread paul-rogers

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102875280
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsAggBatch.java
 ---
@@ -0,0 +1,230 @@
+/**
--- End diff --

Stopped here for now, will continue with the rest tomorrow.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-22 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102604228
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java ---
@@ -390,4 +391,15 @@
 
   String DYNAMIC_UDF_SUPPORT_ENABLED = "exec.udf.enable_dynamic_support";
   BooleanValidator DYNAMIC_UDF_SUPPORT_ENABLED_VALIDATOR = new 
BooleanValidator(DYNAMIC_UDF_SUPPORT_ENABLED, true, true);
+
+  /**
+   * Option whose value is a long value representing the number of bits 
required for computing ndv (using HLL)
+   */
+  LongValidator NDV_MEMORY_LIMIT = new 
PositiveLongValidator("exec.statistics.ndv_memory_limit", 30, 20);
--- End diff --

We are not mixing different lengths during the same run. The session 
setting at the foreman would be passed along in the plan fragment - so 
non-foreman fragments will use the same value. Also, we do not mix lengths 
across different runs. So this should not be an issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-22 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102602855
  
--- Diff: exec/java-exec/src/main/codegen/data/Parser.tdd ---
@@ -39,7 +39,13 @@
 "METADATA",
 "DATABASE",
 "IF",
-"JAR"
+"JAR",
+"ANALYZE",
+"COMPUTE",
+"ESTIMATE",
+"STATISTICS",
+"SAMPLE",
+"PERCENT"
--- End diff --

@sudheeshkatkam mentioned
> Something like this came up before where a list of non reserved keyword 
might result in some ambiguous queries. See DRILL-2116. Also DRILL-3875.

Hence, these keywords were not added to the non-reserved keyword list. 
Also, I am not sure how we can preserve backward compatibility here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102345408
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/cost/DrillRelMdSelectivity.java
 ---
@@ -0,0 +1,219 @@

+/***
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ 
**/
+package org.apache.drill.exec.planner.cost;
+
+import org.apache.calcite.plan.RelOptUtil;
+import org.apache.calcite.plan.volcano.RelSubset;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.SingleRel;
+import org.apache.calcite.rel.core.JoinRelType;
+import org.apache.calcite.rel.core.TableScan;
+import org.apache.calcite.rel.metadata.ReflectiveRelMetadataProvider;
+import org.apache.calcite.rel.metadata.RelMdSelectivity;
+import org.apache.calcite.rel.metadata.RelMdUtil;
+import org.apache.calcite.rel.metadata.RelMetadataProvider;
+import org.apache.calcite.rel.metadata.RelMetadataQuery;
+import org.apache.calcite.rel.type.RelDataType;
+import org.apache.calcite.rex.RexBuilder;
+import org.apache.calcite.rex.RexCall;
+import org.apache.calcite.rex.RexInputRef;
+import org.apache.calcite.rex.RexNode;
+import org.apache.calcite.rex.RexUtil;
+import org.apache.calcite.rex.RexVisitor;
+import org.apache.calcite.rex.RexVisitorImpl;
+import org.apache.calcite.sql.SqlKind;
+import org.apache.calcite.util.BuiltInMethod;
+import org.apache.calcite.util.Util;
+import org.apache.drill.exec.planner.common.DrillJoinRelBase;
+import org.apache.drill.exec.planner.common.DrillRelOptUtil;
+import org.apache.drill.exec.planner.logical.DrillTable;
+import org.apache.drill.exec.planner.logical.DrillTranslatableTable;
+
+import java.util.ArrayList;
+import java.util.List;
+
+public class DrillRelMdSelectivity extends RelMdSelectivity {
+  private static final org.slf4j.Logger logger =
+  org.slf4j.LoggerFactory.getLogger(RelMdSelectivity.class);
+
+  private static final DrillRelMdSelectivity INSTANCE =
+  new DrillRelMdSelectivity();
+
+  public static final RelMetadataProvider SOURCE =
+  ReflectiveRelMetadataProvider.reflectiveSource(
+  BuiltInMethod.SELECTIVITY.method, INSTANCE);
+
+  @Override
+  public Double getSelectivity(RelNode rel, RexNode predicate) {
+if (rel instanceof TableScan) {
+  return getScanSelectivity((TableScan) rel, predicate);
+} else if (rel instanceof DrillJoinRelBase) {
+  return getJoinSelectivity(((DrillJoinRelBase) rel), predicate);
+} else if (rel instanceof SingleRel && 
!DrillRelOptUtil.guessRows(rel)) {
+return RelMetadataQuery.getSelectivity(((SingleRel) 
rel).getInput(), predicate);
+} else if (rel instanceof RelSubset && 
!DrillRelOptUtil.guessRows(rel)) {
+  if (((RelSubset) rel).getBest() != null) {
+return RelMetadataQuery.getSelectivity(((RelSubset)rel).getBest(), 
predicate);
+  } else if (((RelSubset)rel).getOriginal() != null) {
+return 
RelMetadataQuery.getSelectivity(((RelSubset)rel).getOriginal(), predicate);
+  } else {
+return super.getSelectivity(rel, predicate);
+  }
+} else {
+  return super.getSelectivity(rel, predicate);
+}
+  }
+
+  private Double getJoinSelectivity(DrillJoinRelBase rel, RexNode 
predicate) {
+double sel = 1.0;
+// determine which filters apply to the left vs right
+RexNode leftPred = null;
+RexNode rightPred = null;
+JoinRelType joinType = rel.getJoinType();
+final RexBuilder rexBuilder = rel.getCluster().getRexBuilder();
+int[] adjustments = new int[rel.getRowType().getFieldCount()];
+
+if (DrillRelOptUtil.guessRows(rel)) {
+  return super.getSelectivity(rel, predicate);
+}
+
+if (predicate !=

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102327621
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsAggBatch.java
 ---
@@ -0,0 +1,256 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.sun.codemodel.JExpr;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.exec.exception.ClassTransformationException;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ClassGenerator;
+import org.apache.drill.exec.expr.CodeGenerator;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.ValueVectorWriteExpression;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsAggregate;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggTemplate;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggregator;
+import org.apache.drill.exec.planner.physical.StatsAggPrel.OperatorPhase;
+import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TypedFieldId;
+import org.apache.drill.exec.store.ImplicitColumnExplorer;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.FieldIdUtil;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.IOException;
+import java.util.GregorianCalendar;
+import java.util.List;
+import java.util.TimeZone;
+
+/**
+ * TODO: This needs cleanup. Currently the key values are constants and we 
compare the constants
+ * for every record. Seems unnecessary.
+ *
+ * Example input and output:
+ * Schema of incoming batch: region_id (VARCHAR), sales_city (VARCHAR), 
cnt (BIGINT)
+ * Schema of output:
+ *"schema" : BIGINT - Schema number. For each schema change this 
number is incremented.
+ *"computed" : BIGINT - What time is it computed?
+ *"columns"   : MAP - Column names
+ *   "region_id"  : VARCHAR
+ *   "sales_city" : VARCHAR
+ *   "cnt": VARCHAR
+ *"statscount" : MAP
+ *   "region_id"  : BIGINT - statscount(region_id) - aggregation over 
all values of region_id
+ *  in incoming batch
+ *   "sales_city" : BIGINT - statscount(sales_city)
+ *   "cnt": BIGINT - statscount(cnt)
+ *"nonnullstatcount" : MAP
+ *   "region_id"  : BIGINT - nonnullstatcount(region_id)
+ *   "sales_city" : BIGINT - nonnullstatcount(sales_city)
+ *   "cnt": BIGINT - nonnullstatcount(cnt)
+ *    another map for next stats function 
+ */
+public class StatisticsAggBatch extends StreamingAggBatch {
+  private List functions;
+  private int schema = 0;
+
+  public StatisticsAggBatch(StatisticsAggregate popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws OutOfMemoryException {
+super(popConfig, incoming, context);
+this.functions = popConfig.getFunctions();
+  }
+
+  private void createKeyColumn(String name, LogicalExpression expr, 
List keyExprs,
+  List

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102326907
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map copySrcVecMap = null;
+  private Map> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {
+functions.put(key, "sum");
+  } else if (key.equalsIgnoreCase("hll")) {
+

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102325216
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map copySrcVecMap = null;
+  private Map> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {
+functions.put(key, "sum");
+  } else if (key.equalsIgnoreCase("hll")) {
+

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102324705
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map copySrcVecMap = null;
+  private Map> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {
+functions.put(key, "sum");
+  } else if (key.equalsIgnoreCase("hll")) {
+

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102324596
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map copySrcVecMap = null;
+  private Map> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {
+functions.put(key, "sum");
+  } else if (key.equalsIgnoreCase("hll")) {
+

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102323849
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map copySrcVecMap = null;
+  private Map> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {
+functions.put(key, "sum");
+  } else if (key.equalsIgnoreCase("hll")) {
+

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102322555
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map copySrcVecMap = null;
+  private Map> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {
+functions.put(key, "sum");
+  } else if (key.equalsIgnoreCase("hll")) {
+

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102320614
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map copySrcVecMap = null;
+  private Map> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {
+functions.put(key, "sum");
+  } else if (key.equalsIgnoreCase("hll")) {
+

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102319897
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map copySrcVecMap = null;
+  private Map> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {
+functions.put(key, "sum");
+  } else if (key.equalsIgnoreCase("hll")) {
+

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102316247
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/config/UnpivotMaps.java
 ---
@@ -0,0 +1,59 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.config;
+
+import java.util.List;
+
+import org.apache.drill.exec.physical.base.AbstractSingle;
+import org.apache.drill.exec.physical.base.PhysicalOperator;
+import org.apache.drill.exec.physical.base.PhysicalVisitor;
+import org.apache.drill.exec.proto.UserBitShared.CoreOperatorType;
+
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.fasterxml.jackson.annotation.JsonTypeName;
+
+@JsonTypeName("unpivot-maps")
+public class UnpivotMaps extends AbstractSingle {
--- End diff --

This is generic in the sense that it can unpivot any given set of maps 
given that they are of
the same size and contain the same keys(columns).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102314383
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillStatsTable.java
 ---
@@ -0,0 +1,347 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.planner.common;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+import com.fasterxml.jackson.annotation.JsonIgnore;
+import com.fasterxml.jackson.annotation.JsonGetter;
+import com.fasterxml.jackson.annotation.JsonSetter;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.fasterxml.jackson.annotation.JsonSubTypes;
+import com.fasterxml.jackson.annotation.JsonTypeInfo;
+import com.fasterxml.jackson.annotation.JsonTypeName;
+import com.fasterxml.jackson.databind.DeserializationFeature;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import com.google.common.base.Stopwatch;
+import com.google.common.collect.Maps;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.RelVisitor;
+import org.apache.calcite.rel.core.TableScan;
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+import org.apache.drill.exec.ops.QueryContext;
+import org.apache.drill.exec.planner.logical.DrillTable;
+import org.apache.drill.exec.util.ImpersonationUtil;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.joda.time.DateTime;
+
+/**
+ * Wraps the stats table info including schema and tableName. Also 
materializes stats from storage
+ * and keeps them in memory.
+ */
+public class DrillStatsTable {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(DrillStatsTable.class);
+  private final FileSystem fs;
+  private final Path tablePath;
+
+  /**
+   * List of columns in stats table.
+   */
+  public static final String COL_COLUMN = "column";
+  public static final String COL_COMPUTED = "computed";
+  public static final String COL_STATCOUNT = "statcount";
+  public static final String COL_NDV = "ndv";
+
+  private final String schemaName;
+  private final String tableName;
+
+  private final Map ndv = Maps.newHashMap();
+  private double rowCount = -1;
+
+  private boolean materialized = false;
+
+  private TableStatistics statistics = null;
+
+  public DrillStatsTable(String schemaName, String tableName, Path 
tablePath, FileSystem fs) {
+this.schemaName = schemaName;
+this.tableName = tableName;
+this.tablePath = tablePath;
+this.fs = 
ImpersonationUtil.createFileSystem(ImpersonationUtil.getProcessUserName(), 
fs.getConf());
+  }
+
+  public String getSchemaName() {
+return schemaName;
+  }
+
+  public String getTableName() {
+return tableName;
+  }
+  /**
+   * Get number of distinct values of given column. If stats are not 
present for the given column,
+   * a null is returned.
+   *
+   * Note: returned data may not be accurate. Accuracy depends on whether 
the table data has changed after the
+   * stats are computed.
+   *
+   * @param col
+   * @return
+   */
+  public Double getNdv(String col) {
+// Stats might not have materialized because of errors.
+if (!materialized) {
+  return null;
+}
+final String upperCol = col.toUpperCase();
+final Long ndvCol = ndv.get(upperCol);
+// Ndv estimation techniques like HLL may over-estimate, hence cap it 
at rowCount
+if (ndvCol != null) {
+  return Math.min(ndvCol, rowCount);
--- End diff --

Histograms would help with the data skew. When we have histograms, the NDV 
would be obtained from the Histograms. Stats will be off by default (so not as

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102310795
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map copySrcVecMap = null;
+  private Map> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {
+functions.put(key, "sum");
+  } else if (key.equalsIgnoreCase("hll")) {
+

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102306890
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map copySrcVecMap = null;
+  private Map> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {
+functions.put(key, "sum");
+  } else if (key.equalsIgnoreCase("hll")) {
+

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102295174
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map copySrcVecMap = null;
+  private Map> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {
+functions.put(key, "sum");
+  } else if (key.equalsIgnoreCase("hll")) {
+

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102295106
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map copySrcVecMap = null;
+  private Map> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {
+functions.put(key, "sum");
+  } else if (key.equalsIgnoreCase("hll")) {
+

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102294356
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map copySrcVecMap = null;
+  private Map> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {
+functions.put(key, "sum");
+  } else if (key.equalsIgnoreCase("hll")) {
+

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102292075
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map copySrcVecMap = null;
+  private Map> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {
+functions.put(key, "sum");
+  } else if (key.equalsIgnoreCase("hll")) {
+

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102291655
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillFilterRelBase.java
 ---
@@ -113,6 +114,6 @@ public double getRows() {
 selectivity = filterMaxSelectivityEstimateFactor;
   }
 }
-return selectivity*RelMetadataQuery.getRowCount(getInput());
+return NumberUtil.multiply(selectivity, 
RelMetadataQuery.getRowCount(getInput()));
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102291099
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unpivot/UnpivotMapsRecordBatch.java
 ---
@@ -0,0 +1,276 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.unpivot;
+
+import java.util.List;
+import java.util.Map;
+
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.common.types.TypeProtos.MinorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.UnpivotMaps;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+
+/**
+ * Unpivot maps. Assumptions are:
+ *  1) all child vectors in a map are of same type.
+ *  2) Each map contains the same number of fields and field names are 
also same (types could be different).
+ *
+ * Example input and output:
+ * Schema of input:
+ *"schema": BIGINT - Schema number. For each schema change 
this number is incremented.
+ *"computed"  : BIGINT - What time is it computed?
+ *"columns" : MAP - Column names
+ *   "region_id"  : VARCHAR
+ *   "sales_city" : VARCHAR
+ *   "cnt": VARCHAR
+ *"statscount" : MAP
+ *   "region_id"  : BIGINT - statscount(region_id) - aggregation over 
all values of region_id
+ *  in incoming batch
+ *   "sales_city" : BIGINT - statscount(sales_city)
+ *   "cnt": BIGINT - statscount(cnt)
+ *"nonnullstatcount" : MAP
+ *   "region_id"  : BIGINT - nonnullstatcount(region_id)
+ *   "sales_city" : BIGINT - nonnullstatcount(sales_city)
+ *   "cnt": BIGINT - nonnullstatcount(cnt)
+ *    another map for next stats function 
+ *
+ * Schema of output:
--- End diff --

For now, we stick to the original reviewed design. We can revisit the 
design later, if required. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102290203
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatchCreator.java
 ---
@@ -0,0 +1,39 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.base.Preconditions;
+import org.apache.drill.common.exceptions.ExecutionSetupException;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.physical.impl.BatchCreator;
+import org.apache.drill.exec.record.CloseableRecordBatch;
+import org.apache.drill.exec.record.RecordBatch;
+
+import java.util.List;
+
+@SuppressWarnings("unused")
--- End diff --

Copy-paste artifact. Removed it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102290224
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/unpivot/UnpivotMapsBatchCreator.java
 ---
@@ -0,0 +1,40 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.unpivot;
+
+import java.util.List;
+
+import org.apache.drill.common.exceptions.ExecutionSetupException;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.UnpivotMaps;
+import org.apache.drill.exec.physical.impl.BatchCreator;
+import org.apache.drill.exec.record.CloseableRecordBatch;
+import org.apache.drill.exec.record.RecordBatch;
+
+import com.google.common.base.Preconditions;
+
+@SuppressWarnings("unused")
--- End diff --

Copy-paste artifact. Removed it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102289330
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map copySrcVecMap = null;
+  private Map> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {
+functions.put(key, "sum");
+  } else if (key.equalsIgnoreCase("hll")) {
+

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102286549
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map copySrcVecMap = null;
+  private Map> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {
+functions.put(key, "sum");
+  } else if (key.equalsIgnoreCase("hll")) {
+

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-14 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r101180535
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map copySrcVecMap = null;
+  private Map> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {
+functions.put(key, "sum");
+  } else if (key.equalsIgnoreCase("hll")) {
+

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-14 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r101176671
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map copySrcVecMap = null;
+  private Map> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {
+functions.put(key, "sum");
+  } else if (key.equalsIgnoreCase("hll")) {
+

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-14 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r101173499
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsAggBatch.java
 ---
@@ -0,0 +1,256 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.sun.codemodel.JExpr;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.exec.exception.ClassTransformationException;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ClassGenerator;
+import org.apache.drill.exec.expr.CodeGenerator;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.ValueVectorWriteExpression;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsAggregate;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggTemplate;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggregator;
+import org.apache.drill.exec.planner.physical.StatsAggPrel.OperatorPhase;
+import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TypedFieldId;
+import org.apache.drill.exec.store.ImplicitColumnExplorer;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.FieldIdUtil;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.IOException;
+import java.util.GregorianCalendar;
+import java.util.List;
+import java.util.TimeZone;
+
+/**
+ * TODO: This needs cleanup. Currently the key values are constants and we 
compare the constants
+ * for every record. Seems unnecessary.
+ *
+ * Example input and output:
+ * Schema of incoming batch: region_id (VARCHAR), sales_city (VARCHAR), 
cnt (BIGINT)
+ * Schema of output:
+ *"schema" : BIGINT - Schema number. For each schema change this 
number is incremented.
+ *"computed" : BIGINT - What time is it computed?
+ *"columns"   : MAP - Column names
+ *   "region_id"  : VARCHAR
+ *   "sales_city" : VARCHAR
+ *   "cnt": VARCHAR
+ *"statscount" : MAP
+ *   "region_id"  : BIGINT - statscount(region_id) - aggregation over 
all values of region_id
+ *  in incoming batch
+ *   "sales_city" : BIGINT - statscount(sales_city)
+ *   "cnt": BIGINT - statscount(cnt)
+ *"nonnullstatcount" : MAP
+ *   "region_id"  : BIGINT - nonnullstatcount(region_id)
+ *   "sales_city" : BIGINT - nonnullstatcount(sales_city)
+ *   "cnt": BIGINT - nonnullstatcount(cnt)
+ *    another map for next stats function 
+ */
+public class StatisticsAggBatch extends StreamingAggBatch {
+  private List functions;
+  private int schema = 0;
+
+  public StatisticsAggBatch(StatisticsAggregate popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws OutOfMemoryException {
+super(popConfig, incoming, context);
+this.functions = popConfig.getFunctions();
+  }
+
+  private void createKeyColumn(String name, LogicalExpression expr, 
List keyExprs,
+  List

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-14 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r101167425
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsAggBatch.java
 ---
@@ -0,0 +1,256 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.sun.codemodel.JExpr;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.exec.exception.ClassTransformationException;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ClassGenerator;
+import org.apache.drill.exec.expr.CodeGenerator;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.ValueVectorWriteExpression;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsAggregate;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggTemplate;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggregator;
+import org.apache.drill.exec.planner.physical.StatsAggPrel.OperatorPhase;
+import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TypedFieldId;
+import org.apache.drill.exec.store.ImplicitColumnExplorer;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.FieldIdUtil;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.IOException;
+import java.util.GregorianCalendar;
+import java.util.List;
+import java.util.TimeZone;
+
+/**
+ * TODO: This needs cleanup. Currently the key values are constants and we 
compare the constants
+ * for every record. Seems unnecessary.
+ *
+ * Example input and output:
+ * Schema of incoming batch: region_id (VARCHAR), sales_city (VARCHAR), 
cnt (BIGINT)
+ * Schema of output:
+ *"schema" : BIGINT - Schema number. For each schema change this 
number is incremented.
+ *"computed" : BIGINT - What time is it computed?
+ *"columns"   : MAP - Column names
+ *   "region_id"  : VARCHAR
+ *   "sales_city" : VARCHAR
+ *   "cnt": VARCHAR
+ *"statscount" : MAP
+ *   "region_id"  : BIGINT - statscount(region_id) - aggregation over 
all values of region_id
+ *  in incoming batch
+ *   "sales_city" : BIGINT - statscount(sales_city)
+ *   "cnt": BIGINT - statscount(cnt)
+ *"nonnullstatcount" : MAP
+ *   "region_id"  : BIGINT - nonnullstatcount(region_id)
+ *   "sales_city" : BIGINT - nonnullstatcount(sales_city)
+ *   "cnt": BIGINT - nonnullstatcount(cnt)
+ *    another map for next stats function 
+ */
+public class StatisticsAggBatch extends StreamingAggBatch {
+  private List functions;
+  private int schema = 0;
+
+  public StatisticsAggBatch(StatisticsAggregate popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws OutOfMemoryException {
+super(popConfig, incoming, context);
+this.functions = popConfig.getFunctions();
+  }
+
+  private void createKeyColumn(String name, LogicalExpression expr, 
List keyExprs,
+  List

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-14 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r101155927
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsAggBatch.java
 ---
@@ -0,0 +1,256 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.sun.codemodel.JExpr;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.exec.exception.ClassTransformationException;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ClassGenerator;
+import org.apache.drill.exec.expr.CodeGenerator;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.ValueVectorWriteExpression;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsAggregate;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggTemplate;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggregator;
+import org.apache.drill.exec.planner.physical.StatsAggPrel.OperatorPhase;
+import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TypedFieldId;
+import org.apache.drill.exec.store.ImplicitColumnExplorer;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.FieldIdUtil;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.IOException;
+import java.util.GregorianCalendar;
+import java.util.List;
+import java.util.TimeZone;
+
+/**
+ * TODO: This needs cleanup. Currently the key values are constants and we 
compare the constants
+ * for every record. Seems unnecessary.
+ *
+ * Example input and output:
+ * Schema of incoming batch: region_id (VARCHAR), sales_city (VARCHAR), 
cnt (BIGINT)
+ * Schema of output:
+ *"schema" : BIGINT - Schema number. For each schema change this 
number is incremented.
+ *"computed" : BIGINT - What time is it computed?
+ *"columns"   : MAP - Column names
+ *   "region_id"  : VARCHAR
+ *   "sales_city" : VARCHAR
+ *   "cnt": VARCHAR
+ *"statscount" : MAP
+ *   "region_id"  : BIGINT - statscount(region_id) - aggregation over 
all values of region_id
+ *  in incoming batch
+ *   "sales_city" : BIGINT - statscount(sales_city)
+ *   "cnt": BIGINT - statscount(cnt)
+ *"nonnullstatcount" : MAP
+ *   "region_id"  : BIGINT - nonnullstatcount(region_id)
+ *   "sales_city" : BIGINT - nonnullstatcount(sales_city)
+ *   "cnt": BIGINT - nonnullstatcount(cnt)
+ *    another map for next stats function 
--- End diff --

We do not plan to support nested types in the current version. We can 
revisit the design in a subsequent version if we plan to support nested types.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-13 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r100869218
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map copySrcVecMap = null;
+  private Map> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {
+functions.put(key, "sum");
+  } else if (key.equalsIgnoreCase("hll")) {
+

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-12 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r100709767
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsAggBatch.java
 ---
@@ -0,0 +1,256 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.sun.codemodel.JExpr;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.exec.exception.ClassTransformationException;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ClassGenerator;
+import org.apache.drill.exec.expr.CodeGenerator;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.ValueVectorWriteExpression;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsAggregate;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggTemplate;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggregator;
+import org.apache.drill.exec.planner.physical.StatsAggPrel.OperatorPhase;
+import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TypedFieldId;
+import org.apache.drill.exec.store.ImplicitColumnExplorer;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.FieldIdUtil;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.IOException;
+import java.util.GregorianCalendar;
+import java.util.List;
+import java.util.TimeZone;
+
+/**
+ * TODO: This needs cleanup. Currently the key values are constants and we 
compare the constants
+ * for every record. Seems unnecessary.
+ *
+ * Example input and output:
+ * Schema of incoming batch: region_id (VARCHAR), sales_city (VARCHAR), 
cnt (BIGINT)
+ * Schema of output:
+ *"schema" : BIGINT - Schema number. For each schema change this 
number is incremented.
+ *"computed" : BIGINT - What time is it computed?
+ *"columns"   : MAP - Column names
+ *   "region_id"  : VARCHAR
+ *   "sales_city" : VARCHAR
+ *   "cnt": VARCHAR
+ *"statscount" : MAP
+ *   "region_id"  : BIGINT - statscount(region_id) - aggregation over 
all values of region_id
+ *  in incoming batch
+ *   "sales_city" : BIGINT - statscount(sales_city)
+ *   "cnt": BIGINT - statscount(cnt)
+ *"nonnullstatcount" : MAP
+ *   "region_id"  : BIGINT - nonnullstatcount(region_id)
+ *   "sales_city" : BIGINT - nonnullstatcount(sales_city)
+ *   "cnt": BIGINT - nonnullstatcount(cnt)
+ *    another map for next stats function 
+ */
+public class StatisticsAggBatch extends StreamingAggBatch {
+  private List functions;
+  private int schema = 0;
+
+  public StatisticsAggBatch(StatisticsAggregate popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws OutOfMemoryException {
+super(popConfig, incoming, context);
+this.functions = popConfig.getFunctions();
+  }
+
+  private void createKeyColumn(String name, LogicalExpression expr, 
List keyExprs,
+  List

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-12 Thread gparai

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r100709733
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsAggBatch.java
 ---
@@ -0,0 +1,256 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.sun.codemodel.JExpr;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.exec.exception.ClassTransformationException;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ClassGenerator;
+import org.apache.drill.exec.expr.CodeGenerator;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.ValueVectorWriteExpression;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsAggregate;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggTemplate;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggregator;
+import org.apache.drill.exec.planner.physical.StatsAggPrel.OperatorPhase;
+import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TypedFieldId;
+import org.apache.drill.exec.store.ImplicitColumnExplorer;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.FieldIdUtil;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.IOException;
+import java.util.GregorianCalendar;
+import java.util.List;
+import java.util.TimeZone;
+
+/**
+ * TODO: This needs cleanup. Currently the key values are constants and we 
compare the constants
+ * for every record. Seems unnecessary.
+ *
+ * Example input and output:
+ * Schema of incoming batch: region_id (VARCHAR), sales_city (VARCHAR), 
cnt (BIGINT)
+ * Schema of output:
+ *"schema" : BIGINT - Schema number. For each schema change this 
number is incremented.
+ *"computed" : BIGINT - What time is it computed?
+ *"columns"   : MAP - Column names
+ *   "region_id"  : VARCHAR
+ *   "sales_city" : VARCHAR
+ *   "cnt": VARCHAR
+ *"statscount" : MAP
+ *   "region_id"  : BIGINT - statscount(region_id) - aggregation over 
all values of region_id
+ *  in incoming batch
+ *   "sales_city" : BIGINT - statscount(sales_city)
+ *   "cnt": BIGINT - statscount(cnt)
+ *"nonnullstatcount" : MAP
+ *   "region_id"  : BIGINT - nonnullstatcount(region_id)
+ *   "sales_city" : BIGINT - nonnullstatcount(sales_city)
+ *   "cnt": BIGINT - nonnullstatcount(cnt)
+ *    another map for next stats function 
+ */
+public class StatisticsAggBatch extends StreamingAggBatch {
+  private List functions;
+  private int schema = 0;
+
+  public StatisticsAggBatch(StatisticsAggregate popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws OutOfMemoryException {
+super(popConfig, incoming, context);
+this.functions = popConfig.getFunctions();
+  }
+
+  private void createKeyColumn(String name, LogicalExpression expr, 
List keyExprs,
+  List

1 2 >

1 - 100 of 184 matches

Mail list logo