[GitHub] drill pull request #1204: DRILL-6318

2018-04-10 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/1204#discussion_r180538860
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillPushLimitToScanRule.java
 ---
@@ -74,9 +74,12 @@ public boolean matches(RelOptRuleCall call) {
   // mess up the schema since Convert_FromJson() is different from 
other regular functions in that it only knows
   // the output schema after evaluation is performed. When input has 0 
row, Drill essentially does not have a way
   // to know the output type.
+  // DRILL-6318:
--- End diff --

Please remove JIRA ref here and below - others can get it via annotations.


---


[GitHub] drill pull request #1204: DRILL-6318

2018-04-10 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/1204#discussion_r180539160
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillPushLimitToScanRule.java
 ---
@@ -89,14 +92,17 @@ public void onMatch(RelOptRuleCall call) {
   RelNode child = projectRel.getInput();
   final RelNode limitUnderProject = 
limitRel.copy(limitRel.getTraitSet(), ImmutableList.of(child));
   final RelNode newProject = projectRel.copy(projectRel.getTraitSet(), 
ImmutableList.of(limitUnderProject));
-  if (DrillRelOptUtil.isProjectOutputRowcountUnknown(projectRel)) {
-//Preserve limit above the project since Flatten can produce more 
rows. Also mark it so we do not fire the rule again.
-final RelNode limitAboveProject = new 
DrillLimitRel(limitRel.getCluster(), limitRel.getTraitSet(), newProject,
-limitRel.getOffset(), limitRel.getFetch(), true);
-call.transformTo(limitAboveProject);
-  } else {
-call.transformTo(newProject);
-  }
+  call.transformTo(newProject);
+  // DRILL-6318:
--- End diff --

Cleanup commented out code.


---


[GitHub] drill pull request #1204: DRILL-6318

2018-04-10 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/1204#discussion_r180539313
  
--- Diff: 
exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/flatten/TestFlattenPlanning.java
 ---
@@ -66,8 +66,11 @@ public void testPushFilterPastProjectWithFlattenNeg() 
throws Exception {
   @Test // DRILL-6099 : push limit past flatten(project)
   public void testLimitPushdownPastFlatten() throws Exception {
 final String query = "select rownum, flatten(complex) comp from 
cp.`store/json/test_flatten_mappify2.json` limit 1";
-final String[] expectedPatterns = 
{".*Limit\\(fetch=\\[1\\]\\).*",".*Flatten.*",".*Limit\\(fetch=\\[1\\]\\).*"};
-final String[] excludedPatterns = null;
+//DRILL-6318 : limit should not push past flatten(project)
+//P.S. Where was an error in this pattern. Even then Limit missing 
after Flatten it matches to plan
--- End diff --

Remove the P.S. and commented out pattern.


---


[GitHub] drill pull request #:

2018-03-06 Thread gparai
Github user gparai commented on the pull request:


https://github.com/apache/drill/commit/9073aed67d89e8b2188870d6c812706085c9c41b#commitcomment-27961774
  
In logical/src/main/java/org/apache/drill/common/expression/SchemaPath.java:
In logical/src/main/java/org/apache/drill/common/expression/SchemaPath.java 
on line 46:
What would be the implication of DYNAMIC_STAR on parseFromString()? The 
Drill Lexer/Parser do not support DYNAMIC_STAR and will run into an error when 
encountering if/when parsing.


---


[GitHub] drill issue #1096: DRILL-6099 : Push limit past flatten(project) without pus...

2018-03-01 Thread gparai
Github user gparai commented on the issue:

https://github.com/apache/drill/pull/1096
  
@amansinha100 I have addressed your review comments. Please take a look. 
Thanks!


---


[GitHub] drill pull request #1096: DRILL-6099 : Push limit past flatten(project) with...

2018-03-01 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/1096#discussion_r171708636
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillPushLimitToScanRule.java
 ---
@@ -55,18 +62,21 @@ public void onMatch(RelOptRuleCall call) {
 }
   };
 
-  public static DrillPushLimitToScanRule LIMIT_ON_PROJECT =
-  new DrillPushLimitToScanRule(
-  RelOptHelper.some(DrillLimitRel.class, RelOptHelper.some(
-  DrillProjectRel.class, 
RelOptHelper.any(DrillScanRel.class))),
-  "DrillPushLimitToScanRule_LimitOnProject") {
+  public static DrillPushLimitToScanRule LIMIT_ON_PROJECT = new 
DrillPushLimitToScanRule(
+  RelOptHelper.some(DrillLimitRel.class, 
RelOptHelper.any(DrillProjectRel.class)), 
"DrillPushLimitToScanRule_LimitOnProject") {
 @Override
 public boolean matches(RelOptRuleCall call) {
   DrillLimitRel limitRel = call.rel(0);
-  DrillScanRel scanRel = call.rel(2);
-  // For now only applies to Parquet. And pushdown only apply limit 
but not offset,
+  DrillProjectRel projectRel = call.rel(1);
+  // pushdown only apply limit but not offset,
   // so if getFetch() return null no need to run this rule.
-  if (scanRel.getGroupScan().supportsLimitPushdown() && 
(limitRel.getFetch() != null)) {
--- End diff --

Without a FLATTEN, the LIMIT would be fully pushed past the PROJECT i.e. we 
would not have a LIMIT on top of the project.


---


[GitHub] drill pull request #1096: DRILL-6099 : Push limit past flatten(project) with...

2018-03-01 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/1096#discussion_r171708439
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillRelOptUtil.java
 ---
@@ -224,4 +226,64 @@ public Void visitInputRef(RexInputRef inputRef) {
 }
   }
 
+  public static boolean isLimit0(RexNode fetch) {
+if (fetch != null && fetch.isA(SqlKind.LITERAL)) {
+  RexLiteral l = (RexLiteral) fetch;
+  switch (l.getTypeName()) {
+case BIGINT:
+case INTEGER:
+case DECIMAL:
+  if (((long) l.getValue2()) == 0) {
+return true;
+  }
+  }
+}
+return false;
+  }
+
+  public static boolean isProjectOutputRowcountUnknown(RelNode project) {
+assert project instanceof Project : "Rel is NOT an instance of 
project!";
+try {
+  RexVisitor visitor =
+  new RexVisitorImpl(true) {
+public Void visitCall(RexCall call) {
+  if 
("flatten".equals(call.getOperator().getName().toLowerCase())) {
+throw new Util.FoundOne(call); /* throw exception to 
interrupt tree walk (this is similar to
+  other utility methods in 
RexUtil.java */
+  }
+  return super.visitCall(call);
+}
+  };
+  for (RexNode rex : ((Project) project).getProjects()) {
+rex.accept(visitor);
+  }
+} catch (Util.FoundOne e) {
+  Util.swallow(e, null);
+  return true;
+}
+return false;
+  }
+
+  public static boolean isProjectOutputSchemaUnknown(RelNode project) {
--- End diff --

Done


---


[GitHub] drill pull request #1096: DRILL-6099 : Push limit past flatten(project) with...

2018-03-01 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/1096#discussion_r171708410
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillRelOptUtil.java
 ---
@@ -224,4 +226,64 @@ public Void visitInputRef(RexInputRef inputRef) {
 }
   }
 
+  public static boolean isLimit0(RexNode fetch) {
+if (fetch != null && fetch.isA(SqlKind.LITERAL)) {
+  RexLiteral l = (RexLiteral) fetch;
+  switch (l.getTypeName()) {
+case BIGINT:
+case INTEGER:
+case DECIMAL:
+  if (((long) l.getValue2()) == 0) {
+return true;
+  }
+  }
+}
+return false;
+  }
+
+  public static boolean isProjectOutputRowcountUnknown(RelNode project) {
--- End diff --

Done


---


[GitHub] drill pull request #1096: DRILL-6099 : Push limit past flatten(project) with...

2018-03-01 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/1096#discussion_r171708384
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillRelOptUtil.java
 ---
@@ -224,4 +226,64 @@ public Void visitInputRef(RexInputRef inputRef) {
 }
   }
 
+  public static boolean isLimit0(RexNode fetch) {
+if (fetch != null && fetch.isA(SqlKind.LITERAL)) {
+  RexLiteral l = (RexLiteral) fetch;
+  switch (l.getTypeName()) {
+case BIGINT:
+case INTEGER:
+case DECIMAL:
+  if (((long) l.getValue2()) == 0) {
+return true;
+  }
+  }
+}
+return false;
+  }
+
+  public static boolean isProjectOutputRowcountUnknown(RelNode project) {
+assert project instanceof Project : "Rel is NOT an instance of 
project!";
+try {
+  RexVisitor visitor =
--- End diff --

Yes, you are correct. If the rewrite does not consider it as embedded 
within other expressions then it is fine for the utility function to do the 
same.


---


[GitHub] drill pull request #1096: DRILL-6099 : Push limit past flatten(project) with...

2018-02-27 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/1096#discussion_r171086924
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillPushLimitToScanRule.java
 ---
@@ -55,18 +62,21 @@ public void onMatch(RelOptRuleCall call) {
 }
   };
 
-  public static DrillPushLimitToScanRule LIMIT_ON_PROJECT =
-  new DrillPushLimitToScanRule(
-  RelOptHelper.some(DrillLimitRel.class, RelOptHelper.some(
-  DrillProjectRel.class, 
RelOptHelper.any(DrillScanRel.class))),
-  "DrillPushLimitToScanRule_LimitOnProject") {
+  public static DrillPushLimitToScanRule LIMIT_ON_PROJECT = new 
DrillPushLimitToScanRule(
--- End diff --

There are many instances where we would have a PROJECT on top of the SCAN. 
The way the rule is refactored now the LIMIT_SCAN rule would not work unless we 
do LIMIT_PROJECT. Hence, these rules should go together for LIMIT_SCAN to work 
effectively. That is the reason I kept the rule here rather than creating a new 
rule.


---


[GitHub] drill pull request #1096: DRILL-6099 : Push limit past flatten(project) with...

2018-02-27 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/1096#discussion_r171085780
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillPushLimitToScanRule.java
 ---
@@ -55,18 +62,21 @@ public void onMatch(RelOptRuleCall call) {
 }
   };
 
-  public static DrillPushLimitToScanRule LIMIT_ON_PROJECT =
-  new DrillPushLimitToScanRule(
-  RelOptHelper.some(DrillLimitRel.class, RelOptHelper.some(
-  DrillProjectRel.class, 
RelOptHelper.any(DrillScanRel.class))),
-  "DrillPushLimitToScanRule_LimitOnProject") {
+  public static DrillPushLimitToScanRule LIMIT_ON_PROJECT = new 
DrillPushLimitToScanRule(
+  RelOptHelper.some(DrillLimitRel.class, 
RelOptHelper.any(DrillProjectRel.class)), 
"DrillPushLimitToScanRule_LimitOnProject") {
 @Override
 public boolean matches(RelOptRuleCall call) {
   DrillLimitRel limitRel = call.rel(0);
-  DrillScanRel scanRel = call.rel(2);
-  // For now only applies to Parquet. And pushdown only apply limit 
but not offset,
+  DrillProjectRel projectRel = call.rel(1);
+  // pushdown only apply limit but not offset,
   // so if getFetch() return null no need to run this rule.
-  if (scanRel.getGroupScan().supportsLimitPushdown() && 
(limitRel.getFetch() != null)) {
--- End diff --

We still have the LIMIT_ON_SCAN rule which does that check. This rule is 
changed from LIMIT_PROJECT_SCAN to LIMIT_PROJECT. The LIMIT_SCAN along with the 
LIMIT_PROJECT would work as the LIMIT_PROJECT_SCAN.


---


[GitHub] drill pull request #1096: DRILL-6099 : Push limit past flatten(project) with...

2018-01-29 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/1096#discussion_r164593749
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillPushLimitToScanRule.java
 ---
@@ -121,4 +132,50 @@ protected void doOnMatch(RelOptRuleCall call, 
DrillLimitRel limitRel, DrillScanR
 }
 
   }
+
+  private static boolean isProjectFlatten(RelNode project) {
--- End diff --

Done.


---


[GitHub] drill issue #1096: DRILL-6099 : Push limit past flatten(project) without pus...

2018-01-27 Thread gparai
Github user gparai commented on the issue:

https://github.com/apache/drill/pull/1096
  
@chunhui-shi can you please review the new changes (in commit e6dcf14)? 
Thanks!


---


[GitHub] drill pull request #1096: DRILL - 6099 : Push limit past flatten(project) wi...

2018-01-19 Thread gparai
GitHub user gparai opened a pull request:

https://github.com/apache/drill/pull/1096

DRILL - 6099 : Push limit past flatten(project) without pushdown into scan

@amansinha100 @chunhui-shi can you please review this PR? Thanks!

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gparai/drill DRILL-6099-master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1096.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1096


commit dbd1b22a47de7493d1ead97c699763aad17584f7
Author: Gautam Parai <gparai@...>
Date:   2018-01-18T23:46:42Z

DRILL - 6099 : Push limit past flatten(project) without pushdown into scan




---


[GitHub] drill issue #1093: DRILL-6093 : Account for simple columns in project cpu co...

2018-01-18 Thread gparai
Github user gparai commented on the issue:

https://github.com/apache/drill/pull/1093
  
@amansinha100 I have addressed your comments.


---


[GitHub] drill pull request #1093: DRILL-6093 : Account for simple columns in project...

2018-01-16 Thread gparai
GitHub user gparai opened a pull request:

https://github.com/apache/drill/pull/1093

DRILL-6093 : Account for simple columns in project cpu costing

@amansinha100 can you please review this PR? Thanks!

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gparai/drill DRILL-6093-master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/1093.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1093


commit 30f1934e6b57aeaf33f72ac701bd431f2c11e403
Author: Gautam Parai <gparai@...>
Date:   2018-01-16T23:16:16Z

DRILL-6093 : Account for simple columns in project cpu costing




---


[GitHub] drill issue #979: DRILL-5853 : Update Calcite to get NULL direction for sort...

2017-10-06 Thread gparai
Github user gparai commented on the issue:

https://github.com/apache/drill/pull/979
  
@amansinha100 Can you please review the PR? Thanks!


---


[GitHub] drill pull request #979: DRILL-5853 : Update Calcite to get NULL direction f...

2017-10-06 Thread gparai
GitHub user gparai opened a pull request:

https://github.com/apache/drill/pull/979

DRILL-5853 : Update Calcite to get NULL direction for sort removal



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/gparai/drill Drill-5853-ACM

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/979.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #979


commit 89b01d45c5d26d4cc1f20d75fd26d25fc5765bc5
Author: Gautam Parai <gpa...@maprtech.com>
Date:   2017-10-07T01:48:42Z

DRILL-5853 : Update Calcite to get NULL direction for sort removal




---


[GitHub] drill issue #729: Drill 1328: Support table statistics for Parquet

2017-06-23 Thread gparai
Github user gparai commented on the issue:

https://github.com/apache/drill/pull/729
  
Thanks for the reminder @paul-rogers. Based on the last discussion with the 
reviewers and Drill community members, we would hold off on the PR because it 
also causes regressions in queries in TPC-H, TPC-DS benchmarks. We identified 
that we need histograms and other enhancements to fully address the 
regressions. I will post a new PR once these issues are addressed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #811: DRILL-5423: Refactor ScanBatch to allow unit testing recor...

2017-05-04 Thread gparai
Github user gparai commented on the issue:

https://github.com/apache/drill/pull/811
  
Thanks for making the changes Paul. 

LGTM +1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #817: DRILL-5429: Improve query performance for MapR DB J...

2017-05-02 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/817#discussion_r114429838
  
--- Diff: 
contrib/format-maprdb/src/main/java/org/apache/drill/exec/store/mapr/db/json/JsonTableGroupScan.java
 ---
@@ -100,30 +104,46 @@ public GroupScan clone(List columns) {
 return newScan;
   }
 
+  public JsonTableGroupScan clone(JsonScanSpec scanSpec) {
+JsonTableGroupScan newScan = new JsonTableGroupScan(this);
+newScan.scanSpec = scanSpec;
+newScan.computeRegionsToScan();
+return newScan;
+  }
+
+  private void computeRegionsToScan() {
+boolean foundStartRegion = false;
+
+regionsToScan = new TreeMap<TabletFragmentInfo, String>();
+for (TabletInfo tabletInfo : tabletInfos) {
+  TabletInfoImpl tabletInfoImpl = (TabletInfoImpl) tabletInfo;
+  if (!foundStartRegion && !isNullOrEmpty(scanSpec.getStartRow()) && 
!tabletInfoImpl.containsRow(scanSpec.getStartRow())) {
+continue;
+  }
+  foundStartRegion = true;
+  regionsToScan.put(new TabletFragmentInfo(tabletInfoImpl), 
tabletInfo.getLocations()[0]);
+  if (!isNullOrEmpty(scanSpec.getStopRow()) && 
tabletInfoImpl.containsRow(scanSpec.getStopRow())) {
+break;
+  }
+}
+  }
+
   private void init() {
 logger.debug("Getting tablet locations");
 try {
   Configuration conf = new Configuration();
-  Table t = MapRDB.getTable(scanSpec.getTableName());
-  TabletInfo[] tabletInfos = t.getTabletInfos(scanSpec.getCondition());
-  tableStats = new MapRDBTableStats(conf, scanSpec.getTableName());
 
-  boolean foundStartRegion = false;
-  regionsToScan = new TreeMap<TabletFragmentInfo, String>();
+  // Fetch table and tabletInfo only once and cache.
+  table = MapRDB.getTable(scanSpec.getTableName());
+  tabletInfos = table.getTabletInfos(scanSpec.getCondition());
+
+  // Calculate totalRowCount for the table
   for (TabletInfo tabletInfo : tabletInfos) {
-TabletInfoImpl tabletInfoImpl = (TabletInfoImpl) tabletInfo;
-if (!foundStartRegion
-&& !isNullOrEmpty(scanSpec.getStartRow())
-&& !tabletInfoImpl.containsRow(scanSpec.getStartRow())) {
-  continue;
-}
-foundStartRegion = true;
-regionsToScan.put(new TabletFragmentInfo(tabletInfoImpl), 
tabletInfo.getLocations()[0]);
-if (!isNullOrEmpty(scanSpec.getStopRow())
-&& tabletInfoImpl.containsRow(scanSpec.getStopRow())) {
-  break;
-}
+totalRowCount += tabletInfo.getEstimatedNumRows();
   }
+
--- End diff --

Please add your explanation as a comment 
> We should recompute regionsToScan as it depends upon scanSpec


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #817: DRILL-5429: Improve query performance for MapR DB J...

2017-05-02 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/817#discussion_r114430230
  
--- Diff: 
contrib/format-maprdb/src/main/java/org/apache/drill/exec/store/mapr/db/json/JsonTableGroupScan.java
 ---
@@ -100,30 +104,46 @@ public GroupScan clone(List columns) {
 return newScan;
   }
 
+  public JsonTableGroupScan clone(JsonScanSpec scanSpec) {
+JsonTableGroupScan newScan = new JsonTableGroupScan(this);
+newScan.scanSpec = scanSpec;
+newScan.computeRegionsToScan();
+return newScan;
+  }
+
+  private void computeRegionsToScan() {
+boolean foundStartRegion = false;
+
+regionsToScan = new TreeMap<TabletFragmentInfo, String>();
+for (TabletInfo tabletInfo : tabletInfos) {
+  TabletInfoImpl tabletInfoImpl = (TabletInfoImpl) tabletInfo;
+  if (!foundStartRegion && !isNullOrEmpty(scanSpec.getStartRow()) && 
!tabletInfoImpl.containsRow(scanSpec.getStartRow())) {
+continue;
+  }
+  foundStartRegion = true;
+  regionsToScan.put(new TabletFragmentInfo(tabletInfoImpl), 
tabletInfo.getLocations()[0]);
+  if (!isNullOrEmpty(scanSpec.getStopRow()) && 
tabletInfoImpl.containsRow(scanSpec.getStopRow())) {
+break;
+  }
+}
+  }
+
   private void init() {
 logger.debug("Getting tablet locations");
 try {
   Configuration conf = new Configuration();
-  Table t = MapRDB.getTable(scanSpec.getTableName());
-  TabletInfo[] tabletInfos = t.getTabletInfos(scanSpec.getCondition());
-  tableStats = new MapRDBTableStats(conf, scanSpec.getTableName());
 
-  boolean foundStartRegion = false;
-  regionsToScan = new TreeMap<TabletFragmentInfo, String>();
+  // Fetch table and tabletInfo only once and cache.
+  table = MapRDB.getTable(scanSpec.getTableName());
+  tabletInfos = table.getTabletInfos(scanSpec.getCondition());
+
+  // Calculate totalRowCount for the table
--- End diff --

Please add a comment explaining why we compute the totalRowCount like so? 
`totalRowCount += tabletInfo.getEstimatedNumRows();`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #817: DRILL-5429: Improve query performance for MapR DB J...

2017-05-02 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/817#discussion_r114429261
  
--- Diff: 
contrib/format-maprdb/src/main/java/org/apache/drill/exec/store/mapr/db/json/JsonTableGroupScan.java
 ---
@@ -100,30 +104,46 @@ public GroupScan clone(List columns) {
 return newScan;
   }
 
+  public JsonTableGroupScan clone(JsonScanSpec scanSpec) {
+JsonTableGroupScan newScan = new JsonTableGroupScan(this);
+newScan.scanSpec = scanSpec;
+newScan.computeRegionsToScan();
+return newScan;
+  }
+
--- End diff --

Please add comments describing the function


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #811: DRILL-5423: Refactor ScanBatch to allow unit testin...

2017-05-02 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/811#discussion_r114399171
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ops/OperatorExecContext.java 
---
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.ops;
+
+import org.apache.drill.exec.memory.BufferAllocator;
+import org.apache.drill.exec.testing.ExecutionControls;
+
+import io.netty.buffer.DrillBuf;
+
+/**
+ * Narrowed version of the {@link OpeartorContext} used to create an
--- End diff --

Typo -> OpeartorContext


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #811: DRILL-5423: Refactor ScanBatch to allow unit testin...

2017-05-02 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/811#discussion_r114411352
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/ScanBatch.java 
---
@@ -323,11 +318,31 @@ public TypedFieldId getValueVectorId(SchemaPath path) 
{
 return container.getValueAccessorById(clazz, ids);
   }
 
-  private class Mutator implements OutputMutator {
+  public static class Mutator implements OutputMutator {
--- End diff --

Maybe add comments about `Mutator` since the class is `public static`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #811: DRILL-5423: Refactor ScanBatch to allow unit testin...

2017-05-02 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/811#discussion_r114403519
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ops/AbstractOperatorExecContext.java
 ---
@@ -0,0 +1,94 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.ops;
+
+import org.apache.drill.exec.memory.BufferAllocator;
+import org.apache.drill.exec.physical.base.PhysicalOperator;
+import org.apache.drill.exec.testing.ExecutionControls;
+
+import io.netty.buffer.DrillBuf;
+
+/**
+ * Implementation of {@link OperatorExecContext} that provides services
+ * needed by most run-time operators. Excludes services that need the
+ * entire Drillbit. Allows easy testing of operator code that uses this
+ * interface.
+ */
+
+public class AbstractOperatorExecContext implements OperatorExecContext {
+
+  protected final BufferAllocator allocator;
+  protected final ExecutionControls executionControls;
+  protected final PhysicalOperator popConfig;
+  protected final BufferManager manager;
+  protected OperatorStatReceiver statsWriter;
+
+  public AbstractOperatorExecContext(BufferAllocator allocator, 
PhysicalOperator popConfig,
+ ExecutionControls executionControls,
+ OperatorStatReceiver stats) {
+this.allocator = allocator;
+this.popConfig = popConfig;
+this.manager = new BufferManagerImpl(allocator);
+statsWriter = stats;
+
+this.executionControls = executionControls;
+  }
+
+  @Override
+  public DrillBuf replace(DrillBuf old, int newSize) {
+return manager.replace(old, newSize);
+  }
+
+  @Override
+  public DrillBuf getManagedBuffer() {
+return manager.getManagedBuffer();
+  }
+
+  @Override
+  public DrillBuf getManagedBuffer(int size) {
+return manager.getManagedBuffer(size);
+  }
+
+  @Override
+  public ExecutionControls getExecutionControls() {
+return executionControls;
+  }
+
+  @Override
+  public BufferAllocator getAllocator() {
+if (allocator == null) {
+  throw new UnsupportedOperationException("Operator context does not 
have an allocator");
+}
+return allocator;
+  }
+
+  public void close() {
+try {
+  manager.close();
+} finally {
+  if (allocator != null) {
+allocator.close();
+  }
+}
+  }
+
+  @Override
+  public OperatorStatReceiver getStatsWriter() {
+return statsWriter;
--- End diff --

This maybe called prior to initializing `statsWriter = stats` in 
`OperatorContextImpl` constructor. Should we add an assert? 
OR 
Should we even pass `OperatorStatReceiver stats` in the 
`AbstractOperatorExecContext` constructor - maybe only use a getter/setter for 
`statsWriter`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #811: DRILL-5423: Refactor ScanBatch to allow unit testin...

2017-05-02 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/811#discussion_r114398903
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ops/OperatorContextImpl.java 
---
@@ -143,20 +104,18 @@ public void close() {
 }
 logger.debug("Closing context for {}", popConfig != null ? 
popConfig.getClass().getName() : null);
 
-manager.close();
-
-if (allocator != null) {
-  allocator.close();
-}
-
-if (fs != null) {
-  try {
-fs.close();
-  } catch (IOException e) {
-throw new DrillRuntimeException(e);
+closed = true;
--- End diff --

What is the rationale for moving `closed = true` earlier. What would happen 
if we have an exception while closing?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #811: DRILL-5423: Refactor ScanBatch to allow unit testin...

2017-05-02 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/811#discussion_r114399420
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ops/OperatorUtilities.java 
---
@@ -0,0 +1,41 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.ops;
+
+import java.util.Iterator;
+
+import org.apache.drill.exec.physical.base.PhysicalOperator;
+
+public class OperatorUtilities {
+
+  private OperatorUtilities() { }
+
+  public static int getChildCount(PhysicalOperator popConfig) {
+Iterator iter = popConfig.iterator();
+int i = 0;
--- End diff --

Refactor i -> count


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #817: DRILL-5429: Cache tableStats per query for MapR DB ...

2017-04-18 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/817#discussion_r112072391
  
--- Diff: 
contrib/format-maprdb/src/main/java/org/apache/drill/exec/store/mapr/db/MapRDBPushFilterIntoScan.java
 ---
@@ -137,11 +137,13 @@ protected void 
doPushFilterIntoJsonGroupScan(RelOptRuleCall call,
   return; //no filter pushdown ==> No transformation.
 }
 
+// Pass tableStats from old groupScan so we do not go and fetch stats 
(an expensive operation) again from MapR DB client.
 final JsonTableGroupScan newGroupsScan = new 
JsonTableGroupScan(groupScan.getUserName(),
 
groupScan.getStoragePlugin(),
 
groupScan.getFormatPlugin(),
 
newScanSpec,
-
groupScan.getColumns());
+
groupScan.getColumns(),
+
groupScan.getTableStats());
--- End diff --

We should try to use `clone()` here. All we are doing is copying stuff from 
one groupscan to another. `JsonTableGroupScan` already has a clone which clones 
everything except columns.

`@Override 
public GroupScan clone(List columns) 
{ JsonTableGroupScan newScan = new JsonTableGroupScan(this); 
newScan.columns = columns; return newScan; }`
We can create another which would clone everything except scanSpec. This 
can be used to pass in the `newScanSpec` generated here. Doing this would also 
clone the `regionsToScan` saving the call to `init()`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #817: DRILL-5429: Cache tableStats per query for MapR DB ...

2017-04-18 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/817#discussion_r112072733
  
--- Diff: 
contrib/format-maprdb/src/main/java/org/apache/drill/exec/store/mapr/db/json/JsonTableGroupScan.java
 ---
@@ -106,7 +115,11 @@ private void init() {
   Configuration conf = new Configuration();
   Table t = MapRDB.getTable(scanSpec.getTableName());
   TabletInfo[] tabletInfos = t.getTabletInfos(scanSpec.getCondition());
-  tableStats = new MapRDBTableStats(conf, scanSpec.getTableName());
+
+  // Fetch tableStats only once and cache it.
+  if (tableStats == null) {
--- End diff --

This can probably be removed if we call clone(). However, it may be a 
useful check if we end up calling it from some other code-paths. Maybe add some 
logging to ensure we are not recreating the tableStats?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #787: DRILL-5319: Refactor "contexts" for unit testing

2017-04-06 Thread gparai
Github user gparai commented on the issue:

https://github.com/apache/drill/pull/787
  
Thanks for the detailed and clear comments.
+1 LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #787: DRILL-5319: Refactor "contexts" for unit testing

2017-04-05 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/787#discussion_r110018766
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentExecContext.java 
---
@@ -0,0 +1,49 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.ops;
+
+import java.io.IOException;
+import java.util.List;
+
+import org.apache.drill.common.config.DrillConfig;
+import org.apache.drill.exec.exception.ClassTransformationException;
+import org.apache.drill.exec.expr.ClassGenerator;
+import org.apache.drill.exec.expr.CodeGenerator;
+import org.apache.drill.exec.expr.fn.FunctionImplementationRegistry;
+import org.apache.drill.exec.server.options.OptionSet;
+import org.apache.drill.exec.testing.ExecutionControls;
+
+public interface FragmentExecContext {
+  FunctionImplementationRegistry getFunctionRegistry();
+  OptionSet getOptionSet();
+
+   T getImplementationClass(final ClassGenerator cg)
--- End diff --

Good to have comments describing the interface methods


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #787: DRILL-5319: Refactor "contexts" for unit testing

2017-04-05 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/787#discussion_r109986447
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/server/options/TypeValidators.java
 ---
@@ -90,19 +90,19 @@ public void validate(final OptionValue v, final 
OptionManager manager) {
   }
 
   public static class MinRangeDoubleValidator extends RangeDoubleValidator 
{
-private final double min;
-private final double max;
+//private final double min;
--- End diff --

Please remove the comments here and below


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #787: DRILL-5319: Refactor "contexts" for unit testing

2017-04-05 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/787#discussion_r109982950
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/server/options/OptionSet.java
 ---
@@ -0,0 +1,72 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.server.options;
+
+/**
+ * Immutable set of options accessible by name or validator.
+ */
+
+public interface OptionSet {
--- End diff --

Maybe rename to OptionReader instead of OptionSet. An OptionManager would 
extend an OptionReader and OptionModifier interfaces.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #802: DRILL-5394: Optimize query planning for MapR-DB tables by ...

2017-03-30 Thread gparai
Github user gparai commented on the issue:

https://github.com/apache/drill/pull/802
  
The new changes look good! Thanks for further refactoring the code. LGTM  +1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #802: DRILL-5394: Optimize query planning for MapR-DB tables by ...

2017-03-30 Thread gparai
Github user gparai commented on the issue:

https://github.com/apache/drill/pull/802
  
Further code refactoring may be worth the effort - We leave tableStats 
as-is and pass it around instead of using the new `rowCount`. This would 
cleanup the code further and would make it more robust to future changes in 
tableStats.

However, in the interest of time I would not hold up the pull request for 
the above code refactoring.
+1 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #802: DRILL-5394: Optimize query planning for MapR-DB tab...

2017-03-29 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/802#discussion_r108823495
  
--- Diff: 
contrib/format-maprdb/src/main/java/org/apache/drill/exec/store/mapr/db/MapRDBPushFilterIntoScan.java
 ---
@@ -184,6 +184,8 @@ protected void doPushFilterIntoBinaryGroupScan(final 
RelOptRuleCall call,
   return; //no filter pushdown ==> No transformation.
 }
 
+// Set rowCount in newScanSpec so we do not go and fetch rowCount (an 
expensive operation) again from MapR DB client.
+newScanSpec.setRowCount(groupScan.getHBaseScanSpec().getRowCount());
--- End diff --

Change the constructor for BinaryTableGroupScan to also pass in TableStats 
and add a getter for TableStats. This should be sufficient for passing around 
the rows.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #802: DRILL-5394: Optimize query planning for MapR-DB tab...

2017-03-29 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/802#discussion_r108822233
  
--- Diff: 
contrib/format-maprdb/src/main/java/org/apache/drill/exec/store/mapr/db/binary/BinaryTableGroupScan.java
 ---
@@ -115,8 +112,11 @@ private void init() {
 try (Admin admin = formatPlugin.getConnection().getAdmin();
  RegionLocator locator = 
formatPlugin.getConnection().getRegionLocator(tableName)) {
   hTableDesc = admin.getTableDescriptor(tableName);
-  tableStats = new MapRDBTableStats(getHBaseConf(), 
hbaseScanSpec.getTableName());
-
+  // Fetch rowCount only once and cache it in hbaseScanSpec.
+  if (hbaseScanSpec.getRowCount() == hbaseScanSpec.ROW_COUNT_UNKNOWN) {
+MapRDBTableStats tableStats = new MapRDBTableStats(getHBaseConf(), 
hbaseScanSpec.getTableName());
+hbaseScanSpec.setRowCount(tableStats.getNumRows());
--- End diff --

This looks weird. We create the TableStats relying on some information from 
the ScanSpec and then proceed to modify the same ScanSpec with the information 
retrieved from TableStats. Please look at below comment as well regarding Spec 
mutability.

Should we instead just overload the MapRDBTableStats constructor to allow 
passing numRows - since that is what the existing constructor endup doing but 
makes a call to DB Client? So instead of populating the ScanSpec we populate 
the tableStats using this new constructor?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #801: DRILL-5378: Put more information for schema change excepti...

2017-03-29 Thread gparai
Github user gparai commented on the issue:

https://github.com/apache/drill/pull/801
  
Would it be a good idea to add tests for the same? Otherwise, LGTM.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #801: DRILL-5378: Put more information for schema change ...

2017-03-28 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/801#discussion_r108492668
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/exception/SchemaChangeException.java
 ---
@@ -49,4 +50,16 @@ public SchemaChangeException(String message, 
Object...objects){
   public SchemaChangeException(String message, Throwable cause, 
Object...objects){
 super(String.format(message, objects), cause);
   }
+
+  public static SchemaChangeException schemChanged(String message, 
BatchSchema priorSchema, BatchSchema newSchema) {
--- End diff --

Please correct the typo: schemChanged to schemaChanged


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #801: DRILL-5378: Put more information for schema change ...

2017-03-28 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/801#discussion_r108493803
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java
 ---
@@ -324,9 +328,9 @@ public AggOutcome doWork() {
   logger.debug("Received new schema.  Batch has {} 
records.", incoming.getRecordCount());
 }
 //newSchema = true;
-this.cleanup();
 // TODO: new schema case needs to be handled appropriately
-return AggOutcome.UPDATE_AGGREGATOR;
+this.cleanup();
+throw SchemaChangeException.schemChanged("Hash aggregate 
does not support schema changes", schema, incoming.getSchema());
--- End diff --

Please correct to the renamed function here and elsewhere.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #801: DRILL-5378: Put more information for schema change ...

2017-03-28 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/801#discussion_r108493691
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java
 ---
@@ -278,7 +281,7 @@ public void setup(HashAggregate hashAggrConfig, 
HashTableConfig htConfig, Fragme
   }
 
   @Override
-  public AggOutcome doWork() {
+  public AggOutcome doWork() throws SchemaChangeException {
--- End diff --

`doWork()` is used elsewhere as well. e.g. StreamingAggregator. Would it 
make sense to address other `doWork()` s in this PR. Better yet factor out 
another interface which the HashAggregator/StreamingAggregator implement - this 
would contain `public AggOutcome doWork() throws SchemaChangeException` 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #729: Drill 1328: Support table statistics for Parquet

2017-03-01 Thread gparai
Github user gparai commented on the issue:

https://github.com/apache/drill/pull/729
  
I have addressed the second set of comments. @amansinha100 @paul-rogers 
could you please review and/or approve? Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-03-01 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103636828
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ops/ContextInformation.java 
---
@@ -29,13 +29,15 @@
   private final long queryStartTime;
   private final int rootFragmentTimeZone;
   private final String sessionId;
+  private final int hllAccuracy;
 
   public ContextInformation(final UserCredentials userCredentials, final 
QueryContextInformation queryContextInfo) {
 this.queryUser = userCredentials.getUserName();
 this.currentDefaultSchema = queryContextInfo.getDefaultSchemaName();
 this.queryStartTime = queryContextInfo.getQueryStartTime();
 this.rootFragmentTimeZone = queryContextInfo.getTimeZone();
 this.sessionId = queryContextInfo.getSessionId();
+this.hllAccuracy = queryContextInfo.getHllAccuracy();
--- End diff --

This is not required for use within the `RecordBatch` but within the UDFs. 
The `ContextInformation` seems to be the existing mechanism to pass information 
to the UDFs.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-03-01 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103636512
  
--- Diff: 
protocol/src/main/java/org/apache/drill/exec/proto/beans/QueryContextInformation.java
 ---
@@ -51,6 +51,7 @@ public static QueryContextInformation getDefaultInstance()
 private int timeZone;
 private String defaultSchemaName;
 private String sessionId;
+private int hllAccuracy;
--- End diff --

hllAccuracy is required within the HLL UDFs. The only way I found to pass 
information inside the UDFs is by injecting the `@Inject ContextInformation` 
available in the `UDFUtilities.java`.
Hence, I define it in the QueryContext eventually passing it to the 
`ContextInformation`. Please suggest if there is a better way to do the 
same.hllAccuracy is required within the HLL UDFs. The only way I found to pass 
information inside the UDFs is by injecting the `@Inject ContextInformation` 
available in the `UDFUtilities.java`.
Hence, I define it in the QueryContext eventually passing it to the 
`ContextInformation`. Please suggest if there is a better way to do the same.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103619932
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/AbstractMergedStatistic.java
 ---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.exec.vector.ValueVector;
+
+public abstract class AbstractMergedStatistic extends Statistic implements 
MergedStatistic {
+  @Override
+  public String getName() {
+throw new UnsupportedOperationException("getName() not implemented");
+  }
+
+  @Override
+  public String getInput() {
+throw new UnsupportedOperationException("getInput() not implemented");
+  }
+
+  @Override
+  public void merge(ValueVector input) {
+throw new UnsupportedOperationException("merge() not implemented");
+  }
+
+  @Override
+  public Object getStat(String colName) {
+throw new UnsupportedOperationException("getStat() not implemented");
+  }
+
+  @Override
+  public void setOutput(ValueVector output) {
+throw new UnsupportedOperationException("getOutput() not implemented");
+  }
+
+  @Override
+  public void configure(Object configurations) {
--- End diff --

Changed the types to be specific. This config does not do hash lookups. 
However, for functions like `merge()` and `setOutput()` do hash lookups which 
might not be as expensive given the alternative (columns mappings for 
`MapVector` which needs to be recomputed for each incoming batch). Also, all 
the `MergedStatistic` and `StatisticsMergeBatch` contain a single map and given 
that both the incoming(no. of minor fragments) and outgoing batches(1) would 
not be many we can keep the code simple here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103618796
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/AvgWidthMergedStatistic.java
 ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+public class AvgWidthMergedStatistic extends AbstractMergedStatistic {
+
+  private String name;
+  private String inputName;
+  private boolean configureComplete = false;
+  private boolean mergeComplete = false;
+  private Map<String, ValueHolder> sumHolder;
+  MergedStatistic types, nonNullStatCounts, statCounts;
+
+  public AvgWidthMergedStatistic (String name, String inputName) {
+this.name = name;
+this.inputName = inputName;
+this.sumHolder = new HashMap<>();
+types = nonNullStatCounts = statCounts = null;
+  }
+
+  @Override
+  public String getName() {
+return name;
+  }
+
+  @Override
+  public String getInput() {
+return inputName;
+  }
+
+  @Override
+  public void merge(ValueVector input) {
+// Check the input is a Map Vector
+assert (input.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+MapVector inputMap = (MapVector) input;
+for (ValueVector vv : inputMap) {
+  String colName = vv.getField().getLastName();
+  NullableFloat8Holder colSumHolder;
+  if (sumHolder.get(colName) != null) {
+colSumHolder = (NullableFloat8Holder) sumHolder.get(colName);
+  } else {
+colSumHolder = new NullableFloat8Holder();
+sumHolder.put(colName, colSumHolder);
+  }
+  Object val = vv.getAccessor().getObject(0);
+  if (val != null) {
+colSumHolder.value += (double) val;
+colSumHolder.isSet = 1;
+  }
+}
+  }
+
+  @Override
+  public Object getStat(String colName) {
+  if (mergeComplete != true) {
+throw new IllegalStateException(
+String.format("Statistic `%s` has not completed merging 
statistics", name));
+  }
+  NullableFloat8Holder colSumHolder = (NullableFloat8Holder) 
sumHolder.get(colName);
+  return (long) (colSumHolder.value/ getRowCount(colName));
+}
+
+  @Override
+  public void setOutput(ValueVector output) {
+// Check the input is a Map Vector
+assert (output.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+// Dependencies have been configured correctly
+assert (configureComplete == true);
+MapVector outputMap = (MapVector) output;
+
+for (ValueVector outMapCol : outputMap) {
+  String colName = outMapCol.getField().getLastName();
+  NullableFloat8Holder colSumHolder = (NullableFloat8Holder) 
sumHolder.get(colName);
+  NullableFloat8Vector vv = (NullableFloat8Vector) outMapCol;
+  vv.allocateNewSafe();
+  vv.getMutator().setSafe(0, (colSumHolder.value / 
getRowCount(colName)));
+}
+mergeComplete = true;
+  }
+
+  @Override
+  public void configure(Object configurations) {
+List statistics = (List) 
configurations;
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature 

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103617262
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/AvgWidthMergedStatistic.java
 ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+public class AvgWidthMergedStatistic extends AbstractMergedStatistic {
+
+  private String name;
+  private String inputName;
+  private boolean configureComplete = false;
+  private boolean mergeComplete = false;
+  private Map<String, ValueHolder> sumHolder;
+  MergedStatistic types, nonNullStatCounts, statCounts;
+
+  public AvgWidthMergedStatistic (String name, String inputName) {
+this.name = name;
+this.inputName = inputName;
+this.sumHolder = new HashMap<>();
+types = nonNullStatCounts = statCounts = null;
+  }
+
+  @Override
+  public String getName() {
+return name;
+  }
+
+  @Override
+  public String getInput() {
+return inputName;
+  }
+
+  @Override
+  public void merge(ValueVector input) {
+// Check the input is a Map Vector
+assert (input.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+MapVector inputMap = (MapVector) input;
+for (ValueVector vv : inputMap) {
+  String colName = vv.getField().getLastName();
+  NullableFloat8Holder colSumHolder;
+  if (sumHolder.get(colName) != null) {
+colSumHolder = (NullableFloat8Holder) sumHolder.get(colName);
+  } else {
+colSumHolder = new NullableFloat8Holder();
+sumHolder.put(colName, colSumHolder);
+  }
+  Object val = vv.getAccessor().getObject(0);
+  if (val != null) {
+colSumHolder.value += (double) val;
+colSumHolder.isSet = 1;
+  }
+}
+  }
+
+  @Override
+  public Object getStat(String colName) {
+  if (mergeComplete != true) {
+throw new IllegalStateException(
+String.format("Statistic `%s` has not completed merging 
statistics", name));
+  }
+  NullableFloat8Holder colSumHolder = (NullableFloat8Holder) 
sumHolder.get(colName);
+  return (long) (colSumHolder.value/ getRowCount(colName));
+}
+
+  @Override
+  public void setOutput(ValueVector output) {
+// Check the input is a Map Vector
+assert (output.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+// Dependencies have been configured correctly
+assert (configureComplete == true);
+MapVector outputMap = (MapVector) output;
+
+for (ValueVector outMapCol : outputMap) {
+  String colName = outMapCol.getField().getLastName();
+  NullableFloat8Holder colSumHolder = (NullableFloat8Holder) 
sumHolder.get(colName);
+  NullableFloat8Vector vv = (NullableFloat8Vector) outMapCol;
+  vv.allocateNewSafe();
+  vv.getMutator().setSafe(0, (colSumHolder.value / 
getRowCount(colName)));
+}
+mergeComplete = true;
+  }
+
+  @Override
+  public void configure(Object configurations) {
+List statistics = (List) 
configurations;
+for (MergedStatistic statistic : statistics) {
+  if (statistic.getName().equals("type")) {
+types = statistic;
+  } else if (statistic.getName().equals("statcount&qu

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103614296
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map<String, String> functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map<MaterializedField, ValueVector> dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map<MaterializedField, ValueVector> copySrcVecMap = null;
+  private Map<String, Map<String, ValueHolder>> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap<String, ValueHolder>());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103614271
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map<String, String> functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map<MaterializedField, ValueVector> dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map<MaterializedField, ValueVector> copySrcVecMap = null;
+  private Map<String, Map<String, ValueHolder>> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap<String, ValueHolder>());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103612078
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsAggBatch.java
 ---
@@ -0,0 +1,256 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.sun.codemodel.JExpr;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.exec.exception.ClassTransformationException;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ClassGenerator;
+import org.apache.drill.exec.expr.CodeGenerator;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.ValueVectorWriteExpression;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsAggregate;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggTemplate;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggregator;
+import org.apache.drill.exec.planner.physical.StatsAggPrel.OperatorPhase;
+import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TypedFieldId;
+import org.apache.drill.exec.store.ImplicitColumnExplorer;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.FieldIdUtil;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.IOException;
+import java.util.GregorianCalendar;
+import java.util.List;
+import java.util.TimeZone;
+
+/**
+ * TODO: This needs cleanup. Currently the key values are constants and we 
compare the constants
+ * for every record. Seems unnecessary.
+ *
+ * Example input and output:
+ * Schema of incoming batch: region_id (VARCHAR), sales_city (VARCHAR), 
cnt (BIGINT)
+ * Schema of output:
+ *"schema" : BIGINT - Schema number. For each schema change this 
number is incremented.
+ *"computed" : BIGINT - What time is it computed?
+ *"columns"   : MAP - Column names
+ *   "region_id"  : VARCHAR
+ *   "sales_city" : VARCHAR
+ *   "cnt": VARCHAR
+ *"statscount" : MAP
--- End diff --

Explained here and in `AnalyzePrule` where all the functions are defined.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103611152
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsAggBatch.java
 ---
@@ -0,0 +1,256 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.sun.codemodel.JExpr;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.exec.exception.ClassTransformationException;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ClassGenerator;
+import org.apache.drill.exec.expr.CodeGenerator;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.ValueVectorWriteExpression;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsAggregate;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggTemplate;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggregator;
+import org.apache.drill.exec.planner.physical.StatsAggPrel.OperatorPhase;
+import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TypedFieldId;
+import org.apache.drill.exec.store.ImplicitColumnExplorer;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.FieldIdUtil;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.IOException;
+import java.util.GregorianCalendar;
+import java.util.List;
+import java.util.TimeZone;
+
+/**
+ * TODO: This needs cleanup. Currently the key values are constants and we 
compare the constants
+ * for every record. Seems unnecessary.
--- End diff --

Refactored out the Javadoc comment since it includes a TODO tag


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103611163
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsAggBatch.java
 ---
@@ -0,0 +1,256 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.sun.codemodel.JExpr;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.exec.exception.ClassTransformationException;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ClassGenerator;
+import org.apache.drill.exec.expr.CodeGenerator;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.ValueVectorWriteExpression;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsAggregate;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggTemplate;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggregator;
+import org.apache.drill.exec.planner.physical.StatsAggPrel.OperatorPhase;
+import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TypedFieldId;
+import org.apache.drill.exec.store.ImplicitColumnExplorer;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.FieldIdUtil;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.IOException;
+import java.util.GregorianCalendar;
+import java.util.List;
+import java.util.TimeZone;
+
+/**
+ * TODO: This needs cleanup. Currently the key values are constants and we 
compare the constants
+ * for every record. Seems unnecessary.
+ *
+ * Example input and output:
+ * Schema of incoming batch: region_id (VARCHAR), sales_city (VARCHAR), 
cnt (BIGINT)
+ * Schema of output:
+ *"schema" : BIGINT - Schema number. For each schema change this 
number is incremented.
+ *"computed" : BIGINT - What time is it computed?
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103611021
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatch.java
 ---
@@ -82,11 +82,18 @@
   private boolean specialBatchSent = false;
   private static final int SPECIAL_BATCH_COUNT = 1;
 
-  public StreamingAggBatch(StreamingAggregate popConfig, RecordBatch 
incoming, FragmentContext context) throws OutOfMemoryException {
+  public StreamingAggBatch(StreamingAggregate popConfig, RecordBatch 
incoming, FragmentContext context)
+  throws OutOfMemoryException {
 super(popConfig, context);
 this.incoming = incoming;
   }
 
+  public StreamingAggBatch(StreamingAggregate popConfig, RecordBatch 
incoming, FragmentContext context,
+   final boolean buildSchema) throws 
OutOfMemoryException {
--- End diff --

I am not familiar with the entire inner implementation of 
StreamingAggBatch. @amansinha100 could you please explain why we need another 
buildSchema flag?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103609111
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ops/ContextInformation.java 
---
@@ -63,4 +65,11 @@ public long getQueryStartTime() {
   public int getRootFragmentTimeZone() {
 return rootFragmentTimeZone;
   }
+
+  /**
+   * @return HLL accuracy parameter
+   */
+  public int getHllMemoryLimit() {
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103608235
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java ---
@@ -390,4 +391,15 @@
 
   String DYNAMIC_UDF_SUPPORT_ENABLED = "exec.udf.enable_dynamic_support";
   BooleanValidator DYNAMIC_UDF_SUPPORT_ENABLED_VALIDATOR = new 
BooleanValidator(DYNAMIC_UDF_SUPPORT_ENABLED, true, true);
+
+  /**
+   * Option whose value is a long value representing the number of bits 
required for computing ndv (using HLL)
+   */
+  LongValidator NDV_MEMORY_LIMIT = new 
PositiveLongValidator("exec.statistics.ndv_memory_limit", 30, 20);
+
+  /**
+   * Option whose value represents the current version of the statistics. 
Decreasing the value will generate
+   * the older version of statistics
+   */
+  LongValidator STATISTICS_VERSION = new 
NonNegativeLongValidator("exec.statistics.capability_version", 1, 1);
--- End diff --

Also, the version in the option will change to the latest for any changes 
to the statistics. However, this option will give users flexibility for 
scenarios described in the earlier commit.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103608069
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillRelOptUtil.java
 ---
@@ -169,4 +175,43 @@ private static boolean containIdentity(List exps,
 }
 return true;
   }
+
+  /**
+   * Returns whether statistics-based estimates or guesses are used by the 
optimizer
+   * */
+  public static boolean guessRows(RelNode rel) {
+final PlannerSettings settings =
+
rel.getCluster().getPlanner().getContext().unwrap(PlannerSettings.class);
+if (!settings.useStatistics()) {
+  return true;
+}
+if (rel instanceof RelSubset) {
--- End diff --

Added comment explaining the special treatment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103608020
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillRelOptUtil.java
 ---
@@ -169,4 +182,99 @@ private static boolean containIdentity(List exps,
 }
 return true;
   }
+
+  /**
+   * Returns whether statistics-based estimates or guesses are used by the 
optimizer
+   * */
+  public static boolean guessRows(RelNode rel) {
+final PlannerSettings settings =
+
rel.getCluster().getPlanner().getContext().unwrap(PlannerSettings.class);
+if (!settings.useStatistics()) {
+  return true;
+}
+if (rel instanceof RelSubset) {
+  if (((RelSubset) rel).getBest() != null) {
+return guessRows(((RelSubset) rel).getBest());
+  } else if (((RelSubset) rel).getOriginal() != null) {
+return guessRows(((RelSubset) rel).getOriginal());
+  }
+} else if (rel instanceof HepRelVertex) {
+  if (((HepRelVertex) rel).getCurrentRel() != null) {
+return guessRows(((HepRelVertex) rel).getCurrentRel());
+  }
+} else if (rel instanceof TableScan) {
+  DrillTable table = rel.getTable().unwrap(DrillTable.class);
+  if (table == null) {
+table = 
rel.getTable().unwrap(DrillTranslatableTable.class).getDrillTable();
+  }
+  if (table != null && table.getStatsTable() != null) {
+return false;
+  } else {
+return true;
+  }
+} else {
+  for (RelNode child : rel.getInputs()) {
+if (guessRows(child)) { // at least one child is a guess
+  return true;
+}
+  }
+}
+return false;
+  }
+
+  private static boolean findLikeOrRangePredicate(RexNode predicate) {
+if ((predicate == null) || predicate.isAlwaysTrue()) {
+  return false;
+}
+for (RexNode pred : RelOptUtil.conjunctions(predicate)) {
+  for (RexNode orPred : RelOptUtil.disjunctions(pred)) {
+if (!orPred.isA(SqlKind.EQUALS) ||
+ orPred.isA(SqlKind.LIKE)) {
+  return true;
+}
+  }
+}
+return false;
+  }
+
+  public static boolean analyzeSimpleEquiJoin(Join join, int[] 
joinFieldOrdinals) {
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103607754
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsAggBatch.java
 ---
@@ -0,0 +1,256 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.sun.codemodel.JExpr;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.exec.exception.ClassTransformationException;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ClassGenerator;
+import org.apache.drill.exec.expr.CodeGenerator;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.ValueVectorWriteExpression;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsAggregate;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggTemplate;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggregator;
+import org.apache.drill.exec.planner.physical.StatsAggPrel.OperatorPhase;
+import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TypedFieldId;
+import org.apache.drill.exec.store.ImplicitColumnExplorer;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.FieldIdUtil;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.IOException;
+import java.util.GregorianCalendar;
+import java.util.List;
+import java.util.TimeZone;
+
+/**
+ * TODO: This needs cleanup. Currently the key values are constants and we 
compare the constants
+ * for every record. Seems unnecessary.
+ *
+ * Example input and output:
+ * Schema of incoming batch: region_id (VARCHAR), sales_city (VARCHAR), 
cnt (BIGINT)
+ * Schema of output:
+ *"schema" : BIGINT - Schema number. For each schema change this 
number is incremented.
+ *"computed" : BIGINT - What time is it computed?
+ *"columns"   : MAP - Column names
+ *   "region_id"  : VARCHAR
+ *   "sales_city" : VARCHAR
+ *   "cnt": VARCHAR
+ *"statscount" : MAP
+ *   "region_id"  : BIGINT - statscount(region_id) - aggregation over 
all values of region_id
+ *  in incoming batch
+ *   "sales_city" : BIGINT - statscount(sales_city)
+ *   "cnt": BIGINT - statscount(cnt)
+ *"nonnullstatcount" : MAP
+ *   "region_id"  : BIGINT - nonnullstatcount(region_id)
+ *   "sales_city" : BIGINT - nonnullstatcount(sales_city)
+ *   "cnt": BIGINT - nonnullstatcount(cnt)
+ *    another map for next stats function 
+ */
+public class StatisticsAggBatch extends StreamingAggBatch {
+  private List functions;
+  private int schema = 0;
+
+  public StatisticsAggBatch(StatisticsAggregate popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws OutOfMemoryException {
+super(popConfig, incoming, context);
+th

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103606988
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/MergedStatisticFactory.java
 ---
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+public class MergedStatisticFactory {
+  /*
+   * Creates the appropriate statistics object given the name of the 
statistics and the input statistic
+   */
+  public static MergedStatistic getMergedStatistic(String outputStatName, 
String inputStatName) {
+if (outputStatName == null || inputStatName == null) {
+  return null;
+} else if (outputStatName.equals(Statistic.COLNAME)) {
+  return new ColumnMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.COLTYPE)) {
+  return new ColTypeMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.STATCOUNT)) {
+  return new StatCountMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.NNSTATCOUNT)) {
+  return new NNStatCountMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.AVG_WIDTH)) {
+  return new AvgWidthMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.HLL_MERGE)) {
+  return new HLLMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.NDV)) {
+  return new NDVMergedStatistic(outputStatName, inputStatName);
--- End diff --

Done using the factory approach you described above. Thanks so much for the 
suggestion.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103606914
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/NNStatCountMergedStatistic.java
 ---
@@ -0,0 +1,98 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.util.HashMap;
+import java.util.Map;
+
+public class NNStatCountMergedStatistic extends AbstractMergedStatistic {
+
+  private String name;
+  private String inputName;
+  private boolean mergeComplete = false;
+  private Map<String, ValueHolder> sumHolder;
+
+  public NNStatCountMergedStatistic (String name, String inputName) {
+this.name = name;
+this.inputName = inputName;
+this.sumHolder = new HashMap<>();
+  }
+
+  @Override
+  public String getName() {
+  return name;
+  }
+
+  @Override
+  public String getInput() {
+return inputName;
+  }
+
+  @Override
+  public void merge(ValueVector input) {
+// Check the input is a Map Vector
+assert (input.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+MapVector inputMap = (MapVector) input;
+for (ValueVector vv : inputMap) {
+  String colName = vv.getField().getLastName();
+  BigIntHolder colSumHolder;
+  if (sumHolder.get(colName) != null) {
+colSumHolder = (BigIntHolder) sumHolder.get(colName);
+  } else {
+colSumHolder = new BigIntHolder();
+sumHolder.put(colName, colSumHolder);
+  }
+  Object val = vv.getAccessor().getObject(0);
+  if (val != null) {
+colSumHolder.value += (long) val;
+  }
+}
+  }
+
+  @Override
+  public Object getStat(String colName) {
+if (mergeComplete != true) {
+  throw new IllegalStateException(String.format("Statistic `%s` has 
not completed merging statistics",
+  name));
+}
+BigIntHolder colSumHolder = (BigIntHolder) sumHolder.get(colName);
+return colSumHolder.value;
+  }
+
+  @Override
+  public void setOutput(ValueVector output) {
+// Check the input is a Map Vector
+assert (output.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+MapVector outputMap = (MapVector) output;
+for (ValueVector outMapCol : outputMap) {
+  String colName = outMapCol.getField().getLastName();
+  BigIntHolder holder = (BigIntHolder) sumHolder.get(colName);
+  NullableBigIntVector vv = (NullableBigIntVector) outMapCol;
+  vv.allocateNewSafe();
+  vv.getMutator().setSafe(0, holder);
+}
+mergeComplete = true;
--- End diff --

The value vector and the results map are all type specific. I do not see a 
way to factor it out to a common function. The only common portion is 2 lines - 
the `for` loop and getting the column name. Same goes for the `merge()` 
function. Hence, leaving as-is.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103419914
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsAggBatch.java
 ---
@@ -0,0 +1,256 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.sun.codemodel.JExpr;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.exec.exception.ClassTransformationException;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ClassGenerator;
+import org.apache.drill.exec.expr.CodeGenerator;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.ValueVectorWriteExpression;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsAggregate;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggTemplate;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggregator;
+import org.apache.drill.exec.planner.physical.StatsAggPrel.OperatorPhase;
+import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TypedFieldId;
+import org.apache.drill.exec.store.ImplicitColumnExplorer;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.FieldIdUtil;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.IOException;
+import java.util.GregorianCalendar;
+import java.util.List;
+import java.util.TimeZone;
+
+/**
+ * TODO: This needs cleanup. Currently the key values are constants and we 
compare the constants
+ * for every record. Seems unnecessary.
+ *
+ * Example input and output:
+ * Schema of incoming batch: region_id (VARCHAR), sales_city (VARCHAR), 
cnt (BIGINT)
+ * Schema of output:
+ *"schema" : BIGINT - Schema number. For each schema change this 
number is incremented.
+ *"computed" : BIGINT - What time is it computed?
+ *"columns"   : MAP - Column names
+ *   "region_id"  : VARCHAR
+ *   "sales_city" : VARCHAR
+ *   "cnt": VARCHAR
+ *"statscount" : MAP
--- End diff --

Rowcount for each column


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103417821
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/MergedStatisticFactory.java
 ---
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+public class MergedStatisticFactory {
+  /*
+   * Creates the appropriate statistics object given the name of the 
statistics and the input statistic
+   */
+  public static MergedStatistic getMergedStatistic(String outputStatName, 
String inputStatName) {
+if (outputStatName == null || inputStatName == null) {
+  return null;
+} else if (outputStatName.equals(Statistic.COLNAME)) {
+  return new ColumnMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.COLTYPE)) {
+  return new ColTypeMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.STATCOUNT)) {
+  return new StatCountMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.NNSTATCOUNT)) {
+  return new NNStatCountMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.AVG_WIDTH)) {
+  return new AvgWidthMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.HLL_MERGE)) {
+  return new HLLMergedStatistic(outputStatName, inputStatName);
--- End diff --

Yes, the output name can be hard-coded in the class. I want to avoid 
hard-coding the input name because then for each new statistic we would have to 
define/duplicate the mapping in multiple places. Right now the mapping is 
defined only once in AnalyzePRule.java. Also there might be a many:1 mapping 
e.g. if the input implementation for a statistic changes but the interface 
remains the same. I don't see this happening now but maybe in the future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103411133
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/AvgWidthMergedStatistic.java
 ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+public class AvgWidthMergedStatistic extends AbstractMergedStatistic {
+
+  private String name;
+  private String inputName;
+  private boolean configureComplete = false;
+  private boolean mergeComplete = false;
+  private Map<String, ValueHolder> sumHolder;
+  MergedStatistic types, nonNullStatCounts, statCounts;
+
+  public AvgWidthMergedStatistic (String name, String inputName) {
+this.name = name;
+this.inputName = inputName;
+this.sumHolder = new HashMap<>();
+types = nonNullStatCounts = statCounts = null;
+  }
+
+  @Override
+  public String getName() {
+return name;
+  }
+
+  @Override
+  public String getInput() {
+return inputName;
+  }
+
+  @Override
+  public void merge(ValueVector input) {
+// Check the input is a Map Vector
+assert (input.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+MapVector inputMap = (MapVector) input;
+for (ValueVector vv : inputMap) {
+  String colName = vv.getField().getLastName();
+  NullableFloat8Holder colSumHolder;
+  if (sumHolder.get(colName) != null) {
+colSumHolder = (NullableFloat8Holder) sumHolder.get(colName);
+  } else {
+colSumHolder = new NullableFloat8Holder();
+sumHolder.put(colName, colSumHolder);
+  }
+  Object val = vv.getAccessor().getObject(0);
+  if (val != null) {
+colSumHolder.value += (double) val;
+colSumHolder.isSet = 1;
+  }
+}
+  }
+
+  @Override
+  public Object getStat(String colName) {
--- End diff --

GetStat() returns different types based on the Statistic Type - addressed 
in a subsequent comment.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103410529
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/ColTypeMergedStatistic.java
 ---
@@ -0,0 +1,93 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.exec.expr.holders.IntHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.vector.IntVector;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.util.HashMap;
+import java.util.Map;
+
+public class ColTypeMergedStatistic extends AbstractMergedStatistic {
+  private String name;
+  private String inputName;
+  private boolean mergeComplete = false;
+  private Map<String, ValueHolder> typeHolder;
+
+
+  public ColTypeMergedStatistic (String name, String inputName) {
+this.name = name;
+this.inputName = inputName;
+typeHolder = new HashMap<>();
+  }
+
+  @Override
+  public String getName() {
+return name;
+  }
+
+  @Override
+  public String getInput() {
+return inputName;
+  }
+
+  @Override
+  public void merge(ValueVector input) {
+// Check the input is a Map Vector
+assert (input.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+MapVector inputMap = (MapVector) input;
+for (ValueVector vv : inputMap) {
+  String colName = vv.getField().getLastName();
+  if (typeHolder.get(colName) == null) {
+IntHolder colType = new IntHolder();
+((IntVector) vv).getAccessor().get(0, colType);
+typeHolder.put(colName, colType);
+  }
+}
+  }
+
+  @Override
+  public Object getStat(String colName) {
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103408205
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/MergedStatisticFactory.java
 ---
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+public class MergedStatisticFactory {
+  /*
+   * Creates the appropriate statistics object given the name of the 
statistics and the input statistic
+   */
+  public static MergedStatistic getMergedStatistic(String outputStatName, 
String inputStatName) {
+if (outputStatName == null || inputStatName == null) {
+  return null;
+} else if (outputStatName.equals(Statistic.COLNAME)) {
+  return new ColumnMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.COLTYPE)) {
+  return new ColTypeMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.STATCOUNT)) {
+  return new StatCountMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.NNSTATCOUNT)) {
+  return new NNStatCountMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.AVG_WIDTH)) {
+  return new AvgWidthMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.HLL_MERGE)) {
+  return new HLLMergedStatistic(outputStatName, inputStatName);
+} else if (outputStatName.equals(Statistic.NDV)) {
+  return new NDVMergedStatistic(outputStatName, inputStatName);
+} else {
+  return null;
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103408160
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/Statistic.java
 ---
@@ -0,0 +1,38 @@
+/**
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103407891
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/MergedStatisticFactory.java
 ---
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+public class MergedStatisticFactory {
+  /*
+   * Creates the appropriate statistics object given the name of the 
statistics and the input statistic
+   */
+  public static MergedStatistic getMergedStatistic(String outputStatName, 
String inputStatName) {
+if (outputStatName == null || inputStatName == null) {
+  return null;
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103406448
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/MergedStatisticFactory.java
 ---
@@ -0,0 +1,46 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+public class MergedStatisticFactory {
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103405690
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/AvgWidthMergedStatistic.java
 ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+public class AvgWidthMergedStatistic extends AbstractMergedStatistic {
+
+  private String name;
+  private String inputName;
+  private boolean configureComplete = false;
+  private boolean mergeComplete = false;
+  private Map<String, ValueHolder> sumHolder;
+  MergedStatistic types, nonNullStatCounts, statCounts;
+
+  public AvgWidthMergedStatistic (String name, String inputName) {
+this.name = name;
+this.inputName = inputName;
+this.sumHolder = new HashMap<>();
+types = nonNullStatCounts = statCounts = null;
+  }
+
+  @Override
+  public String getName() {
+return name;
+  }
+
+  @Override
+  public String getInput() {
+return inputName;
+  }
+
+  @Override
+  public void merge(ValueVector input) {
+// Check the input is a Map Vector
+assert (input.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+MapVector inputMap = (MapVector) input;
+for (ValueVector vv : inputMap) {
+  String colName = vv.getField().getLastName();
+  NullableFloat8Holder colSumHolder;
+  if (sumHolder.get(colName) != null) {
+colSumHolder = (NullableFloat8Holder) sumHolder.get(colName);
+  } else {
+colSumHolder = new NullableFloat8Holder();
+sumHolder.put(colName, colSumHolder);
+  }
+  Object val = vv.getAccessor().getObject(0);
+  if (val != null) {
+colSumHolder.value += (double) val;
+colSumHolder.isSet = 1;
+  }
+}
+  }
+
+  @Override
+  public Object getStat(String colName) {
+  if (mergeComplete != true) {
+throw new IllegalStateException(
+String.format("Statistic `%s` has not completed merging 
statistics", name));
+  }
+  NullableFloat8Holder colSumHolder = (NullableFloat8Holder) 
sumHolder.get(colName);
+  return (long) (colSumHolder.value/ getRowCount(colName));
+}
+
+  @Override
+  public void setOutput(ValueVector output) {
+// Check the input is a Map Vector
+assert (output.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+// Dependencies have been configured correctly
+assert (configureComplete == true);
+MapVector outputMap = (MapVector) output;
+
+for (ValueVector outMapCol : outputMap) {
+  String colName = outMapCol.getField().getLastName();
+  NullableFloat8Holder colSumHolder = (NullableFloat8Holder) 
sumHolder.get(colName);
+  NullableFloat8Vector vv = (NullableFloat8Vector) outMapCol;
+  vv.allocateNewSafe();
+  vv.getMutator().setSafe(0, (colSumHolder.value / 
getRowCount(colName)));
+}
+mergeComplete = true;
+  }
+
+  @Override
+  public void configure(Object configurations) {
+List statistics = (List) 
configurations;
+for (MergedStatistic statistic : statistics) {
+  if (statistic.getName().equals("type")) {
+types = statistic;
+  } else if (statistic.getName().equals("statcount&qu

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-28 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103403068
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/AvgWidthMergedStatistic.java
 ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+public class AvgWidthMergedStatistic extends AbstractMergedStatistic {
+
+  private String name;
+  private String inputName;
+  private boolean configureComplete = false;
+  private boolean mergeComplete = false;
+  private Map<String, ValueHolder> sumHolder;
+  MergedStatistic types, nonNullStatCounts, statCounts;
+
+  public AvgWidthMergedStatistic (String name, String inputName) {
+this.name = name;
+this.inputName = inputName;
+this.sumHolder = new HashMap<>();
+types = nonNullStatCounts = statCounts = null;
+  }
+
+  @Override
+  public String getName() {
+return name;
+  }
+
+  @Override
+  public String getInput() {
+return inputName;
+  }
+
+  @Override
+  public void merge(ValueVector input) {
+// Check the input is a Map Vector
+assert (input.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+MapVector inputMap = (MapVector) input;
+for (ValueVector vv : inputMap) {
+  String colName = vv.getField().getLastName();
+  NullableFloat8Holder colSumHolder;
+  if (sumHolder.get(colName) != null) {
+colSumHolder = (NullableFloat8Holder) sumHolder.get(colName);
+  } else {
+colSumHolder = new NullableFloat8Holder();
+sumHolder.put(colName, colSumHolder);
+  }
+  Object val = vv.getAccessor().getObject(0);
+  if (val != null) {
+colSumHolder.value += (double) val;
+colSumHolder.isSet = 1;
+  }
+}
+  }
+
+  @Override
+  public Object getStat(String colName) {
+  if (mergeComplete != true) {
+throw new IllegalStateException(
+String.format("Statistic `%s` has not completed merging 
statistics", name));
+  }
+  NullableFloat8Holder colSumHolder = (NullableFloat8Holder) 
sumHolder.get(colName);
+  return (long) (colSumHolder.value/ getRowCount(colName));
+}
+
+  @Override
+  public void setOutput(ValueVector output) {
+// Check the input is a Map Vector
+assert (output.getField().getType().getMinorType() == 
TypeProtos.MinorType.MAP);
+// Dependencies have been configured correctly
+assert (configureComplete == true);
+MapVector outputMap = (MapVector) output;
+
+for (ValueVector outMapCol : outputMap) {
+  String colName = outMapCol.getField().getLastName();
+  NullableFloat8Holder colSumHolder = (NullableFloat8Holder) 
sumHolder.get(colName);
+  NullableFloat8Vector vv = (NullableFloat8Vector) outMapCol;
+  vv.allocateNewSafe();
+  vv.getMutator().setSafe(0, (colSumHolder.value / 
getRowCount(colName)));
+}
--- End diff --

Since we create these VVs in the StatisticsMergeBatch, we will only create 
one per statistic per column. We can redesign this in a subsequent 
implementation - not sure how we can get by without generating a VV or maybe 
you mean generate less VVs?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your 

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-27 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103383633
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/AvgWidthMergedStatistic.java
 ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+public class AvgWidthMergedStatistic extends AbstractMergedStatistic {
+
+  private String name;
+  private String inputName;
+  private boolean configureComplete = false;
+  private boolean mergeComplete = false;
--- End diff --

Yes, not necessary. However, having them move through different states 
would make it easier to debug/track issues. Also, incremental statistics would 
probably increase the run-time states interactions. As suggested added a state 
enum `enum State {Config, Merge, Complete};`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-27 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103371017
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/AvgWidthMergedStatistic.java
 ---
@@ -0,0 +1,135 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.util.HashMap;
+import java.util.List;
+import java.util.Map;
+
+public class AvgWidthMergedStatistic extends AbstractMergedStatistic {
+
+  private String name;
+  private String inputName;
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-27 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103367008
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/AbstractMergedStatistic.java
 ---
@@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import org.apache.drill.exec.vector.ValueVector;
+
+public abstract class AbstractMergedStatistic extends Statistic implements 
MergedStatistic {
+  @Override
+  public String getName() {
+throw new UnsupportedOperationException("getName() not implemented");
+  }
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-27 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103366767
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ops/QueryContext.java ---
@@ -283,4 +288,22 @@ public void close() throws Exception {
   closed = true;
 }
   }
+
+  /**
+  * @param stmtType : Sets the type {@link SqlStatementType} of the 
statement e.g. CTAS, ANALYZE
+  */
+  public void setSQLStatementType(SqlStatementType stmtType) {
+if (this.stmtType == null) {
+  this.stmtType = stmtType;
+} else {
+  throw new UnsupportedOperationException("SQL Statement type is 
already set");
+}
+  }
+
+  /**
+   * @return Get the type {@link SqlStatementType} of the statement e.g. 
CTAS, ANALYZE
+   */
+  public SqlStatementType getSQLStatementType() {
+return this.stmtType;
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-27 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103366732
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ops/QueryContext.java ---
@@ -283,4 +288,22 @@ public void close() throws Exception {
   closed = true;
 }
   }
+
+  /**
+  * @param stmtType : Sets the type {@link SqlStatementType} of the 
statement e.g. CTAS, ANALYZE
+  */
+  public void setSQLStatementType(SqlStatementType stmtType) {
+if (this.stmtType == null) {
+  this.stmtType = stmtType;
+} else {
+  throw new UnsupportedOperationException("SQL Statement type is 
already set");
--- End diff --

Changed to `IllegalStateException`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-27 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103366521
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentContext.java ---
@@ -245,6 +247,18 @@ public SchemaPlus getRootSchema() {
   }
 
   /**
+   * Returns the statement type (e.g. SELECT, CTAS, ANALYZE) from the 
query context.
+   * @return query statement type {@link SqlStatementType}, if known.
+   */
+  public SqlStatementType getSQLStatementType() {
+if (queryContext == null) {
+  fail(new UnsupportedOperationException("Statement type is only valid 
for root fragment. " +
--- End diff --

Done


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-27 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r103365674
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java ---
@@ -390,4 +391,15 @@
 
   String DYNAMIC_UDF_SUPPORT_ENABLED = "exec.udf.enable_dynamic_support";
   BooleanValidator DYNAMIC_UDF_SUPPORT_ENABLED_VALIDATOR = new 
BooleanValidator(DYNAMIC_UDF_SUPPORT_ENABLED, true, true);
+
+  /**
+   * Option whose value is a long value representing the number of bits 
required for computing ndv (using HLL)
+   */
+  LongValidator NDV_MEMORY_LIMIT = new 
PositiveLongValidator("exec.statistics.ndv_memory_limit", 30, 20);
+
+  /**
+   * Option whose value represents the current version of the statistics. 
Decreasing the value will generate
+   * the older version of statistics
+   */
+  LongValidator STATISTICS_VERSION = new 
NonNegativeLongValidator("exec.statistics.capability_version", 1, 1);
--- End diff --

Say in the next version(v2), we add histograms. Computing stats is 
expensive so users might prefer to remain on the present version(v1) maybe 
because their queries do not involve too many inequalities. Always generating 
the latest version of the stats will force the users to compute the latest and 
greatest stats without needing them. On the other hand, providing individual 
control of which statistic to compute moves too much burden onto the user to 
figure out exactly which statistics would help their use-cases.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #741: DRILL-5196: init MongoDB cluster when run a single test ca...

2017-02-24 Thread gparai
Github user gparai commented on the issue:

https://github.com/apache/drill/pull/741
  
+1 LGTM


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-22 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102604228
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/ExecConstants.java ---
@@ -390,4 +391,15 @@
 
   String DYNAMIC_UDF_SUPPORT_ENABLED = "exec.udf.enable_dynamic_support";
   BooleanValidator DYNAMIC_UDF_SUPPORT_ENABLED_VALIDATOR = new 
BooleanValidator(DYNAMIC_UDF_SUPPORT_ENABLED, true, true);
+
+  /**
+   * Option whose value is a long value representing the number of bits 
required for computing ndv (using HLL)
+   */
+  LongValidator NDV_MEMORY_LIMIT = new 
PositiveLongValidator("exec.statistics.ndv_memory_limit", 30, 20);
--- End diff --

We are not mixing different lengths during the same run. The session 
setting at the foreman would be passed along in the plan fragment - so 
non-foreman fragments will use the same value. Also, we do not mix lengths 
across different runs. So this should not be an issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-22 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102602855
  
--- Diff: exec/java-exec/src/main/codegen/data/Parser.tdd ---
@@ -39,7 +39,13 @@
 "METADATA",
 "DATABASE",
 "IF",
-"JAR"
+"JAR",
+"ANALYZE",
+"COMPUTE",
+"ESTIMATE",
+"STATISTICS",
+"SAMPLE",
+"PERCENT"
--- End diff --

@sudheeshkatkam mentioned
> Something like this came up before where a list of non reserved keyword 
might result in some ambiguous queries. See DRILL-2116. Also DRILL-3875.

Hence, these keywords were not added to the non-reserved keyword list. 
Also, I am not sure how we can preserve backward compatibility here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill issue #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai
Github user gparai commented on the issue:

https://github.com/apache/drill/pull/729
  
I have addressed the comments from the earlier pull request. Please take a 
look


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102345408
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/cost/DrillRelMdSelectivity.java
 ---
@@ -0,0 +1,219 @@

+/***
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ 
**/
+package org.apache.drill.exec.planner.cost;
+
+import org.apache.calcite.plan.RelOptUtil;
+import org.apache.calcite.plan.volcano.RelSubset;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.SingleRel;
+import org.apache.calcite.rel.core.JoinRelType;
+import org.apache.calcite.rel.core.TableScan;
+import org.apache.calcite.rel.metadata.ReflectiveRelMetadataProvider;
+import org.apache.calcite.rel.metadata.RelMdSelectivity;
+import org.apache.calcite.rel.metadata.RelMdUtil;
+import org.apache.calcite.rel.metadata.RelMetadataProvider;
+import org.apache.calcite.rel.metadata.RelMetadataQuery;
+import org.apache.calcite.rel.type.RelDataType;
+import org.apache.calcite.rex.RexBuilder;
+import org.apache.calcite.rex.RexCall;
+import org.apache.calcite.rex.RexInputRef;
+import org.apache.calcite.rex.RexNode;
+import org.apache.calcite.rex.RexUtil;
+import org.apache.calcite.rex.RexVisitor;
+import org.apache.calcite.rex.RexVisitorImpl;
+import org.apache.calcite.sql.SqlKind;
+import org.apache.calcite.util.BuiltInMethod;
+import org.apache.calcite.util.Util;
+import org.apache.drill.exec.planner.common.DrillJoinRelBase;
+import org.apache.drill.exec.planner.common.DrillRelOptUtil;
+import org.apache.drill.exec.planner.logical.DrillTable;
+import org.apache.drill.exec.planner.logical.DrillTranslatableTable;
+
+import java.util.ArrayList;
+import java.util.List;
+
+public class DrillRelMdSelectivity extends RelMdSelectivity {
+  private static final org.slf4j.Logger logger =
+  org.slf4j.LoggerFactory.getLogger(RelMdSelectivity.class);
+
+  private static final DrillRelMdSelectivity INSTANCE =
+  new DrillRelMdSelectivity();
+
+  public static final RelMetadataProvider SOURCE =
+  ReflectiveRelMetadataProvider.reflectiveSource(
+  BuiltInMethod.SELECTIVITY.method, INSTANCE);
+
+  @Override
+  public Double getSelectivity(RelNode rel, RexNode predicate) {
+if (rel instanceof TableScan) {
+  return getScanSelectivity((TableScan) rel, predicate);
+} else if (rel instanceof DrillJoinRelBase) {
+  return getJoinSelectivity(((DrillJoinRelBase) rel), predicate);
+} else if (rel instanceof SingleRel && 
!DrillRelOptUtil.guessRows(rel)) {
+return RelMetadataQuery.getSelectivity(((SingleRel) 
rel).getInput(), predicate);
+} else if (rel instanceof RelSubset && 
!DrillRelOptUtil.guessRows(rel)) {
+  if (((RelSubset) rel).getBest() != null) {
+return RelMetadataQuery.getSelectivity(((RelSubset)rel).getBest(), 
predicate);
+  } else if (((RelSubset)rel).getOriginal() != null) {
+return 
RelMetadataQuery.getSelectivity(((RelSubset)rel).getOriginal(), predicate);
+  } else {
+return super.getSelectivity(rel, predicate);
+  }
+} else {
+  return super.getSelectivity(rel, predicate);
+}
+  }
+
+  private Double getJoinSelectivity(DrillJoinRelBase rel, RexNode 
predicate) {
+double sel = 1.0;
+// determine which filters apply to the left vs right
+RexNode leftPred = null;
+RexNode rightPred = null;
+JoinRelType joinType = rel.getJoinType();
+final RexBuilder rexBuilder = rel.getCluster().getRexBuilder();
+int[] adjustments = new int[rel.getRowType().getFieldCount()];
+
+if (DrillRelOptUtil.guessRows(rel)) {
+  return super.getSelectivity(rel, predicate);
   

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102327621
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsAggBatch.java
 ---
@@ -0,0 +1,256 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.sun.codemodel.JExpr;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.exec.exception.ClassTransformationException;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ClassGenerator;
+import org.apache.drill.exec.expr.CodeGenerator;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.ValueVectorWriteExpression;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsAggregate;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggBatch;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggTemplate;
+import org.apache.drill.exec.physical.impl.aggregate.StreamingAggregator;
+import org.apache.drill.exec.planner.physical.StatsAggPrel.OperatorPhase;
+import org.apache.drill.exec.record.BatchSchema.SelectionVectorMode;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TypedFieldId;
+import org.apache.drill.exec.store.ImplicitColumnExplorer;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.complex.FieldIdUtil;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.IOException;
+import java.util.GregorianCalendar;
+import java.util.List;
+import java.util.TimeZone;
+
+/**
+ * TODO: This needs cleanup. Currently the key values are constants and we 
compare the constants
+ * for every record. Seems unnecessary.
+ *
+ * Example input and output:
+ * Schema of incoming batch: region_id (VARCHAR), sales_city (VARCHAR), 
cnt (BIGINT)
+ * Schema of output:
+ *"schema" : BIGINT - Schema number. For each schema change this 
number is incremented.
+ *"computed" : BIGINT - What time is it computed?
+ *"columns"   : MAP - Column names
+ *   "region_id"  : VARCHAR
+ *   "sales_city" : VARCHAR
+ *   "cnt": VARCHAR
+ *"statscount" : MAP
+ *   "region_id"  : BIGINT - statscount(region_id) - aggregation over 
all values of region_id
+ *  in incoming batch
+ *   "sales_city" : BIGINT - statscount(sales_city)
+ *   "cnt": BIGINT - statscount(cnt)
+ *"nonnullstatcount" : MAP
+ *   "region_id"  : BIGINT - nonnullstatcount(region_id)
+ *   "sales_city" : BIGINT - nonnullstatcount(sales_city)
+ *   "cnt": BIGINT - nonnullstatcount(cnt)
+ *    another map for next stats function 
+ */
+public class StatisticsAggBatch extends StreamingAggBatch {
+  private List functions;
+  private int schema = 0;
+
+  public StatisticsAggBatch(StatisticsAggregate popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws OutOfMemoryException {
+super(popConfig, incoming, context);
+th

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102326907
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map<String, String> functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map<MaterializedField, ValueVector> dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map<MaterializedField, ValueVector> copySrcVecMap = null;
+  private Map<String, Map<String, ValueHolder>> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap<String, ValueHolder>());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102325216
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map<String, String> functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map<MaterializedField, ValueVector> dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map<MaterializedField, ValueVector> copySrcVecMap = null;
+  private Map<String, Map<String, ValueHolder>> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap<String, ValueHolder>());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102324705
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map<String, String> functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map<MaterializedField, ValueVector> dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map<MaterializedField, ValueVector> copySrcVecMap = null;
+  private Map<String, Map<String, ValueHolder>> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap<String, ValueHolder>());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102324596
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map<String, String> functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map<MaterializedField, ValueVector> dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map<MaterializedField, ValueVector> copySrcVecMap = null;
+  private Map<String, Map<String, ValueHolder>> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap<String, ValueHolder>());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102323849
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map<String, String> functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map<MaterializedField, ValueVector> dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map<MaterializedField, ValueVector> copySrcVecMap = null;
+  private Map<String, Map<String, ValueHolder>> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap<String, ValueHolder>());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102322555
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map<String, String> functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map<MaterializedField, ValueVector> dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map<MaterializedField, ValueVector> copySrcVecMap = null;
+  private Map<String, Map<String, ValueHolder>> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap<String, ValueHolder>());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102320614
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map<String, String> functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map<MaterializedField, ValueVector> dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map<MaterializedField, ValueVector> copySrcVecMap = null;
+  private Map<String, Map<String, ValueHolder>> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap<String, ValueHolder>());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102319897
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map<String, String> functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map<MaterializedField, ValueVector> dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map<MaterializedField, ValueVector> copySrcVecMap = null;
+  private Map<String, Map<String, ValueHolder>> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap<String, ValueHolder>());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102316247
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/config/UnpivotMaps.java
 ---
@@ -0,0 +1,59 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.config;
+
+import java.util.List;
+
+import org.apache.drill.exec.physical.base.AbstractSingle;
+import org.apache.drill.exec.physical.base.PhysicalOperator;
+import org.apache.drill.exec.physical.base.PhysicalVisitor;
+import org.apache.drill.exec.proto.UserBitShared.CoreOperatorType;
+
+import com.fasterxml.jackson.annotation.JsonCreator;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.fasterxml.jackson.annotation.JsonTypeName;
+
+@JsonTypeName("unpivot-maps")
+public class UnpivotMaps extends AbstractSingle {
--- End diff --

This is generic in the sense that it can unpivot any given set of maps 
given that they are of
the same size and contain the same keys(columns).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102314383
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/common/DrillStatsTable.java
 ---
@@ -0,0 +1,347 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.planner.common;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Map;
+import java.util.concurrent.TimeUnit;
+import com.fasterxml.jackson.annotation.JsonIgnore;
+import com.fasterxml.jackson.annotation.JsonGetter;
+import com.fasterxml.jackson.annotation.JsonSetter;
+import com.fasterxml.jackson.annotation.JsonProperty;
+import com.fasterxml.jackson.annotation.JsonSubTypes;
+import com.fasterxml.jackson.annotation.JsonTypeInfo;
+import com.fasterxml.jackson.annotation.JsonTypeName;
+import com.fasterxml.jackson.databind.DeserializationFeature;
+import com.fasterxml.jackson.databind.ObjectMapper;
+import com.google.common.base.Stopwatch;
+import com.google.common.collect.Maps;
+import org.apache.calcite.rel.RelNode;
+import org.apache.calcite.rel.RelVisitor;
+import org.apache.calcite.rel.core.TableScan;
+import org.apache.drill.common.exceptions.DrillRuntimeException;
+import org.apache.drill.exec.ops.QueryContext;
+import org.apache.drill.exec.planner.logical.DrillTable;
+import org.apache.drill.exec.util.ImpersonationUtil;
+import org.apache.hadoop.fs.FSDataInputStream;
+import org.apache.hadoop.fs.FileSystem;
+import org.apache.hadoop.fs.Path;
+import org.joda.time.DateTime;
+
+/**
+ * Wraps the stats table info including schema and tableName. Also 
materializes stats from storage
+ * and keeps them in memory.
+ */
+public class DrillStatsTable {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(DrillStatsTable.class);
+  private final FileSystem fs;
+  private final Path tablePath;
+
+  /**
+   * List of columns in stats table.
+   */
+  public static final String COL_COLUMN = "column";
+  public static final String COL_COMPUTED = "computed";
+  public static final String COL_STATCOUNT = "statcount";
+  public static final String COL_NDV = "ndv";
+
+  private final String schemaName;
+  private final String tableName;
+
+  private final Map<String, Long> ndv = Maps.newHashMap();
+  private double rowCount = -1;
+
+  private boolean materialized = false;
+
+  private TableStatistics statistics = null;
+
+  public DrillStatsTable(String schemaName, String tableName, Path 
tablePath, FileSystem fs) {
+this.schemaName = schemaName;
+this.tableName = tableName;
+this.tablePath = tablePath;
+this.fs = 
ImpersonationUtil.createFileSystem(ImpersonationUtil.getProcessUserName(), 
fs.getConf());
+  }
+
+  public String getSchemaName() {
+return schemaName;
+  }
+
+  public String getTableName() {
+return tableName;
+  }
+  /**
+   * Get number of distinct values of given column. If stats are not 
present for the given column,
+   * a null is returned.
+   *
+   * Note: returned data may not be accurate. Accuracy depends on whether 
the table data has changed after the
+   * stats are computed.
+   *
+   * @param col
+   * @return
+   */
+  public Double getNdv(String col) {
+// Stats might not have materialized because of errors.
+if (!materialized) {
+  return null;
+}
+final String upperCol = col.toUpperCase();
+final Long ndvCol = ndv.get(upperCol);
+// Ndv estimation techniques like HLL may over-estimate, hence cap it 
at rowCount
+if (ndvCol != null) {
+  return Math.min(ndvCol, rowCount);
--- End diff --

Histograms would help with the data skew. When we have histograms, the NDV 
would be obtained f

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102310795
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map<String, String> functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map<MaterializedField, ValueVector> dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map<MaterializedField, ValueVector> copySrcVecMap = null;
+  private Map<String, Map<String, ValueHolder>> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap<String, ValueHolder>());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102306890
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map<String, String> functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map<MaterializedField, ValueVector> dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map<MaterializedField, ValueVector> copySrcVecMap = null;
+  private Map<String, Map<String, ValueHolder>> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap<String, ValueHolder>());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102295174
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map<String, String> functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map<MaterializedField, ValueVector> dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map<MaterializedField, ValueVector> copySrcVecMap = null;
+  private Map<String, Map<String, ValueHolder>> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap<String, ValueHolder>());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102295106
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map<String, String> functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map<MaterializedField, ValueVector> dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map<MaterializedField, ValueVector> copySrcVecMap = null;
+  private Map<String, Map<String, ValueHolder>> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap<String, ValueHolder>());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102294356
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map<String, String> functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map<MaterializedField, ValueVector> dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map<MaterializedField, ValueVector> copySrcVecMap = null;
+  private Map<String, Map<String, ValueHolder>> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap<String, ValueHolder>());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {

[GitHub] drill pull request #729: Drill 1328: Support table statistics for Parquet

2017-02-21 Thread gparai
Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/729#discussion_r102292075
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/statistics/StatisticsMergeBatch.java
 ---
@@ -0,0 +1,532 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.physical.impl.statistics;
+
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import org.apache.drill.common.expression.LogicalExpression;
+import org.apache.drill.common.expression.ErrorCollector;
+import org.apache.drill.common.expression.ErrorCollectorImpl;
+import org.apache.drill.common.expression.FunctionCallFactory;
+import org.apache.drill.common.expression.SchemaPath;
+import org.apache.drill.common.expression.ValueExpressions;
+import org.apache.drill.common.types.TypeProtos;
+import org.apache.drill.common.types.TypeProtos.MajorType;
+import org.apache.drill.exec.exception.OutOfMemoryException;
+import org.apache.drill.exec.exception.SchemaChangeException;
+import org.apache.drill.exec.expr.ExpressionTreeMaterializer;
+import org.apache.drill.exec.expr.TypeHelper;
+import org.apache.drill.exec.expr.holders.BigIntHolder;
+import org.apache.drill.exec.expr.holders.NullableFloat8Holder;
+import org.apache.drill.exec.expr.holders.ObjectHolder;
+import org.apache.drill.exec.expr.holders.ValueHolder;
+import org.apache.drill.exec.ops.FragmentContext;
+import org.apache.drill.exec.physical.config.StatisticsMerge;
+import org.apache.drill.exec.record.AbstractSingleRecordBatch;
+import org.apache.drill.exec.record.MaterializedField;
+import org.apache.drill.exec.record.RecordBatch;
+import org.apache.drill.exec.record.TransferPair;
+import org.apache.drill.exec.record.VectorContainer;
+import org.apache.drill.exec.record.VectorWrapper;
+import org.apache.drill.exec.vector.ValueVector;
+import org.apache.drill.exec.vector.DateVector;
+import org.apache.drill.exec.vector.BigIntVector;
+import org.apache.drill.exec.vector.NullableBigIntVector;
+import org.apache.drill.exec.vector.NullableFloat8Vector;
+import org.apache.drill.exec.vector.NullableVarBinaryVector;
+import org.apache.drill.exec.vector.VarCharVector;
+import org.apache.drill.exec.vector.complex.MapVector;
+
+import java.io.ByteArrayInputStream;
+import java.io.DataInputStream;
+import java.io.IOException;
+import java.util.List;
+import java.util.GregorianCalendar;
+import java.util.HashMap;
+import java.util.Map;
+import java.util.TimeZone;
+
+import com.clearspring.analytics.stream.cardinality.HyperLogLog;
+
+public class StatisticsMergeBatch extends 
AbstractSingleRecordBatch {
+  private static final org.slf4j.Logger logger = 
org.slf4j.LoggerFactory.getLogger(StatisticsMergeBatch.class);
+  private Map<String, String> functions;
+  private boolean first = true;
+  private boolean finished = false;
+  private int schema = 0;
+  private int recordCount = 0;
+  private List keyList = null;
+  private Map<MaterializedField, ValueVector> dataSrcVecMap = null;
+  // Map of non-map fields to VV in the incoming schema
+  private Map<MaterializedField, ValueVector> copySrcVecMap = null;
+  private Map<String, Map<String, ValueHolder>> aggregationMap = null;
+  public StatisticsMergeBatch(StatisticsMerge popConfig, RecordBatch 
incoming,
+  FragmentContext context) throws 
OutOfMemoryException {
+super(popConfig, context, incoming);
+this.functions = new HashMap<>();
+this.aggregationMap = new HashMap<>();
+
+/*for (String key : popConfig.getFunctions()) {
+  aggregationMap.put(key, new HashMap<String, ValueHolder>());
+  if (key.equalsIgnoreCase("statcount") || 
key.equalsIgnoreCase("nonnullstatcount")) {

  1   2   3   >