[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-09-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15525244#comment-15525244
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user ssriniva123 commented on the issue:

https://github.com/apache/drill/pull/518
  
Paul,
The code you have listed is semantically equivalent to that of what I 
already I have submitted for pull and will not solve handling of all malformed 
json records. Also the code for reporting the 
error records is working correctly as long as is it is reported by the 
Parser correctly.

As I explained earlier the JSON parser is not just a simple tokenizer, it 
keeps track of internal state,
hence the issue. SERDE's in hive etc work because they  are record oriented 
with clean record demarkations using a new line.

One solution is to submit a patch to jackson parser to expose a method to 
skip to new line in the
event of a parsing exception. This can be parametrized so that behavior can 
customized.



> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-09-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15525212#comment-15525212
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/518
  
The open question was how we can discard a partly-built record during 
recovery. As far as I can tell (veterans, please correct me), the 
JSONRecordReader keeps track of the record count. So, all we have to do is not 
increment the count when we want to discard a record. Look in

JSONRecordReader.next( )
  ...
  outside: while(recordCount < DEFAULT_ROWS_PER_BATCH) {
writer.setPosition(recordCount); // Sets the position for the next 
read.
write = jsonReader.write(writer); // Write the record. We can catch 
errors
  // and recover here??
 ...
  recordCount++; // Don't do this on a bad record
  ...
  writer.setValueCount(recordCount); // The record reader controls the 
record count.

This seems to show the elements of a solution:

1. Try to read the record.
2. If a failure occurs, catch it here and clean up, as in the previous post.
3. Don't increment the record count. We reuse the current one on the next 
record read.

Now the only open question is how we clean up the in-flight record in case 
some columns are not present in the next record. Anyone know how to set a 
vector position to null (for optional) default value (for required) or 
zero-length (for repeated)?


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524922#comment-15524922
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r80612967
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -899,6 +907,16 @@ public FileGroupScan clone(FileSelection selection) 
throws IOException {
 return newScan;
   }
 
+  // clone to create new groupscan with new file selection and batchSize.
+  public ParquetGroupScan clone(FileSelection selection, int batchSize) 
throws IOException {
+ParquetGroupScan newScan = new ParquetGroupScan(this);
+newScan.modifyFileSelection(selection);
+newScan.cacheFileRoot = selection.cacheFileRoot;
+newScan.init(selection.getMetaContext());
+newScan.batchSize = batchSize;
--- End diff --

Should we have a setter method for this? Or builder pattern?


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524920#comment-15524920
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r80612914
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -899,6 +907,16 @@ public FileGroupScan clone(FileSelection selection) 
throws IOException {
 return newScan;
   }
 
+  // clone to create new groupscan with new file selection and batchSize.
+  public ParquetGroupScan clone(FileSelection selection, int batchSize) 
throws IOException {
+ParquetGroupScan newScan = new ParquetGroupScan(this);
+newScan.modifyFileSelection(selection);
+newScan.cacheFileRoot = selection.cacheFileRoot;
--- End diff --

Should we have a setter method for this?


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524921#comment-15524921
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user gparai commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r80613372
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -926,16 +944,22 @@ public GroupScan applyLimit(long maxRecords) {
   fileNames.add(rowGroupInfo.getPath());
 }
 
-if (fileNames.size() == fileSet.size() ) {
+// If there is no change in fileSet and maxRecords is >= batchSize, no 
need to create new groupScan.
+if (fileNames.size() == fileSet.size() && (maxRecords >= batchSize) ) {
   // There is no reduction of rowGroups. Return the original groupScan.
   logger.debug("applyLimit() does not apply!");
   return null;
 }
 
+// If limit maxRecords is less than batchSize, update batchSize to the 
limit size.
+if (maxRecords < batchSize) {
+  batchSize = (int) maxRecords;
--- End diff --

Can we avoid the cast?


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524755#comment-15524755
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r80608344
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java
 ---
@@ -107,7 +107,7 @@ public ScanBatch getBatch(FragmentContext context, 
ParquetRowGroupScan rowGroupS
 if 
(!context.getOptions().getOption(ExecConstants.PARQUET_NEW_RECORD_READER).bool_val
 && !isComplex(footers.get(e.getPath( {
   readers.add(
   new ParquetRecordReader(
-  context, e.getPath(), e.getRowGroupIndex(), fs,
+  context, rowGroupScan.getBatchSize(), e.getPath(), 
e.getRowGroupIndex(), fs,
--- End diff --

What about DrillParquetReader? 


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524754#comment-15524754
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r80606955
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -115,6 +115,8 @@
   private List rowGroupInfos;
   private Metadata.ParquetTableMetadataBase parquetTableMetadata = null;
   private String cacheFileRoot = null;
+  private int batchSize;
+  private static final int DEFAULT_BATCH_LENGTH = 256 * 1024;
--- End diff --

why do we have this DEFAULT_BATCH_LENGTH, in addition to the new option of 
"store.parquet.record_batch_size"? 


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524649#comment-15524649
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

GitHub user ppadma opened a pull request:

https://github.com/apache/drill/pull/597

DRILL-4905: Push down the LIMIT to the parquet reader scan.

For limit N query, where N is less than current default record batchSize 
(256K for all fixedlength, 32K otherwise), we still end up reading all 256K/32K 
rows from disk if rowGroup has that many rows. This  causes performance 
degradation especially when there are large number of columns. 
This fix tries to address this problem by changing the record batchSize 
parquet record reader uses so we don't read more than what is needed.
Also, added a sys option (store.parquet.record_batch_size) to be able to 
set record batch size.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ppadma/drill DRILL-4905

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/597.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #597


commit cd665ebdba11f8685ba446f5ec535c81ddd6edc7
Author: Padma Penumarthy 
Date:   2016-09-26T17:51:07Z

DRILL-4905: Push down the LIMIT to the parquet reader scan to limit the 
numbers of records read




> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (DRILL-4880) Support JDBC driver registration using ServiceLoader

2016-09-26 Thread Laurent Goujon (JIRA)

 [ 
https://issues.apache.org/jira/browse/DRILL-4880?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Laurent Goujon reassigned DRILL-4880:
-

Assignee: Laurent Goujon

> Support JDBC driver registration using ServiceLoader 
> -
>
> Key: DRILL-4880
> URL: https://issues.apache.org/jira/browse/DRILL-4880
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Client - JDBC
>Affects Versions: 1.8.0
> Environment: Windows Server 2012
>Reporter: Sudip Mukherjee
>Assignee: Laurent Goujon
> Fix For: 1.9.0
>
>
> Currently drill-jdbc-all*.jar doesn't contain a 
> META_INF/services/java.sql.Driver file which is apparently used to discover a 
> service by Java ServiceLoader API.
> Can drill jdbc driver have this file like all the other jdbc drivers so that 
> the driver can be loaded using ServiceLoader instead of a direct 
> Class.forName?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4880) Support JDBC driver registration using ServiceLoader

2016-09-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4880?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524631#comment-15524631
 ] 

ASF GitHub Bot commented on DRILL-4880:
---

GitHub user laurentgo opened a pull request:

https://github.com/apache/drill/pull/596

DRILL-4880: Support JDBC driver registration using ServiceLoader

Support loading Drill driver using ServiceLoader. From the user perspective,
it means being able to use the driver without registering it first, like by 
using
Class.forName("org.apache.drill.jdbc.Driver") for example.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/laurentgo/drill laurent/DRILL-4880

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/596.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #596


commit b3112b084d29eb9d4d09a1497457ca725335da84
Author: Laurent Goujon 
Date:   2016-09-27T00:14:59Z

DRILL-4880: Support JDBC driver registration using ServiceLoader

Support loading Drill driver using ServiceLoader. From the user perspective,
it means being able to use the driver without registering it first, like by 
using
Class.forName("org.apache.drill.jdbc.Driver") for exxample.




> Support JDBC driver registration using ServiceLoader 
> -
>
> Key: DRILL-4880
> URL: https://issues.apache.org/jira/browse/DRILL-4880
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Client - JDBC
>Affects Versions: 1.8.0
> Environment: Windows Server 2012
>Reporter: Sudip Mukherjee
> Fix For: 1.9.0
>
>
> Currently drill-jdbc-all*.jar doesn't contain a 
> META_INF/services/java.sql.Driver file which is apparently used to discover a 
> service by Java ServiceLoader API.
> Can drill jdbc driver have this file like all the other jdbc drivers so that 
> the driver can be loaded using ServiceLoader instead of a direct 
> Class.forName?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4726) Dynamic UDFs support

2016-09-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524281#comment-15524281
 ] 

ASF GitHub Bot commented on DRILL-4726:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/574#discussion_r80547641
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionImplementationRegistry.java
 ---
@@ -301,29 +323,120 @@ private ScanResult scan(ClassLoader classLoader, 
Path path, URL[] urls) throws I
 return RunTimeScan.dynamicPackageScan(drillConfig, 
Sets.newHashSet(urls));
   }
 }
-throw new FunctionValidationException(String.format("Marker file %s is 
missing in %s.",
+throw new JarValidationException(String.format("Marker file %s is 
missing in %s",
 CommonConstants.DRILL_JAR_MARKER_FILE_RESOURCE_PATHNAME, 
path.getName()));
   }
 
-  private static String getUdfDir() {
-return Preconditions.checkNotNull(System.getenv("DRILL_UDF_DIR"), 
"DRILL_UDF_DIR variable is not set");
+  /**
+   * Return list of jars that are missing in local function registry
+   * but present in remote function registry.
+   *
+   * @param remoteFunctionRegistry remote function registry
+   * @param localFunctionRegistry local function registry
+   * @return list of missing jars
+   */
+  private List getMissingJars(RemoteFunctionRegistry 
remoteFunctionRegistry,
+  LocalFunctionRegistry 
localFunctionRegistry) {
+List remoteJars = 
remoteFunctionRegistry.getRegistry().getJarList();
+List localJars = localFunctionRegistry.getAllJarNames();
+List missingJars = Lists.newArrayList();
+for (Jar jar : remoteJars) {
+  if (!localJars.contains(jar.getName())) {
+missingJars.add(jar.getName());
+  }
+}
+return missingJars;
+  }
+
+  /**
+   * Creates local udf directory, if it doesn't exist.
+   * Checks if local is a directory and if current application has write 
rights on it.
+   * Attempts to clean up local idf directory in case jars were left after 
previous drillbit run.
+   *
+   * @return path to local udf directory
+   */
+  private Path getLocalUdfDir() {
+String confDir = getConfDir();
--- End diff --

Unfortunately, this won't work in the case of Drill-on-YARN. The 
$DRILL_HOME and $DRILL_CONF_DIR directories are read-only in that case.

The new site directory (pointed to by DRILL_CONF_DIR) will contain a "jars" 
directory that contains statically-defined UDFs. In Drill-on-YARN, YARN copies 
all of the site directory to the local machine, but makes it read-only so that 
YARN can reuse that same "localized" copy for multiple runs. (That feature is 
handy fo map/reduce, but is not that useful for Drill. Still, that's how YARN 
works...)

One solution: provide a config option that specifies the local UDF 
location. The Apache Drill default can be the config dir (assuming there is a 
way to reference the config dir from within drill-override.conf -- need to 
check that.) For DoY, we will change the location to be a temp directory 
location provided by YARN.

Using the YARN temp directory ensures that the local udf dir starts out 
empty on each run. But, what about the "stock" Drill case? The 
$DRILL_CONFIG_DIR/udf directory probably will contain jars from a previous run. 
Is this desired? Does the code handle this case? Do we clean out UDFs that were 
dropped while the Drillbit was offline? Do we handle a partially-downloaded jar 
that was left incomplete when the previous run crashed?

Or, would it be better to clear the udf directory on the start of each 
Drill run? If we do that, can we always write udfs to a temp directory? Perhaps 
review the temp directories available.

Since DoY defines the temp directory at runtime, we need to set the temp 
diretory in drill-config.sh (which you did in a previous version.) As it turns 
out, Drill already has temp directories set in the config system (for 
spill-to-disk.) So we need to reconcile these two.

Perhaps this:

Define DRILL_TEMP_DIR in drill-config.sh. If it is set in the environment 
(the DoY case) or drill-env.sh (the non-DoY case), use it. Else, default to 
/tmp.

Under DoY, we can run multiple drillbits on the same host (by changing 
ports, etc.) So we need a unique path. Define the actual Drillbit temp 
directory to be

drillbit-temp-dir = $DRILL_TEMP_DIR/${drill-root}-${cluster-id}

We need both the root and cluster ID because neither is unique by itself, 
unfortunately.

Finally, udfs can reside in ${drillbit-temp-dir}/udf

This is just one possibility to illustrate the issue. Feel fre

[jira] [Commented] (DRILL-4726) Dynamic UDFs support

2016-09-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524288#comment-15524288
 ] 

ASF GitHub Bot commented on DRILL-4726:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/574#discussion_r80549839
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionImplementationRegistry.java
 ---
@@ -301,29 +323,120 @@ private ScanResult scan(ClassLoader classLoader, 
Path path, URL[] urls) throws I
 return RunTimeScan.dynamicPackageScan(drillConfig, 
Sets.newHashSet(urls));
   }
 }
-throw new FunctionValidationException(String.format("Marker file %s is 
missing in %s.",
+throw new JarValidationException(String.format("Marker file %s is 
missing in %s",
 CommonConstants.DRILL_JAR_MARKER_FILE_RESOURCE_PATHNAME, 
path.getName()));
   }
 
-  private static String getUdfDir() {
-return Preconditions.checkNotNull(System.getenv("DRILL_UDF_DIR"), 
"DRILL_UDF_DIR variable is not set");
+  /**
+   * Return list of jars that are missing in local function registry
+   * but present in remote function registry.
+   *
+   * @param remoteFunctionRegistry remote function registry
+   * @param localFunctionRegistry local function registry
+   * @return list of missing jars
+   */
+  private List getMissingJars(RemoteFunctionRegistry 
remoteFunctionRegistry,
+  LocalFunctionRegistry 
localFunctionRegistry) {
+List remoteJars = 
remoteFunctionRegistry.getRegistry().getJarList();
+List localJars = localFunctionRegistry.getAllJarNames();
+List missingJars = Lists.newArrayList();
+for (Jar jar : remoteJars) {
+  if (!localJars.contains(jar.getName())) {
+missingJars.add(jar.getName());
+  }
+}
+return missingJars;
+  }
+
+  /**
+   * Creates local udf directory, if it doesn't exist.
+   * Checks if local is a directory and if current application has write 
rights on it.
+   * Attempts to clean up local idf directory in case jars were left after 
previous drillbit run.
+   *
+   * @return path to local udf directory
+   */
+  private Path getLocalUdfDir() {
+String confDir = getConfDir();
+File udfDir = new File(confDir, "udf");
+String udfPath = udfDir.getPath();
+udfDir.mkdirs();
+Preconditions.checkState(udfDir.exists(), "Local udf directory [%s] 
must exist", udfPath);
+Preconditions.checkState(udfDir.isDirectory(), "Local udf directory 
[%s] must be a directory", udfPath);
+Preconditions.checkState(udfDir.canWrite(), "Local udf directory [%s] 
must be writable for application user", udfPath);
+try {
+  FileUtils.cleanDirectory(udfDir);
+} catch (IOException e) {
+  throw new DrillRuntimeException("Error during local udf directory 
clean up", e);
+}
+return new Path(udfDir.toURI());
+  }
+
+  /**
+   * First tries to get drill conf directory value from system properties,
+   * if value is missing, checks environment properties.
+   * Throws exception is value is null.
+   * @return drill conf dir path
+   */
+  private String getConfDir() {
+String drillConfDir = "DRILL_CONF_DIR";
+String value = System.getProperty(drillConfDir);
+if (value == null) {
+  value = Preconditions.checkNotNull(System.getenv(drillConfDir), "%s 
variable is not set", drillConfDir);
+}
+return value;
+  }
+
+  /**
+   * Copies jar from remote udf area to local udf area with numeric suffix,
+   * in order to achieve uniqueness for each locally copied jar.
+   * Ex: DrillUDF-1.0.jar -> DrillUDF-1.0_12200255588.jar
+   *
+   * @param jarName jar name to be copied
+   * @param remoteFunctionRegistry remote function registry
+   * @return local path to jar that was copied
+   * @throws IOException in case of problems during jar coping process
+   */
+  private Path copyJarToLocal(String jarName, RemoteFunctionRegistry 
remoteFunctionRegistry) throws IOException {
+String generatedName = String.format(generated_jar_name_pattern,
+Files.getNameWithoutExtension(jarName), System.nanoTime(), 
Files.getFileExtension(jarName));
+Path registryArea = remoteFunctionRegistry.getRegistryArea();
+FileSystem fs = remoteFunctionRegistry.getFs();
+Path remoteJar = new Path(registryArea, jarName);
+Path localJar = new Path(localUdfDir, generatedName);
+try {
+  fs.copyToLocalFile(remoteJar, localJar);
+} catch (IOException e) {
+  String message = String.format("Error during jar [%

[jira] [Commented] (DRILL-4726) Dynamic UDFs support

2016-09-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524283#comment-15524283
 ] 

ASF GitHub Bot commented on DRILL-4726:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/574#discussion_r80539844
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionImplementationRegistry.java
 ---
@@ -120,9 +129,9 @@ public FunctionImplementationRegistry(DrillConfig 
config, ScanResult classpathSc
* Register functions in given operator table.
* @param operatorTable
*/
-  public void register(DrillOperatorTable operatorTable) {
+  public void register(DrillOperatorTable operatorTable, AtomicLong 
version) {
--- End diff --

Perhaps describe the semantics (purpose) of the atomic long. Is this for 
function versioning? If so, perhaps add a description of how this works to the 
Javadoc comment for this class so that the reader understands what's happening.


> Dynamic UDFs support
> 
>
> Key: DRILL-4726
> URL: https://issues.apache.org/jira/browse/DRILL-4726
> Project: Apache Drill
>  Issue Type: New Feature
>Affects Versions: 1.6.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
> Fix For: Future
>
>
> Allow register UDFs without  restart of Drillbits.
> Design is described in document below:
> https://docs.google.com/document/d/1FfyJtWae5TLuyheHCfldYUpCdeIezR2RlNsrOTYyAB4/edit?usp=sharing
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4726) Dynamic UDFs support

2016-09-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524285#comment-15524285
 ] 

ASF GitHub Bot commented on DRILL-4726:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/574#discussion_r80548125
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionImplementationRegistry.java
 ---
@@ -301,29 +323,120 @@ private ScanResult scan(ClassLoader classLoader, 
Path path, URL[] urls) throws I
 return RunTimeScan.dynamicPackageScan(drillConfig, 
Sets.newHashSet(urls));
   }
 }
-throw new FunctionValidationException(String.format("Marker file %s is 
missing in %s.",
+throw new JarValidationException(String.format("Marker file %s is 
missing in %s",
 CommonConstants.DRILL_JAR_MARKER_FILE_RESOURCE_PATHNAME, 
path.getName()));
   }
 
-  private static String getUdfDir() {
-return Preconditions.checkNotNull(System.getenv("DRILL_UDF_DIR"), 
"DRILL_UDF_DIR variable is not set");
+  /**
+   * Return list of jars that are missing in local function registry
+   * but present in remote function registry.
+   *
+   * @param remoteFunctionRegistry remote function registry
+   * @param localFunctionRegistry local function registry
+   * @return list of missing jars
+   */
+  private List getMissingJars(RemoteFunctionRegistry 
remoteFunctionRegistry,
+  LocalFunctionRegistry 
localFunctionRegistry) {
+List remoteJars = 
remoteFunctionRegistry.getRegistry().getJarList();
+List localJars = localFunctionRegistry.getAllJarNames();
+List missingJars = Lists.newArrayList();
+for (Jar jar : remoteJars) {
+  if (!localJars.contains(jar.getName())) {
+missingJars.add(jar.getName());
+  }
+}
+return missingJars;
+  }
+
+  /**
+   * Creates local udf directory, if it doesn't exist.
+   * Checks if local is a directory and if current application has write 
rights on it.
+   * Attempts to clean up local idf directory in case jars were left after 
previous drillbit run.
+   *
+   * @return path to local udf directory
+   */
+  private Path getLocalUdfDir() {
+String confDir = getConfDir();
+File udfDir = new File(confDir, "udf");
+String udfPath = udfDir.getPath();
+udfDir.mkdirs();
+Preconditions.checkState(udfDir.exists(), "Local udf directory [%s] 
must exist", udfPath);
+Preconditions.checkState(udfDir.isDirectory(), "Local udf directory 
[%s] must be a directory", udfPath);
+Preconditions.checkState(udfDir.canWrite(), "Local udf directory [%s] 
must be writable for application user", udfPath);
+try {
+  FileUtils.cleanDirectory(udfDir);
+} catch (IOException e) {
+  throw new DrillRuntimeException("Error during local udf directory 
clean up", e);
+}
+return new Path(udfDir.toURI());
+  }
+
+  /**
+   * First tries to get drill conf directory value from system properties,
+   * if value is missing, checks environment properties.
+   * Throws exception is value is null.
+   * @return drill conf dir path
+   */
+  private String getConfDir() {
+String drillConfDir = "DRILL_CONF_DIR";
+String value = System.getProperty(drillConfDir);
+if (value == null) {
+  value = Preconditions.checkNotNull(System.getenv(drillConfDir), "%s 
variable is not set", drillConfDir);
+}
+return value;
+  }
+
+  /**
+   * Copies jar from remote udf area to local udf area with numeric suffix,
+   * in order to achieve uniqueness for each locally copied jar.
+   * Ex: DrillUDF-1.0.jar -> DrillUDF-1.0_12200255588.jar
+   *
+   * @param jarName jar name to be copied
+   * @param remoteFunctionRegistry remote function registry
+   * @return local path to jar that was copied
+   * @throws IOException in case of problems during jar coping process
+   */
+  private Path copyJarToLocal(String jarName, RemoteFunctionRegistry 
remoteFunctionRegistry) throws IOException {
+String generatedName = String.format(generated_jar_name_pattern,
--- End diff --

File.createTempFile ?


> Dynamic UDFs support
> 
>
> Key: DRILL-4726
> URL: https://issues.apache.org/jira/browse/DRILL-4726
> Project: Apache Drill
>  Issue Type: New Feature
>Affects Versions: 1.6.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
> Fix For: Future
>
>
> Allow register UDFs without  restart of Drillbits.
> Design is described in document below:
> http

[jira] [Commented] (DRILL-4726) Dynamic UDFs support

2016-09-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524282#comment-15524282
 ] 

ASF GitHub Bot commented on DRILL-4726:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/574#discussion_r80548740
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionImplementationRegistry.java
 ---
@@ -301,29 +323,120 @@ private ScanResult scan(ClassLoader classLoader, 
Path path, URL[] urls) throws I
 return RunTimeScan.dynamicPackageScan(drillConfig, 
Sets.newHashSet(urls));
   }
 }
-throw new FunctionValidationException(String.format("Marker file %s is 
missing in %s.",
+throw new JarValidationException(String.format("Marker file %s is 
missing in %s",
 CommonConstants.DRILL_JAR_MARKER_FILE_RESOURCE_PATHNAME, 
path.getName()));
   }
 
-  private static String getUdfDir() {
-return Preconditions.checkNotNull(System.getenv("DRILL_UDF_DIR"), 
"DRILL_UDF_DIR variable is not set");
+  /**
+   * Return list of jars that are missing in local function registry
+   * but present in remote function registry.
+   *
+   * @param remoteFunctionRegistry remote function registry
+   * @param localFunctionRegistry local function registry
+   * @return list of missing jars
+   */
+  private List getMissingJars(RemoteFunctionRegistry 
remoteFunctionRegistry,
+  LocalFunctionRegistry 
localFunctionRegistry) {
+List remoteJars = 
remoteFunctionRegistry.getRegistry().getJarList();
+List localJars = localFunctionRegistry.getAllJarNames();
+List missingJars = Lists.newArrayList();
+for (Jar jar : remoteJars) {
+  if (!localJars.contains(jar.getName())) {
+missingJars.add(jar.getName());
+  }
+}
+return missingJars;
+  }
+
+  /**
+   * Creates local udf directory, if it doesn't exist.
+   * Checks if local is a directory and if current application has write 
rights on it.
+   * Attempts to clean up local idf directory in case jars were left after 
previous drillbit run.
+   *
+   * @return path to local udf directory
+   */
+  private Path getLocalUdfDir() {
+String confDir = getConfDir();
+File udfDir = new File(confDir, "udf");
+String udfPath = udfDir.getPath();
+udfDir.mkdirs();
+Preconditions.checkState(udfDir.exists(), "Local udf directory [%s] 
must exist", udfPath);
+Preconditions.checkState(udfDir.isDirectory(), "Local udf directory 
[%s] must be a directory", udfPath);
+Preconditions.checkState(udfDir.canWrite(), "Local udf directory [%s] 
must be writable for application user", udfPath);
+try {
+  FileUtils.cleanDirectory(udfDir);
+} catch (IOException e) {
+  throw new DrillRuntimeException("Error during local udf directory 
clean up", e);
+}
+return new Path(udfDir.toURI());
+  }
+
+  /**
+   * First tries to get drill conf directory value from system properties,
+   * if value is missing, checks environment properties.
+   * Throws exception is value is null.
+   * @return drill conf dir path
+   */
+  private String getConfDir() {
+String drillConfDir = "DRILL_CONF_DIR";
+String value = System.getProperty(drillConfDir);
+if (value == null) {
+  value = Preconditions.checkNotNull(System.getenv(drillConfDir), "%s 
variable is not set", drillConfDir);
+}
+return value;
+  }
+
+  /**
+   * Copies jar from remote udf area to local udf area with numeric suffix,
+   * in order to achieve uniqueness for each locally copied jar.
+   * Ex: DrillUDF-1.0.jar -> DrillUDF-1.0_12200255588.jar
+   *
+   * @param jarName jar name to be copied
+   * @param remoteFunctionRegistry remote function registry
+   * @return local path to jar that was copied
+   * @throws IOException in case of problems during jar coping process
+   */
+  private Path copyJarToLocal(String jarName, RemoteFunctionRegistry 
remoteFunctionRegistry) throws IOException {
+String generatedName = String.format(generated_jar_name_pattern,
+Files.getNameWithoutExtension(jarName), System.nanoTime(), 
Files.getFileExtension(jarName));
+Path registryArea = remoteFunctionRegistry.getRegistryArea();
+FileSystem fs = remoteFunctionRegistry.getFs();
+Path remoteJar = new Path(registryArea, jarName);
+Path localJar = new Path(localUdfDir, generatedName);
+try {
+  fs.copyToLocalFile(remoteJar, localJar);
+} catch (IOException e) {
+  String message = String.format("Error during jar [%

[jira] [Commented] (DRILL-4726) Dynamic UDFs support

2016-09-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524291#comment-15524291
 ] 

ASF GitHub Bot commented on DRILL-4726:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/574#discussion_r80578885
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/handlers/DropFunctionHandler.java
 ---
@@ -48,54 +48,77 @@ public DropFunctionHandler(SqlHandlerConfig config) {
   }
 
   /**
-   * Drops UDFs dynamically.
+   * Unregisters UDFs dynamically. Process consists of several steps:
+   * 
+   * Registering jar in jar registry to ensure that several jars with 
the same name is not being unregistered.
+   * Starts remote unregistration process, gets list of all jars and 
excludes jar to be deleted.
+   * Signals drill bits to start local unregistration process.
+   * Removes source and binary jars from registry area.
+   * 
--- End diff --

Can we explain the semantics regarding in-flight queries? If query Q is 
executing while we unregister UDF U, what happens?

Suppose user A submits query Q. The foreman for Q finds that Q uses UDF U. 
The rest of the prepare takes a while.

At the same time, user B unregisters U (right after the Foreman for Q 
verified that U exists.)

The Foreman for Q now sends fragments to other Drillbits.

Should Q find that U is available on those Drillbits? Or, is it OK for Q to 
fail because at least one Drillbit has started to unregister U?

Now, supppose that Drill queues are turned on and Q sits in the queue for a 
minute before executing. Should it still find that U is available on the 
various Drillbits?

This all may be worked out, the point here is just to explain the intended 
behavior.


> Dynamic UDFs support
> 
>
> Key: DRILL-4726
> URL: https://issues.apache.org/jira/browse/DRILL-4726
> Project: Apache Drill
>  Issue Type: New Feature
>Affects Versions: 1.6.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
> Fix For: Future
>
>
> Allow register UDFs without  restart of Drillbits.
> Design is described in document below:
> https://docs.google.com/document/d/1FfyJtWae5TLuyheHCfldYUpCdeIezR2RlNsrOTYyAB4/edit?usp=sharing
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4726) Dynamic UDFs support

2016-09-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524292#comment-15524292
 ] 

ASF GitHub Bot commented on DRILL-4726:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/574#discussion_r80559233
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/registry/FunctionRegistryHolder.java
 ---
@@ -0,0 +1,360 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.expr.fn.registry;
+
+import com.google.common.collect.ArrayListMultimap;
+import com.google.common.collect.ListMultimap;
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import com.google.common.collect.Queues;
+import org.apache.drill.common.concurrent.AutoCloseableLock;
+import org.apache.drill.exec.expr.fn.DrillFuncHolder;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Queue;
+import java.util.concurrent.atomic.AtomicLong;
+import java.util.concurrent.locks.ReadWriteLock;
+import java.util.concurrent.locks.ReentrantReadWriteLock;
+
+/**
+ * Function registry holder stores function implementations by jar name, 
function name.
+ * Contains two maps that hold data by jars and functions respectively.
+ * Jars map contains each jar as a key and map of all its functions with 
collection of function signatures as value.
+ * Functions map contains function name as key and map of its signatures 
and function holder as value.
+ * All maps and collections used are concurrent to guarantee memory 
consistency effects.
+ * Such structure is chosen to achieve maximum speed while retrieving data 
by jar or by function name,
+ * since we expect infrequent registry changes.
+ * Holder is designed to allow concurrent reads and single writes to keep 
data consistent.
+ * This is achieved by {@link ReadWriteLock} implementation usage.
+ * Holder has number version which changes every time new jars are added 
or removed. Initial version number is 0.
+ * Also version is used when user needs data from registry with version it 
is based on.
+ *
+ * Structure example:
+ *
+ * JARS
+ * built-in   -> upper  -> upper(VARCHAR-REQUIRED)
+ *-> lower  -> lower(VARCHAR-REQUIRED)
+ *
+ * First.jar  -> upper  -> upper(VARCHAR-OPTIONAL)
+ *-> custom_upper   -> custom_upper(VARCHAR-REQUIRED)
+ *  -> custom_upper(VARCHAR-OPTIONAL)
+ *
+ * Second.jar -> lower  -> lower(VARCHAR-OPTIONAL)
+ *-> custom_upper   -> custom_upper(VARCHAR-REQUIRED)
+ *  -> custom_upper(VARCHAR-OPTIONAL)
+ *
+ * FUNCTIONS
+ * upper-> upper(VARCHAR-REQUIRED)-> function holder for 
upper(VARCHAR-REQUIRED)
+ *  -> upper(VARCHAR-OPTIONAL)-> function holder for 
upper(VARCHAR-OPTIONAL)
+ *
+ * lower-> lower(VARCHAR-REQUIRED)-> function holder for 
lower(VARCHAR-REQUIRED)
+ *  -> lower(VARCHAR-OPTIONAL)-> function holder for 
lower(VARCHAR-OPTIONAL)
+ *
+ * custom_upper -> custom_upper(VARCHAR-REQUIRED) -> function holder for 
custom_upper(VARCHAR-REQUIRED)
+ *  -> custom_upper(VARCHAR-OPTIONAL) -> function holder for 
custom_upper(VARCHAR-OPTIONAL)
+ *
+ * custom_lower -> custom_lower(VARCHAR-REQUIRED) -> function holder for 
custom_lower(VARCHAR-REQUIRED)
+ *  -> custom_lower(VARCHAR-OPTIONAL) -> function holder for 
custom_lower(VARCHAR-OPTIONAL)
+ *
+ * where
+ * First.jar is jar name represented by String
+ * upper is function name represented by String
+ * upper(VARCHAR-REQUIRED) is signature name represented by String which 
consist of function name, list of input parameters
+ * function holder for upper(VARCHAR-REQUIRED) is {@link Dr

[jira] [Commented] (DRILL-4726) Dynamic UDFs support

2016-09-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524284#comment-15524284
 ] 

ASF GitHub Bot commented on DRILL-4726:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/574#discussion_r80554787
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionImplementationRegistry.java
 ---
@@ -301,29 +323,120 @@ private ScanResult scan(ClassLoader classLoader, 
Path path, URL[] urls) throws I
 return RunTimeScan.dynamicPackageScan(drillConfig, 
Sets.newHashSet(urls));
   }
 }
-throw new FunctionValidationException(String.format("Marker file %s is 
missing in %s.",
+throw new JarValidationException(String.format("Marker file %s is 
missing in %s",
 CommonConstants.DRILL_JAR_MARKER_FILE_RESOURCE_PATHNAME, 
path.getName()));
   }
 
-  private static String getUdfDir() {
-return Preconditions.checkNotNull(System.getenv("DRILL_UDF_DIR"), 
"DRILL_UDF_DIR variable is not set");
+  /**
+   * Return list of jars that are missing in local function registry
+   * but present in remote function registry.
+   *
+   * @param remoteFunctionRegistry remote function registry
+   * @param localFunctionRegistry local function registry
+   * @return list of missing jars
+   */
+  private List getMissingJars(RemoteFunctionRegistry 
remoteFunctionRegistry,
+  LocalFunctionRegistry 
localFunctionRegistry) {
+List remoteJars = 
remoteFunctionRegistry.getRegistry().getJarList();
+List localJars = localFunctionRegistry.getAllJarNames();
+List missingJars = Lists.newArrayList();
+for (Jar jar : remoteJars) {
+  if (!localJars.contains(jar.getName())) {
+missingJars.add(jar.getName());
+  }
+}
+return missingJars;
+  }
+
+  /**
+   * Creates local udf directory, if it doesn't exist.
+   * Checks if local is a directory and if current application has write 
rights on it.
+   * Attempts to clean up local idf directory in case jars were left after 
previous drillbit run.
+   *
+   * @return path to local udf directory
+   */
+  private Path getLocalUdfDir() {
+String confDir = getConfDir();
+File udfDir = new File(confDir, "udf");
+String udfPath = udfDir.getPath();
+udfDir.mkdirs();
+Preconditions.checkState(udfDir.exists(), "Local udf directory [%s] 
must exist", udfPath);
+Preconditions.checkState(udfDir.isDirectory(), "Local udf directory 
[%s] must be a directory", udfPath);
+Preconditions.checkState(udfDir.canWrite(), "Local udf directory [%s] 
must be writable for application user", udfPath);
+try {
+  FileUtils.cleanDirectory(udfDir);
+} catch (IOException e) {
+  throw new DrillRuntimeException("Error during local udf directory 
clean up", e);
+}
+return new Path(udfDir.toURI());
+  }
+
+  /**
+   * First tries to get drill conf directory value from system properties,
+   * if value is missing, checks environment properties.
+   * Throws exception is value is null.
+   * @return drill conf dir path
+   */
+  private String getConfDir() {
+String drillConfDir = "DRILL_CONF_DIR";
+String value = System.getProperty(drillConfDir);
+if (value == null) {
+  value = Preconditions.checkNotNull(System.getenv(drillConfDir), "%s 
variable is not set", drillConfDir);
+}
+return value;
+  }
+
+  /**
+   * Copies jar from remote udf area to local udf area with numeric suffix,
+   * in order to achieve uniqueness for each locally copied jar.
+   * Ex: DrillUDF-1.0.jar -> DrillUDF-1.0_12200255588.jar
+   *
+   * @param jarName jar name to be copied
+   * @param remoteFunctionRegistry remote function registry
+   * @return local path to jar that was copied
+   * @throws IOException in case of problems during jar coping process
+   */
+  private Path copyJarToLocal(String jarName, RemoteFunctionRegistry 
remoteFunctionRegistry) throws IOException {
+String generatedName = String.format(generated_jar_name_pattern,
+Files.getNameWithoutExtension(jarName), System.nanoTime(), 
Files.getFileExtension(jarName));
+Path registryArea = remoteFunctionRegistry.getRegistryArea();
+FileSystem fs = remoteFunctionRegistry.getFs();
+Path remoteJar = new Path(registryArea, jarName);
+Path localJar = new Path(localUdfDir, generatedName);
+try {
+  fs.copyToLocalFile(remoteJar, localJar);
+} catch (IOException e) {
+  String message = String.format("Error during jar [%

[jira] [Commented] (DRILL-4726) Dynamic UDFs support

2016-09-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524289#comment-15524289
 ] 

ASF GitHub Bot commented on DRILL-4726:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/574#discussion_r80558464
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/registry/FunctionRegistryHolder.java
 ---
@@ -0,0 +1,360 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.expr.fn.registry;
+
+import com.google.common.collect.ArrayListMultimap;
+import com.google.common.collect.ListMultimap;
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import com.google.common.collect.Queues;
+import org.apache.drill.common.concurrent.AutoCloseableLock;
+import org.apache.drill.exec.expr.fn.DrillFuncHolder;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Queue;
+import java.util.concurrent.atomic.AtomicLong;
+import java.util.concurrent.locks.ReadWriteLock;
+import java.util.concurrent.locks.ReentrantReadWriteLock;
+
+/**
+ * Function registry holder stores function implementations by jar name, 
function name.
+ * Contains two maps that hold data by jars and functions respectively.
+ * Jars map contains each jar as a key and map of all its functions with 
collection of function signatures as value.
+ * Functions map contains function name as key and map of its signatures 
and function holder as value.
+ * All maps and collections used are concurrent to guarantee memory 
consistency effects.
+ * Such structure is chosen to achieve maximum speed while retrieving data 
by jar or by function name,
+ * since we expect infrequent registry changes.
+ * Holder is designed to allow concurrent reads and single writes to keep 
data consistent.
+ * This is achieved by {@link ReadWriteLock} implementation usage.
+ * Holder has number version which changes every time new jars are added 
or removed. Initial version number is 0.
+ * Also version is used when user needs data from registry with version it 
is based on.
+ *
+ * Structure example:
+ *
+ * JARS
+ * built-in   -> upper  -> upper(VARCHAR-REQUIRED)
+ *-> lower  -> lower(VARCHAR-REQUIRED)
+ *
+ * First.jar  -> upper  -> upper(VARCHAR-OPTIONAL)
+ *-> custom_upper   -> custom_upper(VARCHAR-REQUIRED)
+ *  -> custom_upper(VARCHAR-OPTIONAL)
+ *
+ * Second.jar -> lower  -> lower(VARCHAR-OPTIONAL)
+ *-> custom_upper   -> custom_upper(VARCHAR-REQUIRED)
+ *  -> custom_upper(VARCHAR-OPTIONAL)
+ *
+ * FUNCTIONS
+ * upper-> upper(VARCHAR-REQUIRED)-> function holder for 
upper(VARCHAR-REQUIRED)
+ *  -> upper(VARCHAR-OPTIONAL)-> function holder for 
upper(VARCHAR-OPTIONAL)
+ *
+ * lower-> lower(VARCHAR-REQUIRED)-> function holder for 
lower(VARCHAR-REQUIRED)
+ *  -> lower(VARCHAR-OPTIONAL)-> function holder for 
lower(VARCHAR-OPTIONAL)
+ *
+ * custom_upper -> custom_upper(VARCHAR-REQUIRED) -> function holder for 
custom_upper(VARCHAR-REQUIRED)
+ *  -> custom_upper(VARCHAR-OPTIONAL) -> function holder for 
custom_upper(VARCHAR-OPTIONAL)
+ *
+ * custom_lower -> custom_lower(VARCHAR-REQUIRED) -> function holder for 
custom_lower(VARCHAR-REQUIRED)
+ *  -> custom_lower(VARCHAR-OPTIONAL) -> function holder for 
custom_lower(VARCHAR-OPTIONAL)
+ *
+ * where
+ * First.jar is jar name represented by String
+ * upper is function name represented by String
+ * upper(VARCHAR-REQUIRED) is signature name represented by String which 
consist of function name, list of input parameters
+ * function holder for upper(VARCHAR-REQUIRED) is {@link Dr

[jira] [Commented] (DRILL-4726) Dynamic UDFs support

2016-09-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524287#comment-15524287
 ] 

ASF GitHub Bot commented on DRILL-4726:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/574#discussion_r80557682
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/registry/FunctionRegistryHolder.java
 ---
@@ -0,0 +1,360 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.expr.fn.registry;
+
+import com.google.common.collect.ArrayListMultimap;
+import com.google.common.collect.ListMultimap;
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import com.google.common.collect.Queues;
+import org.apache.drill.common.concurrent.AutoCloseableLock;
+import org.apache.drill.exec.expr.fn.DrillFuncHolder;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Queue;
+import java.util.concurrent.atomic.AtomicLong;
+import java.util.concurrent.locks.ReadWriteLock;
+import java.util.concurrent.locks.ReentrantReadWriteLock;
+
+/**
+ * Function registry holder stores function implementations by jar name, 
function name.
+ * Contains two maps that hold data by jars and functions respectively.
+ * Jars map contains each jar as a key and map of all its functions with 
collection of function signatures as value.
+ * Functions map contains function name as key and map of its signatures 
and function holder as value.
+ * All maps and collections used are concurrent to guarantee memory 
consistency effects.
+ * Such structure is chosen to achieve maximum speed while retrieving data 
by jar or by function name,
+ * since we expect infrequent registry changes.
+ * Holder is designed to allow concurrent reads and single writes to keep 
data consistent.
+ * This is achieved by {@link ReadWriteLock} implementation usage.
+ * Holder has number version which changes every time new jars are added 
or removed. Initial version number is 0.
+ * Also version is used when user needs data from registry with version it 
is based on.
+ *
+ * Structure example:
+ *
+ * JARS
+ * built-in   -> upper  -> upper(VARCHAR-REQUIRED)
+ *-> lower  -> lower(VARCHAR-REQUIRED)
+ *
+ * First.jar  -> upper  -> upper(VARCHAR-OPTIONAL)
+ *-> custom_upper   -> custom_upper(VARCHAR-REQUIRED)
+ *  -> custom_upper(VARCHAR-OPTIONAL)
+ *
+ * Second.jar -> lower  -> lower(VARCHAR-OPTIONAL)
+ *-> custom_upper   -> custom_upper(VARCHAR-REQUIRED)
+ *  -> custom_upper(VARCHAR-OPTIONAL)
+ *
+ * FUNCTIONS
+ * upper-> upper(VARCHAR-REQUIRED)-> function holder for 
upper(VARCHAR-REQUIRED)
+ *  -> upper(VARCHAR-OPTIONAL)-> function holder for 
upper(VARCHAR-OPTIONAL)
+ *
+ * lower-> lower(VARCHAR-REQUIRED)-> function holder for 
lower(VARCHAR-REQUIRED)
+ *  -> lower(VARCHAR-OPTIONAL)-> function holder for 
lower(VARCHAR-OPTIONAL)
+ *
+ * custom_upper -> custom_upper(VARCHAR-REQUIRED) -> function holder for 
custom_upper(VARCHAR-REQUIRED)
+ *  -> custom_upper(VARCHAR-OPTIONAL) -> function holder for 
custom_upper(VARCHAR-OPTIONAL)
+ *
+ * custom_lower -> custom_lower(VARCHAR-REQUIRED) -> function holder for 
custom_lower(VARCHAR-REQUIRED)
+ *  -> custom_lower(VARCHAR-OPTIONAL) -> function holder for 
custom_lower(VARCHAR-OPTIONAL)
+ *
+ * where
+ * First.jar is jar name represented by String
+ * upper is function name represented by String
+ * upper(VARCHAR-REQUIRED) is signature name represented by String which 
consist of function name, list of input parameters
+ * function holder for upper(VARCHAR-REQUIRED) is {@link Dr

[jira] [Commented] (DRILL-4726) Dynamic UDFs support

2016-09-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524290#comment-15524290
 ] 

ASF GitHub Bot commented on DRILL-4726:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/574#discussion_r80559591
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/registry/FunctionRegistryHolder.java
 ---
@@ -0,0 +1,360 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.expr.fn.registry;
+
+import com.google.common.collect.ArrayListMultimap;
+import com.google.common.collect.ListMultimap;
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import com.google.common.collect.Queues;
+import org.apache.drill.common.concurrent.AutoCloseableLock;
+import org.apache.drill.exec.expr.fn.DrillFuncHolder;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Queue;
+import java.util.concurrent.atomic.AtomicLong;
+import java.util.concurrent.locks.ReadWriteLock;
+import java.util.concurrent.locks.ReentrantReadWriteLock;
+
+/**
+ * Function registry holder stores function implementations by jar name, 
function name.
+ * Contains two maps that hold data by jars and functions respectively.
+ * Jars map contains each jar as a key and map of all its functions with 
collection of function signatures as value.
+ * Functions map contains function name as key and map of its signatures 
and function holder as value.
+ * All maps and collections used are concurrent to guarantee memory 
consistency effects.
+ * Such structure is chosen to achieve maximum speed while retrieving data 
by jar or by function name,
+ * since we expect infrequent registry changes.
+ * Holder is designed to allow concurrent reads and single writes to keep 
data consistent.
+ * This is achieved by {@link ReadWriteLock} implementation usage.
+ * Holder has number version which changes every time new jars are added 
or removed. Initial version number is 0.
+ * Also version is used when user needs data from registry with version it 
is based on.
+ *
+ * Structure example:
+ *
+ * JARS
+ * built-in   -> upper  -> upper(VARCHAR-REQUIRED)
+ *-> lower  -> lower(VARCHAR-REQUIRED)
+ *
+ * First.jar  -> upper  -> upper(VARCHAR-OPTIONAL)
+ *-> custom_upper   -> custom_upper(VARCHAR-REQUIRED)
+ *  -> custom_upper(VARCHAR-OPTIONAL)
+ *
+ * Second.jar -> lower  -> lower(VARCHAR-OPTIONAL)
+ *-> custom_upper   -> custom_upper(VARCHAR-REQUIRED)
+ *  -> custom_upper(VARCHAR-OPTIONAL)
+ *
+ * FUNCTIONS
+ * upper-> upper(VARCHAR-REQUIRED)-> function holder for 
upper(VARCHAR-REQUIRED)
+ *  -> upper(VARCHAR-OPTIONAL)-> function holder for 
upper(VARCHAR-OPTIONAL)
+ *
+ * lower-> lower(VARCHAR-REQUIRED)-> function holder for 
lower(VARCHAR-REQUIRED)
+ *  -> lower(VARCHAR-OPTIONAL)-> function holder for 
lower(VARCHAR-OPTIONAL)
+ *
+ * custom_upper -> custom_upper(VARCHAR-REQUIRED) -> function holder for 
custom_upper(VARCHAR-REQUIRED)
+ *  -> custom_upper(VARCHAR-OPTIONAL) -> function holder for 
custom_upper(VARCHAR-OPTIONAL)
+ *
+ * custom_lower -> custom_lower(VARCHAR-REQUIRED) -> function holder for 
custom_lower(VARCHAR-REQUIRED)
+ *  -> custom_lower(VARCHAR-OPTIONAL) -> function holder for 
custom_lower(VARCHAR-OPTIONAL)
+ *
+ * where
+ * First.jar is jar name represented by String
+ * upper is function name represented by String
+ * upper(VARCHAR-REQUIRED) is signature name represented by String which 
consist of function name, list of input parameters
+ * function holder for upper(VARCHAR-REQUIRED) is {@link Dr

[jira] [Commented] (DRILL-4726) Dynamic UDFs support

2016-09-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524286#comment-15524286
 ] 

ASF GitHub Bot commented on DRILL-4726:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/574#discussion_r80542988
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionImplementationRegistry.java
 ---
@@ -179,14 +190,15 @@ public DrillFuncHolder 
findExactMatchingDrillFunction(String name, List argTypes, MajorType returnType, boolean retry) {
-for (DrillFuncHolder h : drillFuncRegistry.getMethods(name)) {
+AtomicLong version = new AtomicLong();
--- End diff --

Not entirely clear how the version works. We create it here and intialze it 
to 0. Then in loadRemoteFunctions( ), we compare the local version (always 0) 
to the registry version (which will be zero only until, presumably, the first 
load.) Are we missing an initialization here to the current registry version? 
Else, please explain how this works.


> Dynamic UDFs support
> 
>
> Key: DRILL-4726
> URL: https://issues.apache.org/jira/browse/DRILL-4726
> Project: Apache Drill
>  Issue Type: New Feature
>Affects Versions: 1.6.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
> Fix For: Future
>
>
> Allow register UDFs without  restart of Drillbits.
> Design is described in document below:
> https://docs.google.com/document/d/1FfyJtWae5TLuyheHCfldYUpCdeIezR2RlNsrOTYyAB4/edit?usp=sharing
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4726) Dynamic UDFs support

2016-09-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524293#comment-15524293
 ] 

ASF GitHub Bot commented on DRILL-4726:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/574#discussion_r80579839
  
--- Diff: exec/java-exec/src/main/resources/drill-module.conf ---
@@ -45,6 +45,8 @@ drill.client: {
   supports-complex-types: true
 }
 
+drill.home: "/tmp"
--- End diff --

Drill home is /tmp? Please explain...

From below, it seems that "drill.home" means the Drill home directory in 
DFS. (The problem is, we already use $DRILL_HOME to mean the location of the 
Drill install.)

Maybe change this to "drill.dfs.home" or "drill.dfs-home" to make it clear 
that this is for DFS.


> Dynamic UDFs support
> 
>
> Key: DRILL-4726
> URL: https://issues.apache.org/jira/browse/DRILL-4726
> Project: Apache Drill
>  Issue Type: New Feature
>Affects Versions: 1.6.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
> Fix For: Future
>
>
> Allow register UDFs without  restart of Drillbits.
> Design is described in document below:
> https://docs.google.com/document/d/1FfyJtWae5TLuyheHCfldYUpCdeIezR2RlNsrOTYyAB4/edit?usp=sharing
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4726) Dynamic UDFs support

2016-09-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524280#comment-15524280
 ] 

ASF GitHub Bot commented on DRILL-4726:
---

Github user paul-rogers commented on a diff in the pull request:

https://github.com/apache/drill/pull/574#discussion_r80534600
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/coord/zk/ZookeeperClient.java
 ---
@@ -257,14 +263,47 @@ public void put(final String path, final byte[] data, 
DataChangeVersion version)
   }
   if (hasNode) {
 if (version != null) {
-  
curator.setData().withVersion(version.getVersion()).forPath(target, data);
+  try {
+
curator.setData().withVersion(version.getVersion()).forPath(target, data);
+  } catch (final KeeperException.BadVersionException e) {
+throw new VersionMismatchException("Unable to put data. 
Version mismatch is detected.", version.getVersion(), e);
+  }
 } else {
   curator.setData().forPath(target, data);
 }
   }
   getCache().rebuildNode(target);
-} catch (final KeeperException.BadVersionException e) {
-  throw new VersionMismatchException(e);
+} catch (final VersionMismatchException e) {
+  throw e;
+} catch (final Exception e) {
+  throw new DrillRuntimeException("unable to put ", e);
+}
+  }
+
+  /**
+   * Puts the given byte sequence into the given path if path is does not 
exist.
+   *
+   * @param path  target path
+   * @param data  data to store
+   * @return null if path was created, else data stored for the given path
+   */
+  public byte[] putIfAbsent(final String path, final byte[] data) {
+Preconditions.checkNotNull(path, "path is required");
+Preconditions.checkNotNull(data, "data is required");
+
+final String target = PathUtils.join(root, path);
+try {
+  boolean hasNode = hasPath(path, true);
--- End diff --

Isn't this a race condition? What if some other client creates the node 
between the check for hasPath and the create call below? The ZK documentation 
is sparse here, but my tests show that create( ) will fail if the node exists; 
you'll get a NodeExistsException, so no need to do the hasPath check.


> Dynamic UDFs support
> 
>
> Key: DRILL-4726
> URL: https://issues.apache.org/jira/browse/DRILL-4726
> Project: Apache Drill
>  Issue Type: New Feature
>Affects Versions: 1.6.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
> Fix For: Future
>
>
> Allow register UDFs without  restart of Drillbits.
> Design is described in document below:
> https://docs.google.com/document/d/1FfyJtWae5TLuyheHCfldYUpCdeIezR2RlNsrOTYyAB4/edit?usp=sharing
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4203) Parquet File : Date is stored wrongly

2016-09-26 Thread Rahul Challapalli (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15524235#comment-15524235
 ] 

Rahul Challapalli commented on DRILL-4203:
--

I came up with a test plan [1] based on the discussion on this jira. [~vitalii] 
& [~jaltekruse], Can you take a look and provide some feedback?

[1] 
https://docs.google.com/spreadsheets/d/12Jor8I8lVFrsrWM4Am6W2LzeNloz_e1GB8XlikolbyU/edit?usp=sharing

> Parquet File : Date is stored wrongly
> -
>
> Key: DRILL-4203
> URL: https://issues.apache.org/jira/browse/DRILL-4203
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.4.0
>Reporter: Stéphane Trou
>Assignee: Vitalii Diravka
>Priority: Critical
>
> Hello,
> I have some problems when i try to read parquet files produce by drill with  
> Spark,  all dates are corrupted.
> I think the problem come from drill :)
> {code}
> cat /tmp/date_parquet.csv 
> Epoch,1970-01-01
> {code}
> {code}
> 0: jdbc:drill:zk=local> select columns[0] as name, cast(columns[1] as date) 
> as epoch_date from dfs.tmp.`date_parquet.csv`;
> ++-+
> |  name  | epoch_date  |
> ++-+
> | Epoch  | 1970-01-01  |
> ++-+
> {code}
> {code}
> 0: jdbc:drill:zk=local> create table dfs.tmp.`buggy_parquet`as select 
> columns[0] as name, cast(columns[1] as date) as epoch_date from 
> dfs.tmp.`date_parquet.csv`;
> +---++
> | Fragment  | Number of records written  |
> +---++
> | 0_0   | 1  |
> +---++
> {code}
> When I read the file with parquet tools, i found  
> {code}
> java -jar parquet-tools-1.8.1.jar head /tmp/buggy_parquet/
> name = Epoch
> epoch_date = 4881176
> {code}
> According to 
> [https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#date], 
> epoch_date should be equals to 0.
> Meta : 
> {code}
> java -jar parquet-tools-1.8.1.jar meta /tmp/buggy_parquet/
> file:file:/tmp/buggy_parquet/0_0_0.parquet 
> creator: parquet-mr version 1.8.1-drill-r0 (build 
> 6b605a4ea05b66e1a6bf843353abcb4834a4ced8) 
> extra:   drill.version = 1.4.0 
> file schema: root 
> 
> name:OPTIONAL BINARY O:UTF8 R:0 D:1
> epoch_date:  OPTIONAL INT32 O:DATE R:0 D:1
> row group 1: RC:1 TS:93 OFFSET:4 
> 
> name: BINARY SNAPPY DO:0 FPO:4 SZ:52/50/0,96 VC:1 
> ENC:RLE,BIT_PACKED,PLAIN
> epoch_date:   INT32 SNAPPY DO:0 FPO:56 SZ:45/43/0,96 VC:1 
> ENC:RLE,BIT_PACKED,PLAIN
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-09-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15523997#comment-15523997
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/518
  
As it turns out, the sample code shown was actually tested with a stock 
Jackson JSON parser: it does work. No parser changes are needed.

The issue is not whether we can make the parser do what is needed: the code 
posted in the comment above demonstrated a solution.

The issue is how we incorporate that code into the JSON parser to clean up 
partial records and prevent schema changes. When I have time, I'll investigate 
that question in greater depth.

IMHO, without a proper fix, we should simply state that Drill does not 
support malformed JSON. If an input file might be incorrect, run it though a 
clean-up step before allowing Drill to scan it. Otherwise, we are opening the 
door to many hard-to-resolve bugs when people ask Drill to scan corrupt JSON: 
the result, without a proper fix, would be undefined -- which is worse than the 
current behavior that simply fails the scan with an error.

Let's follow up again after I (or someone) has had a chance to figure out 
if we can undo a partially built record. If we can do that, then we've got a 
path to a clean solution: recover the parser (as shown earlier) and discard the 
in-flight record (as we need to research.)


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-26 Thread Padma Penumarthy (JIRA)
Padma Penumarthy created DRILL-4905:
---

 Summary: Push down the LIMIT to the parquet reader scan to limit 
the numbers of records read
 Key: DRILL-4905
 URL: https://issues.apache.org/jira/browse/DRILL-4905
 Project: Apache Drill
  Issue Type: Bug
  Components: Storage - Parquet
Affects Versions: 1.8.0
Reporter: Padma Penumarthy
Assignee: Padma Penumarthy
 Fix For: 1.9.0


Limit the number of records read from disk by pushing down the limit to parquet 
reader.

For queries like
select * from  limit N; 

where N < size of Parquet row group, we are reading 32K/64k rows or entire row 
group. This needs to be optimized to read only N rows.

 





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-09-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15523852#comment-15523852
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user ssriniva123 commented on the issue:

https://github.com/apache/drill/pull/518
  
There is not much to do except change the JSON parser to support this 
functionality.

- Indicate to the parser that a current record terminates when it 
encounters a \n (Of course
this then assumes that druid also aligns to record separators using a new 
line).
This is a change to make to the jackson parser.

- Right now the current code works for all the standard cases except one 
case where the inner 
most sub-structure within a JSON is malformed.

Given that this is a great recipe for approximation algorithms , I am 
requesting this change to be pulled in.

If need be we can work on change to the jackson parser using a different 
request.



> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (DRILL-4831) Running refresh table metadata concurrently randomly fails with JsonParseException

2016-09-26 Thread Aman Sinha (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-4831?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15523322#comment-15523322
 ] 

Aman Sinha commented on DRILL-4831:
---

Supporting concurrent writers to the metadata cache file will require 
enhancements to the current implementation.  The underlying file system (in 
this case MapR-FS but this is not specific to MFS) provides consistency 
guarantees only at the file system level, not at the application (query) level. 
 Thus, if there are multiple writers writing to the same byte offsets in the 
file, the file system will eventually produce a 'serializable' output (i.e the 
net effect is writes are performed in some sequential manner) but while the 
writes are happening, the readers may see overlapping writes from different 
writers.  

In order to address this, there are 2 options:  (a) each writer can write to a 
temporary unique filename and rename at the very end.  The renaming process is 
atomic.   (b) create a lock file for the table and have the writers open the 
lock file in Exclusive mode before writing to the actual file. 

> Running refresh table metadata concurrently randomly fails with 
> JsonParseException
> --
>
> Key: DRILL-4831
> URL: https://issues.apache.org/jira/browse/DRILL-4831
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Metadata
>Affects Versions: 1.8.0
>Reporter: Rahul Challapalli
> Attachments: error.log, l_3level.tgz
>
>
> git.commit.id.abbrev=f476eb5
> Just run the below command concurrently from 10 different JDBC connections. 
> There is a likelihood that you might encounter the below error
> Extracts from the log
> {code}
> Caused By (java.lang.AssertionError) Internal error: Error while applying 
> rule DrillPushProjIntoScan, args 
> [rel#189411:LogicalProject.NONE.ANY([]).[](input=rel#189289:Subset#3.ENUMERABLE.ANY([]).[],l_orderkey=$1,dir0=$2,dir1=$3,dir2=$4,l_shipdate=$5,l_extendedprice=$6,l_discount=$7),
>  rel#189233:EnumerableTableScan.ENUMERABLE.ANY([]).[](table=[dfs, 
> metadata_caching_pp, l_3level])]
> org.apache.calcite.util.Util.newInternal():792
> org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch():251
> .
> .
>   java.lang.Thread.run():745
>   Caused By (org.apache.drill.common.exceptions.DrillRuntimeException) 
> com.fasterxml.jackson.core.JsonParseException: Illegal character ((CTRL-CHAR, 
> code 0)): only regular white space (\r, \n, \t) is allowed between tokens
>  at [Source: com.mapr.fs.MapRFsDataInputStream@57a574a8; line: 1, column: 2]
> org.apache.drill.exec.planner.logical.DrillPushProjIntoScan.onMatch():95
> {code}  
> Attached the complete log message and the data set



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)