date:20160927

[jira] [Updated] (DRILL-4909) Refinements to Drill web UI - Query profile page

2016-09-27 Thread Paul Rogers (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-4909:
---
Description: 
The top of the page displays histogram of minor fragment execution. However, it 
is hard to infer what it displays.

* Label the x-axis. The units seem to be seconds, but a legend of: "Runtime 
(sec.)" would help.
* Label the y-axis. Seems to be colored by major fragment, lines by minor 
fragment. But, took some sleuthing to figure this out.
* Tooltip on each color band to identify the major fragment. (Probably too 
fiddly to label minor fragment lines.)

In the tables:

* For each operator, list the number of rows processed. (Available in the 
details already.)
* In the table that sumarizes major fragments, have as a tool-tip the names of 
the minor fragments to give the numbers some meaning. That is, hovering over 
00-xx-xx should say "Project, Merging Receiver".
* In the table that shows minor fragments for major fragments, either add a 
list of minor fragment names to the title, or as a pop-up. That is, in the 
heading that says, "Major Fragment: 02-xx-xx", add "(PARQUET_ROW_GROUP_SCAN, 
PROJECT, ...)

In the Operator Profiles overview, add a tool-tip with details about each 
operator such as:

* Number of vector allocations
* Number of vector extensions (increasing the size of vectors)
* Average vector utilization (ratio of selected to unselected rows)
* Average batch size: number of rows, bytes per row, bytes per batch

For scanners:

* Number of files scanned
* Number of bytes read (or file length if a table scan)
* Name of the file scanned (or first several if a group)

For filters:

* Rows in, rows out and selectivity (as a ratio)

In the operator detail table:

* Add a line for totals (records, batches)
* Add a line for averages (most fields)


  was:
The top of the page displays histogram of minor fragment execution. However, it 
is hard to infer what it displays.

* Label the x-axis. The units seem to be seconds, but a legend of: "Runtime 
(sec.)" would help.
* Label the y-axis. Seems to be colored by major fragment, lines by minor 
fragment. But, took some sleuthing to figure this out.
* Tooltip on each color band to identify the major fragment. (Probably too 
fiddly to label minor fragment lines.)

In the tables:

* For each operator, list the number of rows processed.
* In the table that sumarizes major fragments, have as a tool-tip the names of 
the minor fragments to give the numbers some meaning. That is, hovering over 
00-xx-xx should say "Project, Merging Receiver".
* In the table that shows minor fragments for major fragments, either add a 
list of minor fragment names to the title, or as a pop-up. That is, in the 
heading that says, "Major Fragment: 02-xx-xx", add "(PARQUET_ROW_GROUP_SCAN, 
PROJECT, ...)

In the Operator Profiles overview, add a tool-tip with details about each 
operator such as:

* Number of vector allocations
* Number of vector extensions (increasing the size of vectors)
* Average vector utilization (ratio of selected to unselected rows)
* Average batch size: number of rows, bytes per row, bytes per batch

For scanners:

* Number of files scanned
* Number of bytes read (or file length if a table scan)

For filters:

* Rows in, rows out and selectivity (as a ratio)



> Refinements to Drill web UI - Query profile page
> 
>
> Key: DRILL-4909
> URL: https://issues.apache.org/jira/browse/DRILL-4909
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Web Server
>Affects Versions: 1.8.0
>Reporter: Paul Rogers
>Priority: Minor
>
> The top of the page displays histogram of minor fragment execution. However, 
> it is hard to infer what it displays.
> * Label the x-axis. The units seem to be seconds, but a legend of: "Runtime 
> (sec.)" would help.
> * Label the y-axis. Seems to be colored by major fragment, lines by minor 
> fragment. But, took some sleuthing to figure this out.
> * Tooltip on each color band to identify the major fragment. (Probably too 
> fiddly to label minor fragment lines.)
> In the tables:
> * For each operator, list the number of rows processed. (Available in the 
> details already.)
> * In the table that sumarizes major fragments, have as a tool-tip the names 
> of the minor fragments to give the numbers some meaning. That is, hovering 
> over 00-xx-xx should say "Project, Merging Receiver".
> * In the table that shows minor fragments for major fragments, either add a 
> list of minor fragment names to the title, or as a pop-up. That is, in the 
> heading that says, "Major Fragment: 02-xx-xx", add "(PARQUET_ROW_GROUP_SCAN, 
> PROJECT, ...)
> In the Operator Profiles overview, add a tool-tip with details about each 
> operator such as:
> * Number of vector allocations
> * Number of vector

[jira] [Updated] (DRILL-4909) Refinements to Drill web UI - Query profile page

2016-09-27 Thread Paul Rogers (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-4909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-4909:
---
Description: 
The top of the page displays histogram of minor fragment execution. However, it 
is hard to infer what it displays.

* Label the x-axis. The units seem to be seconds, but a legend of: "Runtime 
(sec.)" would help.
* Label the y-axis. Seems to be colored by major fragment, lines by minor 
fragment. But, took some sleuthing to figure this out.
* Tooltip on each color band to identify the major fragment. (Probably too 
fiddly to label minor fragment lines.)

In the tables:

* For each operator, list the number of rows processed.
* In the table that sumarizes major fragments, have as a tool-tip the names of 
the minor fragments to give the numbers some meaning. That is, hovering over 
00-xx-xx should say "Project, Merging Receiver".
* In the table that shows minor fragments for major fragments, either add a 
list of minor fragment names to the title, or as a pop-up. That is, in the 
heading that says, "Major Fragment: 02-xx-xx", add "(PARQUET_ROW_GROUP_SCAN, 
PROJECT, ...)

In the Operator Profiles overview, add a tool-tip with details about each 
operator such as:

* Number of vector allocations
* Number of vector extensions (increasing the size of vectors)
* Average vector utilization (ratio of selected to unselected rows)
* Average batch size: number of rows, bytes per row, bytes per batch

  was:
The top of the page displays histogram of minor fragment execution. However, it 
is hard to infer what it displays.

* Label the x-axis. The units seem to be seconds, but a legend of: "Runtime 
(sec.)" would help.
* Label the y-axis. Seems to be colored by major fragment, lines by minor 
fragment. But, took some sleuthing to figure this out.
* Tooltip on each color band to identify the major fragment. (Probably too 
fiddly to label minor fragment lines.)

In the tables:

* For each operator, list the number of rows processed.


> Refinements to Drill web UI - Query profile page
> 
>
> Key: DRILL-4909
> URL: https://issues.apache.org/jira/browse/DRILL-4909
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Web Server
>Affects Versions: 1.8.0
>Reporter: Paul Rogers
>Priority: Minor
>
> The top of the page displays histogram of minor fragment execution. However, 
> it is hard to infer what it displays.
> * Label the x-axis. The units seem to be seconds, but a legend of: "Runtime 
> (sec.)" would help.
> * Label the y-axis. Seems to be colored by major fragment, lines by minor 
> fragment. But, took some sleuthing to figure this out.
> * Tooltip on each color band to identify the major fragment. (Probably too 
> fiddly to label minor fragment lines.)
> In the tables:
> * For each operator, list the number of rows processed.
> * In the table that sumarizes major fragments, have as a tool-tip the names 
> of the minor fragments to give the numbers some meaning. That is, hovering 
> over 00-xx-xx should say "Project, Merging Receiver".
> * In the table that shows minor fragments for major fragments, either add a 
> list of minor fragment names to the title, or as a pop-up. That is, in the 
> heading that says, "Major Fragment: 02-xx-xx", add "(PARQUET_ROW_GROUP_SCAN, 
> PROJECT, ...)
> In the Operator Profiles overview, add a tool-tip with details about each 
> operator such as:
> * Number of vector allocations
> * Number of vector extensions (increasing the size of vectors)
> * Average vector utilization (ratio of selected to unselected rows)
> * Average batch size: number of rows, bytes per row, bytes per batch



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4203) Parquet File : Date is stored wrongly

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4203?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15527917#comment-15527917
 ] 

ASF GitHub Bot commented on DRILL-4203:
---

Github user parthchandra commented on the issue:

https://github.com/apache/drill/pull/595
  
@vdiravka can you repost this PR with the commits split into Jason's 
original work and then with your fixes on top? We should give credit where it 
is due. 


> Parquet File : Date is stored wrongly
> -
>
> Key: DRILL-4203
> URL: https://issues.apache.org/jira/browse/DRILL-4203
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.4.0
>Reporter: Stéphane Trou
>Assignee: Vitalii Diravka
>Priority: Critical
>
> Hello,
> I have some problems when i try to read parquet files produce by drill with  
> Spark,  all dates are corrupted.
> I think the problem come from drill :)
> {code}
> cat /tmp/date_parquet.csv 
> Epoch,1970-01-01
> {code}
> {code}
> 0: jdbc:drill:zk=local> select columns[0] as name, cast(columns[1] as date) 
> as epoch_date from dfs.tmp.`date_parquet.csv`;
> ++-+
> |  name  | epoch_date  |
> ++-+
> | Epoch  | 1970-01-01  |
> ++-+
> {code}
> {code}
> 0: jdbc:drill:zk=local> create table dfs.tmp.`buggy_parquet`as select 
> columns[0] as name, cast(columns[1] as date) as epoch_date from 
> dfs.tmp.`date_parquet.csv`;
> +---++
> | Fragment  | Number of records written  |
> +---++
> | 0_0   | 1  |
> +---++
> {code}
> When I read the file with parquet tools, i found  
> {code}
> java -jar parquet-tools-1.8.1.jar head /tmp/buggy_parquet/
> name = Epoch
> epoch_date = 4881176
> {code}
> According to 
> [https://github.com/Parquet/parquet-format/blob/master/LogicalTypes.md#date], 
> epoch_date should be equals to 0.
> Meta : 
> {code}
> java -jar parquet-tools-1.8.1.jar meta /tmp/buggy_parquet/
> file:file:/tmp/buggy_parquet/0_0_0.parquet 
> creator: parquet-mr version 1.8.1-drill-r0 (build 
> 6b605a4ea05b66e1a6bf843353abcb4834a4ced8) 
> extra:   drill.version = 1.4.0 
> file schema: root 
> 
> name:OPTIONAL BINARY O:UTF8 R:0 D:1
> epoch_date:  OPTIONAL INT32 O:DATE R:0 D:1
> row group 1: RC:1 TS:93 OFFSET:4 
> 
> name: BINARY SNAPPY DO:0 FPO:4 SZ:52/50/0,96 VC:1 
> ENC:RLE,BIT_PACKED,PLAIN
> epoch_date:   INT32 SNAPPY DO:0 FPO:56 SZ:45/43/0,96 VC:1 
> ENC:RLE,BIT_PACKED,PLAIN
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (DRILL-4909) Refinements to Drill web UI - Query profile page

2016-09-27 Thread Paul Rogers (JIRA)

Paul Rogers created DRILL-4909:
--

 Summary: Refinements to Drill web UI - Query profile page
 Key: DRILL-4909
 URL: https://issues.apache.org/jira/browse/DRILL-4909
 Project: Apache Drill
  Issue Type: Improvement
  Components: Web Server
Affects Versions: 1.8.0
Reporter: Paul Rogers
Priority: Minor


The top of the page displays histogram of minor fragment execution. However, it 
is hard to infer what it displays.

* Label the x-axis. The units seem to be seconds, but a legend of: "Runtime 
(sec.)" would help.
* Label the y-axis. Seems to be colored by major fragment, lines by minor 
fragment. But, took some sleuthing to figure this out.
* Tooltip on each color band to identify the major fragment. (Probably too 
fiddly to label minor fragment lines.)

In the tables:

* For each operator, list the number of rows processed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15527763#comment-15527763
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r80817027
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -115,6 +115,8 @@
   private List rowGroupInfos;
   private Metadata.ParquetTableMetadataBase parquetTableMetadata = null;
   private String cacheFileRoot = null;
+  private int batchSize;
+  private static final int DEFAULT_BATCH_LENGTH = 256 * 1024;
--- End diff --

Max value for store.parquet.record_batch_size is 256K. So, it cannot be set 
to 512K. I changed the name in ParquetGroupScan/ParquetRowGroupScan to 
recommendedBatchSize as we discussed. Please review new diffs.


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (DRILL-4831) Running refresh table metadata concurrently randomly fails with JsonParseException

2016-09-27 Thread Zelaine Fong (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-4831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zelaine Fong updated DRILL-4831:

Assignee: Padma Penumarthy

> Running refresh table metadata concurrently randomly fails with 
> JsonParseException
> --
>
> Key: DRILL-4831
> URL: https://issues.apache.org/jira/browse/DRILL-4831
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Metadata
>Affects Versions: 1.8.0
>Reporter: Rahul Challapalli
>Assignee: Padma Penumarthy
> Attachments: error.log, l_3level.tgz
>
>
> git.commit.id.abbrev=f476eb5
> Just run the below command concurrently from 10 different JDBC connections. 
> There is a likelihood that you might encounter the below error
> Extracts from the log
> {code}
> Caused By (java.lang.AssertionError) Internal error: Error while applying 
> rule DrillPushProjIntoScan, args 
> [rel#189411:LogicalProject.NONE.ANY([]).[](input=rel#189289:Subset#3.ENUMERABLE.ANY([]).[],l_orderkey=$1,dir0=$2,dir1=$3,dir2=$4,l_shipdate=$5,l_extendedprice=$6,l_discount=$7),
>  rel#189233:EnumerableTableScan.ENUMERABLE.ANY([]).[](table=[dfs, 
> metadata_caching_pp, l_3level])]
> org.apache.calcite.util.Util.newInternal():792
> org.apache.calcite.plan.volcano.VolcanoRuleCall.onMatch():251
> .
> .
>   java.lang.Thread.run():745
>   Caused By (org.apache.drill.common.exceptions.DrillRuntimeException) 
> com.fasterxml.jackson.core.JsonParseException: Illegal character ((CTRL-CHAR, 
> code 0)): only regular white space (\r, \n, \t) is allowed between tokens
>  at [Source: com.mapr.fs.MapRFsDataInputStream@57a574a8; line: 1, column: 2]
> org.apache.drill.exec.planner.logical.DrillPushProjIntoScan.onMatch():95
> {code}  
> Attached the complete log message and the data set



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15527425#comment-15527425
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r80796011
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -115,6 +115,8 @@
   private List rowGroupInfos;
   private Metadata.ParquetTableMetadataBase parquetTableMetadata = null;
   private String cacheFileRoot = null;
+  private int batchSize;
+  private static final int DEFAULT_BATCH_LENGTH = 256 * 1024;
--- End diff --

Are you referring to code here:
{code}
// Pick the minimum of recordsPerBatch calculated above, batchSize we 
got from rowGroupScan (based on limit)
// and user configured batchSize value.
recordsPerBatch = (int) Math.min(Math.min(recordsPerBatch, batchSize),
 
fragmentContext.getOptions().getOption(ExecConstants.PARQUET_RECORD_BATCH_SIZE).num_val.intValue());
{code}

If I understand correctly, batchSize in ParquetRecordReader comes from 
ParquetRowGroupScan, which comes from ParquetGroupScan, which is set to 
DEFAULT_BATCH_LENGTH.  If I have a RG with 512K rows, and I set 
"store.parquet.record_batch_size" to be 512K, will your code honor this 512 
batch size, or will it use DEFAULT_BATCH_LENGTH since it's smallest? 

Also, if "store.parquet.record_batch_size" is set to be different from 
DEFAULT_BATCH_LENGTH, why would we still use DEFAULT_BATCH_LENGTH in 
ParquetGroupScan / ParquetRowGroupScan?  People might be confused if they look 
at the serialized physical plan, which shows "batchSize = DEFAULT_BATCH_LENGTH. 



> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15527347#comment-15527347
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r80788617
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -899,6 +907,16 @@ public FileGroupScan clone(FileSelection selection) 
throws IOException {
 return newScan;
   }
 
+  // clone to create new groupscan with new file selection and batchSize.
+  public ParquetGroupScan clone(FileSelection selection, int batchSize) 
throws IOException {
+ParquetGroupScan newScan = new ParquetGroupScan(this);
+newScan.modifyFileSelection(selection);
+newScan.cacheFileRoot = selection.cacheFileRoot;
+newScan.init(selection.getMetaContext());
+newScan.batchSize = batchSize;
--- End diff --

done


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15527346#comment-15527346
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r80788614
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -899,6 +907,16 @@ public FileGroupScan clone(FileSelection selection) 
throws IOException {
 return newScan;
   }
 
+  // clone to create new groupscan with new file selection and batchSize.
+  public ParquetGroupScan clone(FileSelection selection, int batchSize) 
throws IOException {
+ParquetGroupScan newScan = new ParquetGroupScan(this);
+newScan.modifyFileSelection(selection);
+newScan.cacheFileRoot = selection.cacheFileRoot;
--- End diff --

done


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15527343#comment-15527343
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r80788478
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java
 ---
@@ -107,7 +107,7 @@ public ScanBatch getBatch(FragmentContext context, 
ParquetRowGroupScan rowGroupS
 if 
(!context.getOptions().getOption(ExecConstants.PARQUET_NEW_RECORD_READER).bool_val
 && !isComplex(footers.get(e.getPath( {
   readers.add(
   new ParquetRecordReader(
-  context, e.getPath(), e.getRowGroupIndex(), fs,
+  context, rowGroupScan.getBatchSize(), e.getPath(), 
e.getRowGroupIndex(), fs,
--- End diff --

done.


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15527340#comment-15527340
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user ppadma commented on the issue:

https://github.com/apache/drill/pull/597
  
updated with review comments.


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-3423) Add New HTTPD format plugin

2016-09-27 Thread Parth Chandra (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-3423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15527296#comment-15527296
 ] 

Parth Chandra commented on DRILL-3423:
--

[~cgivre] Since it looks like you're working on this, can I suggest that when 
you put together your PR, you include Jacques' original commit from his branch 
as well as Jim's changes on top of that so that all parties get due credit/
Also, Jacques original branch has a ComplexWriterFacade class which might be 
useful in writing the complex fields. 
If you could post a link to your branch, we can assist more.

> Add New HTTPD format plugin
> ---
>
> Key: DRILL-3423
> URL: https://issues.apache.org/jira/browse/DRILL-3423
> Project: Apache Drill
>  Issue Type: New Feature
>  Components: Storage - Other
>Reporter: Jacques Nadeau
>Assignee: Jim Scott
> Fix For: Future
>
>
> Add an HTTPD logparser based format plugin.  The author has been kind enough 
> to move the logparser project to be released under the Apache License.  Can 
> find it here:
> 
> nl.basjes.parse.httpdlog
> httpdlog-parser
> 2.0
> 
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-27 Thread Padma Penumarthy (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15527288#comment-15527288
 ] 

Padma Penumarthy commented on DRILL-4905:
-

Doing this fix for native parquet reader only.

> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (DRILL-4908) Unable to setup Sybase JDBC Plugin with access to multiple databases

2016-09-27 Thread David Lee (JIRA)

David Lee created DRILL-4908:


 Summary: Unable to setup Sybase JDBC Plugin with access to 
multiple databases
 Key: DRILL-4908
 URL: https://issues.apache.org/jira/browse/DRILL-4908
 Project: Apache Drill
  Issue Type: Improvement
  Components: SQL Parser
Affects Versions: 1.8.0
 Environment: linux, sybase ase, sybase iq, windows
Reporter: David Lee
 Fix For: Future


This may also be a problem with Microsoft SQL Server which uses the same SQL 
Syntax.

I am unable to setup a single JDBC plugin which allows me to query tables on 
different databases on the server.

I can setup multiple JDBC plugins for each database on the server and join data 
across multiple JDBC connections, but this is extremely inefficient and SQL 
queries 

just hang.

Test Case: Create two tables on two different databases and write a single SQL 
statement to join them together. Try to replicate the results in Apache Drill.

A. Temp tables in Sybase:

use tempdb
go

create table phone_book
(
first_name varchar(10),
last_name varchar(20),
phone_number varchar(12)
)
go

insert phone_book values ('Bob','Marley','555-555-')
insert phone_book values ('Mary','Jane','111-111-')
insert phone_book values ('Bat','Man','911-911-')
go


use tempdb_adhoc
go

create table cities
(
first_name varchar(10),
last_name varchar(20),
city varchar(20)
)
go

insert cities values ('Bob','Marley','San Francisco')
insert cities values ('Mary','Jane','New York')
insert cities values ('Bat','Man','Gotham')
go


select a.first_name, a.last_name, a.phone_number, b.city
from tempdb.guest.phone_book a
join tempdb_adhoc.guest.cities b
on b.first_name = a.first_name
and b.last_name = a.last_name
go

Returns Back in SYBASE ISQL:

 first_name last_namephone_number city   
 --   
 BobMarley   555-555- San Francisco
 Mary   Jane 111-111- New York
 BatMan  911-911- Gotham

B. Drill JDBC Plugin Setups:

DEV:

{
  "type": "jdbc",
  "driver": "com.sybase.jdbc4.jdbc.SybDriver",
  "url": "jdbc:sybase:Tds:my_server:4100",
  "username": "my_login",
  "password": "my_password",
  "enabled": true
}


DEV_TEMPDB:

{
  "type": "jdbc",
  "driver": "com.sybase.jdbc4.jdbc.SybDriver",
  "url": "jdbc:sybase:Tds:my_server:4100/tempdb",
  "username": "my_login",
  "password": "my_password",
  "enabled": true
}


DEV_TEMPDB_ADHOC:

{
  "type": "jdbc",
  "driver": "com.sybase.jdbc4.jdbc.SybDriver",
  "url": "jdbc:sybase:Tds:my_server:4100/tempdb_adhoc",
  "username": "my_login",
  "password": "my_password",
  "enabled": true
}

C. Examples of Drill Statements which work and don't work.

1. Returns back redundant schemas for each JDBC plugin:

0: jdbc:drill:zk=local> show schemas;

+--+
| SCHEMA_NAME  |
+--+
| DEV.tempdb   |
| DEV.tempdb_adhoc |
| DEV_TEMPDB.tempdb|
| DEV_TEMPDB.tempdb_adhoc  |
| DEV_TEMPDB_ADHOC.tempdb  |
| DEV_TEMPDB_ADHOC.tempdb_adhoc|
+--+

2. SQL selects work within schemas and joins across schemas:

0: jdbc:drill:zk=local> select * from DEV_TEMPDB.tempdb.guest.phone_book;
+-++---+
| first_name  | last_name  | phone_number  |
+-++---+
| Bob | Marley | 555-555-  |
| Mary| Jane   | 111-111-  |
| Bat | Man| 911-911-  |
+-++---+
3 rows selected (1.585 seconds)

0: jdbc:drill:zk=local> select * from 
DEV_TEMPDB_ADHOC.tempdb_adhoc.guest.cities;
;
+-+++
| first_name  | last_name  |  city  |
+-+++
| Bob | Marley | San Francisco  |
| Mary| Jane   | New York   |
| Bat | Man| Gotham |
+-+++
3 rows selected (1.173 seconds)

0: jdbc:drill:zk=local> select a.first_name, a.last_name, a.phone_number, b.city
. . . . . . . . . . . > from DEV_TEMPDB.tempdb.guest.phone_book a
. . . . . . . . . . . > join DEV_TEMPDB_ADHOC.tempdb_adhoc.guest.cities b
. . . . . . . . . . . > on b.first_name = a.first_name
. . . . . . . . . . . > and b.last_name = a.last_name;
+-++---++
| first_name  | last_name  | phone_number  |  city  |
+-++---++
| Bob | Marley | 555-555-  | San Francisco  |
| Mary| Jane   | 111-111-  | New York   |
| Bat | Man| 911-911-  | Gotham |

[jira] [Created] (DRILL-4907) Wrong default value mentioned in documentation for "planner.width.max_per_node"

2016-09-27 Thread Rahul Challapalli (JIRA)

Rahul Challapalli created DRILL-4907:


 Summary: Wrong default value mentioned in documentation for 
"planner.width.max_per_node"
 Key: DRILL-4907
 URL: https://issues.apache.org/jira/browse/DRILL-4907
 Project: Apache Drill
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.8.0
Reporter: Rahul Challapalli
Priority: Minor


>From the documentation of config options at [1], the default value for 
>"planner.width.max_per_node" is mentioned as 3. This should be updated to 70% 
>of the total processors on a node.

[1] https://drill.apache.org/docs/configuration-options-introduction/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15526939#comment-15526939
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user jinfengni commented on the issue:

https://github.com/apache/drill/pull/597
  
comment


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15526935#comment-15526935
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user jinfengni commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r80754698
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java
 ---
@@ -107,7 +107,7 @@ public ScanBatch getBatch(FragmentContext context, 
ParquetRowGroupScan rowGroupS
 if 
(!context.getOptions().getOption(ExecConstants.PARQUET_NEW_RECORD_READER).bool_val
 && !isComplex(footers.get(e.getPath( {
   readers.add(
   new ParquetRecordReader(
-  context, e.getPath(), e.getRowGroupIndex(), fs,
+  context, rowGroupScan.getBatchSize(), e.getPath(), 
e.getRowGroupIndex(), fs,
--- End diff --

If it's only for one type of parquet reader, please document it in the 
JIRA, so that people will know this. 


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15526924#comment-15526924
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r80754060
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -899,6 +907,16 @@ public FileGroupScan clone(FileSelection selection) 
throws IOException {
 return newScan;
   }
 
+  // clone to create new groupscan with new file selection and batchSize.
+  public ParquetGroupScan clone(FileSelection selection, int batchSize) 
throws IOException {
+ParquetGroupScan newScan = new ParquetGroupScan(this);
+newScan.modifyFileSelection(selection);
+newScan.cacheFileRoot = selection.cacheFileRoot;
--- End diff --

yes, we can.  will do.


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4387) Improve execution side when it handles skipAll query

2016-09-27 Thread Jinfeng Ni (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15526884#comment-15526884
 ] 

Jinfeng Ni commented on DRILL-4387:
---

[~khfaraaz], for this incorrect query result, this is a related but separate 
issue. Could you please try on 1.5.0 before DRILL-4387 was merged and see if it 
shows the same behavior? You may open a different JIR to track this incorrect 
result issue. thx.


> Improve execution side when it handles skipAll query
> 
>
> Key: DRILL-4387
> URL: https://issues.apache.org/jira/browse/DRILL-4387
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Jinfeng Ni
>Assignee: Jinfeng Ni
> Fix For: 1.6.0
>
>
> DRILL-4279 changes the planner side and the RecordReader in the execution 
> side when they handles skipAll query. However, it seems there are other 
> places in the codebase that do not handle skipAll query efficiently. In 
> particular, in GroupScan or ScanBatchCreator, we will replace a NULL or empty 
> column list with star column. This essentially will force the execution side 
> (RecordReader) to fetch all the columns for data source. Such behavior will 
> lead to big performance overhead for the SCAN operator.
> To improve Drill's performance, we should change those places as well, as a 
> follow-up work after DRILL-4279.
> One simple example of this problem is:
> {code}
>SELECT DISTINCT substring(dir1, 5) from  dfs.`/Path/To/ParquetTable`;  
> {code}
> The query does not require any regular column from the parquet file. However, 
> ParquetRowGroupScan and ParquetScanBatchCreator will put star column as the 
> column list. In case table has dozens or hundreds of columns, this will make 
> SCAN operator much more expensive than necessary. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15526883#comment-15526883
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r80751210
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetScanBatchCreator.java
 ---
@@ -107,7 +107,7 @@ public ScanBatch getBatch(FragmentContext context, 
ParquetRowGroupScan rowGroupS
 if 
(!context.getOptions().getOption(ExecConstants.PARQUET_NEW_RECORD_READER).bool_val
 && !isComplex(footers.get(e.getPath( {
   readers.add(
   new ParquetRecordReader(
-  context, e.getPath(), e.getRowGroupIndex(), fs,
+  context, rowGroupScan.getBatchSize(), e.getPath(), 
e.getRowGroupIndex(), fs,
--- End diff --

This fix is done only for native parquet reader. 


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4905) Push down the LIMIT to the parquet reader scan to limit the numbers of records read

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4905?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15526881#comment-15526881
 ] 

ASF GitHub Bot commented on DRILL-4905:
---

Github user ppadma commented on a diff in the pull request:

https://github.com/apache/drill/pull/597#discussion_r80751189
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/store/parquet/ParquetGroupScan.java
 ---
@@ -115,6 +115,8 @@
   private List rowGroupInfos;
   private Metadata.ParquetTableMetadataBase parquetTableMetadata = null;
   private String cacheFileRoot = null;
+  private int batchSize;
+  private static final int DEFAULT_BATCH_LENGTH = 256 * 1024;
--- End diff --

The default batch length is used here to compare with the limit value and 
decide if we want to create new groupscan.  New option has nothing to do with 
this. For normal case, we do not touch/use the option.  It is added so we can 
use it if we want to change it at run time for any reason. 


> Push down the LIMIT to the parquet reader scan to limit the numbers of 
> records read
> ---
>
> Key: DRILL-4905
> URL: https://issues.apache.org/jira/browse/DRILL-4905
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Parquet
>Affects Versions: 1.8.0
>Reporter: Padma Penumarthy
>Assignee: Padma Penumarthy
> Fix For: 1.9.0
>
>
> Limit the number of records read from disk by pushing down the limit to 
> parquet reader.
> For queries like
> select * from  limit N; 
> where N < size of Parquet row group, we are reading 32K/64k rows or entire 
> row group. This needs to be optimized to read only N rows.
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4726) Dynamic UDFs support

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15526675#comment-15526675
 ] 

ASF GitHub Bot commented on DRILL-4726:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/574#discussion_r80648171
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionImplementationRegistry.java
 ---
@@ -301,29 +323,120 @@ private ScanResult scan(ClassLoader classLoader, 
Path path, URL[] urls) throws I
 return RunTimeScan.dynamicPackageScan(drillConfig, 
Sets.newHashSet(urls));
   }
 }
-throw new FunctionValidationException(String.format("Marker file %s is 
missing in %s.",
+throw new JarValidationException(String.format("Marker file %s is 
missing in %s",
 CommonConstants.DRILL_JAR_MARKER_FILE_RESOURCE_PATHNAME, 
path.getName()));
   }
 
-  private static String getUdfDir() {
-return Preconditions.checkNotNull(System.getenv("DRILL_UDF_DIR"), 
"DRILL_UDF_DIR variable is not set");
+  /**
+   * Return list of jars that are missing in local function registry
+   * but present in remote function registry.
+   *
+   * @param remoteFunctionRegistry remote function registry
+   * @param localFunctionRegistry local function registry
+   * @return list of missing jars
+   */
+  private List getMissingJars(RemoteFunctionRegistry 
remoteFunctionRegistry,
+  LocalFunctionRegistry 
localFunctionRegistry) {
+List remoteJars = 
remoteFunctionRegistry.getRegistry().getJarList();
+List localJars = localFunctionRegistry.getAllJarNames();
+List missingJars = Lists.newArrayList();
+for (Jar jar : remoteJars) {
+  if (!localJars.contains(jar.getName())) {
+missingJars.add(jar.getName());
+  }
+}
+return missingJars;
+  }
+
+  /**
+   * Creates local udf directory, if it doesn't exist.
+   * Checks if local is a directory and if current application has write 
rights on it.
+   * Attempts to clean up local idf directory in case jars were left after 
previous drillbit run.
+   *
+   * @return path to local udf directory
+   */
+  private Path getLocalUdfDir() {
+String confDir = getConfDir();
+File udfDir = new File(confDir, "udf");
+String udfPath = udfDir.getPath();
+udfDir.mkdirs();
+Preconditions.checkState(udfDir.exists(), "Local udf directory [%s] 
must exist", udfPath);
+Preconditions.checkState(udfDir.isDirectory(), "Local udf directory 
[%s] must be a directory", udfPath);
+Preconditions.checkState(udfDir.canWrite(), "Local udf directory [%s] 
must be writable for application user", udfPath);
+try {
+  FileUtils.cleanDirectory(udfDir);
+} catch (IOException e) {
+  throw new DrillRuntimeException("Error during local udf directory 
clean up", e);
+}
+return new Path(udfDir.toURI());
+  }
+
+  /**
+   * First tries to get drill conf directory value from system properties,
+   * if value is missing, checks environment properties.
+   * Throws exception is value is null.
+   * @return drill conf dir path
+   */
+  private String getConfDir() {
+String drillConfDir = "DRILL_CONF_DIR";
+String value = System.getProperty(drillConfDir);
+if (value == null) {
+  value = Preconditions.checkNotNull(System.getenv(drillConfDir), "%s 
variable is not set", drillConfDir);
+}
+return value;
+  }
+
+  /**
+   * Copies jar from remote udf area to local udf area with numeric suffix,
+   * in order to achieve uniqueness for each locally copied jar.
+   * Ex: DrillUDF-1.0.jar -> DrillUDF-1.0_12200255588.jar
+   *
+   * @param jarName jar name to be copied
+   * @param remoteFunctionRegistry remote function registry
+   * @return local path to jar that was copied
+   * @throws IOException in case of problems during jar coping process
+   */
+  private Path copyJarToLocal(String jarName, RemoteFunctionRegistry 
remoteFunctionRegistry) throws IOException {
+String generatedName = String.format(generated_jar_name_pattern,
+Files.getNameWithoutExtension(jarName), System.nanoTime(), 
Files.getFileExtension(jarName));
+Path registryArea = remoteFunctionRegistry.getRegistryArea();
+FileSystem fs = remoteFunctionRegistry.getFs();
+Path remoteJar = new Path(registryArea, jarName);
+Path localJar = new Path(localUdfDir, generatedName);
+try {
+  fs.copyToLocalFile(remoteJar, localJar);
+} catch (IOException e) {
+  String message = String.format("Error during jar [%s] coping

[jira] [Commented] (DRILL-4726) Dynamic UDFs support

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15526686#comment-15526686
 ] 

ASF GitHub Bot commented on DRILL-4726:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/574#discussion_r80648485
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/registry/FunctionRegistryHolder.java
 ---
@@ -0,0 +1,360 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.expr.fn.registry;
+
+import com.google.common.collect.ArrayListMultimap;
+import com.google.common.collect.ListMultimap;
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import com.google.common.collect.Queues;
+import org.apache.drill.common.concurrent.AutoCloseableLock;
+import org.apache.drill.exec.expr.fn.DrillFuncHolder;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Queue;
+import java.util.concurrent.atomic.AtomicLong;
+import java.util.concurrent.locks.ReadWriteLock;
+import java.util.concurrent.locks.ReentrantReadWriteLock;
+
+/**
+ * Function registry holder stores function implementations by jar name, 
function name.
+ * Contains two maps that hold data by jars and functions respectively.
+ * Jars map contains each jar as a key and map of all its functions with 
collection of function signatures as value.
+ * Functions map contains function name as key and map of its signatures 
and function holder as value.
+ * All maps and collections used are concurrent to guarantee memory 
consistency effects.
+ * Such structure is chosen to achieve maximum speed while retrieving data 
by jar or by function name,
+ * since we expect infrequent registry changes.
+ * Holder is designed to allow concurrent reads and single writes to keep 
data consistent.
+ * This is achieved by {@link ReadWriteLock} implementation usage.
+ * Holder has number version which changes every time new jars are added 
or removed. Initial version number is 0.
+ * Also version is used when user needs data from registry with version it 
is based on.
+ *
+ * Structure example:
+ *
+ * JARS
+ * built-in   -> upper  -> upper(VARCHAR-REQUIRED)
+ *-> lower  -> lower(VARCHAR-REQUIRED)
+ *
+ * First.jar  -> upper  -> upper(VARCHAR-OPTIONAL)
+ *-> custom_upper   -> custom_upper(VARCHAR-REQUIRED)
+ *  -> custom_upper(VARCHAR-OPTIONAL)
+ *
+ * Second.jar -> lower  -> lower(VARCHAR-OPTIONAL)
+ *-> custom_upper   -> custom_upper(VARCHAR-REQUIRED)
+ *  -> custom_upper(VARCHAR-OPTIONAL)
+ *
+ * FUNCTIONS
+ * upper-> upper(VARCHAR-REQUIRED)-> function holder for 
upper(VARCHAR-REQUIRED)
+ *  -> upper(VARCHAR-OPTIONAL)-> function holder for 
upper(VARCHAR-OPTIONAL)
+ *
+ * lower-> lower(VARCHAR-REQUIRED)-> function holder for 
lower(VARCHAR-REQUIRED)
+ *  -> lower(VARCHAR-OPTIONAL)-> function holder for 
lower(VARCHAR-OPTIONAL)
+ *
+ * custom_upper -> custom_upper(VARCHAR-REQUIRED) -> function holder for 
custom_upper(VARCHAR-REQUIRED)
+ *  -> custom_upper(VARCHAR-OPTIONAL) -> function holder for 
custom_upper(VARCHAR-OPTIONAL)
+ *
+ * custom_lower -> custom_lower(VARCHAR-REQUIRED) -> function holder for 
custom_lower(VARCHAR-REQUIRED)
+ *  -> custom_lower(VARCHAR-OPTIONAL) -> function holder for 
custom_lower(VARCHAR-OPTIONAL)
+ *
+ * where
+ * First.jar is jar name represented by String
+ * upper is function name represented by String
+ * upper(VARCHAR-REQUIRED) is signature name represented by String which 
consist of function name, list of input parameters
+ * function holder for upper(VARCHAR-REQUIRED) is {@link

[jira] [Commented] (DRILL-4726) Dynamic UDFs support

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15526677#comment-15526677
 ] 

ASF GitHub Bot commented on DRILL-4726:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/574#discussion_r80735634
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionImplementationRegistry.java
 ---
@@ -301,29 +323,120 @@ private ScanResult scan(ClassLoader classLoader, 
Path path, URL[] urls) throws I
 return RunTimeScan.dynamicPackageScan(drillConfig, 
Sets.newHashSet(urls));
   }
 }
-throw new FunctionValidationException(String.format("Marker file %s is 
missing in %s.",
+throw new JarValidationException(String.format("Marker file %s is 
missing in %s",
 CommonConstants.DRILL_JAR_MARKER_FILE_RESOURCE_PATHNAME, 
path.getName()));
   }
 
-  private static String getUdfDir() {
-return Preconditions.checkNotNull(System.getenv("DRILL_UDF_DIR"), 
"DRILL_UDF_DIR variable is not set");
+  /**
+   * Return list of jars that are missing in local function registry
+   * but present in remote function registry.
+   *
+   * @param remoteFunctionRegistry remote function registry
+   * @param localFunctionRegistry local function registry
+   * @return list of missing jars
+   */
+  private List getMissingJars(RemoteFunctionRegistry 
remoteFunctionRegistry,
+  LocalFunctionRegistry 
localFunctionRegistry) {
+List remoteJars = 
remoteFunctionRegistry.getRegistry().getJarList();
+List localJars = localFunctionRegistry.getAllJarNames();
+List missingJars = Lists.newArrayList();
+for (Jar jar : remoteJars) {
+  if (!localJars.contains(jar.getName())) {
+missingJars.add(jar.getName());
+  }
+}
+return missingJars;
+  }
+
+  /**
+   * Creates local udf directory, if it doesn't exist.
+   * Checks if local is a directory and if current application has write 
rights on it.
+   * Attempts to clean up local idf directory in case jars were left after 
previous drillbit run.
+   *
+   * @return path to local udf directory
+   */
+  private Path getLocalUdfDir() {
+String confDir = getConfDir();
--- End diff --

Unfortunately, I didn't realize that $DRILL_HOME and $DRILL_CONF_DIR in DoY 
are not writable.
Yes, you are right we do clean up local udf directory from previously 
loaded jars each time drillbit startups so it definitely a good idea to keep 
local udf directory in tmp folder,
So I have added $DRILL_TMP_DIR in drill-config.sh with default to /tmp if 
not set.
In code I concatenate $DRILL_TMP_DIR + drill.exec.udf.directory.base which 
defaults to ${drill.exec.zk.root}"/udf".



> Dynamic UDFs support
> 
>
> Key: DRILL-4726
> URL: https://issues.apache.org/jira/browse/DRILL-4726
> Project: Apache Drill
>  Issue Type: New Feature
>Affects Versions: 1.6.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
> Fix For: Future
>
>
> Allow register UDFs without  restart of Drillbits.
> Design is described in document below:
> https://docs.google.com/document/d/1FfyJtWae5TLuyheHCfldYUpCdeIezR2RlNsrOTYyAB4/edit?usp=sharing
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4726) Dynamic UDFs support

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15526680#comment-15526680
 ] 

ASF GitHub Bot commented on DRILL-4726:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/574#discussion_r80647672
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionImplementationRegistry.java
 ---
@@ -179,14 +190,15 @@ public DrillFuncHolder 
findExactMatchingDrillFunction(String name, List argTypes, MajorType returnType, boolean retry) {
-for (DrillFuncHolder h : drillFuncRegistry.getMethods(name)) {
+AtomicLong version = new AtomicLong();
--- End diff --

We use AtomicLong to hold local registry version, when we call 
localFunctionRegistry.getMethods(name, version) we pass our AtomicLong and it's 
enriched with current local version registry number. Please see 
FunctionRegistryHolder.getMethods description for more details.


> Dynamic UDFs support
> 
>
> Key: DRILL-4726
> URL: https://issues.apache.org/jira/browse/DRILL-4726
> Project: Apache Drill
>  Issue Type: New Feature
>Affects Versions: 1.6.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
> Fix For: Future
>
>
> Allow register UDFs without  restart of Drillbits.
> Design is described in document below:
> https://docs.google.com/document/d/1FfyJtWae5TLuyheHCfldYUpCdeIezR2RlNsrOTYyAB4/edit?usp=sharing
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4726) Dynamic UDFs support

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15526683#comment-15526683
 ] 

ASF GitHub Bot commented on DRILL-4726:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/574#discussion_r80646286
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/coord/zk/ZookeeperClient.java
 ---
@@ -257,14 +263,47 @@ public void put(final String path, final byte[] data, 
DataChangeVersion version)
   }
   if (hasNode) {
 if (version != null) {
-  
curator.setData().withVersion(version.getVersion()).forPath(target, data);
+  try {
+
curator.setData().withVersion(version.getVersion()).forPath(target, data);
+  } catch (final KeeperException.BadVersionException e) {
+throw new VersionMismatchException("Unable to put data. 
Version mismatch is detected.", version.getVersion(), e);
+  }
 } else {
   curator.setData().forPath(target, data);
 }
   }
   getCache().rebuildNode(target);
-} catch (final KeeperException.BadVersionException e) {
-  throw new VersionMismatchException(e);
+} catch (final VersionMismatchException e) {
+  throw e;
+} catch (final Exception e) {
+  throw new DrillRuntimeException("unable to put ", e);
+}
+  }
+
+  /**
+   * Puts the given byte sequence into the given path if path is does not 
exist.
+   *
+   * @param path  target path
+   * @param data  data to store
+   * @return null if path was created, else data stored for the given path
+   */
+  public byte[] putIfAbsent(final String path, final byte[] data) {
+Preconditions.checkNotNull(path, "path is required");
+Preconditions.checkNotNull(data, "data is required");
+
+final String target = PathUtils.join(root, path);
+try {
+  boolean hasNode = hasPath(path, true);
--- End diff --

Agree.


> Dynamic UDFs support
> 
>
> Key: DRILL-4726
> URL: https://issues.apache.org/jira/browse/DRILL-4726
> Project: Apache Drill
>  Issue Type: New Feature
>Affects Versions: 1.6.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
> Fix For: Future
>
>
> Allow register UDFs without  restart of Drillbits.
> Design is described in document below:
> https://docs.google.com/document/d/1FfyJtWae5TLuyheHCfldYUpCdeIezR2RlNsrOTYyAB4/edit?usp=sharing
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4726) Dynamic UDFs support

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15526676#comment-15526676
 ] 

ASF GitHub Bot commented on DRILL-4726:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/574#discussion_r80692935
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/registry/FunctionRegistryHolder.java
 ---
@@ -0,0 +1,360 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.expr.fn.registry;
+
+import com.google.common.collect.ArrayListMultimap;
+import com.google.common.collect.ListMultimap;
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import com.google.common.collect.Queues;
+import org.apache.drill.common.concurrent.AutoCloseableLock;
+import org.apache.drill.exec.expr.fn.DrillFuncHolder;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Queue;
+import java.util.concurrent.atomic.AtomicLong;
+import java.util.concurrent.locks.ReadWriteLock;
+import java.util.concurrent.locks.ReentrantReadWriteLock;
+
+/**
+ * Function registry holder stores function implementations by jar name, 
function name.
+ * Contains two maps that hold data by jars and functions respectively.
+ * Jars map contains each jar as a key and map of all its functions with 
collection of function signatures as value.
+ * Functions map contains function name as key and map of its signatures 
and function holder as value.
+ * All maps and collections used are concurrent to guarantee memory 
consistency effects.
+ * Such structure is chosen to achieve maximum speed while retrieving data 
by jar or by function name,
+ * since we expect infrequent registry changes.
+ * Holder is designed to allow concurrent reads and single writes to keep 
data consistent.
+ * This is achieved by {@link ReadWriteLock} implementation usage.
+ * Holder has number version which changes every time new jars are added 
or removed. Initial version number is 0.
+ * Also version is used when user needs data from registry with version it 
is based on.
+ *
+ * Structure example:
+ *
+ * JARS
+ * built-in   -> upper  -> upper(VARCHAR-REQUIRED)
+ *-> lower  -> lower(VARCHAR-REQUIRED)
+ *
+ * First.jar  -> upper  -> upper(VARCHAR-OPTIONAL)
+ *-> custom_upper   -> custom_upper(VARCHAR-REQUIRED)
+ *  -> custom_upper(VARCHAR-OPTIONAL)
+ *
+ * Second.jar -> lower  -> lower(VARCHAR-OPTIONAL)
+ *-> custom_upper   -> custom_upper(VARCHAR-REQUIRED)
+ *  -> custom_upper(VARCHAR-OPTIONAL)
+ *
+ * FUNCTIONS
+ * upper-> upper(VARCHAR-REQUIRED)-> function holder for 
upper(VARCHAR-REQUIRED)
+ *  -> upper(VARCHAR-OPTIONAL)-> function holder for 
upper(VARCHAR-OPTIONAL)
+ *
+ * lower-> lower(VARCHAR-REQUIRED)-> function holder for 
lower(VARCHAR-REQUIRED)
+ *  -> lower(VARCHAR-OPTIONAL)-> function holder for 
lower(VARCHAR-OPTIONAL)
+ *
+ * custom_upper -> custom_upper(VARCHAR-REQUIRED) -> function holder for 
custom_upper(VARCHAR-REQUIRED)
+ *  -> custom_upper(VARCHAR-OPTIONAL) -> function holder for 
custom_upper(VARCHAR-OPTIONAL)
+ *
+ * custom_lower -> custom_lower(VARCHAR-REQUIRED) -> function holder for 
custom_lower(VARCHAR-REQUIRED)
+ *  -> custom_lower(VARCHAR-OPTIONAL) -> function holder for 
custom_lower(VARCHAR-OPTIONAL)
+ *
+ * where
+ * First.jar is jar name represented by String
+ * upper is function name represented by String
+ * upper(VARCHAR-REQUIRED) is signature name represented by String which 
consist of function name, list of input parameters
+ * function holder for upper(VARCHAR-REQUIRED) is {@link

[jira] [Commented] (DRILL-4726) Dynamic UDFs support

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15526678#comment-15526678
 ] 

ASF GitHub Bot commented on DRILL-4726:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/574#discussion_r80675326
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionImplementationRegistry.java
 ---
@@ -301,29 +323,120 @@ private ScanResult scan(ClassLoader classLoader, 
Path path, URL[] urls) throws I
 return RunTimeScan.dynamicPackageScan(drillConfig, 
Sets.newHashSet(urls));
   }
 }
-throw new FunctionValidationException(String.format("Marker file %s is 
missing in %s.",
+throw new JarValidationException(String.format("Marker file %s is 
missing in %s",
 CommonConstants.DRILL_JAR_MARKER_FILE_RESOURCE_PATHNAME, 
path.getName()));
   }
 
-  private static String getUdfDir() {
-return Preconditions.checkNotNull(System.getenv("DRILL_UDF_DIR"), 
"DRILL_UDF_DIR variable is not set");
+  /**
+   * Return list of jars that are missing in local function registry
+   * but present in remote function registry.
+   *
+   * @param remoteFunctionRegistry remote function registry
+   * @param localFunctionRegistry local function registry
+   * @return list of missing jars
+   */
+  private List getMissingJars(RemoteFunctionRegistry 
remoteFunctionRegistry,
+  LocalFunctionRegistry 
localFunctionRegistry) {
+List remoteJars = 
remoteFunctionRegistry.getRegistry().getJarList();
+List localJars = localFunctionRegistry.getAllJarNames();
+List missingJars = Lists.newArrayList();
+for (Jar jar : remoteJars) {
+  if (!localJars.contains(jar.getName())) {
+missingJars.add(jar.getName());
+  }
+}
+return missingJars;
+  }
+
+  /**
+   * Creates local udf directory, if it doesn't exist.
+   * Checks if local is a directory and if current application has write 
rights on it.
+   * Attempts to clean up local idf directory in case jars were left after 
previous drillbit run.
+   *
+   * @return path to local udf directory
+   */
+  private Path getLocalUdfDir() {
+String confDir = getConfDir();
+File udfDir = new File(confDir, "udf");
+String udfPath = udfDir.getPath();
+udfDir.mkdirs();
+Preconditions.checkState(udfDir.exists(), "Local udf directory [%s] 
must exist", udfPath);
+Preconditions.checkState(udfDir.isDirectory(), "Local udf directory 
[%s] must be a directory", udfPath);
+Preconditions.checkState(udfDir.canWrite(), "Local udf directory [%s] 
must be writable for application user", udfPath);
+try {
+  FileUtils.cleanDirectory(udfDir);
+} catch (IOException e) {
+  throw new DrillRuntimeException("Error during local udf directory 
clean up", e);
+}
+return new Path(udfDir.toURI());
+  }
+
+  /**
+   * First tries to get drill conf directory value from system properties,
+   * if value is missing, checks environment properties.
+   * Throws exception is value is null.
+   * @return drill conf dir path
+   */
+  private String getConfDir() {
+String drillConfDir = "DRILL_CONF_DIR";
+String value = System.getProperty(drillConfDir);
+if (value == null) {
+  value = Preconditions.checkNotNull(System.getenv(drillConfDir), "%s 
variable is not set", drillConfDir);
+}
+return value;
+  }
+
+  /**
+   * Copies jar from remote udf area to local udf area with numeric suffix,
+   * in order to achieve uniqueness for each locally copied jar.
+   * Ex: DrillUDF-1.0.jar -> DrillUDF-1.0_12200255588.jar
+   *
+   * @param jarName jar name to be copied
+   * @param remoteFunctionRegistry remote function registry
+   * @return local path to jar that was copied
+   * @throws IOException in case of problems during jar coping process
+   */
+  private Path copyJarToLocal(String jarName, RemoteFunctionRegistry 
remoteFunctionRegistry) throws IOException {
+String generatedName = String.format(generated_jar_name_pattern,
+Files.getNameWithoutExtension(jarName), System.nanoTime(), 
Files.getFileExtension(jarName));
+Path registryArea = remoteFunctionRegistry.getRegistryArea();
+FileSystem fs = remoteFunctionRegistry.getFs();
+Path remoteJar = new Path(registryArea, jarName);
+Path localJar = new Path(localUdfDir, generatedName);
+try {
+  fs.copyToLocalFile(remoteJar, localJar);
+} catch (IOException e) {
+  String message = String.format("Error during jar [%s] coping

[jira] [Commented] (DRILL-4726) Dynamic UDFs support

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15526679#comment-15526679
 ] 

ASF GitHub Bot commented on DRILL-4726:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/574#discussion_r80669448
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionImplementationRegistry.java
 ---
@@ -301,29 +323,120 @@ private ScanResult scan(ClassLoader classLoader, 
Path path, URL[] urls) throws I
 return RunTimeScan.dynamicPackageScan(drillConfig, 
Sets.newHashSet(urls));
   }
 }
-throw new FunctionValidationException(String.format("Marker file %s is 
missing in %s.",
+throw new JarValidationException(String.format("Marker file %s is 
missing in %s",
 CommonConstants.DRILL_JAR_MARKER_FILE_RESOURCE_PATHNAME, 
path.getName()));
   }
 
-  private static String getUdfDir() {
-return Preconditions.checkNotNull(System.getenv("DRILL_UDF_DIR"), 
"DRILL_UDF_DIR variable is not set");
+  /**
+   * Return list of jars that are missing in local function registry
+   * but present in remote function registry.
+   *
+   * @param remoteFunctionRegistry remote function registry
+   * @param localFunctionRegistry local function registry
+   * @return list of missing jars
+   */
+  private List getMissingJars(RemoteFunctionRegistry 
remoteFunctionRegistry,
+  LocalFunctionRegistry 
localFunctionRegistry) {
+List remoteJars = 
remoteFunctionRegistry.getRegistry().getJarList();
+List localJars = localFunctionRegistry.getAllJarNames();
+List missingJars = Lists.newArrayList();
+for (Jar jar : remoteJars) {
+  if (!localJars.contains(jar.getName())) {
+missingJars.add(jar.getName());
+  }
+}
+return missingJars;
+  }
+
+  /**
+   * Creates local udf directory, if it doesn't exist.
+   * Checks if local is a directory and if current application has write 
rights on it.
+   * Attempts to clean up local idf directory in case jars were left after 
previous drillbit run.
+   *
+   * @return path to local udf directory
+   */
+  private Path getLocalUdfDir() {
+String confDir = getConfDir();
+File udfDir = new File(confDir, "udf");
+String udfPath = udfDir.getPath();
+udfDir.mkdirs();
+Preconditions.checkState(udfDir.exists(), "Local udf directory [%s] 
must exist", udfPath);
+Preconditions.checkState(udfDir.isDirectory(), "Local udf directory 
[%s] must be a directory", udfPath);
+Preconditions.checkState(udfDir.canWrite(), "Local udf directory [%s] 
must be writable for application user", udfPath);
+try {
+  FileUtils.cleanDirectory(udfDir);
+} catch (IOException e) {
+  throw new DrillRuntimeException("Error during local udf directory 
clean up", e);
+}
+return new Path(udfDir.toURI());
+  }
+
+  /**
+   * First tries to get drill conf directory value from system properties,
+   * if value is missing, checks environment properties.
+   * Throws exception is value is null.
+   * @return drill conf dir path
+   */
+  private String getConfDir() {
+String drillConfDir = "DRILL_CONF_DIR";
+String value = System.getProperty(drillConfDir);
+if (value == null) {
+  value = Preconditions.checkNotNull(System.getenv(drillConfDir), "%s 
variable is not set", drillConfDir);
+}
+return value;
+  }
+
+  /**
+   * Copies jar from remote udf area to local udf area with numeric suffix,
+   * in order to achieve uniqueness for each locally copied jar.
+   * Ex: DrillUDF-1.0.jar -> DrillUDF-1.0_12200255588.jar
+   *
+   * @param jarName jar name to be copied
+   * @param remoteFunctionRegistry remote function registry
+   * @return local path to jar that was copied
+   * @throws IOException in case of problems during jar coping process
+   */
+  private Path copyJarToLocal(String jarName, RemoteFunctionRegistry 
remoteFunctionRegistry) throws IOException {
+String generatedName = String.format(generated_jar_name_pattern,
+Files.getNameWithoutExtension(jarName), System.nanoTime(), 
Files.getFileExtension(jarName));
+Path registryArea = remoteFunctionRegistry.getRegistryArea();
+FileSystem fs = remoteFunctionRegistry.getFs();
+Path remoteJar = new Path(registryArea, jarName);
+Path localJar = new Path(localUdfDir, generatedName);
+try {
+  fs.copyToLocalFile(remoteJar, localJar);
+} catch (IOException e) {
+  String message = String.format("Error during jar [%s] coping

[jira] [Commented] (DRILL-4726) Dynamic UDFs support

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15526685#comment-15526685
 ] 

ASF GitHub Bot commented on DRILL-4726:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/574#discussion_r80650009
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/planner/sql/handlers/DropFunctionHandler.java
 ---
@@ -48,54 +48,77 @@ public DropFunctionHandler(SqlHandlerConfig config) {
   }
 
   /**
-   * Drops UDFs dynamically.
+   * Unregisters UDFs dynamically. Process consists of several steps:
+   * 
+   * Registering jar in jar registry to ensure that several jars with 
the same name is not being unregistered.
+   * Starts remote unregistration process, gets list of all jars and 
excludes jar to be deleted.
+   * Signals drill bits to start local unregistration process.
+   * Removes source and binary jars from registry area.
+   * 
--- End diff --

As noted in previous comments, the expected behavior is to fail.


> Dynamic UDFs support
> 
>
> Key: DRILL-4726
> URL: https://issues.apache.org/jira/browse/DRILL-4726
> Project: Apache Drill
>  Issue Type: New Feature
>Affects Versions: 1.6.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
> Fix For: Future
>
>
> Allow register UDFs without  restart of Drillbits.
> Design is described in document below:
> https://docs.google.com/document/d/1FfyJtWae5TLuyheHCfldYUpCdeIezR2RlNsrOTYyAB4/edit?usp=sharing
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4726) Dynamic UDFs support

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15526684#comment-15526684
 ] 

ASF GitHub Bot commented on DRILL-4726:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/574#discussion_r80666252
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionImplementationRegistry.java
 ---
@@ -120,9 +129,9 @@ public FunctionImplementationRegistry(DrillConfig 
config, ScanResult classpathSc
* Register functions in given operator table.
* @param operatorTable
*/
-  public void register(DrillOperatorTable operatorTable) {
+  public void register(DrillOperatorTable operatorTable, AtomicLong 
version) {
--- End diff --

Agree. It's used as holder to store local function registry version drill 
operator table was populated from.


> Dynamic UDFs support
> 
>
> Key: DRILL-4726
> URL: https://issues.apache.org/jira/browse/DRILL-4726
> Project: Apache Drill
>  Issue Type: New Feature
>Affects Versions: 1.6.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
> Fix For: Future
>
>
> Allow register UDFs without  restart of Drillbits.
> Design is described in document below:
> https://docs.google.com/document/d/1FfyJtWae5TLuyheHCfldYUpCdeIezR2RlNsrOTYyAB4/edit?usp=sharing
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4726) Dynamic UDFs support

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15526681#comment-15526681
 ] 

ASF GitHub Bot commented on DRILL-4726:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/574#discussion_r80649045
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/registry/FunctionRegistryHolder.java
 ---
@@ -0,0 +1,360 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.expr.fn.registry;
+
+import com.google.common.collect.ArrayListMultimap;
+import com.google.common.collect.ListMultimap;
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import com.google.common.collect.Queues;
+import org.apache.drill.common.concurrent.AutoCloseableLock;
+import org.apache.drill.exec.expr.fn.DrillFuncHolder;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Queue;
+import java.util.concurrent.atomic.AtomicLong;
+import java.util.concurrent.locks.ReadWriteLock;
+import java.util.concurrent.locks.ReentrantReadWriteLock;
+
+/**
+ * Function registry holder stores function implementations by jar name, 
function name.
+ * Contains two maps that hold data by jars and functions respectively.
+ * Jars map contains each jar as a key and map of all its functions with 
collection of function signatures as value.
+ * Functions map contains function name as key and map of its signatures 
and function holder as value.
+ * All maps and collections used are concurrent to guarantee memory 
consistency effects.
+ * Such structure is chosen to achieve maximum speed while retrieving data 
by jar or by function name,
+ * since we expect infrequent registry changes.
+ * Holder is designed to allow concurrent reads and single writes to keep 
data consistent.
+ * This is achieved by {@link ReadWriteLock} implementation usage.
+ * Holder has number version which changes every time new jars are added 
or removed. Initial version number is 0.
+ * Also version is used when user needs data from registry with version it 
is based on.
+ *
+ * Structure example:
+ *
+ * JARS
+ * built-in   -> upper  -> upper(VARCHAR-REQUIRED)
+ *-> lower  -> lower(VARCHAR-REQUIRED)
+ *
+ * First.jar  -> upper  -> upper(VARCHAR-OPTIONAL)
+ *-> custom_upper   -> custom_upper(VARCHAR-REQUIRED)
+ *  -> custom_upper(VARCHAR-OPTIONAL)
+ *
+ * Second.jar -> lower  -> lower(VARCHAR-OPTIONAL)
+ *-> custom_upper   -> custom_upper(VARCHAR-REQUIRED)
+ *  -> custom_upper(VARCHAR-OPTIONAL)
+ *
+ * FUNCTIONS
+ * upper-> upper(VARCHAR-REQUIRED)-> function holder for 
upper(VARCHAR-REQUIRED)
+ *  -> upper(VARCHAR-OPTIONAL)-> function holder for 
upper(VARCHAR-OPTIONAL)
+ *
+ * lower-> lower(VARCHAR-REQUIRED)-> function holder for 
lower(VARCHAR-REQUIRED)
+ *  -> lower(VARCHAR-OPTIONAL)-> function holder for 
lower(VARCHAR-OPTIONAL)
+ *
+ * custom_upper -> custom_upper(VARCHAR-REQUIRED) -> function holder for 
custom_upper(VARCHAR-REQUIRED)
+ *  -> custom_upper(VARCHAR-OPTIONAL) -> function holder for 
custom_upper(VARCHAR-OPTIONAL)
+ *
+ * custom_lower -> custom_lower(VARCHAR-REQUIRED) -> function holder for 
custom_lower(VARCHAR-REQUIRED)
+ *  -> custom_lower(VARCHAR-OPTIONAL) -> function holder for 
custom_lower(VARCHAR-OPTIONAL)
+ *
+ * where
+ * First.jar is jar name represented by String
+ * upper is function name represented by String
+ * upper(VARCHAR-REQUIRED) is signature name represented by String which 
consist of function name, list of input parameters
+ * function holder for upper(VARCHAR-REQUIRED) is {@link

[jira] [Commented] (DRILL-4726) Dynamic UDFs support

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15526682#comment-15526682
 ] 

ASF GitHub Bot commented on DRILL-4726:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/574#discussion_r80649687
  
--- Diff: 
exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/registry/FunctionRegistryHolder.java
 ---
@@ -0,0 +1,360 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ * 
+ * http://www.apache.org/licenses/LICENSE-2.0
+ * 
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.drill.exec.expr.fn.registry;
+
+import com.google.common.collect.ArrayListMultimap;
+import com.google.common.collect.ListMultimap;
+import com.google.common.collect.Lists;
+import com.google.common.collect.Maps;
+import com.google.common.collect.Queues;
+import org.apache.drill.common.concurrent.AutoCloseableLock;
+import org.apache.drill.exec.expr.fn.DrillFuncHolder;
+
+import java.util.List;
+import java.util.Map;
+import java.util.Queue;
+import java.util.concurrent.atomic.AtomicLong;
+import java.util.concurrent.locks.ReadWriteLock;
+import java.util.concurrent.locks.ReentrantReadWriteLock;
+
+/**
+ * Function registry holder stores function implementations by jar name, 
function name.
+ * Contains two maps that hold data by jars and functions respectively.
+ * Jars map contains each jar as a key and map of all its functions with 
collection of function signatures as value.
+ * Functions map contains function name as key and map of its signatures 
and function holder as value.
+ * All maps and collections used are concurrent to guarantee memory 
consistency effects.
+ * Such structure is chosen to achieve maximum speed while retrieving data 
by jar or by function name,
+ * since we expect infrequent registry changes.
+ * Holder is designed to allow concurrent reads and single writes to keep 
data consistent.
+ * This is achieved by {@link ReadWriteLock} implementation usage.
+ * Holder has number version which changes every time new jars are added 
or removed. Initial version number is 0.
+ * Also version is used when user needs data from registry with version it 
is based on.
+ *
+ * Structure example:
+ *
+ * JARS
+ * built-in   -> upper  -> upper(VARCHAR-REQUIRED)
+ *-> lower  -> lower(VARCHAR-REQUIRED)
+ *
+ * First.jar  -> upper  -> upper(VARCHAR-OPTIONAL)
+ *-> custom_upper   -> custom_upper(VARCHAR-REQUIRED)
+ *  -> custom_upper(VARCHAR-OPTIONAL)
+ *
+ * Second.jar -> lower  -> lower(VARCHAR-OPTIONAL)
+ *-> custom_upper   -> custom_upper(VARCHAR-REQUIRED)
+ *  -> custom_upper(VARCHAR-OPTIONAL)
+ *
+ * FUNCTIONS
+ * upper-> upper(VARCHAR-REQUIRED)-> function holder for 
upper(VARCHAR-REQUIRED)
+ *  -> upper(VARCHAR-OPTIONAL)-> function holder for 
upper(VARCHAR-OPTIONAL)
+ *
+ * lower-> lower(VARCHAR-REQUIRED)-> function holder for 
lower(VARCHAR-REQUIRED)
+ *  -> lower(VARCHAR-OPTIONAL)-> function holder for 
lower(VARCHAR-OPTIONAL)
+ *
+ * custom_upper -> custom_upper(VARCHAR-REQUIRED) -> function holder for 
custom_upper(VARCHAR-REQUIRED)
+ *  -> custom_upper(VARCHAR-OPTIONAL) -> function holder for 
custom_upper(VARCHAR-OPTIONAL)
+ *
+ * custom_lower -> custom_lower(VARCHAR-REQUIRED) -> function holder for 
custom_lower(VARCHAR-REQUIRED)
+ *  -> custom_lower(VARCHAR-OPTIONAL) -> function holder for 
custom_lower(VARCHAR-OPTIONAL)
+ *
+ * where
+ * First.jar is jar name represented by String
+ * upper is function name represented by String
+ * upper(VARCHAR-REQUIRED) is signature name represented by String which 
consist of function name, list of input parameters
+ * function holder for upper(VARCHAR-REQUIRED) is {@link

[jira] [Commented] (DRILL-4726) Dynamic UDFs support

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15526687#comment-15526687
 ] 

ASF GitHub Bot commented on DRILL-4726:
---

Github user arina-ielchiieva commented on a diff in the pull request:

https://github.com/apache/drill/pull/574#discussion_r80737998
  
--- Diff: exec/java-exec/src/main/resources/drill-module.conf ---
@@ -45,6 +45,8 @@ drill.client: {
   supports-complex-types: true
 }
 
+drill.home: "/tmp"
--- End diff --

Changed to drill.home to drill.dfs-home.
The problem why I set value to /tmp is since I am not quite sure to which 
value to set it.
If I set for example, /user/drill drillbit will fail at start up since 
currently we don't have such directory and user that runs drillbit usually 
doesn't have rights to create directory from /.


> Dynamic UDFs support
> 
>
> Key: DRILL-4726
> URL: https://issues.apache.org/jira/browse/DRILL-4726
> Project: Apache Drill
>  Issue Type: New Feature
>Affects Versions: 1.6.0
>Reporter: Arina Ielchiieva
>Assignee: Arina Ielchiieva
> Fix For: Future
>
>
> Allow register UDFs without  restart of Drillbits.
> Design is described in document below:
> https://docs.google.com/document/d/1FfyJtWae5TLuyheHCfldYUpCdeIezR2RlNsrOTYyAB4/edit?usp=sharing
>  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4906) CASE Expression with constant generates class exception

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4906?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15526343#comment-15526343
 ] 

ASF GitHub Bot commented on DRILL-4906:
---

GitHub user Serhii-Harnyk opened a pull request:

https://github.com/apache/drill/pull/598

DRILL-4906 CASE Expression with constant generates class exception



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/Serhii-Harnyk/drill DRILL-4906

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/drill/pull/598.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #598


commit 46f0df7cf89fff8e7d8cf9a21a810e1b4e292bce
Author: Serhii-Harnyk 
Date:   2016-09-22T12:06:10Z

DRILL-4906 CASE Expression with constant generates class exception




> CASE Expression with constant generates class exception
> ---
>
> Key: DRILL-4906
> URL: https://issues.apache.org/jira/browse/DRILL-4906
> Project: Apache Drill
>  Issue Type: Bug
>  Components: SQL Parser
>Affects Versions: 1.6.0, 1.8.0
>Reporter: Serhii Harnyk
>Assignee: Serhii Harnyk
> Fix For: 1.9.0
>
>
> How to reproduce:
> select (case when (true) then 1 end) from (values(1));
> Error
> Error: SYSTEM ERROR: ClassCastException: 
> org.apache.drill.exec.expr.holders.NullableVarCharHolder cannot be cast to 
> org.apache.drill.exec.expr.holders.VarCharHolder



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (DRILL-4906) CASE Expression with constant generates class exception

2016-09-27 Thread Serhii Harnyk (JIRA)

Serhii Harnyk created DRILL-4906:


 Summary: CASE Expression with constant generates class exception
 Key: DRILL-4906
 URL: https://issues.apache.org/jira/browse/DRILL-4906
 Project: Apache Drill
  Issue Type: Bug
  Components: SQL Parser
Affects Versions: 1.8.0, 1.6.0
Reporter: Serhii Harnyk
Assignee: Serhii Harnyk
 Fix For: 1.9.0


How to reproduce:

select (case when (true) then 1 end) from (values(1));

Error
Error: SYSTEM ERROR: ClassCastException: 
org.apache.drill.exec.expr.holders.NullableVarCharHolder cannot be cast to 
org.apache.drill.exec.expr.holders.VarCharHolder



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15525244#comment-15525244
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user ssriniva123 commented on the issue:

https://github.com/apache/drill/pull/518
  
Paul,
The code you have listed is semantically equivalent to that of what I 
already I have submitted for pull and will not solve handling of all malformed 
json records. Also the code for reporting the 
error records is working correctly as long as is it is reported by the 
Parser correctly.

As I explained earlier the JSON parser is not just a simple tokenizer, it 
keeps track of internal state,
hence the issue. SERDE's in hive etc work because they  are record oriented 
with clean record demarkations using a new line.

One solution is to submit a patch to jackson parser to expose a method to 
skip to new line in the
event of a parsing exception. This can be parametrized so that behavior can 
customized.



> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4653) Malformed JSON should not stop the entire query from progressing

2016-09-27 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-4653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15525212#comment-15525212
 ] 

ASF GitHub Bot commented on DRILL-4653:
---

Github user paul-rogers commented on the issue:

https://github.com/apache/drill/pull/518
  
The open question was how we can discard a partly-built record during 
recovery. As far as I can tell (veterans, please correct me), the 
JSONRecordReader keeps track of the record count. So, all we have to do is not 
increment the count when we want to discard a record. Look in

JSONRecordReader.next( )
  ...
  outside: while(recordCount < DEFAULT_ROWS_PER_BATCH) {
writer.setPosition(recordCount); // Sets the position for the next 
read.
write = jsonReader.write(writer); // Write the record. We can catch 
errors
  // and recover here??
 ...
  recordCount++; // Don't do this on a bad record
  ...
  writer.setValueCount(recordCount); // The record reader controls the 
record count.

This seems to show the elements of a solution:

1. Try to read the record.
2. If a failure occurs, catch it here and clean up, as in the previous post.
3. Don't increment the record count. We reuse the current one on the next 
record read.

Now the only open question is how we clean up the in-flight record in case 
some columns are not present in the next record. Anyone know how to set a 
vector position to null (for optional) default value (for required) or 
zero-length (for repeated)?


> Malformed JSON should not stop the entire query from progressing
> 
>
> Key: DRILL-4653
> URL: https://issues.apache.org/jira/browse/DRILL-4653
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Storage - JSON
>Affects Versions: 1.6.0
>Reporter: subbu srinivasan
> Fix For: Future
>
>
> Currently Drill query terminates upon first encounter of a invalid JSON line.
> Drill has to continue progressing after ignoring the bad records. Something 
> similar to a setting of (ignore.malformed.json) would help.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

38 matches

Mail list logo