[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF

2019-06-25 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872948#comment-16872948
 ] 

ASF GitHub Bot commented on DRILL-7293:
---

paul-rogers commented on issue #1807: DRILL-7293: Convert the regex ("log") 
plugin to use EVF
URL: https://github.com/apache/drill/pull/1807#issuecomment-505730216
 
 
   One solution to the schema issue for table functions is to use the `columns` 
trick from the text reader. If no schema is provided, then instead of creating 
a set of `field_n` columns, create a single `columns` array column. 
Specifically, if there is no schema defined for the table, and no schema in the 
plugin config (perhaps because the plugin config was created via a table 
function), then just use `columns`.
   
   If I get some time, I'll try this out. With the EVF, this might actually be 
pretty simple. Might be best to add such a feature via another PR.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Convert the regex ("log") plugin to use EVF
> ---
>
> Key: DRILL-7293
> URL: https://issues.apache.org/jira/browse/DRILL-7293
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.16.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
> Fix For: 1.17.0
>
>
> The "log" plugin (which uses a regex to define the row format) is the subject 
> of Chapter 12 of the Learning Apache Drill book (though the version in the 
> book is simpler than the one in the master branch.)
> The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set 
> framework") gives Drill control over the size of batches created by readers, 
> and allows readers to use the recently-added provided schema mechanism.
> We wish to use the log reader as an example for how to convert a Drill format 
> plugin to use the EVF so that other developers can convert their own plugins.
> This PR provides the first set of log plugin changes to enable us to publish 
> a tutorial on the EVF.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (DRILL-7309) Improve documentation for table functions

2019-06-25 Thread Paul Rogers (JIRA)
Paul Rogers created DRILL-7309:
--

 Summary: Improve documentation for table functions
 Key: DRILL-7309
 URL: https://issues.apache.org/jira/browse/DRILL-7309
 Project: Apache Drill
  Issue Type: Bug
  Components: Documentation
Affects Versions: 1.16.0
Reporter: Paul Rogers
Assignee: Bridget Bevens


Consider the [documentation of table 
functions|https://drill.apache.org/docs/plugin-configuration-basics/], the 
"Using the Formats Attributes as Table Function Parameters" section. The 
documentation is a bit sparse and it always takes me a long time to remember 
how to use table functions. Here are some improvements.

> ...use the table function syntax:

> select a, b from table({table function name}(parameters))

> The table function name is the table name, the type parameter is the format 
> name, ...

Change the second line to:

```
select a, b from table((type='', ))
```

The use of the angle brackets is a bit more consistent with other doc pages 
such as [this one|https://drill.apache.org/docs/query-directory-functions/]. We 
already mentioned the {{type}} parameter, but did not show it in the template. 
Then say:

The type parameter must match the name of a format plugin. This is the name you 
put in the {{type}} field of your plugin JSON as explained above. Note that 
this is *not* the name of your format config. That is, it might be "text", not 
"csv" or "csvh".

The type parameter *must* be the first parameter. Other parameters can appear 
in any order. You must provide required parameters. Only string, Boolean and 
integer parameters are supported. Table functions do not support lists (so you 
cannot specify the {{extensions}}, for example.)

If parameter names are the same as SQL reserved words, quote the parameter with 
back-ticks as for table and column names. Quote string values with 
single-quotes. Do not quote integer values.

If the string value contains back-slashes, you must escape them with a second 
back-slash. If the string contains a single-quote, you must escape it with 
another single-quote. Example:

```
`regex` => '''(\\d)'''
```



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7293) Convert the regex ("log") plugin to use EVF

2019-06-25 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7293?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872907#comment-16872907
 ] 

ASF GitHub Bot commented on DRILL-7293:
---

paul-rogers commented on issue #1807: DRILL-7293: Convert the regex ("log") 
plugin to use EVF
URL: https://github.com/apache/drill/pull/1807#issuecomment-505714454
 
 
   @arina-ielchiieva, I was able to get the plugin to work for this query:
   
   ```
   SELECT * FROM table(dfs.tf.table1(
 type => 'logRegex',
 regex => '(\\d\\d\\d\\d)-(\\d\\d)-(\\d\\d) .*',
 maxErrors => 10))
   ```
   
   To do this, I had to fix some of the issues described in DRILL-7298. In 
particular, DRILL-6672 notes that table functions are not able to call 
{{setFoo()}} methods as Jackson can, so table functions only work if the format 
plugin config fields are {{public}}. The were not public for the log format 
plugin, so I changed them to {{public}} to get the above query to work.
   
   If we look at the code in 
[`FormatPluginOptionsDescriptor.createConfigForTable()`](https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/store/dfs/FormatPluginOptionsDescriptor.java#L123),
 we'll see that there is nothing that would handle the `values` syntax 
suggested in your note. The only supported types are Java primitives.
   
   When I tried this query:
   
   ```
   SELECT * FROM table(dfs.tf.noGroups(
 type => 'logRegex',
 regex => '(\\d\\d\\d\\d)-(\\d\\d)-(\\d\\d) .*',
 `schema`=>values('month', 'VARCHAR')))
   ```
   
   I got this result:
   
   ```
   PARSE ERROR: Encountered "values" at line 1, column 115.
   
   SQL Query: SELECT * FROM table(dfs.tf.noGroups(type => 'logRegex', regex => 
'(\\d\\d\\d\\d)-(\\d\\d)-(\\d\\d) .*', `schema`=>values('month', 'VARCHAR')))

^
   ```
   
   So, looks like the {{values}} trick does not work. Even if it did, the code 
to produce the values argument would use some kind of Java collection which 
would not match the {{List}} of the {{schema}} field.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Convert the regex ("log") plugin to use EVF
> ---
>
> Key: DRILL-7293
> URL: https://issues.apache.org/jira/browse/DRILL-7293
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.16.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
> Fix For: 1.17.0
>
>
> The "log" plugin (which uses a regex to define the row format) is the subject 
> of Chapter 12 of the Learning Apache Drill book (though the version in the 
> book is simpler than the one in the master branch.)
> The recently-completed "Enhanced Vector Framework" (EVF, AKA the "row set 
> framework") gives Drill control over the size of batches created by readers, 
> and allows readers to use the recently-added provided schema mechanism.
> We wish to use the log reader as an example for how to convert a Drill format 
> plugin to use the EVF so that other developers can convert their own plugins.
> This PR provides the first set of log plugin changes to enable us to publish 
> a tutorial on the EVF.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7306) Disable "fast schema" batch for new scan framework

2019-06-25 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872877#comment-16872877
 ] 

ASF GitHub Bot commented on DRILL-7306:
---

paul-rogers commented on issue #1813: DRILL-7306: Disable schema-only batch for 
new scan framework
URL: https://github.com/apache/drill/pull/1813#issuecomment-505698254
 
 
   Commits squashed. Note that we can commit either this PR, or DRILL-7293, but 
not both at the same time. I will need to add one line to DRILL-7293 either 
after committing that PR, OR after committing DRILL-7293 before this one.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Disable "fast schema" batch for new scan framework
> --
>
> Key: DRILL-7306
> URL: https://issues.apache.org/jira/browse/DRILL-7306
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
>  The EVF framework is set up to return a "fast schema" empty batch with only 
> schema as its first batch because, when the code was written, it seemed 
> that's how we wanted operators to work. However, DRILL-7305 notes that many 
> operators cannot handle empty batches.
> Since the empty-batch bugs show that Drill does not, in fact, provide a "fast 
> schema" batch, this ticket asks to disable the feature in the new scan 
> framework. The feature is disabled with a config option; it can be re-enabled 
> if ever it is needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (DRILL-6952) Merge row set based "compliant" text reader

2019-06-25 Thread Anton Gozhiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Gozhiy closed DRILL-6952.
---
Resolution: Fixed

> Merge row set based "compliant" text reader
> ---
>
> Key: DRILL-6952
> URL: https://issues.apache.org/jira/browse/DRILL-6952
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.15.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.16.0
>
>
> The result set loader project created a revised version of the compliant text 
> reader that uses the result set loader framework (which includes the 
> schema-based projection framework.)
> This task merges that work into master:
> * Review the history of the complaint text reader for changes made in the 
> last year since the code was written.
> * Apply those changes to the row set-based code, as necessary.
> * Issue a PR for the new version of the compliant text reader
> * Work through any test issues that crop up in the pre-commit tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (DRILL-6952) Merge row set based "compliant" text reader

2019-06-25 Thread Anton Gozhiy (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-6952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872306#comment-16872306
 ] 

Anton Gozhiy edited comment on DRILL-6952 at 6/25/19 1:35 PM:
--

Verified with Drill version 1.17.0-SNAPSHOT (commit 
f3f7dbd40f5e899f2aacba35db8f50ffedfa9d3d)
Cases checked:

Tested with different storage plugin parameters (extractHeader, delimiters etc.)
The same with table function.
Complex json files with nesting maps and arrays.
Data with implicit columns (with v3 reader, all such columns are moved to the 
end of rows)
Aggregate functions with specific columns and wildcard.
Large text fields (they was limited to 65536 symbols, now fixed)
No significant changes in performance were discovered. (Compared test runs with 
different readers)
Some bugs were fixed by V3 reader:
DRILL-5487, DRILL-5554, DRILL- (partially fixed), DRILL-4814, DRILL-7034, 
DRILL-7082, DRILL-7083
Bugs that were introduced by V3 reader and then fixed:
DRILL-7181, DRILL-7257, DRILL-7258


was (Author: angozhiy):
Verified with Drill version 1.17.0-SNAPSHOT (commit 
f3f7dbd40f5e899f2aacba35db8f50ffedfa9d3d)
Cases checked:

Tested with different storage plugin parameters (extractHeader, delimiters etc.)
The same with table function.
Complex json files with nesting maps and arrays.
Data with implicit columns (with v3 reader, all such columns are moved to the 
end of rows)
Aggregate functions with specific columns and wildcard.
Large text fields (they was limited to 65536 symbols, now fixed)
No significant changes in performance were discovered. (Compared test runs with 
different readers)
Some bugs were fixed by V3 reader:
DRILL-5487, DRILL-5554, DRILL-, DRILL-4814, DRILL-7034, DRILL-7082, 
DRILL-7083
Bugs that were introduced by V3 reader and then fixed:
DRILL-7181, DRILL-7257, DRILL-7258

> Merge row set based "compliant" text reader
> ---
>
> Key: DRILL-6952
> URL: https://issues.apache.org/jira/browse/DRILL-6952
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.15.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.16.0
>
>
> The result set loader project created a revised version of the compliant text 
> reader that uses the result set loader framework (which includes the 
> schema-based projection framework.)
> This task merges that work into master:
> * Review the history of the complaint text reader for changes made in the 
> last year since the code was written.
> * Apply those changes to the row set-based code, as necessary.
> * Issue a PR for the new version of the compliant text reader
> * Work through any test issues that crop up in the pre-commit tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (DRILL-6952) Merge row set based "compliant" text reader

2019-06-25 Thread Anton Gozhiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Gozhiy reopened DRILL-6952:
-

> Merge row set based "compliant" text reader
> ---
>
> Key: DRILL-6952
> URL: https://issues.apache.org/jira/browse/DRILL-6952
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.15.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.16.0
>
>
> The result set loader project created a revised version of the compliant text 
> reader that uses the result set loader framework (which includes the 
> schema-based projection framework.)
> This task merges that work into master:
> * Review the history of the complaint text reader for changes made in the 
> last year since the code was written.
> * Apply those changes to the row set-based code, as necessary.
> * Issue a PR for the new version of the compliant text reader
> * Work through any test issues that crop up in the pre-commit tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (DRILL-5487) Vector corruption in CSV with headers and truncated last row

2019-06-25 Thread Anton Gozhiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-5487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Gozhiy closed DRILL-5487.
---
Resolution: Fixed

> Vector corruption in CSV with headers and truncated last row
> 
>
> Key: DRILL-5487
> URL: https://issues.apache.org/jira/browse/DRILL-5487
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Text  CSV
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Priority: Major
> Fix For: 1.17.0
>
>
> The CSV format plugin allows two ways of reading data:
> * As named columns
> * As a single array, called {{columns}}, that holds all columns for a row
> The named columns feature will corrupt the offset vectors if the last row of 
> the file is truncated: leaves off one or more columns.
> To illustrate the CSV data corruption, I created a CSV file, test4.csv, of 
> the following form:
> {code}
> h,u
> abc,def
> ghi
> {code}
> Note that the file is truncated: the command and second field is missing on 
> the last line.
> Then, I created a simple test using the "cluster fixture" framework:
> {code}
>   @Test
>   public void readerTest() throws Exception {
> FixtureBuilder builder = ClusterFixture.builder()
> .maxParallelization(1);
> try (ClusterFixture cluster = builder.build();
>  ClientFixture client = cluster.clientFixture()) {
>   TextFormatConfig csvFormat = new TextFormatConfig();
>   csvFormat.fieldDelimiter = ',';
>   csvFormat.skipFirstLine = false;
>   csvFormat.extractHeader = true;
>   cluster.defineWorkspace("dfs", "data", "/tmp/data", "csv", csvFormat);
>   String sql = "SELECT * FROM `dfs.data`.`csv/test4.csv` LIMIT 10";
>   client.queryBuilder().sql(sql).printCsv();
> }
>   }
> {code}
> The results show we've got a problem:
> {code}
> Exception (no rows returned): 
> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
> IllegalArgumentException: length: -3 (expected: >= 0)
> {code}
> If the last line were:
> {code}
> efg,
> {code}
> Then the offset vector should look like this:
> {code}
> [0, 3, 3]
> {code}
> Very likely we have an offset vector that looks like this instead:
> {code}
> [0, 3, 0]
> {code}
> When we compute the second column of the second row, we should compute:
> {code}
> length = offset[2] - offset[1] = 3 - 3 = 0
> {code}
> Instead we get:
> {code}
> length = offset[2] - offset[1] = 0 - 3 = -3
> {code}
> The summary is that a premature EOF appears to cause the "missing" columns to 
> be skipped; they are not filled with a blank value to "bump" the offset 
> vectors to fill in the last row. Instead, they are left at 0, causing havoc 
> downstream in the query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (DRILL-7083) Wrong data type for explicit partition column beyond file depth

2019-06-25 Thread Anton Gozhiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Gozhiy closed DRILL-7083.
---
Resolution: Fixed

> Wrong data type for explicit partition column beyond file depth
> ---
>
> Key: DRILL-7083
> URL: https://issues.apache.org/jira/browse/DRILL-7083
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.15.0
>Reporter: Paul Rogers
>Priority: Minor
> Fix For: 1.17.0
>
>
> Consider the simple case in DRILL-7082. That ticket talks about implicit 
> partition columns created by the wildcard. Consider a very similar case:
> {code:sql}
> SELECT a, b, c, dir0, dir1 FROM `myTable`
> {code}
> Where {{myTable}} is a directory of CSV files, each with schema {{(a, b, c)}}:
> {noformat}
> myTable
> |- file1.csv
> |- nested
>|- file2.csv
> {noformat}
> If the query is run in "stock" Drill, the planner will place both files 
> within a single scan operator as described in DRILL-7082. The result schema 
> will be:
> {noformat}
> (a VARCHAR, b VARCHAR, c VARCHAR, dir0 VARCHAR, dir1 INT)
> {noformat}
> Notice that last column: why is "dir1" a (nullable) INT? The partition 
> mechanism only recognizes partitions that actually exist, leaving the Project 
> operator to fill in (with a Nullable INT) any partitions that don't exist 
> (any directory levels not actually seen by the scan operator.)
> Now, using the same trick as in DRILL-7082, try the query
> {code:sql}
> SELECT a, b, c, dir0 FROM `myTable`
> {code}
> Again, the trick causes Drill to read each file in a separate scan operator 
> (simulating what happens when queries run at scale.)
> The scan operator for {{file1.csv}} will see no partitions, so it will omit 
> "dir0" and the Project operator will helpfully fill in a Nullable INT. The 
> scan operator for {{file2.csv}} sees one level of partition, so sets {{dir0}} 
> to {{nested}} as a Nullable VARCHAR.
> What does the client see? Two records: one with "dir0" as a Nullable INT, the 
> other as a Nullable VARCHAR. Client such as JDBC and ODBC see a hard schema 
> change between the two records.
> The two cases described above are really two versions of the same issue. 
> Clients expect that, if they use the "dir0", "dir1", ... columns, that the 
> type is always Nullable Varchar so that the schema stays consistent across 
> batches.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7083) Wrong data type for explicit partition column beyond file depth

2019-06-25 Thread Anton Gozhiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Gozhiy updated DRILL-7083:

Fix Version/s: 1.17.0

> Wrong data type for explicit partition column beyond file depth
> ---
>
> Key: DRILL-7083
> URL: https://issues.apache.org/jira/browse/DRILL-7083
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.15.0
>Reporter: Paul Rogers
>Priority: Minor
> Fix For: 1.17.0
>
>
> Consider the simple case in DRILL-7082. That ticket talks about implicit 
> partition columns created by the wildcard. Consider a very similar case:
> {code:sql}
> SELECT a, b, c, dir0, dir1 FROM `myTable`
> {code}
> Where {{myTable}} is a directory of CSV files, each with schema {{(a, b, c)}}:
> {noformat}
> myTable
> |- file1.csv
> |- nested
>|- file2.csv
> {noformat}
> If the query is run in "stock" Drill, the planner will place both files 
> within a single scan operator as described in DRILL-7082. The result schema 
> will be:
> {noformat}
> (a VARCHAR, b VARCHAR, c VARCHAR, dir0 VARCHAR, dir1 INT)
> {noformat}
> Notice that last column: why is "dir1" a (nullable) INT? The partition 
> mechanism only recognizes partitions that actually exist, leaving the Project 
> operator to fill in (with a Nullable INT) any partitions that don't exist 
> (any directory levels not actually seen by the scan operator.)
> Now, using the same trick as in DRILL-7082, try the query
> {code:sql}
> SELECT a, b, c, dir0 FROM `myTable`
> {code}
> Again, the trick causes Drill to read each file in a separate scan operator 
> (simulating what happens when queries run at scale.)
> The scan operator for {{file1.csv}} will see no partitions, so it will omit 
> "dir0" and the Project operator will helpfully fill in a Nullable INT. The 
> scan operator for {{file2.csv}} sees one level of partition, so sets {{dir0}} 
> to {{nested}} as a Nullable VARCHAR.
> What does the client see? Two records: one with "dir0" as a Nullable INT, the 
> other as a Nullable VARCHAR. Client such as JDBC and ODBC see a hard schema 
> change between the two records.
> The two cases described above are really two versions of the same issue. 
> Clients expect that, if they use the "dir0", "dir1", ... columns, that the 
> type is always Nullable Varchar so that the schema stays consistent across 
> batches.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (DRILL-7083) Wrong data type for explicit partition column beyond file depth

2019-06-25 Thread Anton Gozhiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7083?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Gozhiy reopened DRILL-7083:
-

> Wrong data type for explicit partition column beyond file depth
> ---
>
> Key: DRILL-7083
> URL: https://issues.apache.org/jira/browse/DRILL-7083
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.15.0
>Reporter: Paul Rogers
>Priority: Minor
>
> Consider the simple case in DRILL-7082. That ticket talks about implicit 
> partition columns created by the wildcard. Consider a very similar case:
> {code:sql}
> SELECT a, b, c, dir0, dir1 FROM `myTable`
> {code}
> Where {{myTable}} is a directory of CSV files, each with schema {{(a, b, c)}}:
> {noformat}
> myTable
> |- file1.csv
> |- nested
>|- file2.csv
> {noformat}
> If the query is run in "stock" Drill, the planner will place both files 
> within a single scan operator as described in DRILL-7082. The result schema 
> will be:
> {noformat}
> (a VARCHAR, b VARCHAR, c VARCHAR, dir0 VARCHAR, dir1 INT)
> {noformat}
> Notice that last column: why is "dir1" a (nullable) INT? The partition 
> mechanism only recognizes partitions that actually exist, leaving the Project 
> operator to fill in (with a Nullable INT) any partitions that don't exist 
> (any directory levels not actually seen by the scan operator.)
> Now, using the same trick as in DRILL-7082, try the query
> {code:sql}
> SELECT a, b, c, dir0 FROM `myTable`
> {code}
> Again, the trick causes Drill to read each file in a separate scan operator 
> (simulating what happens when queries run at scale.)
> The scan operator for {{file1.csv}} will see no partitions, so it will omit 
> "dir0" and the Project operator will helpfully fill in a Nullable INT. The 
> scan operator for {{file2.csv}} sees one level of partition, so sets {{dir0}} 
> to {{nested}} as a Nullable VARCHAR.
> What does the client see? Two records: one with "dir0" as a Nullable INT, the 
> other as a Nullable VARCHAR. Client such as JDBC and ODBC see a hard schema 
> change between the two records.
> The two cases described above are really two versions of the same issue. 
> Clients expect that, if they use the "dir0", "dir1", ... columns, that the 
> type is always Nullable Varchar so that the schema stays consistent across 
> batches.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7082) Inconsistent results with implicit partition columns, multi scans

2019-06-25 Thread Anton Gozhiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Gozhiy updated DRILL-7082:

Fix Version/s: 1.17.0

> Inconsistent results with implicit partition columns, multi scans
> -
>
> Key: DRILL-7082
> URL: https://issues.apache.org/jira/browse/DRILL-7082
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.15.0
>Reporter: Paul Rogers
>Priority: Minor
> Fix For: 1.17.0
>
>
> The runtime behavior of implicit partition columns is wildly inconsistent to 
> the point of being unusable. Consider the following query:
> {code:sql}
> SELECT * FROM `myTable`
> {code}
> Where {{myTable}} is a directory of CSV files, each with schema {{(a, b, c)}}:
> {noformat}
> myTable
> |- file1.csv
> |- nested
>|- file2.csv
> {noformat}
> Our test files are small. Turn out that, even if we write a test that scans a 
> few files, such as the above example, Drill will group all the reads into a 
> single fragment with a single scan operator. When that happens:
> * The partition columns appear before the data columns: (dir0, a, b, c).
> * The partition columns always appear in every row.
> We get the above result because a single scan operator sees both files and 
> knows the right number of partition columns to create for each.
> But, we know that, if two scans each read files at different depths, the 
> "shallower" one won't see as many partition directories as the "deeper" one. 
> To test this, I modified the text reader to accept a new session option that 
> sets the minimum parallelization. I set it to 2 (same as the number of 
> files.) One could probably also see this by creating large text files so that 
> the Drill parallelizer will choose to create two fragments.
> Then, I ran the above query 10 times. Now, I get these results:
> * Half the time, the first row has only the data columns (a, b, c), the other 
> half of the time the first row has a partition column. (Depending on which 
> file returned data first.)
> * Some of the time the partition column appears in the first position (dir0, 
> a, b, c) and some of the time in the last (a, b, c, dir0). (I have no idea 
> why.)
> The result is, from a two-file query, depending on random factors, your first 
> row schema could be:
> * (a, b, c)
> * (dir0, a, b, c)
> * (a, b, c, dir0)
> In many cases, the second row comes with a hard schema change to a different 
> format.
> The above is demonstrated in the (soon to be provided) {{TestPartitionRace}} 
> unit test.
> IMHO, the behavior is basically unusable as any JDBC/ODBC client will see an 
> inconsistent, changing schema. Instead, what a user would expect is:
> * The partition columns are in the same location in every row (preferably at 
> the end, so data columns remain in fixed positions regardless of the number 
> of partition columns.)
> * The same number of columns in every row. This means that all scan operators 
> must use a single uniform partition depth count, preferably set at plan type 
> in the group scan node that has visibility to all the files to scan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (DRILL-7082) Inconsistent results with implicit partition columns, multi scans

2019-06-25 Thread Anton Gozhiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Gozhiy reopened DRILL-7082:
-

> Inconsistent results with implicit partition columns, multi scans
> -
>
> Key: DRILL-7082
> URL: https://issues.apache.org/jira/browse/DRILL-7082
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.15.0
>Reporter: Paul Rogers
>Priority: Minor
>
> The runtime behavior of implicit partition columns is wildly inconsistent to 
> the point of being unusable. Consider the following query:
> {code:sql}
> SELECT * FROM `myTable`
> {code}
> Where {{myTable}} is a directory of CSV files, each with schema {{(a, b, c)}}:
> {noformat}
> myTable
> |- file1.csv
> |- nested
>|- file2.csv
> {noformat}
> Our test files are small. Turn out that, even if we write a test that scans a 
> few files, such as the above example, Drill will group all the reads into a 
> single fragment with a single scan operator. When that happens:
> * The partition columns appear before the data columns: (dir0, a, b, c).
> * The partition columns always appear in every row.
> We get the above result because a single scan operator sees both files and 
> knows the right number of partition columns to create for each.
> But, we know that, if two scans each read files at different depths, the 
> "shallower" one won't see as many partition directories as the "deeper" one. 
> To test this, I modified the text reader to accept a new session option that 
> sets the minimum parallelization. I set it to 2 (same as the number of 
> files.) One could probably also see this by creating large text files so that 
> the Drill parallelizer will choose to create two fragments.
> Then, I ran the above query 10 times. Now, I get these results:
> * Half the time, the first row has only the data columns (a, b, c), the other 
> half of the time the first row has a partition column. (Depending on which 
> file returned data first.)
> * Some of the time the partition column appears in the first position (dir0, 
> a, b, c) and some of the time in the last (a, b, c, dir0). (I have no idea 
> why.)
> The result is, from a two-file query, depending on random factors, your first 
> row schema could be:
> * (a, b, c)
> * (dir0, a, b, c)
> * (a, b, c, dir0)
> In many cases, the second row comes with a hard schema change to a different 
> format.
> The above is demonstrated in the (soon to be provided) {{TestPartitionRace}} 
> unit test.
> IMHO, the behavior is basically unusable as any JDBC/ODBC client will see an 
> inconsistent, changing schema. Instead, what a user would expect is:
> * The partition columns are in the same location in every row (preferably at 
> the end, so data columns remain in fixed positions regardless of the number 
> of partition columns.)
> * The same number of columns in every row. This means that all scan operators 
> must use a single uniform partition depth count, preferably set at plan type 
> in the group scan node that has visibility to all the files to scan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (DRILL-7082) Inconsistent results with implicit partition columns, multi scans

2019-06-25 Thread Anton Gozhiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7082?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Gozhiy closed DRILL-7082.
---
Resolution: Fixed

> Inconsistent results with implicit partition columns, multi scans
> -
>
> Key: DRILL-7082
> URL: https://issues.apache.org/jira/browse/DRILL-7082
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.15.0
>Reporter: Paul Rogers
>Priority: Minor
> Fix For: 1.17.0
>
>
> The runtime behavior of implicit partition columns is wildly inconsistent to 
> the point of being unusable. Consider the following query:
> {code:sql}
> SELECT * FROM `myTable`
> {code}
> Where {{myTable}} is a directory of CSV files, each with schema {{(a, b, c)}}:
> {noformat}
> myTable
> |- file1.csv
> |- nested
>|- file2.csv
> {noformat}
> Our test files are small. Turn out that, even if we write a test that scans a 
> few files, such as the above example, Drill will group all the reads into a 
> single fragment with a single scan operator. When that happens:
> * The partition columns appear before the data columns: (dir0, a, b, c).
> * The partition columns always appear in every row.
> We get the above result because a single scan operator sees both files and 
> knows the right number of partition columns to create for each.
> But, we know that, if two scans each read files at different depths, the 
> "shallower" one won't see as many partition directories as the "deeper" one. 
> To test this, I modified the text reader to accept a new session option that 
> sets the minimum parallelization. I set it to 2 (same as the number of 
> files.) One could probably also see this by creating large text files so that 
> the Drill parallelizer will choose to create two fragments.
> Then, I ran the above query 10 times. Now, I get these results:
> * Half the time, the first row has only the data columns (a, b, c), the other 
> half of the time the first row has a partition column. (Depending on which 
> file returned data first.)
> * Some of the time the partition column appears in the first position (dir0, 
> a, b, c) and some of the time in the last (a, b, c, dir0). (I have no idea 
> why.)
> The result is, from a two-file query, depending on random factors, your first 
> row schema could be:
> * (a, b, c)
> * (dir0, a, b, c)
> * (a, b, c, dir0)
> In many cases, the second row comes with a hard schema change to a different 
> format.
> The above is demonstrated in the (soon to be provided) {{TestPartitionRace}} 
> unit test.
> IMHO, the behavior is basically unusable as any JDBC/ODBC client will see an 
> inconsistent, changing schema. Instead, what a user would expect is:
> * The partition columns are in the same location in every row (preferably at 
> the end, so data columns remain in fixed positions regardless of the number 
> of partition columns.)
> * The same number of columns in every row. This means that all scan operators 
> must use a single uniform partition depth count, preferably set at plan type 
> in the group scan node that has visibility to all the files to scan.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (DRILL-5554) Wrong error type for "SELECT a" from a CSV file without headers

2019-06-25 Thread Anton Gozhiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-5554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Gozhiy closed DRILL-5554.
---
Resolution: Fixed

> Wrong error type for "SELECT a" from a CSV file without headers
> ---
>
> Key: DRILL-5554
> URL: https://issues.apache.org/jira/browse/DRILL-5554
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Priority: Trivial
> Fix For: 1.17.0
>
>
> Create a CSV file without headers:
> {code}
> 10,foo,bar
> {code}
> Use a CSV storage plugin configured to not skip the first line and not read 
> headers.
> Then, issue the following query:
> {code}
> SELECT a FROM `dfs.data.example.csv`
> {code}
> The result is correct: an error:
> {code}
> org.apache.drill.common.exceptions.UserRemoteException: 
> DATA_READ ERROR: Selected column 'a' must have name 'columns' or must be 
> plain '*'
> {code}
> But, the type of error is wrong. This is not a data read error: the file read 
> just fine. The problem is a semantic error: a query form that is not 
> compatible wth the storage plugin.
> Suggest using {{UserException.unsupportedError()}} instead since the user is 
> asking the plugin to do something that the plugin does not support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (DRILL-5554) Wrong error type for "SELECT a" from a CSV file without headers

2019-06-25 Thread Anton Gozhiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-5554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Gozhiy reopened DRILL-5554:
-

> Wrong error type for "SELECT a" from a CSV file without headers
> ---
>
> Key: DRILL-5554
> URL: https://issues.apache.org/jira/browse/DRILL-5554
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Priority: Trivial
>
> Create a CSV file without headers:
> {code}
> 10,foo,bar
> {code}
> Use a CSV storage plugin configured to not skip the first line and not read 
> headers.
> Then, issue the following query:
> {code}
> SELECT a FROM `dfs.data.example.csv`
> {code}
> The result is correct: an error:
> {code}
> org.apache.drill.common.exceptions.UserRemoteException: 
> DATA_READ ERROR: Selected column 'a' must have name 'columns' or must be 
> plain '*'
> {code}
> But, the type of error is wrong. This is not a data read error: the file read 
> just fine. The problem is a semantic error: a query form that is not 
> compatible wth the storage plugin.
> Suggest using {{UserException.unsupportedError()}} instead since the user is 
> asking the plugin to do something that the plugin does not support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-5554) Wrong error type for "SELECT a" from a CSV file without headers

2019-06-25 Thread Anton Gozhiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-5554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Gozhiy updated DRILL-5554:

Fix Version/s: 1.17.0

> Wrong error type for "SELECT a" from a CSV file without headers
> ---
>
> Key: DRILL-5554
> URL: https://issues.apache.org/jira/browse/DRILL-5554
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Priority: Trivial
> Fix For: 1.17.0
>
>
> Create a CSV file without headers:
> {code}
> 10,foo,bar
> {code}
> Use a CSV storage plugin configured to not skip the first line and not read 
> headers.
> Then, issue the following query:
> {code}
> SELECT a FROM `dfs.data.example.csv`
> {code}
> The result is correct: an error:
> {code}
> org.apache.drill.common.exceptions.UserRemoteException: 
> DATA_READ ERROR: Selected column 'a' must have name 'columns' or must be 
> plain '*'
> {code}
> But, the type of error is wrong. This is not a data read error: the file read 
> just fine. The problem is a semantic error: a query form that is not 
> compatible wth the storage plugin.
> Suggest using {{UserException.unsupportedError()}} instead since the user is 
> asking the plugin to do something that the plugin does not support.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-5487) Vector corruption in CSV with headers and truncated last row

2019-06-25 Thread Anton Gozhiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-5487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Gozhiy updated DRILL-5487:

Fix Version/s: (was: Future)
   1.17.0

> Vector corruption in CSV with headers and truncated last row
> 
>
> Key: DRILL-5487
> URL: https://issues.apache.org/jira/browse/DRILL-5487
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Text  CSV
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Priority: Major
> Fix For: 1.17.0
>
>
> The CSV format plugin allows two ways of reading data:
> * As named columns
> * As a single array, called {{columns}}, that holds all columns for a row
> The named columns feature will corrupt the offset vectors if the last row of 
> the file is truncated: leaves off one or more columns.
> To illustrate the CSV data corruption, I created a CSV file, test4.csv, of 
> the following form:
> {code}
> h,u
> abc,def
> ghi
> {code}
> Note that the file is truncated: the command and second field is missing on 
> the last line.
> Then, I created a simple test using the "cluster fixture" framework:
> {code}
>   @Test
>   public void readerTest() throws Exception {
> FixtureBuilder builder = ClusterFixture.builder()
> .maxParallelization(1);
> try (ClusterFixture cluster = builder.build();
>  ClientFixture client = cluster.clientFixture()) {
>   TextFormatConfig csvFormat = new TextFormatConfig();
>   csvFormat.fieldDelimiter = ',';
>   csvFormat.skipFirstLine = false;
>   csvFormat.extractHeader = true;
>   cluster.defineWorkspace("dfs", "data", "/tmp/data", "csv", csvFormat);
>   String sql = "SELECT * FROM `dfs.data`.`csv/test4.csv` LIMIT 10";
>   client.queryBuilder().sql(sql).printCsv();
> }
>   }
> {code}
> The results show we've got a problem:
> {code}
> Exception (no rows returned): 
> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
> IllegalArgumentException: length: -3 (expected: >= 0)
> {code}
> If the last line were:
> {code}
> efg,
> {code}
> Then the offset vector should look like this:
> {code}
> [0, 3, 3]
> {code}
> Very likely we have an offset vector that looks like this instead:
> {code}
> [0, 3, 0]
> {code}
> When we compute the second column of the second row, we should compute:
> {code}
> length = offset[2] - offset[1] = 3 - 3 = 0
> {code}
> Instead we get:
> {code}
> length = offset[2] - offset[1] = 0 - 3 = -3
> {code}
> The summary is that a premature EOF appears to cause the "missing" columns to 
> be skipped; they are not filled with a blank value to "bump" the offset 
> vectors to fill in the last row. Instead, they are left at 0, causing havoc 
> downstream in the query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Reopened] (DRILL-5487) Vector corruption in CSV with headers and truncated last row

2019-06-25 Thread Anton Gozhiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-5487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Gozhiy reopened DRILL-5487:
-

> Vector corruption in CSV with headers and truncated last row
> 
>
> Key: DRILL-5487
> URL: https://issues.apache.org/jira/browse/DRILL-5487
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - Text  CSV
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Priority: Major
> Fix For: Future
>
>
> The CSV format plugin allows two ways of reading data:
> * As named columns
> * As a single array, called {{columns}}, that holds all columns for a row
> The named columns feature will corrupt the offset vectors if the last row of 
> the file is truncated: leaves off one or more columns.
> To illustrate the CSV data corruption, I created a CSV file, test4.csv, of 
> the following form:
> {code}
> h,u
> abc,def
> ghi
> {code}
> Note that the file is truncated: the command and second field is missing on 
> the last line.
> Then, I created a simple test using the "cluster fixture" framework:
> {code}
>   @Test
>   public void readerTest() throws Exception {
> FixtureBuilder builder = ClusterFixture.builder()
> .maxParallelization(1);
> try (ClusterFixture cluster = builder.build();
>  ClientFixture client = cluster.clientFixture()) {
>   TextFormatConfig csvFormat = new TextFormatConfig();
>   csvFormat.fieldDelimiter = ',';
>   csvFormat.skipFirstLine = false;
>   csvFormat.extractHeader = true;
>   cluster.defineWorkspace("dfs", "data", "/tmp/data", "csv", csvFormat);
>   String sql = "SELECT * FROM `dfs.data`.`csv/test4.csv` LIMIT 10";
>   client.queryBuilder().sql(sql).printCsv();
> }
>   }
> {code}
> The results show we've got a problem:
> {code}
> Exception (no rows returned): 
> org.apache.drill.common.exceptions.UserRemoteException: SYSTEM ERROR:
> IllegalArgumentException: length: -3 (expected: >= 0)
> {code}
> If the last line were:
> {code}
> efg,
> {code}
> Then the offset vector should look like this:
> {code}
> [0, 3, 3]
> {code}
> Very likely we have an offset vector that looks like this instead:
> {code}
> [0, 3, 0]
> {code}
> When we compute the second column of the second row, we should compute:
> {code}
> length = offset[2] - offset[1] = 3 - 3 = 0
> {code}
> Instead we get:
> {code}
> length = offset[2] - offset[1] = 0 - 3 = -3
> {code}
> The summary is that a premature EOF appears to cause the "missing" columns to 
> be skipped; they are not filled with a blank value to "bump" the offset 
> vectors to fill in the last row. Instead, they are left at 0, causing havoc 
> downstream in the query.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7308) Incorrect Metadata from text file queries

2019-06-25 Thread Charles Givre (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7308?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872321#comment-16872321
 ] 

Charles Givre commented on DRILL-7308:
--

Hi [~Paul.Rogers]
I'm fine with Drill returning the width and precision, however it doesn't seem 
to be doing so consistently and it was breaking my SQLAlchemy driver, which was 
how I found this in the first place.  For example, if you query a non-CSV file, 
you just get VARCHAR as the data type with no width and precision.  Also, as 
noted, the width and precision seem to be wrong.  When I submitted 
https://issues.apache.org/jira/browse/DRILL-6847, I did test VARCHAR fields 
with specified width and it did work at the time, so I suspect something has 
changed along the way.  

Seperately, and I think this is related, I've been developing a series of UDFs 
that accept a VARCHAR as input and return complex fields.  When I use the UDF 
in a query with data from either a CSV or from VALUES() I get errors, but if 
the data is from another source, it works.  See examples below:

{{apache drill> SELECT whois('gtkcyber.com') FROM (VALUES(1));
Error: FUNCTION ERROR: WHOIS does not support operand types (CHAR)


[Error Id: af4705e6-e4ef-461d-8866-f7ce3b9b5e09 ] (state=,code=0)
}}

This example is from an HTTPD web server log and the function works as intended.

{{apache drill> SELECT whois(connection_client_host )
. .semicolon> FROM dfs.test.`hackers-access.httpd` LIMIT 1;
No match for "195.154.46.135".

+--+
|  EXPR$0   
   |
+--+
| {"_last_update_of_whois_database":" 2019-06-25T12:46:22Z <<<","notice":" The 
expiration date displayed in this record is the date the","terms_of_use":" You 
are not authorized to access or query our 
Whois","by_the_following_terms_of_use":" You agree that you may use this Data 
only","to":" (1) allow, enable, or otherwise support the transmission of mass"} 
|
+--+
1 row selected (1.259 seconds)}}

Here is the same thing when querying the domain.csvh file and you can see that 
it doesn't work. 

{{apache drill> SELECT whois(domain) as domain_info from 
dfs.test.`domains.csvh`;
Error: SYSTEM ERROR: SchemaChangeException: Failure while trying to materialize 
incoming schema.  Errors:

Error in expression at index -1.  Error: Missing function implementation: 
[whois(VARCHAR-REQUIRED)].  Full expression: --UNKNOWN EXPRESSION--..

Fragment 0:0

Please, refer to logs for more information.

[Error Id: bd226bdd-9cef-439d-8c3e-bcdd972f52b1 on 192.168.1.33:31010] 
(state=,code=0)}}




> Incorrect Metadata from text file queries
> -
>
> Key: DRILL-7308
> URL: https://issues.apache.org/jira/browse/DRILL-7308
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Metadata
>Affects Versions: 1.17.0
>Reporter: Charles Givre
>Priority: Major
> Attachments: Screen Shot 2019-06-24 at 3.16.40 PM.png, domains.csvh
>
>
> I'm noticing some strange behavior with the newest version of Drill.  If you 
> query a CSV file, you get the following metadata:
> {code:sql}
> SELECT * FROM dfs.test.`domains.csvh` LIMIT 1
> {code}
> {code:json}
> {
>   "queryId": "22eee85f-c02c-5878-9735-091d18788061",
>   "columns": [
>     "domain"
>   ],
>   "rows": [}
>    {       "domain": "thedataist.com"     }  ],
>   "metadata": [
>     "VARCHAR(0, 0)",
>     "VARCHAR(0, 0)"
>   ],
>   "queryState": "COMPLETED",
>   "attemptedAutoLimit": 0
> }
> {code}
> There are two issues here:
> 1.  VARCHAR now has precision
> 2.  There are twice as many columns as there should be.
> Additionally, if you query a regular CSV, without the columns extracted, you 
> get the following:
> {code:json}
>  "rows": [
>  { 
>       "columns": "[\"ACCT_NUM\",\"PRODUCT\",\"MONTH\",\"REVENUE\"]"     }
>   ],
>    "metadata": [
>      "VARCHAR(0, 0)",
>      "VARCHAR(0, 0)"
>    ],
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Closed] (DRILL-6952) Merge row set based "compliant" text reader

2019-06-25 Thread Anton Gozhiy (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-6952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anton Gozhiy closed DRILL-6952.
---

Verified with Drill version 1.17.0-SNAPSHOT (commit 
f3f7dbd40f5e899f2aacba35db8f50ffedfa9d3d)
Cases checked:

Tested with different storage plugin parameters (extractHeader, delimiters etc.)
The same with table function.
Complex json files with nesting maps and arrays.
Data with implicit columns (with v3 reader, all such columns are moved to the 
end of rows)
Aggregate functions with specific columns and wildcard.
Large text fields (they was limited to 65536 symbols, now fixed)
No significant changes in performance were discovered. (Compared test runs with 
different readers)
Some bugs were fixed by V3 reader:
DRILL-5487, DRILL-5554, DRILL-, DRILL-4814, DRILL-7034, DRILL-7082, 
DRILL-7083
Bugs that were introduced by V3 reader and then fixed:
DRILL-7181, DRILL-7257, DRILL-7258

> Merge row set based "compliant" text reader
> ---
>
> Key: DRILL-6952
> URL: https://issues.apache.org/jira/browse/DRILL-6952
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.15.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.16.0
>
>
> The result set loader project created a revised version of the compliant text 
> reader that uses the result set loader framework (which includes the 
> schema-based projection framework.)
> This task merges that work into master:
> * Review the history of the complaint text reader for changes made in the 
> last year since the code was written.
> * Apply those changes to the row set-based code, as necessary.
> * Issue a PR for the new version of the compliant text reader
> * Work through any test issues that crop up in the pre-commit tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7174) Expose complex to Json control in the Drill C++ Client

2019-06-25 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7174?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872215#comment-16872215
 ] 

ASF GitHub Bot commented on DRILL-7174:
---

arjuntheprogrammer commented on pull request #1814: DRILL-7174: Expose complex 
to Json control in the Drill C++ Client
URL: https://github.com/apache/drill/pull/1814
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Expose complex to Json control in the Drill C++ Client
> --
>
> Key: DRILL-7174
> URL: https://issues.apache.org/jira/browse/DRILL-7174
> Project: Apache Drill
>  Issue Type: Task
>Reporter: Rob Wu
>Priority: Minor
>
> Arjun Gupta will be supplying a patch for this
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7271) Refactor Metadata interfaces and classes to contain all needed information for the File based Metastore

2019-06-25 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7271?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872157#comment-16872157
 ] 

ASF GitHub Bot commented on DRILL-7271:
---

vvysotskyi commented on pull request #1810: DRILL-7271: Refactor Metadata 
interfaces and classes to contain all needed information for the File based 
Metastore
URL: https://github.com/apache/drill/pull/1810
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Refactor Metadata interfaces and classes to contain all needed information 
> for the File based Metastore
> ---
>
> Key: DRILL-7271
> URL: https://issues.apache.org/jira/browse/DRILL-7271
> Project: Apache Drill
>  Issue Type: Sub-task
>Reporter: Arina Ielchiieva
>Assignee: Volodymyr Vysotskyi
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
> 1. Merge info from metadataStatistics + statisticsKinds into one holder: 
> Map.
> 2. Rename hasStatistics to hasDescriptiveStatistics
> 3. Remove drill-file-metastore-plugin
> 4. Move  
> org.apache.drill.exec.physical.base.AbstractGroupScanWithMetadata.MetadataLevel
>  to metadata module, rename to MetadataType and add new value: SEGMENT.
> 5. Add JSON ser/de for ColumnStatistics, StatisticsHolder.
> 6. Add new info classes:
> {noformat}
> class TableInfo {
>   String storagePlugin;
>   String workspace;
>   String name;
>   String type;
>   String owner;
> }
> class MetadataInfo {
>   public static final String GENERAL_INFO_KEY = "GENERAL_INFO";
>   public static final String DEFAULT_SEGMENT_KEY = "DEFAULT_SEGMENT";
>   MetadataType type (enum);
>   String key;
>   String identifier;
> }
> {noformat}
> 7. Modify existing metadata classes:
> org.apache.drill.metastore.FileTableMetadata
> {noformat}
> missing fields
> --
> storagePlugin, workspace, tableType -> will be covered by TableInfo class
> metadataType, metadataKey -> will be covered by MetadataInfo class
> interestingColumns
> fields to modify
> 
> private final Map tableStatistics;
> private final Map statisticsKinds;
> private final Set partitionKeys; -> Map
> {noformat}
> org.apache.drill.metastore.PartitionMetadata
> {noformat}
> missing fields
> --
> storagePlugin, workspace -> will be covered by TableInfo class
> metadataType, metadataKey, metadataIdentifier -> will be covered by 
> MetadataInfo class
> partitionValues (List)
> location (String) (for directory level metadata) - directory location
> fields to modify
> 
> private final Map tableStatistics;
> private final Map statisticsKinds;
> private final Set location; -> locations
> {noformat}
> org.apache.drill.metastore.FileMetadata
> {noformat}
> missing fields
> --
> storagePlugin, workspace -> will be covered by TableInfo class
> metadataType, metadataKey, metadataIdentifier -> will be covered by 
> MetadataInfo class
> path - path to file 
> fields to modify
> 
> private final Map tableStatistics;
> private final Map statisticsKinds;
> private final Path location; - should contain directory to which file belongs
> {noformat}
> org.apache.drill.metastore.RowGroupMetadata
> {noformat}
> missing fields
> --
> storagePlugin, workspace -> will be covered by TableInfo class
> metadataType, metadataKey, metadataIdentifier -> will be covered by 
> MetadataInfo class
> path - path to file 
> fields to modify
> 
> private final Map tableStatistics;
> private final Map statisticsKinds;
> private final Path location; - should contain directory to which file belongs
> {noformat}
> 8. Remove org.apache.drill.exec package from metastore module.
> 9. Rename ColumnStatisticsImpl class.
> 10. Separate existing classes in org.apache.drill.metastore package into 
> sub-packages.
> 11. Rename FileTableMetadata -> BaseTableMetadata
> 12. TableMetadataProvider.getNonInterestingColumnsMeta() -> 
> getNonInterestingColumnsMetadata
> 13. Introduce segment-level metadata class:
> {noformat}
> class SegmentMetadata {
>   TableInfo tableInfo;
>   MetadataInfo metadataInfo;
>   SchemaPath column;
>   TupleMetadata schema;
>   String location;
>   Map columnsStatistics;
>   Map statistics;
>   List partitionValues;
>   List locations;
>   long lastModifiedTime;
> }
> {noformat}
> h1. Segment metadata
> In the fix for this Jira, one of the changes is introducing segment level 
> metadata.
> For now, metadata hierarchy is the following:
> - Table
> - Segment

[jira] [Assigned] (DRILL-7306) Disable "fast schema" batch for new scan framework

2019-06-25 Thread Arina Ielchiieva (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva reassigned DRILL-7306:
---

Assignee: Arina Ielchiieva  (was: Paul Rogers)

> Disable "fast schema" batch for new scan framework
> --
>
> Key: DRILL-7306
> URL: https://issues.apache.org/jira/browse/DRILL-7306
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Paul Rogers
>Assignee: Arina Ielchiieva
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
>  The EVF framework is set up to return a "fast schema" empty batch with only 
> schema as its first batch because, when the code was written, it seemed 
> that's how we wanted operators to work. However, DRILL-7305 notes that many 
> operators cannot handle empty batches.
> Since the empty-batch bugs show that Drill does not, in fact, provide a "fast 
> schema" batch, this ticket asks to disable the feature in the new scan 
> framework. The feature is disabled with a config option; it can be re-enabled 
> if ever it is needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (DRILL-7306) Disable "fast schema" batch for new scan framework

2019-06-25 Thread Arina Ielchiieva (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva reassigned DRILL-7306:
---

Assignee: Paul Rogers  (was: Arina Ielchiieva)

> Disable "fast schema" batch for new scan framework
> --
>
> Key: DRILL-7306
> URL: https://issues.apache.org/jira/browse/DRILL-7306
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
>  The EVF framework is set up to return a "fast schema" empty batch with only 
> schema as its first batch because, when the code was written, it seemed 
> that's how we wanted operators to work. However, DRILL-7305 notes that many 
> operators cannot handle empty batches.
> Since the empty-batch bugs show that Drill does not, in fact, provide a "fast 
> schema" batch, this ticket asks to disable the feature in the new scan 
> framework. The feature is disabled with a config option; it can be re-enabled 
> if ever it is needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7306) Disable "fast schema" batch for new scan framework

2019-06-25 Thread Arina Ielchiieva (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-7306:

Labels: ready-to-commit  (was: )

> Disable "fast schema" batch for new scan framework
> --
>
> Key: DRILL-7306
> URL: https://issues.apache.org/jira/browse/DRILL-7306
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
>  The EVF framework is set up to return a "fast schema" empty batch with only 
> schema as its first batch because, when the code was written, it seemed 
> that's how we wanted operators to work. However, DRILL-7305 notes that many 
> operators cannot handle empty batches.
> Since the empty-batch bugs show that Drill does not, in fact, provide a "fast 
> schema" batch, this ticket asks to disable the feature in the new scan 
> framework. The feature is disabled with a config option; it can be re-enabled 
> if ever it is needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (DRILL-7306) Disable "fast schema" batch for new scan framework

2019-06-25 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/DRILL-7306?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872135#comment-16872135
 ] 

ASF GitHub Bot commented on DRILL-7306:
---

arina-ielchiieva commented on issue #1813: DRILL-7306: Disable schema-only 
batch for new scan framework
URL: https://github.com/apache/drill/pull/1813#issuecomment-505333500
 
 
   @paul-rogers thanks, now it's much better.
   +1, please squash the commits.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Disable "fast schema" batch for new scan framework
> --
>
> Key: DRILL-7306
> URL: https://issues.apache.org/jira/browse/DRILL-7306
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
> Fix For: 1.17.0
>
>
>  The EVF framework is set up to return a "fast schema" empty batch with only 
> schema as its first batch because, when the code was written, it seemed 
> that's how we wanted operators to work. However, DRILL-7305 notes that many 
> operators cannot handle empty batches.
> Since the empty-batch bugs show that Drill does not, in fact, provide a "fast 
> schema" batch, this ticket asks to disable the feature in the new scan 
> framework. The feature is disabled with a config option; it can be re-enabled 
> if ever it is needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (DRILL-7306) Disable "fast schema" batch for new scan framework

2019-06-25 Thread Arina Ielchiieva (JIRA)


 [ 
https://issues.apache.org/jira/browse/DRILL-7306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Arina Ielchiieva updated DRILL-7306:

Reviewer: Arina Ielchiieva

> Disable "fast schema" batch for new scan framework
> --
>
> Key: DRILL-7306
> URL: https://issues.apache.org/jira/browse/DRILL-7306
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.17.0
>
>
>  The EVF framework is set up to return a "fast schema" empty batch with only 
> schema as its first batch because, when the code was written, it seemed 
> that's how we wanted operators to work. However, DRILL-7305 notes that many 
> operators cannot handle empty batches.
> Since the empty-batch bugs show that Drill does not, in fact, provide a "fast 
> schema" batch, this ticket asks to disable the feature in the new scan 
> framework. The feature is disabled with a config option; it can be re-enabled 
> if ever it is needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)