[jira] [Created] (DRILL-7555) Standardize Jackson ObjectMapper usage

2020-01-30 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7555:
--

 Summary: Standardize Jackson ObjectMapper usage
 Key: DRILL-7555
 URL: https://issues.apache.org/jira/browse/DRILL-7555
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers


Drill makes heavy use of Jackson to serialize Java objects to/from JSON. Drill 
has added multiple custom serializers. See the  {{PhysicalPlanReader}} 
constuctor for a list of these.

However, many modules in Drill declare their own {{ObjectMapper}} instances, 
often without some (or all) of the custom Drill mappers. This is tedious and 
error-prone.

We should:

* Define a standard Drill object mappper.
* Replace all ad-hoc instances of {{ObjectMapper}} with the Drill version (when 
reading/writing Drill-defined JSON).

Further, storage plugins need an {{ObjectMapper}} to convert a scan spec from 
JSON to Java. (It is not clear why we do this serialization, or if it is 
needed, but that is how things work at present.) Plugins don't have access to 
any of the "full feature" object mappers: each plugin would have to cobble 
together the serdes it needs.

So, after standardizing the object mappers, pass in an instance of that 
standard mapper to the storage plugin.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7553) Modernize type management

2020-01-26 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7553:
--

 Summary: Modernize type management
 Key: DRILL-7553
 URL: https://issues.apache.org/jira/browse/DRILL-7553
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers


This is a roll-up issue for our ongoing discussion around improving and 
modernizing Drill's runtime type system. At present, Drill approaches types 
vastly differently than most other DB and query tools:

 * Drill does little (or no) plan-time type checking and propagation. Instead, 
all type management is done at execution time, in each reader, in each 
operator, and ultimately in the client.
 * Drill allows structured types (Map, Dict, Arrays), but does not have the 
extended SQL statements to fully utilize these types.
* Drill supports varying types: two readers can both read column {{c}}, but can 
do so with different types. We've always hoped to discover some way to 
reconcile the types. But, at present, the functionality is buggy and 
incomplete. It is not clear that a viable solution exists. Drill also provides 
"formal" varying types: Union and List. These types are also not fully 
supported.

These three topics are closely related. "Schema-free" means we must infer types 
at read time and so Drill cannot do plan-type type analysis of the kind done in 
other engines. Because of schema-on-read (which is what "schema-free" really 
means), two readers can read different types for the same fields, and so we end 
up with varying or inconsistent types, and are forced to figure out some way to 
manage the conflicts.

The gist of the proposal explored in this ticket is to exploit the learning 
from other engines: to embrace types when available, and to impose tractable 
rules when types are discovered at run time.

h4. Proposal Summary

This is very much a discussion draft. Here are some suggestions to get started.

# Set as our goal to manage types at plan time. Runtime type discovery becomes 
a (limited) special case.
# Pull type resolution, propagation and checking into the planner where it can 
be done once per query. Move it out of execution where it must be done multiple 
times: once per operator per minor fragment. Implement the standard DB type 
checking and propagation rules. (These rules are currently implicitly 
implemented deep in the code gen code.)
# Generate operator code in the planner; send it to workers as part of the 
physical plan (to avoid the need to generate the code on each worker.)
# Provide schema-aware extensions for storage and format plugins so that they 
can advertise a schema when known. (Examples; Hive sources get schemas from 
HMS, JDBC sources get schema from the underlying database, Avro, Parquet and 
others obtain schema from the target files, etc.) This mechanism works with, 
but is in addition to, the Drill metastore. 
# Separate the concepts of "schema-free" (no plan-time schema) from 
"schema-on-read" (schema is known in the planner, and data is read into that 
schema by readers; e.g. the Hive model.) Drill remains schema-on-read (for 
sources that need it), but does not attempt the impossible with schema-free 
(that is, we no longer read inconsistent data into a relational model and hope 
we can make it work.)
# For convenience, allow "schema-free" (no plan-time schema). The restriction 
is that all readers *must* produce the same schema It is a fatal (to the query) 
error for an operator to receive batches with different schemas. (The reasons 
can be discussed separately.)
# Preserve the Map, Dict and Array types, but with tighter semantics: all 
elements must be of the same type.
# Replace the Union and List types with a new type: Java objects. Java objects 
can be anything and can vary from row-to-row. Java types are processed using 
UDFs (or Drill functions.)
# All "extended" types (complex: Map, Dict and Array, or Java objects) must be 
reduced to primitive types in a top-level tuple if the client is ODBC (which 
cannot handle non-relational types.) The same is true if the destination is a 
simple sink such as CSV or JDBC.
# Provide a light-weight way to resolve schema ambiguities that are identified 
by the new, stricter type rules. The light-weight solution is either a file or 
some kind of simple Drill-managed registry akin to the plugin registry. Users 
can run a query, see if there are conflicting types, and, if so, add a 
resolution rule to the registry. The user then reruns the query with a clean 
result.

In the past couple of years we have made progress in some of these areas. This 
ticket suggests we bring those threads together in a coherent strategy.

h4. Arrow/Java/Fixed Block/Something Else Storage

The ideas here are independent of choices we might make for our internal data 
representation format. The above design works equally well with either Drill or 
Arrow vectors, or with something else 

[jira] [Comment Edited] (DRILL-7551) Improve Error Reporting

2020-01-26 Thread Paul Rogers (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023322#comment-17023322
 ] 

Paul Rogers edited comment on DRILL-7551 at 1/27/20 1:05 AM:
-

Fixing errors has a number of dimensions:
 # Inconsistent use of exceptions at runtime. We have {{UserException}} which 
creates some structure, but we also throw random other unchecked exceptions. 
\{{UserException}}s do not, however, provide a mapping into SQL errors of the 
type understood by xDBC drivers.
 # Inconsistent error context. A low level bit of code (a file open call, say) 
only knows that it failed and that is what it tends to report: ("IO Error 10".) 
At the next level up, the surrounding code might know a bit more. ("Error 
reading HDFS:/foo/bar1234.parquet".) What we need is a bit of synthesis to say, 
("Too many network timeouts reading block 17 from the bar1234.parquet of the 
`foo` table stored in the HDFS system `sales`".)
 # Errors are exceptions and we are overly generous in showing every last bit 
of stack trace on the client, the server and so on. Even those of us who live 
in the code find that the few lines we care about (NPE in such-and-such call 
stack) is lost in hundreds of lines that, frankly, I've never personally looked 
at.
 # The client API is a bit of a mess in error reporting: returning unchecked 
{{UserException}}s rather than a well-structured {{DrillException}} (say) 
designed for client use. (This is probably because the Drill client was a quick 
short-term solution based on Drill's internal Drillbit-to-Drillbit RPC.)
# Catch errors as early as possible. Example: plan-time type checking 
(eventually), storage plugin validation in the UI (see comment below.)

In addition to the above execution-focused items, it would be good to look at 
the SQL parser/planner errors as well. Not sure that returning 20-30 lines of 
possible tokens is super-helpful when I make a SQL typo. Probably fine to say, 
"Didn't understand the SQL at line 10, position 3.");

To clean up our error act, we must move forward on each of these fronts.

For my part, I've been chipping away at item 1: trying to convert all code to 
throw {{UserException}}. EVF provides an "error context" that helps (but does 
not solve) item 2. I've also made a pass on items 3 & 4, but have been hesitant 
to make any changes to the client API for fear of breaking the two JDBC drivers 
and our (currently unstaffed) C++ client.

Would be great to get some help. For example, how can we provide 
user-meaningful context in our errors (Item 2)? How can we map errors in to 
standard SQL error and warning codes (part of item 1)? Maybe someone can help 
us figure out how to achieve item 4 with minimal client impact. And, of course, 
once we set the pattern we want to use, everyone can help by improving each of 
the many places were we raise exceptions.

Item 5 can be done independently of other tasks.


was (Author: paul.rogers):
Fixing errors has a number of dimensions:
 # Inconsistent use of exceptions at runtime. We have {{UserException}} which 
creates some structure, but we also throw random other unchecked exceptions. 
\{{UserException}}s do not, however, provide a mapping into SQL errors of the 
type understood by xDBC drivers.
 # Inconsistent error context. A low level bit of code (a file open call, say) 
only knows that it failed and that is what it tends to report: ("IO Error 10".) 
At the next level up, the surrounding code might know a bit more. ("Error 
reading HDFS:/foo/bar1234.parquet".) What we need is a bit of synthesis to say, 
("Too many network timeouts reading block 17 from the bar1234.parquet of the 
`foo` table stored in the HDFS system `sales`".)
 # Errors are exceptions and we are overly generous in showing every last bit 
of stack trace on the client, the server and so on. Even those of us who live 
in the code find that the few lines we care about (NPE in such-and-such call 
stack) is lost in hundreds of lines that, frankly, I've never personally looked 
at.
 # The client API is a bit of a mess in error reporting: returning unchecked 
{{UserException}}s rather than a well-structured {{DrillException}} (say) 
designed for client use. (This is probably because the Drill client was a quick 
short-term solution based on Drill's internal Drillbit-to-Drillbit RPC.)

In addition to the above execution-focused items, it would be good to look at 
the SQL parser/planner errors as well. Not sure that returning 20-30 lines of 
possible tokens is super-helpful when I make a SQL typo. Probably fine to say, 
"Didn't understand the SQL at line 10, position 3.");

To clean up our error act, we must move forward on each of these fronts.

For my part, I've been chipping away at item 1: trying to convert all code to 
throw {{UserException}}. EVF provides an "error context" that helps (but does 
not solve) item 2. I've also made a pass on 

[jira] [Commented] (DRILL-7551) Improve Error Reporting

2020-01-24 Thread Paul Rogers (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023322#comment-17023322
 ] 

Paul Rogers commented on DRILL-7551:


Fixing errors has a number of dimensions:
 # Inconsistent use of exceptions at runtime. We have {{UserException}} which 
creates some structure, but we also throw random other unchecked exceptions. 
\{{UserException}}s do not, however, provide a mapping into SQL errors of the 
type understood by xDBC drivers.
 # Inconsistent error context. A low level bit of code (a file open call, say) 
only knows that it failed and that is what it tends to report: ("IO Error 10".) 
At the next level up, the surrounding code might know a bit more. ("Error 
reading HDFS:/foo/bar1234.parquet".) What we need is a bit of synthesis to say, 
("Too many network timeouts reading block 17 from the bar1234.parquet of the 
`foo` table stored in the HDFS system `sales`".)
 # Errors are exceptions and we are overly generous in showing every last bit 
of stack trace on the client, the server and so on. Even those of us who live 
in the code find that the few lines we care about (NPE in such-and-such call 
stack) is lost in hundreds of lines that, frankly, I've never personally looked 
at.
 # The client API is a bit of a mess in error reporting: returning unchecked 
{{UserException}}s rather than a well-structured {{DrillException}} (say) 
designed for client use. (This is probably because the Drill client was a quick 
short-term solution based on Drill's internal Drillbit-to-Drillbit RPC.)

In addition to the above execution-focused items, it would be good to look at 
the SQL parser/planner errors as well. Not sure that returning 20-30 lines of 
possible tokens is super-helpful when I make a SQL typo. Probably fine to say, 
"Didn't understand the SQL at line 10, position 3.");

To clean up our error act, we must move forward on each of these fronts.

For my part, I've been chipping away at item 1: trying to convert all code to 
throw {{UserException}}. EVF provides an "error context" that helps (but does 
not solve) item 2. I've also made a pass on items 3 & 4, but have been hesitant 
to make any changes to the client API for fear of breaking the two JDBC drivers 
and our (currently unstaffed) C++ client.

Would be great to get some help. For example, how can we provide 
user-meaningful context in our errors (Item 2)? How can we map errors in to 
standard SQL error and warning codes (part of item 1)? Maybe someone can help 
us figure out how to achieve item 4 with minimal client impact. And, of course, 
once we set the pattern we want to use, everyone can help by improving each of 
the many places were we raise exceptions.

> Improve Error Reporting
> ---
>
> Key: DRILL-7551
> URL: https://issues.apache.org/jira/browse/DRILL-7551
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.17.0
>Reporter: Charles Givre
>Priority: Major
> Fix For: 1.18.0
>
>
> This Jira is to serve as a master Jira issue to improve the usability of 
> error messages. Instead of dumping stack traces, the overall goal is to give 
> the user something that can actually explain:
>  # What went wrong
>  # How to fix 
> Work that relates to this, should be created as subtasks.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7545) Projection ambiguities in complex types

2020-01-21 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7545:
--

 Summary: Projection ambiguities in complex types
 Key: DRILL-7545
 URL: https://issues.apache.org/jira/browse/DRILL-7545
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.17.0
Reporter: Paul Rogers


Summarized from an e-mail chain on the dev mailing list:

We recently introduced the DICT type. We also added the EVF framework. We have 
a bit of code which parses the projection list, then checks if a column from a 
reader is consistent with projection. The idea is to ensure that the columns 
produced by a Scan will be valid when a Project later tries to use them with 
the given project list. And, if the Scan says it can support Project-push-down, 
then the Scan is obligated to do the full check.

First we'll explain how I'll solve the projection problem given your 
explanation. Then we'll point out three potential ambiguities. Thanks to Bohdan 
for his explanations.

The problems here are not due to any one person. As explained below, they are 
due to trying to add concepts into SQL which SQL is not well-suited to support.

h4. Projection for DICT Types

Queries go through two major steps: planing and execution. At the planning 
stage we use SQL syntax for the project list. For example:

{code:sql}
explain plan for SELECT a, e.`map`.`member`, `dict`['key'], `array`[10]  FROM 
cp.`employee.json` e
{code}

The planner sends an execution plan to operators. The project list appears in 
JSON. For the above:

{code:json}
   "columns" : [ "`a`", "`map`.`member`", "`dict`.`key`", "`array`[10]" ],
{code}

We see that the JSON works as Bohdan described:

* The SQL map "map.member" syntax is converted to "`map`.`member`" in the JSON 
plan.
* The SQL DICT "`dict`['key']" syntax is converted to a form identical to maps: 
"`dict`.`key`".
* The SQL DICT/array "`array`[10]" syntax is converted to "`array`[10]" in JSON.

That is, on the execution side, we can't tell the difference between a MAP and 
a DICT request. We also can't tell the difference between an Array and DICT 
request. Apparently, because of this, the Schema Path parser does not recognize 
DICT syntax.

Given the way projection works, "a.b" and "a['b']" are identical: either works 
for both a map or a DICT with VARCHAR keys. That is, we just say that map and 
array projection are both compatible with a DICT column?

h4. Projection Checking in Scan

Mentioned above is that a Scan that supports Project-push-down must ensure that 
the output columns match the projection list. Doing that check is quite easy 
when the projection is simple: `a`. The column `a` can match a data column `a` 
of any type.

The task is bit harder when the projection is an array `a[0]`. Since this now 
means either array or DICT with an INT key, this projected column can match:

* Any REPEATED type
* A LIST
* A non-REPEATED DICT with INT, BIGINT, SMALLINT or TINYINT keys (ignoring the 
UINTx types)
* A REPEATED DICT with any type of key
* A UNION (because a union might contain a repeated type)

We can also handle a map projection: `a.b` which matches:

* A (possibly repeated) map
* A (possibly repeated) DICT with VARCHAR keys
* A UNION (because a union might contain a possibly-repeated map)
* A LIST (because the list can contain a union which might contain a 
possibly-repeated map)

Things get very complex indeed when we have multiple qualifiers such as 
`a[0][1].b` which matches:

* A LIST that contains a repeated map
* A REPEATED LIST that contains a (possibly-repeated) map
* A DICT with an INT key that has a value of a repeated map
* A REPEATED DICT that contains an INT key that contains a MAP
* (If we had sufficient metadata) A LIST that contains a REPEATED DICT with a 
VARCHAR key.

h4. DICT Projection Ambiguities

The DICT type introduces an ambiguity. Note above that `a.b` can refer to 
either a REPEATED or non-REPEATED MAP. If non-repeated, `a.b` means to get the 
one value for member `b` of map `a`. But, if the map is REPEATED, this means to 
project an array of `b` values obtained from the array of maps.

For a DICT, there is an ambiguity with `a[0][1]` if the DICT is repeated DICT 
of INT keys and REPEATED BIGINT values: that is ARRAY>>. Does `a[0][1]` mean to pull out the 0th element of the 
REPEATED DICT, then lookup where the key == 1? Or, does it mean to pull out all 
the DICT array values where the key == 0 and then pull out the 1st value of the 
INT array? That is, because we have an implied (in all members of the array) 
syntax, one can interpret this case as:

{noformat}
repeatedDict[0].valueOf(1) --> ARRAY
-- All the values in the key=1 array of element 0
{noformat}

or

{noformat}
repeatedDict.valueOf(0)[1] --> ARRAY
-- All the values in the key=0, element 1 positions across all DICT elements
{noformat}

It would seem to make sense to prefer the first interpretation. Unfortunately, 
MAPs already use the 

[jira] [Commented] (DRILL-7542) Fix Drill-on-Yarn logger

2020-01-20 Thread Paul Rogers (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019695#comment-17019695
 ] 

Paul Rogers commented on DRILL-7542:


[~arina], I can't recall this detail. I will speculate that I had to use the 
same logging as the YARN framework. DoY has two executables: the client and the 
App Master. Both make heavy use of the YARN and HDFS APIs. I may have found 
that things worked best if I used the same logger for my code as YARN and HDFS 
used.

That said, feel free to experiment; perhaps I missed something that would allow 
us to get YARN and HDFS to log to our logger; I'm a pure novice at the logging 
mechanisms.

> Fix Drill-on-Yarn logger
> 
>
> Key: DRILL-7542
> URL: https://issues.apache.org/jira/browse/DRILL-7542
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0, 1.17.0
>Reporter: Arina Ielchiieva
>Priority: Major
>
> Drill project uses Logback logger backed by SLF4J:
> {noformat}
> import org.slf4j.Logger;
> import org.slf4j.LoggerFactory;
> private static final Logger logger = 
> LoggerFactory.getLogger(ResultsListener.class);
> {noformat}
> Drill-on-Yarn project uses commons loggin:
> {noformat}
> import org.apache.commons.logging.Log;
> import org.apache.commons.logging.LogFactory;
> private static final Log LOG = LogFactory.getLog(AbstractScheduler.class);
> {noformat}
> It would be nice if all project components used the same approach for logging.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7522) JSON reader (v1) omits null columns in SELECT *

2020-01-12 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7522:
--

 Summary: JSON reader (v1) omits null columns in SELECT *
 Key: DRILL-7522
 URL: https://issues.apache.org/jira/browse/DRILL-7522
 Project: Apache Drill
  Issue Type: Bug
Affects Versions: 1.17.0
Reporter: Paul Rogers


Run the following unit test: {{TestStarQueries.testSelStarOrderBy}}, runs the 
following query:

{code:sql}
select * from cp.`employee.json` order by last_name
{code}

The query reads a Foodmart file {{customer.json}} that has records like this:

{code:json}
{"employee_id":53,...","end_date":null,"salary":...}
{code}

The field {{end_date}} turns out to be null for all records in 
{{customer.json}}.

Then, look at the verification query. It carefully includes all fields *except* 
{{end_date}}. That is, the test was written to expect that the JSON reader will 
omit a column that has all NULL values.

While it might seem OK to omit all-NULL columns (they don't have any data), the 
problem is that Drill is a distributed system. Suppose we query a directory of 
50 such files, some of which have all-NULLs in one field, some of which have 
all-NULLs in another. Although the files have the same schema, {{SELECT *}} 
will return different schemas (depending on which file has which non-NULL 
columns.)

A downstream operator will have to merge these schemas. And, since Drill fills 
in a Nullable INT field for missing columns, we might end up with a schema 
change exception because the actual field type is VARCHAR when it appears.

One can argue that {{SELECT *}} means "return all columns", not "return all 
columns except those that happen to be null in the first batch." Yes, we have 
the problem of not knowing the actual field type. Eventually, provided schemas 
will resolve such issues.

Note that in the "V2" JSON reader, {{end_date}} is included in the query.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7510) Incorrect String/number comparison with union types

2020-01-03 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7510:
--

 Summary: Incorrect String/number comparison with union types
 Key: DRILL-7510
 URL: https://issues.apache.org/jira/browse/DRILL-7510
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers
Assignee: Paul Rogers


Run the following test: {{TestTopNSchemaChanges.testUnionTypes()}}. It will 
pass. Look at the expected output:

{code:java}
builder.baselineValues(0l, 0l);
builder.baselineValues(1.0d, 1.0d);
builder.baselineValues(3l, 3l);
builder.baselineValues(4.0d, 4.0d);
builder.baselineValues(6l, 6l);
builder.baselineValues(7.0d, 7.0d);
builder.baselineValues(9l, 9l);
builder.baselineValues("2", "2");
{code}

The string values sort after the numbers.

After the fix for DRILL-7502, we get the following output:

{code:java}
builder.baselineValues(0l, 0l);
builder.baselineValues(1.0d, 1.0d);
builder.baselineValues("2", "2");
builder.baselineValues(3l, 3l);
builder.baselineValues(4.0d, 4.0d);
builder.baselineValues("5", "5");
builder.baselineValues(6l, 6l);
builder.baselineValues(7.0d, 7.0d);
{code}

This accidental fix suggests that the original design was to convert values to 
the same type, then compare them. Converting numbers to strings, say, would 
cause them to be lexically ordered, as in the second output.

The {{UNION}} type is poorly supported, so it is likely that this bug does not 
affect actual users.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7507) Convert fragment interrupts to exceptions

2020-01-01 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7507:
--

 Summary: Convert fragment interrupts to exceptions
 Key: DRILL-7507
 URL: https://issues.apache.org/jira/browse/DRILL-7507
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Operations periodically check if they should continue by calling the 
{{shouldContinue()}} method. If the method returns false, operators return a 
{{STOP}} status in some form.

This change modifies handling to throw an exception instead; cancelling a 
fragment the same way that we handle errors.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7506) Simplify code gen error handling

2020-01-01 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7506:
--

 Summary: Simplify code gen error handling
 Key: DRILL-7506
 URL: https://issues.apache.org/jira/browse/DRILL-7506
 Project: Apache Drill
  Issue Type: Improvement
Affects Versions: 1.17.0
Reporter: Paul Rogers
Assignee: Paul Rogers
 Fix For: 1.18.0


Code generation can generate a variety of errors. Most operators bubble these 
exceptions up several layers in the code before catching them. This patch moves 
error handling closer to the code gen itself to allow a) simpler code, and b) 
clearer error messages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (DRILL-7359) Add support for DICT type in RowSet Framework

2019-12-31 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-7359:
---
Labels: ready-to-commit  (was: )

> Add support for DICT type in RowSet Framework
> -
>
> Key: DRILL-7359
> URL: https://issues.apache.org/jira/browse/DRILL-7359
> Project: Apache Drill
>  Issue Type: New Feature
>Reporter: Bohdan Kazydub
>Assignee: Bohdan Kazydub
>Priority: Major
>  Labels: ready-to-commit
> Fix For: 1.18.0
>
>
> Add support for new DICT data type (see DRILL-7096) in RowSet Framework



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-6360) Document the typeof() function

2019-12-31 Thread Paul Rogers (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-6360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006252#comment-17006252
 ] 

Paul Rogers commented on DRILL-6360:


The PR for DRILL-7502 provides additional documentation about the updated 
behaviour of this function and two of the other type functions.

> Document the typeof() function
> --
>
> Key: DRILL-6360
> URL: https://issues.apache.org/jira/browse/DRILL-6360
> Project: Apache Drill
>  Issue Type: Task
>  Components: Documentation
>Affects Versions: 1.13.0
>Reporter: Paul Rogers
>Assignee: Bridget Bevens
>Priority: Minor
>  Labels: doc-impacting
>
> Drill has a {{typeof()}} function that returns the data type (but not mode) 
> of a column. It was discussed on the dev list recently. However, a search of 
> the Drill web site, and a scan by hand, failed to turn up documentation about 
> the function.
> As a general suggestion, would be great to have an alphabetical list of all 
> functions so we don't have to hunt all over the site to find which functions 
> are available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-6362) typeof() lies about types

2019-12-31 Thread Paul Rogers (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-6362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006251#comment-17006251
 ] 

Paul Rogers commented on DRILL-6362:


DRILL-7502 fixes this issues, along with several related issues.

> typeof() lies about types
> -
>
> Key: DRILL-6362
> URL: https://issues.apache.org/jira/browse/DRILL-6362
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.13.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
>
> Drill provides a {{typeof()}} function that returns the type of a column. 
> But, it seems to make up types. Consider the following input file:
> {noformat}
> {a: true}
> {a: false}
> {a: null}
> {noformat}
> Consider the following two queries:
> {noformat}
> SELECT a FROM `json/boolean.json`;
> ++
> |   a|
> ++
> | true   |
> | false  |
> | null   |
> ++
> > SELECT typeof(a) FROM `json/boolean.json`;
> +-+
> | EXPR$0  |
> +-+
> | BIT |
> | BIT |
> | NULL|
> +-+
> {noformat}
> Notice that the values are reported as BIT. But, I believe the actual type is 
> UInt1 (the bit vector is, I believe, deprecated.) Then, the function reports 
> NULL instead of the actual type for the null value.
> Since Drill has an {{isnull()}} function, there is no reason for {{typeof()}} 
> to muddle the type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-5189) There's no documentation for the typeof() function

2019-12-31 Thread Paul Rogers (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006250#comment-17006250
 ] 

Paul Rogers commented on DRILL-5189:


The PR for DRILL-7502 provides additional documentation about the updated 
behaviour of this function and two of the other type functions.

> There's no documentation for the typeof() function
> --
>
> Key: DRILL-5189
> URL: https://issues.apache.org/jira/browse/DRILL-5189
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Chris Westin
>Assignee: Bridget Bevens
>Priority: Major
>
> I looked through the documentation at https://drill.apache.org/docs/ under 
> SQL Reference > SQL Functions > ... and could not find any reference to 
> typeof(). Google searches only turned up a reference to DRILL-4204.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (DRILL-7502) Incorrect/invalid codegen for typeof() with UNION

2019-12-31 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers reassigned DRILL-7502:
--

Assignee: Paul Rogers

> Incorrect/invalid codegen for typeof() with UNION
> -
>
> Key: DRILL-7502
> URL: https://issues.apache.org/jira/browse/DRILL-7502
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
>
> The {{typeof()}} function is defined as follows:
> {code:java}
>   @FunctionTemplate(names = {"typeOf"},
>   scope = FunctionTemplate.FunctionScope.SIMPLE,
>   nulls = NullHandling.INTERNAL)
>   public static class GetType implements DrillSimpleFunc {
> @Param
> FieldReader input;
> @Output
> VarCharHolder out;
> @Inject
> DrillBuf buf;
> @Override
> public void setup() {}
> @Override
> public void eval() {
>   String typeName = input.getTypeString();
>   byte[] type = typeName.getBytes();
>   buf = buf.reallocIfNeeded(type.length);
>   buf.setBytes(0, type);
>   out.buffer = buf;
>   out.start = 0;
>   out.end = type.length;
> }
>   }
> {code}
> Note that the {{input}} field is defined as {{FieldReader}} which has a 
> method called {{getTypeString()}}. As a result, the code works fine in all 
> existing tests in {{TestTypeFns}}.
> I tried to add a function to use {{typeof()}} on a column of type {{UNION}}. 
> When I did, the query failed with a compile error in generated code:
> {noformat}
> SYSTEM ERROR: CompileException: Line 42, Column 43: 
>   A method named "getTypeString" is not declared in any enclosing class nor 
> any supertype, nor through a static import
> {noformat}
> The stack trace shows the generated code; Note that the type of {{input}} 
> changes from a reader to a holder, causing code to be invalid:
> {code:java}
> public class ProjectorGen0 {
> DrillBuf work0;
> UnionVector vv1;
> VarCharVector vv6;
> DrillBuf work9;
> VarCharVector vv11;
> DrillBuf work14;
> VarCharVector vv16;
> public void doEval(int inIndex, int outIndex)
> throws SchemaChangeException
> {
> {
> UnionHolder out4 = new UnionHolder();
> {
> out4 .isSet = vv1 .getAccessor().isSet((inIndex));
> if (out4 .isSet == 1) {
> vv1 .getAccessor().get((inIndex), out4);
> }
> }
> // start of eval portion of typeOf function. //
> VarCharHolder out5 = new VarCharHolder();
> {
> final VarCharHolder out = new VarCharHolder();
> UnionHolder input = out4;
> DrillBuf buf = work0;
> UnionFunctions$GetType_eval:
> {
> String typeName = input.getTypeString();
> byte[] type = typeName.getBytes();
> buf = buf.reallocIfNeeded(type.length);
> buf.setBytes(0, type);
> out.buffer = buf;
> out.start = 0;
> out.end = type.length;
> }
> {code}
> By contrast, here is the generated code for one of the existing 
> {{TestTypeFns}} tests where things work:
> {code:java}
> public class ProjectorGen0
> extends ProjectorTemplate
> {
> DrillBuf work0;
> NullableBigIntVector vv1;
> VarCharVector vv7;
> public ProjectorGen0() {
> try {
> __DRILL_INIT__();
> } catch (SchemaChangeException e) {
> throw new UnsupportedOperationException(e);
> }
> }
> public void doEval(int inIndex, int outIndex)
> throws SchemaChangeException
> {
> {
>..
> // start of eval portion of typeOf function. //
> VarCharHolder out6 = new VarCharHolder();
> {
> final VarCharHolder out = new VarCharHolder();
> FieldReader input = new NullableIntHolderReaderImpl(out5);
> DrillBuf buf = work0;
> UnionFunctions$GetType_eval:
> {
> String typeName = input.getTypeString();
> byte[] type = typeName.getBytes();
> buf = buf.reallocIfNeeded(type.length);
> buf.setBytes(0, type);
> out.buffer = buf;
> out.start = 0;
> out.end = type.length;
> }
> work0 = buf;
> out6 .start = out.start;
> out6 .end = out.end;
> out6 .buffer = out.buffer;
> }
> // end of eval portion of typeOf function. //
> {code}
> Notice that the {{input}} variable is of type {{FieldReader}} as expected.
> Queries that work:
> {code:java}
> String sql = "SELECT typeof(CAST(a AS " + castType + ")) FROM (VALUES 
> (1)) AS T(a)";
> sql = "SELECT typeof(CAST(a AS " + castType + ")) FROM 
> cp.`functions/null.json`";
> 

[jira] [Updated] (DRILL-7502) Incorrect/invalid codegen for typeof() with UNION

2019-12-31 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-7502:
---
Fix Version/s: 1.18.0
Affects Version/s: 1.17.0

> Incorrect/invalid codegen for typeof() with UNION
> -
>
> Key: DRILL-7502
> URL: https://issues.apache.org/jira/browse/DRILL-7502
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.17.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
> Fix For: 1.18.0
>
>
> The {{typeof()}} function is defined as follows:
> {code:java}
>   @FunctionTemplate(names = {"typeOf"},
>   scope = FunctionTemplate.FunctionScope.SIMPLE,
>   nulls = NullHandling.INTERNAL)
>   public static class GetType implements DrillSimpleFunc {
> @Param
> FieldReader input;
> @Output
> VarCharHolder out;
> @Inject
> DrillBuf buf;
> @Override
> public void setup() {}
> @Override
> public void eval() {
>   String typeName = input.getTypeString();
>   byte[] type = typeName.getBytes();
>   buf = buf.reallocIfNeeded(type.length);
>   buf.setBytes(0, type);
>   out.buffer = buf;
>   out.start = 0;
>   out.end = type.length;
> }
>   }
> {code}
> Note that the {{input}} field is defined as {{FieldReader}} which has a 
> method called {{getTypeString()}}. As a result, the code works fine in all 
> existing tests in {{TestTypeFns}}.
> I tried to add a function to use {{typeof()}} on a column of type {{UNION}}. 
> When I did, the query failed with a compile error in generated code:
> {noformat}
> SYSTEM ERROR: CompileException: Line 42, Column 43: 
>   A method named "getTypeString" is not declared in any enclosing class nor 
> any supertype, nor through a static import
> {noformat}
> The stack trace shows the generated code; Note that the type of {{input}} 
> changes from a reader to a holder, causing code to be invalid:
> {code:java}
> public class ProjectorGen0 {
> DrillBuf work0;
> UnionVector vv1;
> VarCharVector vv6;
> DrillBuf work9;
> VarCharVector vv11;
> DrillBuf work14;
> VarCharVector vv16;
> public void doEval(int inIndex, int outIndex)
> throws SchemaChangeException
> {
> {
> UnionHolder out4 = new UnionHolder();
> {
> out4 .isSet = vv1 .getAccessor().isSet((inIndex));
> if (out4 .isSet == 1) {
> vv1 .getAccessor().get((inIndex), out4);
> }
> }
> // start of eval portion of typeOf function. //
> VarCharHolder out5 = new VarCharHolder();
> {
> final VarCharHolder out = new VarCharHolder();
> UnionHolder input = out4;
> DrillBuf buf = work0;
> UnionFunctions$GetType_eval:
> {
> String typeName = input.getTypeString();
> byte[] type = typeName.getBytes();
> buf = buf.reallocIfNeeded(type.length);
> buf.setBytes(0, type);
> out.buffer = buf;
> out.start = 0;
> out.end = type.length;
> }
> {code}
> By contrast, here is the generated code for one of the existing 
> {{TestTypeFns}} tests where things work:
> {code:java}
> public class ProjectorGen0
> extends ProjectorTemplate
> {
> DrillBuf work0;
> NullableBigIntVector vv1;
> VarCharVector vv7;
> public ProjectorGen0() {
> try {
> __DRILL_INIT__();
> } catch (SchemaChangeException e) {
> throw new UnsupportedOperationException(e);
> }
> }
> public void doEval(int inIndex, int outIndex)
> throws SchemaChangeException
> {
> {
>..
> // start of eval portion of typeOf function. //
> VarCharHolder out6 = new VarCharHolder();
> {
> final VarCharHolder out = new VarCharHolder();
> FieldReader input = new NullableIntHolderReaderImpl(out5);
> DrillBuf buf = work0;
> UnionFunctions$GetType_eval:
> {
> String typeName = input.getTypeString();
> byte[] type = typeName.getBytes();
> buf = buf.reallocIfNeeded(type.length);
> buf.setBytes(0, type);
> out.buffer = buf;
> out.start = 0;
> out.end = type.length;
> }
> work0 = buf;
> out6 .start = out.start;
> out6 .end = out.end;
> out6 .buffer = out.buffer;
> }
> // end of eval portion of typeOf function. //
> {code}
> Notice that the {{input}} variable is of type {{FieldReader}} as expected.
> Queries that work:
> {code:java}
> String sql = "SELECT typeof(CAST(a AS " + castType + ")) FROM (VALUES 
> (1)) AS T(a)";
> sql = 

[jira] [Created] (DRILL-7503) Refactor project operator

2019-12-30 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7503:
--

 Summary: Refactor project operator
 Key: DRILL-7503
 URL: https://issues.apache.org/jira/browse/DRILL-7503
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers
Assignee: Paul Rogers


Work on another ticket revealed that the Project operator ("record batch") has 
grown quite complex. The setup phase lives in the operator as one huge 
function. The function combines the "logical" tasks of working out the 
projection expressions and types, the code gen for those expressions, and the 
physical setup of vectors.

The refactoring breaks up the logic so that it is easier to focus on the 
specific bits of interest.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (DRILL-7502) Incorrect/invalid codegen for typeof() with UNION

2019-12-30 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-7502:
---
Description: 
The {{typeof()}} function is defined as follows:

{code:java}
  @FunctionTemplate(names = {"typeOf"},
  scope = FunctionTemplate.FunctionScope.SIMPLE,
  nulls = NullHandling.INTERNAL)
  public static class GetType implements DrillSimpleFunc {

@Param
FieldReader input;
@Output
VarCharHolder out;
@Inject
DrillBuf buf;

@Override
public void setup() {}

@Override
public void eval() {
  String typeName = input.getTypeString();
  byte[] type = typeName.getBytes();
  buf = buf.reallocIfNeeded(type.length);
  buf.setBytes(0, type);
  out.buffer = buf;
  out.start = 0;
  out.end = type.length;
}
  }
{code}

Note that the {{input}} field is defined as {{FieldReader}} which has a method 
called {{getTypeString()}}. As a result, the code works fine in all existing 
tests in {{TestTypeFns}}.

I tried to add a function to use {{typeof()}} on a column of type {{UNION}}. 
When I did, the query failed with a compile error in generated code:

{noformat}
SYSTEM ERROR: CompileException: Line 42, Column 43: 
  A method named "getTypeString" is not declared in any enclosing class nor any 
supertype, nor through a static import
{noformat}

The stack trace shows the generated code; Note that the type of {{input}} 
changes from a reader to a holder, causing code to be invalid:

{code:java}
public class ProjectorGen0 {

DrillBuf work0;
UnionVector vv1;
VarCharVector vv6;
DrillBuf work9;
VarCharVector vv11;
DrillBuf work14;
VarCharVector vv16;

public void doEval(int inIndex, int outIndex)
throws SchemaChangeException
{
{
UnionHolder out4 = new UnionHolder();
{
out4 .isSet = vv1 .getAccessor().isSet((inIndex));
if (out4 .isSet == 1) {
vv1 .getAccessor().get((inIndex), out4);
}
}
// start of eval portion of typeOf function. //
VarCharHolder out5 = new VarCharHolder();
{
final VarCharHolder out = new VarCharHolder();
UnionHolder input = out4;
DrillBuf buf = work0;
UnionFunctions$GetType_eval:
{
String typeName = input.getTypeString();
byte[] type = typeName.getBytes();

buf = buf.reallocIfNeeded(type.length);
buf.setBytes(0, type);
out.buffer = buf;
out.start = 0;
out.end = type.length;
}
{code}

By contrast, here is the generated code for one of the existing {{TestTypeFns}} 
tests where things work:

{code:java}
public class ProjectorGen0
extends ProjectorTemplate
{

DrillBuf work0;
NullableBigIntVector vv1;
VarCharVector vv7;

public ProjectorGen0() {
try {
__DRILL_INIT__();
} catch (SchemaChangeException e) {
throw new UnsupportedOperationException(e);
}
}

public void doEval(int inIndex, int outIndex)
throws SchemaChangeException
{
{
   ..
// start of eval portion of typeOf function. //
VarCharHolder out6 = new VarCharHolder();
{
final VarCharHolder out = new VarCharHolder();
FieldReader input = new NullableIntHolderReaderImpl(out5);
DrillBuf buf = work0;
UnionFunctions$GetType_eval:
{
String typeName = input.getTypeString();
byte[] type = typeName.getBytes();

buf = buf.reallocIfNeeded(type.length);
buf.setBytes(0, type);
out.buffer = buf;
out.start = 0;
out.end = type.length;
}
work0 = buf;
out6 .start = out.start;
out6 .end = out.end;
out6 .buffer = out.buffer;
}
// end of eval portion of typeOf function. //
{code}

Notice that the {{input}} variable is of type {{FieldReader}} as expected.

Queries that work:

{code:java}
String sql = "SELECT typeof(CAST(a AS " + castType + ")) FROM (VALUES (1)) 
AS T(a)";
sql = "SELECT typeof(CAST(a AS " + castType + ")) FROM 
cp.`functions/null.json`";
String sql = "SELECT typeof(" + expr + ") FROM (VALUES (" + value + ")) AS 
T(a)";
{code}

Query that fails:

{code:java}
String sql ="SELECT typeof(a) AS t, modeof(a) as m, drilltypeof(a) AS dt\n" 
+
"FROM cp.`jsoninput/union/c.json`";
{code}

The queries that work all include either a CAST or constant values. The query 
that fails works with data read from a file. Also, the queries that work use 
scalar types, the query that fails uses the UNION type.


  was:
The {{typeof()}} function is defined as follows:

{code:java}
  @FunctionTemplate(names = {"typeOf"},
  scope = 

[jira] [Updated] (DRILL-7502) Incorrect/invalid codegen for typeof() with UNION

2019-12-30 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-7502:
---
Description: 
The {{typeof()}} function is defined as follows:

{code:java}
  @FunctionTemplate(names = {"typeOf"},
  scope = FunctionTemplate.FunctionScope.SIMPLE,
  nulls = NullHandling.INTERNAL)
  public static class GetType implements DrillSimpleFunc {

@Param
FieldReader input;
@Output
VarCharHolder out;
@Inject
DrillBuf buf;

@Override
public void setup() {}

@Override
public void eval() {
  String typeName = input.getTypeString();
  byte[] type = typeName.getBytes();
  buf = buf.reallocIfNeeded(type.length);
  buf.setBytes(0, type);
  out.buffer = buf;
  out.start = 0;
  out.end = type.length;
}
  }
{code}

Note that the {{input}} field is defined as {{FieldReader}} which has a method 
called {{getTypeString()}}. As a result, the code works fine in all existing 
tests in {{TestTypeFns}}.

I tried to add a function to use {{typeof()}} on a column of type {{UNION}}. 
When I did, the query failed with a compile error in generated code:

{noformat}
SYSTEM ERROR: CompileException: Line 42, Column 43: 
  A method named "getTypeString" is not declared in any enclosing class nor any 
supertype, nor through a static import
{noformat}

The stack trace shows the generated code; Note that the type of {{input}} 
changes from a reader to a holder, causing code to be invalid:

{code:java}
public class ProjectorGen0 {

DrillBuf work0;
UnionVector vv1;
VarCharVector vv6;
DrillBuf work9;
VarCharVector vv11;
DrillBuf work14;
VarCharVector vv16;

public void doEval(int inIndex, int outIndex)
throws SchemaChangeException
{
{
UnionHolder out4 = new UnionHolder();
{
out4 .isSet = vv1 .getAccessor().isSet((inIndex));
if (out4 .isSet == 1) {
vv1 .getAccessor().get((inIndex), out4);
}
}
// start of eval portion of typeOf function. //
VarCharHolder out5 = new VarCharHolder();
{
final VarCharHolder out = new VarCharHolder();
UnionHolder input = out4;
DrillBuf buf = work0;
UnionFunctions$GetType_eval:
{
String typeName = input.getTypeString();
byte[] type = typeName.getBytes();

buf = buf.reallocIfNeeded(type.length);
buf.setBytes(0, type);
out.buffer = buf;
out.start = 0;
out.end = type.length;
}
{code}

By contrast, here is the generated code for one of the existing {{TestTypeFns}} 
tests where things work:

{code:java}
public class ProjectorGen0
extends ProjectorTemplate
{

DrillBuf work0;
NullableBigIntVector vv1;
VarCharVector vv7;

public ProjectorGen0() {
try {
__DRILL_INIT__();
} catch (SchemaChangeException e) {
throw new UnsupportedOperationException(e);
}
}

public void doEval(int inIndex, int outIndex)
throws SchemaChangeException
{
{
   ..
// start of eval portion of typeOf function. //
VarCharHolder out6 = new VarCharHolder();
{
final VarCharHolder out = new VarCharHolder();
FieldReader input = new NullableIntHolderReaderImpl(out5);
DrillBuf buf = work0;
UnionFunctions$GetType_eval:
{
String typeName = input.getTypeString();
byte[] type = typeName.getBytes();

buf = buf.reallocIfNeeded(type.length);
buf.setBytes(0, type);
out.buffer = buf;
out.start = 0;
out.end = type.length;
}
work0 = buf;
out6 .start = out.start;
out6 .end = out.end;
out6 .buffer = out.buffer;
}
// end of eval portion of typeOf function. //
{code}

Notice that the {{input}} variable is of type {{FieldReader}} as expected.


  was:
The {{typeof()}} function is defined as follows:

{code:java}
  @FunctionTemplate(names = {"typeOf"},
  scope = FunctionTemplate.FunctionScope.SIMPLE,
  nulls = NullHandling.INTERNAL)
  public static class GetType implements DrillSimpleFunc {

@Param
FieldReader input;
@Output
VarCharHolder out;
@Inject
DrillBuf buf;

@Override
public void setup() {}

@Override
public void eval() {
  String typeName = input.getTypeString();
  byte[] type = typeName.getBytes();
  buf = buf.reallocIfNeeded(type.length);
  buf.setBytes(0, type);
  out.buffer = buf;
  out.start = 0;
  out.end = type.length;
}
  }
{code}

Note that the {{input}} field is defined as {{FieldReader}} which has a method 
called {{getTypeString()}}. As a result, the code works 

[jira] [Created] (DRILL-7502) Incorrect/invalid codegen for typeof() with UNION

2019-12-30 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7502:
--

 Summary: Incorrect/invalid codegen for typeof() with UNION
 Key: DRILL-7502
 URL: https://issues.apache.org/jira/browse/DRILL-7502
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers


The {{typeof()}} function is defined as follows:

{code:java}
  @FunctionTemplate(names = {"typeOf"},
  scope = FunctionTemplate.FunctionScope.SIMPLE,
  nulls = NullHandling.INTERNAL)
  public static class GetType implements DrillSimpleFunc {

@Param
FieldReader input;
@Output
VarCharHolder out;
@Inject
DrillBuf buf;

@Override
public void setup() {}

@Override
public void eval() {
  String typeName = input.getTypeString();
  byte[] type = typeName.getBytes();
  buf = buf.reallocIfNeeded(type.length);
  buf.setBytes(0, type);
  out.buffer = buf;
  out.start = 0;
  out.end = type.length;
}
  }
{code}

Note that the {{input}} field is defined as {{FieldReader}} which has a method 
called {{getTypeString()}}. As a result, the code works fine in all existing 
tests in {{TestTypeFns}}.

I tried to add a function to use {{typeof()}} on a column of type {{UNION}}. 
When I did, the query failed with a compile error in generated code:

{noformat}
SYSTEM ERROR: CompileException: Line 42, Column 43: 
  A method named "getTypeString" is not declared in any enclosing class nor any 
supertype, nor through a static import
{noformat}

The stack trace shows the generated code; Note that the type of {{input}} 
changes from a reader to a holder, causing code to be invalid:

{code:java}
public class ProjectorGen0 {

DrillBuf work0;
UnionVector vv1;
VarCharVector vv6;
DrillBuf work9;
VarCharVector vv11;
DrillBuf work14;
VarCharVector vv16;

public void doEval(int inIndex, int outIndex)
throws SchemaChangeException
{
{
UnionHolder out4 = new UnionHolder();
{
out4 .isSet = vv1 .getAccessor().isSet((inIndex));
if (out4 .isSet == 1) {
vv1 .getAccessor().get((inIndex), out4);
}
}
// start of eval portion of typeOf function. //
VarCharHolder out5 = new VarCharHolder();
{
final VarCharHolder out = new VarCharHolder();
UnionHolder input = out4;
DrillBuf buf = work0;
UnionFunctions$GetType_eval:
{
String typeName = input.getTypeString();
byte[] type = typeName.getBytes();

buf = buf.reallocIfNeeded(type.length);
buf.setBytes(0, type);
out.buffer = buf;
out.start = 0;
out.end = type.length;
}
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (DRILL-7499) sqltypeof() function with an array returns "ARRAY", not type

2019-12-29 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers reassigned DRILL-7499:
--

Assignee: Paul Rogers

> sqltypeof() function with an array returns "ARRAY", not type
> 
>
> Key: DRILL-7499
> URL: https://issues.apache.org/jira/browse/DRILL-7499
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>
> The {{sqltypeof()}} function was introduced in Drill 1.14 to work around 
> limitations of the original {{typeof()}} function. The function is mentioned 
> in _Learning Apache Drill_, Chapter 8, page 152:
> {noformat}
> ELECT sqlTypeOf(columns) AS cols_type,
>modeOf(columns) AS cols_mode
> FROM `csv/cust.csv` LIMIT 1;
> +++
> | cols_type  | cols_mode  |
> +++
> | CHARACTER VARYING  | ARRAY  |
> +++
> {noformat}
> When the same query is run against the just-released Drill 1.17, we get the 
> *wrong* results:
> {noformat}
> +---+---+
> | cols_type | cols_mode |
> +---+---+
> | ARRAY | ARRAY |
> +---+---+
> {noformat}
> The definition of {{sqlTypeOf()}} is that it should return the type portion 
> of the columns (type, mode) major type. Clearly, it is no longer doing so for 
> arrays. As a result, there is no function to obtain the data type for arrays.
> The problem also shows up in the query from page 158:
> {code:sql}
> SELECT a, b,
>sqlTypeOf(b) AS b_type, modeof(b) AS b_mode
> FROM `gen/70kmissing.json`
> WHERE mod(a, 7) = 1;
> {code}
> Expected (table from the book with Drill 1.14 results):
> {noformat}
> ++---+--+---+
> |   a|   b   |  b_type  |  b_mode   |
> ++---+--+---+
> | 1  | null  | INTEGER  | NULLABLE  |
> ++---+--+---+
> {noformat}
> Actual Drill 1.17 results:
> {noformat}
> +---+---+---+--+
> |   a   | b |  b_type   |  b_mode  |
> +---+---+---+--+
> | 1 | null  | NULL  | NULLABLE |
> +---+---+---+--+
> {noformat}
> (Second line of table is omitted because something else changed, not relevant 
> to this ticket.)
> The above might not actually be a bug, however if someone has changed the 
> type of missing columns from the old {{INT}} to a newer (untyped) {{NULL}}. 
> But, an indirect test suggests that the column is still `INT` and the 
> function is wrong:
> {code:sql}
> SELECT a, b
> FROM `gen/70kdouble.json`
> WHERE b IS NOT NULL ORDER BY a;
> {code}
> Data:
> {noformat}
> {a: 1}
> ...
> {a: 6}
> {a: 70001, b: 10.5}
> {noformat}
> Error:
> {noformat}
> Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External 
> Sort. Please enable Union type.
> Previous schema BatchSchema [fields=[[`a` (BIGINT:OPTIONAL)], [`b` 
> (INT:OPTIONAL)]], selectionVector=NONE]
> Incoming schema BatchSchema [fields=[[`a` (BIGINT:OPTIONAL)], [`b` 
> (FLOAT8:OPTIONAL)]], selectionVector=NONE]
> {noformat}
> Oddly, however, the query on page 160 works as expected:
> {code:sql}
> SELECT sqlTypeOf(a) AS a_type, modeOf(a) AS a_mode 
> FROM `json/all-null.json` LIMIT 1;
> {code}
> {noformat}
> +-+--+
> | a_type  |  a_mode  |
> +-+--+
> | INTEGER | NULLABLE |
> +-+--+
> {noformat}
>  Someone will have to do some investigating to understand the current 
> behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7499) sqltypeof() function with an array returns "ARRAY", not type

2019-12-29 Thread Paul Rogers (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005161#comment-17005161
 ] 

Paul Rogers commented on DRILL-7499:


I believe the described behavior is an unintended artifact of this bit of code 
in {{Types.java}}:

{code:java}
  public static String getSqlTypeName(final MajorType type) {
if (type.getMode() == DataMode.REPEATED || type.getMinorType() == 
MinorType.LIST) {
  return "ARRAY";
}
return getBaseSqlTypeName(type);
  }
{code}

Since we have {{modeOf()}} to report the mode ({{REPEATED}}), will modify this 
function to not return "ARRAY" for the {{REPEATED}} mode.

> sqltypeof() function with an array returns "ARRAY", not type
> 
>
> Key: DRILL-7499
> URL: https://issues.apache.org/jira/browse/DRILL-7499
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Paul Rogers
>Priority: Minor
>
> The {{sqltypeof()}} function was introduced in Drill 1.14 to work around 
> limitations of the original {{typeof()}} function. The function is mentioned 
> in _Learning Apache Drill_, Chapter 8, page 152:
> {noformat}
> ELECT sqlTypeOf(columns) AS cols_type,
>modeOf(columns) AS cols_mode
> FROM `csv/cust.csv` LIMIT 1;
> +++
> | cols_type  | cols_mode  |
> +++
> | CHARACTER VARYING  | ARRAY  |
> +++
> {noformat}
> When the same query is run against the just-released Drill 1.17, we get the 
> *wrong* results:
> {noformat}
> +---+---+
> | cols_type | cols_mode |
> +---+---+
> | ARRAY | ARRAY |
> +---+---+
> {noformat}
> The definition of {{sqlTypeOf()}} is that it should return the type portion 
> of the columns (type, mode) major type. Clearly, it is no longer doing so for 
> arrays. As a result, there is no function to obtain the data type for arrays.
> The problem also shows up in the query from page 158:
> {code:sql}
> SELECT a, b,
>sqlTypeOf(b) AS b_type, modeof(b) AS b_mode
> FROM `gen/70kmissing.json`
> WHERE mod(a, 7) = 1;
> {code}
> Expected (table from the book with Drill 1.14 results):
> {noformat}
> ++---+--+---+
> |   a|   b   |  b_type  |  b_mode   |
> ++---+--+---+
> | 1  | null  | INTEGER  | NULLABLE  |
> ++---+--+---+
> {noformat}
> Actual Drill 1.17 results:
> {noformat}
> +---+---+---+--+
> |   a   | b |  b_type   |  b_mode  |
> +---+---+---+--+
> | 1 | null  | NULL  | NULLABLE |
> +---+---+---+--+
> {noformat}
> (Second line of table is omitted because something else changed, not relevant 
> to this ticket.)
> The above might not actually be a bug, however if someone has changed the 
> type of missing columns from the old {{INT}} to a newer (untyped) {{NULL}}. 
> But, an indirect test suggests that the column is still `INT` and the 
> function is wrong:
> {code:sql}
> SELECT a, b
> FROM `gen/70kdouble.json`
> WHERE b IS NOT NULL ORDER BY a;
> {code}
> Data:
> {noformat}
> {a: 1}
> ...
> {a: 6}
> {a: 70001, b: 10.5}
> {noformat}
> Error:
> {noformat}
> Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External 
> Sort. Please enable Union type.
> Previous schema BatchSchema [fields=[[`a` (BIGINT:OPTIONAL)], [`b` 
> (INT:OPTIONAL)]], selectionVector=NONE]
> Incoming schema BatchSchema [fields=[[`a` (BIGINT:OPTIONAL)], [`b` 
> (FLOAT8:OPTIONAL)]], selectionVector=NONE]
> {noformat}
> Oddly, however, the query on page 160 works as expected:
> {code:sql}
> SELECT sqlTypeOf(a) AS a_type, modeOf(a) AS a_mode 
> FROM `json/all-null.json` LIMIT 1;
> {code}
> {noformat}
> +-+--+
> | a_type  |  a_mode  |
> +-+--+
> | INTEGER | NULLABLE |
> +-+--+
> {noformat}
>  Someone will have to do some investigating to understand the current 
> behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (DRILL-7501) Drill 1.17 sqlTypeOf for a Map now reports STRUCT

2019-12-29 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers resolved DRILL-7501.

Resolution: Won't Fix

As explained on the dev list, the return value in this case was changed to 
match the preferred name {{STRUCT}} for what Drill has historically called a 
{{MAP}}. The name {{STRUCT}} is consistent with Hive.

> Drill 1.17 sqlTypeOf for a Map now reports STRUCT
> -
>
> Key: DRILL-7501
> URL: https://issues.apache.org/jira/browse/DRILL-7501
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>
> Drill 1.14 introduced the {{sqlTypeOf()}} function to workaround limits of 
> the {{typeof()}} function. {{sqlTypeOf()}} should return the name of the SQL 
> type for a column, using the type name that Drill uses.
> A query from page 163 of _Learning Apache Drill_:
> {code:sql}
> SELECT sqlTypeOf(`name`) AS name_type FROM `json/nested.json`;
> {code}
> Drill 1.14 results (correct):
> {noformat}
> ++
> | name_type  |
> ++
> | MAP|
> ++
> {noformat}
> Drill 1.17 results (incorrect):
> {noformat}
> +---+
> | name_type |
> +---+
> | STRUCT|
> +---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (DRILL-5189) There's no documentation for the typeof() function

2019-12-29 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers resolved DRILL-5189.

Resolution: Duplicate

> There's no documentation for the typeof() function
> --
>
> Key: DRILL-5189
> URL: https://issues.apache.org/jira/browse/DRILL-5189
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Documentation
>Reporter: Chris Westin
>Assignee: Bridget Bevens
>Priority: Major
>
> I looked through the documentation at https://drill.apache.org/docs/ under 
> SQL Reference > SQL Functions > ... and could not find any reference to 
> typeof(). Google searches only turned up a reference to DRILL-4204.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (DRILL-7501) Drill 1.17 sqlTypeOf for a Map now reports STRUCT

2019-12-29 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers reassigned DRILL-7501:
--

Assignee: Paul Rogers

> Drill 1.17 sqlTypeOf for a Map now reports STRUCT
> -
>
> Key: DRILL-7501
> URL: https://issues.apache.org/jira/browse/DRILL-7501
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>
> Drill 1.14 introduced the {{sqlTypeOf()}} function to workaround limits of 
> the {{typeof()}} function. {{sqlTypeOf()}} should return the name of the SQL 
> type for a column, using the type name that Drill uses.
> A query from page 163 of _Learning Apache Drill_:
> {code:sql}
> SELECT sqlTypeOf(`name`) AS name_type FROM `json/nested.json`;
> {code}
> Drill 1.14 results (correct):
> {noformat}
> ++
> | name_type  |
> ++
> | MAP|
> ++
> {noformat}
> Drill 1.17 results (incorrect):
> {noformat}
> +---+
> | name_type |
> +---+
> | STRUCT|
> +---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-6377) typeof() does not return DECIMAL scale, precision

2019-12-29 Thread Paul Rogers (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-6377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005155#comment-17005155
 ] 

Paul Rogers commented on DRILL-6377:


See DRILL-6362. The primary purpose of {{typeof()}} is to allow a query to 
determine the type of a value in a {{UNION}} column. (It has also been useful 
to debug queries for non-{{UNION}} columns.) Since adding widths would 
interfere with the purpose of this function, we should continue to omit them. 
As [~arina] has shown, a user who wants that information can use the 
{{sqlTypeOf()}} function.

> typeof() does not return DECIMAL scale, precision
> -
>
> Key: DRILL-6377
> URL: https://issues.apache.org/jira/browse/DRILL-6377
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Paul Rogers
>Priority: Minor
> Fix For: 1.16.0
>
>
> The {{typeof()}} function returns the type of a column:
> {noformat}
> SELECT typeof(CAST(a AS DOUBLE)) FROM (VALUES (1)) AS T(a);
> +-+
> | EXPR$0  |
> +-+
> | FLOAT8  |
> +-+
> {noformat}
> In Drill, the {{DECIMAL}} type is parameterized with scale and precision. 
> However, {{typeof()}} does not return this information:
> {noformat}
> ALTER SESSION SET `planner.enable_decimal_data_type` = true;
> SELECT typeof(CAST(a AS DECIMAL)) FROM (VALUES (1)) AS T(a);
> +--+
> |  EXPR$0  |
> +--+
> | DECIMAL38SPARSE  |
> +--+
> SELECT typeof(CAST(a AS DECIMAL(6, 3))) FROM (VALUES (1)) AS T(a);
> +---+
> |  EXPR$0   |
> +---+
> | DECIMAL9  |
> +---+
> {noformat}
> Expected something of the form {{DECIMAL(, )}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (DRILL-6377) typeof() does not return DECIMAL scale, precision

2019-12-29 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-6377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers reassigned DRILL-6377:
--

Assignee: Paul Rogers

> typeof() does not return DECIMAL scale, precision
> -
>
> Key: DRILL-6377
> URL: https://issues.apache.org/jira/browse/DRILL-6377
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.13.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
> Fix For: 1.16.0
>
>
> The {{typeof()}} function returns the type of a column:
> {noformat}
> SELECT typeof(CAST(a AS DOUBLE)) FROM (VALUES (1)) AS T(a);
> +-+
> | EXPR$0  |
> +-+
> | FLOAT8  |
> +-+
> {noformat}
> In Drill, the {{DECIMAL}} type is parameterized with scale and precision. 
> However, {{typeof()}} does not return this information:
> {noformat}
> ALTER SESSION SET `planner.enable_decimal_data_type` = true;
> SELECT typeof(CAST(a AS DECIMAL)) FROM (VALUES (1)) AS T(a);
> +--+
> |  EXPR$0  |
> +--+
> | DECIMAL38SPARSE  |
> +--+
> SELECT typeof(CAST(a AS DECIMAL(6, 3))) FROM (VALUES (1)) AS T(a);
> +---+
> |  EXPR$0   |
> +---+
> | DECIMAL9  |
> +---+
> {noformat}
> Expected something of the form {{DECIMAL(, )}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (DRILL-6360) Document the typeof() function

2019-12-29 Thread Paul Rogers (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-6360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005125#comment-17005125
 ] 

Paul Rogers edited comment on DRILL-6360 at 12/30/19 6:34 AM:
--

The information should go 
[here|https://drill.apache.org/docs/data-type-functions/].

{{typeof()}} returns {{"NULL"}} if the value of a column is NULL, else it 
returns the internal Drill type name for a column as given by {{drillTypeOf()}}.

If we adopt the changes proposed in DRILL-6362, then the documentation becomes: 
Returns the type of the column using the internal Drill type name. If the 
column is the experimental {{UNION}} type, then returns the type of the 
specific column value, or "NULL" if that column is null. To determine if a 
column is a UNION, use the {{drillTypeOf()}} function.

Note that in Drill 1.17 and before, the {{typeof()}} function returned "NULL" 
if the column value is null. From Drill 1.18 and later, this is only true if 
the column is of type {{UNION}}.


was (Author: paul.rogers):
The information should go 
[here|https://drill.apache.org/docs/data-type-functions/].

{{typeof()}} returns {{"NULL"}} if the value of a column is NULL, else it 
returns the internal Drill type name for a column as given by {{drillTypeOf()}}.

> Document the typeof() function
> --
>
> Key: DRILL-6360
> URL: https://issues.apache.org/jira/browse/DRILL-6360
> Project: Apache Drill
>  Issue Type: Task
>  Components: Documentation
>Affects Versions: 1.13.0
>Reporter: Paul Rogers
>Assignee: Bridget Bevens
>Priority: Minor
>  Labels: doc-impacting
>
> Drill has a {{typeof()}} function that returns the data type (but not mode) 
> of a column. It was discussed on the dev list recently. However, a search of 
> the Drill web site, and a scan by hand, failed to turn up documentation about 
> the function.
> As a general suggestion, would be great to have an alphabetical list of all 
> functions so we don't have to hunt all over the site to find which functions 
> are available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (DRILL-6362) typeof() lies about types

2019-12-29 Thread Paul Rogers (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-6362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005138#comment-17005138
 ] 

Paul Rogers edited comment on DRILL-6362 at 12/30/19 6:22 AM:
--

It is likely that this function was meant to mimic the 
[{{typeof()}}|https://www.w3resource.com/sqlite/core-functions-typeof.php] 
function of SqlLite, which also returns "NULL" if the actual value is NULL.

Snowflake has the concept of a "Variant" (like Drill's Union type). In this 
case 
[{{typeof()}}|https://docs.snowflake.net/manuals/sql-reference/functions/typeof.html]
 returns the type of the value. The documentation shows an example for a null 
value for which {{typeof()}} to returns "NULL".

Given this, the Drill function should probably return the value type for a 
UNION type. At present, {{typeof()}} will return "UNION", which is not 
consistent with the Snowflake variant pattern.

 Postres has the 
[{{pg_typeof()}}|https://www.postgresql.org/docs/9.3/functions-info.html] 
function, which is a bit convoluted, but the examples shows that it effectively 
returns the type name.

Given all this, the proposal is to modify {{typeof()}} as follows:

* For a {{UNION}} type, return the actual type of the specific column value.
* For a {{UNION}} type (only), return "NULL" if the UNION itself is NULL. (Such 
a column really does have no type.)
* For all other types, return the {{MinorType}} name.

To be clear, the two changes are:

* Modify handling of {{UNION}} columns.
* Modify handling of columns with values set to {{NULL}}.

These changes seem valid because:

* They make the Drill function closer to operation of other SQL engines.
* Other than for debugging, the most likely use of {{typeof()}} is to work with 
UNIONS, a task for which the function currently fails.


was (Author: paul.rogers):
Closing this because we did create the new functions and we we've elected to 
leave this function alone for now.

> typeof() lies about types
> -
>
> Key: DRILL-6362
> URL: https://issues.apache.org/jira/browse/DRILL-6362
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.13.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
>
> Drill provides a {{typeof()}} function that returns the type of a column. 
> But, it seems to make up types. Consider the following input file:
> {noformat}
> {a: true}
> {a: false}
> {a: null}
> {noformat}
> Consider the following two queries:
> {noformat}
> SELECT a FROM `json/boolean.json`;
> ++
> |   a|
> ++
> | true   |
> | false  |
> | null   |
> ++
> > SELECT typeof(a) FROM `json/boolean.json`;
> +-+
> | EXPR$0  |
> +-+
> | BIT |
> | BIT |
> | NULL|
> +-+
> {noformat}
> Notice that the values are reported as BIT. But, I believe the actual type is 
> UInt1 (the bit vector is, I believe, deprecated.) Then, the function reports 
> NULL instead of the actual type for the null value.
> Since Drill has an {{isnull()}} function, there is no reason for {{typeof()}} 
> to muddle the type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Reopened] (DRILL-6362) typeof() lies about types

2019-12-29 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-6362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers reopened DRILL-6362:


> typeof() lies about types
> -
>
> Key: DRILL-6362
> URL: https://issues.apache.org/jira/browse/DRILL-6362
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.13.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
>
> Drill provides a {{typeof()}} function that returns the type of a column. 
> But, it seems to make up types. Consider the following input file:
> {noformat}
> {a: true}
> {a: false}
> {a: null}
> {noformat}
> Consider the following two queries:
> {noformat}
> SELECT a FROM `json/boolean.json`;
> ++
> |   a|
> ++
> | true   |
> | false  |
> | null   |
> ++
> > SELECT typeof(a) FROM `json/boolean.json`;
> +-+
> | EXPR$0  |
> +-+
> | BIT |
> | BIT |
> | NULL|
> +-+
> {noformat}
> Notice that the values are reported as BIT. But, I believe the actual type is 
> UInt1 (the bit vector is, I believe, deprecated.) Then, the function reports 
> NULL instead of the actual type for the null value.
> Since Drill has an {{isnull()}} function, there is no reason for {{typeof()}} 
> to muddle the type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (DRILL-6362) typeof() lies about types

2019-12-29 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-6362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers resolved DRILL-6362.

Resolution: Won't Fix

> typeof() lies about types
> -
>
> Key: DRILL-6362
> URL: https://issues.apache.org/jira/browse/DRILL-6362
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.13.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
>
> Drill provides a {{typeof()}} function that returns the type of a column. 
> But, it seems to make up types. Consider the following input file:
> {noformat}
> {a: true}
> {a: false}
> {a: null}
> {noformat}
> Consider the following two queries:
> {noformat}
> SELECT a FROM `json/boolean.json`;
> ++
> |   a|
> ++
> | true   |
> | false  |
> | null   |
> ++
> > SELECT typeof(a) FROM `json/boolean.json`;
> +-+
> | EXPR$0  |
> +-+
> | BIT |
> | BIT |
> | NULL|
> +-+
> {noformat}
> Notice that the values are reported as BIT. But, I believe the actual type is 
> UInt1 (the bit vector is, I believe, deprecated.) Then, the function reports 
> NULL instead of the actual type for the null value.
> Since Drill has an {{isnull()}} function, there is no reason for {{typeof()}} 
> to muddle the type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-6362) typeof() lies about types

2019-12-29 Thread Paul Rogers (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-6362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005138#comment-17005138
 ] 

Paul Rogers commented on DRILL-6362:


Closing this because we did create the new functions and we we've elected to 
leave this function alone for now.

> typeof() lies about types
> -
>
> Key: DRILL-6362
> URL: https://issues.apache.org/jira/browse/DRILL-6362
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.13.0
>Reporter: Paul Rogers
>Priority: Major
>
> Drill provides a {{typeof()}} function that returns the type of a column. 
> But, it seems to make up types. Consider the following input file:
> {noformat}
> {a: true}
> {a: false}
> {a: null}
> {noformat}
> Consider the following two queries:
> {noformat}
> SELECT a FROM `json/boolean.json`;
> ++
> |   a|
> ++
> | true   |
> | false  |
> | null   |
> ++
> > SELECT typeof(a) FROM `json/boolean.json`;
> +-+
> | EXPR$0  |
> +-+
> | BIT |
> | BIT |
> | NULL|
> +-+
> {noformat}
> Notice that the values are reported as BIT. But, I believe the actual type is 
> UInt1 (the bit vector is, I believe, deprecated.) Then, the function reports 
> NULL instead of the actual type for the null value.
> Since Drill has an {{isnull()}} function, there is no reason for {{typeof()}} 
> to muddle the type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (DRILL-6362) typeof() lies about types

2019-12-29 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-6362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers reassigned DRILL-6362:
--

Assignee: Paul Rogers

> typeof() lies about types
> -
>
> Key: DRILL-6362
> URL: https://issues.apache.org/jira/browse/DRILL-6362
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.13.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
>
> Drill provides a {{typeof()}} function that returns the type of a column. 
> But, it seems to make up types. Consider the following input file:
> {noformat}
> {a: true}
> {a: false}
> {a: null}
> {noformat}
> Consider the following two queries:
> {noformat}
> SELECT a FROM `json/boolean.json`;
> ++
> |   a|
> ++
> | true   |
> | false  |
> | null   |
> ++
> > SELECT typeof(a) FROM `json/boolean.json`;
> +-+
> | EXPR$0  |
> +-+
> | BIT |
> | BIT |
> | NULL|
> +-+
> {noformat}
> Notice that the values are reported as BIT. But, I believe the actual type is 
> UInt1 (the bit vector is, I believe, deprecated.) Then, the function reports 
> NULL instead of the actual type for the null value.
> Since Drill has an {{isnull()}} function, there is no reason for {{typeof()}} 
> to muddle the type.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-6360) Document the typeof() function

2019-12-29 Thread Paul Rogers (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-6360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005125#comment-17005125
 ] 

Paul Rogers commented on DRILL-6360:


The information should go 
[here|https://drill.apache.org/docs/data-type-functions/].

{{typeof()}} returns {{"NULL"}} if the value of a column is NULL, else it 
returns the internal Drill type name for a column as given by {{drillTypeOf()}}.

> Document the typeof() function
> --
>
> Key: DRILL-6360
> URL: https://issues.apache.org/jira/browse/DRILL-6360
> Project: Apache Drill
>  Issue Type: Task
>  Components: Documentation
>Affects Versions: 1.13.0
>Reporter: Paul Rogers
>Assignee: Bridget Bevens
>Priority: Minor
>  Labels: doc-impacting
>
> Drill has a {{typeof()}} function that returns the data type (but not mode) 
> of a column. It was discussed on the dev list recently. However, a search of 
> the Drill web site, and a scan by hand, failed to turn up documentation about 
> the function.
> As a general suggestion, would be great to have an alphabetical list of all 
> functions so we don't have to hunt all over the site to find which functions 
> are available.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7352) Introduce new checkstyle rules to make code style more consistent

2019-12-29 Thread Paul Rogers (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005124#comment-17005124
 ] 

Paul Rogers commented on DRILL-7352:


In prior years, we followed the Sun coding conventions, as suggested 
[here|http://drill.apache.org/docs/apache-drill-contribution-guidelines/] and 
documented 
[here|https://www.oracle.com/technetwork/java/codeconvtoc-136057.html].

Obviously, the Sun conventions are 20 years old and do not address newer Java 
features or conventions.

It is a good idea to update the Drills standards. Google's standards seem fine. 
The cautious step would be to keep the original standards, adopting the Google 
standards only when the don't conflict (much) with existing code.

Then, let's be sure to document the standards on the web site.

> Introduce new checkstyle rules to make code style more consistent
> -
>
> Key: DRILL-7352
> URL: https://issues.apache.org/jira/browse/DRILL-7352
> Project: Apache Drill
>  Issue Type: Task
>Reporter: Vova Vysotskyi
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Source - https://checkstyle.sourceforge.io/checks.html
> List of rules to be enabled:
> * [LeftCurly|https://checkstyle.sourceforge.io/config_blocks.html#LeftCurly] 
> - force placement of a left curly brace at the end of the line.
> * 
> [RightCurly|https://checkstyle.sourceforge.io/config_blocks.html#RightCurly] 
> - force placement of a right curly brace
> * 
> [NewlineAtEndOfFile|https://checkstyle.sourceforge.io/config_misc.html#NewlineAtEndOfFile]
> * 
> [UnnecessaryParentheses|https://checkstyle.sourceforge.io/config_coding.html#UnnecessaryParentheses]
> * 
> [MethodParamPad|https://checkstyle.sourceforge.io/config_whitespace.html#MethodParamPad]
> * [InnerTypeLast 
> |https://checkstyle.sourceforge.io/config_design.html#InnerTypeLast]
> * 
> [MissingOverride|https://checkstyle.sourceforge.io/config_annotation.html#MissingOverride]
> * 
> [InvalidJavadocPosition|https://checkstyle.sourceforge.io/config_javadoc.html#InvalidJavadocPosition]
> * 
> [ArrayTypeStyle|https://checkstyle.sourceforge.io/config_misc.html#ArrayTypeStyle]
> * [UpperEll|https://checkstyle.sourceforge.io/config_misc.html#UpperEll]
> and others



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7352) Introduce new checkstyle rules to make code style more consistent

2019-12-29 Thread Paul Rogers (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005122#comment-17005122
 ] 

Paul Rogers commented on DRILL-7352:


Once we have candidate rules, I'll try implementing those rules in Eclipse to 
verify that they work. Then I can provide a new Eclipse setup file.

> Introduce new checkstyle rules to make code style more consistent
> -
>
> Key: DRILL-7352
> URL: https://issues.apache.org/jira/browse/DRILL-7352
> Project: Apache Drill
>  Issue Type: Task
>Reporter: Vova Vysotskyi
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Source - https://checkstyle.sourceforge.io/checks.html
> List of rules to be enabled:
> * [LeftCurly|https://checkstyle.sourceforge.io/config_blocks.html#LeftCurly] 
> - force placement of a left curly brace at the end of the line.
> * 
> [RightCurly|https://checkstyle.sourceforge.io/config_blocks.html#RightCurly] 
> - force placement of a right curly brace
> * 
> [NewlineAtEndOfFile|https://checkstyle.sourceforge.io/config_misc.html#NewlineAtEndOfFile]
> * 
> [UnnecessaryParentheses|https://checkstyle.sourceforge.io/config_coding.html#UnnecessaryParentheses]
> * 
> [MethodParamPad|https://checkstyle.sourceforge.io/config_whitespace.html#MethodParamPad]
> * [InnerTypeLast 
> |https://checkstyle.sourceforge.io/config_design.html#InnerTypeLast]
> * 
> [MissingOverride|https://checkstyle.sourceforge.io/config_annotation.html#MissingOverride]
> * 
> [InvalidJavadocPosition|https://checkstyle.sourceforge.io/config_javadoc.html#InvalidJavadocPosition]
> * 
> [ArrayTypeStyle|https://checkstyle.sourceforge.io/config_misc.html#ArrayTypeStyle]
> * [UpperEll|https://checkstyle.sourceforge.io/config_misc.html#UpperEll]
> and others



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7352) Introduce new checkstyle rules to make code style more consistent

2019-12-29 Thread Paul Rogers (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005121#comment-17005121
 ] 

Paul Rogers commented on DRILL-7352:


Eclipse can be made to organize imports similarly but generally sorts the 
imports so that, say, {{com}} comes before {{org}}, etc. Does IntelliJ do this?

We need a defined order, else when an IDE "organizes imports", the order will 
be unstable.

> Introduce new checkstyle rules to make code style more consistent
> -
>
> Key: DRILL-7352
> URL: https://issues.apache.org/jira/browse/DRILL-7352
> Project: Apache Drill
>  Issue Type: Task
>Reporter: Vova Vysotskyi
>Priority: Major
> Attachments: screenshot-1.png
>
>
> Source - https://checkstyle.sourceforge.io/checks.html
> List of rules to be enabled:
> * [LeftCurly|https://checkstyle.sourceforge.io/config_blocks.html#LeftCurly] 
> - force placement of a left curly brace at the end of the line.
> * 
> [RightCurly|https://checkstyle.sourceforge.io/config_blocks.html#RightCurly] 
> - force placement of a right curly brace
> * 
> [NewlineAtEndOfFile|https://checkstyle.sourceforge.io/config_misc.html#NewlineAtEndOfFile]
> * 
> [UnnecessaryParentheses|https://checkstyle.sourceforge.io/config_coding.html#UnnecessaryParentheses]
> * 
> [MethodParamPad|https://checkstyle.sourceforge.io/config_whitespace.html#MethodParamPad]
> * [InnerTypeLast 
> |https://checkstyle.sourceforge.io/config_design.html#InnerTypeLast]
> * 
> [MissingOverride|https://checkstyle.sourceforge.io/config_annotation.html#MissingOverride]
> * 
> [InvalidJavadocPosition|https://checkstyle.sourceforge.io/config_javadoc.html#InvalidJavadocPosition]
> * 
> [ArrayTypeStyle|https://checkstyle.sourceforge.io/config_misc.html#ArrayTypeStyle]
> * [UpperEll|https://checkstyle.sourceforge.io/config_misc.html#UpperEll]
> and others



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7501) Drill 1.17 sqlTypeOf for a Map now reports STRUCT

2019-12-26 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7501:
--

 Summary: Drill 1.17 sqlTypeOf for a Map now reports STRUCT
 Key: DRILL-7501
 URL: https://issues.apache.org/jira/browse/DRILL-7501
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers


Drill 1.14 introduced the {{sqlTypeOf()}} function to workaround limits of the 
{{typeof()}} function. {{sqlTypeOf()}} should return the name of the SQL type 
for a column, using the type name that Drill uses.

A query from page 163 of _Learning Apache Drill_:

{code:sql}
SELECT sqlTypeOf(`name`) AS name_type FROM `json/nested.json`;
{code}

Drill 1.14 results (correct):

{noformat}
++
| name_type  |
++
| MAP|
++
{noformat}

Drill 1.17 results (incorrect):

{noformat}
+---+
| name_type |
+---+
| STRUCT|
+---+
{noformat}




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7500) CTAS to JSON omits the final newline

2019-12-26 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7500:
--

 Summary: CTAS to JSON omits the final newline
 Key: DRILL-7500
 URL: https://issues.apache.org/jira/browse/DRILL-7500
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers


Try the query from page 160 of _Learning Apache Drill_:

{code:sql}
ALTER SESSION SET `store.format` = 'json';
CREATE TABLE `out/json-null` AS SELECT * FROM `json/null2.json`;
{code}

Then, {{cat}} the resulting file:

{noformat}
cat out/json-null/0_0_0.json 
{
  "custId" : 123,
  "name" : "Fred",
  "balance" : 123.45
} {
  "custId" : 125,
  "name" : "Barney"
}(base) paul@paul-linux:~/eclipse-workspace/drillbook/data$
{noformat}

Notice that the file is missing a final newline, and so the shell prompt is 
appended to the last closing bracket.

Expected the line to be terminated with a newline.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (DRILL-7499) sqltypeof() function with an array returns "ARRAY", not type

2019-12-26 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-7499:
---
Issue Type: Bug  (was: Improvement)

> sqltypeof() function with an array returns "ARRAY", not type
> 
>
> Key: DRILL-7499
> URL: https://issues.apache.org/jira/browse/DRILL-7499
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.17.0
>Reporter: Paul Rogers
>Priority: Minor
>  Labels: regresion
>
> The {{sqltypeof()}} function was introduced in Drill 1.14 to work around 
> limitations of the original {{typeof()}} function. The function is mentioned 
> in _Learning Apache Drill_, Chapter 8, page 152:
> {noformat}
> ELECT sqlTypeOf(columns) AS cols_type,
>modeOf(columns) AS cols_mode
> FROM `csv/cust.csv` LIMIT 1;
> +++
> | cols_type  | cols_mode  |
> +++
> | CHARACTER VARYING  | ARRAY  |
> +++
> {noformat}
> When the same query is run against the just-released Drill 1.17, we get the 
> *wrong* results:
> {noformat}
> +---+---+
> | cols_type | cols_mode |
> +---+---+
> | ARRAY | ARRAY |
> +---+---+
> {noformat}
> The definition of {{sqlTypeOf()}} is that it should return the type portion 
> of the columns (type, mode) major type. Clearly, it is no longer doing so for 
> arrays. As a result, there is no function to obtain the data type for arrays.
> The problem also shows up in the query from page 158:
> {code:sql}
> SELECT a, b,
>sqlTypeOf(b) AS b_type, modeof(b) AS b_mode
> FROM `gen/70kmissing.json`
> WHERE mod(a, 7) = 1;
> {code}
> Expected (table from the book with Drill 1.14 results):
> {noformat}
> ++---+--+---+
> |   a|   b   |  b_type  |  b_mode   |
> ++---+--+---+
> | 1  | null  | INTEGER  | NULLABLE  |
> ++---+--+---+
> {noformat}
> Actual Drill 1.17 results:
> {noformat}
> +---+---+---+--+
> |   a   | b |  b_type   |  b_mode  |
> +---+---+---+--+
> | 1 | null  | NULL  | NULLABLE |
> +---+---+---+--+
> {noformat}
> (Second line of table is omitted because something else changed, not relevant 
> to this ticket.)
> The above might not actually be a bug, however if someone has changed the 
> type of missing columns from the old {{INT}} to a newer (untyped) {{NULL}}. 
> But, an indirect test suggests that the column is still `INT` and the 
> function is wrong:
> {code:sql}
> SELECT a, b
> FROM `gen/70kdouble.json`
> WHERE b IS NOT NULL ORDER BY a;
> {code}
> Data:
> {noformat}
> {a: 1}
> ...
> {a: 6}
> {a: 70001, b: 10.5}
> {noformat}
> Error:
> {noformat}
> Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External 
> Sort. Please enable Union type.
> Previous schema BatchSchema [fields=[[`a` (BIGINT:OPTIONAL)], [`b` 
> (INT:OPTIONAL)]], selectionVector=NONE]
> Incoming schema BatchSchema [fields=[[`a` (BIGINT:OPTIONAL)], [`b` 
> (FLOAT8:OPTIONAL)]], selectionVector=NONE]
> {noformat}
> Oddly, however, the query on page 160 works as expected:
> {code:sql}
> SELECT sqlTypeOf(a) AS a_type, modeOf(a) AS a_mode 
> FROM `json/all-null.json` LIMIT 1;
> {code}
> {noformat}
> +-+--+
> | a_type  |  a_mode  |
> +-+--+
> | INTEGER | NULLABLE |
> +-+--+
> {noformat}
>  Someone will have to do some investigating to understand the current 
> behaviour.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (DRILL-7499) sqltypeof() function with an array returns "ARRAY", not type

2019-12-26 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-7499:
---
Description: 
The {{sqltypeof()}} function was introduced in Drill 1.14 to work around 
limitations of the original {{typeof()}} function. The function is mentioned in 
_Learning Apache Drill_, Chapter 8, page 152:


{noformat}
ELECT sqlTypeOf(columns) AS cols_type,
   modeOf(columns) AS cols_mode
FROM `csv/cust.csv` LIMIT 1;

+++
| cols_type  | cols_mode  |
+++
| CHARACTER VARYING  | ARRAY  |
+++
{noformat}

When the same query is run against the just-released Drill 1.17, we get the 
*wrong* results:

{noformat}
+---+---+
| cols_type | cols_mode |
+---+---+
| ARRAY | ARRAY |
+---+---+
{noformat}

The definition of {{sqlTypeOf()}} is that it should return the type portion of 
the columns (type, mode) major type. Clearly, it is no longer doing so for 
arrays. As a result, there is no function to obtain the data type for arrays.

The problem also shows up in the query from page 158:

{code:sql}
SELECT a, b,
   sqlTypeOf(b) AS b_type, modeof(b) AS b_mode
FROM `gen/70kmissing.json`
WHERE mod(a, 7) = 1;
{code}

Expected (table from the book with Drill 1.14 results):

{noformat}
++---+--+---+
|   a|   b   |  b_type  |  b_mode   |
++---+--+---+
| 1  | null  | INTEGER  | NULLABLE  |
++---+--+---+
{noformat}

Actual Drill 1.17 results:

{noformat}
+---+---+---+--+
|   a   | b |  b_type   |  b_mode  |
+---+---+---+--+
| 1 | null  | NULL  | NULLABLE |
+---+---+---+--+
{noformat}

(Second line of table is omitted because something else changed, not relevant 
to this ticket.)

The above might not actually be a bug, however if someone has changed the type 
of missing columns from the old {{INT}} to a newer (untyped) {{NULL}}. But, an 
indirect test suggests that the column is still `INT` and the function is wrong:

{code:sql}
SELECT a, b
FROM `gen/70kdouble.json`
WHERE b IS NOT NULL ORDER BY a;
{code}

Data:

{noformat}
{a: 1}
...
{a: 6}
{a: 70001, b: 10.5}
{noformat}

Error:

{noformat}
Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External 
Sort. Please enable Union type.

Previous schema BatchSchema [fields=[[`a` (BIGINT:OPTIONAL)], [`b` 
(INT:OPTIONAL)]], selectionVector=NONE]
Incoming schema BatchSchema [fields=[[`a` (BIGINT:OPTIONAL)], [`b` 
(FLOAT8:OPTIONAL)]], selectionVector=NONE]

{noformat}

Oddly, however, the query on page 160 works as expected:

{code:sql}
SELECT sqlTypeOf(a) AS a_type, modeOf(a) AS a_mode 
FROM `json/all-null.json` LIMIT 1;
{code}

{noformat}
+-+--+
| a_type  |  a_mode  |
+-+--+
| INTEGER | NULLABLE |
+-+--+
{noformat}

 Someone will have to do some investigating to understand the current behaviour.

  was:
The {{sqltypeof()}} function was introduced in Drill 1.14 to work around 
limitations of the original {{typeof()}} function. The function is mentioned in 
_Learning Apache Drill_, Chapter 8, page 152:


{noformat}
ELECT sqlTypeOf(columns) AS cols_type,
   modeOf(columns) AS cols_mode
FROM `csv/cust.csv` LIMIT 1;

+++
| cols_type  | cols_mode  |
+++
| CHARACTER VARYING  | ARRAY  |
+++
{noformat}

When the same query is run against the just-released Drill 1.17, we get the 
*wrong* results:

{noformat}
+---+---+
| cols_type | cols_mode |
+---+---+
| ARRAY | ARRAY |
+---+---+
{noformat}

The definition of {{sqlTypeOf()}} is that it should return the type portion of 
the columns (type, mode) major type. Clearly, it is no longer doing so for 
arrays. As a result, there is no function to obtain the data type for arrays.

The problem also shows up in the query from page 158:

{code:sql}
SELECT a, b,
   sqlTypeOf(b) AS b_type, modeof(b) AS b_mode
FROM `gen/70kmissing.json`
WHERE mod(a, 7) = 1;
{code}

Expected (table from the book with Drill 1.14 results):

{noformat}
++---+--+---+
|   a|   b   |  b_type  |  b_mode   |
++---+--+---+
| 1  | null  | INTEGER  | NULLABLE  |
++---+--+---+
{noformat}

Actual Drill 1.17 results:

{noformat}
+---+---+---+--+
|   a   | b |  b_type   |  b_mode  |
+---+---+---+--+
| 1 | null  | NULL  | NULLABLE |

[jira] [Updated] (DRILL-7499) sqltypeof() function with an array returns "ARRAY", not type

2019-12-26 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-7499:
---
Description: 
The {{sqltypeof()}} function was introduced in Drill 1.14 to work around 
limitations of the original {{typeof()}} function. The function is mentioned in 
_Learning Apache Drill_, Chapter 8, page 152:


{noformat}
ELECT sqlTypeOf(columns) AS cols_type,
   modeOf(columns) AS cols_mode
FROM `csv/cust.csv` LIMIT 1;

+++
| cols_type  | cols_mode  |
+++
| CHARACTER VARYING  | ARRAY  |
+++
{noformat}

When the same query is run against the just-released Drill 1.17, we get the 
*wrong* results:

{noformat}
+---+---+
| cols_type | cols_mode |
+---+---+
| ARRAY | ARRAY |
+---+---+
{noformat}

The definition of {{sqlTypeOf()}} is that it should return the type portion of 
the columns (type, mode) major type. Clearly, it is no longer doing so for 
arrays. As a result, there is no function to obtain the data type for arrays.

The problem also shows up in the query from page 158:

{code:sql}
SELECT a, b,
   sqlTypeOf(b) AS b_type, modeof(b) AS b_mode
FROM `gen/70kmissing.json`
WHERE mod(a, 7) = 1;
{code}

Expected (table from the book with Drill 1.14 results):

{noformat}
++---+--+---+
|   a|   b   |  b_type  |  b_mode   |
++---+--+---+
| 1  | null  | INTEGER  | NULLABLE  |
++---+--+---+
{noformat}

Actual Drill 1.17 results:

{noformat}
+---+---+---+--+
|   a   | b |  b_type   |  b_mode  |
+---+---+---+--+
| 1 | null  | NULL  | NULLABLE |
+---+---+---+--+
{noformat}

(Second line of table is omitted because something else changed, not relevant 
to this ticket.)

The above might not actually be a bug, however if someone has changed the type 
of missing columns from the old {{INT}} to a newer (untyped) {{NULL}}. But, an 
indirect test suggests that the column is still `INT` and the function is wrong:

{code:sql}
SELECT a, b
FROM `gen/70kdouble.json`
WHERE b IS NOT NULL ORDER BY a;
{code}

Data:

{noformat}
{a: 1}
...
{a: 6}
{a: 70001, b: 10.5}
{noformat}

Error:

{noformat}
Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External 
Sort. Please enable Union type.

Previous schema BatchSchema [fields=[[`a` (BIGINT:OPTIONAL)], [`b` 
(INT:OPTIONAL)]], selectionVector=NONE]
Incoming schema BatchSchema [fields=[[`a` (BIGINT:OPTIONAL)], [`b` 
(FLOAT8:OPTIONAL)]], selectionVector=NONE]

{noformat}

 

  was:
The {{sqltypeof()}} function was introduced in Drill 1.14 to work around 
limitations of the original {{typeof()}} function. The function is mentioned in 
_Learning Apache Drill_, Chapter 8, page 152:


{noformat}
ELECT sqlTypeOf(columns) AS cols_type,
   modeOf(columns) AS cols_mode
FROM `csv/cust.csv` LIMIT 1;

+++
| cols_type  | cols_mode  |
+++
| CHARACTER VARYING  | ARRAY  |
+++
{noformat}

When the same query is run against the just-released Drill 1.17, we get the 
*wrong* results:

{noformat}
+---+---+
| cols_type | cols_mode |
+---+---+
| ARRAY | ARRAY |
+---+---+
{noformat}

The definition of {{sqlTypeOf()}} is that it should return the type portion of 
the columns (type, mode) major type. Clearly, it is no longer doing so for 
arrays. As a result, there is no function to obtain the data type for arrays.

The problem also shows up in the query from page 158:

{code:sql}
SELECT a, b,
   sqlTypeOf(b) AS b_type, modeof(b) AS b_mode
FROM `gen/70kmissing.json`
WHERE mod(a, 7) = 1;
{code}

Expected (table from the book with Drill 1.14 results):

{noformat}
++---+--+---+
|   a|   b   |  b_type  |  b_mode   |
++---+--+---+
| 1  | null  | INTEGER  | NULLABLE  |
++---+--+---+
{noformat}

Actual Drill 1.17 results:

{noformat}
+---+---+---+--+
|   a   | b |  b_type   |  b_mode  |
+---+---+---+--+
| 1 | null  | NULL  | NULLABLE |
+---+---+---+--+
{noformat}

(Second line of table is omitted because something else changed, not relevant 
to this ticket.)

The above might not actually be a bug, however if someone has changed the type 
of missing columns from the old {{INT}} to a newer (untyped) {{NULL}}.
 


> sqltypeof() function with an array returns "ARRAY", not type
> 

[jira] [Updated] (DRILL-7499) sqltypeof() function with an array returns "ARRAY", not type

2019-12-26 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-7499:
---
Description: 
The {{sqltypeof()}} function was introduced in Drill 1.14 to work around 
limitations of the original {{typeof()}} function. The function is mentioned in 
_Learning Apache Drill_, Chapter 8, page 152:


{noformat}
ELECT sqlTypeOf(columns) AS cols_type,
   modeOf(columns) AS cols_mode
FROM `csv/cust.csv` LIMIT 1;

+++
| cols_type  | cols_mode  |
+++
| CHARACTER VARYING  | ARRAY  |
+++
{noformat}

When the same query is run against the just-released Drill 1.17, we get the 
*wrong* results:

{noformat}
+---+---+
| cols_type | cols_mode |
+---+---+
| ARRAY | ARRAY |
+---+---+
{noformat}

The definition of {{sqlTypeOf()}} is that it should return the type portion of 
the columns (type, mode) major type. Clearly, it is no longer doing so for 
arrays. As a result, there is no function to obtain the data type for arrays.

The problem also shows up in the query from page 158:

{code:sql}
SELECT a, b,
   sqlTypeOf(b) AS b_type, modeof(b) AS b_mode
FROM `gen/70kmissing.json`
WHERE mod(a, 7) = 1;
{code}

Expected (table from the book with Drill 1.14 results):

{noformat}
++---+--+---+
|   a|   b   |  b_type  |  b_mode   |
++---+--+---+
| 1  | null  | INTEGER  | NULLABLE  |
++---+--+---+
{noformat}

Actual Drill 1.17 results:

{noformat}
+---+---+---+--+
|   a   | b |  b_type   |  b_mode  |
+---+---+---+--+
| 1 | null  | NULL  | NULLABLE |
+---+---+---+--+
{noformat}

(Second line of table is omitted because something else changed, not relevant 
to this ticket.)

The above might not actually be a bug, however if someone has changed the type 
of missing columns from the old {{INT}} to a newer (untyped) {{NULL}}.
 

  was:
The {{sqltypeof()}} function was introduced in Drill 1.14 to work around 
limitations of the original {{typeof()}} function. The function is mentioned in 
_Learning Apache Drill_, Chapter 8, page 152:


{noformat}
ELECT sqlTypeOf(columns) AS cols_type,
   modeOf(columns) AS cols_mode
FROM `csv/cust.csv` LIMIT 1;

+++
| cols_type  | cols_mode  |
+++
| CHARACTER VARYING  | ARRAY  |
+++
{noformat}

When the same query is run against the just-released Drill 1.17, we get the 
*wrong* results:

{noformat}
+---+---+
| cols_type | cols_mode |
+---+---+
| ARRAY | ARRAY |
+---+---+
{noformat}

The definition of {{sqlTypeOf()}} is that it should return the type portion of 
the columns (type, mode) major type. Clearly, it is no longer doing so for 
arrays. As a result, there is no function to obtain the data type for arrays.

The problem also shows up in the query from page 158:

{code:sql}
SELECT a, b,
   sqlTypeOf(b) AS b_type, modeof(b) AS b_mode
FROM `gen/70kmissing.json`
WHERE mod(a, 7) = 1;
{code}

Expected (table from the book with Drill 1.14 results):

{noformat}
++---+--+---+
|   a|   b   |  b_type  |  b_mode   |
++---+--+---+
| 1  | null  | INTEGER  | NULLABLE  |
++---+--+---+
{noformat}

Actual Drill 1.17 results:

{noformat}
+---+---+---+--+
|   a   | b |  b_type   |  b_mode  |
+---+---+---+--+
| 1 | null  | NULL  | NULLABLE |
+---+---+---+--+
{noformat}

(Second line of table is omitted because something else changed, not relevant 
to this ticket.)
 


> sqltypeof() function with an array returns "ARRAY", not type
> 
>
> Key: DRILL-7499
> URL: https://issues.apache.org/jira/browse/DRILL-7499
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Paul Rogers
>Priority: Minor
>  Labels: regresion
>
> The {{sqltypeof()}} function was introduced in Drill 1.14 to work around 
> limitations of the original {{typeof()}} function. The function is mentioned 
> in _Learning Apache Drill_, Chapter 8, page 152:
> {noformat}
> ELECT sqlTypeOf(columns) AS cols_type,
>modeOf(columns) AS cols_mode
> FROM `csv/cust.csv` LIMIT 1;
> +++
> | cols_type  | cols_mode  |
> +++
> 

[jira] [Updated] (DRILL-7499) sqltypeof() function with an array returns "ARRAY", not type

2019-12-26 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-7499:
---
Description: 
The {{sqltypeof()}} function was introduced in Drill 1.14 to work around 
limitations of the original {{typeof()}} function. The function is mentioned in 
_Learning Apache Drill_, Chapter 8, page 152:


{noformat}
ELECT sqlTypeOf(columns) AS cols_type,
   modeOf(columns) AS cols_mode
FROM `csv/cust.csv` LIMIT 1;

+++
| cols_type  | cols_mode  |
+++
| CHARACTER VARYING  | ARRAY  |
+++
{noformat}

When the same query is run against the just-released Drill 1.17, we get the 
*wrong* results:

{noformat}
+---+---+
| cols_type | cols_mode |
+---+---+
| ARRAY | ARRAY |
+---+---+
{noformat}

The definition of {{sqlTypeOf()}} is that it should return the type portion of 
the columns (type, mode) major type. Clearly, it is no longer doing so for 
arrays. As a result, there is no function to obtain the data type for arrays.

The problem also shows up in the query from page 158:

{code:sql}
SELECT a, b,
   sqlTypeOf(b) AS b_type, modeof(b) AS b_mode
FROM `gen/70kmissing.json`
WHERE mod(a, 7) = 1;
{code}

Expected (table from the book with Drill 1.14 results):

{noformat}
++---+--+---+
|   a|   b   |  b_type  |  b_mode   |
++---+--+---+
| 1  | null  | INTEGER  | NULLABLE  |
++---+--+---+
{noformat}

Actual Drill 1.17 results:

{noformat}
+---+---+---+--+
|   a   | b |  b_type   |  b_mode  |
+---+---+---+--+
| 1 | null  | NULL  | NULLABLE |
+---+---+---+--+
{noformat}

(Second line of table is omitted because something else changed, not relevant 
to this ticket.)
 

  was:
The {{sqltypeof()}} function was introduced in Drill 1.14 to work around 
limitations of the original {{typeof()}} function. The function is mentioned in 
_Learning Apache Drill_, Chapter 8, page 152:


{noformat}
ELECT sqlTypeOf(columns) AS cols_type,
   modeOf(columns) AS cols_mode
FROM `csv/cust.csv` LIMIT 1;

+++
| cols_type  | cols_mode  |
+++
| CHARACTER VARYING  | ARRAY  |
+++
{noformat}

When the same query is run against the just-released Drill 1.17, we get the 
*wrong* results:

{noformat}
+---+---+
| cols_type | cols_mode |
+---+---+
| ARRAY | ARRAY |
+---+---+
{noformat}

The definition of {{sqlTypeOf()}} is that it should return the type portion of 
the columns (type, mode) major type. Clearly, it is no longer doing so for 
arrays. As a result, there is no function to obtain the data type for arrays.

 


> sqltypeof() function with an array returns "ARRAY", not type
> 
>
> Key: DRILL-7499
> URL: https://issues.apache.org/jira/browse/DRILL-7499
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Paul Rogers
>Priority: Minor
>  Labels: regresion
>
> The {{sqltypeof()}} function was introduced in Drill 1.14 to work around 
> limitations of the original {{typeof()}} function. The function is mentioned 
> in _Learning Apache Drill_, Chapter 8, page 152:
> {noformat}
> ELECT sqlTypeOf(columns) AS cols_type,
>modeOf(columns) AS cols_mode
> FROM `csv/cust.csv` LIMIT 1;
> +++
> | cols_type  | cols_mode  |
> +++
> | CHARACTER VARYING  | ARRAY  |
> +++
> {noformat}
> When the same query is run against the just-released Drill 1.17, we get the 
> *wrong* results:
> {noformat}
> +---+---+
> | cols_type | cols_mode |
> +---+---+
> | ARRAY | ARRAY |
> +---+---+
> {noformat}
> The definition of {{sqlTypeOf()}} is that it should return the type portion 
> of the columns (type, mode) major type. Clearly, it is no longer doing so for 
> arrays. As a result, there is no function to obtain the data type for arrays.
> The problem also shows up in the query from page 158:
> {code:sql}
> SELECT a, b,
>sqlTypeOf(b) AS b_type, modeof(b) AS b_mode
> FROM `gen/70kmissing.json`
> WHERE mod(a, 7) = 1;
> {code}
> Expected (table from the book with Drill 1.14 results):
> {noformat}
> ++---+--+---+
> |   a|   b   |  b_type  |  b_mode   |
> ++---+--+---+
> | 1  | null  | INTEGER  | NULLABLE  |
> 

[jira] [Created] (DRILL-7499) sqltypeof() function with an array returns "ARRAY", not type

2019-12-26 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7499:
--

 Summary: sqltypeof() function with an array returns "ARRAY", not 
type
 Key: DRILL-7499
 URL: https://issues.apache.org/jira/browse/DRILL-7499
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers


The {{sqltypeof()}} function was introduced in Drill 1.14 to work around 
limitations of the original {{typeof()}} function. The function is mentioned in 
_Learning Apache Drill_, Chapter 8, page 152:


{noformat}
ELECT sqlTypeOf(columns) AS cols_type,
   modeOf(columns) AS cols_mode
FROM `csv/cust.csv` LIMIT 1;

+++
| cols_type  | cols_mode  |
+++
| CHARACTER VARYING  | ARRAY  |
+++
{noformat}

When the same query is run against the just-released Drill 1.17, we get the 
*wrong* results:

{noformat}
+---+---+
| cols_type | cols_mode |
+---+---+
| ARRAY | ARRAY |
+---+---+
{noformat}

The definition of {{sqlTypeOf()}} is that it should return the type portion of 
the columns (type, mode) major type. Clearly, it is no longer doing so for 
arrays. As a result, there is no function to obtain the data type for arrays.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7498) Allow the storage plugin editor window to be resizable

2019-12-26 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7498:
--

 Summary: Allow the storage plugin editor window to be resizable
 Key: DRILL-7498
 URL: https://issues.apache.org/jira/browse/DRILL-7498
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers


Open the Drill Web Console. Click on the Storage tab. Pick a Storage Plugin and 
click Update.

The JSON appears in nicely formatted editor. On a typical-sized monitor, the 
edit box takes up only half the screen vertically. Since it really helps to see 
more of the JSON than this small window, it would be handy if the edit box 
offered a resizer, such as this very Jira edit box does.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (DRILL-7487) Retire unused OUT_OF_MEMORY iterator status

2019-12-12 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-7487:
---
Description: 
Drill has long supported the {{OUT_OF_MEMORY}} iterator status. The idea is 
that an operator can realize it has encountered memory pressure and ask its 
downstream operator to free up some memory. However, an inspection of the code 
shows that the status is actually sent in only one place 
({{UnorderedReceiverBatch}}), and then only in response to the operator hitting 
its allocator limit (which no other batch can do anything about.)

If an operator did choose to try to use this status, there are two key problems:

# The operator must be able to suspend itself at any point that it might need 
memory. For example, an operator that allocates a dozen vectors must be able to 
stop on, say, the 9th vector, then resume at that point on the subsequent call 
to {{next()}}. The complexity of the state machine needed to do this is very 
high.
# The *downstream* operators (who may not yet have seen rows) are the least 
likely to be able to release memory. It is the *upstream* operators (such as 
spillable operators) that might be able to spill some of the rows they are 
holding.

Presto suggests a nice alternative:

* An operator which encounters memory pressure asks the fragment executor for 
more memory.
* The fragment executor asks all *other* operators in that fragment to release 
memory if possible.

This allows a very simple memory recovery strategy:

{noformat}
  try {
// allocate something
  } catch (OutOfMemoryException e) {
context.requestMemory(this);
// allocate something again, throwing OOM if it fails again
  }
{noformat}

Note that, since the fragment runs on a single thread, the above is simple to 
implement. Each operator is either idle (not executing) or in a call to 
{{next()}} on a child operator. These are both stable times to consider 
invoking spilling. Further, a sender could use this opportunity to write 
partially-filled batches to the network and release them rather than waiting 
for more data.

The only thing that can't be handled is, say, having an interior node flush a 
batch to its downstream operator in the same batch.

Proposed are two changes:

# Retire the OUT_OF_MEMORY status. Simply remove all references to it since it 
is never sent.
# Create a stub {{requestMemory()}} method in the operator context that does 
nothing now, but could be expanded to perform the work suggested above.


  was:
Drill has long supported the {{OUT_OF_MEMORY}} iterator status. The idea is 
that an operator can realize it has encountered memory pressure and ask its 
downstream operator to free up some memory. However, an inspection of the code 
shows that the status is actually sent in only one place 
({{UnorderedReceiverBatch}}), and then only in response to the operator hitting 
its allocator limit (which no other batch can do anything about.)

If an operator did choose to try to use this status, there are two key problems:

1. The operator must be able to suspend itself at any point that it might need 
memory. For example, an operator that allocates a dozen vectors must be able to 
stop on, say, the 9th vector, then resume at that point on the subsequent call 
to {{next()}}. The complexity of the state machine needed to do this is very 
high.
2. The *downstream* operators (who may not yet have seen rows) are the least 
likely to be able to release memory. It is the *upstream* operators (such as 
spillable operators) that might be able to spill some of the rows they are 
holding.

Presto suggests a nice alternative:

* An operator which encounters memory pressure asks the fragment executor for 
more memory.
* The fragment executor asks all *other* operators in that fragment to release 
memory if possible.

This allows a very simple memory recovery strategy:

{noformat}
  try {
// allocate something
  } catch (OutOfMemoryException e) {
context.requestMemory(this);
// allocate something again, throwing OOM if it fails again
  }
{noformat}

Note that, since the fragment runs on a single thread, the above is simple to 
implement. Each operator is either idle (not executing) or in a call to 
{{next()}} on a child operator. These are both stable times to consider 
invoking spilling. Further, a sender could use this opportunity to write 
partially-filled batches to the network and release them rather than waiting 
for more data.

The only thing that can't be handled is, say, having an interior node flush a 
batch to its downstream operator in the same batch.

Proposed are two changes:

1. Retire the OUT_OF_MEMORY status. Simply remove all references to it since it 
is never sent.
2. Create a stub {{requestMemory()}} method in the operator context that does 
nothing now, but could be expanded to perform the work suggested above.



> Retire unused OUT_OF_MEMORY iterator status
> 

[jira] [Updated] (DRILL-7487) Retire unused OUT_OF_MEMORY iterator status

2019-12-12 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-7487:
---
Description: 
Drill has long supported the {{OUT_OF_MEMORY}} iterator status. The idea is 
that an operator can realize it has encountered memory pressure and ask its 
downstream operator to free up some memory. However, an inspection of the code 
shows that the status is actually sent in only one place 
({{UnorderedReceiverBatch}}), and then only in response to the operator hitting 
its allocator limit (which no other batch can do anything about.)

If an operator did choose to try to use this status, there are two key problems:

1. The operator must be able to suspend itself at any point that it might need 
memory. For example, an operator that allocates a dozen vectors must be able to 
stop on, say, the 9th vector, then resume at that point on the subsequent call 
to {{next()}}. The complexity of the state machine needed to do this is very 
high.
2. The *downstream* operators (who may not yet have seen rows) are the least 
likely to be able to release memory. It is the *upstream* operators (such as 
spillable operators) that might be able to spill some of the rows they are 
holding.

Presto suggests a nice alternative:

* An operator which encounters memory pressure asks the fragment executor for 
more memory.
* The fragment executor asks all *other* operators in that fragment to release 
memory if possible.

This allows a very simple memory recovery strategy:

{noformat}
  try {
// allocate something
  } catch (OutOfMemoryException e) {
context.requestMemory(this);
// allocate something again, throwing OOM if it fails again
  }
{noformat}

Note that, since the fragment runs on a single thread, the above is simple to 
implement. Each operator is either idle (not executing) or in a call to 
`next()` on a child operator. These are both stable times to consider invoking 
spilling. Further, a sender could use this opportunity to write 
partially-filled batches to the network and release them rather than waiting 
for more data.

The only thing that can't be handled is, say, having an interior node flush a 
batch to its downstream operator in the same batch.

Proposed are two changes:

1. Retire the OUT_OF_MEMORY status. Simply remove all references to it since it 
is never sent.
2. Create a stub {{requestMemory()}} method in the operator context that does 
nothing now, but could be expanded to perform the work suggested above.


  was:
Drill has long supported the {{OUT_OF_MEMORY}} iterator status. The idea is 
that an operator can realize it has encountered memory pressure and ask its 
downstream operator to free up some memory. However, an inspection of the code 
shows that the status is actually sent in only one place 
({{UnorderedReceiverBatch}}), and then only in response to the operator hitting 
its allocator limit (which no other batch can do anything about.)

If an operator did choose to try to use this status, there are two key problems:

1. The operator must be able to suspend itself at any point that it might need 
memory. For example, an operator that allocates a dozen vectors must be able to 
stop on, say, the 9th vector, then resume at that point on the subsequent call 
to `next()`. The complexity of the state machine needed to do this is very high.
2. The *downstream* operators (who may not yet have seen rows) are the least 
likely to be able to release memory. It is the *upstream* operators (such as 
spillable operators) that might be able to spill some of the rows they are 
holding.

Presto suggests a nice alternative:

* An operator which encounters memory pressure asks the fragment executor for 
more memory.
* The fragment executor asks all *other* operators in that fragment to release 
memory if possible.

This allows a very simple memory recovery strategy:

{noformat}
  try {
// allocate something
  } catch (OutOfMemoryException e) {
context.requestMemory(this);
// allocate something again, throwing OOM if it fails again
  }
{noformat}

Note that, since the fragment runs on a single thread, the above is simple to 
implement. Each operator is either idle (not executing) or in a call to 
`next()` on a child operator. These are both stable times to consider invoking 
spilling. Further, a sender could use this opportunity to write 
partially-filled batches to the network and release them rather than waiting 
for more data.

The only thing that can't be handled is, say, having an interior node flush a 
batch to its downstream operator in the same batch.

Proposed are two changes:

1. Retire the OUT_OF_MEMORY status. Simply remove all references to it since it 
is never sent.
2. Create a stub {{requestMemory()}} method in the operator context that does 
nothing now, but could be expanded to perform the work suggested above.



> Retire unused OUT_OF_MEMORY iterator status
> 

[jira] [Updated] (DRILL-7487) Retire unused OUT_OF_MEMORY iterator status

2019-12-12 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-7487:
---
Description: 
Drill has long supported the {{OUT_OF_MEMORY}} iterator status. The idea is 
that an operator can realize it has encountered memory pressure and ask its 
downstream operator to free up some memory. However, an inspection of the code 
shows that the status is actually sent in only one place 
({{UnorderedReceiverBatch}}), and then only in response to the operator hitting 
its allocator limit (which no other batch can do anything about.)

If an operator did choose to try to use this status, there are two key problems:

1. The operator must be able to suspend itself at any point that it might need 
memory. For example, an operator that allocates a dozen vectors must be able to 
stop on, say, the 9th vector, then resume at that point on the subsequent call 
to {{next()}}. The complexity of the state machine needed to do this is very 
high.
2. The *downstream* operators (who may not yet have seen rows) are the least 
likely to be able to release memory. It is the *upstream* operators (such as 
spillable operators) that might be able to spill some of the rows they are 
holding.

Presto suggests a nice alternative:

* An operator which encounters memory pressure asks the fragment executor for 
more memory.
* The fragment executor asks all *other* operators in that fragment to release 
memory if possible.

This allows a very simple memory recovery strategy:

{noformat}
  try {
// allocate something
  } catch (OutOfMemoryException e) {
context.requestMemory(this);
// allocate something again, throwing OOM if it fails again
  }
{noformat}

Note that, since the fragment runs on a single thread, the above is simple to 
implement. Each operator is either idle (not executing) or in a call to 
{{next()}} on a child operator. These are both stable times to consider 
invoking spilling. Further, a sender could use this opportunity to write 
partially-filled batches to the network and release them rather than waiting 
for more data.

The only thing that can't be handled is, say, having an interior node flush a 
batch to its downstream operator in the same batch.

Proposed are two changes:

1. Retire the OUT_OF_MEMORY status. Simply remove all references to it since it 
is never sent.
2. Create a stub {{requestMemory()}} method in the operator context that does 
nothing now, but could be expanded to perform the work suggested above.


  was:
Drill has long supported the {{OUT_OF_MEMORY}} iterator status. The idea is 
that an operator can realize it has encountered memory pressure and ask its 
downstream operator to free up some memory. However, an inspection of the code 
shows that the status is actually sent in only one place 
({{UnorderedReceiverBatch}}), and then only in response to the operator hitting 
its allocator limit (which no other batch can do anything about.)

If an operator did choose to try to use this status, there are two key problems:

1. The operator must be able to suspend itself at any point that it might need 
memory. For example, an operator that allocates a dozen vectors must be able to 
stop on, say, the 9th vector, then resume at that point on the subsequent call 
to {{next()}}. The complexity of the state machine needed to do this is very 
high.
2. The *downstream* operators (who may not yet have seen rows) are the least 
likely to be able to release memory. It is the *upstream* operators (such as 
spillable operators) that might be able to spill some of the rows they are 
holding.

Presto suggests a nice alternative:

* An operator which encounters memory pressure asks the fragment executor for 
more memory.
* The fragment executor asks all *other* operators in that fragment to release 
memory if possible.

This allows a very simple memory recovery strategy:

{noformat}
  try {
// allocate something
  } catch (OutOfMemoryException e) {
context.requestMemory(this);
// allocate something again, throwing OOM if it fails again
  }
{noformat}

Note that, since the fragment runs on a single thread, the above is simple to 
implement. Each operator is either idle (not executing) or in a call to 
`next()` on a child operator. These are both stable times to consider invoking 
spilling. Further, a sender could use this opportunity to write 
partially-filled batches to the network and release them rather than waiting 
for more data.

The only thing that can't be handled is, say, having an interior node flush a 
batch to its downstream operator in the same batch.

Proposed are two changes:

1. Retire the OUT_OF_MEMORY status. Simply remove all references to it since it 
is never sent.
2. Create a stub {{requestMemory()}} method in the operator context that does 
nothing now, but could be expanded to perform the work suggested above.



> Retire unused OUT_OF_MEMORY iterator status

[jira] [Updated] (DRILL-7487) Retire unused OUT_OF_MEMORY iterator status

2019-12-12 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-7487:
---
Description: 
Drill has long supported the {{OUT_OF_MEMORY}} iterator status. The idea is 
that an operator can realize it has encountered memory pressure and ask its 
downstream operator to free up some memory. However, an inspection of the code 
shows that the status is actually sent in only one place 
({{UnorderedReceiverBatch}}), and then only in response to the operator hitting 
its allocator limit (which no other batch can do anything about.)

If an operator did choose to try to use this status, there are two key problems:

1. The operator must be able to suspend itself at any point that it might need 
memory. For example, an operator that allocates a dozen vectors must be able to 
stop on, say, the 9th vector, then resume at that point on the subsequent call 
to `next()`. The complexity of the state machine needed to do this is very high.
2. The *downstream* operators (who may not yet have seen rows) are the least 
likely to be able to release memory. It is the *upstream* operators (such as 
spillable operators) that might be able to spill some of the rows they are 
holding.

Presto suggests a nice alternative:

* An operator which encounters memory pressure asks the fragment executor for 
more memory.
* The fragment executor asks all *other* operators in that fragment to release 
memory if possible.

This allows a very simple memory recovery strategy:

{noformat}
  try {
// allocate something
  } catch (OutOfMemoryException e) {
context.requestMemory(this);
// allocate something again, throwing OOM if it fails again
  }
{noformat}

Note that, since the fragment runs on a single thread, the above is simple to 
implement. Each operator is either idle (not executing) or in a call to 
`next()` on a child operator. These are both stable times to consider invoking 
spilling. Further, a sender could use this opportunity to write 
partially-filled batches to the network and release them rather than waiting 
for more data.

The only thing that can't be handled is, say, having an interior node flush a 
batch to its downstream operator in the same batch.

Proposed are two changes:

1. Retire the OUT_OF_MEMORY status. Simply remove all references to it since it 
is never sent.
2. Create a stub {{requestMemory()}} method in the operator context that does 
nothing now, but could be expanded to perform the work suggested above.


  was:
Drill has long supported the {{OUT_OF_MEMORY}} iterator status. The idea is 
that an operator can realize it has encountered memory pressure and ask its 
downstream operator to free up some memory. However, an inspection of the code 
shows that the status is actually sent in only one place 
({{UnorderedReceiverBatch}}), and then only in response to the operator hitting 
its allocator limit (which no other batch can do anything about.)

If an operator did choose to try to use this status, there are two key problems:

1. The operator must be able to suspend itself at any point that it might need 
memory. For example, an operator that allocates a dozen vectors must be able to 
stop on, say, the 9th vector, then resume at that point on the subsequent call 
to `next()`. The complexity of the state machine needed to do this is very high.
2. The *downstream* operators (who may not yet have seen rows) are the least 
likely to be able to release memory. It is the *upstream* operators (such as 
spillable operators) that might be able to spill some of the rows they are 
holding.

Presto suggests a nice alternative:

* An operator which encounters memory pressure asks the fragment executor for 
more memory.
* The fragment executor asks all *other* operators in that fragment to release 
memory if possible.

This allows a very simple memory recovery strategy:

{noformat}
  try {
// allocate something
  } catch (OutOfMemoryException e) {
context.requestMemory(this);
// allocate something again, throwing OOM if it fails again
  }
{noformat}

Proposed are two changes:

1. Retire the OUT_OF_MEMORY status. Simply remove all references to it since it 
is never sent.
2. Create a stub {{requestMemory()}} method in the operator context that does 
nothing now, but could be expanded to perform the work suggested above.



> Retire unused OUT_OF_MEMORY iterator status
> ---
>
> Key: DRILL-7487
> URL: https://issues.apache.org/jira/browse/DRILL-7487
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>
> Drill has long supported the {{OUT_OF_MEMORY}} iterator status. The idea is 
> that an operator can realize it has encountered memory pressure and ask its 
> downstream operator to free up some memory. 

[jira] [Created] (DRILL-7487) Retire unused OUT_OF_MEMORY iterator status

2019-12-12 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7487:
--

 Summary: Retire unused OUT_OF_MEMORY iterator status
 Key: DRILL-7487
 URL: https://issues.apache.org/jira/browse/DRILL-7487
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers
Assignee: Paul Rogers


Drill has long supported the {{OUT_OF_MEMORY}} iterator status. The idea is 
that an operator can realize it has encountered memory pressure and ask its 
downstream operator to free up some memory. However, an inspection of the code 
shows that the status is actually sent in only one place 
({{UnorderedReceiverBatch}}), and then only in response to the operator hitting 
its allocator limit (which no other batch can do anything about.)

If an operator did choose to try to use this status, there are two key problems:

1. The operator must be able to suspend itself at any point that it might need 
memory. For example, an operator that allocates a dozen vectors must be able to 
stop on, say, the 9th vector, then resume at that point on the subsequent call 
to `next()`. The complexity of the state machine needed to do this is very high.
2. The *downstream* operators (who may not yet have seen rows) are the least 
likely to be able to release memory. It is the *upstream* operators (such as 
spillable operators) that might be able to spill some of the rows they are 
holding.

Presto suggests a nice alternative:

* An operator which encounters memory pressure asks the fragment executor for 
more memory.
* The fragment executor asks all *other* operators in that fragment to release 
memory if possible.

This allows a very simple memory recovery strategy:

{noformat}
  try {
// allocate something
  } catch (OutOfMemoryException e) {
context.requestMemory(this);
// allocate something again, throwing OOM if it fails again
  }
{noformat}

Proposed are two changes:

1. Retire the OUT_OF_MEMORY status. Simply remove all references to it since it 
is never sent.
2. Create a stub {{requestMemory()}} method in the operator context that does 
nothing now, but could be expanded to perform the work suggested above.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (DRILL-5272) Text file reader is inefficient

2019-12-12 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers resolved DRILL-5272.

Resolution: Fixed

This issue was fixed when converting the text readers to use the result set 
loader framework.

> Text file reader is inefficient
> ---
>
> Key: DRILL-5272
> URL: https://issues.apache.org/jira/browse/DRILL-5272
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>
> From inspection of the ScanBatch and CompliantTextReader.
> Every batch holds about five implicit vectors. These are repeated for every 
> row, which can greatly increase incoming data size.
> When populating the vectors, the allocation starts at 8 bytes and grows to 16 
> bytes, causing a (slow) memory reallocation for every vector:
> {code}
> [org.apache.drill.exec.vector.UInt4Vector] - 
> Reallocating vector [$offsets$(UINT4:REQUIRED)]. # of bytes: [8] -> [16]
> {code}
> Whether due to the above, or a different issues is causing memory growth in 
> the scan batch:
> {code}
> Entry Memory: 6,456,448
> Exit Memory: 7,636,312
> Entry Memory: 7570560
> Exit Memory: 8750424
> ...
> {code}
> Evidently the implicit vectors are added in response to a "SELECT *" query. 
> Perhaps provide them only if actually requested.
> The vectors are populated for every row, making a copy of a potentially long 
> file name and path for every record. Since the values are common to every 
> record, perhaps we can use the same data copy for each, but have the offset 
> vector for each record just point to the single copy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (DRILL-5272) Text file reader is inefficient

2019-12-12 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers reassigned DRILL-5272:
--

Assignee: Paul Rogers

> Text file reader is inefficient
> ---
>
> Key: DRILL-5272
> URL: https://issues.apache.org/jira/browse/DRILL-5272
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.10.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>
> From inspection of the ScanBatch and CompliantTextReader.
> Every batch holds about five implicit vectors. These are repeated for every 
> row, which can greatly increase incoming data size.
> When populating the vectors, the allocation starts at 8 bytes and grows to 16 
> bytes, causing a (slow) memory reallocation for every vector:
> {code}
> [org.apache.drill.exec.vector.UInt4Vector] - 
> Reallocating vector [$offsets$(UINT4:REQUIRED)]. # of bytes: [8] -> [16]
> {code}
> Whether due to the above, or a different issues is causing memory growth in 
> the scan batch:
> {code}
> Entry Memory: 6,456,448
> Exit Memory: 7,636,312
> Entry Memory: 7570560
> Exit Memory: 8750424
> ...
> {code}
> Evidently the implicit vectors are added in response to a "SELECT *" query. 
> Perhaps provide them only if actually requested.
> The vectors are populated for every row, making a copy of a potentially long 
> file name and path for every record. Since the values are common to every 
> record, perhaps we can use the same data copy for each, but have the offset 
> vector for each record just point to the single copy.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (DRILL-6832) Remove old "unmanaged" sort implementation

2019-12-12 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-6832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers reassigned DRILL-6832:
--

Assignee: Paul Rogers

> Remove old "unmanaged" sort implementation
> --
>
> Key: DRILL-6832
> URL: https://issues.apache.org/jira/browse/DRILL-6832
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.14.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Minor
>
> Several releases back Drill introduced a new "managed" external sort that 
> enhanced the sort operator's memory management. To be safe, at the time, the 
> new version was controlled by an option, with the ability to revert to the 
> old version.
> The new version has proven to be stable. The time has come to remove the old 
> version.
> * Remove the implementation in {{physical.impl.xsort}}.
> * Move the implementation from {{physical.impl.xsort.managed}} to the parent 
> package.
> * Remove the conditional code in the batch creator.
> * Remove the option that allowed disabling the new version.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7486) Restructure row set reader builder

2019-12-12 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7486:
--

 Summary: Restructure row set reader builder
 Key: DRILL-7486
 URL: https://issues.apache.org/jira/browse/DRILL-7486
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers
Assignee: Paul Rogers


The code to build a row set reader is located in several places, and is tied to 
the {{RowSet}} class for historical reasons. This restructuring pulls out the 
code so it can be used from a {{VectorContainer}} or other source.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7476) Info in some sys schema tables are missing if queried with limit clause

2019-12-10 Thread Paul Rogers (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16993203#comment-16993203
 ] 

Paul Rogers commented on DRILL-7476:


Added additional batch checks and uncovered the issue:

{noformat}
Found one or more vector errors from UnorderedReceiverBatch
user - NullableVarCharVector: Value count = 1, but last set = -1
{noformat}

Provided a patch. 

> Info in some sys schema tables are missing if queried with limit clause
> ---
>
> Key: DRILL-7476
> URL: https://issues.apache.org/jira/browse/DRILL-7476
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.17.0
>Reporter: Arina Ielchiieva
>Assignee: Paul Rogers
>Priority: Blocker
> Fix For: 1.17.0
>
>
> Affected schema: sys
> Affected tables: connections, threads, memory
> If query is executed with limit clause, information for some fields is 
> missing:
> *Connections*
> {noformat}
> apache drill (sys)> select * from connections;
> +---+---++-+---+-+-+-+--+--+
> |   user|client |  drillbit  |   established   | 
> duration  | queries | isAuthenticated | isEncrypted | usingSSL |  
>  session|
> +---+---++-+---+-+-+-+--+--+
> | anonymous | xxx.xxx.x.xxx | xxx | 2019-12-10 13:45:01.766 | 59 min 42.393 
> sec | 27  | false   | false   | false| xxx |
> +---+---++-+---+-+-+-+--+--+
> 1 row selected (0.1 seconds)
> apache drill (sys)> select * from connections limit 1;
> +--++--+-+--+-+-+-+--+-+
> | user | client | drillbit |   established   | duration | queries | 
> isAuthenticated | isEncrypted | usingSSL | session |
> +--++--+-+--+-+-+-+--+-+
> |  ||  | 2019-12-10 13:45:01.766 |  | 28  | 
> false   | false   | false| |
> +--++--+-+--+-+-+-+--+-+
> {noformat}
> *Threads*
> {noformat}
> apache drill (sys)> select * from threads;
> ++---+---+--+
> |  hostname  | user_port | total_threads | busy_threads |
> ++---+---+--+
> | xxx | 31010 | 27| 23   |
> ++---+---+--+
> 1 row selected (0.119 seconds)
> apache drill (sys)> select * from threads limit 1; 
> +--+---+---+--+
> | hostname | user_port | total_threads | busy_threads |
> +--+---+---+--+
> |  | 31010 | 27| 24   |
> {noformat}
> *Memory*
> {noformat}
> apache drill (sys)> select * from memory;
> ++---+--+++++
> |  hostname  | user_port | heap_current |  heap_max  | direct_current | 
> jvm_direct_current | direct_max |
> ++---+--+++++
> | xxx | 31010 | 493974480| 4116185088 | 5048576| 122765   
>   | 8589934592 |
> ++---+--+++++
> 1 row selected (0.115 seconds)
> apache drill (sys)> select * from memory limit 1;
> +--+---+--+++++
> | hostname | user_port | heap_current |  heap_max  | direct_current | 
> jvm_direct_current | direct_max |
> +--+---+--+++++
> |  | 31010 | 499343272| 4116185088 | 9048576| 122765  
>| 8589934592 |
> +--+---+--+++++
> {noformat}
> When selecting data from *Drillbits* table which has similar fields (ex: 
> hostname), everything is fine:
> {noformat}
> apache drill (sys)> select * from drillbits;
> 

[jira] [Updated] (DRILL-7479) Short-term fixes for metadata API parameterized type issues

2019-12-10 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-7479:
---
Description: 
See DRILL-7480 for a discussion of the issues with how we currently use 
parameterized types in the metadata API.

This ticket is for short-term fixes that convert unsafe generic types of the 
form {{StatisticsHolder}} to the form {{StatisticsHolder}} so that the 
compiler does not complain with many warnings (and a few Eclipse-only errors.)

The topic should be revisited later in the context of DRILL-7480.

  was:
See DRILL- for a discussion of the issues with how we currently use 
parameterized types in the metadata API.

This ticket is for short-term fixes that convert unsafe generic types of the 
form {{StatisticsHolder}} to the form {{StatisticsHolder}} so that the 
compiler does not complain with many warnings (and a few Eclipse-only errors.)

The topic should be revisited later in the context of DRILL-.


> Short-term fixes for metadata API parameterized type issues
> ---
>
> Key: DRILL-7479
> URL: https://issues.apache.org/jira/browse/DRILL-7479
> Project: Apache Drill
>  Issue Type: Improvement
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
>
> See DRILL-7480 for a discussion of the issues with how we currently use 
> parameterized types in the metadata API.
> This ticket is for short-term fixes that convert unsafe generic types of the 
> form {{StatisticsHolder}} to the form {{StatisticsHolder}} so that the 
> compiler does not complain with many warnings (and a few Eclipse-only errors.)
> The topic should be revisited later in the context of DRILL-7480.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7480) Revisit parameterized type design for Metadata API

2019-12-10 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7480:
--

 Summary: Revisit parameterized type design for Metadata API
 Key: DRILL-7480
 URL: https://issues.apache.org/jira/browse/DRILL-7480
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers


Grabbed latest master and found that the code will not build in Eclipse due to 
a type mismatch in the statistics code. Specifically, the problem is that we 
have several parameterized classes, but we often omit the parameters. 
Evidently, doing so is fine for some compilers, but is an error in Eclipse.

Then, while fixing the immediate issue, I found an opposite problem: code that 
would satisfy Eclipse, but which failed in the Maven build.

I spent time making another pass through the metadata code to add type 
parameters, remove "rawtypes" ignores and so on. See DRILL-7479.

Stepping back a bit, it seems that we are perhaps using the type parameters in 
a way that does not serve our needs in this particular case.

We have many classes that hold onto particular values of some type, such as 
{{StatisticsHolder}}, which can hold a String, a Double, etc. So, we 
parameterize.

But, after that, we treat the items generically. We don't care that {{foo}} is 
a {{StatisticsHolder}} and {{bar}} is {{StatisticsHolder}}, we 
just want to create, combine and work with lists of statistics.

The same is true in several other places such as column type, comparator type, 
etc. For comparators, we don't really care what type they compare, we just 
want, given two generic \{{StatisticsHolder}}s to get the corresponding 
comparator.

This is very similar to the situation with the "column accessors" in EVF: each 
column is a {{VARCHAR}} or a\{{ FLOAT8}}, but most code just treats them 
generically. So, the type-ness of the value was treated as data a runtime 
attribute, not a compile-time attribute.

This is a subtle point. Most code in Drill does not work with types directly in 
Java code. Instead, Drill is an interpreter: it works with generic objects 
which, at run time, resolve to actual typed objects. It is the difference 
between writing an application (directly uses types) and writing a language 
(generically works with all types.)

For example, a {{StatsticsHolder}} probably only needs to be type-aware at the 
moment it is populated or used, but not in all the generic column-level and 
table level code. (The same is true of properties in the column metadata class, 
as an example.)

IMHO, {{StatsticsHolder}} probably wants to be a non-parameterized class. It 
should have a declaration object that, say, provides the name, type, comparator 
and with other metadata. When the actual value is needed, a typed getter can be 
provided:
{code:java}
 T getValue();
{code}
As it is, the type system is very complex but we get no value. Since it is so 
complex, the code just punted and sprinkled raw types and ignores in many 
places, which defeats the purpose of parameterized types anyway.

Suggestion: let's revisit this work after the upcoming release and see if we 
can simplify it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7479) Short-term fixes for metadata API parameterized type issues

2019-12-10 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7479:
--

 Summary: Short-term fixes for metadata API parameterized type 
issues
 Key: DRILL-7479
 URL: https://issues.apache.org/jira/browse/DRILL-7479
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers
Assignee: Paul Rogers


See DRILL- for a discussion of the issues with how we currently use 
parameterized types in the metadata API.

This ticket is for short-term fixes that convert unsafe generic types of the 
form {{StatisticsHolder}} to the form {{StatisticsHolder}} so that the 
compiler does not complain with many warnings (and a few Eclipse-only errors.)

The topic should be revisited later in the context of DRILL-.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (DRILL-7476) Info in some sys schema tables are missing if queried with limit clause

2019-12-10 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers reassigned DRILL-7476:
--

Assignee: Paul Rogers  (was: Paul Rogers)

> Info in some sys schema tables are missing if queried with limit clause
> ---
>
> Key: DRILL-7476
> URL: https://issues.apache.org/jira/browse/DRILL-7476
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.17.0
>Reporter: Arina Ielchiieva
>Assignee: Paul Rogers
>Priority: Blocker
> Fix For: 1.17.0
>
>
> Affected schema: sys
> Affected tables: connections, threads, memory
> If query is executed with limit clause, information for some fields is 
> missing:
> *Connections*
> {noformat}
> apache drill (sys)> select * from connections;
> +---+---++-+---+-+-+-+--+--+
> |   user|client |  drillbit  |   established   | 
> duration  | queries | isAuthenticated | isEncrypted | usingSSL |  
>  session|
> +---+---++-+---+-+-+-+--+--+
> | anonymous | xxx.xxx.x.xxx | xxx | 2019-12-10 13:45:01.766 | 59 min 42.393 
> sec | 27  | false   | false   | false| xxx |
> +---+---++-+---+-+-+-+--+--+
> 1 row selected (0.1 seconds)
> apache drill (sys)> select * from connections limit 1;
> +--++--+-+--+-+-+-+--+-+
> | user | client | drillbit |   established   | duration | queries | 
> isAuthenticated | isEncrypted | usingSSL | session |
> +--++--+-+--+-+-+-+--+-+
> |  ||  | 2019-12-10 13:45:01.766 |  | 28  | 
> false   | false   | false| |
> +--++--+-+--+-+-+-+--+-+
> {noformat}
> *Threads*
> {noformat}
> apache drill (sys)> select * from threads;
> ++---+---+--+
> |  hostname  | user_port | total_threads | busy_threads |
> ++---+---+--+
> | xxx | 31010 | 27| 23   |
> ++---+---+--+
> 1 row selected (0.119 seconds)
> apache drill (sys)> select * from threads limit 1; 
> +--+---+---+--+
> | hostname | user_port | total_threads | busy_threads |
> +--+---+---+--+
> |  | 31010 | 27| 24   |
> {noformat}
> *Memory*
> {noformat}
> apache drill (sys)> select * from memory;
> ++---+--+++++
> |  hostname  | user_port | heap_current |  heap_max  | direct_current | 
> jvm_direct_current | direct_max |
> ++---+--+++++
> | xxx | 31010 | 493974480| 4116185088 | 5048576| 122765   
>   | 8589934592 |
> ++---+--+++++
> 1 row selected (0.115 seconds)
> apache drill (sys)> select * from memory limit 1;
> +--+---+--+++++
> | hostname | user_port | heap_current |  heap_max  | direct_current | 
> jvm_direct_current | direct_max |
> +--+---+--+++++
> |  | 31010 | 499343272| 4116185088 | 9048576| 122765  
>| 8589934592 |
> +--+---+--+++++
> {noformat}
> When selecting data from *Drillbits* table which has similar fields (ex: 
> hostname), everything is fine:
> {noformat}
> apache drill (sys)> select * from drillbits;
> ++---+--+---+---+-+-++
> |  hostname  | user_port | control_port | data_port | http_port | current | 
> version | state  |
> 

[jira] [Commented] (DRILL-7470) drill-yarn unit tests print stack traces with NoSuchMethodError

2019-12-06 Thread Paul Rogers (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16990071#comment-16990071
 ] 

Paul Rogers commented on DRILL-7470:


I wrote those tests originally, some of them are rather fragile because of the 
kinds of things they test. I'll take a quick look to see if this is something 
obvious.

> drill-yarn unit tests print stack traces with NoSuchMethodError
> ---
>
> Key: DRILL-7470
> URL: https://issues.apache.org/jira/browse/DRILL-7470
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.17.0
>Reporter: Vova Vysotskyi
>Assignee: Anton Gozhiy
>Priority: Minor
>
> Looks like it was caused by the Hadoop update.
> *Steps to reproduce:*
> 1. run {{mvn clean install}}
> 2. wait until drill-yarn unit tests are finished
> 3. check output
> *Expected output:*
> {noformat}
> [INFO] --- maven-surefire-plugin:3.0.0-M3:test (default-test) @ drill-yarn ---
> [INFO] 
> [INFO] ---
> [INFO]  T E S T S
> [INFO] ---
> [INFO] Running org.apache.drill.yarn.zk.TestAmRegistration
> [INFO] Running org.apache.drill.yarn.zk.TestZkRegistry
> [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.096 
> s - in org.apache.drill.yarn.zk.TestAmRegistration
> [INFO] Running org.apache.drill.yarn.client.TestCommandLineOptions
> [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 
> s - in org.apache.drill.yarn.client.TestCommandLineOptions
> [INFO] Running org.apache.drill.yarn.client.TestClient
> [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.057 
> s - in org.apache.drill.yarn.client.TestClient
> [INFO] Running org.apache.drill.yarn.scripts.TestScripts
> [WARNING] Tests run: 1, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 
> 0.001 s - in org.apache.drill.yarn.scripts.TestScripts
> [INFO] Running org.apache.drill.yarn.core.TestConfig
> [INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.307 
> s - in org.apache.drill.yarn.core.TestConfig
> [INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.028 
> s - in org.apache.drill.yarn.zk.TestZkRegistry
> [INFO] 
> [INFO] Results:
> [INFO] 
> [WARNING] Tests run: 11, Failures: 0, Errors: 0, Skipped: 1
> [INFO] 
> [INFO] 
> [INFO] --- maven-surefire-plugin:3.0.0-M3:test (metastore-test) @ drill-yarn 
> ---
> {noformat}
> *Actual output*
> {noformat}
> [INFO] --- maven-surefire-plugin:3.0.0-M3:test (default-test) @ drill-yarn ---
> [INFO] 
> [INFO] ---
> [INFO]  T E S T S
> [INFO] ---
> Failed to instantiate [ch.qos.logback.classic.LoggerContext]
> Reported exception:
> java.lang.NoSuchMethodError: 
> ch.qos.logback.core.util.Loader.getResourceOccurrenceCount(Ljava/lang/String;Ljava/lang/ClassLoader;)Ljava/util/Set;
>   at 
> ch.qos.logback.classic.util.ContextInitializer.multiplicityWarning(ContextInitializer.java:158)
>   at 
> ch.qos.logback.classic.util.ContextInitializer.statusOnResourceSearch(ContextInitializer.java:181)
>   at 
> ch.qos.logback.classic.util.ContextInitializer.findConfigFileURLFromSystemProperties(ContextInitializer.java:109)
>   at 
> ch.qos.logback.classic.util.ContextInitializer.findURLOfDefaultConfigurationFile(ContextInitializer.java:118)
>   at 
> ch.qos.logback.classic.util.ContextInitializer.autoConfig(ContextInitializer.java:146)
>   at org.slf4j.impl.StaticLoggerBinder.init(StaticLoggerBinder.java:85)
>   at 
> org.slf4j.impl.StaticLoggerBinder.(StaticLoggerBinder.java:55)
>   at org.slf4j.LoggerFactory.bind(LoggerFactory.java:150)
>   at org.slf4j.LoggerFactory.performInitialization(LoggerFactory.java:124)
>   at org.slf4j.LoggerFactory.getILoggerFactory(LoggerFactory.java:412)
>   at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:357)
>   at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:383)
>   at 
> org.apache.drill.common.util.ProtobufPatcher.(ProtobufPatcher.java:33)
>   at org.apache.drill.test.BaseTest.(BaseTest.java:35)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner.createTest(BlockJUnit4ClassRunner.java:217)
>   at 
> org.junit.runners.BlockJUnit4ClassRunner$1.runReflectiveCall(BlockJUnit4ClassRunner.java:266)
>   

[jira] [Resolved] (DRILL-7303) Filter record batch does not handle zero-length batches

2019-11-29 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers resolved DRILL-7303.

Resolution: Duplicate

> Filter record batch does not handle zero-length batches
> ---
>
> Key: DRILL-7303
> URL: https://issues.apache.org/jira/browse/DRILL-7303
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
>
> Testing of the row-set-based JSON reader revealed a limitation of the Filter 
> record batch: if an incoming batch has zero records, the length of the 
> associated SV2 is left at -1. In particular:
> {code:java}
> public class SelectionVector2 implements AutoCloseable {
>   // Indicates actual number of rows in the RecordBatch
>   // container which owns this SV2 instance
>   private int batchActualRecordCount = -1;
> {code}
> Then:
> {code:java}
> public abstract class FilterTemplate2 implements Filterer {
>   @Override
>   public void filterBatch(int recordCount) throws SchemaChangeException{
> if (recordCount == 0) {
>   outgoingSelectionVector.setRecordCount(0);
>   return;
> }
> {code}
> Notice there is no call to set the actual record count. The solution is to 
> insert one line of code:
> {code:java}
> if (recordCount == 0) {
>   outgoingSelectionVector.setRecordCount(0);
>   outgoingSelectionVector.setBatchActualRecordCount(0); // <-- Add this
>   return;
> }
> {code}
> Without this, the query fails with an error due to an invalid index of -1.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (DRILL-7311) Partial fixes for empty batch bugs

2019-11-29 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers resolved DRILL-7311.

Resolution: Duplicate

> Partial fixes for empty batch bugs
> --
>
> Key: DRILL-7311
> URL: https://issues.apache.org/jira/browse/DRILL-7311
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
> Fix For: 1.18.0
>
>
> DRILL-7305 explains that multiple operators have serious bugs when presented 
> with empty batches. DRILL-7306 explains that the EVF (AKA "new scan 
> framework") was originally coded to emit an empty "fast schema" batch, but 
> that the feature was disabled because of the many empty-batch operator 
> failures.
> This ticket covers a set of partial fixes for empty-batch issues. This is the 
> result of work done to get the converted JSON reader to work with a "fast 
> schema." The JSON work, in the end, revealed that Drill has too many bugs to 
> enable fast schema, and so the DRILL-7306 was implemented instead.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (DRILL-7305) Multiple operators do not handle empty batches

2019-11-29 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers resolved DRILL-7305.

Resolution: Duplicate

> Multiple operators do not handle empty batches
> --
>
> Key: DRILL-7305
> URL: https://issues.apache.org/jira/browse/DRILL-7305
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Paul Rogers
>Priority: Major
>
> While testing the new "EVF" framework, it was found that multiple operators 
> incorrectly handle empty batches. The EVF framework is set up to return a 
> "fast schema" empty batch with only schema as its first batch. It turns out 
> that many operators fail with problems such as:
> * Failure to set the value counts in the output container
> * Fail to initialize the offset vector position 0 to 0 for variable-width or 
> repeated vectors
> And so on.
> Partial fixes are in the JSON reader PR.
> For now, the easiest work-around is to disable the "fast schema" path in the 
> EVF: DRILL-7306.
> To discover the remaining issues, enable the 
> {{ScanOrchestratorBuilder.enableSchemaBatch}} option and run unit tests. You 
> can use the {{VectorChecker}} and {{VectorAccessorUtilities.verify()}} 
> methods to check state. Insert a call to {{verify()}} in each "next" method: 
> verify the incoming and outgoing batches. The checker only verifies a few 
> vector types; but these are enough to show many problems.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (DRILL-7324) Many vector-validity errors from unit tests

2019-11-29 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers reassigned DRILL-7324:
--

Assignee: Paul Rogers

> Many vector-validity errors from unit tests
> ---
>
> Key: DRILL-7324
> URL: https://issues.apache.org/jira/browse/DRILL-7324
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
>
> Drill's value vectors contain many counts that must be maintained in sync. 
> Drill provides a utility, {{BatchValidator}} to check (a subset of) these 
> values for consistency.
> The {{IteratorValidatorBatchIterator}} class is used in tests to validate the 
> state of each operator (AKA "record batch") as Drill runs the Volcano 
> iterator. This class can also validate vectors by setting the 
> {{VALIDATE_VECTORS}} constant to `true`.
> This was done, then unit tests were run. Many tests failed. Examples:
> {noformat}
> [INFO] Running org.apache.drill.TestUnionDistinct
> 18:44:26.742 [22d42585-74c2-d418-6f59-9b1870d04770:frag:0:0] ERROR 
> o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
> LimitRecordBatch
> key - NullableBitVector: Row count = 0, but value count = 2
> 18:44:26.745 [22d42585-74c2-d418-6f59-9b1870d04770:frag:0:0] ERROR 
> o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
> LimitRecordBatch
> key - NullableBitVector: Row count = 0, but value count = 2
> [INFO] Running org.apache.drill.TestUnionDistinct
> 8:44:48.302 [22d4256e-c90b-847c-5104-02d6cdf5223e:frag:0:0] ERROR 
> o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
> LimitRecordBatch
> key - NullableBitVector: Row count = 0, but value count = 2
> 18:44:48.703 [22d4256e-ccf3-2af6-f56a-140e9c3e55bb:frag:0:0] ERROR 
> o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
> FilterRecordBatch
> n_nationkey - IntVector: Row count = 2, but value count = 25
> n_regionkey - IntVector: Row count = 2, but value count = 25
> 18:44:48.731 [22d4256e-ccf3-2af6-f56a-140e9c3e55bb:frag:0:0] ERROR 
> o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
> FilterRecordBatch
> n_nationkey - IntVector: Row count = 4, but value count = 25
> n_regionkey - IntVector: Row count = 4, but value count = 25
> 18:44:49.039 [22d4256f-6b39-d2ab-d145-4f2b0db315a3:frag:0:0] ERROR 
> o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
> FilterRecordBatch
> n_nationkey - IntVector: Row count = 2, but value count = 25
> 18:44:49.363 [22d4256e-3d91-850f-9ab4-5939219ac0d0:frag:0:0] ERROR 
> o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
> FilterRecordBatch
> c_custkey - IntVector: Row count = 4, but value count = 1500
> 18:44:49.597 [22d4256d-c113-ae5c-6f31-4dd1ec091365:frag:0:0] ERROR 
> o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
> FilterRecordBatch
> n_nationkey - IntVector: Row count = 5, but value count = 25
> n_regionkey - IntVector: Row count = 5, but value count = 25
> 18:44:49.610 [22d4256d-c113-ae5c-6f31-4dd1ec091365:frag:0:0] ERROR 
> o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
> FilterRecordBatch
> r_regionkey - IntVector: Row count = 1, but value count = 5
> 18:44:53.029 [22d4256a-8b70-5f3b-f79b-806e194c5ed2:frag:0:0] ERROR 
> o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
> LimitRecordBatch
> n_nationkey - IntVector: Row count = 0, but value count = 25
> n_name - VarCharVector: Row count = 0, but value count = 25
> n_regionkey - IntVector: Row count = 0, but value count = 25
> 18:44:53.033 [22d4256a-8b70-5f3b-f79b-806e194c5ed2:frag:0:0] ERROR 
> o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
> LimitRecordBatch
> n_regionkey - IntVector: Row count = 5, but value count = 25
> 18:44:53.331 [22d4256a-526c-7815-c216-8e45752a4a6c:frag:0:0] ERROR 
> o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
> LimitRecordBatch
> n_nationkey - IntVector: Row count = 5, but value count = 25
> n_name - VarCharVector: Row count = 5, but value count = 25
> n_regionkey - IntVector: Row count = 5, but value count = 25
> 18:44:53.337 [22d4256a-526c-7815-c216-8e45752a4a6c:frag:0:0] ERROR 
> o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
> LimitRecordBatch
> n_regionkey - IntVector: Row count = 0, but value count = 25
> 18:44:53.646 [22d42569-c293-ced0-c3d0-e9153cc4a70a:frag:0:0] ERROR 
> o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from 
> LimitRecordBatch
> key - NullableBitVector: Row count = 0, but value count = 2
> Running org.apache.drill.TestTpchSingleMode
> 18:45:01.299 

[jira] [Created] (DRILL-7458) Base storage plugin framework

2019-11-26 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7458:
--

 Summary: Base storage plugin framework
 Key: DRILL-7458
 URL: https://issues.apache.org/jira/browse/DRILL-7458
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers
Assignee: Paul Rogers


The "Easy" framework allows third-parties to add format plugins to Drill with 
moderate effort. (The process could be easier, but "Easy" makes it as simple as 
possible given the current structure.)

At present, no such "starter" framework exists for storage plugins. Further, 
multiple storage plugins have implemented filter push down, seemingly by 
copying large blocks of code.

This ticket offers a "base" framework for storage plugins and for filter 
push-downs. The framework builds on the EVF, allowing plugins to also support 
project push down.

The framework has a "test mule" storage plugin to verify functionality, and was 
used as the basis of an REST-like plugin.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (DRILL-7457) Join assignment is random when table costs are identical

2019-11-22 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-7457:
---
Summary: Join assignment is random when table costs are identical  (was: 
Join assignment is random when table costa are identical)

> Join assignment is random when table costs are identical
> 
>
> Key: DRILL-7457
> URL: https://issues.apache.org/jira/browse/DRILL-7457
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Priority: Minor
>
> Create a simple test: a join between two identical scans, call them t1 and 
> t2. Ensure that the scans report the same cost. Capture the logical plan. 
> Repeat the exercise several times. You will see that Drill randomly assigns 
> t1 to the left side or right side.
> Operationally this might not make a difference. But, in tests, it means that 
> trying to compare an "actual" and "golden" plan is impossible as the plans 
> are unstable.
> Also, if only the estimates are the same, but the table size differs, then 
> runtime performance will randomly be better on some query runs than others.
> Better is to fall back to SQL statement table order if the two tables are 
> otherwise identical in cost.
> This may be a Calcite issue rather than a Drill issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7457) Join assignment is random when table costa are identical

2019-11-22 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7457:
--

 Summary: Join assignment is random when table costa are identical
 Key: DRILL-7457
 URL: https://issues.apache.org/jira/browse/DRILL-7457
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers


Create a simple test: a join between two identical scans, call them t1 and t2. 
Ensure that the scans report the same cost. Capture the logical plan. Repeat 
the exercise several times. You will see that Drill randomly assigns t1 to the 
left side or right side.

Operationally this might not make a difference. But, in tests, it means that 
trying to compare an "actual" and "golden" plan is impossible as the plans are 
unstable.

Also, if only the estimates are the same, but the table size differs, then 
runtime performance will randomly be better on some query runs than others.

Better is to fall back to SQL statement table order if the two tables are 
otherwise identical in cost.

This may be a Calcite issue rather than a Drill issue.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7456) Batch count fixes for 12 additional operators

2019-11-22 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7456:
--

 Summary: Batch count fixes for 12 additional operators
 Key: DRILL-7456
 URL: https://issues.apache.org/jira/browse/DRILL-7456
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers
Assignee: Paul Rogers


Enables batch validation for 12 additional operators:

* MergingRecordBatch
* OrderedPartitionRecordBatch
* RangePartitionRecordBatch
* TraceRecordBatch
* UnionAllRecordBatch
* UnorderedReceiverBatch
* UnpivotMapsRecordBatch
* WindowFrameRecordBatch
* TopNBatch
* HashJoinBatch
* ExternalSortBatch
* WriterRecordBatch

Fixes issues found with those checks so that this set of operators passes all 
checks.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7455) "Renaming" projection operator to avoid physical copies

2019-11-22 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7455:
--

 Summary: "Renaming" projection operator to avoid physical copies
 Key: DRILL-7455
 URL: https://issues.apache.org/jira/browse/DRILL-7455
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers


Drill/Calcite inserts project operators for three main reasons:

1. To compute a new column: {{SELECT a + b AS c ...}}

2. To rename columns: {{SELECT a AS x ...}}

3. To remove columns: {{SELECT a ...} but a data source provides columns {{a}}, 
and {{b}}.

Example of case 1:

{code:json}
"pop" : "project",
"@id" : 4,
"exprs" : [ {
  "ref" : "`a0`",
  "expr" : "`a`"
}, {
  "ref" : "`b0`",
  "expr" : "`b`"
} ],
{code}

Of these, only case 2 requires row-by-row computation of new values. Case 1 
simply creates a new vector with only the name changed; but the same data. Case 
3 preserves some vectors, drops others.

In the cases 1 and 2, a simple data transfer from input to output would be 
adequate. Yet, if one steps through the code, and enables code generation, one 
will see that Drill steps through each record in all three cases, even calling 
an empty per-record compute block.

A better-performance solution is to separate out the renames/drops (cases 1 and 
3) from the column computations (case 2). This can be done either:

1. At plan time, identify that all columns are renames, and replace the 
row-by-row project with a column-level project.

2. At run time that identifies the column-level projections (cases 1 and 3) and 
handles those with transfer pairs, while doing row-by-row computes only if case 
2 exists.

Since row-by-row copies are among the most expensive operations in Drill, this 
optimization could improve performance by a decent amount.

Note that a further optimization is to remove "trivial" projects such as the 
following:

{code:json}
"pop" : "project",
"@id" : 2,
"exprs" : [ {
  "ref" : "`a`",
  "expr" : "`a`"
}, {
  "ref" : "`b`",
  "expr" : "`b`"
}, {
  "ref" : "`b0`",
  "expr" : "`b0`"
} ],
{code}

The only value of such a projection is to say, "remove all vectors except 
{{a}}, {{b}} and {{b0}}. In fact, the only time such a projection should be 
needed is:

1. On top of a data source that does not support projection push down.

2. When Calcite knows it wants to discard certain intermediate columns.

Otherwise, Calcite knows which columns emerge from operator x, and should not 
need to add a project to enforce that schema if it is already what the project 
will emit.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (DRILL-7451) Planner inserts "trivial" top project node for simple query

2019-11-19 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-7451:
---
Summary: Planner inserts "trivial" top project node for simple query  (was: 
Planner inserts project node even if scan handles project push-down)

> Planner inserts "trivial" top project node for simple query
> ---
>
> Key: DRILL-7451
> URL: https://issues.apache.org/jira/browse/DRILL-7451
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Priority: Minor
>
> I created a "dummy" storage plugin for testing. The test does a simple query:
> {code:sql}
> SELECT a, b, c from dummy.myTable
> {code}
> The first test is to mark the plugin's group scan as supporting projection 
> push down. However, Drill still creates a projection node in the logical plan:
> {code:json}
>   "graph" : [ {
> "pop" : "DummyGroupScan",
> "@id" : 2,
> "columns" : [ "`**`" ],
> "userName" : "progers",
> "cost" : {
>   "memoryCost" : 1.6777216E7,
>   "outputRowCount" : 1.0
> }
>   }, {
> "pop" : "project",
> "@id" : 1,
> "exprs" : [ {
>   "ref" : "`a`",
>   "expr" : "`a`"
> }, {
>   "ref" : "`b`",
>   "expr" : "`b`"
> }, {
>   "ref" : "`c`",
>   "expr" : "`c`"
> } ],
> "child" : 2,
> "outputProj" : true,
> "initialAllocation" : 100,
> "maxAllocation" : 100,
> "cost" : {
>   "memoryCost" : 1.6777216E7,
>   "outputRowCount" : 1.0
> }
>   }, {
> "pop" : "screen",
> "@id" : 0,
> "child" : 1,
> "initialAllocation" : 100,
> "maxAllocation" : 100,
> "cost" : {
>   "memoryCost" : 1.6777216E7,
>   "outputRowCount" : 1.0
> }
>   } ]
> {code}
> There is [a comment in the 
> code|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillPushProjectIntoScanRule.java#L109]
>  that suggests the project should be removed:
> {code:java}
> // project above scan may be removed in ProjectRemoveRule for
> // the case when it is trivial
> {code}
> As shown in the example, the project is trivial. There is a subtlety: it may 
> be that the scan, unknown to the planner, produce additional columns, say 
> {{d}} and {{e}} which the project operator is needed to remove.
> If this is the reason the project remains, perhaps we can add a flag of some 
> kind where the group scan can insist that not only does it handle projection, 
> it will not insert additional columns. At that point, the project is 
> completely unnecessary in this case.
> This is not a functional bug; just a performance issue: we exercise the 
> machinery of the project operator to do exactly nothing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7451) Planner inserts project node even if scan handles project push-down

2019-11-19 Thread Paul Rogers (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16978120#comment-16978120
 ] 

Paul Rogers commented on DRILL-7451:


It appears that the actual behavior is a bit more complex. Run the same test as 
above, with the same query, but now mark the plugin as projection pushdown is 
*not* supported. In this case we get two projects. This suggests that the 
project above is added for a different reason, but it is still trivial and 
should be removed.

Logical plan with scan project pushdown disabled:

{code:json}
  "graph" : [ {
"pop" : "DummyGroupScan",
"@id" : 3,
"columns" : [ "`**`" ],
"userName" : "progers",
"cost" : {
  "memoryCost" : 1.6777216E7,
  "outputRowCount" : 1.0
}
  }, {
"pop" : "project",
"@id" : 2,
"exprs" : [ {
  "ref" : "`a`",
  "expr" : "`a`"
}, {
  "ref" : "`b`",
  "expr" : "`b`"
}, {
  "ref" : "`c`",
  "expr" : "`c`"
} ],
"child" : 3,
"outputProj" : true,
"initialAllocation" : 100,
"maxAllocation" : 100,
"cost" : {
  "memoryCost" : 1.6777216E7,
  "outputRowCount" : 1.0
}
  }, {
"pop" : "project",
"@id" : 1,
"exprs" : [ {
  "ref" : "`a`",
  "expr" : "`a`"
}, {
  "ref" : "`b`",
  "expr" : "`b`"
}, {
  "ref" : "`c`",
  "expr" : "`c`"
} ],
"child" : 2,
"outputProj" : true,
"initialAllocation" : 100,
"maxAllocation" : 100,
"cost" : {
  "memoryCost" : 1.6777216E7,
  "outputRowCount" : 1.0
}
  }, {
"pop" : "screen",
"@id" : 0,
"child" : 1,
"initialAllocation" : 100,
"maxAllocation" : 100,
"cost" : {
  "memoryCost" : 1.6777216E7,
  "outputRowCount" : 1.0
}
  } ]
{code}


> Planner inserts project node even if scan handles project push-down
> ---
>
> Key: DRILL-7451
> URL: https://issues.apache.org/jira/browse/DRILL-7451
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Priority: Minor
>
> I created a "dummy" storage plugin for testing. The test does a simple query:
> {code:sql}
> SELECT a, b, c from dummy.myTable
> {code}
> The first test is to mark the plugin's group scan as supporting projection 
> push down. However, Drill still creates a projection node in the logical plan:
> {code:json}
>   "graph" : [ {
> "pop" : "DummyGroupScan",
> "@id" : 2,
> "columns" : [ "`**`" ],
> "userName" : "progers",
> "cost" : {
>   "memoryCost" : 1.6777216E7,
>   "outputRowCount" : 1.0
> }
>   }, {
> "pop" : "project",
> "@id" : 1,
> "exprs" : [ {
>   "ref" : "`a`",
>   "expr" : "`a`"
> }, {
>   "ref" : "`b`",
>   "expr" : "`b`"
> }, {
>   "ref" : "`c`",
>   "expr" : "`c`"
> } ],
> "child" : 2,
> "outputProj" : true,
> "initialAllocation" : 100,
> "maxAllocation" : 100,
> "cost" : {
>   "memoryCost" : 1.6777216E7,
>   "outputRowCount" : 1.0
> }
>   }, {
> "pop" : "screen",
> "@id" : 0,
> "child" : 1,
> "initialAllocation" : 100,
> "maxAllocation" : 100,
> "cost" : {
>   "memoryCost" : 1.6777216E7,
>   "outputRowCount" : 1.0
> }
>   } ]
> {code}
> There is [a comment in the 
> code|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillPushProjectIntoScanRule.java#L109]
>  that suggests the project should be removed:
> {code:java}
> // project above scan may be removed in ProjectRemoveRule for
> // the case when it is trivial
> {code}
> As shown in the example, the project is trivial. There is a subtlety: it may 
> be that the scan, unknown to the planner, produce additional columns, say 
> {{d}} and {{e}} which the project operator is needed to remove.
> If this is the reason the project remains, perhaps we can add a flag of some 
> kind where the group scan can insist that not only does it handle projection, 
> it will not insert additional columns. At that point, the project is 
> completely unnecessary in this case.
> This is not a functional bug; just a performance issue: we exercise the 
> machinery of the project operator to do exactly nothing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7451) Planner inserts project node even if scan handles project push-down

2019-11-19 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7451:
--

 Summary: Planner inserts project node even if scan handles project 
push-down
 Key: DRILL-7451
 URL: https://issues.apache.org/jira/browse/DRILL-7451
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers


I created a "dummy" storage plugin for testing. The test does a simple query:

{code:sql}
SELECT a, b, c from dummy.myTable
{code}

The first test is to mark the plugin's group scan as supporting projection push 
down. However, Drill still creates a projection node in the logical plan:

{code:json}
  "graph" : [ {
"pop" : "DummyGroupScan",
"@id" : 2,
"columns" : [ "`**`" ],
"userName" : "progers",
"cost" : {
  "memoryCost" : 1.6777216E7,
  "outputRowCount" : 1.0
}
  }, {
"pop" : "project",
"@id" : 1,
"exprs" : [ {
  "ref" : "`a`",
  "expr" : "`a`"
}, {
  "ref" : "`b`",
  "expr" : "`b`"
}, {
  "ref" : "`c`",
  "expr" : "`c`"
} ],
"child" : 2,
"outputProj" : true,
"initialAllocation" : 100,
"maxAllocation" : 100,
"cost" : {
  "memoryCost" : 1.6777216E7,
  "outputRowCount" : 1.0
}
  }, {
"pop" : "screen",
"@id" : 0,
"child" : 1,
"initialAllocation" : 100,
"maxAllocation" : 100,
"cost" : {
  "memoryCost" : 1.6777216E7,
  "outputRowCount" : 1.0
}
  } ]
{code}

There is [a comment in the 
code|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillPushProjectIntoScanRule.java#L109]
 that suggests the project should be removed:

{code:java}
// project above scan may be removed in ProjectRemoveRule for
// the case when it is trivial
{code}

As shown in the example, the project is trivial. There is a subtlety: it may be 
that the scan, unknown to the planner, produce additional columns, say {{d}} 
and {{e}} which the project operator is needed to remove.

If this is the reason the project remains, perhaps we can add a flag of some 
kind where the group scan can insist that not only does it handle projection, 
it will not insert additional columns. At that point, the project is completely 
unnecessary in this case.

This is not a functional bug; just a performance issue: we exercise the 
machinery of the project operator to do exactly nothing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7448) Fix warnings when running Drill memory tests

2019-11-17 Thread Paul Rogers (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976215#comment-16976215
 ] 

Paul Rogers commented on DRILL-7448:


Occurs in the vector module tests also.

> Fix warnings when running Drill memory tests
> 
>
> Key: DRILL-7448
> URL: https://issues.apache.org/jira/browse/DRILL-7448
> Project: Apache Drill
>  Issue Type: Bug
>Affects Versions: 1.16.0
>Reporter: Arina Ielchiieva
>Assignee: Bohdan Kazydub
>Priority: Minor
> Fix For: 1.17.0
>
>
> {noformat}
> -- drill-memory-base 
> [INFO] ---
> [INFO]  T E S T S
> [INFO] ---
> [INFO] Running org.apache.drill.exec.memory.TestEndianess
> [INFO] Running org.apache.drill.exec.memory.TestAccountant
> 16:21:45,719 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could 
> NOT find resource [logback.groovy]
> 16:21:45,719 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Found 
> resource [logback-test.xml] at 
> [jar:file:/Users/arina/Development/git_repo/drill/common/target/drill-common-1.17.0-SNAPSHOT-tests.jar!/logback-test.xml]
> 16:21:45,733 |-INFO in 
> ch.qos.logback.core.joran.spi.ConfigurationWatchList@dbd940d - URL 
> [jar:file:/Users/arina/Development/git_repo/drill/common/target/drill-common-1.17.0-SNAPSHOT-tests.jar!/logback-test.xml]
>  is not of type file
> 16:21:45,780 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
> set
> 16:21:45,802 |-ERROR in ch.qos.logback.core.joran.conditional.IfAction - 
> Could not find Janino library on the class path. Skipping conditional 
> processing.
> 16:21:45,802 |-ERROR in ch.qos.logback.core.joran.conditional.IfAction - See 
> also http://logback.qos.ch/codes.html#ifJanino
> 16:21:45,803 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
> 16:21:45,811 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> Naming appender as [STDOUT]
> 16:21:45,826 |-INFO in 
> ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
> type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
> property
> 16:21:45,866 |-INFO in ch.qos.logback.classic.joran.action.LevelAction - ROOT 
> level set to ERROR
> 16:21:45,866 |-ERROR in ch.qos.logback.core.joran.conditional.IfAction - 
> Could not find Janino library on the class path. Skipping conditional 
> processing.
> 16:21:45,866 |-ERROR in ch.qos.logback.core.joran.conditional.IfAction - See 
> also http://logback.qos.ch/codes.html#ifJanino
> 16:21:45,866 |-WARN in ch.qos.logback.classic.joran.action.RootLoggerAction - 
> The object on the top the of the stack is not the root logger
> 16:21:45,866 |-WARN in ch.qos.logback.classic.joran.action.RootLoggerAction - 
> It is: ch.qos.logback.core.joran.conditional.IfAction
> 16:21:45,866 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - End of 
> configuration.
> 16:21:45,867 |-INFO in 
> ch.qos.logback.classic.joran.JoranConfigurator@71d15f18 - Registering current 
> configuration as safe fallback point
> 16:21:45,717 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could 
> NOT find resource [logback.groovy]
> 16:21:45,717 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Found 
> resource [logback-test.xml] at 
> [jar:file:/Users/arina/Development/git_repo/drill/common/target/drill-common-1.17.0-SNAPSHOT-tests.jar!/logback-test.xml]
> 16:21:45,729 |-INFO in 
> ch.qos.logback.core.joran.spi.ConfigurationWatchList@2698dc7 - URL 
> [jar:file:/Users/arina/Development/git_repo/drill/common/target/drill-common-1.17.0-SNAPSHOT-tests.jar!/logback-test.xml]
>  is not of type file
> 16:21:45,778 |-INFO in 
> ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not 
> set
> 16:21:45,807 |-ERROR in ch.qos.logback.core.joran.conditional.IfAction - 
> Could not find Janino library on the class path. Skipping conditional 
> processing.
> 16:21:45,807 |-ERROR in ch.qos.logback.core.joran.conditional.IfAction - See 
> also http://logback.qos.ch/codes.html#ifJanino
> 16:21:45,808 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender]
> 16:21:45,814 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - 
> Naming appender as [STDOUT]
> 16:21:45,829 |-INFO in 
> ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default 
> type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] 
> property
> 16:21:45,868 |-INFO in ch.qos.logback.classic.joran.action.LevelAction - ROOT 
> level set to ERROR
> 16:21:45,868 |-ERROR in 

[jira] [Created] (DRILL-7447) Simplify the Mock reader

2019-11-16 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7447:
--

 Summary: Simplify the Mock reader
 Key: DRILL-7447
 URL: https://issues.apache.org/jira/browse/DRILL-7447
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers
Assignee: Paul Rogers


The mock reader is used to generate large volumes of data. It has evolved over 
time and has many crufty vestiges of prior implementations.

Also, the Mock reader allows specifying that types are nullable, and the rate 
of null values. This change adds to the existing "encoding" to allow specifying 
this property via SQL: add an "n" to the column name to specify nullable, a 
number to specify percent. To specify INT columns with 10%, 50% and 90% nulls:

{noformat}
SELECT a_in10, b_n50, b_n90 FROM mock.dummy1000
{noformat}

The default is 25% nulls (which already existed in the code) if no numeric 
suffix is provided.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7446) Eclipse compilation issue in AbstractParquetGroupScan

2019-11-16 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7446:
--

 Summary: Eclipse compilation issue in AbstractParquetGroupScan
 Key: DRILL-7446
 URL: https://issues.apache.org/jira/browse/DRILL-7446
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers
Assignee: Paul Rogers


When the recent master branch is loaded in Eclipse, we get a compiler error in 
{{AbstractParquetGroupScan}}:

{noformat}
The method getFiltered(OptionManager, FilterPredicate) from the type 
AbstractGroupScanWithMetadata.GroupScanWithMetadataFilterer is not visible 
AbstractParquetGroupScan.java   
/drill-java-exec/src/main/java/org/apache/drill/exec/store/parquet  line 
242Java Problem

Type mismatch: cannot convert from 
AbstractGroupScanWithMetadata.GroupScanWithMetadataFilterer to 
AbstractParquetGroupScan.RowGroupScanFilterer AbstractParquetGroupScan.java   
/drill-java-exec/src/main/java/org/apache/drill/exec/store/parquet  line 
237Java Problem
{noformat}

The issue appears to be due to using the raw type rather than using parameters 
with the type.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (DRILL-7233) Format Plugin for HDF5

2019-11-16 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-7233:
---
Reviewer: Paul Rogers
  Labels: doc-impacting ready-to-commit  (was: doc-impacting)

> Format Plugin for HDF5
> --
>
> Key: DRILL-7233
> URL: https://issues.apache.org/jira/browse/DRILL-7233
> Project: Apache Drill
>  Issue Type: New Feature
>Affects Versions: 1.17.0
>Reporter: Charles Givre
>Assignee: Charles Givre
>Priority: Major
>  Labels: doc-impacting, ready-to-commit
> Fix For: 1.18.0
>
>
> h2. Drill HDF5 Format Plugin
> h2. 
> Per wikipedia, Hierarchical Data Format (HDF) is a set of file formats 
> designed to store and organize large amounts of data. Originally developed at 
> the National Center for Supercomputing Applications, it is supported by The 
> HDF Group, a non-profit corporation whose mission is to ensure continued 
> development of HDF5 technologies and the continued accessibility of data 
> stored in HDF.
> This plugin enables Apache Drill to query HDF5 files.
> h3. Configuration
> There are three configuration variables in this plugin:
> type: This should be set to hdf5.
> extensions: This is a list of the file extensions used to identify HDF5 
> files. Typically HDF5 uses .h5 or .hdf5 as file extensions. This defaults to 
> .h5.
> defaultPath:
> h3. Example Configuration
> h3. 
> For most uses, the configuration below will suffice to enable Drill to query 
> HDF5 files.
> {{"hdf5": {
>   "type": "hdf5",
>   "extensions": [
> "h5"
>   ],
>   "defaultPath": null
> }}}
> h3. Usage
> Since HDF5 can be viewed as a file system within a file, a single file can 
> contain many datasets. For instance, if you have a simple HDF5 file, a star 
> query will produce the following result:
> {{apache drill> select * from dfs.test.`dset.h5`;
> +---+---+---+--+
> | path  | data_type | file_name | int_data
>  |
> +---+---+---+--+
> | /dset | DATASET   | dset.h5   | 
> [[1,2,3,4,5,6],[7,8,9,10,11,12],[13,14,15,16,17,18],[19,20,21,22,23,24]] |
> +---+---+---+--+}}
> The actual data in this file is mapped to a column called int_data. In order 
> to effectively access the data, you should use Drill's FLATTEN() function on 
> the int_data column, which produces the following result.
> {{apache drill> select flatten(int_data) as int_data from dfs.test.`dset.h5`;
> +-+
> |  int_data   |
> +-+
> | [1,2,3,4,5,6]   |
> | [7,8,9,10,11,12]|
> | [13,14,15,16,17,18] |
> | [19,20,21,22,23,24] |
> +-+}}
> Once you have the data in this form, you can access it similarly to how you 
> might access nested data in JSON or other files.
> {{apache drill> SELECT int_data[0] as col_0,
> . .semicolon> int_data[1] as col_1,
> . .semicolon> int_data[2] as col_2
> . .semicolon> FROM ( SELECT flatten(int_data) AS int_data
> . . . . . .)> FROM dfs.test.`dset.h5`
> . . . . . .)> );
> +---+---+---+
> | col_0 | col_1 | col_2 |
> +---+---+---+
> | 1 | 2 | 3 |
> | 7 | 8 | 9 |
> | 13| 14| 15|
> | 19| 20| 21|
> +---+---+---+}}
> Alternatively, a better way to query the actual data in an HDF5 file is to 
> use the defaultPath field in your query. If the defaultPath field is defined 
> in the query, or via the plugin configuration, Drill will only return the 
> data, rather than the file metadata.
> ** Note: Once you have determined which data set you are querying, it is 
> advisable to use this method to query HDF5 data. **
> You can set the defaultPath variable in either the plugin configuration, or 
> at query time using the table() function as shown in the example below:
> {{SELECT * 
> FROM table(dfs.test.`dset.h5` (type => 'hdf5', defaultPath => '/dset'))}}
> This query will return the result below:
> {{apache drill> SELECT * FROM table(dfs.test.`dset.h5` (type => 'hdf5', 
> defaultPath => '/dset'));
> +---+---+---+---+---+---+
> | int_col_0 | int_col_1 | int_col_2 | int_col_3 | int_col_4 | int_col_5 |
> +---+---+---+---+---+---+
> | 1 | 2 | 3 | 4 | 5 | 6 |
> | 7 | 8 | 9 | 10| 11| 12|
> | 13| 14| 15| 16| 17| 18|
> | 19| 20| 21   

[jira] [Comment Edited] (DRILL-7352) Introduce new checkstyle rules to make code style more consistent

2019-11-16 Thread Paul Rogers (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956669#comment-16956669
 ] 

Paul Rogers edited comment on DRILL-7352 at 11/17/19 2:08 AM:
--

Start with the [existing set of 
rules|http://drill.apache.org/docs/apache-drill-contribution-guidelines/].

* Import order. Typical order: {{java}}, {{javax}}, {{org}}, {{com}}. Static 
imports at the top.
* Use {{final}} aggressively on fields, do not use it on local variables or 
parameters.
* {{case}} statements indent one level in from the {{switch}} statements.

Once decisions are finalized, update the format files for Eclipse and IntelliJ.


was (Author: paul.rogers):
Start with the [existing set of 
rules|http://drill.apache.org/docs/apache-drill-contribution-guidelines/].

* Import order. Typical order: `java`, javax`, `org`, `com`. Static imports at 
the top.
* Use `final` aggressively on fields, do not use it on local variables or 
parameters.
* `case` statements indent one level in from the `switch` statements.

Once decisions are finalized, update the format files for Eclipse and IntelliJ.

> Introduce new checkstyle rules to make code style more consistent
> -
>
> Key: DRILL-7352
> URL: https://issues.apache.org/jira/browse/DRILL-7352
> Project: Apache Drill
>  Issue Type: Task
>Reporter: Vova Vysotskyi
>Priority: Major
>
> Source - https://checkstyle.sourceforge.io/checks.html
> List of rules to be enabled:
> * [LeftCurly|https://checkstyle.sourceforge.io/config_blocks.html#LeftCurly] 
> - force placement of a left curly brace at the end of the line.
> * 
> [RightCurly|https://checkstyle.sourceforge.io/config_blocks.html#RightCurly] 
> - force placement of a right curly brace
> * 
> [NewlineAtEndOfFile|https://checkstyle.sourceforge.io/config_misc.html#NewlineAtEndOfFile]
> * 
> [UnnecessaryParentheses|https://checkstyle.sourceforge.io/config_coding.html#UnnecessaryParentheses]
> * 
> [MethodParamPad|https://checkstyle.sourceforge.io/config_whitespace.html#MethodParamPad]
> * [InnerTypeLast 
> |https://checkstyle.sourceforge.io/config_design.html#InnerTypeLast]
> * 
> [MissingOverride|https://checkstyle.sourceforge.io/config_annotation.html#MissingOverride]
> * 
> [InvalidJavadocPosition|https://checkstyle.sourceforge.io/config_javadoc.html#InvalidJavadocPosition]
> * 
> [ArrayTypeStyle|https://checkstyle.sourceforge.io/config_misc.html#ArrayTypeStyle]
> * [UpperEll|https://checkstyle.sourceforge.io/config_misc.html#UpperEll]
> and others



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (DRILL-7352) Introduce new checkstyle rules to make code style more consistent

2019-11-16 Thread Paul Rogers (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956669#comment-16956669
 ] 

Paul Rogers edited comment on DRILL-7352 at 11/17/19 2:07 AM:
--

Start with the [existing set of 
rules|http://drill.apache.org/docs/apache-drill-contribution-guidelines/].

* Import order. Typical order: `java`, javax`, `org`, `com`. Static imports at 
the top.
* Use `final` aggressively on fields, do not use it on local variables or 
parameters.
* `case` statements indent one level in from the `switch` statements.

Once decisions are finalized, update the format files for Eclipse and IntelliJ.


was (Author: paul.rogers):
Start with the [existing set of 
rules|http://drill.apache.org/docs/apache-drill-contribution-guidelines/].

* Import order. Typical order: `java`, javax`, `org`, `com`. Static imports at 
the top.
* Use `final` aggressively on fields, do not use it on local variables or 
parameters.

Once decisions are finalized, update the format files for Eclipse and IntelliJ.

> Introduce new checkstyle rules to make code style more consistent
> -
>
> Key: DRILL-7352
> URL: https://issues.apache.org/jira/browse/DRILL-7352
> Project: Apache Drill
>  Issue Type: Task
>Reporter: Vova Vysotskyi
>Priority: Major
>
> Source - https://checkstyle.sourceforge.io/checks.html
> List of rules to be enabled:
> * [LeftCurly|https://checkstyle.sourceforge.io/config_blocks.html#LeftCurly] 
> - force placement of a left curly brace at the end of the line.
> * 
> [RightCurly|https://checkstyle.sourceforge.io/config_blocks.html#RightCurly] 
> - force placement of a right curly brace
> * 
> [NewlineAtEndOfFile|https://checkstyle.sourceforge.io/config_misc.html#NewlineAtEndOfFile]
> * 
> [UnnecessaryParentheses|https://checkstyle.sourceforge.io/config_coding.html#UnnecessaryParentheses]
> * 
> [MethodParamPad|https://checkstyle.sourceforge.io/config_whitespace.html#MethodParamPad]
> * [InnerTypeLast 
> |https://checkstyle.sourceforge.io/config_design.html#InnerTypeLast]
> * 
> [MissingOverride|https://checkstyle.sourceforge.io/config_annotation.html#MissingOverride]
> * 
> [InvalidJavadocPosition|https://checkstyle.sourceforge.io/config_javadoc.html#InvalidJavadocPosition]
> * 
> [ArrayTypeStyle|https://checkstyle.sourceforge.io/config_misc.html#ArrayTypeStyle]
> * [UpperEll|https://checkstyle.sourceforge.io/config_misc.html#UpperEll]
> and others



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7445) Create batch copier based on result set framework

2019-11-14 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7445:
--

 Summary: Create batch copier based on result set framework
 Key: DRILL-7445
 URL: https://issues.apache.org/jira/browse/DRILL-7445
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers
Assignee: Paul Rogers


The result set framework now provides both a reader and writer. Provide a 
copier that copies batches using this framework. Such a copier can:

* Copy selected records
* Copy all records, such as for an SV2 or SV4




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7444) JSON blank result on SELECT when too much byte in multiple files on Drill embedded

2019-11-14 Thread Paul Rogers (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974607#comment-16974607
 ] 

Paul Rogers commented on DRILL-7444:


This is an odd one; there are none of the usual schema ambiguity issues that 
can affect JSON.

I'll take a look at this since I've got some JSON work pending.

> JSON blank result on SELECT when too much byte in multiple files on Drill 
> embedded
> --
>
> Key: DRILL-7444
> URL: https://issues.apache.org/jira/browse/DRILL-7444
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Storage - JSON
>Affects Versions: 1.17.0
>Reporter: benj
>Priority: Major
>
> 2 files (a.json and b.json) and the concat of these 2 file (ab.json) produce 
> different results on a simple _SELECT_ when using +Drill embedded+.
> Problem appears from a number of byte (~ 102 400 000 in my case)
> {code:bash}
> #!/bin/bash
> # script gen.sh to reproduce the problem
> for ((i=1;i<=$1;++i));
> do
> echo -n '{"At":"'
> for j in {1..999};
> do
>   echo -n 'ab'
> done
> echo '"}'
> done
> {code}
> {noformat}
> == I ==
> $ gen.sh 1 > a.json
> $ gen.sh 239 > b.json
> $ wc -c *.json
> 1 a.json
>   239 b.json
> 10239 total
> $ bash drill-embedded
> apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1;
> ++
> |   At   |
> ++
> | aab... |
> ++
> => All is fine here
> == II ==
> $ gen.sh 1 > a.json
> $ gen.sh 240 > b.json
> $ wc -c *.json
> 1 a.json
>   240 b.json
> 10240 total
> $ bash drill-embedded
> apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1;
> ++
> |   At   |
> ++
> ||
> ++
> => In a surprising way field `At` is empty
> == III ==
> $ gen.sh 10240 > ab.json
> $ wc -c *.json 
> 10240 ab.json
> $ bash drill-embedded
> apache drill> SELECT * FROM dfs.tmp.`c.json` LIMIT 1;
> ++ 
> |At  |
> ++
> | aab... |
> ++
> => All is fine here although the number of lines is equal to case II
>   {noformat}
> The Version of the Drill 1.17 tested here is the latest at 2019-11-13
> This problem doesn't appears with Drill embedded 1.16



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7442) Create multi-batch row set reader

2019-11-10 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7442:
--

 Summary: Create multi-batch row set reader
 Key: DRILL-7442
 URL: https://issues.apache.org/jira/browse/DRILL-7442
 Project: Apache Drill
  Issue Type: Improvement
Reporter: Paul Rogers
Assignee: Paul Rogers


The "row set" work provided a {{RowSetWriter}} and {{RowSetReader}} to write to 
and read from a single batch. The {{ResultSetLoader}} class provided a writer 
that spans multiple batches, handling schema changes across batches and so on.

This ticket introduces a reader equivalent, the {{ResultSetReader}} that reads 
an entire result set of multiple batches, handling schema changes along the way.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7441) Fix issues with fillEmpties, offset vectors

2019-11-10 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7441:
--

 Summary: Fix issues with fillEmpties, offset vectors
 Key: DRILL-7441
 URL: https://issues.apache.org/jira/browse/DRILL-7441
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers
Assignee: Paul Rogers


Enable the vector validator with full testing of offset vectors. A number of 
operators trigger errors. Tracking down the issues, and adding detailed tests, 
it turns out that:

* Drill has an informal standard that zero-length batches should have 
zero-length offset vectors, while a batch of size 1 will have offset vectors of 
size 2. Thus, zero-length is a special case.
* Nullable, repeated and variable-width vectors have "fill empties" logic that 
is used in two places: when setting the value count and when preparing to write 
a new value. The current logic is not quite right for either case.

Detailed vector checks fail due to inconsistencies in how the above works. This 
PR fixes those issues.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7149) Kerberos Code Missing from Drill on YARN

2019-11-06 Thread Paul Rogers (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968866#comment-16968866
 ] 

Paul Rogers commented on DRILL-7149:


I'm not a Kerberos expert, but I can perhaps provide a few hints.

Drill information for enabling Kerberos is 
[here|http://drill.apache.org/docs/configuring-kerberos-security/].

My advice is to get one Drillbit working on CDH using these instructions. Then, 
use that information to configure DoY.

The examples suggest putting the keytab file in the absolute location 
{{/etc/drill/conf}}. This is probably not the right choice on a CDH cluster.  

If the keytab is the same for all Drill nodes, then place the file in your 
{{$DRILL_SITE/conf}} directory. The site directory is copied from your DoY 
client machine to each Drill node ("localized" in YARN terminology.)

You will need to change the config file to point to that location. IIRC, the 
{{$DRILL_SITE}} environment variable is available to Drill.

The config file shown in the above-cited page is the one you create in your DoY 
client site directory. DoY will localize that file to every Drillbit running 
under YARN.

If the documentation is accurate, then you only need the config options and the 
keytab file. You should be able to pass these along to Drill using the "stock" 
DoY.

The trick would come in if you need to generate the keytab file per host. (Here 
my knowledge of Kerberos is very weak.) You will learn this as you try the step 
suggested above: running Drill on a CDH node by hand to learn what 
configuration is required.

> Kerberos Code Missing from Drill on YARN
> 
>
> Key: DRILL-7149
> URL: https://issues.apache.org/jira/browse/DRILL-7149
> Project: Apache Drill
>  Issue Type: Bug
>  Components: Security
>Affects Versions: 1.14.0
>Reporter: Charles Givre
>Priority: Blocker
>
> My company is trying to deploy Drill using the Drill on Yarn (DoY) and we 
> have run into the issue that DoY does not seem to support passing Kerberos 
> credentials in order to interact with HDFS. 
> Upon checking the source code available in GIT 
> (https://github.com/apache/drill/blob/1.14.0/drill-yarn/src/main/java/org/apache/drill/yarn/core/)
>  and referring to Apache YARN documentation 
> (https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YarnApplicationSecurity.html)
>  , we saw no section for passing the security credentials needed by the 
> application to interact with any Hadoop cluster services and applications. 
> This we feel needs to be added to the source code so that delegation tokens 
> can be passed inside the container for the process to be able access Drill 
> archive on HDFS and start. It probably should be added to the 
> ContainerLaunchContext within the ApplicationSubmissionContext for DoY as 
> suggested under Apache documentation.
>  
> We tried the same DoY utility on a non-kerberised cluster and the process 
> started well. Although we ran into a different issue there of hosts getting 
> blacklisted
> We tested with the Single Principal per cluster option.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7439) Batch count fixes for six additional operators

2019-11-05 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7439:
--

 Summary: Batch count fixes for six additional operators
 Key: DRILL-7439
 URL: https://issues.apache.org/jira/browse/DRILL-7439
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers
Assignee: Paul Rogers


Enables vector checks, and fixes batch count and vector issues for:

* StreamingAggBatch
* RuntimeFilterRecordBatch
* FlattenRecordBatch
* MergeJoinBatch
* NestedLoopJoinBatch
* LimitRecordBatch




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (DRILL-7434) TopNBatch constructs Union vector incorrectly

2019-11-03 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers reassigned DRILL-7434:
--

Assignee: (was: Paul Rogers)

> TopNBatch constructs Union vector incorrectly
> -
>
> Key: DRILL-7434
> URL: https://issues.apache.org/jira/browse/DRILL-7434
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Priority: Major
>
> The Union type is an "experimental" type that has never been completed. Yet, 
> we use it as if it works.
> Consider the test {{TestTopNSchemaChanges.testMissingColumn()}}. Run this 
> with the new batch validator enabled. This test creates a union vector. Here 
> is how the schema looks:
> {noformat}
> (UNION:OPTIONAL), subtypes=([FLOAT8, INT]),
>   children=([`internal` (MAP:REQUIRED), children=([`types` 
> (UINT1:REQUIRED)])])
> {noformat}
> This is very hard to follow because the Union vector structure is complex 
> (and has many issues.) Let's work though it.
> We are looking at the {{MaterializedField}} for the union vector. It tells us 
> that this Union has two types: {{FLOAT8}} and {{INT}}. All good.
> The Union has a vector per type, stored in an "internal map".' That map shows 
> up as child, it is there on the {{children}} list as {{internal}}. However, 
> the metadata claims that only one vector exists in that map: the {{types}} 
> vector (the one that tells us what type to use for each row.)  The vectors 
> for {{FLOAT8}} and {{INT}} are missing.
> If, however, we use our debugger and inspect the actual contents of the 
> {{internal}} map, we get the following:
> {noformat}
> [`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)], [`float8` 
> (FLOAT8:OPTIONAL)], [`int` (INT:OPTIONAL)])]
> {noformat}
> That is, the internal map has the correct schema, but the Union vector itself 
> has the wrong (incomplete) schema.
> This is an inherent design flaw with Union vector: it requires two copies of 
> the schema to be in sync. Further {{MaterializedField}} was designed to be 
> immutable, but the map and Union types require mutation. If the Union simply 
> points to the actual Map vector {{MaterializedField}}, it will drift out of 
> date since the map vector creates a new schema each time we add fields; the 
> Union vector ends up pointing to the old one.
> This is not a simple bug to fix, but the result of the bug is that the 
> vectors end up corrupted, as detected by the Batch Validator. In fact, the 
> bug itself is subtle.
> The TopNBatch does pass vector validation. However, because of the incorrect 
> metadata, the downstream {{RemovingRecordBatch}} creates the derived Union 
> vector incorrectly: it fails to set the value count for the {{INT}} type.
> {noformat}
> Found one or more vector errors from RemovingRecordBatch
> kl-type-INT - NullableIntVector: Row count = 3, but value count = 0
> {noformat}
> Where {{kl-type-INT}} is an ad-hoc way of saying we are checking the {{INT}} 
> type vector for a Union named {{kl}}.
> The schema of Union out of the {{RemovingRecordBatch}} has been truncated. 
> The Union itself:
> {noformat}
> [`kl` (UNION:OPTIONAL), subtypes=([FLOAT8, INT]),
>   children=([`internal` (MAP:REQUIRED), children=([`types` 
> (UINT1:REQUIRED)])])]
> {noformat}
> The internal map:
> {noformat}
> [`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)], [`int` 
> (INT:OPTIONAL)])]
> {noformat}
> Notice that the {{FLOAT8}} vector has disappeared: the Union vector metadata 
> claims we have such a vector, but the internal map does not actually contain 
> the vector.
> The root cause is that the vector checker (indeed, any client) will call 
> {{UnionVector.getMember(type)}} to get a vector for a type. This method 
> includes a switch statement to call, say, {{getIntVector()}}. That method, in 
> turn, creates the vector if does not exist.
> But, since we are reading, we have an existing data batch. When we create a 
> new vector, we create it as zero size. Thus, we think we have n records 
> (three in this case), but we actually have zero. This kinda-sorta works 
> because the type vector won't ever contain an entry for the "runt" vector, so 
> we won't actually access data. But, this is an inconsistent structure. It 
> breaks if we peer inside, as we are doing in the batch validator.
> If we check for this case, we now get:
> {noformat}
> Found one or more vector errors from RemovingRecordBatch
> kl - UnionVector: Union vector includes type INT, but the internal map has no 
> matching member
> {noformat}
> This is why Union is such a mess: is this a bug or just a very fragile 
> design? I claim bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7434) TopNBatch constructs Union vector incorrectly

2019-11-03 Thread Paul Rogers (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966320#comment-16966320
 ] 

Paul Rogers commented on DRILL-7434:


See DRILL-7436 for a workaround (to materialize all type vectors.) Someone 
should look deeper for a longer-term fix, such as removing unused subtypes.

> TopNBatch constructs Union vector incorrectly
> -
>
> Key: DRILL-7434
> URL: https://issues.apache.org/jira/browse/DRILL-7434
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
>
> The Union type is an "experimental" type that has never been completed. Yet, 
> we use it as if it works.
> Consider the test {{TestTopNSchemaChanges.testMissingColumn()}}. Run this 
> with the new batch validator enabled. This test creates a union vector. Here 
> is how the schema looks:
> {noformat}
> (UNION:OPTIONAL), subtypes=([FLOAT8, INT]),
>   children=([`internal` (MAP:REQUIRED), children=([`types` 
> (UINT1:REQUIRED)])])
> {noformat}
> This is very hard to follow because the Union vector structure is complex 
> (and has many issues.) Let's work though it.
> We are looking at the {{MaterializedField}} for the union vector. It tells us 
> that this Union has two types: {{FLOAT8}} and {{INT}}. All good.
> The Union has a vector per type, stored in an "internal map".' That map shows 
> up as child, it is there on the {{children}} list as {{internal}}. However, 
> the metadata claims that only one vector exists in that map: the {{types}} 
> vector (the one that tells us what type to use for each row.)  The vectors 
> for {{FLOAT8}} and {{INT}} are missing.
> If, however, we use our debugger and inspect the actual contents of the 
> {{internal}} map, we get the following:
> {noformat}
> [`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)], [`float8` 
> (FLOAT8:OPTIONAL)], [`int` (INT:OPTIONAL)])]
> {noformat}
> That is, the internal map has the correct schema, but the Union vector itself 
> has the wrong (incomplete) schema.
> This is an inherent design flaw with Union vector: it requires two copies of 
> the schema to be in sync. Further {{MaterializedField}} was designed to be 
> immutable, but the map and Union types require mutation. If the Union simply 
> points to the actual Map vector {{MaterializedField}}, it will drift out of 
> date since the map vector creates a new schema each time we add fields; the 
> Union vector ends up pointing to the old one.
> This is not a simple bug to fix, but the result of the bug is that the 
> vectors end up corrupted, as detected by the Batch Validator. In fact, the 
> bug itself is subtle.
> The TopNBatch does pass vector validation. However, because of the incorrect 
> metadata, the downstream {{RemovingRecordBatch}} creates the derived Union 
> vector incorrectly: it fails to set the value count for the {{INT}} type.
> {noformat}
> Found one or more vector errors from RemovingRecordBatch
> kl-type-INT - NullableIntVector: Row count = 3, but value count = 0
> {noformat}
> Where {{kl-type-INT}} is an ad-hoc way of saying we are checking the {{INT}} 
> type vector for a Union named {{kl}}.
> The schema of Union out of the {{RemovingRecordBatch}} has been truncated. 
> The Union itself:
> {noformat}
> [`kl` (UNION:OPTIONAL), subtypes=([FLOAT8, INT]),
>   children=([`internal` (MAP:REQUIRED), children=([`types` 
> (UINT1:REQUIRED)])])]
> {noformat}
> The internal map:
> {noformat}
> [`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)], [`int` 
> (INT:OPTIONAL)])]
> {noformat}
> Notice that the {{FLOAT8}} vector has disappeared: the Union vector metadata 
> claims we have such a vector, but the internal map does not actually contain 
> the vector.
> The root cause is that the vector checker (indeed, any client) will call 
> {{UnionVector.getMember(type)}} to get a vector for a type. This method 
> includes a switch statement to call, say, {{getIntVector()}}. That method, in 
> turn, creates the vector if does not exist.
> But, since we are reading, we have an existing data batch. When we create a 
> new vector, we create it as zero size. Thus, we think we have n records 
> (three in this case), but we actually have zero. This kinda-sorta works 
> because the type vector won't ever contain an entry for the "runt" vector, so 
> we won't actually access data. But, this is an inconsistent structure. It 
> breaks if we peer inside, as we are doing in the batch validator.
> If we check for this case, we now get:
> {noformat}
> Found one or more vector errors from RemovingRecordBatch
> kl - UnionVector: Union vector includes type INT, but the internal map has no 
> matching member
> {noformat}
> This is why Union is such a mess: is this a bug or just a very fragile 
> design? I claim bug.



--
This 

[jira] [Commented] (DRILL-7435) Project operator incorrectly adds a LATE type to union vector

2019-11-03 Thread Paul Rogers (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966319#comment-16966319
 ] 

Paul Rogers commented on DRILL-7435:


DRILL-7436 provides a work-around fix. Someone should probably look carefully 
to work out the detailed semantics in this area: how should we handle `LATE` 
with the Union vector?

> Project operator incorrectly adds a LATE type to union vector
> -
>
> Key: DRILL-7435
> URL: https://issues.apache.org/jira/browse/DRILL-7435
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Priority: Major
>
> Run Drill with a fix for DRILL-7434. Now, another test fails: 
> {{TestJsonReader.testTypeCase()}} fails when it tries to set the value count. 
> Evidently the Project operator has added the {{LATE}} type to the Union 
> vector. However, there is no vector type associated with the {{LATE}} type. 
> An attempt to get the member or this type throws an exception.
> The simple work around is to special-case this type when setting the value 
> count. The longer-term fix is to not add the {{LATE}} type to a union vector.
> The problem appears to occur here:
> {noformat}
> Daemon Thread [2240a19e-344e-9a8b-f3d9-2a1550662b1b:frag:0:0] (Suspended 
> (breakpoint at line 2091 in TypeProtos$MajorType$Builder))   
>   TypeProtos$MajorType$Builder.addSubType(TypeProtos$MinorType) line: 
> 2091
>   DefaultReturnTypeInference.getType(List, 
> FunctionAttributes) line: 58
>   FunctionTemplate$ReturnType.getType(List, 
> FunctionAttributes) line: 195  
>   
> DrillSimpleFuncHolder(DrillFuncHolder).getReturnType(List) 
> line: 401 
>   DrillFuncHolderExpr.(String, DrillFuncHolder, 
> List, ExpressionPosition) line: 39   
>   DrillSimpleFuncHolder(DrillFuncHolder).getExpr(String, 
> List, ExpressionPosition) line: 113   
>   ExpressionTreeMaterializer.addCastExpression(LogicalExpression, 
> TypeProtos$MajorType, FunctionLookupContext, ErrorCollector, boolean) line: 
> 235 
>   
> ExpressionTreeMaterializer$MaterializeVisitor(ExpressionTreeMaterializer$AbstractMaterializeVisitor).visitIfExpression(IfExpression,
>  FunctionLookupContext) line: 638   
>   
> ExpressionTreeMaterializer$MaterializeVisitor(ExpressionTreeMaterializer$AbstractMaterializeVisitor).visitIfExpression(IfExpression,
>  Object) line: 332  
>   IfExpression.accept(ExprVisitor, V) line: 65 
>   ExpressionTreeMaterializer.materialize(LogicalExpression, 
> Map, ErrorCollector, FunctionLookupContext, 
> boolean, boolean) line: 165  
>   ExpressionTreeMaterializer.materialize(LogicalExpression, 
> VectorAccessible, ErrorCollector, FunctionLookupContext, boolean, boolean) 
> line: 143  
>   ProjectRecordBatch.setupNewSchemaFromInput(RecordBatch) line: 482   
>   ProjectRecordBatch.setupNewSchema() line: 571   
>   ProjectRecordBatch(AbstractUnaryRecordBatch).innerNext() line: 99
>   ProjectRecordBatch.innerNext() line: 144
>   ...
> {noformat}
> This appears to be processing the if statement in the following test query:
> {noformat}
>   .sqlQuery("select case when is_bigint(field1) " +
> "then field1 when is_list(field1) then field1[0] " +
> "when is_map(field1) then t.field1.inner1 end f1 from 
> cp.`jsoninput/union/a.json` t")
> {noformat}
> The problem appears to be that a function says it takes data of type LATE, 
> and then that data is added to the Union. Not sure of the exact solution, but 
> simply omitting the LATE value from the Union seems to work.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7436) Fix record count, vector structure issues in several operators

2019-11-03 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7436:
--

 Summary: Fix record count, vector structure issues in several 
operators
 Key: DRILL-7436
 URL: https://issues.apache.org/jira/browse/DRILL-7436
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers
Assignee: Paul Rogers


This is the next in a continuing series of fixes to the container record count, 
batch record count, and vector structure in several operators. This batch 
represents the smallest change needed to add checking for the Filter operator.

In order to get Filter to pass checks, many of its upstream operators needed to 
be fixed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (DRILL-7435) Project operator incorrectly adds a LATE type to union vector

2019-11-03 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-7435:
---
Description: 
Run Drill with a fix for DRILL-7434. Now, another test fails: 
{{TestJsonReader.testTypeCase()}} fails when it tries to set the value count. 
Evidently the Project operator has added the {{LATE}} type to the Union vector. 
However, there is no vector type associated with the {{LATE}} type. An attempt 
to get the member or this type throws an exception.

The simple work around is to special-case this type when setting the value 
count. The longer-term fix is to not add the {{LATE}} type to a union vector.

The problem appears to occur here:

{noformat}
Daemon Thread [2240a19e-344e-9a8b-f3d9-2a1550662b1b:frag:0:0] (Suspended 
(breakpoint at line 2091 in TypeProtos$MajorType$Builder)) 
TypeProtos$MajorType$Builder.addSubType(TypeProtos$MinorType) line: 
2091
DefaultReturnTypeInference.getType(List, 
FunctionAttributes) line: 58
FunctionTemplate$ReturnType.getType(List, 
FunctionAttributes) line: 195  

DrillSimpleFuncHolder(DrillFuncHolder).getReturnType(List) 
line: 401 
DrillFuncHolderExpr.(String, DrillFuncHolder, 
List, ExpressionPosition) line: 39   
DrillSimpleFuncHolder(DrillFuncHolder).getExpr(String, 
List, ExpressionPosition) line: 113   
ExpressionTreeMaterializer.addCastExpression(LogicalExpression, 
TypeProtos$MajorType, FunctionLookupContext, ErrorCollector, boolean) line: 235 

ExpressionTreeMaterializer$MaterializeVisitor(ExpressionTreeMaterializer$AbstractMaterializeVisitor).visitIfExpression(IfExpression,
 FunctionLookupContext) line: 638   

ExpressionTreeMaterializer$MaterializeVisitor(ExpressionTreeMaterializer$AbstractMaterializeVisitor).visitIfExpression(IfExpression,
 Object) line: 332  
IfExpression.accept(ExprVisitor, V) line: 65 
ExpressionTreeMaterializer.materialize(LogicalExpression, 
Map, ErrorCollector, FunctionLookupContext, 
boolean, boolean) line: 165  
ExpressionTreeMaterializer.materialize(LogicalExpression, 
VectorAccessible, ErrorCollector, FunctionLookupContext, boolean, boolean) 
line: 143  
ProjectRecordBatch.setupNewSchemaFromInput(RecordBatch) line: 482   
ProjectRecordBatch.setupNewSchema() line: 571   
ProjectRecordBatch(AbstractUnaryRecordBatch).innerNext() line: 99
ProjectRecordBatch.innerNext() line: 144
...
{noformat}

This appears to be processing the if statement in the following test query:

{noformat}
  .sqlQuery("select case when is_bigint(field1) " +
"then field1 when is_list(field1) then field1[0] " +
"when is_map(field1) then t.field1.inner1 end f1 from 
cp.`jsoninput/union/a.json` t")
{noformat}

The problem appears to be that a function says it takes data of type LATE, and 
then that data is added to the Union. Not sure of the exact solution, but 
simply omitting the LATE value from the Union seems to work.



  was:
Run Drill with a fix for DRILL-7434. Now, another test fails: 
{{TestJsonReader.testTypeCase()}} fails when it tries to set the value count. 
Evidently the JSON reader has added the {{LATE}} type to the Union vector. 
However, there is no vector type associated with the {{LATE}} type. An attempt 
to get the member or this type throws an exception.

The simple work around is to special-case this type when setting the value 
count. The longer-term fix is to not add the {{LATE}} type to a union vector.


> Project operator incorrectly adds a LATE type to union vector
> -
>
> Key: DRILL-7435
> URL: https://issues.apache.org/jira/browse/DRILL-7435
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Priority: Major
>
> Run Drill with a fix for DRILL-7434. Now, another test fails: 
> {{TestJsonReader.testTypeCase()}} fails when it tries to set the value count. 
> Evidently the Project operator has added the {{LATE}} type to the Union 
> vector. However, there is no vector type associated with the {{LATE}} type. 
> An attempt to get the member or this type throws an exception.
> The simple work around is to special-case this type when setting the value 
> count. The longer-term fix is to not add the {{LATE}} type to a union vector.
> The problem appears to occur here:
> {noformat}
> Daemon Thread [2240a19e-344e-9a8b-f3d9-2a1550662b1b:frag:0:0] (Suspended 
> (breakpoint at line 2091 in TypeProtos$MajorType$Builder))   
>   TypeProtos$MajorType$Builder.addSubType(TypeProtos$MinorType) line: 
> 2091
>   DefaultReturnTypeInference.getType(List, 
> FunctionAttributes) line: 58
>   FunctionTemplate$ReturnType.getType(List, 
> 

[jira] [Updated] (DRILL-7435) Project operator incorrectly adds a LATE type to union vector

2019-11-03 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-7435:
---
Summary: Project operator incorrectly adds a LATE type to union vector  
(was: JSON reader incorrectly adds a LATE type to union vector)

> Project operator incorrectly adds a LATE type to union vector
> -
>
> Key: DRILL-7435
> URL: https://issues.apache.org/jira/browse/DRILL-7435
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Priority: Major
>
> Run Drill with a fix for DRILL-7434. Now, another test fails: 
> {{TestJsonReader.testTypeCase()}} fails when it tries to set the value count. 
> Evidently the JSON reader has added the {{LATE}} type to the Union vector. 
> However, there is no vector type associated with the {{LATE}} type. An 
> attempt to get the member or this type throws an exception.
> The simple work around is to special-case this type when setting the value 
> count. The longer-term fix is to not add the {{LATE}} type to a union vector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7435) JSON reader incorrectly adds a LATE type to union vector

2019-11-03 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7435:
--

 Summary: JSON reader incorrectly adds a LATE type to union vector
 Key: DRILL-7435
 URL: https://issues.apache.org/jira/browse/DRILL-7435
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers


Run Drill with a fix for DRILL-7434. Now, another test fails: 
{{TestJsonReader.testTypeCase()}} fails when it tries to set the value count. 
Evidently the JSON reader has added the {{LATE}} type to the Union vector. 
However, there is no vector type associated with the {{LATE}} type. An attempt 
to get the member or this type throws an exception.

The simple work around is to special-case this type when setting the value 
count. The longer-term fix is to not add the {{LATE}} type to a union vector.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (DRILL-7434) TopNBatch constructs Union vector incorrectly

2019-11-03 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers reassigned DRILL-7434:
--

Assignee: Paul Rogers

> TopNBatch constructs Union vector incorrectly
> -
>
> Key: DRILL-7434
> URL: https://issues.apache.org/jira/browse/DRILL-7434
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Assignee: Paul Rogers
>Priority: Major
>
> The Union type is an "experimental" type that has never been completed. Yet, 
> we use it as if it works.
> Consider the test {{TestTopNSchemaChanges.testMissingColumn()}}. Run this 
> with the new batch validator enabled. This test creates a union vector. Here 
> is how the schema looks:
> {noformat}
> (UNION:OPTIONAL), subtypes=([FLOAT8, INT]),
>   children=([`internal` (MAP:REQUIRED), children=([`types` 
> (UINT1:REQUIRED)])])
> {noformat}
> This is very hard to follow because the Union vector structure is complex 
> (and has many issues.) Let's work though it.
> We are looking at the {{MaterializedField}} for the union vector. It tells us 
> that this Union has two types: {{FLOAT8}} and {{INT}}. All good.
> The Union has a vector per type, stored in an "internal map".' That map shows 
> up as child, it is there on the {{children}} list as {{internal}}. However, 
> the metadata claims that only one vector exists in that map: the {{types}} 
> vector (the one that tells us what type to use for each row.)  The vectors 
> for {{FLOAT8}} and {{INT}} are missing.
> If, however, we use our debugger and inspect the actual contents of the 
> {{internal}} map, we get the following:
> {noformat}
> [`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)], [`float8` 
> (FLOAT8:OPTIONAL)], [`int` (INT:OPTIONAL)])]
> {noformat}
> That is, the internal map has the correct schema, but the Union vector itself 
> has the wrong (incomplete) schema.
> This is an inherent design flaw with Union vector: it requires two copies of 
> the schema to be in sync. Further {{MaterializedField}} was designed to be 
> immutable, but the map and Union types require mutation. If the Union simply 
> points to the actual Map vector {{MaterializedField}}, it will drift out of 
> date since the map vector creates a new schema each time we add fields; the 
> Union vector ends up pointing to the old one.
> This is not a simple bug to fix, but the result of the bug is that the 
> vectors end up corrupted, as detected by the Batch Validator. In fact, the 
> bug itself is subtle.
> The TopNBatch does pass vector validation. However, because of the incorrect 
> metadata, the downstream {{RemovingRecordBatch}} creates the derived Union 
> vector incorrectly: it fails to set the value count for the {{INT}} type.
> {noformat}
> Found one or more vector errors from RemovingRecordBatch
> kl-type-INT - NullableIntVector: Row count = 3, but value count = 0
> {noformat}
> Where {{kl-type-INT}} is an ad-hoc way of saying we are checking the {{INT}} 
> type vector for a Union named {{kl}}.
> The schema of Union out of the {{RemovingRecordBatch}} has been truncated. 
> The Union itself:
> {noformat}
> [`kl` (UNION:OPTIONAL), subtypes=([FLOAT8, INT]),
>   children=([`internal` (MAP:REQUIRED), children=([`types` 
> (UINT1:REQUIRED)])])]
> {noformat}
> The internal map:
> {noformat}
> [`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)], [`int` 
> (INT:OPTIONAL)])]
> {noformat}
> Notice that the {{FLOAT8}} vector has disappeared: the Union vector metadata 
> claims we have such a vector, but the internal map does not actually contain 
> the vector.
> The root cause is that the vector checker (indeed, any client) will call 
> {{UnionVector.getMember(type)}} to get a vector for a type. This method 
> includes a switch statement to call, say, {{getIntVector()}}. That method, in 
> turn, creates the vector if does not exist.
> But, since we are reading, we have an existing data batch. When we create a 
> new vector, we create it as zero size. Thus, we think we have n records 
> (three in this case), but we actually have zero. This kinda-sorta works 
> because the type vector won't ever contain an entry for the "runt" vector, so 
> we won't actually access data. But, this is an inconsistent structure. It 
> breaks if we peer inside, as we are doing in the batch validator.
> If we check for this case, we now get:
> {noformat}
> Found one or more vector errors from RemovingRecordBatch
> kl - UnionVector: Union vector includes type INT, but the internal map has no 
> matching member
> {noformat}
> This is why Union is such a mess: is this a bug or just a very fragile 
> design? I claim bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7434) TopNBatch constructs Union vector incorrectly

2019-11-03 Thread Paul Rogers (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966243#comment-16966243
 ] 

Paul Rogers commented on DRILL-7434:


A workaround is to force creation of the child type vectors in 
{{UnionVector.setValueCount()}}. This is a workaround because, if there are no 
values for a given type, we should not need the child vector. A better 
long-term solution would be to simply remove child types for which there are no 
values. This is left as an exercise for another time.

> TopNBatch constructs Union vector incorrectly
> -
>
> Key: DRILL-7434
> URL: https://issues.apache.org/jira/browse/DRILL-7434
> Project: Apache Drill
>  Issue Type: Bug
>Reporter: Paul Rogers
>Priority: Major
>
> The Union type is an "experimental" type that has never been completed. Yet, 
> we use it as if it works.
> Consider the test {{TestTopNSchemaChanges.testMissingColumn()}}. Run this 
> with the new batch validator enabled. This test creates a union vector. Here 
> is how the schema looks:
> {noformat}
> (UNION:OPTIONAL), subtypes=([FLOAT8, INT]),
>   children=([`internal` (MAP:REQUIRED), children=([`types` 
> (UINT1:REQUIRED)])])
> {noformat}
> This is very hard to follow because the Union vector structure is complex 
> (and has many issues.) Let's work though it.
> We are looking at the {{MaterializedField}} for the union vector. It tells us 
> that this Union has two types: {{FLOAT8}} and {{INT}}. All good.
> The Union has a vector per type, stored in an "internal map".' That map shows 
> up as child, it is there on the {{children}} list as {{internal}}. However, 
> the metadata claims that only one vector exists in that map: the {{types}} 
> vector (the one that tells us what type to use for each row.)  The vectors 
> for {{FLOAT8}} and {{INT}} are missing.
> If, however, we use our debugger and inspect the actual contents of the 
> {{internal}} map, we get the following:
> {noformat}
> [`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)], [`float8` 
> (FLOAT8:OPTIONAL)], [`int` (INT:OPTIONAL)])]
> {noformat}
> That is, the internal map has the correct schema, but the Union vector itself 
> has the wrong (incomplete) schema.
> This is an inherent design flaw with Union vector: it requires two copies of 
> the schema to be in sync. Further {{MaterializedField}} was designed to be 
> immutable, but the map and Union types require mutation. If the Union simply 
> points to the actual Map vector {{MaterializedField}}, it will drift out of 
> date since the map vector creates a new schema each time we add fields; the 
> Union vector ends up pointing to the old one.
> This is not a simple bug to fix, but the result of the bug is that the 
> vectors end up corrupted, as detected by the Batch Validator. In fact, the 
> bug itself is subtle.
> The TopNBatch does pass vector validation. However, because of the incorrect 
> metadata, the downstream {{RemovingRecordBatch}} creates the derived Union 
> vector incorrectly: it fails to set the value count for the {{INT}} type.
> {noformat}
> Found one or more vector errors from RemovingRecordBatch
> kl-type-INT - NullableIntVector: Row count = 3, but value count = 0
> {noformat}
> Where {{kl-type-INT}} is an ad-hoc way of saying we are checking the {{INT}} 
> type vector for a Union named {{kl}}.
> The schema of Union out of the {{RemovingRecordBatch}} has been truncated. 
> The Union itself:
> {noformat}
> [`kl` (UNION:OPTIONAL), subtypes=([FLOAT8, INT]),
>   children=([`internal` (MAP:REQUIRED), children=([`types` 
> (UINT1:REQUIRED)])])]
> {noformat}
> The internal map:
> {noformat}
> [`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)], [`int` 
> (INT:OPTIONAL)])]
> {noformat}
> Notice that the {{FLOAT8}} vector has disappeared: the Union vector metadata 
> claims we have such a vector, but the internal map does not actually contain 
> the vector.
> The root cause is that the vector checker (indeed, any client) will call 
> {{UnionVector.getMember(type)}} to get a vector for a type. This method 
> includes a switch statement to call, say, {{getIntVector()}}. That method, in 
> turn, creates the vector if does not exist.
> But, since we are reading, we have an existing data batch. When we create a 
> new vector, we create it as zero size. Thus, we think we have n records 
> (three in this case), but we actually have zero. This kinda-sorta works 
> because the type vector won't ever contain an entry for the "runt" vector, so 
> we won't actually access data. But, this is an inconsistent structure. It 
> breaks if we peer inside, as we are doing in the batch validator.
> If we check for this case, we now get:
> {noformat}
> Found one or more vector errors from RemovingRecordBatch
> kl - UnionVector: Union vector includes type INT, but the 

[jira] [Updated] (DRILL-7434) TopNBatch constructs Union vector incorrectly

2019-11-03 Thread Paul Rogers (Jira)


 [ 
https://issues.apache.org/jira/browse/DRILL-7434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Paul Rogers updated DRILL-7434:
---
Description: 
The Union type is an "experimental" type that has never been completed. Yet, we 
use it as if it works.

Consider the test {{TestTopNSchemaChanges.testMissingColumn()}}. Run this with 
the new batch validator enabled. This test creates a union vector. Here is how 
the schema looks:

{noformat}
(UNION:OPTIONAL), subtypes=([FLOAT8, INT]),
  children=([`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)])])
{noformat}

This is very hard to follow because the Union vector structure is complex (and 
has many issues.) Let's work though it.

We are looking at the {{MaterializedField}} for the union vector. It tells us 
that this Union has two types: {{FLOAT8}} and {{INT}}. All good.

The Union has a vector per type, stored in an "internal map".' That map shows 
up as child, it is there on the {{children}} list as {{internal}}. However, the 
metadata claims that only one vector exists in that map: the {{types}} vector 
(the one that tells us what type to use for each row.)  The vectors for 
{{FLOAT8}} and {{INT}} are missing.

If, however, we use our debugger and inspect the actual contents of the 
{{internal}} map, we get the following:

{noformat}
[`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)], [`float8` 
(FLOAT8:OPTIONAL)], [`int` (INT:OPTIONAL)])]
{noformat}

That is, the internal map has the correct schema, but the Union vector itself 
has the wrong (incomplete) schema.

This is an inherent design flaw with Union vector: it requires two copies of 
the schema to be in sync. Further {{MaterializedField}} was designed to be 
immutable, but the map and Union types require mutation. If the Union simply 
points to the actual Map vector {{MaterializedField}}, it will drift out of 
date since the map vector creates a new schema each time we add fields; the 
Union vector ends up pointing to the old one.

This is not a simple bug to fix, but the result of the bug is that the vectors 
end up corrupted, as detected by the Batch Validator. In fact, the bug itself 
is subtle.

The TopNBatch does pass vector validation. However, because of the incorrect 
metadata, the downstream {{RemovingRecordBatch}} creates the derived Union 
vector incorrectly: it fails to set the value count for the {{INT}} type.

{noformat}
Found one or more vector errors from RemovingRecordBatch
kl-type-INT - NullableIntVector: Row count = 3, but value count = 0
{noformat}

Where {{kl-type-INT}} is an ad-hoc way of saying we are checking the {{INT}} 
type vector for a Union named {{kl}}.

The schema of Union out of the {{RemovingRecordBatch}} has been truncated. The 
Union itself:

{noformat}
[`kl` (UNION:OPTIONAL), subtypes=([FLOAT8, INT]),
  children=([`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)])])]
{noformat}

The internal map:

{noformat}
[`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)], [`int` 
(INT:OPTIONAL)])]
{noformat}

Notice that the {{FLOAT8}} vector has disappeared: the Union vector metadata 
claims we have such a vector, but the internal map does not actually contain 
the vector.

The root cause is that the vector checker (indeed, any client) will call 
{{UnionVector.getMember(type)}} to get a vector for a type. This method 
includes a switch statement to call, say, {{getIntVector()}}. That method, in 
turn, creates the vector if does not exist.

But, since we are reading, we have an existing data batch. When we create a new 
vector, we create it as zero size. Thus, we think we have n records (three in 
this case), but we actually have zero. This kinda-sorta works because the type 
vector won't ever contain an entry for the "runt" vector, so we won't actually 
access data. But, this is an inconsistent structure. It breaks if we peer 
inside, as we are doing in the batch validator.

If we check for this case, we now get:

{noformat}
Found one or more vector errors from RemovingRecordBatch
kl - UnionVector: Union vector includes type INT, but the internal map has no 
matching member
{noformat}

This is why Union is such a mess: is this a bug or just a very fragile design? 
I claim bug.

  was:
The Union type is an "experimental" type that has never been completed. Yet, we 
use it as if it works.

Consider the test {{TestTopNSchemaChanges.testMissingColumn()}}. Run this with 
the new batch validator enabled. This test creates a union vector. Here is how 
the schema looks:

{noformat}
(UNION:OPTIONAL), subtypes=([FLOAT8, INT]),
  children=([`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)])])
{noformat}

This is very hard to follow because the Union vector structure is complex (and 
has many issues.) Let's work though it.

We are looking at the {{MaterializedField}} for the union vector. It tells us 
that this Union has two types: {{FLOAT8}} and {{INT}}. 

[jira] [Created] (DRILL-7434) TopNBatch constructs Union vector incorrectly

2019-11-03 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7434:
--

 Summary: TopNBatch constructs Union vector incorrectly
 Key: DRILL-7434
 URL: https://issues.apache.org/jira/browse/DRILL-7434
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers


The Union type is an "experimental" type that has never been completed. Yet, we 
use it as if it works.

Consider the test {{TestTopNSchemaChanges.testMissingColumn()}}. Run this with 
the new batch validator enabled. This test creates a union vector. Here is how 
the schema looks:

{noformat}
(UNION:OPTIONAL), subtypes=([FLOAT8, INT]),
  children=([`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)])])
{noformat}

This is very hard to follow because the Union vector structure is complex (and 
has many issues.) Let's work though it.

We are looking at the {{MaterializedField}} for the union vector. It tells us 
that this Union has two types: {{FLOAT8}} and {{INT}}. All good.

The Union has a vector per type, stored in an "internal map".' That map shows 
up as child, it is there on the {{children}} list as {{internal}}. However, the 
metadata claims that only one vector exists in that map: the {{types}} vector 
(the one that tells us what type to use for each row.)  The vectors for 
{{FLOAT8}} and {{INT}} are missing.

If, however, we use our debugger and inspect the actual contents of the 
{{internal}} map, we get the following:

{noformat}
[`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)], [`float8` 
(FLOAT8:OPTIONAL)], [`int` (INT:OPTIONAL)])]
{noformat}

That is, the internal map has the correct schema, but the Union vector itself 
has the wrong (incomplete) schema.

This is an inherent design flaw with Union vector: it requires two copies of 
the schema to be in sync. Further {{MaterializedField}} was designed to be 
immutable, but the map and Union types require mutation. If the Union simply 
points to the actual Map vector {{MaterializedField}}, it will drift out of 
date since the map vector creates a new schema each time we add fields; the 
Union vector ends up pointing to the old one.

This is not a simple bug to fix, but the result of the bug is that the vectors 
end up corrupted, as detected by the Batch Validator. In fact, the bug itself 
is subtle.

The TopNBatch does pass vector validation. However, because of the incorrect 
metadata, the downstream {{RemovingRecordBatch}} creates the derived Union 
vector incorrectly: it fails to set the value count for the {{INT}} type.

{noformat}
Found one or more vector errors from RemovingRecordBatch
kl-type-INT - NullableIntVector: Row count = 3, but value count = 0
{noformat}

Where {{kl-type-INT}} is an ad-hoc way of saying we are checking the {{INT}} 
type vector for a Union named {{kl}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (DRILL-7428) Drill incorrectly allows a repeated map field to be projected to top level

2019-10-29 Thread Paul Rogers (Jira)
Paul Rogers created DRILL-7428:
--

 Summary: Drill incorrectly allows a repeated map field to be 
projected to top level
 Key: DRILL-7428
 URL: https://issues.apache.org/jira/browse/DRILL-7428
 Project: Apache Drill
  Issue Type: Bug
Reporter: Paul Rogers


Consider the following query from the [Mongo DB 
tests|https://github.com/apache/drill/blob/master/contrib/storage-mongo/src/test/java/org/apache/drill/exec/store/mongo/MongoTestConstants.java#L80]:

{noformat}
select t.name as name, t.topping.type as type 
  from mongo.%s.`%s` t where t.sales >= 150
{noformat}


The query is used in 
[{{TestMongoQueries.testUnShardedDBInShardedClusterWithProjectionAndFilter()}}|https://github.com/apache/drill/blob/master/contrib/storage-mongo/src/test/java/org/apache/drill/exec/store/mongo/TestMongoQueries.java#L89].
 
Here it turns out that {{topping}} is a repeated map. The query is projecting 
the members of that map to the top level. The query has five rows, but 24 
values in the repeated map. The Project operator allows the projection, 
resulting in an output batch in which most vectors have 5 values, but the 
{{topping}} column, now at the top level and no longer in the map, has 24 
values.

As a result, the first five values, formerly associated with the first record, 
are now associated with the first five top-level records, while the values 
formerly associated with records 1-4 are lost.

Thus, this is a data corruption bug.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7426) Json support lists of different types

2019-10-28 Thread Paul Rogers (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961336#comment-16961336
 ] 

Paul Rogers commented on DRILL-7426:


[~cgivre], I should have seen that one coming...

But, seriously, a provided schema turns out to be the best way to predict the 
future.

> Json support lists of different types
> -
>
> Key: DRILL-7426
> URL: https://issues.apache.org/jira/browse/DRILL-7426
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.16.0
>Reporter: benj
>Priority: Trivial
>
> With a file.json like
> {code:json}
> {
> "name": "toto",
> "info": [["LOAD", []]],
> "response": 1
> }
> {code}
> A simple SELECT gives an error
> {code:sql}
> apache drill> SELECT * FROM dfs.test.`file.json`;
> Error: UNSUPPORTED_OPERATION ERROR: In a list of type VARCHAR, encountered a 
> value of type LIST. Drill does not support lists of different types.
> {code}
> But there is an option _exec.enable_union_type_ that allows these request
> {code:sql}
> apache drill> ALTER SESSION SET `exec.enable_union_type` = true;
> apache drill> SELECT * FROM dfs.test.`file.json`;
> +--+---+--+
> | name | info  | response |
> +--+---+--+
> | toto | [["LOAD",[]]] | 1|
> +--+---+--+
> 1 row selected (0.283 seconds)
> {code}
> The usage of this option is not evident. So, it will be useful to mention 
> after the error message the possibility to set it.
> {noformat}
> Error: UNSUPPORTED_OPERATION ERROR: In a list of type VARCHAR, encountered a 
> value of type LIST. Drill does not support lists of different types.  SET 
> the option 'exec.enable_union_type' to true and try again;
> {noformat}
> This behaviour is used for other error, example:
> {noformat}
> ...
> Error: UNSUPPORTED_OPERATION ERROR: This query cannot be planned possibly due 
> to either a cartesian join or an inequality join. 
> If a cartesian or inequality join is used intentionally, set the option 
> 'planner.enable_nljoin_for_scalar_only' to false and try again.
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7426) Json support lists of different types

2019-10-28 Thread Paul Rogers (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961322#comment-16961322
 ] 

Paul Rogers commented on DRILL-7426:


[~cgivre], the query in question used the wildcard, which asks to read all 
columns. In general, the reader cannot predict the future: it cannot tell that 
`info` will contain mixed data.

However, Drill should work if the query were `SELECT name, response FROM ...`. 
If not, then that is a bug that is fixable.

The issue is that the user seems to need the data. One workaround is to rewrite 
the JSON so that the array is represented as an object:

{noformat}
{
"name": "toto",
"info": { command: "LOAD", values: [] },
"response": 1
}
{noformat}

But, here we run into the empty-array issue: we don't know the type of the 
`values` array...

In general, JSON can represent a wider set of data structures than relational 
tuples. It has always been an open question the variety of such data that Drill 
should handle. I think most users end up running an ETL to convert the data 
into a relational format (then store the data in Parquet for better 
performance.) So, one could debate whether it is worth adding more complexity 
to Drill.

> Json support lists of different types
> -
>
> Key: DRILL-7426
> URL: https://issues.apache.org/jira/browse/DRILL-7426
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.16.0
>Reporter: benj
>Priority: Trivial
>
> With a file.json like
> {code:json}
> {
> "name": "toto",
> "info": [["LOAD", []]],
> "response": 1
> }
> {code}
> A simple SELECT gives an error
> {code:sql}
> apache drill> SELECT * FROM dfs.test.`file.json`;
> Error: UNSUPPORTED_OPERATION ERROR: In a list of type VARCHAR, encountered a 
> value of type LIST. Drill does not support lists of different types.
> {code}
> But there is an option _exec.enable_union_type_ that allows these request
> {code:sql}
> apache drill> ALTER SESSION SET `exec.enable_union_type` = true;
> apache drill> SELECT * FROM dfs.test.`file.json`;
> +--+---+--+
> | name | info  | response |
> +--+---+--+
> | toto | [["LOAD",[]]] | 1|
> +--+---+--+
> 1 row selected (0.283 seconds)
> {code}
> The usage of this option is not evident. So, it will be useful to mention 
> after the error message the possibility to set it.
> {noformat}
> Error: UNSUPPORTED_OPERATION ERROR: In a list of type VARCHAR, encountered a 
> value of type LIST. Drill does not support lists of different types.  SET 
> the option 'exec.enable_union_type' to true and try again;
> {noformat}
> This behaviour is used for other error, example:
> {noformat}
> ...
> Error: UNSUPPORTED_OPERATION ERROR: This query cannot be planned possibly due 
> to either a cartesian join or an inequality join. 
> If a cartesian or inequality join is used intentionally, set the option 
> 'planner.enable_nljoin_for_scalar_only' to false and try again.
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (DRILL-7426) Json support lists of different types

2019-10-28 Thread Paul Rogers (Jira)


[ 
https://issues.apache.org/jira/browse/DRILL-7426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961257#comment-16961257
 ] 

Paul Rogers commented on DRILL-7426:


As it turns out, this is a known limitation of Drill. Drill is a relational 
engine, designed to serve relational clients such as JDBC and ODBC. Although 
Drill has a Union data type, that type remains experimental and not fully 
supported.

At present, it seems that the Union type can be passed through the scan 
operator to a SqlLine client, where it is converted to a string for display, as 
shown in your example. However, it is not supported by most other operators, 
resulting in the failure you reported.

The fundamental problem is that it is not clear how the Union type should work 
with clients (JDBC, ODBC) that require a traditional relational schema. Drill 
does not support extended SQL syntax (such as SQL++), just traditional 
relational SQL.

We have seen cases in which JSON authors use arrays as a compact representation 
of a tuple:

{noformat}
[ 10, "fred", "flintstone", "male", 12.34 ]
{noformat}

Is this the case with your example that contains, it seems, both a string and 
an array?

At present, Drill has no way to map such a tuple into a relational structure. 
One could imagine converting the array into, say, a Map with field names 
defined somehow.

Here, "all text mode" will not help as that mode can't handle array/string 
conflicts, only string/number conflicts.

> Json support lists of different types
> -
>
> Key: DRILL-7426
> URL: https://issues.apache.org/jira/browse/DRILL-7426
> Project: Apache Drill
>  Issue Type: Improvement
>  Components: Documentation
>Affects Versions: 1.16.0
>Reporter: benj
>Priority: Trivial
>
> With a file.json like
> {code:json}
> {
> "name": "toto",
> "info": [["LOAD", []]],
> "response": 1
> }
> {code}
> A simple SELECT gives an error
> {code:sql}
> apache drill> SELECT * FROM dfs.test.`file.json`;
> Error: UNSUPPORTED_OPERATION ERROR: In a list of type VARCHAR, encountered a 
> value of type LIST. Drill does not support lists of different types.
> {code}
> But there is an option _exec.enable_union_type_ that allows these request
> {code:sql}
> apache drill> ALTER SESSION SET `exec.enable_union_type` = true;
> apache drill> SELECT * FROM dfs.test.`file.json`;
> +--+---+--+
> | name | info  | response |
> +--+---+--+
> | toto | [["LOAD",[]]] | 1|
> +--+---+--+
> 1 row selected (0.283 seconds)
> {code}
> The usage of this option is not evident. So, it will be useful to mention 
> after the error message the possibility to set it.
> {noformat}
> Error: UNSUPPORTED_OPERATION ERROR: In a list of type VARCHAR, encountered a 
> value of type LIST. Drill does not support lists of different types.  SET 
> the option 'exec.enable_union_type' to true and try again;
> {noformat}
> This behaviour is used for other error, example:
> {noformat}
> ...
> Error: UNSUPPORTED_OPERATION ERROR: This query cannot be planned possibly due 
> to either a cartesian join or an inequality join. 
> If a cartesian or inequality join is used intentionally, set the option 
> 'planner.enable_nljoin_for_scalar_only' to false and try again.
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


<    1   2   3   4   5   6   7   8   9   10   >