[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior

2017-12-28 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305699#comment-16305699
 ] 

Paul Rogers edited comment on DRILL-6035 at 12/28/17 8:07 PM:
--

h4. Solutions

Drill's JSON support is mature; it is used successfully by a wide range of 
users. Presumably, the issues discussed above are only theoretical; practical 
users do not encounter them. (If they did, we'd see many more open bugs in this 
area.)

Further, though JSON started as Drill's reference data source, real-world 
experience has shown that Parquet has assumed that role; JSON is not used only 
lightly. Hence, there may little interest or value in addressing the above 
issues.

Still, it is worth presenting some options if it ever becomes necessary to 
improve Drill's JSON support.

h4. Futures: Stats

Drill can address its inability to predict the future by converting the problem 
into one of learning from the past. Specifically, Drill can correctly pick the 
type of each column if Drill were to scan the entire collection of JSON files 
before executing the first query, and recording the inferred type of each 
column.

Drill already has (a prototype of) a statistics (stats) collection mechanism. 
Drill could extend this mechanism to scan the full set of JSON files. For each 
column, implement inference rules such as:

* Record the data types seen for each column. If a column is mostly nulls (or 
missing), even a single occurrence, even within terabytes of files, would be 
enough to tell Drill the type.
* When a column has a conflicting scalar type (both string and number, say), 
record the “most permissive” type. That is, do an automatic per-column “text 
mode” for scalar conflicts. (Or, for numbers, simply pick FLOAT8 if even a 
single value has a decimal point.)
* For numbers, use the value of the largest number to infer type. (INT, BITINT, 
FLOAT8 or DECIMAL).
* If a column has both primitive and structured values (both a string and map, 
say), then mark the column as needed an “enhanced” text mode: read the complex 
type as a JSON string.

The above simply extends existing schema inference beyond the first row of each 
file to the entire collection of files. Clearly there is a cost (the need to 
scan the entire set of JSON files, and keep the stats up to date via repeated 
scans.)

The solution is not perfect, but may reduce problems by some percentage, say 
80-90%. This solution has the advantage of living up to Drill’s core promise of 
being “schema-free”: the user does not provide a schema, though Drill infers a 
schema from actual data.

h4. Futures: Schema Metadata

The stats solution is still not complete: it cannot address the issue of 
intent: what does the JSON actually represent? It can only deal with the entire 
body of JSON at the syntax level.

The next level of solution allows the user to specify their intent. That is, to 
specify rules for mapping their application-specific JSON encoding into the 
relational model used by Drill.

Many of the difficulties in JSON come from ambiguities that limit Drill's 
ability to map JSON into a fixed set of tabular columns. One straightforward 
solution is to supply a schema (which may include mapping rules) that resolve 
the ambiguities for each application.

Advantages of a defined schema:

* Code generation vs. interpretation — performance
* Avoid ambiguities — fewer failures

The metadata might include simple information such as:

* The declared type of each column. (Drill uses that type even if values are 
null or missing, eliminating the need to guess.)
* Columns to be ignored in a `SELECT *` query. (Such as internal metadata 
columns.)

The metadata layer would be much more powerful if it also included mapping 
rules for cases such as:

* Identify that an array is represents a tuple, and provide column names and 
types for each array element.
* Identify that a map structure is used to encode the name and type of a value, 
along with rules to map that value to a Drill column.

An example of the first case above:

{noformat}
Metadata: Array as a row: [cust_id INT, name: VARCHAR….]
Data: [101, “fred”, …]
{noformat}

Example of the second case:

{noformat}
Mapping rule: for key {“$type:” type, “$value”: value},
Create a column named  of type  with value .
Data: cust_id {“$type”: “INT”, value: 101}
{noformat}

The need for the mapping (and the required set of features) will be made 
clearer by gathering many samples of JSON “in the wild” to learn how 
application developers actually encode their data into JSON.

Possible sources of a schema:

* Hive
* Schema file stored with the data
* Metadata system as in Impala
* Custom mapping code that is a "data format" plugin on top of the JSON format 
plugin

Drill uses views for this role today. But, views introduce a chicken-and-egg 
problem and cannot be the (whole) solution for JSON. Views are

[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior

2017-12-28 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305699#comment-16305699
 ] 

Paul Rogers edited comment on DRILL-6035 at 12/28/17 7:57 PM:
--

h4. Solutions

Drill's JSON support is mature; it is used successfully by a wide range of 
users. Presumably, the issues discussed above are only theoretical; practical 
users do not encounter them. (If they did, we'd see many more open bugs in this 
area.)

Further, though JSON started as Drill's reference data source, real-world 
experience has shown that Parquet has assumed that role; JSON is not used only 
lightly. Hence, there may little interest or value in addressing the above 
issues.

Still, it is worth presenting some options if it ever becomes necessary to 
improve Drill's JSON support.

h4. Futures: Stats

Drill can address its inability to predict the future by converting the problem 
into one of learning from the past. Specifically, Drill can correctly pick the 
type of each column if Drill were to scan the entire collection of JSON files 
before executing the first query, and recording the inferred type of each 
column.

Drill already has (a prototype of) a statistics (stats) collection mechanism. 
Drill could extend this mechanism to scan the full set of JSON files. For each 
column, implement inference rules such as:

* Record the data types seen for each column. If a column is mostly nulls (or 
missing), even a single occurrence, even within terabytes of files, would be 
enough to tell Drill the type.
* When a column has a conflicting scalar type (both string and number, say), 
record the “most permissive” type. That is, do an automatic per-column “text 
mode” for scalar conflicts. (Or, for numbers, simply pick FLOAT8 if even a 
single value has a decimal point.)
* For numbers, use the value of the largest number to infer type. (INT, BITINT, 
FLOAT8 or DECIMAL).
* If a column has both primitive and structured values (both a string and map, 
say), then mark the column as needed an “enhanced” text mode: read the complex 
type as a JSON string.

The above simply extends existing schema inference beyond the first row of each 
file to the entire collection of files. Clearly there is a cost (the need to 
scan the entire set of JSON files, and keep the stats up to date via repeated 
scans.)

The solution is not perfect, but may reduce problems by some percentage, say 
80-90%. This solution has the advantage of living up to Drill’s core promise of 
being “schema-free”: the user does not provide a schema, though Drill infers a 
schema from actual data.

h4. Futures: Schema Metadata

The stats solution is still not complete: it cannot address the issue of 
intent: what does the JSON actually represent? It can only deal with the entire 
body of JSON at the syntax level.

The next level of solution allows the user to specify their intent. That is, to 
specify rules for mapping their application-specific JSON encoding into the 
relational model used by Drill.

Many of the difficulties in JSON come from ambiguities that limit Drill's 
ability to map JSON into a fixed set of tabular columns. One straightforward 
solution is to supply a schema (which may include mapping rules) that resolve 
the ambiguities for each application.

Advantages of a defined schema:

* Code generation vs. interpretation — performance
* Avoid ambiguities — fewer failures

The metadata might include simple information such as:

* The declared type of each column. (Drill uses that type even if values are 
null or missing, eliminating the need to guess.)
* Columns to be ignored in a `SELECT *` query. (Such as internal metadata 
columns.)

The metadata layer would be much more powerful if it also included mapping 
rules for cases such as:

* Identify that an array is represents a tuple, and provide column names and 
types for each array element.
* Identify that a map structure is used to encode the name and type of a value, 
along with rules to map that value to a Drill column.

An example of the first case above:

{noformat}
Metadata: Array as a row: [cust_id INT, name: VARCHAR….]
Data: [101, “fred”, …]
{noformat}

Example of the second case:

{noformat}
Mapping rule: for key {“$type:” type, “$value”: value},
Create a column named  of type  with value .
Data: cust_id {“$type”: “INT”, value: 101}
{noformat}

The need for the mapping (and the required set of features) will be made 
clearer by gathering many samples of JSON “in the wild” to learn how 
application developers actually encode their data into JSON.

Possible sources of a schema:

* Hive
* Schema file stored with the data
* Metadata system as in Impala
* Custom mapping code that is a "data format" plugin on top of the JSON format 
plugin

Drill uses views for this role today. But, views introduce a chicken-and-egg 
problem and cannot be the (whole) solution for JSON. Views are

[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior

2017-12-28 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305689#comment-16305689
 ] 

Paul Rogers edited comment on DRILL-6035 at 12/28/17 7:42 PM:
--

h4. JSON Translation in Drill

The goal of the JSON reader is to load JSON data into Drill vectors. The 
challenge, as shown above, is that the two data models are different. The 
problems are two-fold:

* There exist many perfectly valid ways to translate relational data into JSON.
* There exist no universal ways to translate arbitrary JSON into a relational 
model.

That is, JSON can represent any relational table, and do so in a variety of 
ways. There is a one (table) to many (JSON formats) relationship. Further, 
there are many JSON structures that do not correspond to tables.

h4. Lack of JSON Support in JDBC and ODBC

The issue is further complicated by the fact that ODBC (Tableau) and JDBC are 
Drill’s primary client interfaces. These formats do not readily support 
non-tabular data. Thus, not only just Drill successfully consume JSON DAGs, 
Drill must then convert these structures into simple tables to be consumed by 
BI tools. (That is, Drill is not a document-oriented query engine, it is a 
classic tabular query engine.)

h4. Many-to-Many Mappings from JSON to Tables

The challenge of the JSON reader is to convert arbitrary JSON into a relational 
model, and to do so with no (or very little) information beyond a list of 
projected columns and the JSON file itself.

As we will see, the problem is fundamentally not solvable: Drill has too little 
information to correctly map from the many possible (and often conflicting) 
JSON formats into the proper relational format.

Consider a trivial example.  The following is a perfectly legal representation 
of a Customer in JSON:

{code}
[101, “Fred”, “Bedrock”, 123.45, “10-11-12”]
{code}

Applications sometimes use the above format to conserve space. It works because 
the writer and reader agree on the meaning of each array entry.

How is Drill to interpret the above. More to the point, how is Drill to 
interpret the above *without a schema*? Said another way, the above format 
works because the writer and reader agree on a format, but Drill is designed to 
work without that information. Clearly, without a schema, it is impossible to 
understand that the above is a terse representation of a row.

Without a schema, all Drill knows is that the above is a heterogeneous array. 
How is Drill to know that this array has a one-to-one correspondence to columns 
in a customer record vs. say, an arbitrary array? It can’t. All Drill can do is 
ask the user to enable “all-text mode”, read the array as text, then use SQL to 
project the array entries correctly:

{code}
SELECT CAST(cust[0] AS INTEGER) AS cust_id,
   cust[1] AS cust_name, cust[2] AS city,
   CAST(cust[3] AS FLOAT8) AS balance,
   TO_DATE(cust[4], ‘yy-mm-dd’) as start_date FROM …
{code}

Although this example is contrived, “JSON in the wild” has a wide variety of 
formats since JSON is a universal format and places no constraints on an 
application’s data model.

h4. Schema Inference

Translation of JSON to the relational model is simple when the JSON is designed 
to exactly fit what Drill expects. 

Since JSON is Drill’s reference model, Drill attempts to translate all valid 
JSON into the relational model. This is impossible in the general case. Here 
are just a few of the complexities:

* A run of null values without a non-null value.
* Different data types for an attribute name across objects. (That is, JSON 
does not enforce a schema.)
* Different data types (as parsed by Drill) for array elements. (Such as the 
customer row example above.)
* Nulls inside arrays.
* “Sophisticated” data models such as those described earlier

Drill implements a variety of special rules to handle some of the above in some 
special cases. These rules were presented in earlier sections. To summarize a 
few:

* All-text mode can overcome different primitive types in an object attribute 
or list (at the cost of extensive casts in the SQL expression.) But, all-text 
mode cannot overcome a change from primitive type to object or list. All-text 
mode applies to all queries within a session, not just to the one column with a 
conflict.
* Using a projection list to avoid projecting a column with conflicting types 
(but, then the value is unavailable to Drill queries.) A projection list, 
however, changes the data structure, moving columns nested inside maps to the 
top level, requiring changes elsewhere in the query to adjust.
* “Null-deferral” delays picking a type for a null column until a value is 
seen, but can’t see across a batch boundary.
* “All numbers as Float” handles the case of integer numbers followed by 
numbers with a decimal point, but is a session option so applies to all queries 
within a session, not 

[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior

2017-12-28 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305682#comment-16305682
 ] 

Paul Rogers edited comment on DRILL-6035 at 12/28/17 7:35 PM:
--

h4. Background

The above sections focus on the actual implementation of JSON in Drill. This 
section takes a step back to provide broader industry context.

h4. The JSON Data Model

[JSON is built on … universal data structures|https://www.json.org]. JSON is 
truly schema-free: it supports any arbitrary (DAG) data model. As shown on the 
[JSON site|https://www.json.org]. JSON is based on three simple concepts: 
objects (lists of name/value pairs), lists, and scalar values.

JSON is a newer format, but the concept is classic. XML was based on a similar 
model (but was much more verbose and complex.) XML itself was a rehash of an 
even older model, the [CODASYL|https://en.wikipedia.org/wiki/CODASYL] model 
that described hierarchical databases in the 1960s.

The power of a universal representation comes at a cost: queries are complex. 
CODASYL defined a model, as did XML with 
[XQuery|https://en.wikipedia.org/wiki/XQuery]. In browsers, the 
[DOM|https://en.wikipedia.org/wiki/Document_Object_Model] provides a 
programmatic query language for XML trees that define web pages. Each of these 
query systems is very complex; far too complex for everyday business use.

h4. The Relational Data Model

The relational model was invented by Edgar Codd as a [reaction to the 
complexity|http://history-computer.com/ModernComputer/Software/Codd.html] of 
the prior CODASYL model:

bq. Don Chamberlin, … coinventor of SQL, \[said of] Codd's ideas: "...since I'd 
been studying CODASYL (the language used to query navigational databases), I 
could imagine how those queries would have been represented in CODASYL by 
programs that were five pages long, that would navigate through this labyrinth 
of pointers and stuff. Codd would sort of write them down as one-liners. ... 
They weren't complicated at all. I said, 'Wow.' This was kind of a conversion 
experience for me. I understood what the relational thing was about after that."

h4. The Drill Data Model

Drill is a relational engine based on the SQL language (and hence the 
relational data model). Drill uses a columnar representation to realize the 
[basic relational 
contracts|https://en.wikibooks.org/wiki/Relational_Database_Design/Basic_Concepts]:

* Domain (set of values, often expressed as a specific data type)
* Column: a (name, domain) “attribute that describe an entity in the database 
model.”
* Row: a complete set of columns
* Table: a collection of rows

The key point is that relational theory is based on relations (tables) that 
consist of a set of rows, all of which have the same set of columns (the same 
schema). In fact, the restriction to a consistent schema is what allows Drill 
to store data as columns rather than rows. (The columnar model would make 
little sense for, say, an HTML document in which each element has a different 
set of attributes.)

The relational model is decidedly *not* universal: it is a highly restricted 
data model chosen because of the expressive power that relational theory 
provides when data is restricted to a tabular presentation.

The complexity of Drill's JSON implementation arises from the idea that Drill 
can automatically map from JSON's universal format to the relational tabular 
format. That is an assumption that turns out to be naive.


was (Author: paul.rogers):
h4. Background

The above sections focus on the actual implementation of JSON in Drill. This 
section takes a step back to provide broader industry context.

h4. The JSON Data Model

[JSON is built on … universal data structures|https://www.json.org]. JSON is 
truly schema-free: it supports any arbitrary (DAG) data model. As shown on the 
[JSON site|https://www.json.org]. JSON is based on three simple concepts: 
objects (lists of name/value pairs), lists, and scalar values.

JSON is a newer format, but the concept is classic. XML was based on a similar 
model (but was much more verbose and complex.) XML itself was a rehash of an 
even older model, the [CODASYL|https://en.wikipedia.org/wiki/CODASYL] model 
that described hierarchical databases in the 1960s.

The power of a universal representation comes at a cost: queries are complex. 
CODASYL defined a model, as did XML with 
[XQuery|https://en.wikipedia.org/wiki/XQuery]. In browsers, the 
[DOM|https://en.wikipedia.org/wiki/Document_Object_Model] provides a 
programmatic query language for XML trees that define web pages. Each of these 
query systems is very complex; far too complex for everyday business use.

h4. The Relational Data Model

The relational model was invented by Edgar Codd as a [reaction to the 
complexity|http://history-computer.com/ModernComputer/Software/Codd.html] of 
the prior CODASYL model:

bq. Don Chamb

[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior

2017-12-28 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293316#comment-16293316
 ] 

Paul Rogers edited comment on DRILL-6035 at 12/28/17 7:11 PM:
--

h4. JSON Arrays

Drill supports simple arrays in JSON using the following rules:

* Arrays must contain hetrogeneous elements: any of the scalars described 
above, or a JSON object.
* Single-dimensional arrays cannot contain null entries.
* Multi-dimensional arrays can contain nulls at the outer level but not the 
inner level.

(See a later comment for nested arrays.)

For example, the following are scalar arrays:
{code}
[10, 20]
[10.30, 10.45]
["foo", "bar"]
[true, false]
{code}

h4. Schema Change in Arrays

The following will trigger errors:

{code}
{a: [10, "foo"]}  // Mixed types
{a: [10]} {a: ["foo"]} // Schema change
{a: [10, 12.5]} // Conflicting types: integer and float
{code}

h4. Nulls in Arrays

Using array types, the following rules apply to nulls:

* An array cannot contain nulls.
* An empty array at the start of the file has an unknown type.
* An entire array can be null, which is represented as an empty array. (That 
is, an empty array and a {{null}} value are considered the same.)

Thus, the following is valid JSON, but is not supported by Drill:

{code}
{a: [10, null, 20] }
{code}

The general rule is that Drill does not support nulls in one-dimensional JSON 
arrays. (See below for the 2+D case.)

h4. "Deferred" Type Definition in Drill 1.13

Drill 1.13 (but likely not Drill 1.12) can handle some ambiguity in arrays:

{code}
{a: 1}
{a: 2, b: null}
{a: 3, b: []}
{a: 4, b: [10, 20]}
{code}

Here, the first three records are ambiguous. The first has no entry for {{b}}, 
the second says that {{b}} is {{null}}, but not the type of {{null}} (In JSON, 
{{null}} is untyped, while in Drill {{NULL}} must have a type.) In the third 
record we learn that {{b}} is an array, but do not yet know the type of the 
array. Finally, in record four we learn that {{b}} is an array of {{BIGINT}}.

This mechanism works only if the type is revealed in the first batch. If the 
type does not appear, Drill must guess a type, will guess array of text-mode. 
This works within the same file if the array turns out to be, say, integers. 
But, it will probably cause a schema conflict across files that see the 
integers within the first batch. (The general rule is that Drill cannot predict 
the future.)

h4. Missing {{LIST}} Support

JSON arrays can contain nulls. Drill provides a (partially completed, 
inoperable) {{LIST}} type as described below that handles nulls. But, this 
vector is not used in Drill 1.12 or earlier. Instead, Drill uses repeated types 
which cannot handle nulls. (The {{LIST}} type is described in a separate note 
below.)

h4. Multi-Dimensional Arrays

JSON supports lists (arrays) of arbitrary dimensions. Drill supports this via 
two types:

* A base array in the form of a repeated cardinality of any primitive type or 
map. (The so-called “Repeated mode.”)
* A extended array of base or extended arrays. (The so-called “repeated list.”)

Drill supports multi-dimensional arrays of a single type (including maps). 
Unlike JSON:

* Drill arrays cannot hold nulls.
* Arrays themselves cannot be null (though they can be empty.)

Note that support for lists is via the {{RepeatedListVector}}. Note that this 
is not simply the {{REPEATED}} form of the {{ListVector}}; it is something else 
entirely. The {{ListVector}} is not supported at present (see below), but the 
{{RepeatedListVector}} is.

For example:

{code}
{a: [[10, 20], [30, 40]]}
{code}

Drill (at least in 1.13) allows nulls in the outer dimension of an array:

{code}
{a: [[10, 20], null, [30, 40]]}
{code}

A null is the same as an empty array:

{code}
{a: [[10, 20], [ ], [30, 40]]}
{code}

Drill 1.13 allows "ambiguity" in arrays before seeing an actual value:
{code}
{a: null}
{a: [ ]}
{a: [ [ ] ]}
{a: [[10, 20], null, [30, 40]]}
{code}

The caveat is that the value must appear within the first batch. Otherwise, the 
first batch must guess at a type, will guess text mode, and will conflict with 
later values if they are arrays or maps.

Note that Drill *cannot* handle leading nulls before the type is seen. The 
following is invalid:

{code}
{a: [null, [10, 20], [30, 40]]}
{code}

The reason is that Drill does not yet know the type at the time that the null 
appears, and so does not know whether to treat it as an empty array or as a 
null scalar (and, of course, nulls are not supported in scalar arrays.)


was (Author: paul.rogers):
h4. JSON Arrays

Drill supports simple arrays in JSON using the following rules:

* Arrays must contain hetrogeneous elements: any of the scalars described 
above, or a JSON object.
* Single-dimensional arrays cannot contain null entries.
* Multi-dimensional arrays can contain nulls at the outer level but not the 
inner level.

(See

[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior

2017-12-28 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16296230#comment-16296230
 ] 

Paul Rogers edited comment on DRILL-6035 at 12/28/17 7:02 PM:
--

h4. Lists

_*NOTE:* This section describes the Drill {{LIST}} type which turns out to be 
broken and not supported. The following is based on a prototype using the 
{{LIST}} type created after fixing some, but not all, {{LIST}} bugs._

JSON supports arrays of the form:

{code}
{a: ["I'm", "an", "array"] }
{code}

Drill has two very different ways to represent arrays:

1. As a {{REPEATED}} cardinality for most data types. This gives rise to a 
{{RepeatedFooVector}} for some type {{Foo}}.
2. As a {{LIST}} type with the {{ListVector}} implementation.

Here, Arrow has done a nice job. Arrow unified the {{REPEATED}} cardinality and 
the {{LIST}} vector type into a single concept. Drill, however, still has two 
systems.

h4. Repeated Cardinality

Drill's "go to" way to handle arrays is with the {{REPEATED}} cardinality (AKA 
"repeated data mode.") Most readers that handle arrays use the {{REPEATED}} 
form. To help understand the {{LIST}} type, we review {{REPEATED}} support here.

When working with a {{REPEATED} column, the rules for nulls  are:

* Arrays may not contain nulls. (Drill does not support nulls as array 
elements.)
* A null (or missing) array field is treated the same as an empty array.

If JSON were to use the {{REPEATED}} vectors, the following would be invalid:

{code}
[10, null, 20]
{code}

The following are all valid with {{REPEATED}} vectors:

{code}
{id: 1}
{id: 2, a: null}
{id: 3, a: []}
{id: 4, a: [10, 20, 30]}
{code}

h4. Properties of Lists

The key properties of a List relative to a repeated type are:

* Each list can be a list of nothing, a list of a single type, or a list of a 
union of multiple types.
* Each list value can be null.
* Each list entry can be null (for primitive types.)

h4. Null Support

As explained below, lists support three kinds of nullability:

* The type itself can be null (a list of nulls)
* The list column value for a row can be null. (This is in contrast to repeated 
types in which an array can be empty, but the entire array value cannot be 
null.)
* When the list is of a primitive type, entries can be null. (The list is 
defined as list of nullable items, such as nullable BIGINT.)
* When the list is of maps, the map entries *cannot* be null. (Instead, the map 
columns are nullable and all columns for the "null" map are set to null.)
* When the list is of other lists, the list entry *cannot* be null. (Instead, 
the nested list is empty.)

The semantics are a bit confusing when seen from the outside. They make 
slightly more sense based on the implementation choices made in the code. 
(Though, generally we want the code to match our requirements, not the other 
way around.)
h4. Late Type Identification

As described earlier, Drill 1.13 will defer picking an array type if it sees 
null values. For example:

{code}
{id: 1}
{id: 2, a: null}
{id: 3, a: []}
{id: 4, a: [10, 20, 30]}
{code}

In the above example, for id=2, Drill sees column `a` but does not pick a type. 
For id=3, Drill identifies that `a` is an array, but does not know the type. 
Finally, for id=4, Drill identifies the array as {{BIGINT}}.

h4. Null-Only Arrays

A special case occurs if a JSON file contains only empty arrays or arrays of 
nulls (such as a file that contains only the first three records above.)

In Drill 1.12 and earlier, the result is a list of {{LATE}} elements (See the 
List section below.) It seems that {{SqlLine}} will correctly show the null 
values.

An interesting case occurs when Drill reads two files: one with an array with 
only nulls, another with real values. For example:

{noformat}
File A: {a: [null, null] }
File B: {a: [10, 20] }
{noformat}

(The above condition can occur only if JSON uses the broken {{LIST}} type; it 
cannot occur in Drill 1.12. In 1.12, the equivalent condition is if File A 
contains:

{noformat}
{a: []}
{noformat}

Drill is distributed: one fragment will read File A, another will read File B. 
At some point, the two arrays will come together. One fragment will have 
created a list of {{LATE}}, another a list of {{BIGINT}}. Most operators will 
trigger a schema change error in this case.

Interestingly, however, if the query is a simple {{SELECT *}}, then the lists 
are compatible and {{SqlLine}} will display the correct results.

In Drill 1.13, if the first batch contains only nulls and/or empty arrays, 
Drill guesses that the type is an array of {{VARCHAR}}. Since this is only a 
guess, a schema change will result if the guess is wrong.

h4. Lists are Obscure

The {{LIST}} type appears to be used only for JSON, and it is unclear how well 
supported it is in the rest of Drill. For example, it is not clear that 
functions that work with arrays correctly ha

[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior

2017-12-28 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293316#comment-16293316
 ] 

Paul Rogers edited comment on DRILL-6035 at 12/28/17 6:42 PM:
--

h4. JSON Arrays

Drill supports simple arrays in JSON using the following rules:

* Arrays must contain hetrogeneous elements: any of the scalars described 
above, or a JSON object.
* Single-dimensional arrays cannot contain null entries.
* Multi-dimensional arrays can contain nulls at the outer level but not the 
inner level.

(See a later comment for nested arrays.)

For example, the following are scalar arrays:
{code}
[10, 20]
[10.30, 10.45]
["foo", "bar"]
[true, false]
{code}

h4. Schema Change in Arrays

The following will trigger errors:

{code}
{a: [10, "foo"]}  // Mixed types
{a: [10]} {a: ["foo"]} // Schema change
{a: [10, 12.5]} // Conflicting types: integer and float
{code}

h4. Nulls in Arrays

The general rule is that Drill does not support nulls in JSON arrays. However, 
there are many odd cases.

If the JSON array is 2+ dimensional, then Drill supports nulls in outer 
dimensions but not inner ones. That is, the following is legal:

{code}
{a: [["null", "is"], null, ["allowed"]]}
{code}

Here, {{null}} behaves the same as an empty array: {{[ ]}}.

Note, however, that the above is *not* valid if the nulls come before the first 
non-null value:

{code}
{a: [null, ["not", "allowed"]]}
{code}

The reason is that, in the previous example, Drill could determine that the 
outer list is a list of string arrays. But, in the second case, Drill cannot 
tell the array type when it sees the {{null}}: it might be an array of scalars, 
so {{null}} is not allowed.

That said, the following *is* allowed:

{code}
{a: [["null", "is"], null, ["allowed"]]}
{a: [null, ["is", "allowed"]]}
{code}

Here, the first array showed the type; Drill remembers that type for the second 
array.

h4. Missing {{LIST}} Support

JSON arrays can contain nulls. Drill provides a (partially completed, 
inoperable) {{LIST}} type as described below that handles nulls. But, this 
vector is not used in Drill 1.12 or earlier. Instead, Drill uses repeated types 
which cannot handle nulls. (The {{LIST}} type is described in a separate note 
below.)

Using array types, the following rules apply to nulls:

* An array cannot contain nulls.
* An empty array at the start of the file has an unknown type. (Do we select 
Nullable {{INT}}?)
* An entire array can be null, which is represented as an empty array. (That 
is, an empty array and a {{null}} value are considered the same.)

h4. Late Type Identification

As described earlier, Drill 1.13 will defer picking an array type if it sees 
null values. For example:

{code}
{id: 1}
{id: 2, a: null}
{id: 3, a: []}
{id: 4, a: [10, 20, 30]}
{code}

In the above example, for id=2, Drill sees column `a` but does not pick a type. 
For id=3, Drill identifies that `a` is an array, but does not know the type. 
Finally, for id=4, Drill identifies the array as {{BIGINT}}.

h4. Null-Only Arrays

A special case occurs if a JSON file contains only empty arrays or arrays of 
nulls (such as a file that contains only the first three records above.)

In Drill 1.12 and earlier, the result is a list of {{LATE}} elements (See the 
List section below.) It seems that {{SqlLine}} will correctly show the null 
values.

An interesting case occurs when Drill reads two files: one with an array with 
only nulls, another with real values. For example:

{noformat}
File A: {a: [null, null] }
File B: {a: [10, 20] }
{noformat}

(The above condition can occur only if JSON uses the broken {{LIST}} type; it 
cannot occur in Drill 1.12. In 1.12, the equivalent condition is if File A 
contains:

{noformat}
{a: []}
{noformat}

Drill is distributed: one fragment will read File A, another will read File B. 
At some point, the two arrays will come together. One fragment will have 
created a list of {{LATE}}, another a list of {{BIGINT}}. Most operators will 
trigger a schema change error in this case.

Interestingly, however, if the query is a simple {{SELECT *}}, then the lists 
are compatible and {{SqlLine}} will display the correct results.

In Drill 1.13, if the first batch contains only nulls and/or empty arrays, 
Drill guesses that the type is an array of {{VARCHAR}}. Since this is only a 
guess, a schema change will result if the guess is wrong.

h4. Multi-Dimensional Arrays

Drill supports multi-dimensional arrays of a single type (including maps). For 
example:

{code}
{a: [[10, 20], [30, 40]]}
{code}

Drill (at least in 1.13) allows nulls in the outer dimension of an array:

{code}
{a: [[10, 20], null, [30, 40]]}
{code}

A null is the same as an empty array:

{code}
{a: [[10, 20], [ ], [30, 40]]}
{code}

Drill 1.13 allows "ambiguity" in arrays before seeing an actual value:
{code}
{a: null}
{a: [ ]}
{a: [ [ ] ]}
{a: [[10, 20], null, [30, 4

[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior

2017-12-26 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293316#comment-16293316
 ] 

Paul Rogers edited comment on DRILL-6035 at 12/27/17 7:06 AM:
--

h4. JSON Arrays

Drill supports simple arrays in JSON using the following rules:

* Arrays must contain hetrogeneous elements: any of the scalars described 
above, or a JSON object.
* Single-dimensional arrays cannot contain null entries.
* Two-dimensional arrays can contain nulls at the outer level but not the inner 
level.

(See a later comment for nested arrays.)

For example, the following are scalar arrays:
{code}
[10, 20]
[10.30, 10.45]
["foo", "bar"]
[true, false]
{code}

h4. Schema Change in Arrays

The following will trigger errors:

{code}
{a: [10, "foo"]}  // Mixed types
{a: [10]} {a: ["foo"]} // Schema change
{a: [10, 12.5]} // Conflicting types: integer and float
{code}

h4. Nulls in Arrays

The general rule is that Drill does not support nulls in JSON arrays. However, 
there are many odd cases.

If the JSON array is 2+ dimensional, then Drill supports nulls in outer 
dimensions but not inner ones. That is, the following is legal:

{code}
{a: [["null", "is"], null, ["allowed"]]}
{code}

Here, {{null}} behaves the same as an empty array: {{[ ]}}.

Note, however, that the above is *not* valid if the nulls come before the first 
non-null value:

{code}
{a: [null, ["not", "allowed"]]}
{code}

The reason is that, in the previous example, Drill could determine that the 
outer list is a list of string arrays. But, in the second case, Drill cannot 
tell the array type when it sees the {{null}}: it might be an array of scalars, 
so {{null}} is not allowed.

That said, the following *is* allowed:

{code}
{a: [["null", "is"], null, ["allowed"]]}
{a: [null, ["is", "allowed"]]}
{code}

Here, the first array showed the type; Drill remembers that type for the second 
array.

h4. Missing {{LIST}} Support

JSON arrays can contain nulls. Drill provides a (partially completed, 
inoperable) {{LIST}} type as described below that handles nulls. But, this 
vector is not used in Drill 1.12 or earlier. Instead, Drill uses repeated types 
which cannot handle nulls. (The {{LIST}} type is described in a separate note 
below.)

Using array types, the following rules apply to nulls:

* An array cannot contain nulls.
* An empty array at the start of the file has an unknown type. (Do we select 
Nullable {{INT}}?)
* An entire array can be null, which is represented as an empty array. (That 
is, an empty array and a {{null}} value are considered the same.)

h4. Late Type Identification

As described earlier, Drill 1.13 will defer picking an array type if it sees 
null values. For example:

{code}
{id: 1}
{id: 2, a: null}
{id: 3, a: []}
{id: 4, a: [10, 20, 30]}
{code}

In the above example, for id=2, Drill sees column `a` but does not pick a type. 
For id=3, Drill identifies that `a` is an array, but does not know the type. 
Finally, for id=4, Drill identifies the array as {{BIGINT}}.

h4. Null-Only Arrays

A special case occurs if a JSON file contains only empty arrays or arrays of 
nulls (such as a file that contains only the first three records above.)

In Drill 1.12 and earlier, the result is a list of {{LATE}} elements (See the 
List section below.) It seems that {{SqlLine}} will correctly show the null 
values.

An interesting case occurs when Drill reads two files: one with an array with 
only nulls, another with real values. For example:

{noformat}
File A: {a: [null, null] }
File B: {a: [10, 20] }
{noformat}

(The above condition can occur only if JSON uses the broken {{LIST}} type; it 
cannot occur in Drill 1.12. In 1.12, the equivalent condition is if File A 
contains:

{noformat}
{a: []}
{noformat}

Drill is distributed: one fragment will read File A, another will read File B. 
At some point, the two arrays will come together. One fragment will have 
created a list of {{LATE}}, another a list of {{BIGINT}}. Most operators will 
trigger a schema change error in this case.

Interestingly, however, if the query is a simple {{SELECT *}}, then the lists 
are compatible and {{SqlLine}} will display the correct results.

In Drill 1.13, if the first batch contains only nulls and/or empty arrays, 
Drill guesses that the type is an array of {{VARCHAR}}. Since this is only a 
guess, a schema change will result if the guess is wrong.


was (Author: paul.rogers):
h4. JSON Arrays

Drill supports simple arrays in JSON using the following rules:

* Arrays must contain hetrogeneous elements: any of the scalars described 
above, or a JSON object.
* Single-dimensional arrays cannot contain null entries.
* Two-dimensional arrays can contain nulls at the outer level but not the inner 
level.

(See a later comment for nested arrays.)

For example, the following are scalar arrays:
{code}
[10, 20]
[10.30, 10.45]
["foo", "bar"]
[true, false]
{co

[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior

2017-12-26 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293316#comment-16293316
 ] 

Paul Rogers edited comment on DRILL-6035 at 12/27/17 7:04 AM:
--

h4. JSON Arrays

Drill supports simple arrays in JSON using the following rules:

* Arrays must contain hetrogeneous elements: any of the scalars described 
above, or a JSON object.
* Single-dimensional arrays cannot contain null entries.
* Two-dimensional arrays can contain nulls at the outer level but not the inner 
level.

(See a later comment for nested arrays.)

For example, the following are scalar arrays:
{code}
[10, 20]
[10.30, 10.45]
["foo", "bar"]
[true, false]
{code}

h4. Schema Change in Arrays

The following will trigger errors:

{code}
{a: [10, "foo"]}  // Mixed types
{a: [10]} {a: ["foo"]} // Schema change
{a: [10, 12.5]} // Conflicting types: integer and float
{code}

h4. Nulls in Arrays

The general rule is that Drill does not support nulls in JSON arrays. However, 
there are many odd cases.

If the JSON array is 2+ dimensional, then Drill supports nulls in outer 
dimensions but not inner ones. That is, the following is legal:

{code}
{a: [["null", "is"], null, ["allowed"]]}
{code}

Here, {{null}} behaves the same as an empty array: {{[ ]}}.

Note, however, that the above is *not* valid if the nulls come before the first 
non-null value:

{code}
{a: [null, ["not", "allowed"]]}
{code}

The reason is that, in the previous example, Drill could determine that the 
outer list is a list of string arrays. But, in the second case, Drill cannot 
tell the array type when it sees the {{null}}: it might be an array of scalars, 
so {{null}} is not allowed.

h4. Missing {{LIST}} Support

JSON arrays can contain nulls. Drill provides a (partially completed, 
inoperable) {{LIST}} type as described below that handles nulls. But, this 
vector is not used in Drill 1.12 or earlier. Instead, Drill uses repeated types 
which cannot handle nulls. (The {{LIST}} type is described in a separate note 
below.)

Using array types, the following rules apply to nulls:

* An array cannot contain nulls.
* An empty array at the start of the file has an unknown type. (Do we select 
Nullable {{INT}}?)
* An entire array can be null, which is represented as an empty array. (That 
is, an empty array and a {{null}} value are considered the same.)

h4. Late Type Identification

As described earlier, Drill 1.13 will defer picking an array type if it sees 
null values. For example:

{code}
{id: 1}
{id: 2, a: null}
{id: 3, a: []}
{id: 4, a: [10, 20, 30]}
{code}

In the above example, for id=2, Drill sees column `a` but does not pick a type. 
For id=3, Drill identifies that `a` is an array, but does not know the type. 
Finally, for id=4, Drill identifies the array as {{BIGINT}}.

h4. Null-Only Arrays

A special case occurs if a JSON file contains only empty arrays or arrays of 
nulls (such as a file that contains only the first three records above.)

In Drill 1.12 and earlier, the result is a list of {{LATE}} elements (See the 
List section below.) It seems that {{SqlLine}} will correctly show the null 
values.

An interesting case occurs when Drill reads two files: one with an array with 
only nulls, another with real values. For example:

{noformat}
File A: {a: [null, null] }
File B: {a: [10, 20] }
{noformat}

(The above condition can occur only if JSON uses the broken {{LIST}} type; it 
cannot occur in Drill 1.12. In 1.12, the equivalent condition is if File A 
contains:

{noformat}
{a: []}
{noformat}

Drill is distributed: one fragment will read File A, another will read File B. 
At some point, the two arrays will come together. One fragment will have 
created a list of {{LATE}}, another a list of {{BIGINT}}. Most operators will 
trigger a schema change error in this case.

Interestingly, however, if the query is a simple {{SELECT *}}, then the lists 
are compatible and {{SqlLine}} will display the correct results.

In Drill 1.13, if the first batch contains only nulls and/or empty arrays, 
Drill guesses that the type is an array of {{VARCHAR}}. Since this is only a 
guess, a schema change will result if the guess is wrong.


was (Author: paul.rogers):
h4. JSON Arrays

Drill supports simple arrays in JSON using the following rules:

* Arrays must contain hetrogeneous elements: any of the scalars described 
above, or a JSON object.
* Single-dimensional arrays cannot contain null entries.
* Two-dimensional arrays can contain nulls at the outer level but not the inner 
level.

(See a later comment for nested arrays.)

For example, the following are scalar arrays:
{code}
[10, 20]
[10.30, 10.45]
["foo", "bar"]
[true, false]
{code}

h4. Schema Change in Arrays

The following will trigger errors:

{code}
{a: [10, "foo"]}  // Mixed types
{a: [10]} {a: ["foo"]} // Schema change
{a: [10, 12.5]} // Conflicting types: integer and float
{code}

h4

[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior

2017-12-26 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293316#comment-16293316
 ] 

Paul Rogers edited comment on DRILL-6035 at 12/27/17 7:04 AM:
--

h4. JSON Arrays

Drill supports simple arrays in JSON using the following rules:

* Arrays must contain hetrogeneous elements: any of the scalars described 
above, or a JSON object.
* Single-dimensional arrays cannot contain null entries.
* Two-dimensional arrays can contain nulls at the outer level but not the inner 
level.

(See a later comment for nested arrays.)

For example, the following are scalar arrays:
{code}
[10, 20]
[10.30, 10.45]
["foo", "bar"]
[true, false]
{code}

h4. Schema Change in Arrays

The following will trigger errors:

{code}
{a: [10, "foo"]}  // Mixed types
{a: [10]} {a: ["foo"]} // Schema change
{a: [10, 12.5]} // Conflicting types: integer and float
{code}

h4. Nulls in Arrays

The general rule is that Drill does not support nulls in JSON arrays. However, 
there are many odd cases.

If the JSON array is 2+ dimensional, then Drill supports nulls in outer 
dimensions but not inner ones. That is, the following is legal:

{code}
{a: [["a", "string"], null, ["another"]]}
{code}

Here, {{null}} behaves the same as an empty array: {{[ ]}}.

Note, however, that the above is *not* valid if the nulls come before the first 
non-null value:

{code}
{a: [null, ["not", "allowed"]]}
{code}

The reason is that, in the previous example, Drill could determine that the 
outer list is a list of string arrays. But, in the second case, Drill cannot 
tell the array type when it sees the {{null}}: it might be an array of scalars, 
so {{null}} is not allowed.

h4. Missing {{LIST}} Support

JSON arrays can contain nulls. Drill provides a (partially completed, 
inoperable) {{LIST}} type as described below that handles nulls. But, this 
vector is not used in Drill 1.12 or earlier. Instead, Drill uses repeated types 
which cannot handle nulls. (The {{LIST}} type is described in a separate note 
below.)

Using array types, the following rules apply to nulls:

* An array cannot contain nulls.
* An empty array at the start of the file has an unknown type. (Do we select 
Nullable {{INT}}?)
* An entire array can be null, which is represented as an empty array. (That 
is, an empty array and a {{null}} value are considered the same.)

h4. Late Type Identification

As described earlier, Drill 1.13 will defer picking an array type if it sees 
null values. For example:

{code}
{id: 1}
{id: 2, a: null}
{id: 3, a: []}
{id: 4, a: [10, 20, 30]}
{code}

In the above example, for id=2, Drill sees column `a` but does not pick a type. 
For id=3, Drill identifies that `a` is an array, but does not know the type. 
Finally, for id=4, Drill identifies the array as {{BIGINT}}.

h4. Null-Only Arrays

A special case occurs if a JSON file contains only empty arrays or arrays of 
nulls (such as a file that contains only the first three records above.)

In Drill 1.12 and earlier, the result is a list of {{LATE}} elements (See the 
List section below.) It seems that {{SqlLine}} will correctly show the null 
values.

An interesting case occurs when Drill reads two files: one with an array with 
only nulls, another with real values. For example:

{noformat}
File A: {a: [null, null] }
File B: {a: [10, 20] }
{noformat}

(The above condition can occur only if JSON uses the broken {{LIST}} type; it 
cannot occur in Drill 1.12. In 1.12, the equivalent condition is if File A 
contains:

{noformat}
{a: []}
{noformat}

Drill is distributed: one fragment will read File A, another will read File B. 
At some point, the two arrays will come together. One fragment will have 
created a list of {{LATE}}, another a list of {{BIGINT}}. Most operators will 
trigger a schema change error in this case.

Interestingly, however, if the query is a simple {{SELECT *}}, then the lists 
are compatible and {{SqlLine}} will display the correct results.

In Drill 1.13, if the first batch contains only nulls and/or empty arrays, 
Drill guesses that the type is an array of {{VARCHAR}}. Since this is only a 
guess, a schema change will result if the guess is wrong.


was (Author: paul.rogers):
h4. JSON Arrays

Drill supports simple arrays in JSON using the following rules:

* Arrays must contain hetrogeneous elements: any of the scalars described 
above, or a JSON object.
* Single-dimensional arrays cannot contain null entries.
* Two-dimensional arrays can contain nulls at the outer level but not the inner 
level.

(See a later comment for nested arrays.)

For example, the following are scalar arrays:
{code}
[10, 20]
[10.30, 10.45]
["foo", "bar"]
[true, false]
{code}

h4. Schema Change in Arrays

The following will trigger errors:

{code}
{a: [10, "foo"]}  // Mixed types
{a: [10]} {a: ["foo"]} // Schema change
{a: [10, 12.5]} // Conflicting types: integer and float
{code}

h

[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior

2017-12-19 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16296230#comment-16296230
 ] 

Paul Rogers edited comment on DRILL-6035 at 12/20/17 6:41 AM:
--

h4. Lists

_*NOTE:* This section describes the Drill {{LIST}} type which turns out to be 
broken and not supported. The following is based on a prototype using the 
{{LIST}} type created after fixing some, but not all, {{LIST}} bugs._

JSON supports arrays of the form:

{code}
{a: ["I'm", "an", "array"] }
{code}

Drill has two very different ways to represent arrays:

1. As a {{REPEATED}} cardinality for most data types. This gives rise to a 
{{RepeatedFooVector}} for some type {{Foo}}.
2. As a {{LIST}} type with the {{ListVector}} implementation.

Here, Arrow has done a nice job. Arrow unified the {{REPEATED}} cardinality and 
the {{LIST}} vector type into a single concept. Drill, however, still has two 
systems.

h4. Repeated Cardinality

Drill's "go to" way to handle arrays is with the {{REPEATED}} cardinality (AKA 
"repeated data mode.") Most readers that handle arrays use the {{REPEATED}} 
form. To help understand the {{LIST}} type, we review {{REPEATED}} support here.

When working with a {{REPEATED} column, the rules for nulls  are:

* Arrays may not contain nulls. (Drill does not support nulls as array 
elements.)
* A null (or missing) array field is treated the same as an empty array.

If JSON were to use the {{REPEATED}} vectors, the following would be invalid:

{code}
[10, null, 20]
{code}

The following are all valid with {{REPEATED}} vectors:

{code}
{id: 1}
{id: 2, a: null}
{id: 3, a: []}
{id: 4, a: [10, 20, 30]}
{code}

h4. Properties of Lists

The key properties of a List relative to a repeated type are:

* Each list can be a list of nothing, a list of a single type, or a list of a 
union of multiple types.
* Each list value can be null.
* Each list entry can be null (for primitive types.)

h4. Null Support

As explained below, lists support three kinds of nullability:

* The type itself can be null (a list of nulls)
* The list column value for a row can be null. (This is in contrast to repeated 
types in which an array can be empty, but the entire array value cannot be 
null.)
* When the list is of a primitive type, entries can be null. (The list is 
defined as list of nullable items, such as nullable BIGINT.)
* When the list is of maps, the map entries *cannot* be null. (Instead, the map 
columns are nullable and all columns for the "null" map are set to null.)
* When the list is of other lists, the list entry *cannot* be null. (Instead, 
the nested list is empty.)

The semantics are a bit confusing when seen from the outside. They make 
slightly more sense based on the implementation choices made in the code. 
(Though, generally we want the code to match our requirements, not the other 
way around.)

h4. Lists are Obscure

The {{LIST}} type appears to be used only for JSON, and it is unclear how well 
supported it is in the rest of Drill. For example, it is not clear that 
functions that work with arrays correctly handle null entries. (This needs to 
be tested.) 

JDBC supports array columns, but it is not clear if the Drill JDBC driver has 
implemented them. ODBC doesn't support arrays at all, so whether it supports 
arrays with nulls is a moot point.

h4. Lists in JSON

The {{LIST}} type appears to be used only for JSON where it is a better fit for 
JSON semantics than Drill's normal {{REPEATED}} cardinality. The list type 
allows list members to be null. All of the following are legal using lists:

{code}
{a: null}
{a: []}
{a: [null, null]}
{a: [null, 10, null]}
{a: [10, "foo"]}
{code}

We'll look at each of these in detail.

h4. Degenerate Lists

Consider the simplest possible list in JSON: a file that contains only an empty 
list:

{noformat}
{a: []}
{noformat}

What is the type of the list? In JSON, lists have no type, they are just lists. 
Drill requires a type, however when working with a {{REPEATED}} cardinality: 
the column must be an array of something.

Lists, however, can be a list of only nulls using the obscure {{LATE}} data 
type. That is, the list exists, but has no type. ({{LATE}} seems to suggest 
that the type will be assigned later.)

Next, consider another degenerate array:

{noformat}
{a: [null, null]}
{noformat}

Here we have an array of nulls. Again, we don't know what type these are a null 
of. Again, a LIST allows the JSON reader to produce a row with a single column 
{{`a`}} that is of type {{LIST}} that contains only the "dummy" {{LATE}} type. 
The list will indicate that we have two entries, both of which are null.

It is unclear, however, if the rest of Drill supports this concept. (DRILL-5970 
discusses a case in which an empty array, with a List of {{LATE}}, is exported 
to Parquet, producing results different than one might naively expect.)

h4. Single-type L

[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior

2017-12-19 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293316#comment-16293316
 ] 

Paul Rogers edited comment on DRILL-6035 at 12/20/17 6:39 AM:
--

h4. JSON Arrays

Drill supports simple arrays in JSON using the following rules:

* Arrays must contain hetrogeneous elements: any of the scalars described 
above, or a JSON object.
* Single-dimensional arrays cannot contain null entries.
* Two-dimensional arrays can contain nulls at the outer level but not the inner 
level.

(See a later comment for nested arrays.)

For example, the following are scalar arrays:
{code}
[10, 20]
[10.30, 10.45]
["foo", "bar"]
[true, false]
{code}

h4. Schema Change in Arrays

The following will trigger errors:

{code}
{a: [10, "foo"]}  // Mixed types
{a: [10]} {a: ["foo"]} // Schema change
{a: [10, 12.5]} // Conflicting types: integer and float
{code}

h4. Nulls in Arrays

h4. Missing {{LIST}} Support

JSON arrays can contain nulls. Drill provides a (partially completed, 
inoperable) {{LIST}} type as described below that handles nulls. But, this 
vector is not used in Drill 1.12 or earlier. Instead, Drill uses repeated types 
which cannot handle nulls. (The {{LIST}} type is described in a separate note 
below.)

Using array types, the following rules apply to nulls:

* An array cannot contain nulls.
* An empty array at the start of the file has an unknown type. (Do we select 
Nullable {{INT}}?)
* An entire array can be null, which is represented as an empty array. (That 
is, an empty array and a {{null}} value are considered the same.)

h4. Late Type Identification

As described earlier, Drill 1.13 will defer picking an array type if it sees 
null values. For example:

{code}
{id: 1}
{id: 2, a: null}
{id: 3, a: []}
{id: 4, a: [10, 20, 30]}
{code}

In the above example, for id=2, Drill sees column `a` but does not pick a type. 
For id=3, Drill identifies that `a` is an array, but does not know the type. 
Finally, for id=4, Drill identifies the array as {{BIGINT}}.

h4. Null-Only Arrays

A special case occurs if a JSON file contains only empty arrays or arrays of 
nulls (such as a file that contains only the first three records above.)

In Drill 1.12 and earlier, the result is a list of {{LATE}} elements (See the 
List section below.) It seems that {{SqlLine}} will correctly show the null 
values.

An interesting case occurs when Drill reads two files: one with an array with 
only nulls, another with real values. For example:

{noformat}
File A: {a: [null, null] }
File B: {a: [10, 20] }
{noformat}

(The above condition can occur only if JSON uses the broken {{LIST}} type; it 
cannot occur in Drill 1.12. In 1.12, the equivalent condition is if File A 
contains:

{noformat}
{a: []}
{noformat}

Drill is distributed: one fragment will read File A, another will read File B. 
At some point, the two arrays will come together. One fragment will have 
created a list of {{LATE}}, another a list of {{BIGINT}}. Most operators will 
trigger a schema change error in this case.

Interestingly, however, if the query is a simple {{SELECT *}}, then the lists 
are compatible and {{SqlLine}} will display the correct results.

In Drill 1.13, if the first batch contains only nulls and/or empty arrays, 
Drill guesses that the type is an array of {{VARCHAR}}. Since this is only a 
guess, a schema change will result if the guess is wrong.


was (Author: paul.rogers):
h4. JSON Arrays

Drill supports simple arrays in JSON using the following rules:

* Arrays must contain hetrogeneous elements: any of the scalars described 
above, or a JSON object.

(See a later comment for nested arrays.)

For example, the following are scalar arrays:
{code}
[10, 20]
[10.30, 10.45]
["foo", "bar"]
[true, false]
{code}

h4. Schema Change in Arrays

The following will trigger errors:

{code}
{a: [10, "foo"]}  // Mixed types
{a: [10]} {a: ["foo"]} // Schema change
{a: [10, 12.5]} // Conflicting types: integer and float
{code}

h4. Nulls in Arrays

Drill handles nulls in arrays using the {{LIST}} type, described in a separate 
note below.

h4. Late Type Identification

As described earlier, Drill will defer picking an array type if it sees null 
values. For example:

{code}
{id: 1}
{id: 2, a: null}
{id: 3, a: []}
{id: 4, a: [10, 20, 30]}
{code}

In the above example, for id=2, Drill sees column `a` but does not pick a type. 
For id=3, Drill identifies that `a` is an array, but does not know the type. 
Finally, for id=4, Drill identifies the array as {{BIGINT}}.

h4. Null-Only Arrays

A special case occurs if a JSON file contains only empty arrays or arrays of 
nulls (such as a file that contains only the first three records above.)

In Drill 1.12 and earlier, the result is a list of {{LATE}} elements (See the 
List section below.) It seems that {{SqlLine}} will correctly show the null 
values.

An interesting case occurs when Dri

[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior

2017-12-18 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293316#comment-16293316
 ] 

Paul Rogers edited comment on DRILL-6035 at 12/19/17 5:42 AM:
--

h4. JSON Arrays

Drill supports simple arrays in JSON using the following rules:

* Arrays must contain hetrogeneous elements: any of the scalars described 
above, or a JSON object.

(See a later comment for nested arrays.)

For example, the following are scalar arrays:
{code}
[10, 20]
[10.30, 10.45]
["foo", "bar"]
[true, false]
{code}

h4. Schema Change in Arrays

The following will trigger errors:

{code}
{a: [10, "foo"]}  // Mixed types
{a: [10]} {a: ["foo"]} // Schema change
{a: [10, 12.5]} // Conflicting types: integer and float
{code}

h4. Nulls in Arrays

Drill handles nulls in arrays using the {{LIST}} type, described in a separate 
note below.

h4. Late Type Identification

As described earlier, Drill will defer picking an array type if it sees null 
values. For example:

{code}
{id: 1}
{id: 2, a: null}
{id: 3, a: []}
{id: 4, a: [10, 20, 30]}
{code}

In the above example, for id=2, Drill sees column `a` but does not pick a type. 
For id=3, Drill identifies that `a` is an array, but does not know the type. 
Finally, for id=4, Drill identifies the array as {{BIGINT}}.

h4. Null-Only Arrays

A special case occurs if a JSON file contains only empty arrays or arrays of 
nulls (such as a file that contains only the first three records above.)

In Drill 1.12 and earlier, the result is a list of {{LATE}} elements (See the 
List section below.) It seems that {{SqlLine}} will correctly show the null 
values.

An interesting case occurs when Drill reads two files: one with an array with 
only nulls, another with real values. For example:

{noformat}
File A: {a: [null, null] }
File B: {a: [10, 20] }
{noformat}

Drill is distributed: one fragment will read File A, another will read File B. 
At some point, the two arrays will come together. One fragment will have 
created a list of {{LATE}}, another a list of {{BIGINT}}. Most operators will 
trigger a schema change error in this case.

Interestingly, however, if the query is a simple {{SELECT *}}, then the lists 
are compatible and {{SqlLine}} will display the correct results.

In Drill 1.13, if the first batch contains only nulls and/or empty arrays, 
Drill guesses that the type is an array of {{VARCHAR}}. Since this is only a 
guess, a schema change will result if the guess is wrong.


was (Author: paul.rogers):
h4. JSON Arrays

Drill supports simple arrays in JSON using the following rules:

* Arrays must contain hetrogeneous elements: any of the scalars described 
above, or a JSON object.

(See a later comment for nested arrays.)

For example, the following are scalar arrays:
{code}
[10, 20]
[10.30, 10.45]
["foo", "bar"]
[true, false]
{code}

h4. Schema Change in Arrays

The following will trigger errors:

{code}
{a: [10, "foo"]}  // Mixed types
{a: [10]} {a: ["foo"]} // Schema change
{a: [10, 12.5]} // Conflicting types: integer and float
{code}

h4. Nulls in Arrays

Drill handles nulls in arrays using the {{LIST}} type, described in a separate 
note below.

h4. Late Type Identification

As described earlier, Drill will defer picking an array type if it sees null 
values. For example:

{code}
{id: 1}
{id: 2, a: null}
{id: 3, a: []}
{id: 4, a: [10, 20, 30]}
{code}

In the above example, for id=2, Drill sees column `a` but does not pick a type. 
For id=3, Drill identifies that `a` is an array, but does not know the type. 
Finally, for id=4, Drill identifies the array as {{BIGINT}}.

> Specify Drill's JSON behavior
> -
>
> Key: DRILL-6035
> URL: https://issues.apache.org/jira/browse/DRILL-6035
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.13.0
>Reporter: Paul Rogers
>Assignee: Pritesh Maker
>
> Drill supports JSON as its native data format. However, experience suggests 
> that Drill may have limitations in the JSON that Drill supports. This ticket 
> asks to clarify Drill's expected behavior on various kinds of JSON.
> Topics to be addressed:
> * Relational vs. non-relational structures
> * JSON structures used in practice and how they map to Drill
> * Support for varying data types
> * Support for missing values, especially across files
> These topics are complex, hence the request to provide a detailed 
> specifications that clarifies what Drill does and does not support (or what 
> is should and should not support.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior

2017-12-18 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293316#comment-16293316
 ] 

Paul Rogers edited comment on DRILL-6035 at 12/19/17 5:33 AM:
--

h4. JSON Arrays

Drill supports simple arrays in JSON using the following rules:

* Arrays must contain hetrogeneous elements: any of the scalars described 
above, or a JSON object.

(See a later comment for nested arrays.)

For example, the following are scalar arrays:
{code}
[10, 20]
[10.30, 10.45]
["foo", "bar"]
[true, false]
{code}

h4. Schema Change in Arrays

The following will trigger errors:

{code}
{a: [10, "foo"]}  // Mixed types
{a: [10]} {a: ["foo"]} // Schema change
{a: [10, 12.5]} // Conflicting types: integer and float
{code}

h4. Nulls in Arrays

Drill handles nulls in arrays using the {{LIST}} type, described in a separate 
note below.

h4. Late Type Identification

As described earlier, Drill will defer picking an array type if it sees null 
values. For example:

{code}
{id: 1}
{id: 2, a: null}
{id: 3, a: []}
{id: 4, a: [10, 20, 30]}
{code}

In the above example, for id=2, Drill sees column `a` but does not pick a type. 
For id=3, Drill identifies that `a` is an array, but does not know the type. 
Finally, for id=4, Drill identifies the array as {{BIGINT}}.


was (Author: paul.rogers):
h4. JSON Arrays

Drill supports simple arrays in JSON using the following rules:

* Arrays must contain hetrogeneous elements: any of the scalars described 
above, or a JSON object.

(See a later comment for nested arrays.)

For example, the following are scalar arrays:
{code}
[10, 20]
[10.30, 10.45]
["foo", "bar"]
[true, false]
{code}

h4. Schema Change in Arrays

The following will trigger errors:

{code}
{a: [10, "foo"]}  // Mixed types
{a: [10]} {a: ["foo"]} // Schema change
{a: [10, 12.5]} // Conflicting types: integer and float
{code}

h4. Nulls in Arrays

Drill handles nulls in arrays using the {{LIST}} type, described in a separate 
note below.

> Specify Drill's JSON behavior
> -
>
> Key: DRILL-6035
> URL: https://issues.apache.org/jira/browse/DRILL-6035
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.13.0
>Reporter: Paul Rogers
>Assignee: Pritesh Maker
>
> Drill supports JSON as its native data format. However, experience suggests 
> that Drill may have limitations in the JSON that Drill supports. This ticket 
> asks to clarify Drill's expected behavior on various kinds of JSON.
> Topics to be addressed:
> * Relational vs. non-relational structures
> * JSON structures used in practice and how they map to Drill
> * Support for varying data types
> * Support for missing values, especially across files
> These topics are complex, hence the request to provide a detailed 
> specifications that clarifies what Drill does and does not support (or what 
> is should and should not support.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior

2017-12-18 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16296230#comment-16296230
 ] 

Paul Rogers edited comment on DRILL-6035 at 12/19/17 5:30 AM:
--

h4. Lists

JSON supports arrays of the form:

{code}
{a: ["I'm", "an", "array"] }
{code}

Drill has two very different ways to represent arrays:

1. As a {{REPEATED}} cardinality for most data types. This gives rise to a 
{{RepeatedFooVector}} for some type {{Foo}}.
2. As a {{LIST}} type with the {{ListVector}} implementation.

Here, Arrow has done a nice job. Arrow unified the {{REPEATED}} cardinality and 
the {{LIST}} vector type into a single concept. Drill, however, still has two 
systems.

h4. Repeated Cardinality

Drill's "go to" way to handle arrays is with the {{REPEATED}} cardinality (AKA 
"repeated data mode.") Most readers that handle arrays use the {{REPEATED}} 
form. To help understand the {{LIST}} type, we review {{REPEATED}} support here.

When working with a {{REPEATED} column, the rules for nulls  are:

* Arrays may not contain nulls. (Drill does not support nulls as array 
elements.)
* A null (or missing) array field is treated the same as an empty array.

If JSON were to use the {{REPEATED}} vectors, the following would be invalid:

{code}
[10, null, 20]
{code}

The following are all valid with {{REPEATED}} vectors:

{code}
{id: 1}
{id: 2, a: null}
{id: 3, a: []}
{id: 4, a: [10, 20, 30]}
{code}

h4. Properties of Lists

The key properties of a List relative to a repeated type are:

* Each list can be a list of nothing, a list of a single type, or a list of a 
union of multiple types.
* Each list value can be null.
* Each list entry can be null (for primitive types.)

h4. Null Support

As explained below, lists support three kinds of nullability:

* The type itself can be null (a list of nulls)
* The list column value for a row can be null. (This is in contrast to repeated 
types in which an array can be empty, but the entire array value cannot be 
null.)
* When the list is of a primitive type, entries can be null. (The list is 
defined as list of nullable items, such as nullable BIGINT.)
* When the list is of maps, the map entries *cannot* be null. (Instead, the map 
columns are nullable and all columns for the "null" map are set to null.)
* When the list is of other lists, the list entry *cannot* be null. (Instead, 
the nested list is empty.)

The semantics are a bit confusing when seen from the outside. They make 
slightly more sense based on the implementation choices made in the code. 
(Though, generally we want the code to match our requirements, not the other 
way around.)

h4. Lists are Obscure

The {{LIST}} type appears to be used only for JSON, and it is unclear how well 
supported it is in the rest of Drill. For example, it is not clear that 
functions that work with arrays correctly handle null entries. (This needs to 
be tested.) 

JDBC supports array columns, but it is not clear if the Drill JDBC driver has 
implemented them. ODBC doesn't support arrays at all, so whether it supports 
arrays with nulls is a moot point.

h4. Lists in JSON

The {{LIST}} type appears to be used only for JSON where it is a better fit for 
JSON semantics than Drill's normal {{REPEATED}} cardinality. The list type 
allows list members to be null. All of the following are legal using lists:

{code}
{a: null}
{a: []}
{a: [null, null]}
{a: [null, 10, null]}
{a: [10, "foo"]}
{code}

We'll look at each of these in detail.

h4. Degenerate Lists

Consider the simplest possible list in JSON: a file that contains only an empty 
list:

{noformat}
{a: []}
{noformat}

What is the type of the list? In JSON, lists have no type, they are just lists. 
Drill requires a type, however when working with a {{REPEATED}} cardinality: 
the column must be an array of something.

Lists, however, can be a list of only nulls using the obscure {{LATE}} data 
type. That is, the list exists, but has no type. ({{LATE}} seems to suggest 
that the type will be assigned later.)

Next, consider another degenerate array:

{noformat}
{a: [null, null]}
{noformat}

Here we have an array of nulls. Again, we don't know what type these are a null 
of. Again, a LIST allows the JSON reader to produce a row with a single column 
{{`a`}} that is of type {{LIST}} that contains only the "dummy" {{LATE}} type. 
The list will indicate that we have two entries, both of which are null.

It is unclear, however, if the rest of Drill supports this concept. (DRILL-5970 
discusses a case in which an empty array, with a List of {{LATE}}, is exported 
to Parquet, producing results different than one might naively expect.)

h4. Single-type Lists

The typical JSON use case is create lists with a single type, optionally 
containing nulls:

{noformat}
{a: [10, null, 20] }
{noformat}

In this case, the List operates in its' second mode: as a list of a single 
type, 

[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior

2017-12-18 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293316#comment-16293316
 ] 

Paul Rogers edited comment on DRILL-6035 at 12/19/17 5:25 AM:
--

h4. JSON Arrays

Drill supports simple arrays in JSON using the following rules:

* Arrays must contain hetrogeneous elements: any of the scalars described 
above, or a JSON object.

(See a later comment for nested arrays.)

For example, the following are scalar arrays:
{code}
[10, 20]
[10.30, 10.45]
["foo", "bar"]
[true, false]
{code}

h4. Schema Change in Arrays

The following will trigger errors:

{code}
{a: [10, "foo"]}  // Mixed types
{a: [10]} {a: ["foo"]} // Schema change
{a: [10, 12.5]} // Conflicting types: integer and float
{code}

h4. Nulls in Arrays

Drill handles nulls in arrays using the {{LIST}} type, described in a separate 
note below.


was (Author: paul.rogers):
h4. JSON Arrays

Drill supports simple arrays in JSON using the following rules:

* Arrays must contain hetrogeneous elements: any of the scalars described 
above, or a JSON object.

(See a later comment for nested arrays.)

For example, the following are scalar arrays:
{code}
[10, 20]
[10.30, 10.45]
["foo", "bar"]
[true, false]
{code}

h4. Schema Change in Arrays

The following will trigger errors:

{code}
{a: [10, "foo"]}  // Mixed types
{a: [10]} {a: ["foo"]} // Schema change
{a: [10, 12.5]} // Conflicting types: integer and float
{code}

h4. Nulls with Arrays

Rules for nulls  are:

* Arrays may not contain nulls. (Drill does not support nulls as array 
elements.)
* A null (or missing) array field is treated the same as an empty array.

The following is invalid:
{code}
[10, null, 20]
{code}

The following are all valid:

{code}
{id: 1}
{id: 2, a: null}
{id: 3, a: []}
{id: 4, a: [10, 20, 30]}
{code}

As described, Drill will defer picking an array type if it sees null values. In 
the above example, for id=2, Drill sees column `a` but does not pick a type. 
For id=3, Drill identifies that `a` is an array, but does not know the type. 
Finally, for id=4, Drill identifies the array as {{Repeated BIGINT}}. (This is 
the behavior for Drill 1.13, earlier versions may differ and require 
investigation.)

As usual, if the first file or batch contains only nulls, Drill will guess 
{{Nullable VARCHAR}} which will cause a schema change error if later records 
reveal the type to be an array (of any type.)

If the first batch contains only nulls and/or empty arrays, Drill guesses that 
the type is {{Repeated VARCHAR}}. (Again, this is specific to Drill 1.13.) For 
example:

{code}
{id: 1}
{id: 2, a: null}
{id: 3, a: []}
{code}

> Specify Drill's JSON behavior
> -
>
> Key: DRILL-6035
> URL: https://issues.apache.org/jira/browse/DRILL-6035
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.13.0
>Reporter: Paul Rogers
>Assignee: Pritesh Maker
>
> Drill supports JSON as its native data format. However, experience suggests 
> that Drill may have limitations in the JSON that Drill supports. This ticket 
> asks to clarify Drill's expected behavior on various kinds of JSON.
> Topics to be addressed:
> * Relational vs. non-relational structures
> * JSON structures used in practice and how they map to Drill
> * Support for varying data types
> * Support for missing values, especially across files
> These topics are complex, hence the request to provide a detailed 
> specifications that clarifies what Drill does and does not support (or what 
> is should and should not support.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior

2017-12-15 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293316#comment-16293316
 ] 

Paul Rogers edited comment on DRILL-6035 at 12/16/17 4:08 AM:
--

h4. JSON Arrays

Drill supports simple arrays in JSON using the following rules:

* Arrays must contain hetrogeneous elements: any of the scalars described 
above, or a JSON object.

(See a later comment for nested arrays.)

For example, the following are scalar arrays:
{code}
[10, 20]
[10.30, 10.45]
["foo", "bar"]
[true, false]
{code}

h4. Schema Change in Arrays

The following will trigger errors:

{code}
{a: [10, "foo"]}  // Mixed types
{a: [10]} {a: ["foo"]} // Schema change
{a: [10, 12.5]} // Conflicting types: integer and float
{code}

h4. Nulls with Arrays

Rules for nulls  are:

* Arrays may not contain nulls. (Drill does not support nulls as array 
elements.)
* A null (or missing) array field is treated the same as an empty array.

The following is invalid:
{code}
[10, null, 20]
{code}

The following are all valid:

{code}
{id: 1}
{id: 2, a: null}
{id: 3, a: []}
{id: 4, a: [10, 20, 30]}
{code}

As described, Drill will defer picking an array type if it sees null values. In 
the above example, for id=2, Drill sees column `a` but does not pick a type. 
For id=3, Drill identifies that `a` is an array, but does not know the type. 
Finally, for id=4, Drill identifies the array as {{Repeated BIGINT}}. (This is 
the behavior for Drill 1.13, earlier versions may differ and require 
investigation.)

As usual, if the first file or batch contains only nulls, Drill will guess 
{{Nullable VARCHAR}} which will cause a schema change error if later records 
reveal the type to be an array (of any type.)

If the first batch contains only nulls and/or empty arrays, Drill guesses that 
the type is {{Repeated VARCHAR}}. (Again, this is specific to Drill 1.13.) For 
example:

{code}
{id: 1}
{id: 2, a: null}
{id: 3, a: []}
{code}


was (Author: paul.rogers):
h4. JSON Arrays

Drill supports simple arrays in JSON using the following rules:

* Arrays must contain hetrogeneous elements: any of the scalars described 
above, or a JSON object.

(See a later comment for nested arrays.)

For example, the following are scalar arrays:
{code}
[10, 20]
[10.30, 10.45]
["foo", "bar"]
[true, false]
{code}

h4. Schema Change in Arrays

The following will trigger errors:

{code}
{a: [10, "foo"]}  // Mixed types
{a: [10]} {a: ["foo"]} // Schema change
{code}

h4. Nulls with Arrays

Rules for nulls  are:

* Arrays may not contain nulls. (Drill does not support nulls as array 
elements.)
* A null (or missing) array field is treated the same as an empty array.

The following is invalid:
{code}
[10, null, 20]
{code}

The following are all valid:

{code}
{id: 1}
{id: 2, a: null}
{id: 3, a: []}
{id: 4, a: [10, 20, 30]}
{code}

As described, Drill will defer picking an array type if it sees null values. In 
the above example, for id=2, Drill sees column `a` but does not pick a type. 
For id=3, Drill identifies that `a` is an array, but does not know the type. 
Finally, for id=4, Drill identifies the array as {{Repeated BIGINT}}. (This is 
the behavior for Drill 1.13, earlier versions may differ and require 
investigation.)

As usual, if the first file or batch contains only nulls, Drill will guess 
{{Nullable VARCHAR}} which will cause a schema change error if later records 
reveal the type to be an array (of any type.)

If the first batch contains only nulls and/or empty arrays, Drill guesses that 
the type is {{Repeated VARCHAR}}. (Again, this is specific to Drill 1.13.) For 
example:

{code}
{id: 1}
{id: 2, a: null}
{id: 3, a: []}
{code}

> Specify Drill's JSON behavior
> -
>
> Key: DRILL-6035
> URL: https://issues.apache.org/jira/browse/DRILL-6035
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.13.0
>Reporter: Paul Rogers
>Assignee: Pritesh Maker
>
> Drill supports JSON as its native data format. However, experience suggests 
> that Drill may have limitations in the JSON that Drill supports. This ticket 
> asks to clarify Drill's expected behavior on various kinds of JSON.
> Topics to be addressed:
> * Relational vs. non-relational structures
> * JSON structures used in practice and how they map to Drill
> * Support for varying data types
> * Support for missing values, especially across files
> These topics are complex, hence the request to provide a detailed 
> specifications that clarifies what Drill does and does not support (or what 
> is should and should not support.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior

2017-12-15 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293557#comment-16293557
 ] 

Paul Rogers edited comment on DRILL-6035 at 12/16/17 4:07 AM:
--

h4. All-Text Mode

Drill provides the ability to read scalar values as text:

{code}
ALTER SESSION SET `store.json.all_text_mode` = true
{code}

In this mode, JSON scalars are read as follows:

|| JSON Type || As Member Value | As Array Value ||
| Missing | NULL (VARCHAR) | N/A |
| null | NULL (VARCHAR) |  String value "null" |
| true/false | "true"/"false" | Same |
| Number | Number text | Same |
| String | The string value (without quotes) | Same |

All-text mode can overcome some schema change exceptions such as:

* Long string of missing or null values before the first non-null value.
* Different scalar types in different records.
* Hetrogeneous arrays.
* Arrays that contain nulls. (The null values are stored as empty strings.)

In Drill 1.13, in all-text mode, missing columns are presumed to be Nullable 
VARCHAR. (Prior versions may have assumed Nullable INT.) As a result, if 
file1.json has column `x`, but file2.json does not, then no schema change will 
occur when combining the results since both files will assume that `x` is a 
Nullable VARCHAR. (Note that this works only if the query explicitly projects 
column `x`. It won't necessarily work for queries with the wildcard.)

Note that all-text mode cannot overcome schema changes due to mixes of scalar 
and structured (object or list) types.


was (Author: paul.rogers):
h4. All-Text Mode

Drill provides the ability to read scalar values as text:

{code}
ALTER SESSION SET `store.json.all_text_mode` = true
{code}

In this mode, JSON scalars are read as follows:

|| JSON Type || As Text ||
| Missing | NULL (VARCHAR) |
| null | NULL (VARCHAR) | 
| true/false | "true"/"false" |
| Number | Number text |
| String | The string value (without quotes) |

All-text mode can overcome some schema change exceptions such as:

* Long string of missing or null values before the first non-null value.
* Different scalar types in different records.
* Hetrogeneous arrays.
* Arrays that contain nulls. (The null values are stored as empty strings.)

In Drill 1.13, in all-text mode, missing columns are presumed to be Nullable 
VARCHAR. (Prior versions may have assumed Nullable INT.) As a result, if 
file1.json has column `x`, but file2.json does not, then no schema change will 
occur when combining the results since both files will assume that `x` is a 
Nullable VARCHAR. (Note that this works only if the query explicitly projects 
column `x`. It won't necessarily work for queries with the wildcard.)

Note that all-text mode cannot overcome schema changes due to mixes of scalar 
and structured (object or list) types.

> Specify Drill's JSON behavior
> -
>
> Key: DRILL-6035
> URL: https://issues.apache.org/jira/browse/DRILL-6035
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.13.0
>Reporter: Paul Rogers
>Assignee: Pritesh Maker
>
> Drill supports JSON as its native data format. However, experience suggests 
> that Drill may have limitations in the JSON that Drill supports. This ticket 
> asks to clarify Drill's expected behavior on various kinds of JSON.
> Topics to be addressed:
> * Relational vs. non-relational structures
> * JSON structures used in practice and how they map to Drill
> * Support for varying data types
> * Support for missing values, especially across files
> These topics are complex, hence the request to provide a detailed 
> specifications that clarifies what Drill does and does not support (or what 
> is should and should not support.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior

2017-12-15 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293265#comment-16293265
 ] 

Paul Rogers edited comment on DRILL-6035 at 12/16/17 2:04 AM:
--

h4. JSON Scalars

Drill uses simple rules to infer scalar data types from JSON input:

* If the value is quoted, it is a Nullable {{VARCHAR}}. Example: {{a: 
"varchar"}}
* If the value is numeric, and has no decimal point, it is a Nullable 
{{BIGINT}}. Example: {{a: 10}}
* If the value is numeric, and has a decimal point, it is a Nullable 
{{FLOAT8}}. Eample: {{a: 10.5}}
* If the value is {{true}} or {{false}}, it is a Nullable {{BIT}}, stored as 
Nullable {{UINT1}}, with {{true}} stored as 1, {{false}} as 0. Example: {{a: 
true}}

A special case occurs if the user enables {{store.json.all_text_mode}}. In this 
case, all cases above give rise to a nullable {{VARCHAR}} column type. 

h4. Nulls with Scalar Values

The rules for {{null}} are a bit more complex. The rules below apply to the 
Drill 1.13 JSON reader, earlier versions require investigation.

* If the {{null}} occurs after seeing one of the above types, then the value is 
a {{NULL}} of that type (such as a {{NULL}} {{VARCHAR}}.)
* If the {{null}} occurs at the start of the file, Drill defers identifying the 
data type until it sees the first non-null value.
* If the entire file (or first batch of a file: 4K records before Drill 1.13, 
variable number in Drill 1.13 or later) contains only {{null}} values, the 
column is defined as nullable {{VARCHAR}}. (Drill 1.12 and earlier likely has 
different behavior.)

For example, in the following example, Drill 1.13 identifies the type as 
nullable {{BIGINT}}:

{code}
{a: null} {a: null} {a: null} {a: 10}
{code}

In the following example, Drill must guess that the data type is nullable 
{{VARCHAR}}:

{code}
{a: null} {a: null} {a: null}
{code}

h4. BIGINT vs. FLOAT8 Types

JSON has a single numeric type which can hold integers, floats or decimals. 
Drill attemps to infer the type of the field from the format of the first 
number encountered for a field using the following rules:

* If the number consists entirely of digits, it is an integer, stored as 
{{BIGINT}}.
* Else, it is a floating point number stored as {{FLOAT8}}.

Drill 1.13 adds the following rule:

* If a number is {{FLOAT8}}, accept an integer, but promote it to {{FLOAT8}}.

That is, the following array is stored as {{FLOAT8}}:

{code}
[1.2, 3, 5.0]
{code}

Drill is easily fooled, however, rearrange the values and Drill will raise an 
error:

{code}
[3, 1.2, 5.0]
{code}

In the second case, Drill sees 3, guesses {{BIGINT}}, then immediately fails 
when trying to store the float value 1.2.

To work around this problem, a user may set the 
{{`store.json.read_numbers_as_double`}} system/session property to read all 
numbers as {{FLOAT8}}, even if they appear to be integers.




was (Author: paul.rogers):
h4. JSON Scalars

Drill uses simple rules to infer scalar data types from JSON input:

* If the value is quoted, it is a Nullable {{VARCHAR}}. Example: {{a: 
"varchar"}}
* If the value is numeric, and has no decimal point, it is a Nullable 
{{BIGINT}}. Example: {{a: 10}}
* If the value is numeric, and has a decimal point, it is a Nullable 
{{FLOAT8}}. Eample: {{a: 10.5}}
* If the value is {{true}} or {{false}}, it is a Nullable {{BIT}}, stored as 
Nullable {{UINT1}}, with {{true}} stored as 1, {{false}} as 0. Example: {{a: 
true}}

A special case occurs if the user enables {{store.json.all_text_mode}}. In this 
case, all cases above give rise to a nullable {{VARCHAR}} column type. 

h4. Nulls with Scalar Values

The rules for {{null}} are a bit more complex. The rules below apply to the 
Drill 1.13 JSON reader, earlier versions require investigation.

* If the {{null}} occurs after seeing one of the above types, then the value is 
a {{NULL}} of that type (such as a {{NULL}} {{VARCHAR}}.)
* If the {{null}} occurs at the start of the file, Drill defers identifying the 
data type until it sees the first non-null value.
* If the entire file (or first batch of a file: 4K records before Drill 1.13, 
variable number in Drill 1.13 or later) contains only {{null}} values, the 
column is defined as nullable {{VARCHAR}}. (Drill 1.12 and earlier likely has 
different behavior.)

For example, in the following example, Drill 1.13 identifies the type as 
nullable {{BIGINT}}:

{code}
{a: null} {a: null} {a: null} {a: 10}
{code}

In the following example, Drill must guess that the data type is nullable 
{{VARCHAR}}:

{code}
{a: null} {a: null} {a: null}
{code}

h4. BIGINT vs. FLOAT8 Types

JSON has a single numeric type which can hold integers, floats or decimals. 
Drill attemps to infer the type of the field from the format of the first 
number encountered for a field using the following rules:

* If the number consists entirely of digits, it is an integer, stored as

[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior

2017-12-15 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293265#comment-16293265
 ] 

Paul Rogers edited comment on DRILL-6035 at 12/16/17 2:01 AM:
--

h4. JSON Scalars

Drill uses simple rules to infer scalar data types from JSON input:

* If the value is quoted, it is a Nullable {{VARCHAR}}. Example: {{a: 
"varchar"}}
* If the value is numeric, and has no decimal point, it is a Nullable 
{{BIGINT}}. Example: {{a: 10}}
* If the value is numeric, and has a decimal point, it is a Nullable 
{{FLOAT8}}. Eample: {{a: 10.5}}
* If the value is {{true}} or {{false}}, it is a Nullable {{BIT}}, stored as 
Nullable {{UINT1}}, with {{true}} stored as 1, {{false}} as 0. Example: {{a: 
true}}

A special case occurs if the user enables {{store.json.all_text_mode}}. In this 
case, all cases above give rise to a nullable {{VARCHAR}} column type. 

h4. Nulls with Scalar Values

The rules for {{null}} are a bit more complex. The rules below apply to the 
Drill 1.13 JSON reader, earlier versions require investigation.

* If the {{null}} occurs after seeing one of the above types, then the value is 
a {{NULL}} of that type (such as a {{NULL}} {{VARCHAR}}.)
* If the {{null}} occurs at the start of the file, Drill defers identifying the 
data type until it sees the first non-null value.
* If the entire file (or first batch of a file: 4K records before Drill 1.13, 
variable number in Drill 1.13 or later) contains only {{null}} values, the 
column is defined as nullable {{VARCHAR}}. (Drill 1.12 and earlier likely has 
different behavior.)

For example, in the following example, Drill 1.13 identifies the type as 
nullable {{BIGINT}}:

{code}
{a: null} {a: null} {a: null} {a: 10}
{code}

In the following example, Drill must guess that the data type is nullable 
{{VARCHAR}}:

{code}
{a: null} {a: null} {a: null}
{code}

h4. BIGINT vs. FLOAT8 Types

JSON has a single numeric type which can hold integers, floats or decimals. 
Drill attemps to infer the type of the field from the format of the first 
number encountered for a field using the following rules:

* If the number consists entirely of digits, it is an integer, stored as 
{{BIGINT}}.
* Else, it is a floating point number stored as {{FLOAT8}}.

Drill 1.13 adds the following rule:

* If a number is {{FLOAT8}}, accept an integer, but promote it to {{FLOAT8}}.

That is, the following array is stored as {{FLOAT8}}:

{code}
[1.2, 3, 5.0]
{code}

Drill is easily fooled, however, rearrange the values and Drill will raise an 
error:

{code}
[3, 1.2, 5.0]
{code}

In the second case, Drill sees 3, guesses {{BIGINT}}, then immediately fails 
when trying to store the float value 1.2.




was (Author: paul.rogers):
h4. JSON Scalars

Drill uses simple rules to infer scalar data types from JSON input:

* If the value is quoted, it is a Nullable {{VARCHAR}}. Example: {{a: 
"varchar"}}
* If the value is numeric, and has no decimal point, it is a Nullable 
{{BIGINT}}. Example: {{a: 10}}
* If the value is numeric, and has a decimal point, it is a Nullable 
{{FLOAT8}}. Eample: {{a: 10.5}}
* If the value is {{true}} or {{false}}, it is a Nullable {{BIT}}, stored as 
Nullable {{UINT1}}, with {{true}} stored as 1, {{false}} as 0. Example: {{a: 
true}}

A special case occurs if the user enables {{store.json.all_text_mode}}. In this 
case, all cases above give rise to a nullable {{VARCHAR}} column type. 

h4. Nulls with Scalar Values

The rules for {{null}} are a bit more complex. The rules below apply to the 
Drill 1.13 JSON reader, earlier versions require investigation.

* If the {{null}} occurs after seeing one of the above types, then the value is 
a {{NULL}} of that type (such as a {{NULL}} {{VARCHAR}}.)
* If the {{null}} occurs at the start of the file, Drill defers identifying the 
data type until it sees the first non-null value.
* If the entire file (or first batch of a file: 4K records before Drill 1.13, 
variable number in Drill 1.13 or later) contains only {{null}} values, the 
column is defined as nullable {{VARCHAR}}. (Drill 1.12 and earlier likely has 
different behavior.)

For example, in the following example, Drill 1.13 identifies the type as 
nullable {{BIGINT}}:

{code}
{a: null} {a: null} {a: null} {a: 10}
{code}

In the following example, Drill must guess that the data type is nullable 
{{VARCHAR}}:

{code}
{a: null} {a: null} {a: null}
{code}

> Specify Drill's JSON behavior
> -
>
> Key: DRILL-6035
> URL: https://issues.apache.org/jira/browse/DRILL-6035
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.13.0
>Reporter: Paul Rogers
>Assignee: Pritesh Maker
>
> Drill supports JSON as its native data format. However, experience suggests 
> that Drill may have limitations in the JSON that Drill supports. This 

[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior

2017-12-15 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293362#comment-16293362
 ] 

Paul Rogers edited comment on DRILL-6035 at 12/16/17 1:19 AM:
--

h4. JSON Projection Pushdown

The JSON reader supports "projection push-down." The simple rules are simple in 
concept, but complex in details.

The project list comes from the query. In its simplest form, it is the list of 
columns following the {{SELECT}} keyword:

{code}
SELECT a, b.c, d[0] FROM ...
{code}

|| Projection || JSON Value of "a" || Drill Result ||
| `a` | Scalar | Projects `a` |
| | Array | Projects all elements of `a` |
| | Object | Projects all members of `a` |
| | Missing | Creates a {{Nullable INT}} (Drill 1.12) or {{Nullable VARCHAR}} 
(Drill 1.13) column |
| | {{null}} | As above |
| `a`.`b` | Scalar | Error (`a` must be an object) |
| | Scalar array | Error (`a` must be a map or an array of maps) |
| | Object that contains `b` | Projects just `b` from object `a` |
| | Object that does not contain `b` | Projects a nullable column `b` within 
map `a` |
| | Object array that contains `b` | Projects just be from the objects within 
array `a` |
| | Object array that does not contain `b` | Projects a nullable column `b` 
within the array of maps |
| | Missing | Projects a map `a` that contains a nullable column `b` |
| | {{null}} | As above |
| a\[0] | Scalar | Error (`a` must be an array) |
| | Scalar array | Projects just `a\[0]` as a scalar (the reader projects the 
entire array, a project operator pulls out the `a\[0]` element) |
| | Object | Error (`a` must e an array) |
| | Object array | Projects just object (map) `a\[0]` as described above |
| | {{null}} | JSON creates an array of null values, project pulls out `a\[0]` |
| | Missing | As above |

Notes:

* The rules above are for Drill 1.13. Drill 1.12 and earlier is different, and 
requires investigation.
* The rules for null values are suble. The type of the null is inferred from 
the project list in the case of a map (`a`.`b`) or an array (`a\[0]). Previous 
sections described null handling for the {{SELECT *}} and {{SELECT `a`}} cases.
* The rules for projecting map columns apply to both arrays and single maps. 
(In Drill 1.12 and earlier, the two cases appear to have behaved differently.)


was (Author: paul.rogers):
h4. JSON Projection Pushdown

The JSON reader supports "projection push-down." The simple rules are simple in 
concept, but complex in details.

The project list comes from the query. In its simplest form, it is the list of 
columns following the {{SELECT}} keyword:

{code}
SELECT a, b.c, d[0] FROM ...
{code}

|| Projection || JSON Value of `a` || Drill Result ||
| `a` | Scalar | Projects `a` |
| | Array | Projects all elements of `a` |
| | Object | Projects all members of `a` |
| | Missing | Creates a {{Nullable INT}} (Drill 1.12) or {{Nullable VARCHAR}} 
(Drill 1.13) column |
| | {{null}} | As above |
| `a`.`b` | Scalar | Error (`a` must be an object) |
| | Scalar array | Error (`a` must be a map or an array of maps) |
| | Object that contains `b` | Projects just `b` from object `a` |
| | Object that does not contain `b` | Projects a nullable column `b` within 
map `a` |
| | Object array that contains `b` | Projects just be from the objects within 
array `a` |
| | Object array that does not contain `b` | Projects a nullable column `b` 
within the array of maps |
| | Missing | Projects a map `a` that contains a nullable column `b` |
| | {{null}} | As above |
| a\[0] | Scalar | Error (`a` must be an array) |
| | Scalar array | Projects just `a\[0]` as a scalar (the reader projects the 
entire array, a project operator pulls out the `a\[0]` element) |
| | Object | Error (`a` must e an array) |
| | Object array | Projects just object (map) `a\[0]` as described above |
| | {{null}} | JSON creates an array of null values, project pulls out `a\[0]` |
| | Missing | As above |

Notes:

* The rules above are for Drill 1.13. Drill 1.12 and earlier is different, and 
requires investigation.
* The rules for null values are suble. The type of the null is inferred from 
the project list in the case of a map (`a`.`b`) or an array (`a\[0]). Previous 
sections described null handling for the {{SELECT *}} and {{SELECT `a`}} cases.
* The rules for projecting map columns apply to both arrays and single maps. 
(In Drill 1.12 and earlier, the two cases appear to have behaved differently.)

> Specify Drill's JSON behavior
> -
>
> Key: DRILL-6035
> URL: https://issues.apache.org/jira/browse/DRILL-6035
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.13.0
>Reporter: Paul Rogers
>Assignee: Pritesh Maker
>
> Drill supports JSON as its native data format. However, experience suggests 
> that Drill may have limitation

[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior

2017-12-15 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293477#comment-16293477
 ] 

Paul Rogers edited comment on DRILL-6035 at 12/16/17 1:07 AM:
--

h4. JSON Structure

The [JSON standard|https://tools.ietf.org/html/rfc7159] as [described more 
clearly here|https://www.json.org] states that a JSON document is a single 
value (null, scalar, object or list). Drill support a non-standard (but common) 
extension that allows a list of objects.

|| Document Structure || JSON Standard || Drill Support ||
| Empty | Invalid | Empty list of records |
| null | Valid | Invalid |
| Scalar | Valid | Invalid |
| Array | Valid | Valid (in Drill 1.13) as long as the value is an array of 
objects |
| Object | Valid | Single record |
| List of objects | Invalid | List of records |

In Drill, there must be no commas between top-level objects. This is a clear 
difference compared to the JSON standard which requires commas to separate 
items in a list or object. (This difference is because Drill's JSON file 
structure is not JSON. Think of it instead as a serialized set of JSON objects.)

h4. Drill JSON Document Structure

Thus, a typical JSON input file in Drill is:

{code}
{a: 10, b: "foo"}
{a: 20, b: "bar"}
{code}

h4. Top-Level Array

As noted above, for JSON compatibility, Drill also supports a top-level array 
of objects:

{code}
[
 {a: 10, b: "foo"},
 {a: 20, b: "bar"}
]
{code}

Note that, when the objects are in an array, a comma must separate objects.

The above applies to a JSON text file. No separator is implied (or needed) if 
the data comes from a document database, a Kafka stream or other non-file 
sources.


was (Author: paul.rogers):
h4. JSON Structure

The [JSON standard|https://tools.ietf.org/html/rfc7159] as [described more 
clearly here|https://www.json.org] states that a JSON document is a single 
value (null, scalar, object or list). Drill support a non-standard (but common) 
extension that allows a list of objects.

|| Document Structure || JSON Standard || Drill Support ||
| Empty | Invalid | Empty list of records |
| null | Valid | Invalid |
| Scalar | Valid | Invalid |
| Array | Valid | Valid (in Drill 1.13) as long as the value is an array of 
objects |
| Object | Valid | Single record |
| List of objects | Invalid | List of records |

In Drill, there must be no commas between top-level objects. This is a clear 
difference compared to the JSON standard which requires commas to separate 
items in a list or object. (This difference is because Drill's JSON file 
structure is not JSON. Think of it instead as a serialized set of JSON objects.)

h4. Drill JSON Document Structure

Thus, a typical JSON input file in Drill is:

{code}
{a: 10, b: "foo"}
{a: 20, b: "bar"}
{code}

As noted above, for JSON compatibility, Drill also supports a top-level array 
of objects:

{code}
[
 {a: 10, b: "foo"},
 {a: 20, b: "bar"}
]
{code}

Note that, when the objects are in an array, a comma must separate objects.

The above applies to a JSON text file. No separator is implied (or needed) if 
the data comes from a document database, a Kafka stream or other non-file 
sources.

> Specify Drill's JSON behavior
> -
>
> Key: DRILL-6035
> URL: https://issues.apache.org/jira/browse/DRILL-6035
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.13.0
>Reporter: Paul Rogers
>Assignee: Pritesh Maker
>
> Drill supports JSON as its native data format. However, experience suggests 
> that Drill may have limitations in the JSON that Drill supports. This ticket 
> asks to clarify Drill's expected behavior on various kinds of JSON.
> Topics to be addressed:
> * Relational vs. non-relational structures
> * JSON structures used in practice and how they map to Drill
> * Support for varying data types
> * Support for missing values, especially across files
> These topics are complex, hence the request to provide a detailed 
> specifications that clarifies what Drill does and does not support (or what 
> is should and should not support.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior

2017-12-15 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293265#comment-16293265
 ] 

Paul Rogers edited comment on DRILL-6035 at 12/15/17 9:59 PM:
--

h4. JSON Scalars

Drill uses simple rules to infer scalar data types from JSON input:

* If the value is quoted, it is a Nullable {{VARCHAR}}. Example: {{a: 
"varchar"}}
* If the value is numeric, and has no decimal point, it is a Nullable 
{{BIGINT}}. Example: {{a: 10}}
* If the value is numeric, and has a decimal point, it is a Nullable 
{{FLOAT8}}. Eample: {{a: 10.5}}
* If the value is {{true}} or {{false}}, it is a Nullable {{BIT}}, stored as 
Nullable {{UINT1}}, with {{true}} stored as 1, {{false}} as 0. Example: {{a: 
true}}

A special case occurs if the user enables {{store.json.all_text_mode}}. In this 
case, all cases above give rise to a nullable {{VARCHAR}} column type. 

h4. Nulls with Scalar Values

The rules for {{null}} are a bit more complex. The rules below apply to the 
Drill 1.13 JSON reader, earlier versions require investigation.

* If the {{null}} occurs after seeing one of the above types, then the value is 
a {{NULL}} of that type (such as a {{NULL}} {{VARCHAR}}.)
* If the {{null}} occurs at the start of the file, Drill defers identifying the 
data type until it sees the first non-null value.
* If the entire file (or first batch of a file: 4K records before Drill 1.13, 
variable number in Drill 1.13 or later) contains only {{null}} values, the 
column is defined as nullable {{VARCHAR}}. (Drill 1.12 and earlier likely has 
different behavior.)

For example, in the following example, Drill 1.13 identifies the type as 
nullable {{BIGINT}}:

{code}
{a: null} {a: null} {a: null} {a: 10}
{code}

In the following example, Drill must guess that the data type is nullable 
{{VARCHAR}}:

{code}
{a: null} {a: null} {a: null}
{code}


was (Author: paul.rogers):
Drill uses simple rules to infer scalar data types from JSON input:

* If the value is quoted, it is a Nullable {{VARCHAR}}. Example: {{a: 
"varchar"}}
* If the value is numeric, and has no decimal point, it is a Nullable 
{{BIGINT}}. Example: {{a: 10}}
* If the value is numeric, and has a decimal point, it is a Nullable 
{{FLOAT8}}. Eample: {{a: 10.5}}
* If the value is {{true}} or {{false}}, it is a Nullable {{BIT}}, stored as 
Nullable {{UINT1}}, with {{true}} stored as 1, {{false}} as 0. Example: {{a: 
true}}

A special case occurs if the user enables {{store.json.all_text_mode}}. In this 
case, all cases above give rise to a nullable {{VARCHAR}} column type. 

h4. Nulls with Scalar Values

The rules for {{null}} are a bit more complex. The rules below apply to the 
Drill 1.13 JSON reader, earlier versions require investigation.

* If the {{null}} occurs after seeing one of the above types, then the value is 
a {{NULL}} of that type (such as a {{NULL}} {{VARCHAR}}.)
* If the {{null}} occurs at the start of the file, Drill defers identifying the 
data type until it sees the first non-null value.
* If the entire file (or first batch of a file: 4K records before Drill 1.13, 
variable number in Drill 1.13 or later) contains only {{null}} values, the 
column is defined as nullable {{VARCHAR}}. (Drill 1.12 and earlier likely has 
different behavior.)

For example, in the following example, Drill 1.13 identifies the type as 
nullable {{BIGINT}}:

{code}
{a: null} {a: null} {a: null} {a: 10}
{code}

In the following example, Drill must guess that the data type is nullable 
{{VARCHAR}}:

{code}
{a: null} {a: null} {a: null}
{code}

> Specify Drill's JSON behavior
> -
>
> Key: DRILL-6035
> URL: https://issues.apache.org/jira/browse/DRILL-6035
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.13.0
>Reporter: Paul Rogers
>Assignee: Pritesh Maker
>
> Drill supports JSON as its native data format. However, experience suggests 
> that Drill may have limitations in the JSON that Drill supports. This ticket 
> asks to clarify Drill's expected behavior on various kinds of JSON.
> Topics to be addressed:
> * Relational vs. non-relational structures
> * JSON structures used in practice and how they map to Drill
> * Support for varying data types
> * Support for missing values, especially across files
> These topics are complex, hence the request to provide a detailed 
> specifications that clarifies what Drill does and does not support (or what 
> is should and should not support.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior

2017-12-15 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293298#comment-16293298
 ] 

Paul Rogers edited comment on DRILL-6035 at 12/15/17 9:59 PM:
--

h4. JSON Objects

Drill support for JSON objects consists of three parts.

* Drill supports a non-standard serialized JSON format in which an input file 
is a sequence of JSON objects.
* Values within the top-level JSON object give rise to columns within the Drill 
row.
* Nested objects in JSON give rise to nested {{MAP}} columns in Drill.
* Arrays of JSON objects give rise to {{REPEATED MAP}} columns in Drill.

h4. Top-Level JSON Objects

Example of the expected JSON input format:

{code}
{a: 10}
{a: 20}
{a: 30}
{code}

Drill allows any amount of white-space between objects. It is common to place 
each object on a new line, though this is not required.

Note that Drill allows value names to be quoted or unquoted. The following are 
both valid:

{code}
{"a": 10}
{a: 20}
{code}

h4. Drill MAP type

Drill uses the term "Map" to describe JSON objects. However, the {{MAP}} type 
in Drill is closer to the {{STRUCT}} type in Impala and Hive. That is, like a 
{{STRUCT}}, the schema of all map instances is identical across rows. (This is 
unlike, say, a JSON object or Python map in which the members of one instance 
are independent of those in any other instance.)

As a result, the following example:

{code}
{a: {x: 10}}
{a: {y: 20}}
{code}

Gives rise to records in Drill with data similar to the following:

{code}
{a: {x: 10, y: null}}
{a: {x: null, y: 20}}
{code}

h4. JSON Object Arrays and Drill Repeated MAPs

The example below shows a repeated object which gives rise to a {{REPEATED 
MAP}}:

{code}
{a: [{b: 10}, {b: 20}]}
{code}

h4. Nulls with JSON Objects

JSON allows null values for a map:

{code}
{id: 1, a: {x: 10}}
{id: 2, a: null}
{id: 3}
{code}

Drill does not support the concept of a "nullable map". Instead, Drill defines 
all map members to be nullable. If the entire object is null (or missing) in 
JSON, Drill treats this the same as if every member were null. Thus, in Drill, 
the following are all equivalent:

{code}
{id: 1, a: {x: null, y: null}}
{id: 2, a: {}}
{id: 3, a: {x: null}}
{id: 4, a: null}
{id: 5}
{code}

As a side note, when exporting the above data to a JSON file, Drill cannot 
recreate the original structure. Instead, it writes all of the above in a 
common format. (The format has evolved based on previous bugs, need to 
investigate the current choice.)

As described for scalars, Drill will defer selecting a type for a column if the 
initial records consist only of null values. If a later value is revealed to be 
a map, Drill will choose the map type. If the file (or first batch) consists 
only of nulls, then Drill cannot know the type and guesses {{VARCHAR}}. This 
will lead to a schema change error if a later file (or batch) reveals the type 
to actually be a map (since {{VARCHAR}} and {{MAP}} are not compatible.)

{code}
{id: 1} {id: 2, a: null} {id: 3, a: null}
{id: 4, a: {x: 10, y: 20}}
{code}


was (Author: paul.rogers):
Drill support for JSON objects consists of three parts.

* Drill supports a non-standard serialized JSON format in which an input file 
is a sequence of JSON objects.
* Values within the top-level JSON object give rise to columns within the Drill 
row.
* Nested objects in JSON give rise to nested {{MAP}} columns in Drill.
* Arrays of JSON objects give rise to {{REPEATED MAP}} columns in Drill.

h4. Top-Level JSON Objects

Example of the expected JSON input format:

{code}
{a: 10}
{a: 20}
{a: 30}
{code}

Drill allows any amount of white-space between objects. It is common to place 
each object on a new line, though this is not required.

Note that Drill allows value names to be quoted or unquoted. The following are 
both valid:

{code}
{"a": 10}
{a: 20}
{code}

h4. Drill MAP type

Drill uses the term "Map" to describe JSON objects. However, the {{MAP}} type 
in Drill is closer to the {{STRUCT}} type in Impala and Hive. That is, like a 
{{STRUCT}}, the schema of all map instances is identical across rows. (This is 
unlike, say, a JSON object or Python map in which the members of one instance 
are independent of those in any other instance.)

As a result, the following example:

{code}
{a: {x: 10}}
{a: {y: 20}}
{code}

Gives rise to records in Drill with data similar to the following:

{code}
{a: {x: 10, y: null}}
{a: {x: null, y: 20}}
{code}

h4. JSON Object Arrays and Drill Repeated MAPs

The example below shows a repeated object which gives rise to a {{REPEATED 
MAP}}:

{code}
{a: [{b: 10}, {b: 20}]}
{code}

h4. Nulls with JSON Objects

JSON allows null values for a map:

{code}
{id: 1, a: {x: 10}}
{id: 2, a: null}
{id: 3}
{code}

Drill does not support the concept of a "nullable map". Instead, Drill defines 
all map members to be nullable. If the entire object is null

[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior

2017-12-15 Thread Paul Rogers (JIRA)

[ 
https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293265#comment-16293265
 ] 

Paul Rogers edited comment on DRILL-6035 at 12/15/17 9:46 PM:
--

Drill uses simple rules to infer scalar data types from JSON input:

* If the value is quoted, it is a Nullable {{VARCHAR}}. Example: {{a: 
"varchar"}}
* If the value is numeric, and has no decimal point, it is a Nullable 
{{BIGINT}}. Example: {{a: 10}}
* If the value is numeric, and has a decimal point, it is a Nullable 
{{FLOAT8}}. Eample: {{a: 10.5}}
* If the value is {{true}} or {{false}}, it is a Nullable {{BIT}}, stored as 
Nullable {{UINT1}}, with {{true}} stored as 1, {{false}} as 0. Example: {{a: 
true}}

A special case occurs if the user enables {{store.json.all_text_mode}}. In this 
case, all cases above give rise to a nullable {{VARCHAR}} column type. 

h4. Nulls with Scalar Values

The rules for {{null}} are a bit more complex. The rules below apply to the 
Drill 1.13 JSON reader, earlier versions require investigation.

* If the {{null}} occurs after seeing one of the above types, then the value is 
a {{NULL}} of that type (such as a {{NULL}} {{VARCHAR}}.)
* If the {{null}} occurs at the start of the file, Drill defers identifying the 
data type until it sees the first non-null value.
* If the entire file (or first batch of a file: 4K records before Drill 1.13, 
variable number in Drill 1.13 or later) contains only {{null}} values, the 
column is defined as nullable {{VARCHAR}}. (Drill 1.12 and earlier likely has 
different behavior.)

For example, in the following example, Drill 1.13 identifies the type as 
nullable {{BIGINT}}:

{code}
{a: null} {a: null} {a: null} {a: 10}
{code}

In the following example, Drill must guess that the data type is nullable 
{{VARCHAR}}:

{code}
{a: null} {a: null} {a: null}
{code}


was (Author: paul.rogers):
Drill uses simple rules to infer scalar data types from JSON input:

* If the value is quoted, it is a Nullable {{VARCHAR}}. Example: {{a: 
"varchar"}}
* If the value is numeric, and has no decimal point, it is a Nullable 
{{BIGINT}}. Example: {{a: 10}}
* If the value is numeric, and has a decimal point, it is a Nullable 
{{FLOAT8}}. Eample: {{a: 10.5}}
* If the value is {{true}} or {{false}}, it is a Nullable {{BIT}}, stored as 
Nullable {{UINT1}}, with {{true}} stored as 1, {{false}} as 0. Example: {{a: 
true}}

A special case occurs if the user enables {{store.json.all_text_mode}}. In this 
case, all cases above give rise to a nullable {{VARCHAR}} column type. 

The rules for {{null}} are a bit more complex. The rules below apply to the 
Drill 1.13 JSON reader, earlier versions require investigation.

* If the {{null}} occurs after seeing one of the above types, then the value is 
a {{NULL}} of that type (such as a {{NULL}} {{VARCHAR}}.)
* If the {{null}} occurs at the start of the file, Drill defers identifying the 
data type until it sees the first non-null value.
* If the entire file (or first batch of a file: 4K records before Drill 1.13, 
variable number in Drill 1.13 or later) contains only {{null}} values, the 
column is defined as nullable {{VARCHAR}}. (Drill 1.12 and earlier likely has 
different behavior.)

For example, in the following example, Drill 1.13 identifies the type as 
nullable {{BIGINT}}:

{code}
{a: null} {a: null} {a: null} {a: 10}
{code}

In the following example, Drill must guess that the data type is nullable 
{{VARCHAR}}:

{code}
{a: null} {a: null} {a: null}
{code}

> Specify Drill's JSON behavior
> -
>
> Key: DRILL-6035
> URL: https://issues.apache.org/jira/browse/DRILL-6035
> Project: Apache Drill
>  Issue Type: Improvement
>Affects Versions: 1.13.0
>Reporter: Paul Rogers
>Assignee: Pritesh Maker
>
> Drill supports JSON as its native data format. However, experience suggests 
> that Drill may have limitations in the JSON that Drill supports. This ticket 
> asks to clarify Drill's expected behavior on various kinds of JSON.
> Topics to be addressed:
> * Relational vs. non-relational structures
> * JSON structures used in practice and how they map to Drill
> * Support for varying data types
> * Support for missing values, especially across files
> These topics are complex, hence the request to provide a detailed 
> specifications that clarifies what Drill does and does not support (or what 
> is should and should not support.)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)