[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior
[ https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305699#comment-16305699 ] Paul Rogers edited comment on DRILL-6035 at 12/28/17 8:07 PM: -- h4. Solutions Drill's JSON support is mature; it is used successfully by a wide range of users. Presumably, the issues discussed above are only theoretical; practical users do not encounter them. (If they did, we'd see many more open bugs in this area.) Further, though JSON started as Drill's reference data source, real-world experience has shown that Parquet has assumed that role; JSON is not used only lightly. Hence, there may little interest or value in addressing the above issues. Still, it is worth presenting some options if it ever becomes necessary to improve Drill's JSON support. h4. Futures: Stats Drill can address its inability to predict the future by converting the problem into one of learning from the past. Specifically, Drill can correctly pick the type of each column if Drill were to scan the entire collection of JSON files before executing the first query, and recording the inferred type of each column. Drill already has (a prototype of) a statistics (stats) collection mechanism. Drill could extend this mechanism to scan the full set of JSON files. For each column, implement inference rules such as: * Record the data types seen for each column. If a column is mostly nulls (or missing), even a single occurrence, even within terabytes of files, would be enough to tell Drill the type. * When a column has a conflicting scalar type (both string and number, say), record the “most permissive” type. That is, do an automatic per-column “text mode” for scalar conflicts. (Or, for numbers, simply pick FLOAT8 if even a single value has a decimal point.) * For numbers, use the value of the largest number to infer type. (INT, BITINT, FLOAT8 or DECIMAL). * If a column has both primitive and structured values (both a string and map, say), then mark the column as needed an “enhanced” text mode: read the complex type as a JSON string. The above simply extends existing schema inference beyond the first row of each file to the entire collection of files. Clearly there is a cost (the need to scan the entire set of JSON files, and keep the stats up to date via repeated scans.) The solution is not perfect, but may reduce problems by some percentage, say 80-90%. This solution has the advantage of living up to Drill’s core promise of being “schema-free”: the user does not provide a schema, though Drill infers a schema from actual data. h4. Futures: Schema Metadata The stats solution is still not complete: it cannot address the issue of intent: what does the JSON actually represent? It can only deal with the entire body of JSON at the syntax level. The next level of solution allows the user to specify their intent. That is, to specify rules for mapping their application-specific JSON encoding into the relational model used by Drill. Many of the difficulties in JSON come from ambiguities that limit Drill's ability to map JSON into a fixed set of tabular columns. One straightforward solution is to supply a schema (which may include mapping rules) that resolve the ambiguities for each application. Advantages of a defined schema: * Code generation vs. interpretation — performance * Avoid ambiguities — fewer failures The metadata might include simple information such as: * The declared type of each column. (Drill uses that type even if values are null or missing, eliminating the need to guess.) * Columns to be ignored in a `SELECT *` query. (Such as internal metadata columns.) The metadata layer would be much more powerful if it also included mapping rules for cases such as: * Identify that an array is represents a tuple, and provide column names and types for each array element. * Identify that a map structure is used to encode the name and type of a value, along with rules to map that value to a Drill column. An example of the first case above: {noformat} Metadata: Array as a row: [cust_id INT, name: VARCHAR….] Data: [101, “fred”, …] {noformat} Example of the second case: {noformat} Mapping rule: for key {“$type:” type, “$value”: value}, Create a column named of type with value . Data: cust_id {“$type”: “INT”, value: 101} {noformat} The need for the mapping (and the required set of features) will be made clearer by gathering many samples of JSON “in the wild” to learn how application developers actually encode their data into JSON. Possible sources of a schema: * Hive * Schema file stored with the data * Metadata system as in Impala * Custom mapping code that is a "data format" plugin on top of the JSON format plugin Drill uses views for this role today. But, views introduce a chicken-and-egg problem and cannot be the (whole) solution for JSON. Views are
[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior
[ https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305699#comment-16305699 ] Paul Rogers edited comment on DRILL-6035 at 12/28/17 7:57 PM: -- h4. Solutions Drill's JSON support is mature; it is used successfully by a wide range of users. Presumably, the issues discussed above are only theoretical; practical users do not encounter them. (If they did, we'd see many more open bugs in this area.) Further, though JSON started as Drill's reference data source, real-world experience has shown that Parquet has assumed that role; JSON is not used only lightly. Hence, there may little interest or value in addressing the above issues. Still, it is worth presenting some options if it ever becomes necessary to improve Drill's JSON support. h4. Futures: Stats Drill can address its inability to predict the future by converting the problem into one of learning from the past. Specifically, Drill can correctly pick the type of each column if Drill were to scan the entire collection of JSON files before executing the first query, and recording the inferred type of each column. Drill already has (a prototype of) a statistics (stats) collection mechanism. Drill could extend this mechanism to scan the full set of JSON files. For each column, implement inference rules such as: * Record the data types seen for each column. If a column is mostly nulls (or missing), even a single occurrence, even within terabytes of files, would be enough to tell Drill the type. * When a column has a conflicting scalar type (both string and number, say), record the “most permissive” type. That is, do an automatic per-column “text mode” for scalar conflicts. (Or, for numbers, simply pick FLOAT8 if even a single value has a decimal point.) * For numbers, use the value of the largest number to infer type. (INT, BITINT, FLOAT8 or DECIMAL). * If a column has both primitive and structured values (both a string and map, say), then mark the column as needed an “enhanced” text mode: read the complex type as a JSON string. The above simply extends existing schema inference beyond the first row of each file to the entire collection of files. Clearly there is a cost (the need to scan the entire set of JSON files, and keep the stats up to date via repeated scans.) The solution is not perfect, but may reduce problems by some percentage, say 80-90%. This solution has the advantage of living up to Drill’s core promise of being “schema-free”: the user does not provide a schema, though Drill infers a schema from actual data. h4. Futures: Schema Metadata The stats solution is still not complete: it cannot address the issue of intent: what does the JSON actually represent? It can only deal with the entire body of JSON at the syntax level. The next level of solution allows the user to specify their intent. That is, to specify rules for mapping their application-specific JSON encoding into the relational model used by Drill. Many of the difficulties in JSON come from ambiguities that limit Drill's ability to map JSON into a fixed set of tabular columns. One straightforward solution is to supply a schema (which may include mapping rules) that resolve the ambiguities for each application. Advantages of a defined schema: * Code generation vs. interpretation — performance * Avoid ambiguities — fewer failures The metadata might include simple information such as: * The declared type of each column. (Drill uses that type even if values are null or missing, eliminating the need to guess.) * Columns to be ignored in a `SELECT *` query. (Such as internal metadata columns.) The metadata layer would be much more powerful if it also included mapping rules for cases such as: * Identify that an array is represents a tuple, and provide column names and types for each array element. * Identify that a map structure is used to encode the name and type of a value, along with rules to map that value to a Drill column. An example of the first case above: {noformat} Metadata: Array as a row: [cust_id INT, name: VARCHAR….] Data: [101, “fred”, …] {noformat} Example of the second case: {noformat} Mapping rule: for key {“$type:” type, “$value”: value}, Create a column named of type with value . Data: cust_id {“$type”: “INT”, value: 101} {noformat} The need for the mapping (and the required set of features) will be made clearer by gathering many samples of JSON “in the wild” to learn how application developers actually encode their data into JSON. Possible sources of a schema: * Hive * Schema file stored with the data * Metadata system as in Impala * Custom mapping code that is a "data format" plugin on top of the JSON format plugin Drill uses views for this role today. But, views introduce a chicken-and-egg problem and cannot be the (whole) solution for JSON. Views are
[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior
[ https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305689#comment-16305689 ] Paul Rogers edited comment on DRILL-6035 at 12/28/17 7:42 PM: -- h4. JSON Translation in Drill The goal of the JSON reader is to load JSON data into Drill vectors. The challenge, as shown above, is that the two data models are different. The problems are two-fold: * There exist many perfectly valid ways to translate relational data into JSON. * There exist no universal ways to translate arbitrary JSON into a relational model. That is, JSON can represent any relational table, and do so in a variety of ways. There is a one (table) to many (JSON formats) relationship. Further, there are many JSON structures that do not correspond to tables. h4. Lack of JSON Support in JDBC and ODBC The issue is further complicated by the fact that ODBC (Tableau) and JDBC are Drill’s primary client interfaces. These formats do not readily support non-tabular data. Thus, not only just Drill successfully consume JSON DAGs, Drill must then convert these structures into simple tables to be consumed by BI tools. (That is, Drill is not a document-oriented query engine, it is a classic tabular query engine.) h4. Many-to-Many Mappings from JSON to Tables The challenge of the JSON reader is to convert arbitrary JSON into a relational model, and to do so with no (or very little) information beyond a list of projected columns and the JSON file itself. As we will see, the problem is fundamentally not solvable: Drill has too little information to correctly map from the many possible (and often conflicting) JSON formats into the proper relational format. Consider a trivial example. The following is a perfectly legal representation of a Customer in JSON: {code} [101, “Fred”, “Bedrock”, 123.45, “10-11-12”] {code} Applications sometimes use the above format to conserve space. It works because the writer and reader agree on the meaning of each array entry. How is Drill to interpret the above. More to the point, how is Drill to interpret the above *without a schema*? Said another way, the above format works because the writer and reader agree on a format, but Drill is designed to work without that information. Clearly, without a schema, it is impossible to understand that the above is a terse representation of a row. Without a schema, all Drill knows is that the above is a heterogeneous array. How is Drill to know that this array has a one-to-one correspondence to columns in a customer record vs. say, an arbitrary array? It can’t. All Drill can do is ask the user to enable “all-text mode”, read the array as text, then use SQL to project the array entries correctly: {code} SELECT CAST(cust[0] AS INTEGER) AS cust_id, cust[1] AS cust_name, cust[2] AS city, CAST(cust[3] AS FLOAT8) AS balance, TO_DATE(cust[4], ‘yy-mm-dd’) as start_date FROM … {code} Although this example is contrived, “JSON in the wild” has a wide variety of formats since JSON is a universal format and places no constraints on an application’s data model. h4. Schema Inference Translation of JSON to the relational model is simple when the JSON is designed to exactly fit what Drill expects. Since JSON is Drill’s reference model, Drill attempts to translate all valid JSON into the relational model. This is impossible in the general case. Here are just a few of the complexities: * A run of null values without a non-null value. * Different data types for an attribute name across objects. (That is, JSON does not enforce a schema.) * Different data types (as parsed by Drill) for array elements. (Such as the customer row example above.) * Nulls inside arrays. * “Sophisticated” data models such as those described earlier Drill implements a variety of special rules to handle some of the above in some special cases. These rules were presented in earlier sections. To summarize a few: * All-text mode can overcome different primitive types in an object attribute or list (at the cost of extensive casts in the SQL expression.) But, all-text mode cannot overcome a change from primitive type to object or list. All-text mode applies to all queries within a session, not just to the one column with a conflict. * Using a projection list to avoid projecting a column with conflicting types (but, then the value is unavailable to Drill queries.) A projection list, however, changes the data structure, moving columns nested inside maps to the top level, requiring changes elsewhere in the query to adjust. * “Null-deferral” delays picking a type for a null column until a value is seen, but can’t see across a batch boundary. * “All numbers as Float” handles the case of integer numbers followed by numbers with a decimal point, but is a session option so applies to all queries within a session, not
[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior
[ https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16305682#comment-16305682 ] Paul Rogers edited comment on DRILL-6035 at 12/28/17 7:35 PM: -- h4. Background The above sections focus on the actual implementation of JSON in Drill. This section takes a step back to provide broader industry context. h4. The JSON Data Model [JSON is built on … universal data structures|https://www.json.org]. JSON is truly schema-free: it supports any arbitrary (DAG) data model. As shown on the [JSON site|https://www.json.org]. JSON is based on three simple concepts: objects (lists of name/value pairs), lists, and scalar values. JSON is a newer format, but the concept is classic. XML was based on a similar model (but was much more verbose and complex.) XML itself was a rehash of an even older model, the [CODASYL|https://en.wikipedia.org/wiki/CODASYL] model that described hierarchical databases in the 1960s. The power of a universal representation comes at a cost: queries are complex. CODASYL defined a model, as did XML with [XQuery|https://en.wikipedia.org/wiki/XQuery]. In browsers, the [DOM|https://en.wikipedia.org/wiki/Document_Object_Model] provides a programmatic query language for XML trees that define web pages. Each of these query systems is very complex; far too complex for everyday business use. h4. The Relational Data Model The relational model was invented by Edgar Codd as a [reaction to the complexity|http://history-computer.com/ModernComputer/Software/Codd.html] of the prior CODASYL model: bq. Don Chamberlin, … coinventor of SQL, \[said of] Codd's ideas: "...since I'd been studying CODASYL (the language used to query navigational databases), I could imagine how those queries would have been represented in CODASYL by programs that were five pages long, that would navigate through this labyrinth of pointers and stuff. Codd would sort of write them down as one-liners. ... They weren't complicated at all. I said, 'Wow.' This was kind of a conversion experience for me. I understood what the relational thing was about after that." h4. The Drill Data Model Drill is a relational engine based on the SQL language (and hence the relational data model). Drill uses a columnar representation to realize the [basic relational contracts|https://en.wikibooks.org/wiki/Relational_Database_Design/Basic_Concepts]: * Domain (set of values, often expressed as a specific data type) * Column: a (name, domain) “attribute that describe an entity in the database model.” * Row: a complete set of columns * Table: a collection of rows The key point is that relational theory is based on relations (tables) that consist of a set of rows, all of which have the same set of columns (the same schema). In fact, the restriction to a consistent schema is what allows Drill to store data as columns rather than rows. (The columnar model would make little sense for, say, an HTML document in which each element has a different set of attributes.) The relational model is decidedly *not* universal: it is a highly restricted data model chosen because of the expressive power that relational theory provides when data is restricted to a tabular presentation. The complexity of Drill's JSON implementation arises from the idea that Drill can automatically map from JSON's universal format to the relational tabular format. That is an assumption that turns out to be naive. was (Author: paul.rogers): h4. Background The above sections focus on the actual implementation of JSON in Drill. This section takes a step back to provide broader industry context. h4. The JSON Data Model [JSON is built on … universal data structures|https://www.json.org]. JSON is truly schema-free: it supports any arbitrary (DAG) data model. As shown on the [JSON site|https://www.json.org]. JSON is based on three simple concepts: objects (lists of name/value pairs), lists, and scalar values. JSON is a newer format, but the concept is classic. XML was based on a similar model (but was much more verbose and complex.) XML itself was a rehash of an even older model, the [CODASYL|https://en.wikipedia.org/wiki/CODASYL] model that described hierarchical databases in the 1960s. The power of a universal representation comes at a cost: queries are complex. CODASYL defined a model, as did XML with [XQuery|https://en.wikipedia.org/wiki/XQuery]. In browsers, the [DOM|https://en.wikipedia.org/wiki/Document_Object_Model] provides a programmatic query language for XML trees that define web pages. Each of these query systems is very complex; far too complex for everyday business use. h4. The Relational Data Model The relational model was invented by Edgar Codd as a [reaction to the complexity|http://history-computer.com/ModernComputer/Software/Codd.html] of the prior CODASYL model: bq. Don Chamb
[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior
[ https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293316#comment-16293316 ] Paul Rogers edited comment on DRILL-6035 at 12/28/17 7:11 PM: -- h4. JSON Arrays Drill supports simple arrays in JSON using the following rules: * Arrays must contain hetrogeneous elements: any of the scalars described above, or a JSON object. * Single-dimensional arrays cannot contain null entries. * Multi-dimensional arrays can contain nulls at the outer level but not the inner level. (See a later comment for nested arrays.) For example, the following are scalar arrays: {code} [10, 20] [10.30, 10.45] ["foo", "bar"] [true, false] {code} h4. Schema Change in Arrays The following will trigger errors: {code} {a: [10, "foo"]} // Mixed types {a: [10]} {a: ["foo"]} // Schema change {a: [10, 12.5]} // Conflicting types: integer and float {code} h4. Nulls in Arrays Using array types, the following rules apply to nulls: * An array cannot contain nulls. * An empty array at the start of the file has an unknown type. * An entire array can be null, which is represented as an empty array. (That is, an empty array and a {{null}} value are considered the same.) Thus, the following is valid JSON, but is not supported by Drill: {code} {a: [10, null, 20] } {code} The general rule is that Drill does not support nulls in one-dimensional JSON arrays. (See below for the 2+D case.) h4. "Deferred" Type Definition in Drill 1.13 Drill 1.13 (but likely not Drill 1.12) can handle some ambiguity in arrays: {code} {a: 1} {a: 2, b: null} {a: 3, b: []} {a: 4, b: [10, 20]} {code} Here, the first three records are ambiguous. The first has no entry for {{b}}, the second says that {{b}} is {{null}}, but not the type of {{null}} (In JSON, {{null}} is untyped, while in Drill {{NULL}} must have a type.) In the third record we learn that {{b}} is an array, but do not yet know the type of the array. Finally, in record four we learn that {{b}} is an array of {{BIGINT}}. This mechanism works only if the type is revealed in the first batch. If the type does not appear, Drill must guess a type, will guess array of text-mode. This works within the same file if the array turns out to be, say, integers. But, it will probably cause a schema conflict across files that see the integers within the first batch. (The general rule is that Drill cannot predict the future.) h4. Missing {{LIST}} Support JSON arrays can contain nulls. Drill provides a (partially completed, inoperable) {{LIST}} type as described below that handles nulls. But, this vector is not used in Drill 1.12 or earlier. Instead, Drill uses repeated types which cannot handle nulls. (The {{LIST}} type is described in a separate note below.) h4. Multi-Dimensional Arrays JSON supports lists (arrays) of arbitrary dimensions. Drill supports this via two types: * A base array in the form of a repeated cardinality of any primitive type or map. (The so-called “Repeated mode.”) * A extended array of base or extended arrays. (The so-called “repeated list.”) Drill supports multi-dimensional arrays of a single type (including maps). Unlike JSON: * Drill arrays cannot hold nulls. * Arrays themselves cannot be null (though they can be empty.) Note that support for lists is via the {{RepeatedListVector}}. Note that this is not simply the {{REPEATED}} form of the {{ListVector}}; it is something else entirely. The {{ListVector}} is not supported at present (see below), but the {{RepeatedListVector}} is. For example: {code} {a: [[10, 20], [30, 40]]} {code} Drill (at least in 1.13) allows nulls in the outer dimension of an array: {code} {a: [[10, 20], null, [30, 40]]} {code} A null is the same as an empty array: {code} {a: [[10, 20], [ ], [30, 40]]} {code} Drill 1.13 allows "ambiguity" in arrays before seeing an actual value: {code} {a: null} {a: [ ]} {a: [ [ ] ]} {a: [[10, 20], null, [30, 40]]} {code} The caveat is that the value must appear within the first batch. Otherwise, the first batch must guess at a type, will guess text mode, and will conflict with later values if they are arrays or maps. Note that Drill *cannot* handle leading nulls before the type is seen. The following is invalid: {code} {a: [null, [10, 20], [30, 40]]} {code} The reason is that Drill does not yet know the type at the time that the null appears, and so does not know whether to treat it as an empty array or as a null scalar (and, of course, nulls are not supported in scalar arrays.) was (Author: paul.rogers): h4. JSON Arrays Drill supports simple arrays in JSON using the following rules: * Arrays must contain hetrogeneous elements: any of the scalars described above, or a JSON object. * Single-dimensional arrays cannot contain null entries. * Multi-dimensional arrays can contain nulls at the outer level but not the inner level. (See
[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior
[ https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16296230#comment-16296230 ] Paul Rogers edited comment on DRILL-6035 at 12/28/17 7:02 PM: -- h4. Lists _*NOTE:* This section describes the Drill {{LIST}} type which turns out to be broken and not supported. The following is based on a prototype using the {{LIST}} type created after fixing some, but not all, {{LIST}} bugs._ JSON supports arrays of the form: {code} {a: ["I'm", "an", "array"] } {code} Drill has two very different ways to represent arrays: 1. As a {{REPEATED}} cardinality for most data types. This gives rise to a {{RepeatedFooVector}} for some type {{Foo}}. 2. As a {{LIST}} type with the {{ListVector}} implementation. Here, Arrow has done a nice job. Arrow unified the {{REPEATED}} cardinality and the {{LIST}} vector type into a single concept. Drill, however, still has two systems. h4. Repeated Cardinality Drill's "go to" way to handle arrays is with the {{REPEATED}} cardinality (AKA "repeated data mode.") Most readers that handle arrays use the {{REPEATED}} form. To help understand the {{LIST}} type, we review {{REPEATED}} support here. When working with a {{REPEATED} column, the rules for nulls are: * Arrays may not contain nulls. (Drill does not support nulls as array elements.) * A null (or missing) array field is treated the same as an empty array. If JSON were to use the {{REPEATED}} vectors, the following would be invalid: {code} [10, null, 20] {code} The following are all valid with {{REPEATED}} vectors: {code} {id: 1} {id: 2, a: null} {id: 3, a: []} {id: 4, a: [10, 20, 30]} {code} h4. Properties of Lists The key properties of a List relative to a repeated type are: * Each list can be a list of nothing, a list of a single type, or a list of a union of multiple types. * Each list value can be null. * Each list entry can be null (for primitive types.) h4. Null Support As explained below, lists support three kinds of nullability: * The type itself can be null (a list of nulls) * The list column value for a row can be null. (This is in contrast to repeated types in which an array can be empty, but the entire array value cannot be null.) * When the list is of a primitive type, entries can be null. (The list is defined as list of nullable items, such as nullable BIGINT.) * When the list is of maps, the map entries *cannot* be null. (Instead, the map columns are nullable and all columns for the "null" map are set to null.) * When the list is of other lists, the list entry *cannot* be null. (Instead, the nested list is empty.) The semantics are a bit confusing when seen from the outside. They make slightly more sense based on the implementation choices made in the code. (Though, generally we want the code to match our requirements, not the other way around.) h4. Late Type Identification As described earlier, Drill 1.13 will defer picking an array type if it sees null values. For example: {code} {id: 1} {id: 2, a: null} {id: 3, a: []} {id: 4, a: [10, 20, 30]} {code} In the above example, for id=2, Drill sees column `a` but does not pick a type. For id=3, Drill identifies that `a` is an array, but does not know the type. Finally, for id=4, Drill identifies the array as {{BIGINT}}. h4. Null-Only Arrays A special case occurs if a JSON file contains only empty arrays or arrays of nulls (such as a file that contains only the first three records above.) In Drill 1.12 and earlier, the result is a list of {{LATE}} elements (See the List section below.) It seems that {{SqlLine}} will correctly show the null values. An interesting case occurs when Drill reads two files: one with an array with only nulls, another with real values. For example: {noformat} File A: {a: [null, null] } File B: {a: [10, 20] } {noformat} (The above condition can occur only if JSON uses the broken {{LIST}} type; it cannot occur in Drill 1.12. In 1.12, the equivalent condition is if File A contains: {noformat} {a: []} {noformat} Drill is distributed: one fragment will read File A, another will read File B. At some point, the two arrays will come together. One fragment will have created a list of {{LATE}}, another a list of {{BIGINT}}. Most operators will trigger a schema change error in this case. Interestingly, however, if the query is a simple {{SELECT *}}, then the lists are compatible and {{SqlLine}} will display the correct results. In Drill 1.13, if the first batch contains only nulls and/or empty arrays, Drill guesses that the type is an array of {{VARCHAR}}. Since this is only a guess, a schema change will result if the guess is wrong. h4. Lists are Obscure The {{LIST}} type appears to be used only for JSON, and it is unclear how well supported it is in the rest of Drill. For example, it is not clear that functions that work with arrays correctly ha
[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior
[ https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293316#comment-16293316 ] Paul Rogers edited comment on DRILL-6035 at 12/28/17 6:42 PM: -- h4. JSON Arrays Drill supports simple arrays in JSON using the following rules: * Arrays must contain hetrogeneous elements: any of the scalars described above, or a JSON object. * Single-dimensional arrays cannot contain null entries. * Multi-dimensional arrays can contain nulls at the outer level but not the inner level. (See a later comment for nested arrays.) For example, the following are scalar arrays: {code} [10, 20] [10.30, 10.45] ["foo", "bar"] [true, false] {code} h4. Schema Change in Arrays The following will trigger errors: {code} {a: [10, "foo"]} // Mixed types {a: [10]} {a: ["foo"]} // Schema change {a: [10, 12.5]} // Conflicting types: integer and float {code} h4. Nulls in Arrays The general rule is that Drill does not support nulls in JSON arrays. However, there are many odd cases. If the JSON array is 2+ dimensional, then Drill supports nulls in outer dimensions but not inner ones. That is, the following is legal: {code} {a: [["null", "is"], null, ["allowed"]]} {code} Here, {{null}} behaves the same as an empty array: {{[ ]}}. Note, however, that the above is *not* valid if the nulls come before the first non-null value: {code} {a: [null, ["not", "allowed"]]} {code} The reason is that, in the previous example, Drill could determine that the outer list is a list of string arrays. But, in the second case, Drill cannot tell the array type when it sees the {{null}}: it might be an array of scalars, so {{null}} is not allowed. That said, the following *is* allowed: {code} {a: [["null", "is"], null, ["allowed"]]} {a: [null, ["is", "allowed"]]} {code} Here, the first array showed the type; Drill remembers that type for the second array. h4. Missing {{LIST}} Support JSON arrays can contain nulls. Drill provides a (partially completed, inoperable) {{LIST}} type as described below that handles nulls. But, this vector is not used in Drill 1.12 or earlier. Instead, Drill uses repeated types which cannot handle nulls. (The {{LIST}} type is described in a separate note below.) Using array types, the following rules apply to nulls: * An array cannot contain nulls. * An empty array at the start of the file has an unknown type. (Do we select Nullable {{INT}}?) * An entire array can be null, which is represented as an empty array. (That is, an empty array and a {{null}} value are considered the same.) h4. Late Type Identification As described earlier, Drill 1.13 will defer picking an array type if it sees null values. For example: {code} {id: 1} {id: 2, a: null} {id: 3, a: []} {id: 4, a: [10, 20, 30]} {code} In the above example, for id=2, Drill sees column `a` but does not pick a type. For id=3, Drill identifies that `a` is an array, but does not know the type. Finally, for id=4, Drill identifies the array as {{BIGINT}}. h4. Null-Only Arrays A special case occurs if a JSON file contains only empty arrays or arrays of nulls (such as a file that contains only the first three records above.) In Drill 1.12 and earlier, the result is a list of {{LATE}} elements (See the List section below.) It seems that {{SqlLine}} will correctly show the null values. An interesting case occurs when Drill reads two files: one with an array with only nulls, another with real values. For example: {noformat} File A: {a: [null, null] } File B: {a: [10, 20] } {noformat} (The above condition can occur only if JSON uses the broken {{LIST}} type; it cannot occur in Drill 1.12. In 1.12, the equivalent condition is if File A contains: {noformat} {a: []} {noformat} Drill is distributed: one fragment will read File A, another will read File B. At some point, the two arrays will come together. One fragment will have created a list of {{LATE}}, another a list of {{BIGINT}}. Most operators will trigger a schema change error in this case. Interestingly, however, if the query is a simple {{SELECT *}}, then the lists are compatible and {{SqlLine}} will display the correct results. In Drill 1.13, if the first batch contains only nulls and/or empty arrays, Drill guesses that the type is an array of {{VARCHAR}}. Since this is only a guess, a schema change will result if the guess is wrong. h4. Multi-Dimensional Arrays Drill supports multi-dimensional arrays of a single type (including maps). For example: {code} {a: [[10, 20], [30, 40]]} {code} Drill (at least in 1.13) allows nulls in the outer dimension of an array: {code} {a: [[10, 20], null, [30, 40]]} {code} A null is the same as an empty array: {code} {a: [[10, 20], [ ], [30, 40]]} {code} Drill 1.13 allows "ambiguity" in arrays before seeing an actual value: {code} {a: null} {a: [ ]} {a: [ [ ] ]} {a: [[10, 20], null, [30, 4
[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior
[ https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293316#comment-16293316 ] Paul Rogers edited comment on DRILL-6035 at 12/27/17 7:06 AM: -- h4. JSON Arrays Drill supports simple arrays in JSON using the following rules: * Arrays must contain hetrogeneous elements: any of the scalars described above, or a JSON object. * Single-dimensional arrays cannot contain null entries. * Two-dimensional arrays can contain nulls at the outer level but not the inner level. (See a later comment for nested arrays.) For example, the following are scalar arrays: {code} [10, 20] [10.30, 10.45] ["foo", "bar"] [true, false] {code} h4. Schema Change in Arrays The following will trigger errors: {code} {a: [10, "foo"]} // Mixed types {a: [10]} {a: ["foo"]} // Schema change {a: [10, 12.5]} // Conflicting types: integer and float {code} h4. Nulls in Arrays The general rule is that Drill does not support nulls in JSON arrays. However, there are many odd cases. If the JSON array is 2+ dimensional, then Drill supports nulls in outer dimensions but not inner ones. That is, the following is legal: {code} {a: [["null", "is"], null, ["allowed"]]} {code} Here, {{null}} behaves the same as an empty array: {{[ ]}}. Note, however, that the above is *not* valid if the nulls come before the first non-null value: {code} {a: [null, ["not", "allowed"]]} {code} The reason is that, in the previous example, Drill could determine that the outer list is a list of string arrays. But, in the second case, Drill cannot tell the array type when it sees the {{null}}: it might be an array of scalars, so {{null}} is not allowed. That said, the following *is* allowed: {code} {a: [["null", "is"], null, ["allowed"]]} {a: [null, ["is", "allowed"]]} {code} Here, the first array showed the type; Drill remembers that type for the second array. h4. Missing {{LIST}} Support JSON arrays can contain nulls. Drill provides a (partially completed, inoperable) {{LIST}} type as described below that handles nulls. But, this vector is not used in Drill 1.12 or earlier. Instead, Drill uses repeated types which cannot handle nulls. (The {{LIST}} type is described in a separate note below.) Using array types, the following rules apply to nulls: * An array cannot contain nulls. * An empty array at the start of the file has an unknown type. (Do we select Nullable {{INT}}?) * An entire array can be null, which is represented as an empty array. (That is, an empty array and a {{null}} value are considered the same.) h4. Late Type Identification As described earlier, Drill 1.13 will defer picking an array type if it sees null values. For example: {code} {id: 1} {id: 2, a: null} {id: 3, a: []} {id: 4, a: [10, 20, 30]} {code} In the above example, for id=2, Drill sees column `a` but does not pick a type. For id=3, Drill identifies that `a` is an array, but does not know the type. Finally, for id=4, Drill identifies the array as {{BIGINT}}. h4. Null-Only Arrays A special case occurs if a JSON file contains only empty arrays or arrays of nulls (such as a file that contains only the first three records above.) In Drill 1.12 and earlier, the result is a list of {{LATE}} elements (See the List section below.) It seems that {{SqlLine}} will correctly show the null values. An interesting case occurs when Drill reads two files: one with an array with only nulls, another with real values. For example: {noformat} File A: {a: [null, null] } File B: {a: [10, 20] } {noformat} (The above condition can occur only if JSON uses the broken {{LIST}} type; it cannot occur in Drill 1.12. In 1.12, the equivalent condition is if File A contains: {noformat} {a: []} {noformat} Drill is distributed: one fragment will read File A, another will read File B. At some point, the two arrays will come together. One fragment will have created a list of {{LATE}}, another a list of {{BIGINT}}. Most operators will trigger a schema change error in this case. Interestingly, however, if the query is a simple {{SELECT *}}, then the lists are compatible and {{SqlLine}} will display the correct results. In Drill 1.13, if the first batch contains only nulls and/or empty arrays, Drill guesses that the type is an array of {{VARCHAR}}. Since this is only a guess, a schema change will result if the guess is wrong. was (Author: paul.rogers): h4. JSON Arrays Drill supports simple arrays in JSON using the following rules: * Arrays must contain hetrogeneous elements: any of the scalars described above, or a JSON object. * Single-dimensional arrays cannot contain null entries. * Two-dimensional arrays can contain nulls at the outer level but not the inner level. (See a later comment for nested arrays.) For example, the following are scalar arrays: {code} [10, 20] [10.30, 10.45] ["foo", "bar"] [true, false] {co
[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior
[ https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293316#comment-16293316 ] Paul Rogers edited comment on DRILL-6035 at 12/27/17 7:04 AM: -- h4. JSON Arrays Drill supports simple arrays in JSON using the following rules: * Arrays must contain hetrogeneous elements: any of the scalars described above, or a JSON object. * Single-dimensional arrays cannot contain null entries. * Two-dimensional arrays can contain nulls at the outer level but not the inner level. (See a later comment for nested arrays.) For example, the following are scalar arrays: {code} [10, 20] [10.30, 10.45] ["foo", "bar"] [true, false] {code} h4. Schema Change in Arrays The following will trigger errors: {code} {a: [10, "foo"]} // Mixed types {a: [10]} {a: ["foo"]} // Schema change {a: [10, 12.5]} // Conflicting types: integer and float {code} h4. Nulls in Arrays The general rule is that Drill does not support nulls in JSON arrays. However, there are many odd cases. If the JSON array is 2+ dimensional, then Drill supports nulls in outer dimensions but not inner ones. That is, the following is legal: {code} {a: [["null", "is"], null, ["allowed"]]} {code} Here, {{null}} behaves the same as an empty array: {{[ ]}}. Note, however, that the above is *not* valid if the nulls come before the first non-null value: {code} {a: [null, ["not", "allowed"]]} {code} The reason is that, in the previous example, Drill could determine that the outer list is a list of string arrays. But, in the second case, Drill cannot tell the array type when it sees the {{null}}: it might be an array of scalars, so {{null}} is not allowed. h4. Missing {{LIST}} Support JSON arrays can contain nulls. Drill provides a (partially completed, inoperable) {{LIST}} type as described below that handles nulls. But, this vector is not used in Drill 1.12 or earlier. Instead, Drill uses repeated types which cannot handle nulls. (The {{LIST}} type is described in a separate note below.) Using array types, the following rules apply to nulls: * An array cannot contain nulls. * An empty array at the start of the file has an unknown type. (Do we select Nullable {{INT}}?) * An entire array can be null, which is represented as an empty array. (That is, an empty array and a {{null}} value are considered the same.) h4. Late Type Identification As described earlier, Drill 1.13 will defer picking an array type if it sees null values. For example: {code} {id: 1} {id: 2, a: null} {id: 3, a: []} {id: 4, a: [10, 20, 30]} {code} In the above example, for id=2, Drill sees column `a` but does not pick a type. For id=3, Drill identifies that `a` is an array, but does not know the type. Finally, for id=4, Drill identifies the array as {{BIGINT}}. h4. Null-Only Arrays A special case occurs if a JSON file contains only empty arrays or arrays of nulls (such as a file that contains only the first three records above.) In Drill 1.12 and earlier, the result is a list of {{LATE}} elements (See the List section below.) It seems that {{SqlLine}} will correctly show the null values. An interesting case occurs when Drill reads two files: one with an array with only nulls, another with real values. For example: {noformat} File A: {a: [null, null] } File B: {a: [10, 20] } {noformat} (The above condition can occur only if JSON uses the broken {{LIST}} type; it cannot occur in Drill 1.12. In 1.12, the equivalent condition is if File A contains: {noformat} {a: []} {noformat} Drill is distributed: one fragment will read File A, another will read File B. At some point, the two arrays will come together. One fragment will have created a list of {{LATE}}, another a list of {{BIGINT}}. Most operators will trigger a schema change error in this case. Interestingly, however, if the query is a simple {{SELECT *}}, then the lists are compatible and {{SqlLine}} will display the correct results. In Drill 1.13, if the first batch contains only nulls and/or empty arrays, Drill guesses that the type is an array of {{VARCHAR}}. Since this is only a guess, a schema change will result if the guess is wrong. was (Author: paul.rogers): h4. JSON Arrays Drill supports simple arrays in JSON using the following rules: * Arrays must contain hetrogeneous elements: any of the scalars described above, or a JSON object. * Single-dimensional arrays cannot contain null entries. * Two-dimensional arrays can contain nulls at the outer level but not the inner level. (See a later comment for nested arrays.) For example, the following are scalar arrays: {code} [10, 20] [10.30, 10.45] ["foo", "bar"] [true, false] {code} h4. Schema Change in Arrays The following will trigger errors: {code} {a: [10, "foo"]} // Mixed types {a: [10]} {a: ["foo"]} // Schema change {a: [10, 12.5]} // Conflicting types: integer and float {code} h4
[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior
[ https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293316#comment-16293316 ] Paul Rogers edited comment on DRILL-6035 at 12/27/17 7:04 AM: -- h4. JSON Arrays Drill supports simple arrays in JSON using the following rules: * Arrays must contain hetrogeneous elements: any of the scalars described above, or a JSON object. * Single-dimensional arrays cannot contain null entries. * Two-dimensional arrays can contain nulls at the outer level but not the inner level. (See a later comment for nested arrays.) For example, the following are scalar arrays: {code} [10, 20] [10.30, 10.45] ["foo", "bar"] [true, false] {code} h4. Schema Change in Arrays The following will trigger errors: {code} {a: [10, "foo"]} // Mixed types {a: [10]} {a: ["foo"]} // Schema change {a: [10, 12.5]} // Conflicting types: integer and float {code} h4. Nulls in Arrays The general rule is that Drill does not support nulls in JSON arrays. However, there are many odd cases. If the JSON array is 2+ dimensional, then Drill supports nulls in outer dimensions but not inner ones. That is, the following is legal: {code} {a: [["a", "string"], null, ["another"]]} {code} Here, {{null}} behaves the same as an empty array: {{[ ]}}. Note, however, that the above is *not* valid if the nulls come before the first non-null value: {code} {a: [null, ["not", "allowed"]]} {code} The reason is that, in the previous example, Drill could determine that the outer list is a list of string arrays. But, in the second case, Drill cannot tell the array type when it sees the {{null}}: it might be an array of scalars, so {{null}} is not allowed. h4. Missing {{LIST}} Support JSON arrays can contain nulls. Drill provides a (partially completed, inoperable) {{LIST}} type as described below that handles nulls. But, this vector is not used in Drill 1.12 or earlier. Instead, Drill uses repeated types which cannot handle nulls. (The {{LIST}} type is described in a separate note below.) Using array types, the following rules apply to nulls: * An array cannot contain nulls. * An empty array at the start of the file has an unknown type. (Do we select Nullable {{INT}}?) * An entire array can be null, which is represented as an empty array. (That is, an empty array and a {{null}} value are considered the same.) h4. Late Type Identification As described earlier, Drill 1.13 will defer picking an array type if it sees null values. For example: {code} {id: 1} {id: 2, a: null} {id: 3, a: []} {id: 4, a: [10, 20, 30]} {code} In the above example, for id=2, Drill sees column `a` but does not pick a type. For id=3, Drill identifies that `a` is an array, but does not know the type. Finally, for id=4, Drill identifies the array as {{BIGINT}}. h4. Null-Only Arrays A special case occurs if a JSON file contains only empty arrays or arrays of nulls (such as a file that contains only the first three records above.) In Drill 1.12 and earlier, the result is a list of {{LATE}} elements (See the List section below.) It seems that {{SqlLine}} will correctly show the null values. An interesting case occurs when Drill reads two files: one with an array with only nulls, another with real values. For example: {noformat} File A: {a: [null, null] } File B: {a: [10, 20] } {noformat} (The above condition can occur only if JSON uses the broken {{LIST}} type; it cannot occur in Drill 1.12. In 1.12, the equivalent condition is if File A contains: {noformat} {a: []} {noformat} Drill is distributed: one fragment will read File A, another will read File B. At some point, the two arrays will come together. One fragment will have created a list of {{LATE}}, another a list of {{BIGINT}}. Most operators will trigger a schema change error in this case. Interestingly, however, if the query is a simple {{SELECT *}}, then the lists are compatible and {{SqlLine}} will display the correct results. In Drill 1.13, if the first batch contains only nulls and/or empty arrays, Drill guesses that the type is an array of {{VARCHAR}}. Since this is only a guess, a schema change will result if the guess is wrong. was (Author: paul.rogers): h4. JSON Arrays Drill supports simple arrays in JSON using the following rules: * Arrays must contain hetrogeneous elements: any of the scalars described above, or a JSON object. * Single-dimensional arrays cannot contain null entries. * Two-dimensional arrays can contain nulls at the outer level but not the inner level. (See a later comment for nested arrays.) For example, the following are scalar arrays: {code} [10, 20] [10.30, 10.45] ["foo", "bar"] [true, false] {code} h4. Schema Change in Arrays The following will trigger errors: {code} {a: [10, "foo"]} // Mixed types {a: [10]} {a: ["foo"]} // Schema change {a: [10, 12.5]} // Conflicting types: integer and float {code} h
[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior
[ https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16296230#comment-16296230 ] Paul Rogers edited comment on DRILL-6035 at 12/20/17 6:41 AM: -- h4. Lists _*NOTE:* This section describes the Drill {{LIST}} type which turns out to be broken and not supported. The following is based on a prototype using the {{LIST}} type created after fixing some, but not all, {{LIST}} bugs._ JSON supports arrays of the form: {code} {a: ["I'm", "an", "array"] } {code} Drill has two very different ways to represent arrays: 1. As a {{REPEATED}} cardinality for most data types. This gives rise to a {{RepeatedFooVector}} for some type {{Foo}}. 2. As a {{LIST}} type with the {{ListVector}} implementation. Here, Arrow has done a nice job. Arrow unified the {{REPEATED}} cardinality and the {{LIST}} vector type into a single concept. Drill, however, still has two systems. h4. Repeated Cardinality Drill's "go to" way to handle arrays is with the {{REPEATED}} cardinality (AKA "repeated data mode.") Most readers that handle arrays use the {{REPEATED}} form. To help understand the {{LIST}} type, we review {{REPEATED}} support here. When working with a {{REPEATED} column, the rules for nulls are: * Arrays may not contain nulls. (Drill does not support nulls as array elements.) * A null (or missing) array field is treated the same as an empty array. If JSON were to use the {{REPEATED}} vectors, the following would be invalid: {code} [10, null, 20] {code} The following are all valid with {{REPEATED}} vectors: {code} {id: 1} {id: 2, a: null} {id: 3, a: []} {id: 4, a: [10, 20, 30]} {code} h4. Properties of Lists The key properties of a List relative to a repeated type are: * Each list can be a list of nothing, a list of a single type, or a list of a union of multiple types. * Each list value can be null. * Each list entry can be null (for primitive types.) h4. Null Support As explained below, lists support three kinds of nullability: * The type itself can be null (a list of nulls) * The list column value for a row can be null. (This is in contrast to repeated types in which an array can be empty, but the entire array value cannot be null.) * When the list is of a primitive type, entries can be null. (The list is defined as list of nullable items, such as nullable BIGINT.) * When the list is of maps, the map entries *cannot* be null. (Instead, the map columns are nullable and all columns for the "null" map are set to null.) * When the list is of other lists, the list entry *cannot* be null. (Instead, the nested list is empty.) The semantics are a bit confusing when seen from the outside. They make slightly more sense based on the implementation choices made in the code. (Though, generally we want the code to match our requirements, not the other way around.) h4. Lists are Obscure The {{LIST}} type appears to be used only for JSON, and it is unclear how well supported it is in the rest of Drill. For example, it is not clear that functions that work with arrays correctly handle null entries. (This needs to be tested.) JDBC supports array columns, but it is not clear if the Drill JDBC driver has implemented them. ODBC doesn't support arrays at all, so whether it supports arrays with nulls is a moot point. h4. Lists in JSON The {{LIST}} type appears to be used only for JSON where it is a better fit for JSON semantics than Drill's normal {{REPEATED}} cardinality. The list type allows list members to be null. All of the following are legal using lists: {code} {a: null} {a: []} {a: [null, null]} {a: [null, 10, null]} {a: [10, "foo"]} {code} We'll look at each of these in detail. h4. Degenerate Lists Consider the simplest possible list in JSON: a file that contains only an empty list: {noformat} {a: []} {noformat} What is the type of the list? In JSON, lists have no type, they are just lists. Drill requires a type, however when working with a {{REPEATED}} cardinality: the column must be an array of something. Lists, however, can be a list of only nulls using the obscure {{LATE}} data type. That is, the list exists, but has no type. ({{LATE}} seems to suggest that the type will be assigned later.) Next, consider another degenerate array: {noformat} {a: [null, null]} {noformat} Here we have an array of nulls. Again, we don't know what type these are a null of. Again, a LIST allows the JSON reader to produce a row with a single column {{`a`}} that is of type {{LIST}} that contains only the "dummy" {{LATE}} type. The list will indicate that we have two entries, both of which are null. It is unclear, however, if the rest of Drill supports this concept. (DRILL-5970 discusses a case in which an empty array, with a List of {{LATE}}, is exported to Parquet, producing results different than one might naively expect.) h4. Single-type L
[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior
[ https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293316#comment-16293316 ] Paul Rogers edited comment on DRILL-6035 at 12/20/17 6:39 AM: -- h4. JSON Arrays Drill supports simple arrays in JSON using the following rules: * Arrays must contain hetrogeneous elements: any of the scalars described above, or a JSON object. * Single-dimensional arrays cannot contain null entries. * Two-dimensional arrays can contain nulls at the outer level but not the inner level. (See a later comment for nested arrays.) For example, the following are scalar arrays: {code} [10, 20] [10.30, 10.45] ["foo", "bar"] [true, false] {code} h4. Schema Change in Arrays The following will trigger errors: {code} {a: [10, "foo"]} // Mixed types {a: [10]} {a: ["foo"]} // Schema change {a: [10, 12.5]} // Conflicting types: integer and float {code} h4. Nulls in Arrays h4. Missing {{LIST}} Support JSON arrays can contain nulls. Drill provides a (partially completed, inoperable) {{LIST}} type as described below that handles nulls. But, this vector is not used in Drill 1.12 or earlier. Instead, Drill uses repeated types which cannot handle nulls. (The {{LIST}} type is described in a separate note below.) Using array types, the following rules apply to nulls: * An array cannot contain nulls. * An empty array at the start of the file has an unknown type. (Do we select Nullable {{INT}}?) * An entire array can be null, which is represented as an empty array. (That is, an empty array and a {{null}} value are considered the same.) h4. Late Type Identification As described earlier, Drill 1.13 will defer picking an array type if it sees null values. For example: {code} {id: 1} {id: 2, a: null} {id: 3, a: []} {id: 4, a: [10, 20, 30]} {code} In the above example, for id=2, Drill sees column `a` but does not pick a type. For id=3, Drill identifies that `a` is an array, but does not know the type. Finally, for id=4, Drill identifies the array as {{BIGINT}}. h4. Null-Only Arrays A special case occurs if a JSON file contains only empty arrays or arrays of nulls (such as a file that contains only the first three records above.) In Drill 1.12 and earlier, the result is a list of {{LATE}} elements (See the List section below.) It seems that {{SqlLine}} will correctly show the null values. An interesting case occurs when Drill reads two files: one with an array with only nulls, another with real values. For example: {noformat} File A: {a: [null, null] } File B: {a: [10, 20] } {noformat} (The above condition can occur only if JSON uses the broken {{LIST}} type; it cannot occur in Drill 1.12. In 1.12, the equivalent condition is if File A contains: {noformat} {a: []} {noformat} Drill is distributed: one fragment will read File A, another will read File B. At some point, the two arrays will come together. One fragment will have created a list of {{LATE}}, another a list of {{BIGINT}}. Most operators will trigger a schema change error in this case. Interestingly, however, if the query is a simple {{SELECT *}}, then the lists are compatible and {{SqlLine}} will display the correct results. In Drill 1.13, if the first batch contains only nulls and/or empty arrays, Drill guesses that the type is an array of {{VARCHAR}}. Since this is only a guess, a schema change will result if the guess is wrong. was (Author: paul.rogers): h4. JSON Arrays Drill supports simple arrays in JSON using the following rules: * Arrays must contain hetrogeneous elements: any of the scalars described above, or a JSON object. (See a later comment for nested arrays.) For example, the following are scalar arrays: {code} [10, 20] [10.30, 10.45] ["foo", "bar"] [true, false] {code} h4. Schema Change in Arrays The following will trigger errors: {code} {a: [10, "foo"]} // Mixed types {a: [10]} {a: ["foo"]} // Schema change {a: [10, 12.5]} // Conflicting types: integer and float {code} h4. Nulls in Arrays Drill handles nulls in arrays using the {{LIST}} type, described in a separate note below. h4. Late Type Identification As described earlier, Drill will defer picking an array type if it sees null values. For example: {code} {id: 1} {id: 2, a: null} {id: 3, a: []} {id: 4, a: [10, 20, 30]} {code} In the above example, for id=2, Drill sees column `a` but does not pick a type. For id=3, Drill identifies that `a` is an array, but does not know the type. Finally, for id=4, Drill identifies the array as {{BIGINT}}. h4. Null-Only Arrays A special case occurs if a JSON file contains only empty arrays or arrays of nulls (such as a file that contains only the first three records above.) In Drill 1.12 and earlier, the result is a list of {{LATE}} elements (See the List section below.) It seems that {{SqlLine}} will correctly show the null values. An interesting case occurs when Dri
[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior
[ https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293316#comment-16293316 ] Paul Rogers edited comment on DRILL-6035 at 12/19/17 5:42 AM: -- h4. JSON Arrays Drill supports simple arrays in JSON using the following rules: * Arrays must contain hetrogeneous elements: any of the scalars described above, or a JSON object. (See a later comment for nested arrays.) For example, the following are scalar arrays: {code} [10, 20] [10.30, 10.45] ["foo", "bar"] [true, false] {code} h4. Schema Change in Arrays The following will trigger errors: {code} {a: [10, "foo"]} // Mixed types {a: [10]} {a: ["foo"]} // Schema change {a: [10, 12.5]} // Conflicting types: integer and float {code} h4. Nulls in Arrays Drill handles nulls in arrays using the {{LIST}} type, described in a separate note below. h4. Late Type Identification As described earlier, Drill will defer picking an array type if it sees null values. For example: {code} {id: 1} {id: 2, a: null} {id: 3, a: []} {id: 4, a: [10, 20, 30]} {code} In the above example, for id=2, Drill sees column `a` but does not pick a type. For id=3, Drill identifies that `a` is an array, but does not know the type. Finally, for id=4, Drill identifies the array as {{BIGINT}}. h4. Null-Only Arrays A special case occurs if a JSON file contains only empty arrays or arrays of nulls (such as a file that contains only the first three records above.) In Drill 1.12 and earlier, the result is a list of {{LATE}} elements (See the List section below.) It seems that {{SqlLine}} will correctly show the null values. An interesting case occurs when Drill reads two files: one with an array with only nulls, another with real values. For example: {noformat} File A: {a: [null, null] } File B: {a: [10, 20] } {noformat} Drill is distributed: one fragment will read File A, another will read File B. At some point, the two arrays will come together. One fragment will have created a list of {{LATE}}, another a list of {{BIGINT}}. Most operators will trigger a schema change error in this case. Interestingly, however, if the query is a simple {{SELECT *}}, then the lists are compatible and {{SqlLine}} will display the correct results. In Drill 1.13, if the first batch contains only nulls and/or empty arrays, Drill guesses that the type is an array of {{VARCHAR}}. Since this is only a guess, a schema change will result if the guess is wrong. was (Author: paul.rogers): h4. JSON Arrays Drill supports simple arrays in JSON using the following rules: * Arrays must contain hetrogeneous elements: any of the scalars described above, or a JSON object. (See a later comment for nested arrays.) For example, the following are scalar arrays: {code} [10, 20] [10.30, 10.45] ["foo", "bar"] [true, false] {code} h4. Schema Change in Arrays The following will trigger errors: {code} {a: [10, "foo"]} // Mixed types {a: [10]} {a: ["foo"]} // Schema change {a: [10, 12.5]} // Conflicting types: integer and float {code} h4. Nulls in Arrays Drill handles nulls in arrays using the {{LIST}} type, described in a separate note below. h4. Late Type Identification As described earlier, Drill will defer picking an array type if it sees null values. For example: {code} {id: 1} {id: 2, a: null} {id: 3, a: []} {id: 4, a: [10, 20, 30]} {code} In the above example, for id=2, Drill sees column `a` but does not pick a type. For id=3, Drill identifies that `a` is an array, but does not know the type. Finally, for id=4, Drill identifies the array as {{BIGINT}}. > Specify Drill's JSON behavior > - > > Key: DRILL-6035 > URL: https://issues.apache.org/jira/browse/DRILL-6035 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.13.0 >Reporter: Paul Rogers >Assignee: Pritesh Maker > > Drill supports JSON as its native data format. However, experience suggests > that Drill may have limitations in the JSON that Drill supports. This ticket > asks to clarify Drill's expected behavior on various kinds of JSON. > Topics to be addressed: > * Relational vs. non-relational structures > * JSON structures used in practice and how they map to Drill > * Support for varying data types > * Support for missing values, especially across files > These topics are complex, hence the request to provide a detailed > specifications that clarifies what Drill does and does not support (or what > is should and should not support.) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior
[ https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293316#comment-16293316 ] Paul Rogers edited comment on DRILL-6035 at 12/19/17 5:33 AM: -- h4. JSON Arrays Drill supports simple arrays in JSON using the following rules: * Arrays must contain hetrogeneous elements: any of the scalars described above, or a JSON object. (See a later comment for nested arrays.) For example, the following are scalar arrays: {code} [10, 20] [10.30, 10.45] ["foo", "bar"] [true, false] {code} h4. Schema Change in Arrays The following will trigger errors: {code} {a: [10, "foo"]} // Mixed types {a: [10]} {a: ["foo"]} // Schema change {a: [10, 12.5]} // Conflicting types: integer and float {code} h4. Nulls in Arrays Drill handles nulls in arrays using the {{LIST}} type, described in a separate note below. h4. Late Type Identification As described earlier, Drill will defer picking an array type if it sees null values. For example: {code} {id: 1} {id: 2, a: null} {id: 3, a: []} {id: 4, a: [10, 20, 30]} {code} In the above example, for id=2, Drill sees column `a` but does not pick a type. For id=3, Drill identifies that `a` is an array, but does not know the type. Finally, for id=4, Drill identifies the array as {{BIGINT}}. was (Author: paul.rogers): h4. JSON Arrays Drill supports simple arrays in JSON using the following rules: * Arrays must contain hetrogeneous elements: any of the scalars described above, or a JSON object. (See a later comment for nested arrays.) For example, the following are scalar arrays: {code} [10, 20] [10.30, 10.45] ["foo", "bar"] [true, false] {code} h4. Schema Change in Arrays The following will trigger errors: {code} {a: [10, "foo"]} // Mixed types {a: [10]} {a: ["foo"]} // Schema change {a: [10, 12.5]} // Conflicting types: integer and float {code} h4. Nulls in Arrays Drill handles nulls in arrays using the {{LIST}} type, described in a separate note below. > Specify Drill's JSON behavior > - > > Key: DRILL-6035 > URL: https://issues.apache.org/jira/browse/DRILL-6035 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.13.0 >Reporter: Paul Rogers >Assignee: Pritesh Maker > > Drill supports JSON as its native data format. However, experience suggests > that Drill may have limitations in the JSON that Drill supports. This ticket > asks to clarify Drill's expected behavior on various kinds of JSON. > Topics to be addressed: > * Relational vs. non-relational structures > * JSON structures used in practice and how they map to Drill > * Support for varying data types > * Support for missing values, especially across files > These topics are complex, hence the request to provide a detailed > specifications that clarifies what Drill does and does not support (or what > is should and should not support.) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior
[ https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16296230#comment-16296230 ] Paul Rogers edited comment on DRILL-6035 at 12/19/17 5:30 AM: -- h4. Lists JSON supports arrays of the form: {code} {a: ["I'm", "an", "array"] } {code} Drill has two very different ways to represent arrays: 1. As a {{REPEATED}} cardinality for most data types. This gives rise to a {{RepeatedFooVector}} for some type {{Foo}}. 2. As a {{LIST}} type with the {{ListVector}} implementation. Here, Arrow has done a nice job. Arrow unified the {{REPEATED}} cardinality and the {{LIST}} vector type into a single concept. Drill, however, still has two systems. h4. Repeated Cardinality Drill's "go to" way to handle arrays is with the {{REPEATED}} cardinality (AKA "repeated data mode.") Most readers that handle arrays use the {{REPEATED}} form. To help understand the {{LIST}} type, we review {{REPEATED}} support here. When working with a {{REPEATED} column, the rules for nulls are: * Arrays may not contain nulls. (Drill does not support nulls as array elements.) * A null (or missing) array field is treated the same as an empty array. If JSON were to use the {{REPEATED}} vectors, the following would be invalid: {code} [10, null, 20] {code} The following are all valid with {{REPEATED}} vectors: {code} {id: 1} {id: 2, a: null} {id: 3, a: []} {id: 4, a: [10, 20, 30]} {code} h4. Properties of Lists The key properties of a List relative to a repeated type are: * Each list can be a list of nothing, a list of a single type, or a list of a union of multiple types. * Each list value can be null. * Each list entry can be null (for primitive types.) h4. Null Support As explained below, lists support three kinds of nullability: * The type itself can be null (a list of nulls) * The list column value for a row can be null. (This is in contrast to repeated types in which an array can be empty, but the entire array value cannot be null.) * When the list is of a primitive type, entries can be null. (The list is defined as list of nullable items, such as nullable BIGINT.) * When the list is of maps, the map entries *cannot* be null. (Instead, the map columns are nullable and all columns for the "null" map are set to null.) * When the list is of other lists, the list entry *cannot* be null. (Instead, the nested list is empty.) The semantics are a bit confusing when seen from the outside. They make slightly more sense based on the implementation choices made in the code. (Though, generally we want the code to match our requirements, not the other way around.) h4. Lists are Obscure The {{LIST}} type appears to be used only for JSON, and it is unclear how well supported it is in the rest of Drill. For example, it is not clear that functions that work with arrays correctly handle null entries. (This needs to be tested.) JDBC supports array columns, but it is not clear if the Drill JDBC driver has implemented them. ODBC doesn't support arrays at all, so whether it supports arrays with nulls is a moot point. h4. Lists in JSON The {{LIST}} type appears to be used only for JSON where it is a better fit for JSON semantics than Drill's normal {{REPEATED}} cardinality. The list type allows list members to be null. All of the following are legal using lists: {code} {a: null} {a: []} {a: [null, null]} {a: [null, 10, null]} {a: [10, "foo"]} {code} We'll look at each of these in detail. h4. Degenerate Lists Consider the simplest possible list in JSON: a file that contains only an empty list: {noformat} {a: []} {noformat} What is the type of the list? In JSON, lists have no type, they are just lists. Drill requires a type, however when working with a {{REPEATED}} cardinality: the column must be an array of something. Lists, however, can be a list of only nulls using the obscure {{LATE}} data type. That is, the list exists, but has no type. ({{LATE}} seems to suggest that the type will be assigned later.) Next, consider another degenerate array: {noformat} {a: [null, null]} {noformat} Here we have an array of nulls. Again, we don't know what type these are a null of. Again, a LIST allows the JSON reader to produce a row with a single column {{`a`}} that is of type {{LIST}} that contains only the "dummy" {{LATE}} type. The list will indicate that we have two entries, both of which are null. It is unclear, however, if the rest of Drill supports this concept. (DRILL-5970 discusses a case in which an empty array, with a List of {{LATE}}, is exported to Parquet, producing results different than one might naively expect.) h4. Single-type Lists The typical JSON use case is create lists with a single type, optionally containing nulls: {noformat} {a: [10, null, 20] } {noformat} In this case, the List operates in its' second mode: as a list of a single type,
[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior
[ https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293316#comment-16293316 ] Paul Rogers edited comment on DRILL-6035 at 12/19/17 5:25 AM: -- h4. JSON Arrays Drill supports simple arrays in JSON using the following rules: * Arrays must contain hetrogeneous elements: any of the scalars described above, or a JSON object. (See a later comment for nested arrays.) For example, the following are scalar arrays: {code} [10, 20] [10.30, 10.45] ["foo", "bar"] [true, false] {code} h4. Schema Change in Arrays The following will trigger errors: {code} {a: [10, "foo"]} // Mixed types {a: [10]} {a: ["foo"]} // Schema change {a: [10, 12.5]} // Conflicting types: integer and float {code} h4. Nulls in Arrays Drill handles nulls in arrays using the {{LIST}} type, described in a separate note below. was (Author: paul.rogers): h4. JSON Arrays Drill supports simple arrays in JSON using the following rules: * Arrays must contain hetrogeneous elements: any of the scalars described above, or a JSON object. (See a later comment for nested arrays.) For example, the following are scalar arrays: {code} [10, 20] [10.30, 10.45] ["foo", "bar"] [true, false] {code} h4. Schema Change in Arrays The following will trigger errors: {code} {a: [10, "foo"]} // Mixed types {a: [10]} {a: ["foo"]} // Schema change {a: [10, 12.5]} // Conflicting types: integer and float {code} h4. Nulls with Arrays Rules for nulls are: * Arrays may not contain nulls. (Drill does not support nulls as array elements.) * A null (or missing) array field is treated the same as an empty array. The following is invalid: {code} [10, null, 20] {code} The following are all valid: {code} {id: 1} {id: 2, a: null} {id: 3, a: []} {id: 4, a: [10, 20, 30]} {code} As described, Drill will defer picking an array type if it sees null values. In the above example, for id=2, Drill sees column `a` but does not pick a type. For id=3, Drill identifies that `a` is an array, but does not know the type. Finally, for id=4, Drill identifies the array as {{Repeated BIGINT}}. (This is the behavior for Drill 1.13, earlier versions may differ and require investigation.) As usual, if the first file or batch contains only nulls, Drill will guess {{Nullable VARCHAR}} which will cause a schema change error if later records reveal the type to be an array (of any type.) If the first batch contains only nulls and/or empty arrays, Drill guesses that the type is {{Repeated VARCHAR}}. (Again, this is specific to Drill 1.13.) For example: {code} {id: 1} {id: 2, a: null} {id: 3, a: []} {code} > Specify Drill's JSON behavior > - > > Key: DRILL-6035 > URL: https://issues.apache.org/jira/browse/DRILL-6035 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.13.0 >Reporter: Paul Rogers >Assignee: Pritesh Maker > > Drill supports JSON as its native data format. However, experience suggests > that Drill may have limitations in the JSON that Drill supports. This ticket > asks to clarify Drill's expected behavior on various kinds of JSON. > Topics to be addressed: > * Relational vs. non-relational structures > * JSON structures used in practice and how they map to Drill > * Support for varying data types > * Support for missing values, especially across files > These topics are complex, hence the request to provide a detailed > specifications that clarifies what Drill does and does not support (or what > is should and should not support.) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior
[ https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293316#comment-16293316 ] Paul Rogers edited comment on DRILL-6035 at 12/16/17 4:08 AM: -- h4. JSON Arrays Drill supports simple arrays in JSON using the following rules: * Arrays must contain hetrogeneous elements: any of the scalars described above, or a JSON object. (See a later comment for nested arrays.) For example, the following are scalar arrays: {code} [10, 20] [10.30, 10.45] ["foo", "bar"] [true, false] {code} h4. Schema Change in Arrays The following will trigger errors: {code} {a: [10, "foo"]} // Mixed types {a: [10]} {a: ["foo"]} // Schema change {a: [10, 12.5]} // Conflicting types: integer and float {code} h4. Nulls with Arrays Rules for nulls are: * Arrays may not contain nulls. (Drill does not support nulls as array elements.) * A null (or missing) array field is treated the same as an empty array. The following is invalid: {code} [10, null, 20] {code} The following are all valid: {code} {id: 1} {id: 2, a: null} {id: 3, a: []} {id: 4, a: [10, 20, 30]} {code} As described, Drill will defer picking an array type if it sees null values. In the above example, for id=2, Drill sees column `a` but does not pick a type. For id=3, Drill identifies that `a` is an array, but does not know the type. Finally, for id=4, Drill identifies the array as {{Repeated BIGINT}}. (This is the behavior for Drill 1.13, earlier versions may differ and require investigation.) As usual, if the first file or batch contains only nulls, Drill will guess {{Nullable VARCHAR}} which will cause a schema change error if later records reveal the type to be an array (of any type.) If the first batch contains only nulls and/or empty arrays, Drill guesses that the type is {{Repeated VARCHAR}}. (Again, this is specific to Drill 1.13.) For example: {code} {id: 1} {id: 2, a: null} {id: 3, a: []} {code} was (Author: paul.rogers): h4. JSON Arrays Drill supports simple arrays in JSON using the following rules: * Arrays must contain hetrogeneous elements: any of the scalars described above, or a JSON object. (See a later comment for nested arrays.) For example, the following are scalar arrays: {code} [10, 20] [10.30, 10.45] ["foo", "bar"] [true, false] {code} h4. Schema Change in Arrays The following will trigger errors: {code} {a: [10, "foo"]} // Mixed types {a: [10]} {a: ["foo"]} // Schema change {code} h4. Nulls with Arrays Rules for nulls are: * Arrays may not contain nulls. (Drill does not support nulls as array elements.) * A null (or missing) array field is treated the same as an empty array. The following is invalid: {code} [10, null, 20] {code} The following are all valid: {code} {id: 1} {id: 2, a: null} {id: 3, a: []} {id: 4, a: [10, 20, 30]} {code} As described, Drill will defer picking an array type if it sees null values. In the above example, for id=2, Drill sees column `a` but does not pick a type. For id=3, Drill identifies that `a` is an array, but does not know the type. Finally, for id=4, Drill identifies the array as {{Repeated BIGINT}}. (This is the behavior for Drill 1.13, earlier versions may differ and require investigation.) As usual, if the first file or batch contains only nulls, Drill will guess {{Nullable VARCHAR}} which will cause a schema change error if later records reveal the type to be an array (of any type.) If the first batch contains only nulls and/or empty arrays, Drill guesses that the type is {{Repeated VARCHAR}}. (Again, this is specific to Drill 1.13.) For example: {code} {id: 1} {id: 2, a: null} {id: 3, a: []} {code} > Specify Drill's JSON behavior > - > > Key: DRILL-6035 > URL: https://issues.apache.org/jira/browse/DRILL-6035 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.13.0 >Reporter: Paul Rogers >Assignee: Pritesh Maker > > Drill supports JSON as its native data format. However, experience suggests > that Drill may have limitations in the JSON that Drill supports. This ticket > asks to clarify Drill's expected behavior on various kinds of JSON. > Topics to be addressed: > * Relational vs. non-relational structures > * JSON structures used in practice and how they map to Drill > * Support for varying data types > * Support for missing values, especially across files > These topics are complex, hence the request to provide a detailed > specifications that clarifies what Drill does and does not support (or what > is should and should not support.) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior
[ https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293557#comment-16293557 ] Paul Rogers edited comment on DRILL-6035 at 12/16/17 4:07 AM: -- h4. All-Text Mode Drill provides the ability to read scalar values as text: {code} ALTER SESSION SET `store.json.all_text_mode` = true {code} In this mode, JSON scalars are read as follows: || JSON Type || As Member Value | As Array Value || | Missing | NULL (VARCHAR) | N/A | | null | NULL (VARCHAR) | String value "null" | | true/false | "true"/"false" | Same | | Number | Number text | Same | | String | The string value (without quotes) | Same | All-text mode can overcome some schema change exceptions such as: * Long string of missing or null values before the first non-null value. * Different scalar types in different records. * Hetrogeneous arrays. * Arrays that contain nulls. (The null values are stored as empty strings.) In Drill 1.13, in all-text mode, missing columns are presumed to be Nullable VARCHAR. (Prior versions may have assumed Nullable INT.) As a result, if file1.json has column `x`, but file2.json does not, then no schema change will occur when combining the results since both files will assume that `x` is a Nullable VARCHAR. (Note that this works only if the query explicitly projects column `x`. It won't necessarily work for queries with the wildcard.) Note that all-text mode cannot overcome schema changes due to mixes of scalar and structured (object or list) types. was (Author: paul.rogers): h4. All-Text Mode Drill provides the ability to read scalar values as text: {code} ALTER SESSION SET `store.json.all_text_mode` = true {code} In this mode, JSON scalars are read as follows: || JSON Type || As Text || | Missing | NULL (VARCHAR) | | null | NULL (VARCHAR) | | true/false | "true"/"false" | | Number | Number text | | String | The string value (without quotes) | All-text mode can overcome some schema change exceptions such as: * Long string of missing or null values before the first non-null value. * Different scalar types in different records. * Hetrogeneous arrays. * Arrays that contain nulls. (The null values are stored as empty strings.) In Drill 1.13, in all-text mode, missing columns are presumed to be Nullable VARCHAR. (Prior versions may have assumed Nullable INT.) As a result, if file1.json has column `x`, but file2.json does not, then no schema change will occur when combining the results since both files will assume that `x` is a Nullable VARCHAR. (Note that this works only if the query explicitly projects column `x`. It won't necessarily work for queries with the wildcard.) Note that all-text mode cannot overcome schema changes due to mixes of scalar and structured (object or list) types. > Specify Drill's JSON behavior > - > > Key: DRILL-6035 > URL: https://issues.apache.org/jira/browse/DRILL-6035 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.13.0 >Reporter: Paul Rogers >Assignee: Pritesh Maker > > Drill supports JSON as its native data format. However, experience suggests > that Drill may have limitations in the JSON that Drill supports. This ticket > asks to clarify Drill's expected behavior on various kinds of JSON. > Topics to be addressed: > * Relational vs. non-relational structures > * JSON structures used in practice and how they map to Drill > * Support for varying data types > * Support for missing values, especially across files > These topics are complex, hence the request to provide a detailed > specifications that clarifies what Drill does and does not support (or what > is should and should not support.) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior
[ https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293265#comment-16293265 ] Paul Rogers edited comment on DRILL-6035 at 12/16/17 2:04 AM: -- h4. JSON Scalars Drill uses simple rules to infer scalar data types from JSON input: * If the value is quoted, it is a Nullable {{VARCHAR}}. Example: {{a: "varchar"}} * If the value is numeric, and has no decimal point, it is a Nullable {{BIGINT}}. Example: {{a: 10}} * If the value is numeric, and has a decimal point, it is a Nullable {{FLOAT8}}. Eample: {{a: 10.5}} * If the value is {{true}} or {{false}}, it is a Nullable {{BIT}}, stored as Nullable {{UINT1}}, with {{true}} stored as 1, {{false}} as 0. Example: {{a: true}} A special case occurs if the user enables {{store.json.all_text_mode}}. In this case, all cases above give rise to a nullable {{VARCHAR}} column type. h4. Nulls with Scalar Values The rules for {{null}} are a bit more complex. The rules below apply to the Drill 1.13 JSON reader, earlier versions require investigation. * If the {{null}} occurs after seeing one of the above types, then the value is a {{NULL}} of that type (such as a {{NULL}} {{VARCHAR}}.) * If the {{null}} occurs at the start of the file, Drill defers identifying the data type until it sees the first non-null value. * If the entire file (or first batch of a file: 4K records before Drill 1.13, variable number in Drill 1.13 or later) contains only {{null}} values, the column is defined as nullable {{VARCHAR}}. (Drill 1.12 and earlier likely has different behavior.) For example, in the following example, Drill 1.13 identifies the type as nullable {{BIGINT}}: {code} {a: null} {a: null} {a: null} {a: 10} {code} In the following example, Drill must guess that the data type is nullable {{VARCHAR}}: {code} {a: null} {a: null} {a: null} {code} h4. BIGINT vs. FLOAT8 Types JSON has a single numeric type which can hold integers, floats or decimals. Drill attemps to infer the type of the field from the format of the first number encountered for a field using the following rules: * If the number consists entirely of digits, it is an integer, stored as {{BIGINT}}. * Else, it is a floating point number stored as {{FLOAT8}}. Drill 1.13 adds the following rule: * If a number is {{FLOAT8}}, accept an integer, but promote it to {{FLOAT8}}. That is, the following array is stored as {{FLOAT8}}: {code} [1.2, 3, 5.0] {code} Drill is easily fooled, however, rearrange the values and Drill will raise an error: {code} [3, 1.2, 5.0] {code} In the second case, Drill sees 3, guesses {{BIGINT}}, then immediately fails when trying to store the float value 1.2. To work around this problem, a user may set the {{`store.json.read_numbers_as_double`}} system/session property to read all numbers as {{FLOAT8}}, even if they appear to be integers. was (Author: paul.rogers): h4. JSON Scalars Drill uses simple rules to infer scalar data types from JSON input: * If the value is quoted, it is a Nullable {{VARCHAR}}. Example: {{a: "varchar"}} * If the value is numeric, and has no decimal point, it is a Nullable {{BIGINT}}. Example: {{a: 10}} * If the value is numeric, and has a decimal point, it is a Nullable {{FLOAT8}}. Eample: {{a: 10.5}} * If the value is {{true}} or {{false}}, it is a Nullable {{BIT}}, stored as Nullable {{UINT1}}, with {{true}} stored as 1, {{false}} as 0. Example: {{a: true}} A special case occurs if the user enables {{store.json.all_text_mode}}. In this case, all cases above give rise to a nullable {{VARCHAR}} column type. h4. Nulls with Scalar Values The rules for {{null}} are a bit more complex. The rules below apply to the Drill 1.13 JSON reader, earlier versions require investigation. * If the {{null}} occurs after seeing one of the above types, then the value is a {{NULL}} of that type (such as a {{NULL}} {{VARCHAR}}.) * If the {{null}} occurs at the start of the file, Drill defers identifying the data type until it sees the first non-null value. * If the entire file (or first batch of a file: 4K records before Drill 1.13, variable number in Drill 1.13 or later) contains only {{null}} values, the column is defined as nullable {{VARCHAR}}. (Drill 1.12 and earlier likely has different behavior.) For example, in the following example, Drill 1.13 identifies the type as nullable {{BIGINT}}: {code} {a: null} {a: null} {a: null} {a: 10} {code} In the following example, Drill must guess that the data type is nullable {{VARCHAR}}: {code} {a: null} {a: null} {a: null} {code} h4. BIGINT vs. FLOAT8 Types JSON has a single numeric type which can hold integers, floats or decimals. Drill attemps to infer the type of the field from the format of the first number encountered for a field using the following rules: * If the number consists entirely of digits, it is an integer, stored as
[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior
[ https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293265#comment-16293265 ] Paul Rogers edited comment on DRILL-6035 at 12/16/17 2:01 AM: -- h4. JSON Scalars Drill uses simple rules to infer scalar data types from JSON input: * If the value is quoted, it is a Nullable {{VARCHAR}}. Example: {{a: "varchar"}} * If the value is numeric, and has no decimal point, it is a Nullable {{BIGINT}}. Example: {{a: 10}} * If the value is numeric, and has a decimal point, it is a Nullable {{FLOAT8}}. Eample: {{a: 10.5}} * If the value is {{true}} or {{false}}, it is a Nullable {{BIT}}, stored as Nullable {{UINT1}}, with {{true}} stored as 1, {{false}} as 0. Example: {{a: true}} A special case occurs if the user enables {{store.json.all_text_mode}}. In this case, all cases above give rise to a nullable {{VARCHAR}} column type. h4. Nulls with Scalar Values The rules for {{null}} are a bit more complex. The rules below apply to the Drill 1.13 JSON reader, earlier versions require investigation. * If the {{null}} occurs after seeing one of the above types, then the value is a {{NULL}} of that type (such as a {{NULL}} {{VARCHAR}}.) * If the {{null}} occurs at the start of the file, Drill defers identifying the data type until it sees the first non-null value. * If the entire file (or first batch of a file: 4K records before Drill 1.13, variable number in Drill 1.13 or later) contains only {{null}} values, the column is defined as nullable {{VARCHAR}}. (Drill 1.12 and earlier likely has different behavior.) For example, in the following example, Drill 1.13 identifies the type as nullable {{BIGINT}}: {code} {a: null} {a: null} {a: null} {a: 10} {code} In the following example, Drill must guess that the data type is nullable {{VARCHAR}}: {code} {a: null} {a: null} {a: null} {code} h4. BIGINT vs. FLOAT8 Types JSON has a single numeric type which can hold integers, floats or decimals. Drill attemps to infer the type of the field from the format of the first number encountered for a field using the following rules: * If the number consists entirely of digits, it is an integer, stored as {{BIGINT}}. * Else, it is a floating point number stored as {{FLOAT8}}. Drill 1.13 adds the following rule: * If a number is {{FLOAT8}}, accept an integer, but promote it to {{FLOAT8}}. That is, the following array is stored as {{FLOAT8}}: {code} [1.2, 3, 5.0] {code} Drill is easily fooled, however, rearrange the values and Drill will raise an error: {code} [3, 1.2, 5.0] {code} In the second case, Drill sees 3, guesses {{BIGINT}}, then immediately fails when trying to store the float value 1.2. was (Author: paul.rogers): h4. JSON Scalars Drill uses simple rules to infer scalar data types from JSON input: * If the value is quoted, it is a Nullable {{VARCHAR}}. Example: {{a: "varchar"}} * If the value is numeric, and has no decimal point, it is a Nullable {{BIGINT}}. Example: {{a: 10}} * If the value is numeric, and has a decimal point, it is a Nullable {{FLOAT8}}. Eample: {{a: 10.5}} * If the value is {{true}} or {{false}}, it is a Nullable {{BIT}}, stored as Nullable {{UINT1}}, with {{true}} stored as 1, {{false}} as 0. Example: {{a: true}} A special case occurs if the user enables {{store.json.all_text_mode}}. In this case, all cases above give rise to a nullable {{VARCHAR}} column type. h4. Nulls with Scalar Values The rules for {{null}} are a bit more complex. The rules below apply to the Drill 1.13 JSON reader, earlier versions require investigation. * If the {{null}} occurs after seeing one of the above types, then the value is a {{NULL}} of that type (such as a {{NULL}} {{VARCHAR}}.) * If the {{null}} occurs at the start of the file, Drill defers identifying the data type until it sees the first non-null value. * If the entire file (or first batch of a file: 4K records before Drill 1.13, variable number in Drill 1.13 or later) contains only {{null}} values, the column is defined as nullable {{VARCHAR}}. (Drill 1.12 and earlier likely has different behavior.) For example, in the following example, Drill 1.13 identifies the type as nullable {{BIGINT}}: {code} {a: null} {a: null} {a: null} {a: 10} {code} In the following example, Drill must guess that the data type is nullable {{VARCHAR}}: {code} {a: null} {a: null} {a: null} {code} > Specify Drill's JSON behavior > - > > Key: DRILL-6035 > URL: https://issues.apache.org/jira/browse/DRILL-6035 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.13.0 >Reporter: Paul Rogers >Assignee: Pritesh Maker > > Drill supports JSON as its native data format. However, experience suggests > that Drill may have limitations in the JSON that Drill supports. This
[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior
[ https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293362#comment-16293362 ] Paul Rogers edited comment on DRILL-6035 at 12/16/17 1:19 AM: -- h4. JSON Projection Pushdown The JSON reader supports "projection push-down." The simple rules are simple in concept, but complex in details. The project list comes from the query. In its simplest form, it is the list of columns following the {{SELECT}} keyword: {code} SELECT a, b.c, d[0] FROM ... {code} || Projection || JSON Value of "a" || Drill Result || | `a` | Scalar | Projects `a` | | | Array | Projects all elements of `a` | | | Object | Projects all members of `a` | | | Missing | Creates a {{Nullable INT}} (Drill 1.12) or {{Nullable VARCHAR}} (Drill 1.13) column | | | {{null}} | As above | | `a`.`b` | Scalar | Error (`a` must be an object) | | | Scalar array | Error (`a` must be a map or an array of maps) | | | Object that contains `b` | Projects just `b` from object `a` | | | Object that does not contain `b` | Projects a nullable column `b` within map `a` | | | Object array that contains `b` | Projects just be from the objects within array `a` | | | Object array that does not contain `b` | Projects a nullable column `b` within the array of maps | | | Missing | Projects a map `a` that contains a nullable column `b` | | | {{null}} | As above | | a\[0] | Scalar | Error (`a` must be an array) | | | Scalar array | Projects just `a\[0]` as a scalar (the reader projects the entire array, a project operator pulls out the `a\[0]` element) | | | Object | Error (`a` must e an array) | | | Object array | Projects just object (map) `a\[0]` as described above | | | {{null}} | JSON creates an array of null values, project pulls out `a\[0]` | | | Missing | As above | Notes: * The rules above are for Drill 1.13. Drill 1.12 and earlier is different, and requires investigation. * The rules for null values are suble. The type of the null is inferred from the project list in the case of a map (`a`.`b`) or an array (`a\[0]). Previous sections described null handling for the {{SELECT *}} and {{SELECT `a`}} cases. * The rules for projecting map columns apply to both arrays and single maps. (In Drill 1.12 and earlier, the two cases appear to have behaved differently.) was (Author: paul.rogers): h4. JSON Projection Pushdown The JSON reader supports "projection push-down." The simple rules are simple in concept, but complex in details. The project list comes from the query. In its simplest form, it is the list of columns following the {{SELECT}} keyword: {code} SELECT a, b.c, d[0] FROM ... {code} || Projection || JSON Value of `a` || Drill Result || | `a` | Scalar | Projects `a` | | | Array | Projects all elements of `a` | | | Object | Projects all members of `a` | | | Missing | Creates a {{Nullable INT}} (Drill 1.12) or {{Nullable VARCHAR}} (Drill 1.13) column | | | {{null}} | As above | | `a`.`b` | Scalar | Error (`a` must be an object) | | | Scalar array | Error (`a` must be a map or an array of maps) | | | Object that contains `b` | Projects just `b` from object `a` | | | Object that does not contain `b` | Projects a nullable column `b` within map `a` | | | Object array that contains `b` | Projects just be from the objects within array `a` | | | Object array that does not contain `b` | Projects a nullable column `b` within the array of maps | | | Missing | Projects a map `a` that contains a nullable column `b` | | | {{null}} | As above | | a\[0] | Scalar | Error (`a` must be an array) | | | Scalar array | Projects just `a\[0]` as a scalar (the reader projects the entire array, a project operator pulls out the `a\[0]` element) | | | Object | Error (`a` must e an array) | | | Object array | Projects just object (map) `a\[0]` as described above | | | {{null}} | JSON creates an array of null values, project pulls out `a\[0]` | | | Missing | As above | Notes: * The rules above are for Drill 1.13. Drill 1.12 and earlier is different, and requires investigation. * The rules for null values are suble. The type of the null is inferred from the project list in the case of a map (`a`.`b`) or an array (`a\[0]). Previous sections described null handling for the {{SELECT *}} and {{SELECT `a`}} cases. * The rules for projecting map columns apply to both arrays and single maps. (In Drill 1.12 and earlier, the two cases appear to have behaved differently.) > Specify Drill's JSON behavior > - > > Key: DRILL-6035 > URL: https://issues.apache.org/jira/browse/DRILL-6035 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.13.0 >Reporter: Paul Rogers >Assignee: Pritesh Maker > > Drill supports JSON as its native data format. However, experience suggests > that Drill may have limitation
[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior
[ https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293477#comment-16293477 ] Paul Rogers edited comment on DRILL-6035 at 12/16/17 1:07 AM: -- h4. JSON Structure The [JSON standard|https://tools.ietf.org/html/rfc7159] as [described more clearly here|https://www.json.org] states that a JSON document is a single value (null, scalar, object or list). Drill support a non-standard (but common) extension that allows a list of objects. || Document Structure || JSON Standard || Drill Support || | Empty | Invalid | Empty list of records | | null | Valid | Invalid | | Scalar | Valid | Invalid | | Array | Valid | Valid (in Drill 1.13) as long as the value is an array of objects | | Object | Valid | Single record | | List of objects | Invalid | List of records | In Drill, there must be no commas between top-level objects. This is a clear difference compared to the JSON standard which requires commas to separate items in a list or object. (This difference is because Drill's JSON file structure is not JSON. Think of it instead as a serialized set of JSON objects.) h4. Drill JSON Document Structure Thus, a typical JSON input file in Drill is: {code} {a: 10, b: "foo"} {a: 20, b: "bar"} {code} h4. Top-Level Array As noted above, for JSON compatibility, Drill also supports a top-level array of objects: {code} [ {a: 10, b: "foo"}, {a: 20, b: "bar"} ] {code} Note that, when the objects are in an array, a comma must separate objects. The above applies to a JSON text file. No separator is implied (or needed) if the data comes from a document database, a Kafka stream or other non-file sources. was (Author: paul.rogers): h4. JSON Structure The [JSON standard|https://tools.ietf.org/html/rfc7159] as [described more clearly here|https://www.json.org] states that a JSON document is a single value (null, scalar, object or list). Drill support a non-standard (but common) extension that allows a list of objects. || Document Structure || JSON Standard || Drill Support || | Empty | Invalid | Empty list of records | | null | Valid | Invalid | | Scalar | Valid | Invalid | | Array | Valid | Valid (in Drill 1.13) as long as the value is an array of objects | | Object | Valid | Single record | | List of objects | Invalid | List of records | In Drill, there must be no commas between top-level objects. This is a clear difference compared to the JSON standard which requires commas to separate items in a list or object. (This difference is because Drill's JSON file structure is not JSON. Think of it instead as a serialized set of JSON objects.) h4. Drill JSON Document Structure Thus, a typical JSON input file in Drill is: {code} {a: 10, b: "foo"} {a: 20, b: "bar"} {code} As noted above, for JSON compatibility, Drill also supports a top-level array of objects: {code} [ {a: 10, b: "foo"}, {a: 20, b: "bar"} ] {code} Note that, when the objects are in an array, a comma must separate objects. The above applies to a JSON text file. No separator is implied (or needed) if the data comes from a document database, a Kafka stream or other non-file sources. > Specify Drill's JSON behavior > - > > Key: DRILL-6035 > URL: https://issues.apache.org/jira/browse/DRILL-6035 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.13.0 >Reporter: Paul Rogers >Assignee: Pritesh Maker > > Drill supports JSON as its native data format. However, experience suggests > that Drill may have limitations in the JSON that Drill supports. This ticket > asks to clarify Drill's expected behavior on various kinds of JSON. > Topics to be addressed: > * Relational vs. non-relational structures > * JSON structures used in practice and how they map to Drill > * Support for varying data types > * Support for missing values, especially across files > These topics are complex, hence the request to provide a detailed > specifications that clarifies what Drill does and does not support (or what > is should and should not support.) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior
[ https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293265#comment-16293265 ] Paul Rogers edited comment on DRILL-6035 at 12/15/17 9:59 PM: -- h4. JSON Scalars Drill uses simple rules to infer scalar data types from JSON input: * If the value is quoted, it is a Nullable {{VARCHAR}}. Example: {{a: "varchar"}} * If the value is numeric, and has no decimal point, it is a Nullable {{BIGINT}}. Example: {{a: 10}} * If the value is numeric, and has a decimal point, it is a Nullable {{FLOAT8}}. Eample: {{a: 10.5}} * If the value is {{true}} or {{false}}, it is a Nullable {{BIT}}, stored as Nullable {{UINT1}}, with {{true}} stored as 1, {{false}} as 0. Example: {{a: true}} A special case occurs if the user enables {{store.json.all_text_mode}}. In this case, all cases above give rise to a nullable {{VARCHAR}} column type. h4. Nulls with Scalar Values The rules for {{null}} are a bit more complex. The rules below apply to the Drill 1.13 JSON reader, earlier versions require investigation. * If the {{null}} occurs after seeing one of the above types, then the value is a {{NULL}} of that type (such as a {{NULL}} {{VARCHAR}}.) * If the {{null}} occurs at the start of the file, Drill defers identifying the data type until it sees the first non-null value. * If the entire file (or first batch of a file: 4K records before Drill 1.13, variable number in Drill 1.13 or later) contains only {{null}} values, the column is defined as nullable {{VARCHAR}}. (Drill 1.12 and earlier likely has different behavior.) For example, in the following example, Drill 1.13 identifies the type as nullable {{BIGINT}}: {code} {a: null} {a: null} {a: null} {a: 10} {code} In the following example, Drill must guess that the data type is nullable {{VARCHAR}}: {code} {a: null} {a: null} {a: null} {code} was (Author: paul.rogers): Drill uses simple rules to infer scalar data types from JSON input: * If the value is quoted, it is a Nullable {{VARCHAR}}. Example: {{a: "varchar"}} * If the value is numeric, and has no decimal point, it is a Nullable {{BIGINT}}. Example: {{a: 10}} * If the value is numeric, and has a decimal point, it is a Nullable {{FLOAT8}}. Eample: {{a: 10.5}} * If the value is {{true}} or {{false}}, it is a Nullable {{BIT}}, stored as Nullable {{UINT1}}, with {{true}} stored as 1, {{false}} as 0. Example: {{a: true}} A special case occurs if the user enables {{store.json.all_text_mode}}. In this case, all cases above give rise to a nullable {{VARCHAR}} column type. h4. Nulls with Scalar Values The rules for {{null}} are a bit more complex. The rules below apply to the Drill 1.13 JSON reader, earlier versions require investigation. * If the {{null}} occurs after seeing one of the above types, then the value is a {{NULL}} of that type (such as a {{NULL}} {{VARCHAR}}.) * If the {{null}} occurs at the start of the file, Drill defers identifying the data type until it sees the first non-null value. * If the entire file (or first batch of a file: 4K records before Drill 1.13, variable number in Drill 1.13 or later) contains only {{null}} values, the column is defined as nullable {{VARCHAR}}. (Drill 1.12 and earlier likely has different behavior.) For example, in the following example, Drill 1.13 identifies the type as nullable {{BIGINT}}: {code} {a: null} {a: null} {a: null} {a: 10} {code} In the following example, Drill must guess that the data type is nullable {{VARCHAR}}: {code} {a: null} {a: null} {a: null} {code} > Specify Drill's JSON behavior > - > > Key: DRILL-6035 > URL: https://issues.apache.org/jira/browse/DRILL-6035 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.13.0 >Reporter: Paul Rogers >Assignee: Pritesh Maker > > Drill supports JSON as its native data format. However, experience suggests > that Drill may have limitations in the JSON that Drill supports. This ticket > asks to clarify Drill's expected behavior on various kinds of JSON. > Topics to be addressed: > * Relational vs. non-relational structures > * JSON structures used in practice and how they map to Drill > * Support for varying data types > * Support for missing values, especially across files > These topics are complex, hence the request to provide a detailed > specifications that clarifies what Drill does and does not support (or what > is should and should not support.) -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior
[ https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293298#comment-16293298 ] Paul Rogers edited comment on DRILL-6035 at 12/15/17 9:59 PM: -- h4. JSON Objects Drill support for JSON objects consists of three parts. * Drill supports a non-standard serialized JSON format in which an input file is a sequence of JSON objects. * Values within the top-level JSON object give rise to columns within the Drill row. * Nested objects in JSON give rise to nested {{MAP}} columns in Drill. * Arrays of JSON objects give rise to {{REPEATED MAP}} columns in Drill. h4. Top-Level JSON Objects Example of the expected JSON input format: {code} {a: 10} {a: 20} {a: 30} {code} Drill allows any amount of white-space between objects. It is common to place each object on a new line, though this is not required. Note that Drill allows value names to be quoted or unquoted. The following are both valid: {code} {"a": 10} {a: 20} {code} h4. Drill MAP type Drill uses the term "Map" to describe JSON objects. However, the {{MAP}} type in Drill is closer to the {{STRUCT}} type in Impala and Hive. That is, like a {{STRUCT}}, the schema of all map instances is identical across rows. (This is unlike, say, a JSON object or Python map in which the members of one instance are independent of those in any other instance.) As a result, the following example: {code} {a: {x: 10}} {a: {y: 20}} {code} Gives rise to records in Drill with data similar to the following: {code} {a: {x: 10, y: null}} {a: {x: null, y: 20}} {code} h4. JSON Object Arrays and Drill Repeated MAPs The example below shows a repeated object which gives rise to a {{REPEATED MAP}}: {code} {a: [{b: 10}, {b: 20}]} {code} h4. Nulls with JSON Objects JSON allows null values for a map: {code} {id: 1, a: {x: 10}} {id: 2, a: null} {id: 3} {code} Drill does not support the concept of a "nullable map". Instead, Drill defines all map members to be nullable. If the entire object is null (or missing) in JSON, Drill treats this the same as if every member were null. Thus, in Drill, the following are all equivalent: {code} {id: 1, a: {x: null, y: null}} {id: 2, a: {}} {id: 3, a: {x: null}} {id: 4, a: null} {id: 5} {code} As a side note, when exporting the above data to a JSON file, Drill cannot recreate the original structure. Instead, it writes all of the above in a common format. (The format has evolved based on previous bugs, need to investigate the current choice.) As described for scalars, Drill will defer selecting a type for a column if the initial records consist only of null values. If a later value is revealed to be a map, Drill will choose the map type. If the file (or first batch) consists only of nulls, then Drill cannot know the type and guesses {{VARCHAR}}. This will lead to a schema change error if a later file (or batch) reveals the type to actually be a map (since {{VARCHAR}} and {{MAP}} are not compatible.) {code} {id: 1} {id: 2, a: null} {id: 3, a: null} {id: 4, a: {x: 10, y: 20}} {code} was (Author: paul.rogers): Drill support for JSON objects consists of three parts. * Drill supports a non-standard serialized JSON format in which an input file is a sequence of JSON objects. * Values within the top-level JSON object give rise to columns within the Drill row. * Nested objects in JSON give rise to nested {{MAP}} columns in Drill. * Arrays of JSON objects give rise to {{REPEATED MAP}} columns in Drill. h4. Top-Level JSON Objects Example of the expected JSON input format: {code} {a: 10} {a: 20} {a: 30} {code} Drill allows any amount of white-space between objects. It is common to place each object on a new line, though this is not required. Note that Drill allows value names to be quoted or unquoted. The following are both valid: {code} {"a": 10} {a: 20} {code} h4. Drill MAP type Drill uses the term "Map" to describe JSON objects. However, the {{MAP}} type in Drill is closer to the {{STRUCT}} type in Impala and Hive. That is, like a {{STRUCT}}, the schema of all map instances is identical across rows. (This is unlike, say, a JSON object or Python map in which the members of one instance are independent of those in any other instance.) As a result, the following example: {code} {a: {x: 10}} {a: {y: 20}} {code} Gives rise to records in Drill with data similar to the following: {code} {a: {x: 10, y: null}} {a: {x: null, y: 20}} {code} h4. JSON Object Arrays and Drill Repeated MAPs The example below shows a repeated object which gives rise to a {{REPEATED MAP}}: {code} {a: [{b: 10}, {b: 20}]} {code} h4. Nulls with JSON Objects JSON allows null values for a map: {code} {id: 1, a: {x: 10}} {id: 2, a: null} {id: 3} {code} Drill does not support the concept of a "nullable map". Instead, Drill defines all map members to be nullable. If the entire object is null
[jira] [Comment Edited] (DRILL-6035) Specify Drill's JSON behavior
[ https://issues.apache.org/jira/browse/DRILL-6035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16293265#comment-16293265 ] Paul Rogers edited comment on DRILL-6035 at 12/15/17 9:46 PM: -- Drill uses simple rules to infer scalar data types from JSON input: * If the value is quoted, it is a Nullable {{VARCHAR}}. Example: {{a: "varchar"}} * If the value is numeric, and has no decimal point, it is a Nullable {{BIGINT}}. Example: {{a: 10}} * If the value is numeric, and has a decimal point, it is a Nullable {{FLOAT8}}. Eample: {{a: 10.5}} * If the value is {{true}} or {{false}}, it is a Nullable {{BIT}}, stored as Nullable {{UINT1}}, with {{true}} stored as 1, {{false}} as 0. Example: {{a: true}} A special case occurs if the user enables {{store.json.all_text_mode}}. In this case, all cases above give rise to a nullable {{VARCHAR}} column type. h4. Nulls with Scalar Values The rules for {{null}} are a bit more complex. The rules below apply to the Drill 1.13 JSON reader, earlier versions require investigation. * If the {{null}} occurs after seeing one of the above types, then the value is a {{NULL}} of that type (such as a {{NULL}} {{VARCHAR}}.) * If the {{null}} occurs at the start of the file, Drill defers identifying the data type until it sees the first non-null value. * If the entire file (or first batch of a file: 4K records before Drill 1.13, variable number in Drill 1.13 or later) contains only {{null}} values, the column is defined as nullable {{VARCHAR}}. (Drill 1.12 and earlier likely has different behavior.) For example, in the following example, Drill 1.13 identifies the type as nullable {{BIGINT}}: {code} {a: null} {a: null} {a: null} {a: 10} {code} In the following example, Drill must guess that the data type is nullable {{VARCHAR}}: {code} {a: null} {a: null} {a: null} {code} was (Author: paul.rogers): Drill uses simple rules to infer scalar data types from JSON input: * If the value is quoted, it is a Nullable {{VARCHAR}}. Example: {{a: "varchar"}} * If the value is numeric, and has no decimal point, it is a Nullable {{BIGINT}}. Example: {{a: 10}} * If the value is numeric, and has a decimal point, it is a Nullable {{FLOAT8}}. Eample: {{a: 10.5}} * If the value is {{true}} or {{false}}, it is a Nullable {{BIT}}, stored as Nullable {{UINT1}}, with {{true}} stored as 1, {{false}} as 0. Example: {{a: true}} A special case occurs if the user enables {{store.json.all_text_mode}}. In this case, all cases above give rise to a nullable {{VARCHAR}} column type. The rules for {{null}} are a bit more complex. The rules below apply to the Drill 1.13 JSON reader, earlier versions require investigation. * If the {{null}} occurs after seeing one of the above types, then the value is a {{NULL}} of that type (such as a {{NULL}} {{VARCHAR}}.) * If the {{null}} occurs at the start of the file, Drill defers identifying the data type until it sees the first non-null value. * If the entire file (or first batch of a file: 4K records before Drill 1.13, variable number in Drill 1.13 or later) contains only {{null}} values, the column is defined as nullable {{VARCHAR}}. (Drill 1.12 and earlier likely has different behavior.) For example, in the following example, Drill 1.13 identifies the type as nullable {{BIGINT}}: {code} {a: null} {a: null} {a: null} {a: 10} {code} In the following example, Drill must guess that the data type is nullable {{VARCHAR}}: {code} {a: null} {a: null} {a: null} {code} > Specify Drill's JSON behavior > - > > Key: DRILL-6035 > URL: https://issues.apache.org/jira/browse/DRILL-6035 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.13.0 >Reporter: Paul Rogers >Assignee: Pritesh Maker > > Drill supports JSON as its native data format. However, experience suggests > that Drill may have limitations in the JSON that Drill supports. This ticket > asks to clarify Drill's expected behavior on various kinds of JSON. > Topics to be addressed: > * Relational vs. non-relational structures > * JSON structures used in practice and how they map to Drill > * Support for varying data types > * Support for missing values, especially across files > These topics are complex, hence the request to provide a detailed > specifications that clarifies what Drill does and does not support (or what > is should and should not support.) -- This message was sent by Atlassian JIRA (v6.4.14#64029)