[jira] [Created] (DRILL-7555) Standardize Jackson ObjectMapper usage
Paul Rogers created DRILL-7555: -- Summary: Standardize Jackson ObjectMapper usage Key: DRILL-7555 URL: https://issues.apache.org/jira/browse/DRILL-7555 Project: Apache Drill Issue Type: Improvement Reporter: Paul Rogers Drill makes heavy use of Jackson to serialize Java objects to/from JSON. Drill has added multiple custom serializers. See the {{PhysicalPlanReader}} constuctor for a list of these. However, many modules in Drill declare their own {{ObjectMapper}} instances, often without some (or all) of the custom Drill mappers. This is tedious and error-prone. We should: * Define a standard Drill object mappper. * Replace all ad-hoc instances of {{ObjectMapper}} with the Drill version (when reading/writing Drill-defined JSON). Further, storage plugins need an {{ObjectMapper}} to convert a scan spec from JSON to Java. (It is not clear why we do this serialization, or if it is needed, but that is how things work at present.) Plugins don't have access to any of the "full feature" object mappers: each plugin would have to cobble together the serdes it needs. So, after standardizing the object mappers, pass in an instance of that standard mapper to the storage plugin. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7553) Modernize type management
Paul Rogers created DRILL-7553: -- Summary: Modernize type management Key: DRILL-7553 URL: https://issues.apache.org/jira/browse/DRILL-7553 Project: Apache Drill Issue Type: Improvement Affects Versions: 1.17.0 Reporter: Paul Rogers This is a roll-up issue for our ongoing discussion around improving and modernizing Drill's runtime type system. At present, Drill approaches types vastly differently than most other DB and query tools: * Drill does little (or no) plan-time type checking and propagation. Instead, all type management is done at execution time, in each reader, in each operator, and ultimately in the client. * Drill allows structured types (Map, Dict, Arrays), but does not have the extended SQL statements to fully utilize these types. * Drill supports varying types: two readers can both read column {{c}}, but can do so with different types. We've always hoped to discover some way to reconcile the types. But, at present, the functionality is buggy and incomplete. It is not clear that a viable solution exists. Drill also provides "formal" varying types: Union and List. These types are also not fully supported. These three topics are closely related. "Schema-free" means we must infer types at read time and so Drill cannot do plan-type type analysis of the kind done in other engines. Because of schema-on-read (which is what "schema-free" really means), two readers can read different types for the same fields, and so we end up with varying or inconsistent types, and are forced to figure out some way to manage the conflicts. The gist of the proposal explored in this ticket is to exploit the learning from other engines: to embrace types when available, and to impose tractable rules when types are discovered at run time. h4. Proposal Summary This is very much a discussion draft. Here are some suggestions to get started. # Set as our goal to manage types at plan time. Runtime type discovery becomes a (limited) special case. # Pull type resolution, propagation and checking into the planner where it can be done once per query. Move it out of execution where it must be done multiple times: once per operator per minor fragment. Implement the standard DB type checking and propagation rules. (These rules are currently implicitly implemented deep in the code gen code.) # Generate operator code in the planner; send it to workers as part of the physical plan (to avoid the need to generate the code on each worker.) # Provide schema-aware extensions for storage and format plugins so that they can advertise a schema when known. (Examples; Hive sources get schemas from HMS, JDBC sources get schema from the underlying database, Avro, Parquet and others obtain schema from the target files, etc.) This mechanism works with, but is in addition to, the Drill metastore. # Separate the concepts of "schema-free" (no plan-time schema) from "schema-on-read" (schema is known in the planner, and data is read into that schema by readers; e.g. the Hive model.) Drill remains schema-on-read (for sources that need it), but does not attempt the impossible with schema-free (that is, we no longer read inconsistent data into a relational model and hope we can make it work.) # For convenience, allow "schema-free" (no plan-time schema). The restriction is that all readers *must* produce the same schema It is a fatal (to the query) error for an operator to receive batches with different schemas. (The reasons can be discussed separately.) # Preserve the Map, Dict and Array types, but with tighter semantics: all elements must be of the same type. # Replace the Union and List types with a new type: Java objects. Java objects can be anything and can vary from row-to-row. Java types are processed using UDFs (or Drill functions.) # All "extended" types (complex: Map, Dict and Array, or Java objects) must be reduced to primitive types in a top-level tuple if the client is ODBC (which cannot handle non-relational types.) The same is true if the destination is a simple sink such as CSV or JDBC. # Provide a light-weight way to resolve schema ambiguities that are identified by the new, stricter type rules. The light-weight solution is either a file or some kind of simple Drill-managed registry akin to the plugin registry. Users can run a query, see if there are conflicting types, and, if so, add a resolution rule to the registry. The user then reruns the query with a clean result. In the past couple of years we have made progress in some of these areas. This ticket suggests we bring those threads together in a coherent strategy. h4. Arrow/Java/Fixed Block/Something Else Storage The ideas here are independent of choices we might make for our internal data representation format. The above design works equally well with either Drill or Arrow vectors, or with something else
[jira] [Comment Edited] (DRILL-7551) Improve Error Reporting
[ https://issues.apache.org/jira/browse/DRILL-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023322#comment-17023322 ] Paul Rogers edited comment on DRILL-7551 at 1/27/20 1:05 AM: - Fixing errors has a number of dimensions: # Inconsistent use of exceptions at runtime. We have {{UserException}} which creates some structure, but we also throw random other unchecked exceptions. \{{UserException}}s do not, however, provide a mapping into SQL errors of the type understood by xDBC drivers. # Inconsistent error context. A low level bit of code (a file open call, say) only knows that it failed and that is what it tends to report: ("IO Error 10".) At the next level up, the surrounding code might know a bit more. ("Error reading HDFS:/foo/bar1234.parquet".) What we need is a bit of synthesis to say, ("Too many network timeouts reading block 17 from the bar1234.parquet of the `foo` table stored in the HDFS system `sales`".) # Errors are exceptions and we are overly generous in showing every last bit of stack trace on the client, the server and so on. Even those of us who live in the code find that the few lines we care about (NPE in such-and-such call stack) is lost in hundreds of lines that, frankly, I've never personally looked at. # The client API is a bit of a mess in error reporting: returning unchecked {{UserException}}s rather than a well-structured {{DrillException}} (say) designed for client use. (This is probably because the Drill client was a quick short-term solution based on Drill's internal Drillbit-to-Drillbit RPC.) # Catch errors as early as possible. Example: plan-time type checking (eventually), storage plugin validation in the UI (see comment below.) In addition to the above execution-focused items, it would be good to look at the SQL parser/planner errors as well. Not sure that returning 20-30 lines of possible tokens is super-helpful when I make a SQL typo. Probably fine to say, "Didn't understand the SQL at line 10, position 3."); To clean up our error act, we must move forward on each of these fronts. For my part, I've been chipping away at item 1: trying to convert all code to throw {{UserException}}. EVF provides an "error context" that helps (but does not solve) item 2. I've also made a pass on items 3 & 4, but have been hesitant to make any changes to the client API for fear of breaking the two JDBC drivers and our (currently unstaffed) C++ client. Would be great to get some help. For example, how can we provide user-meaningful context in our errors (Item 2)? How can we map errors in to standard SQL error and warning codes (part of item 1)? Maybe someone can help us figure out how to achieve item 4 with minimal client impact. And, of course, once we set the pattern we want to use, everyone can help by improving each of the many places were we raise exceptions. Item 5 can be done independently of other tasks. was (Author: paul.rogers): Fixing errors has a number of dimensions: # Inconsistent use of exceptions at runtime. We have {{UserException}} which creates some structure, but we also throw random other unchecked exceptions. \{{UserException}}s do not, however, provide a mapping into SQL errors of the type understood by xDBC drivers. # Inconsistent error context. A low level bit of code (a file open call, say) only knows that it failed and that is what it tends to report: ("IO Error 10".) At the next level up, the surrounding code might know a bit more. ("Error reading HDFS:/foo/bar1234.parquet".) What we need is a bit of synthesis to say, ("Too many network timeouts reading block 17 from the bar1234.parquet of the `foo` table stored in the HDFS system `sales`".) # Errors are exceptions and we are overly generous in showing every last bit of stack trace on the client, the server and so on. Even those of us who live in the code find that the few lines we care about (NPE in such-and-such call stack) is lost in hundreds of lines that, frankly, I've never personally looked at. # The client API is a bit of a mess in error reporting: returning unchecked {{UserException}}s rather than a well-structured {{DrillException}} (say) designed for client use. (This is probably because the Drill client was a quick short-term solution based on Drill's internal Drillbit-to-Drillbit RPC.) In addition to the above execution-focused items, it would be good to look at the SQL parser/planner errors as well. Not sure that returning 20-30 lines of possible tokens is super-helpful when I make a SQL typo. Probably fine to say, "Didn't understand the SQL at line 10, position 3."); To clean up our error act, we must move forward on each of these fronts. For my part, I've been chipping away at item 1: trying to convert all code to throw {{UserException}}. EVF provides an "error context" that helps (but does not solve) item 2. I've also made a pass on
[jira] [Commented] (DRILL-7551) Improve Error Reporting
[ https://issues.apache.org/jira/browse/DRILL-7551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023322#comment-17023322 ] Paul Rogers commented on DRILL-7551: Fixing errors has a number of dimensions: # Inconsistent use of exceptions at runtime. We have {{UserException}} which creates some structure, but we also throw random other unchecked exceptions. \{{UserException}}s do not, however, provide a mapping into SQL errors of the type understood by xDBC drivers. # Inconsistent error context. A low level bit of code (a file open call, say) only knows that it failed and that is what it tends to report: ("IO Error 10".) At the next level up, the surrounding code might know a bit more. ("Error reading HDFS:/foo/bar1234.parquet".) What we need is a bit of synthesis to say, ("Too many network timeouts reading block 17 from the bar1234.parquet of the `foo` table stored in the HDFS system `sales`".) # Errors are exceptions and we are overly generous in showing every last bit of stack trace on the client, the server and so on. Even those of us who live in the code find that the few lines we care about (NPE in such-and-such call stack) is lost in hundreds of lines that, frankly, I've never personally looked at. # The client API is a bit of a mess in error reporting: returning unchecked {{UserException}}s rather than a well-structured {{DrillException}} (say) designed for client use. (This is probably because the Drill client was a quick short-term solution based on Drill's internal Drillbit-to-Drillbit RPC.) In addition to the above execution-focused items, it would be good to look at the SQL parser/planner errors as well. Not sure that returning 20-30 lines of possible tokens is super-helpful when I make a SQL typo. Probably fine to say, "Didn't understand the SQL at line 10, position 3."); To clean up our error act, we must move forward on each of these fronts. For my part, I've been chipping away at item 1: trying to convert all code to throw {{UserException}}. EVF provides an "error context" that helps (but does not solve) item 2. I've also made a pass on items 3 & 4, but have been hesitant to make any changes to the client API for fear of breaking the two JDBC drivers and our (currently unstaffed) C++ client. Would be great to get some help. For example, how can we provide user-meaningful context in our errors (Item 2)? How can we map errors in to standard SQL error and warning codes (part of item 1)? Maybe someone can help us figure out how to achieve item 4 with minimal client impact. And, of course, once we set the pattern we want to use, everyone can help by improving each of the many places were we raise exceptions. > Improve Error Reporting > --- > > Key: DRILL-7551 > URL: https://issues.apache.org/jira/browse/DRILL-7551 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.17.0 >Reporter: Charles Givre >Priority: Major > Fix For: 1.18.0 > > > This Jira is to serve as a master Jira issue to improve the usability of > error messages. Instead of dumping stack traces, the overall goal is to give > the user something that can actually explain: > # What went wrong > # How to fix > Work that relates to this, should be created as subtasks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7545) Projection ambiguities in complex types
Paul Rogers created DRILL-7545: -- Summary: Projection ambiguities in complex types Key: DRILL-7545 URL: https://issues.apache.org/jira/browse/DRILL-7545 Project: Apache Drill Issue Type: Bug Affects Versions: 1.17.0 Reporter: Paul Rogers Summarized from an e-mail chain on the dev mailing list: We recently introduced the DICT type. We also added the EVF framework. We have a bit of code which parses the projection list, then checks if a column from a reader is consistent with projection. The idea is to ensure that the columns produced by a Scan will be valid when a Project later tries to use them with the given project list. And, if the Scan says it can support Project-push-down, then the Scan is obligated to do the full check. First we'll explain how I'll solve the projection problem given your explanation. Then we'll point out three potential ambiguities. Thanks to Bohdan for his explanations. The problems here are not due to any one person. As explained below, they are due to trying to add concepts into SQL which SQL is not well-suited to support. h4. Projection for DICT Types Queries go through two major steps: planing and execution. At the planning stage we use SQL syntax for the project list. For example: {code:sql} explain plan for SELECT a, e.`map`.`member`, `dict`['key'], `array`[10] FROM cp.`employee.json` e {code} The planner sends an execution plan to operators. The project list appears in JSON. For the above: {code:json} "columns" : [ "`a`", "`map`.`member`", "`dict`.`key`", "`array`[10]" ], {code} We see that the JSON works as Bohdan described: * The SQL map "map.member" syntax is converted to "`map`.`member`" in the JSON plan. * The SQL DICT "`dict`['key']" syntax is converted to a form identical to maps: "`dict`.`key`". * The SQL DICT/array "`array`[10]" syntax is converted to "`array`[10]" in JSON. That is, on the execution side, we can't tell the difference between a MAP and a DICT request. We also can't tell the difference between an Array and DICT request. Apparently, because of this, the Schema Path parser does not recognize DICT syntax. Given the way projection works, "a.b" and "a['b']" are identical: either works for both a map or a DICT with VARCHAR keys. That is, we just say that map and array projection are both compatible with a DICT column? h4. Projection Checking in Scan Mentioned above is that a Scan that supports Project-push-down must ensure that the output columns match the projection list. Doing that check is quite easy when the projection is simple: `a`. The column `a` can match a data column `a` of any type. The task is bit harder when the projection is an array `a[0]`. Since this now means either array or DICT with an INT key, this projected column can match: * Any REPEATED type * A LIST * A non-REPEATED DICT with INT, BIGINT, SMALLINT or TINYINT keys (ignoring the UINTx types) * A REPEATED DICT with any type of key * A UNION (because a union might contain a repeated type) We can also handle a map projection: `a.b` which matches: * A (possibly repeated) map * A (possibly repeated) DICT with VARCHAR keys * A UNION (because a union might contain a possibly-repeated map) * A LIST (because the list can contain a union which might contain a possibly-repeated map) Things get very complex indeed when we have multiple qualifiers such as `a[0][1].b` which matches: * A LIST that contains a repeated map * A REPEATED LIST that contains a (possibly-repeated) map * A DICT with an INT key that has a value of a repeated map * A REPEATED DICT that contains an INT key that contains a MAP * (If we had sufficient metadata) A LIST that contains a REPEATED DICT with a VARCHAR key. h4. DICT Projection Ambiguities The DICT type introduces an ambiguity. Note above that `a.b` can refer to either a REPEATED or non-REPEATED MAP. If non-repeated, `a.b` means to get the one value for member `b` of map `a`. But, if the map is REPEATED, this means to project an array of `b` values obtained from the array of maps. For a DICT, there is an ambiguity with `a[0][1]` if the DICT is repeated DICT of INT keys and REPEATED BIGINT values: that is ARRAY>>. Does `a[0][1]` mean to pull out the 0th element of the REPEATED DICT, then lookup where the key == 1? Or, does it mean to pull out all the DICT array values where the key == 0 and then pull out the 1st value of the INT array? That is, because we have an implied (in all members of the array) syntax, one can interpret this case as: {noformat} repeatedDict[0].valueOf(1) --> ARRAY -- All the values in the key=1 array of element 0 {noformat} or {noformat} repeatedDict.valueOf(0)[1] --> ARRAY -- All the values in the key=0, element 1 positions across all DICT elements {noformat} It would seem to make sense to prefer the first interpretation. Unfortunately, MAPs already use the
[jira] [Commented] (DRILL-7542) Fix Drill-on-Yarn logger
[ https://issues.apache.org/jira/browse/DRILL-7542?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17019695#comment-17019695 ] Paul Rogers commented on DRILL-7542: [~arina], I can't recall this detail. I will speculate that I had to use the same logging as the YARN framework. DoY has two executables: the client and the App Master. Both make heavy use of the YARN and HDFS APIs. I may have found that things worked best if I used the same logger for my code as YARN and HDFS used. That said, feel free to experiment; perhaps I missed something that would allow us to get YARN and HDFS to log to our logger; I'm a pure novice at the logging mechanisms. > Fix Drill-on-Yarn logger > > > Key: DRILL-7542 > URL: https://issues.apache.org/jira/browse/DRILL-7542 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.16.0, 1.17.0 >Reporter: Arina Ielchiieva >Priority: Major > > Drill project uses Logback logger backed by SLF4J: > {noformat} > import org.slf4j.Logger; > import org.slf4j.LoggerFactory; > private static final Logger logger = > LoggerFactory.getLogger(ResultsListener.class); > {noformat} > Drill-on-Yarn project uses commons loggin: > {noformat} > import org.apache.commons.logging.Log; > import org.apache.commons.logging.LogFactory; > private static final Log LOG = LogFactory.getLog(AbstractScheduler.class); > {noformat} > It would be nice if all project components used the same approach for logging. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7522) JSON reader (v1) omits null columns in SELECT *
Paul Rogers created DRILL-7522: -- Summary: JSON reader (v1) omits null columns in SELECT * Key: DRILL-7522 URL: https://issues.apache.org/jira/browse/DRILL-7522 Project: Apache Drill Issue Type: Bug Affects Versions: 1.17.0 Reporter: Paul Rogers Run the following unit test: {{TestStarQueries.testSelStarOrderBy}}, runs the following query: {code:sql} select * from cp.`employee.json` order by last_name {code} The query reads a Foodmart file {{customer.json}} that has records like this: {code:json} {"employee_id":53,...","end_date":null,"salary":...} {code} The field {{end_date}} turns out to be null for all records in {{customer.json}}. Then, look at the verification query. It carefully includes all fields *except* {{end_date}}. That is, the test was written to expect that the JSON reader will omit a column that has all NULL values. While it might seem OK to omit all-NULL columns (they don't have any data), the problem is that Drill is a distributed system. Suppose we query a directory of 50 such files, some of which have all-NULLs in one field, some of which have all-NULLs in another. Although the files have the same schema, {{SELECT *}} will return different schemas (depending on which file has which non-NULL columns.) A downstream operator will have to merge these schemas. And, since Drill fills in a Nullable INT field for missing columns, we might end up with a schema change exception because the actual field type is VARCHAR when it appears. One can argue that {{SELECT *}} means "return all columns", not "return all columns except those that happen to be null in the first batch." Yes, we have the problem of not knowing the actual field type. Eventually, provided schemas will resolve such issues. Note that in the "V2" JSON reader, {{end_date}} is included in the query. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7510) Incorrect String/number comparison with union types
Paul Rogers created DRILL-7510: -- Summary: Incorrect String/number comparison with union types Key: DRILL-7510 URL: https://issues.apache.org/jira/browse/DRILL-7510 Project: Apache Drill Issue Type: Bug Reporter: Paul Rogers Assignee: Paul Rogers Run the following test: {{TestTopNSchemaChanges.testUnionTypes()}}. It will pass. Look at the expected output: {code:java} builder.baselineValues(0l, 0l); builder.baselineValues(1.0d, 1.0d); builder.baselineValues(3l, 3l); builder.baselineValues(4.0d, 4.0d); builder.baselineValues(6l, 6l); builder.baselineValues(7.0d, 7.0d); builder.baselineValues(9l, 9l); builder.baselineValues("2", "2"); {code} The string values sort after the numbers. After the fix for DRILL-7502, we get the following output: {code:java} builder.baselineValues(0l, 0l); builder.baselineValues(1.0d, 1.0d); builder.baselineValues("2", "2"); builder.baselineValues(3l, 3l); builder.baselineValues(4.0d, 4.0d); builder.baselineValues("5", "5"); builder.baselineValues(6l, 6l); builder.baselineValues(7.0d, 7.0d); {code} This accidental fix suggests that the original design was to convert values to the same type, then compare them. Converting numbers to strings, say, would cause them to be lexically ordered, as in the second output. The {{UNION}} type is poorly supported, so it is likely that this bug does not affect actual users. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7507) Convert fragment interrupts to exceptions
Paul Rogers created DRILL-7507: -- Summary: Convert fragment interrupts to exceptions Key: DRILL-7507 URL: https://issues.apache.org/jira/browse/DRILL-7507 Project: Apache Drill Issue Type: Improvement Affects Versions: 1.17.0 Reporter: Paul Rogers Assignee: Paul Rogers Fix For: 1.18.0 Operations periodically check if they should continue by calling the {{shouldContinue()}} method. If the method returns false, operators return a {{STOP}} status in some form. This change modifies handling to throw an exception instead; cancelling a fragment the same way that we handle errors. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7506) Simplify code gen error handling
Paul Rogers created DRILL-7506: -- Summary: Simplify code gen error handling Key: DRILL-7506 URL: https://issues.apache.org/jira/browse/DRILL-7506 Project: Apache Drill Issue Type: Improvement Affects Versions: 1.17.0 Reporter: Paul Rogers Assignee: Paul Rogers Fix For: 1.18.0 Code generation can generate a variety of errors. Most operators bubble these exceptions up several layers in the code before catching them. This patch moves error handling closer to the code gen itself to allow a) simpler code, and b) clearer error messages. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-7359) Add support for DICT type in RowSet Framework
[ https://issues.apache.org/jira/browse/DRILL-7359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers updated DRILL-7359: --- Labels: ready-to-commit (was: ) > Add support for DICT type in RowSet Framework > - > > Key: DRILL-7359 > URL: https://issues.apache.org/jira/browse/DRILL-7359 > Project: Apache Drill > Issue Type: New Feature >Reporter: Bohdan Kazydub >Assignee: Bohdan Kazydub >Priority: Major > Labels: ready-to-commit > Fix For: 1.18.0 > > > Add support for new DICT data type (see DRILL-7096) in RowSet Framework -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-6360) Document the typeof() function
[ https://issues.apache.org/jira/browse/DRILL-6360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006252#comment-17006252 ] Paul Rogers commented on DRILL-6360: The PR for DRILL-7502 provides additional documentation about the updated behaviour of this function and two of the other type functions. > Document the typeof() function > -- > > Key: DRILL-6360 > URL: https://issues.apache.org/jira/browse/DRILL-6360 > Project: Apache Drill > Issue Type: Task > Components: Documentation >Affects Versions: 1.13.0 >Reporter: Paul Rogers >Assignee: Bridget Bevens >Priority: Minor > Labels: doc-impacting > > Drill has a {{typeof()}} function that returns the data type (but not mode) > of a column. It was discussed on the dev list recently. However, a search of > the Drill web site, and a scan by hand, failed to turn up documentation about > the function. > As a general suggestion, would be great to have an alphabetical list of all > functions so we don't have to hunt all over the site to find which functions > are available. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-6362) typeof() lies about types
[ https://issues.apache.org/jira/browse/DRILL-6362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006251#comment-17006251 ] Paul Rogers commented on DRILL-6362: DRILL-7502 fixes this issues, along with several related issues. > typeof() lies about types > - > > Key: DRILL-6362 > URL: https://issues.apache.org/jira/browse/DRILL-6362 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.13.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > > Drill provides a {{typeof()}} function that returns the type of a column. > But, it seems to make up types. Consider the following input file: > {noformat} > {a: true} > {a: false} > {a: null} > {noformat} > Consider the following two queries: > {noformat} > SELECT a FROM `json/boolean.json`; > ++ > | a| > ++ > | true | > | false | > | null | > ++ > > SELECT typeof(a) FROM `json/boolean.json`; > +-+ > | EXPR$0 | > +-+ > | BIT | > | BIT | > | NULL| > +-+ > {noformat} > Notice that the values are reported as BIT. But, I believe the actual type is > UInt1 (the bit vector is, I believe, deprecated.) Then, the function reports > NULL instead of the actual type for the null value. > Since Drill has an {{isnull()}} function, there is no reason for {{typeof()}} > to muddle the type. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-5189) There's no documentation for the typeof() function
[ https://issues.apache.org/jira/browse/DRILL-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17006250#comment-17006250 ] Paul Rogers commented on DRILL-5189: The PR for DRILL-7502 provides additional documentation about the updated behaviour of this function and two of the other type functions. > There's no documentation for the typeof() function > -- > > Key: DRILL-5189 > URL: https://issues.apache.org/jira/browse/DRILL-5189 > Project: Apache Drill > Issue Type: Bug > Components: Documentation >Reporter: Chris Westin >Assignee: Bridget Bevens >Priority: Major > > I looked through the documentation at https://drill.apache.org/docs/ under > SQL Reference > SQL Functions > ... and could not find any reference to > typeof(). Google searches only turned up a reference to DRILL-4204. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (DRILL-7502) Incorrect/invalid codegen for typeof() with UNION
[ https://issues.apache.org/jira/browse/DRILL-7502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers reassigned DRILL-7502: -- Assignee: Paul Rogers > Incorrect/invalid codegen for typeof() with UNION > - > > Key: DRILL-7502 > URL: https://issues.apache.org/jira/browse/DRILL-7502 > Project: Apache Drill > Issue Type: Bug >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > > The {{typeof()}} function is defined as follows: > {code:java} > @FunctionTemplate(names = {"typeOf"}, > scope = FunctionTemplate.FunctionScope.SIMPLE, > nulls = NullHandling.INTERNAL) > public static class GetType implements DrillSimpleFunc { > @Param > FieldReader input; > @Output > VarCharHolder out; > @Inject > DrillBuf buf; > @Override > public void setup() {} > @Override > public void eval() { > String typeName = input.getTypeString(); > byte[] type = typeName.getBytes(); > buf = buf.reallocIfNeeded(type.length); > buf.setBytes(0, type); > out.buffer = buf; > out.start = 0; > out.end = type.length; > } > } > {code} > Note that the {{input}} field is defined as {{FieldReader}} which has a > method called {{getTypeString()}}. As a result, the code works fine in all > existing tests in {{TestTypeFns}}. > I tried to add a function to use {{typeof()}} on a column of type {{UNION}}. > When I did, the query failed with a compile error in generated code: > {noformat} > SYSTEM ERROR: CompileException: Line 42, Column 43: > A method named "getTypeString" is not declared in any enclosing class nor > any supertype, nor through a static import > {noformat} > The stack trace shows the generated code; Note that the type of {{input}} > changes from a reader to a holder, causing code to be invalid: > {code:java} > public class ProjectorGen0 { > DrillBuf work0; > UnionVector vv1; > VarCharVector vv6; > DrillBuf work9; > VarCharVector vv11; > DrillBuf work14; > VarCharVector vv16; > public void doEval(int inIndex, int outIndex) > throws SchemaChangeException > { > { > UnionHolder out4 = new UnionHolder(); > { > out4 .isSet = vv1 .getAccessor().isSet((inIndex)); > if (out4 .isSet == 1) { > vv1 .getAccessor().get((inIndex), out4); > } > } > // start of eval portion of typeOf function. // > VarCharHolder out5 = new VarCharHolder(); > { > final VarCharHolder out = new VarCharHolder(); > UnionHolder input = out4; > DrillBuf buf = work0; > UnionFunctions$GetType_eval: > { > String typeName = input.getTypeString(); > byte[] type = typeName.getBytes(); > buf = buf.reallocIfNeeded(type.length); > buf.setBytes(0, type); > out.buffer = buf; > out.start = 0; > out.end = type.length; > } > {code} > By contrast, here is the generated code for one of the existing > {{TestTypeFns}} tests where things work: > {code:java} > public class ProjectorGen0 > extends ProjectorTemplate > { > DrillBuf work0; > NullableBigIntVector vv1; > VarCharVector vv7; > public ProjectorGen0() { > try { > __DRILL_INIT__(); > } catch (SchemaChangeException e) { > throw new UnsupportedOperationException(e); > } > } > public void doEval(int inIndex, int outIndex) > throws SchemaChangeException > { > { >.. > // start of eval portion of typeOf function. // > VarCharHolder out6 = new VarCharHolder(); > { > final VarCharHolder out = new VarCharHolder(); > FieldReader input = new NullableIntHolderReaderImpl(out5); > DrillBuf buf = work0; > UnionFunctions$GetType_eval: > { > String typeName = input.getTypeString(); > byte[] type = typeName.getBytes(); > buf = buf.reallocIfNeeded(type.length); > buf.setBytes(0, type); > out.buffer = buf; > out.start = 0; > out.end = type.length; > } > work0 = buf; > out6 .start = out.start; > out6 .end = out.end; > out6 .buffer = out.buffer; > } > // end of eval portion of typeOf function. // > {code} > Notice that the {{input}} variable is of type {{FieldReader}} as expected. > Queries that work: > {code:java} > String sql = "SELECT typeof(CAST(a AS " + castType + ")) FROM (VALUES > (1)) AS T(a)"; > sql = "SELECT typeof(CAST(a AS " + castType + ")) FROM > cp.`functions/null.json`"; >
[jira] [Updated] (DRILL-7502) Incorrect/invalid codegen for typeof() with UNION
[ https://issues.apache.org/jira/browse/DRILL-7502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers updated DRILL-7502: --- Fix Version/s: 1.18.0 Affects Version/s: 1.17.0 > Incorrect/invalid codegen for typeof() with UNION > - > > Key: DRILL-7502 > URL: https://issues.apache.org/jira/browse/DRILL-7502 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.17.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.18.0 > > > The {{typeof()}} function is defined as follows: > {code:java} > @FunctionTemplate(names = {"typeOf"}, > scope = FunctionTemplate.FunctionScope.SIMPLE, > nulls = NullHandling.INTERNAL) > public static class GetType implements DrillSimpleFunc { > @Param > FieldReader input; > @Output > VarCharHolder out; > @Inject > DrillBuf buf; > @Override > public void setup() {} > @Override > public void eval() { > String typeName = input.getTypeString(); > byte[] type = typeName.getBytes(); > buf = buf.reallocIfNeeded(type.length); > buf.setBytes(0, type); > out.buffer = buf; > out.start = 0; > out.end = type.length; > } > } > {code} > Note that the {{input}} field is defined as {{FieldReader}} which has a > method called {{getTypeString()}}. As a result, the code works fine in all > existing tests in {{TestTypeFns}}. > I tried to add a function to use {{typeof()}} on a column of type {{UNION}}. > When I did, the query failed with a compile error in generated code: > {noformat} > SYSTEM ERROR: CompileException: Line 42, Column 43: > A method named "getTypeString" is not declared in any enclosing class nor > any supertype, nor through a static import > {noformat} > The stack trace shows the generated code; Note that the type of {{input}} > changes from a reader to a holder, causing code to be invalid: > {code:java} > public class ProjectorGen0 { > DrillBuf work0; > UnionVector vv1; > VarCharVector vv6; > DrillBuf work9; > VarCharVector vv11; > DrillBuf work14; > VarCharVector vv16; > public void doEval(int inIndex, int outIndex) > throws SchemaChangeException > { > { > UnionHolder out4 = new UnionHolder(); > { > out4 .isSet = vv1 .getAccessor().isSet((inIndex)); > if (out4 .isSet == 1) { > vv1 .getAccessor().get((inIndex), out4); > } > } > // start of eval portion of typeOf function. // > VarCharHolder out5 = new VarCharHolder(); > { > final VarCharHolder out = new VarCharHolder(); > UnionHolder input = out4; > DrillBuf buf = work0; > UnionFunctions$GetType_eval: > { > String typeName = input.getTypeString(); > byte[] type = typeName.getBytes(); > buf = buf.reallocIfNeeded(type.length); > buf.setBytes(0, type); > out.buffer = buf; > out.start = 0; > out.end = type.length; > } > {code} > By contrast, here is the generated code for one of the existing > {{TestTypeFns}} tests where things work: > {code:java} > public class ProjectorGen0 > extends ProjectorTemplate > { > DrillBuf work0; > NullableBigIntVector vv1; > VarCharVector vv7; > public ProjectorGen0() { > try { > __DRILL_INIT__(); > } catch (SchemaChangeException e) { > throw new UnsupportedOperationException(e); > } > } > public void doEval(int inIndex, int outIndex) > throws SchemaChangeException > { > { >.. > // start of eval portion of typeOf function. // > VarCharHolder out6 = new VarCharHolder(); > { > final VarCharHolder out = new VarCharHolder(); > FieldReader input = new NullableIntHolderReaderImpl(out5); > DrillBuf buf = work0; > UnionFunctions$GetType_eval: > { > String typeName = input.getTypeString(); > byte[] type = typeName.getBytes(); > buf = buf.reallocIfNeeded(type.length); > buf.setBytes(0, type); > out.buffer = buf; > out.start = 0; > out.end = type.length; > } > work0 = buf; > out6 .start = out.start; > out6 .end = out.end; > out6 .buffer = out.buffer; > } > // end of eval portion of typeOf function. // > {code} > Notice that the {{input}} variable is of type {{FieldReader}} as expected. > Queries that work: > {code:java} > String sql = "SELECT typeof(CAST(a AS " + castType + ")) FROM (VALUES > (1)) AS T(a)"; > sql =
[jira] [Created] (DRILL-7503) Refactor project operator
Paul Rogers created DRILL-7503: -- Summary: Refactor project operator Key: DRILL-7503 URL: https://issues.apache.org/jira/browse/DRILL-7503 Project: Apache Drill Issue Type: Improvement Reporter: Paul Rogers Assignee: Paul Rogers Work on another ticket revealed that the Project operator ("record batch") has grown quite complex. The setup phase lives in the operator as one huge function. The function combines the "logical" tasks of working out the projection expressions and types, the code gen for those expressions, and the physical setup of vectors. The refactoring breaks up the logic so that it is easier to focus on the specific bits of interest. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-7502) Incorrect/invalid codegen for typeof() with UNION
[ https://issues.apache.org/jira/browse/DRILL-7502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers updated DRILL-7502: --- Description: The {{typeof()}} function is defined as follows: {code:java} @FunctionTemplate(names = {"typeOf"}, scope = FunctionTemplate.FunctionScope.SIMPLE, nulls = NullHandling.INTERNAL) public static class GetType implements DrillSimpleFunc { @Param FieldReader input; @Output VarCharHolder out; @Inject DrillBuf buf; @Override public void setup() {} @Override public void eval() { String typeName = input.getTypeString(); byte[] type = typeName.getBytes(); buf = buf.reallocIfNeeded(type.length); buf.setBytes(0, type); out.buffer = buf; out.start = 0; out.end = type.length; } } {code} Note that the {{input}} field is defined as {{FieldReader}} which has a method called {{getTypeString()}}. As a result, the code works fine in all existing tests in {{TestTypeFns}}. I tried to add a function to use {{typeof()}} on a column of type {{UNION}}. When I did, the query failed with a compile error in generated code: {noformat} SYSTEM ERROR: CompileException: Line 42, Column 43: A method named "getTypeString" is not declared in any enclosing class nor any supertype, nor through a static import {noformat} The stack trace shows the generated code; Note that the type of {{input}} changes from a reader to a holder, causing code to be invalid: {code:java} public class ProjectorGen0 { DrillBuf work0; UnionVector vv1; VarCharVector vv6; DrillBuf work9; VarCharVector vv11; DrillBuf work14; VarCharVector vv16; public void doEval(int inIndex, int outIndex) throws SchemaChangeException { { UnionHolder out4 = new UnionHolder(); { out4 .isSet = vv1 .getAccessor().isSet((inIndex)); if (out4 .isSet == 1) { vv1 .getAccessor().get((inIndex), out4); } } // start of eval portion of typeOf function. // VarCharHolder out5 = new VarCharHolder(); { final VarCharHolder out = new VarCharHolder(); UnionHolder input = out4; DrillBuf buf = work0; UnionFunctions$GetType_eval: { String typeName = input.getTypeString(); byte[] type = typeName.getBytes(); buf = buf.reallocIfNeeded(type.length); buf.setBytes(0, type); out.buffer = buf; out.start = 0; out.end = type.length; } {code} By contrast, here is the generated code for one of the existing {{TestTypeFns}} tests where things work: {code:java} public class ProjectorGen0 extends ProjectorTemplate { DrillBuf work0; NullableBigIntVector vv1; VarCharVector vv7; public ProjectorGen0() { try { __DRILL_INIT__(); } catch (SchemaChangeException e) { throw new UnsupportedOperationException(e); } } public void doEval(int inIndex, int outIndex) throws SchemaChangeException { { .. // start of eval portion of typeOf function. // VarCharHolder out6 = new VarCharHolder(); { final VarCharHolder out = new VarCharHolder(); FieldReader input = new NullableIntHolderReaderImpl(out5); DrillBuf buf = work0; UnionFunctions$GetType_eval: { String typeName = input.getTypeString(); byte[] type = typeName.getBytes(); buf = buf.reallocIfNeeded(type.length); buf.setBytes(0, type); out.buffer = buf; out.start = 0; out.end = type.length; } work0 = buf; out6 .start = out.start; out6 .end = out.end; out6 .buffer = out.buffer; } // end of eval portion of typeOf function. // {code} Notice that the {{input}} variable is of type {{FieldReader}} as expected. Queries that work: {code:java} String sql = "SELECT typeof(CAST(a AS " + castType + ")) FROM (VALUES (1)) AS T(a)"; sql = "SELECT typeof(CAST(a AS " + castType + ")) FROM cp.`functions/null.json`"; String sql = "SELECT typeof(" + expr + ") FROM (VALUES (" + value + ")) AS T(a)"; {code} Query that fails: {code:java} String sql ="SELECT typeof(a) AS t, modeof(a) as m, drilltypeof(a) AS dt\n" + "FROM cp.`jsoninput/union/c.json`"; {code} The queries that work all include either a CAST or constant values. The query that fails works with data read from a file. Also, the queries that work use scalar types, the query that fails uses the UNION type. was: The {{typeof()}} function is defined as follows: {code:java} @FunctionTemplate(names = {"typeOf"}, scope =
[jira] [Updated] (DRILL-7502) Incorrect/invalid codegen for typeof() with UNION
[ https://issues.apache.org/jira/browse/DRILL-7502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers updated DRILL-7502: --- Description: The {{typeof()}} function is defined as follows: {code:java} @FunctionTemplate(names = {"typeOf"}, scope = FunctionTemplate.FunctionScope.SIMPLE, nulls = NullHandling.INTERNAL) public static class GetType implements DrillSimpleFunc { @Param FieldReader input; @Output VarCharHolder out; @Inject DrillBuf buf; @Override public void setup() {} @Override public void eval() { String typeName = input.getTypeString(); byte[] type = typeName.getBytes(); buf = buf.reallocIfNeeded(type.length); buf.setBytes(0, type); out.buffer = buf; out.start = 0; out.end = type.length; } } {code} Note that the {{input}} field is defined as {{FieldReader}} which has a method called {{getTypeString()}}. As a result, the code works fine in all existing tests in {{TestTypeFns}}. I tried to add a function to use {{typeof()}} on a column of type {{UNION}}. When I did, the query failed with a compile error in generated code: {noformat} SYSTEM ERROR: CompileException: Line 42, Column 43: A method named "getTypeString" is not declared in any enclosing class nor any supertype, nor through a static import {noformat} The stack trace shows the generated code; Note that the type of {{input}} changes from a reader to a holder, causing code to be invalid: {code:java} public class ProjectorGen0 { DrillBuf work0; UnionVector vv1; VarCharVector vv6; DrillBuf work9; VarCharVector vv11; DrillBuf work14; VarCharVector vv16; public void doEval(int inIndex, int outIndex) throws SchemaChangeException { { UnionHolder out4 = new UnionHolder(); { out4 .isSet = vv1 .getAccessor().isSet((inIndex)); if (out4 .isSet == 1) { vv1 .getAccessor().get((inIndex), out4); } } // start of eval portion of typeOf function. // VarCharHolder out5 = new VarCharHolder(); { final VarCharHolder out = new VarCharHolder(); UnionHolder input = out4; DrillBuf buf = work0; UnionFunctions$GetType_eval: { String typeName = input.getTypeString(); byte[] type = typeName.getBytes(); buf = buf.reallocIfNeeded(type.length); buf.setBytes(0, type); out.buffer = buf; out.start = 0; out.end = type.length; } {code} By contrast, here is the generated code for one of the existing {{TestTypeFns}} tests where things work: {code:java} public class ProjectorGen0 extends ProjectorTemplate { DrillBuf work0; NullableBigIntVector vv1; VarCharVector vv7; public ProjectorGen0() { try { __DRILL_INIT__(); } catch (SchemaChangeException e) { throw new UnsupportedOperationException(e); } } public void doEval(int inIndex, int outIndex) throws SchemaChangeException { { .. // start of eval portion of typeOf function. // VarCharHolder out6 = new VarCharHolder(); { final VarCharHolder out = new VarCharHolder(); FieldReader input = new NullableIntHolderReaderImpl(out5); DrillBuf buf = work0; UnionFunctions$GetType_eval: { String typeName = input.getTypeString(); byte[] type = typeName.getBytes(); buf = buf.reallocIfNeeded(type.length); buf.setBytes(0, type); out.buffer = buf; out.start = 0; out.end = type.length; } work0 = buf; out6 .start = out.start; out6 .end = out.end; out6 .buffer = out.buffer; } // end of eval portion of typeOf function. // {code} Notice that the {{input}} variable is of type {{FieldReader}} as expected. was: The {{typeof()}} function is defined as follows: {code:java} @FunctionTemplate(names = {"typeOf"}, scope = FunctionTemplate.FunctionScope.SIMPLE, nulls = NullHandling.INTERNAL) public static class GetType implements DrillSimpleFunc { @Param FieldReader input; @Output VarCharHolder out; @Inject DrillBuf buf; @Override public void setup() {} @Override public void eval() { String typeName = input.getTypeString(); byte[] type = typeName.getBytes(); buf = buf.reallocIfNeeded(type.length); buf.setBytes(0, type); out.buffer = buf; out.start = 0; out.end = type.length; } } {code} Note that the {{input}} field is defined as {{FieldReader}} which has a method called {{getTypeString()}}. As a result, the code works
[jira] [Created] (DRILL-7502) Incorrect/invalid codegen for typeof() with UNION
Paul Rogers created DRILL-7502: -- Summary: Incorrect/invalid codegen for typeof() with UNION Key: DRILL-7502 URL: https://issues.apache.org/jira/browse/DRILL-7502 Project: Apache Drill Issue Type: Bug Reporter: Paul Rogers The {{typeof()}} function is defined as follows: {code:java} @FunctionTemplate(names = {"typeOf"}, scope = FunctionTemplate.FunctionScope.SIMPLE, nulls = NullHandling.INTERNAL) public static class GetType implements DrillSimpleFunc { @Param FieldReader input; @Output VarCharHolder out; @Inject DrillBuf buf; @Override public void setup() {} @Override public void eval() { String typeName = input.getTypeString(); byte[] type = typeName.getBytes(); buf = buf.reallocIfNeeded(type.length); buf.setBytes(0, type); out.buffer = buf; out.start = 0; out.end = type.length; } } {code} Note that the {{input}} field is defined as {{FieldReader}} which has a method called {{getTypeString()}}. As a result, the code works fine in all existing tests in {{TestTypeFns}}. I tried to add a function to use {{typeof()}} on a column of type {{UNION}}. When I did, the query failed with a compile error in generated code: {noformat} SYSTEM ERROR: CompileException: Line 42, Column 43: A method named "getTypeString" is not declared in any enclosing class nor any supertype, nor through a static import {noformat} The stack trace shows the generated code; Note that the type of {{input}} changes from a reader to a holder, causing code to be invalid: {code:java} public class ProjectorGen0 { DrillBuf work0; UnionVector vv1; VarCharVector vv6; DrillBuf work9; VarCharVector vv11; DrillBuf work14; VarCharVector vv16; public void doEval(int inIndex, int outIndex) throws SchemaChangeException { { UnionHolder out4 = new UnionHolder(); { out4 .isSet = vv1 .getAccessor().isSet((inIndex)); if (out4 .isSet == 1) { vv1 .getAccessor().get((inIndex), out4); } } // start of eval portion of typeOf function. // VarCharHolder out5 = new VarCharHolder(); { final VarCharHolder out = new VarCharHolder(); UnionHolder input = out4; DrillBuf buf = work0; UnionFunctions$GetType_eval: { String typeName = input.getTypeString(); byte[] type = typeName.getBytes(); buf = buf.reallocIfNeeded(type.length); buf.setBytes(0, type); out.buffer = buf; out.start = 0; out.end = type.length; } {code} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (DRILL-7499) sqltypeof() function with an array returns "ARRAY", not type
[ https://issues.apache.org/jira/browse/DRILL-7499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers reassigned DRILL-7499: -- Assignee: Paul Rogers > sqltypeof() function with an array returns "ARRAY", not type > > > Key: DRILL-7499 > URL: https://issues.apache.org/jira/browse/DRILL-7499 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > > The {{sqltypeof()}} function was introduced in Drill 1.14 to work around > limitations of the original {{typeof()}} function. The function is mentioned > in _Learning Apache Drill_, Chapter 8, page 152: > {noformat} > ELECT sqlTypeOf(columns) AS cols_type, >modeOf(columns) AS cols_mode > FROM `csv/cust.csv` LIMIT 1; > +++ > | cols_type | cols_mode | > +++ > | CHARACTER VARYING | ARRAY | > +++ > {noformat} > When the same query is run against the just-released Drill 1.17, we get the > *wrong* results: > {noformat} > +---+---+ > | cols_type | cols_mode | > +---+---+ > | ARRAY | ARRAY | > +---+---+ > {noformat} > The definition of {{sqlTypeOf()}} is that it should return the type portion > of the columns (type, mode) major type. Clearly, it is no longer doing so for > arrays. As a result, there is no function to obtain the data type for arrays. > The problem also shows up in the query from page 158: > {code:sql} > SELECT a, b, >sqlTypeOf(b) AS b_type, modeof(b) AS b_mode > FROM `gen/70kmissing.json` > WHERE mod(a, 7) = 1; > {code} > Expected (table from the book with Drill 1.14 results): > {noformat} > ++---+--+---+ > | a| b | b_type | b_mode | > ++---+--+---+ > | 1 | null | INTEGER | NULLABLE | > ++---+--+---+ > {noformat} > Actual Drill 1.17 results: > {noformat} > +---+---+---+--+ > | a | b | b_type | b_mode | > +---+---+---+--+ > | 1 | null | NULL | NULLABLE | > +---+---+---+--+ > {noformat} > (Second line of table is omitted because something else changed, not relevant > to this ticket.) > The above might not actually be a bug, however if someone has changed the > type of missing columns from the old {{INT}} to a newer (untyped) {{NULL}}. > But, an indirect test suggests that the column is still `INT` and the > function is wrong: > {code:sql} > SELECT a, b > FROM `gen/70kdouble.json` > WHERE b IS NOT NULL ORDER BY a; > {code} > Data: > {noformat} > {a: 1} > ... > {a: 6} > {a: 70001, b: 10.5} > {noformat} > Error: > {noformat} > Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External > Sort. Please enable Union type. > Previous schema BatchSchema [fields=[[`a` (BIGINT:OPTIONAL)], [`b` > (INT:OPTIONAL)]], selectionVector=NONE] > Incoming schema BatchSchema [fields=[[`a` (BIGINT:OPTIONAL)], [`b` > (FLOAT8:OPTIONAL)]], selectionVector=NONE] > {noformat} > Oddly, however, the query on page 160 works as expected: > {code:sql} > SELECT sqlTypeOf(a) AS a_type, modeOf(a) AS a_mode > FROM `json/all-null.json` LIMIT 1; > {code} > {noformat} > +-+--+ > | a_type | a_mode | > +-+--+ > | INTEGER | NULLABLE | > +-+--+ > {noformat} > Someone will have to do some investigating to understand the current > behaviour. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7499) sqltypeof() function with an array returns "ARRAY", not type
[ https://issues.apache.org/jira/browse/DRILL-7499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005161#comment-17005161 ] Paul Rogers commented on DRILL-7499: I believe the described behavior is an unintended artifact of this bit of code in {{Types.java}}: {code:java} public static String getSqlTypeName(final MajorType type) { if (type.getMode() == DataMode.REPEATED || type.getMinorType() == MinorType.LIST) { return "ARRAY"; } return getBaseSqlTypeName(type); } {code} Since we have {{modeOf()}} to report the mode ({{REPEATED}}), will modify this function to not return "ARRAY" for the {{REPEATED}} mode. > sqltypeof() function with an array returns "ARRAY", not type > > > Key: DRILL-7499 > URL: https://issues.apache.org/jira/browse/DRILL-7499 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Priority: Minor > > The {{sqltypeof()}} function was introduced in Drill 1.14 to work around > limitations of the original {{typeof()}} function. The function is mentioned > in _Learning Apache Drill_, Chapter 8, page 152: > {noformat} > ELECT sqlTypeOf(columns) AS cols_type, >modeOf(columns) AS cols_mode > FROM `csv/cust.csv` LIMIT 1; > +++ > | cols_type | cols_mode | > +++ > | CHARACTER VARYING | ARRAY | > +++ > {noformat} > When the same query is run against the just-released Drill 1.17, we get the > *wrong* results: > {noformat} > +---+---+ > | cols_type | cols_mode | > +---+---+ > | ARRAY | ARRAY | > +---+---+ > {noformat} > The definition of {{sqlTypeOf()}} is that it should return the type portion > of the columns (type, mode) major type. Clearly, it is no longer doing so for > arrays. As a result, there is no function to obtain the data type for arrays. > The problem also shows up in the query from page 158: > {code:sql} > SELECT a, b, >sqlTypeOf(b) AS b_type, modeof(b) AS b_mode > FROM `gen/70kmissing.json` > WHERE mod(a, 7) = 1; > {code} > Expected (table from the book with Drill 1.14 results): > {noformat} > ++---+--+---+ > | a| b | b_type | b_mode | > ++---+--+---+ > | 1 | null | INTEGER | NULLABLE | > ++---+--+---+ > {noformat} > Actual Drill 1.17 results: > {noformat} > +---+---+---+--+ > | a | b | b_type | b_mode | > +---+---+---+--+ > | 1 | null | NULL | NULLABLE | > +---+---+---+--+ > {noformat} > (Second line of table is omitted because something else changed, not relevant > to this ticket.) > The above might not actually be a bug, however if someone has changed the > type of missing columns from the old {{INT}} to a newer (untyped) {{NULL}}. > But, an indirect test suggests that the column is still `INT` and the > function is wrong: > {code:sql} > SELECT a, b > FROM `gen/70kdouble.json` > WHERE b IS NOT NULL ORDER BY a; > {code} > Data: > {noformat} > {a: 1} > ... > {a: 6} > {a: 70001, b: 10.5} > {noformat} > Error: > {noformat} > Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External > Sort. Please enable Union type. > Previous schema BatchSchema [fields=[[`a` (BIGINT:OPTIONAL)], [`b` > (INT:OPTIONAL)]], selectionVector=NONE] > Incoming schema BatchSchema [fields=[[`a` (BIGINT:OPTIONAL)], [`b` > (FLOAT8:OPTIONAL)]], selectionVector=NONE] > {noformat} > Oddly, however, the query on page 160 works as expected: > {code:sql} > SELECT sqlTypeOf(a) AS a_type, modeOf(a) AS a_mode > FROM `json/all-null.json` LIMIT 1; > {code} > {noformat} > +-+--+ > | a_type | a_mode | > +-+--+ > | INTEGER | NULLABLE | > +-+--+ > {noformat} > Someone will have to do some investigating to understand the current > behaviour. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (DRILL-7501) Drill 1.17 sqlTypeOf for a Map now reports STRUCT
[ https://issues.apache.org/jira/browse/DRILL-7501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers resolved DRILL-7501. Resolution: Won't Fix As explained on the dev list, the return value in this case was changed to match the preferred name {{STRUCT}} for what Drill has historically called a {{MAP}}. The name {{STRUCT}} is consistent with Hive. > Drill 1.17 sqlTypeOf for a Map now reports STRUCT > - > > Key: DRILL-7501 > URL: https://issues.apache.org/jira/browse/DRILL-7501 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > > Drill 1.14 introduced the {{sqlTypeOf()}} function to workaround limits of > the {{typeof()}} function. {{sqlTypeOf()}} should return the name of the SQL > type for a column, using the type name that Drill uses. > A query from page 163 of _Learning Apache Drill_: > {code:sql} > SELECT sqlTypeOf(`name`) AS name_type FROM `json/nested.json`; > {code} > Drill 1.14 results (correct): > {noformat} > ++ > | name_type | > ++ > | MAP| > ++ > {noformat} > Drill 1.17 results (incorrect): > {noformat} > +---+ > | name_type | > +---+ > | STRUCT| > +---+ > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (DRILL-5189) There's no documentation for the typeof() function
[ https://issues.apache.org/jira/browse/DRILL-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers resolved DRILL-5189. Resolution: Duplicate > There's no documentation for the typeof() function > -- > > Key: DRILL-5189 > URL: https://issues.apache.org/jira/browse/DRILL-5189 > Project: Apache Drill > Issue Type: Bug > Components: Documentation >Reporter: Chris Westin >Assignee: Bridget Bevens >Priority: Major > > I looked through the documentation at https://drill.apache.org/docs/ under > SQL Reference > SQL Functions > ... and could not find any reference to > typeof(). Google searches only turned up a reference to DRILL-4204. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (DRILL-7501) Drill 1.17 sqlTypeOf for a Map now reports STRUCT
[ https://issues.apache.org/jira/browse/DRILL-7501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers reassigned DRILL-7501: -- Assignee: Paul Rogers > Drill 1.17 sqlTypeOf for a Map now reports STRUCT > - > > Key: DRILL-7501 > URL: https://issues.apache.org/jira/browse/DRILL-7501 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > > Drill 1.14 introduced the {{sqlTypeOf()}} function to workaround limits of > the {{typeof()}} function. {{sqlTypeOf()}} should return the name of the SQL > type for a column, using the type name that Drill uses. > A query from page 163 of _Learning Apache Drill_: > {code:sql} > SELECT sqlTypeOf(`name`) AS name_type FROM `json/nested.json`; > {code} > Drill 1.14 results (correct): > {noformat} > ++ > | name_type | > ++ > | MAP| > ++ > {noformat} > Drill 1.17 results (incorrect): > {noformat} > +---+ > | name_type | > +---+ > | STRUCT| > +---+ > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-6377) typeof() does not return DECIMAL scale, precision
[ https://issues.apache.org/jira/browse/DRILL-6377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005155#comment-17005155 ] Paul Rogers commented on DRILL-6377: See DRILL-6362. The primary purpose of {{typeof()}} is to allow a query to determine the type of a value in a {{UNION}} column. (It has also been useful to debug queries for non-{{UNION}} columns.) Since adding widths would interfere with the purpose of this function, we should continue to omit them. As [~arina] has shown, a user who wants that information can use the {{sqlTypeOf()}} function. > typeof() does not return DECIMAL scale, precision > - > > Key: DRILL-6377 > URL: https://issues.apache.org/jira/browse/DRILL-6377 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.13.0 >Reporter: Paul Rogers >Priority: Minor > Fix For: 1.16.0 > > > The {{typeof()}} function returns the type of a column: > {noformat} > SELECT typeof(CAST(a AS DOUBLE)) FROM (VALUES (1)) AS T(a); > +-+ > | EXPR$0 | > +-+ > | FLOAT8 | > +-+ > {noformat} > In Drill, the {{DECIMAL}} type is parameterized with scale and precision. > However, {{typeof()}} does not return this information: > {noformat} > ALTER SESSION SET `planner.enable_decimal_data_type` = true; > SELECT typeof(CAST(a AS DECIMAL)) FROM (VALUES (1)) AS T(a); > +--+ > | EXPR$0 | > +--+ > | DECIMAL38SPARSE | > +--+ > SELECT typeof(CAST(a AS DECIMAL(6, 3))) FROM (VALUES (1)) AS T(a); > +---+ > | EXPR$0 | > +---+ > | DECIMAL9 | > +---+ > {noformat} > Expected something of the form {{DECIMAL(, )}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (DRILL-6377) typeof() does not return DECIMAL scale, precision
[ https://issues.apache.org/jira/browse/DRILL-6377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers reassigned DRILL-6377: -- Assignee: Paul Rogers > typeof() does not return DECIMAL scale, precision > - > > Key: DRILL-6377 > URL: https://issues.apache.org/jira/browse/DRILL-6377 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.13.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > Fix For: 1.16.0 > > > The {{typeof()}} function returns the type of a column: > {noformat} > SELECT typeof(CAST(a AS DOUBLE)) FROM (VALUES (1)) AS T(a); > +-+ > | EXPR$0 | > +-+ > | FLOAT8 | > +-+ > {noformat} > In Drill, the {{DECIMAL}} type is parameterized with scale and precision. > However, {{typeof()}} does not return this information: > {noformat} > ALTER SESSION SET `planner.enable_decimal_data_type` = true; > SELECT typeof(CAST(a AS DECIMAL)) FROM (VALUES (1)) AS T(a); > +--+ > | EXPR$0 | > +--+ > | DECIMAL38SPARSE | > +--+ > SELECT typeof(CAST(a AS DECIMAL(6, 3))) FROM (VALUES (1)) AS T(a); > +---+ > | EXPR$0 | > +---+ > | DECIMAL9 | > +---+ > {noformat} > Expected something of the form {{DECIMAL(, )}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (DRILL-6360) Document the typeof() function
[ https://issues.apache.org/jira/browse/DRILL-6360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005125#comment-17005125 ] Paul Rogers edited comment on DRILL-6360 at 12/30/19 6:34 AM: -- The information should go [here|https://drill.apache.org/docs/data-type-functions/]. {{typeof()}} returns {{"NULL"}} if the value of a column is NULL, else it returns the internal Drill type name for a column as given by {{drillTypeOf()}}. If we adopt the changes proposed in DRILL-6362, then the documentation becomes: Returns the type of the column using the internal Drill type name. If the column is the experimental {{UNION}} type, then returns the type of the specific column value, or "NULL" if that column is null. To determine if a column is a UNION, use the {{drillTypeOf()}} function. Note that in Drill 1.17 and before, the {{typeof()}} function returned "NULL" if the column value is null. From Drill 1.18 and later, this is only true if the column is of type {{UNION}}. was (Author: paul.rogers): The information should go [here|https://drill.apache.org/docs/data-type-functions/]. {{typeof()}} returns {{"NULL"}} if the value of a column is NULL, else it returns the internal Drill type name for a column as given by {{drillTypeOf()}}. > Document the typeof() function > -- > > Key: DRILL-6360 > URL: https://issues.apache.org/jira/browse/DRILL-6360 > Project: Apache Drill > Issue Type: Task > Components: Documentation >Affects Versions: 1.13.0 >Reporter: Paul Rogers >Assignee: Bridget Bevens >Priority: Minor > Labels: doc-impacting > > Drill has a {{typeof()}} function that returns the data type (but not mode) > of a column. It was discussed on the dev list recently. However, a search of > the Drill web site, and a scan by hand, failed to turn up documentation about > the function. > As a general suggestion, would be great to have an alphabetical list of all > functions so we don't have to hunt all over the site to find which functions > are available. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (DRILL-6362) typeof() lies about types
[ https://issues.apache.org/jira/browse/DRILL-6362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005138#comment-17005138 ] Paul Rogers edited comment on DRILL-6362 at 12/30/19 6:22 AM: -- It is likely that this function was meant to mimic the [{{typeof()}}|https://www.w3resource.com/sqlite/core-functions-typeof.php] function of SqlLite, which also returns "NULL" if the actual value is NULL. Snowflake has the concept of a "Variant" (like Drill's Union type). In this case [{{typeof()}}|https://docs.snowflake.net/manuals/sql-reference/functions/typeof.html] returns the type of the value. The documentation shows an example for a null value for which {{typeof()}} to returns "NULL". Given this, the Drill function should probably return the value type for a UNION type. At present, {{typeof()}} will return "UNION", which is not consistent with the Snowflake variant pattern. Postres has the [{{pg_typeof()}}|https://www.postgresql.org/docs/9.3/functions-info.html] function, which is a bit convoluted, but the examples shows that it effectively returns the type name. Given all this, the proposal is to modify {{typeof()}} as follows: * For a {{UNION}} type, return the actual type of the specific column value. * For a {{UNION}} type (only), return "NULL" if the UNION itself is NULL. (Such a column really does have no type.) * For all other types, return the {{MinorType}} name. To be clear, the two changes are: * Modify handling of {{UNION}} columns. * Modify handling of columns with values set to {{NULL}}. These changes seem valid because: * They make the Drill function closer to operation of other SQL engines. * Other than for debugging, the most likely use of {{typeof()}} is to work with UNIONS, a task for which the function currently fails. was (Author: paul.rogers): Closing this because we did create the new functions and we we've elected to leave this function alone for now. > typeof() lies about types > - > > Key: DRILL-6362 > URL: https://issues.apache.org/jira/browse/DRILL-6362 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.13.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > > Drill provides a {{typeof()}} function that returns the type of a column. > But, it seems to make up types. Consider the following input file: > {noformat} > {a: true} > {a: false} > {a: null} > {noformat} > Consider the following two queries: > {noformat} > SELECT a FROM `json/boolean.json`; > ++ > | a| > ++ > | true | > | false | > | null | > ++ > > SELECT typeof(a) FROM `json/boolean.json`; > +-+ > | EXPR$0 | > +-+ > | BIT | > | BIT | > | NULL| > +-+ > {noformat} > Notice that the values are reported as BIT. But, I believe the actual type is > UInt1 (the bit vector is, I believe, deprecated.) Then, the function reports > NULL instead of the actual type for the null value. > Since Drill has an {{isnull()}} function, there is no reason for {{typeof()}} > to muddle the type. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Reopened] (DRILL-6362) typeof() lies about types
[ https://issues.apache.org/jira/browse/DRILL-6362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers reopened DRILL-6362: > typeof() lies about types > - > > Key: DRILL-6362 > URL: https://issues.apache.org/jira/browse/DRILL-6362 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.13.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > > Drill provides a {{typeof()}} function that returns the type of a column. > But, it seems to make up types. Consider the following input file: > {noformat} > {a: true} > {a: false} > {a: null} > {noformat} > Consider the following two queries: > {noformat} > SELECT a FROM `json/boolean.json`; > ++ > | a| > ++ > | true | > | false | > | null | > ++ > > SELECT typeof(a) FROM `json/boolean.json`; > +-+ > | EXPR$0 | > +-+ > | BIT | > | BIT | > | NULL| > +-+ > {noformat} > Notice that the values are reported as BIT. But, I believe the actual type is > UInt1 (the bit vector is, I believe, deprecated.) Then, the function reports > NULL instead of the actual type for the null value. > Since Drill has an {{isnull()}} function, there is no reason for {{typeof()}} > to muddle the type. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (DRILL-6362) typeof() lies about types
[ https://issues.apache.org/jira/browse/DRILL-6362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers resolved DRILL-6362. Resolution: Won't Fix > typeof() lies about types > - > > Key: DRILL-6362 > URL: https://issues.apache.org/jira/browse/DRILL-6362 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.13.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > > Drill provides a {{typeof()}} function that returns the type of a column. > But, it seems to make up types. Consider the following input file: > {noformat} > {a: true} > {a: false} > {a: null} > {noformat} > Consider the following two queries: > {noformat} > SELECT a FROM `json/boolean.json`; > ++ > | a| > ++ > | true | > | false | > | null | > ++ > > SELECT typeof(a) FROM `json/boolean.json`; > +-+ > | EXPR$0 | > +-+ > | BIT | > | BIT | > | NULL| > +-+ > {noformat} > Notice that the values are reported as BIT. But, I believe the actual type is > UInt1 (the bit vector is, I believe, deprecated.) Then, the function reports > NULL instead of the actual type for the null value. > Since Drill has an {{isnull()}} function, there is no reason for {{typeof()}} > to muddle the type. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-6362) typeof() lies about types
[ https://issues.apache.org/jira/browse/DRILL-6362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005138#comment-17005138 ] Paul Rogers commented on DRILL-6362: Closing this because we did create the new functions and we we've elected to leave this function alone for now. > typeof() lies about types > - > > Key: DRILL-6362 > URL: https://issues.apache.org/jira/browse/DRILL-6362 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.13.0 >Reporter: Paul Rogers >Priority: Major > > Drill provides a {{typeof()}} function that returns the type of a column. > But, it seems to make up types. Consider the following input file: > {noformat} > {a: true} > {a: false} > {a: null} > {noformat} > Consider the following two queries: > {noformat} > SELECT a FROM `json/boolean.json`; > ++ > | a| > ++ > | true | > | false | > | null | > ++ > > SELECT typeof(a) FROM `json/boolean.json`; > +-+ > | EXPR$0 | > +-+ > | BIT | > | BIT | > | NULL| > +-+ > {noformat} > Notice that the values are reported as BIT. But, I believe the actual type is > UInt1 (the bit vector is, I believe, deprecated.) Then, the function reports > NULL instead of the actual type for the null value. > Since Drill has an {{isnull()}} function, there is no reason for {{typeof()}} > to muddle the type. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (DRILL-6362) typeof() lies about types
[ https://issues.apache.org/jira/browse/DRILL-6362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers reassigned DRILL-6362: -- Assignee: Paul Rogers > typeof() lies about types > - > > Key: DRILL-6362 > URL: https://issues.apache.org/jira/browse/DRILL-6362 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.13.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > > Drill provides a {{typeof()}} function that returns the type of a column. > But, it seems to make up types. Consider the following input file: > {noformat} > {a: true} > {a: false} > {a: null} > {noformat} > Consider the following two queries: > {noformat} > SELECT a FROM `json/boolean.json`; > ++ > | a| > ++ > | true | > | false | > | null | > ++ > > SELECT typeof(a) FROM `json/boolean.json`; > +-+ > | EXPR$0 | > +-+ > | BIT | > | BIT | > | NULL| > +-+ > {noformat} > Notice that the values are reported as BIT. But, I believe the actual type is > UInt1 (the bit vector is, I believe, deprecated.) Then, the function reports > NULL instead of the actual type for the null value. > Since Drill has an {{isnull()}} function, there is no reason for {{typeof()}} > to muddle the type. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-6360) Document the typeof() function
[ https://issues.apache.org/jira/browse/DRILL-6360?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005125#comment-17005125 ] Paul Rogers commented on DRILL-6360: The information should go [here|https://drill.apache.org/docs/data-type-functions/]. {{typeof()}} returns {{"NULL"}} if the value of a column is NULL, else it returns the internal Drill type name for a column as given by {{drillTypeOf()}}. > Document the typeof() function > -- > > Key: DRILL-6360 > URL: https://issues.apache.org/jira/browse/DRILL-6360 > Project: Apache Drill > Issue Type: Task > Components: Documentation >Affects Versions: 1.13.0 >Reporter: Paul Rogers >Assignee: Bridget Bevens >Priority: Minor > Labels: doc-impacting > > Drill has a {{typeof()}} function that returns the data type (but not mode) > of a column. It was discussed on the dev list recently. However, a search of > the Drill web site, and a scan by hand, failed to turn up documentation about > the function. > As a general suggestion, would be great to have an alphabetical list of all > functions so we don't have to hunt all over the site to find which functions > are available. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7352) Introduce new checkstyle rules to make code style more consistent
[ https://issues.apache.org/jira/browse/DRILL-7352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005124#comment-17005124 ] Paul Rogers commented on DRILL-7352: In prior years, we followed the Sun coding conventions, as suggested [here|http://drill.apache.org/docs/apache-drill-contribution-guidelines/] and documented [here|https://www.oracle.com/technetwork/java/codeconvtoc-136057.html]. Obviously, the Sun conventions are 20 years old and do not address newer Java features or conventions. It is a good idea to update the Drills standards. Google's standards seem fine. The cautious step would be to keep the original standards, adopting the Google standards only when the don't conflict (much) with existing code. Then, let's be sure to document the standards on the web site. > Introduce new checkstyle rules to make code style more consistent > - > > Key: DRILL-7352 > URL: https://issues.apache.org/jira/browse/DRILL-7352 > Project: Apache Drill > Issue Type: Task >Reporter: Vova Vysotskyi >Priority: Major > Attachments: screenshot-1.png > > > Source - https://checkstyle.sourceforge.io/checks.html > List of rules to be enabled: > * [LeftCurly|https://checkstyle.sourceforge.io/config_blocks.html#LeftCurly] > - force placement of a left curly brace at the end of the line. > * > [RightCurly|https://checkstyle.sourceforge.io/config_blocks.html#RightCurly] > - force placement of a right curly brace > * > [NewlineAtEndOfFile|https://checkstyle.sourceforge.io/config_misc.html#NewlineAtEndOfFile] > * > [UnnecessaryParentheses|https://checkstyle.sourceforge.io/config_coding.html#UnnecessaryParentheses] > * > [MethodParamPad|https://checkstyle.sourceforge.io/config_whitespace.html#MethodParamPad] > * [InnerTypeLast > |https://checkstyle.sourceforge.io/config_design.html#InnerTypeLast] > * > [MissingOverride|https://checkstyle.sourceforge.io/config_annotation.html#MissingOverride] > * > [InvalidJavadocPosition|https://checkstyle.sourceforge.io/config_javadoc.html#InvalidJavadocPosition] > * > [ArrayTypeStyle|https://checkstyle.sourceforge.io/config_misc.html#ArrayTypeStyle] > * [UpperEll|https://checkstyle.sourceforge.io/config_misc.html#UpperEll] > and others -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7352) Introduce new checkstyle rules to make code style more consistent
[ https://issues.apache.org/jira/browse/DRILL-7352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005122#comment-17005122 ] Paul Rogers commented on DRILL-7352: Once we have candidate rules, I'll try implementing those rules in Eclipse to verify that they work. Then I can provide a new Eclipse setup file. > Introduce new checkstyle rules to make code style more consistent > - > > Key: DRILL-7352 > URL: https://issues.apache.org/jira/browse/DRILL-7352 > Project: Apache Drill > Issue Type: Task >Reporter: Vova Vysotskyi >Priority: Major > Attachments: screenshot-1.png > > > Source - https://checkstyle.sourceforge.io/checks.html > List of rules to be enabled: > * [LeftCurly|https://checkstyle.sourceforge.io/config_blocks.html#LeftCurly] > - force placement of a left curly brace at the end of the line. > * > [RightCurly|https://checkstyle.sourceforge.io/config_blocks.html#RightCurly] > - force placement of a right curly brace > * > [NewlineAtEndOfFile|https://checkstyle.sourceforge.io/config_misc.html#NewlineAtEndOfFile] > * > [UnnecessaryParentheses|https://checkstyle.sourceforge.io/config_coding.html#UnnecessaryParentheses] > * > [MethodParamPad|https://checkstyle.sourceforge.io/config_whitespace.html#MethodParamPad] > * [InnerTypeLast > |https://checkstyle.sourceforge.io/config_design.html#InnerTypeLast] > * > [MissingOverride|https://checkstyle.sourceforge.io/config_annotation.html#MissingOverride] > * > [InvalidJavadocPosition|https://checkstyle.sourceforge.io/config_javadoc.html#InvalidJavadocPosition] > * > [ArrayTypeStyle|https://checkstyle.sourceforge.io/config_misc.html#ArrayTypeStyle] > * [UpperEll|https://checkstyle.sourceforge.io/config_misc.html#UpperEll] > and others -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7352) Introduce new checkstyle rules to make code style more consistent
[ https://issues.apache.org/jira/browse/DRILL-7352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17005121#comment-17005121 ] Paul Rogers commented on DRILL-7352: Eclipse can be made to organize imports similarly but generally sorts the imports so that, say, {{com}} comes before {{org}}, etc. Does IntelliJ do this? We need a defined order, else when an IDE "organizes imports", the order will be unstable. > Introduce new checkstyle rules to make code style more consistent > - > > Key: DRILL-7352 > URL: https://issues.apache.org/jira/browse/DRILL-7352 > Project: Apache Drill > Issue Type: Task >Reporter: Vova Vysotskyi >Priority: Major > Attachments: screenshot-1.png > > > Source - https://checkstyle.sourceforge.io/checks.html > List of rules to be enabled: > * [LeftCurly|https://checkstyle.sourceforge.io/config_blocks.html#LeftCurly] > - force placement of a left curly brace at the end of the line. > * > [RightCurly|https://checkstyle.sourceforge.io/config_blocks.html#RightCurly] > - force placement of a right curly brace > * > [NewlineAtEndOfFile|https://checkstyle.sourceforge.io/config_misc.html#NewlineAtEndOfFile] > * > [UnnecessaryParentheses|https://checkstyle.sourceforge.io/config_coding.html#UnnecessaryParentheses] > * > [MethodParamPad|https://checkstyle.sourceforge.io/config_whitespace.html#MethodParamPad] > * [InnerTypeLast > |https://checkstyle.sourceforge.io/config_design.html#InnerTypeLast] > * > [MissingOverride|https://checkstyle.sourceforge.io/config_annotation.html#MissingOverride] > * > [InvalidJavadocPosition|https://checkstyle.sourceforge.io/config_javadoc.html#InvalidJavadocPosition] > * > [ArrayTypeStyle|https://checkstyle.sourceforge.io/config_misc.html#ArrayTypeStyle] > * [UpperEll|https://checkstyle.sourceforge.io/config_misc.html#UpperEll] > and others -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7501) Drill 1.17 sqlTypeOf for a Map now reports STRUCT
Paul Rogers created DRILL-7501: -- Summary: Drill 1.17 sqlTypeOf for a Map now reports STRUCT Key: DRILL-7501 URL: https://issues.apache.org/jira/browse/DRILL-7501 Project: Apache Drill Issue Type: Bug Reporter: Paul Rogers Drill 1.14 introduced the {{sqlTypeOf()}} function to workaround limits of the {{typeof()}} function. {{sqlTypeOf()}} should return the name of the SQL type for a column, using the type name that Drill uses. A query from page 163 of _Learning Apache Drill_: {code:sql} SELECT sqlTypeOf(`name`) AS name_type FROM `json/nested.json`; {code} Drill 1.14 results (correct): {noformat} ++ | name_type | ++ | MAP| ++ {noformat} Drill 1.17 results (incorrect): {noformat} +---+ | name_type | +---+ | STRUCT| +---+ {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7500) CTAS to JSON omits the final newline
Paul Rogers created DRILL-7500: -- Summary: CTAS to JSON omits the final newline Key: DRILL-7500 URL: https://issues.apache.org/jira/browse/DRILL-7500 Project: Apache Drill Issue Type: Bug Reporter: Paul Rogers Try the query from page 160 of _Learning Apache Drill_: {code:sql} ALTER SESSION SET `store.format` = 'json'; CREATE TABLE `out/json-null` AS SELECT * FROM `json/null2.json`; {code} Then, {{cat}} the resulting file: {noformat} cat out/json-null/0_0_0.json { "custId" : 123, "name" : "Fred", "balance" : 123.45 } { "custId" : 125, "name" : "Barney" }(base) paul@paul-linux:~/eclipse-workspace/drillbook/data$ {noformat} Notice that the file is missing a final newline, and so the shell prompt is appended to the last closing bracket. Expected the line to be terminated with a newline. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-7499) sqltypeof() function with an array returns "ARRAY", not type
[ https://issues.apache.org/jira/browse/DRILL-7499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers updated DRILL-7499: --- Issue Type: Bug (was: Improvement) > sqltypeof() function with an array returns "ARRAY", not type > > > Key: DRILL-7499 > URL: https://issues.apache.org/jira/browse/DRILL-7499 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.17.0 >Reporter: Paul Rogers >Priority: Minor > Labels: regresion > > The {{sqltypeof()}} function was introduced in Drill 1.14 to work around > limitations of the original {{typeof()}} function. The function is mentioned > in _Learning Apache Drill_, Chapter 8, page 152: > {noformat} > ELECT sqlTypeOf(columns) AS cols_type, >modeOf(columns) AS cols_mode > FROM `csv/cust.csv` LIMIT 1; > +++ > | cols_type | cols_mode | > +++ > | CHARACTER VARYING | ARRAY | > +++ > {noformat} > When the same query is run against the just-released Drill 1.17, we get the > *wrong* results: > {noformat} > +---+---+ > | cols_type | cols_mode | > +---+---+ > | ARRAY | ARRAY | > +---+---+ > {noformat} > The definition of {{sqlTypeOf()}} is that it should return the type portion > of the columns (type, mode) major type. Clearly, it is no longer doing so for > arrays. As a result, there is no function to obtain the data type for arrays. > The problem also shows up in the query from page 158: > {code:sql} > SELECT a, b, >sqlTypeOf(b) AS b_type, modeof(b) AS b_mode > FROM `gen/70kmissing.json` > WHERE mod(a, 7) = 1; > {code} > Expected (table from the book with Drill 1.14 results): > {noformat} > ++---+--+---+ > | a| b | b_type | b_mode | > ++---+--+---+ > | 1 | null | INTEGER | NULLABLE | > ++---+--+---+ > {noformat} > Actual Drill 1.17 results: > {noformat} > +---+---+---+--+ > | a | b | b_type | b_mode | > +---+---+---+--+ > | 1 | null | NULL | NULLABLE | > +---+---+---+--+ > {noformat} > (Second line of table is omitted because something else changed, not relevant > to this ticket.) > The above might not actually be a bug, however if someone has changed the > type of missing columns from the old {{INT}} to a newer (untyped) {{NULL}}. > But, an indirect test suggests that the column is still `INT` and the > function is wrong: > {code:sql} > SELECT a, b > FROM `gen/70kdouble.json` > WHERE b IS NOT NULL ORDER BY a; > {code} > Data: > {noformat} > {a: 1} > ... > {a: 6} > {a: 70001, b: 10.5} > {noformat} > Error: > {noformat} > Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External > Sort. Please enable Union type. > Previous schema BatchSchema [fields=[[`a` (BIGINT:OPTIONAL)], [`b` > (INT:OPTIONAL)]], selectionVector=NONE] > Incoming schema BatchSchema [fields=[[`a` (BIGINT:OPTIONAL)], [`b` > (FLOAT8:OPTIONAL)]], selectionVector=NONE] > {noformat} > Oddly, however, the query on page 160 works as expected: > {code:sql} > SELECT sqlTypeOf(a) AS a_type, modeOf(a) AS a_mode > FROM `json/all-null.json` LIMIT 1; > {code} > {noformat} > +-+--+ > | a_type | a_mode | > +-+--+ > | INTEGER | NULLABLE | > +-+--+ > {noformat} > Someone will have to do some investigating to understand the current > behaviour. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-7499) sqltypeof() function with an array returns "ARRAY", not type
[ https://issues.apache.org/jira/browse/DRILL-7499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers updated DRILL-7499: --- Description: The {{sqltypeof()}} function was introduced in Drill 1.14 to work around limitations of the original {{typeof()}} function. The function is mentioned in _Learning Apache Drill_, Chapter 8, page 152: {noformat} ELECT sqlTypeOf(columns) AS cols_type, modeOf(columns) AS cols_mode FROM `csv/cust.csv` LIMIT 1; +++ | cols_type | cols_mode | +++ | CHARACTER VARYING | ARRAY | +++ {noformat} When the same query is run against the just-released Drill 1.17, we get the *wrong* results: {noformat} +---+---+ | cols_type | cols_mode | +---+---+ | ARRAY | ARRAY | +---+---+ {noformat} The definition of {{sqlTypeOf()}} is that it should return the type portion of the columns (type, mode) major type. Clearly, it is no longer doing so for arrays. As a result, there is no function to obtain the data type for arrays. The problem also shows up in the query from page 158: {code:sql} SELECT a, b, sqlTypeOf(b) AS b_type, modeof(b) AS b_mode FROM `gen/70kmissing.json` WHERE mod(a, 7) = 1; {code} Expected (table from the book with Drill 1.14 results): {noformat} ++---+--+---+ | a| b | b_type | b_mode | ++---+--+---+ | 1 | null | INTEGER | NULLABLE | ++---+--+---+ {noformat} Actual Drill 1.17 results: {noformat} +---+---+---+--+ | a | b | b_type | b_mode | +---+---+---+--+ | 1 | null | NULL | NULLABLE | +---+---+---+--+ {noformat} (Second line of table is omitted because something else changed, not relevant to this ticket.) The above might not actually be a bug, however if someone has changed the type of missing columns from the old {{INT}} to a newer (untyped) {{NULL}}. But, an indirect test suggests that the column is still `INT` and the function is wrong: {code:sql} SELECT a, b FROM `gen/70kdouble.json` WHERE b IS NOT NULL ORDER BY a; {code} Data: {noformat} {a: 1} ... {a: 6} {a: 70001, b: 10.5} {noformat} Error: {noformat} Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External Sort. Please enable Union type. Previous schema BatchSchema [fields=[[`a` (BIGINT:OPTIONAL)], [`b` (INT:OPTIONAL)]], selectionVector=NONE] Incoming schema BatchSchema [fields=[[`a` (BIGINT:OPTIONAL)], [`b` (FLOAT8:OPTIONAL)]], selectionVector=NONE] {noformat} Oddly, however, the query on page 160 works as expected: {code:sql} SELECT sqlTypeOf(a) AS a_type, modeOf(a) AS a_mode FROM `json/all-null.json` LIMIT 1; {code} {noformat} +-+--+ | a_type | a_mode | +-+--+ | INTEGER | NULLABLE | +-+--+ {noformat} Someone will have to do some investigating to understand the current behaviour. was: The {{sqltypeof()}} function was introduced in Drill 1.14 to work around limitations of the original {{typeof()}} function. The function is mentioned in _Learning Apache Drill_, Chapter 8, page 152: {noformat} ELECT sqlTypeOf(columns) AS cols_type, modeOf(columns) AS cols_mode FROM `csv/cust.csv` LIMIT 1; +++ | cols_type | cols_mode | +++ | CHARACTER VARYING | ARRAY | +++ {noformat} When the same query is run against the just-released Drill 1.17, we get the *wrong* results: {noformat} +---+---+ | cols_type | cols_mode | +---+---+ | ARRAY | ARRAY | +---+---+ {noformat} The definition of {{sqlTypeOf()}} is that it should return the type portion of the columns (type, mode) major type. Clearly, it is no longer doing so for arrays. As a result, there is no function to obtain the data type for arrays. The problem also shows up in the query from page 158: {code:sql} SELECT a, b, sqlTypeOf(b) AS b_type, modeof(b) AS b_mode FROM `gen/70kmissing.json` WHERE mod(a, 7) = 1; {code} Expected (table from the book with Drill 1.14 results): {noformat} ++---+--+---+ | a| b | b_type | b_mode | ++---+--+---+ | 1 | null | INTEGER | NULLABLE | ++---+--+---+ {noformat} Actual Drill 1.17 results: {noformat} +---+---+---+--+ | a | b | b_type | b_mode | +---+---+---+--+ | 1 | null | NULL | NULLABLE |
[jira] [Updated] (DRILL-7499) sqltypeof() function with an array returns "ARRAY", not type
[ https://issues.apache.org/jira/browse/DRILL-7499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers updated DRILL-7499: --- Description: The {{sqltypeof()}} function was introduced in Drill 1.14 to work around limitations of the original {{typeof()}} function. The function is mentioned in _Learning Apache Drill_, Chapter 8, page 152: {noformat} ELECT sqlTypeOf(columns) AS cols_type, modeOf(columns) AS cols_mode FROM `csv/cust.csv` LIMIT 1; +++ | cols_type | cols_mode | +++ | CHARACTER VARYING | ARRAY | +++ {noformat} When the same query is run against the just-released Drill 1.17, we get the *wrong* results: {noformat} +---+---+ | cols_type | cols_mode | +---+---+ | ARRAY | ARRAY | +---+---+ {noformat} The definition of {{sqlTypeOf()}} is that it should return the type portion of the columns (type, mode) major type. Clearly, it is no longer doing so for arrays. As a result, there is no function to obtain the data type for arrays. The problem also shows up in the query from page 158: {code:sql} SELECT a, b, sqlTypeOf(b) AS b_type, modeof(b) AS b_mode FROM `gen/70kmissing.json` WHERE mod(a, 7) = 1; {code} Expected (table from the book with Drill 1.14 results): {noformat} ++---+--+---+ | a| b | b_type | b_mode | ++---+--+---+ | 1 | null | INTEGER | NULLABLE | ++---+--+---+ {noformat} Actual Drill 1.17 results: {noformat} +---+---+---+--+ | a | b | b_type | b_mode | +---+---+---+--+ | 1 | null | NULL | NULLABLE | +---+---+---+--+ {noformat} (Second line of table is omitted because something else changed, not relevant to this ticket.) The above might not actually be a bug, however if someone has changed the type of missing columns from the old {{INT}} to a newer (untyped) {{NULL}}. But, an indirect test suggests that the column is still `INT` and the function is wrong: {code:sql} SELECT a, b FROM `gen/70kdouble.json` WHERE b IS NOT NULL ORDER BY a; {code} Data: {noformat} {a: 1} ... {a: 6} {a: 70001, b: 10.5} {noformat} Error: {noformat} Error: UNSUPPORTED_OPERATION ERROR: Schema changes not supported in External Sort. Please enable Union type. Previous schema BatchSchema [fields=[[`a` (BIGINT:OPTIONAL)], [`b` (INT:OPTIONAL)]], selectionVector=NONE] Incoming schema BatchSchema [fields=[[`a` (BIGINT:OPTIONAL)], [`b` (FLOAT8:OPTIONAL)]], selectionVector=NONE] {noformat} was: The {{sqltypeof()}} function was introduced in Drill 1.14 to work around limitations of the original {{typeof()}} function. The function is mentioned in _Learning Apache Drill_, Chapter 8, page 152: {noformat} ELECT sqlTypeOf(columns) AS cols_type, modeOf(columns) AS cols_mode FROM `csv/cust.csv` LIMIT 1; +++ | cols_type | cols_mode | +++ | CHARACTER VARYING | ARRAY | +++ {noformat} When the same query is run against the just-released Drill 1.17, we get the *wrong* results: {noformat} +---+---+ | cols_type | cols_mode | +---+---+ | ARRAY | ARRAY | +---+---+ {noformat} The definition of {{sqlTypeOf()}} is that it should return the type portion of the columns (type, mode) major type. Clearly, it is no longer doing so for arrays. As a result, there is no function to obtain the data type for arrays. The problem also shows up in the query from page 158: {code:sql} SELECT a, b, sqlTypeOf(b) AS b_type, modeof(b) AS b_mode FROM `gen/70kmissing.json` WHERE mod(a, 7) = 1; {code} Expected (table from the book with Drill 1.14 results): {noformat} ++---+--+---+ | a| b | b_type | b_mode | ++---+--+---+ | 1 | null | INTEGER | NULLABLE | ++---+--+---+ {noformat} Actual Drill 1.17 results: {noformat} +---+---+---+--+ | a | b | b_type | b_mode | +---+---+---+--+ | 1 | null | NULL | NULLABLE | +---+---+---+--+ {noformat} (Second line of table is omitted because something else changed, not relevant to this ticket.) The above might not actually be a bug, however if someone has changed the type of missing columns from the old {{INT}} to a newer (untyped) {{NULL}}. > sqltypeof() function with an array returns "ARRAY", not type >
[jira] [Updated] (DRILL-7499) sqltypeof() function with an array returns "ARRAY", not type
[ https://issues.apache.org/jira/browse/DRILL-7499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers updated DRILL-7499: --- Description: The {{sqltypeof()}} function was introduced in Drill 1.14 to work around limitations of the original {{typeof()}} function. The function is mentioned in _Learning Apache Drill_, Chapter 8, page 152: {noformat} ELECT sqlTypeOf(columns) AS cols_type, modeOf(columns) AS cols_mode FROM `csv/cust.csv` LIMIT 1; +++ | cols_type | cols_mode | +++ | CHARACTER VARYING | ARRAY | +++ {noformat} When the same query is run against the just-released Drill 1.17, we get the *wrong* results: {noformat} +---+---+ | cols_type | cols_mode | +---+---+ | ARRAY | ARRAY | +---+---+ {noformat} The definition of {{sqlTypeOf()}} is that it should return the type portion of the columns (type, mode) major type. Clearly, it is no longer doing so for arrays. As a result, there is no function to obtain the data type for arrays. The problem also shows up in the query from page 158: {code:sql} SELECT a, b, sqlTypeOf(b) AS b_type, modeof(b) AS b_mode FROM `gen/70kmissing.json` WHERE mod(a, 7) = 1; {code} Expected (table from the book with Drill 1.14 results): {noformat} ++---+--+---+ | a| b | b_type | b_mode | ++---+--+---+ | 1 | null | INTEGER | NULLABLE | ++---+--+---+ {noformat} Actual Drill 1.17 results: {noformat} +---+---+---+--+ | a | b | b_type | b_mode | +---+---+---+--+ | 1 | null | NULL | NULLABLE | +---+---+---+--+ {noformat} (Second line of table is omitted because something else changed, not relevant to this ticket.) The above might not actually be a bug, however if someone has changed the type of missing columns from the old {{INT}} to a newer (untyped) {{NULL}}. was: The {{sqltypeof()}} function was introduced in Drill 1.14 to work around limitations of the original {{typeof()}} function. The function is mentioned in _Learning Apache Drill_, Chapter 8, page 152: {noformat} ELECT sqlTypeOf(columns) AS cols_type, modeOf(columns) AS cols_mode FROM `csv/cust.csv` LIMIT 1; +++ | cols_type | cols_mode | +++ | CHARACTER VARYING | ARRAY | +++ {noformat} When the same query is run against the just-released Drill 1.17, we get the *wrong* results: {noformat} +---+---+ | cols_type | cols_mode | +---+---+ | ARRAY | ARRAY | +---+---+ {noformat} The definition of {{sqlTypeOf()}} is that it should return the type portion of the columns (type, mode) major type. Clearly, it is no longer doing so for arrays. As a result, there is no function to obtain the data type for arrays. The problem also shows up in the query from page 158: {code:sql} SELECT a, b, sqlTypeOf(b) AS b_type, modeof(b) AS b_mode FROM `gen/70kmissing.json` WHERE mod(a, 7) = 1; {code} Expected (table from the book with Drill 1.14 results): {noformat} ++---+--+---+ | a| b | b_type | b_mode | ++---+--+---+ | 1 | null | INTEGER | NULLABLE | ++---+--+---+ {noformat} Actual Drill 1.17 results: {noformat} +---+---+---+--+ | a | b | b_type | b_mode | +---+---+---+--+ | 1 | null | NULL | NULLABLE | +---+---+---+--+ {noformat} (Second line of table is omitted because something else changed, not relevant to this ticket.) > sqltypeof() function with an array returns "ARRAY", not type > > > Key: DRILL-7499 > URL: https://issues.apache.org/jira/browse/DRILL-7499 > Project: Apache Drill > Issue Type: Improvement >Reporter: Paul Rogers >Priority: Minor > Labels: regresion > > The {{sqltypeof()}} function was introduced in Drill 1.14 to work around > limitations of the original {{typeof()}} function. The function is mentioned > in _Learning Apache Drill_, Chapter 8, page 152: > {noformat} > ELECT sqlTypeOf(columns) AS cols_type, >modeOf(columns) AS cols_mode > FROM `csv/cust.csv` LIMIT 1; > +++ > | cols_type | cols_mode | > +++ >
[jira] [Updated] (DRILL-7499) sqltypeof() function with an array returns "ARRAY", not type
[ https://issues.apache.org/jira/browse/DRILL-7499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers updated DRILL-7499: --- Description: The {{sqltypeof()}} function was introduced in Drill 1.14 to work around limitations of the original {{typeof()}} function. The function is mentioned in _Learning Apache Drill_, Chapter 8, page 152: {noformat} ELECT sqlTypeOf(columns) AS cols_type, modeOf(columns) AS cols_mode FROM `csv/cust.csv` LIMIT 1; +++ | cols_type | cols_mode | +++ | CHARACTER VARYING | ARRAY | +++ {noformat} When the same query is run against the just-released Drill 1.17, we get the *wrong* results: {noformat} +---+---+ | cols_type | cols_mode | +---+---+ | ARRAY | ARRAY | +---+---+ {noformat} The definition of {{sqlTypeOf()}} is that it should return the type portion of the columns (type, mode) major type. Clearly, it is no longer doing so for arrays. As a result, there is no function to obtain the data type for arrays. The problem also shows up in the query from page 158: {code:sql} SELECT a, b, sqlTypeOf(b) AS b_type, modeof(b) AS b_mode FROM `gen/70kmissing.json` WHERE mod(a, 7) = 1; {code} Expected (table from the book with Drill 1.14 results): {noformat} ++---+--+---+ | a| b | b_type | b_mode | ++---+--+---+ | 1 | null | INTEGER | NULLABLE | ++---+--+---+ {noformat} Actual Drill 1.17 results: {noformat} +---+---+---+--+ | a | b | b_type | b_mode | +---+---+---+--+ | 1 | null | NULL | NULLABLE | +---+---+---+--+ {noformat} (Second line of table is omitted because something else changed, not relevant to this ticket.) was: The {{sqltypeof()}} function was introduced in Drill 1.14 to work around limitations of the original {{typeof()}} function. The function is mentioned in _Learning Apache Drill_, Chapter 8, page 152: {noformat} ELECT sqlTypeOf(columns) AS cols_type, modeOf(columns) AS cols_mode FROM `csv/cust.csv` LIMIT 1; +++ | cols_type | cols_mode | +++ | CHARACTER VARYING | ARRAY | +++ {noformat} When the same query is run against the just-released Drill 1.17, we get the *wrong* results: {noformat} +---+---+ | cols_type | cols_mode | +---+---+ | ARRAY | ARRAY | +---+---+ {noformat} The definition of {{sqlTypeOf()}} is that it should return the type portion of the columns (type, mode) major type. Clearly, it is no longer doing so for arrays. As a result, there is no function to obtain the data type for arrays. > sqltypeof() function with an array returns "ARRAY", not type > > > Key: DRILL-7499 > URL: https://issues.apache.org/jira/browse/DRILL-7499 > Project: Apache Drill > Issue Type: Improvement >Reporter: Paul Rogers >Priority: Minor > Labels: regresion > > The {{sqltypeof()}} function was introduced in Drill 1.14 to work around > limitations of the original {{typeof()}} function. The function is mentioned > in _Learning Apache Drill_, Chapter 8, page 152: > {noformat} > ELECT sqlTypeOf(columns) AS cols_type, >modeOf(columns) AS cols_mode > FROM `csv/cust.csv` LIMIT 1; > +++ > | cols_type | cols_mode | > +++ > | CHARACTER VARYING | ARRAY | > +++ > {noformat} > When the same query is run against the just-released Drill 1.17, we get the > *wrong* results: > {noformat} > +---+---+ > | cols_type | cols_mode | > +---+---+ > | ARRAY | ARRAY | > +---+---+ > {noformat} > The definition of {{sqlTypeOf()}} is that it should return the type portion > of the columns (type, mode) major type. Clearly, it is no longer doing so for > arrays. As a result, there is no function to obtain the data type for arrays. > The problem also shows up in the query from page 158: > {code:sql} > SELECT a, b, >sqlTypeOf(b) AS b_type, modeof(b) AS b_mode > FROM `gen/70kmissing.json` > WHERE mod(a, 7) = 1; > {code} > Expected (table from the book with Drill 1.14 results): > {noformat} > ++---+--+---+ > | a| b | b_type | b_mode | > ++---+--+---+ > | 1 | null | INTEGER | NULLABLE | >
[jira] [Created] (DRILL-7499) sqltypeof() function with an array returns "ARRAY", not type
Paul Rogers created DRILL-7499: -- Summary: sqltypeof() function with an array returns "ARRAY", not type Key: DRILL-7499 URL: https://issues.apache.org/jira/browse/DRILL-7499 Project: Apache Drill Issue Type: Improvement Reporter: Paul Rogers The {{sqltypeof()}} function was introduced in Drill 1.14 to work around limitations of the original {{typeof()}} function. The function is mentioned in _Learning Apache Drill_, Chapter 8, page 152: {noformat} ELECT sqlTypeOf(columns) AS cols_type, modeOf(columns) AS cols_mode FROM `csv/cust.csv` LIMIT 1; +++ | cols_type | cols_mode | +++ | CHARACTER VARYING | ARRAY | +++ {noformat} When the same query is run against the just-released Drill 1.17, we get the *wrong* results: {noformat} +---+---+ | cols_type | cols_mode | +---+---+ | ARRAY | ARRAY | +---+---+ {noformat} The definition of {{sqlTypeOf()}} is that it should return the type portion of the columns (type, mode) major type. Clearly, it is no longer doing so for arrays. As a result, there is no function to obtain the data type for arrays. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7498) Allow the storage plugin editor window to be resizable
Paul Rogers created DRILL-7498: -- Summary: Allow the storage plugin editor window to be resizable Key: DRILL-7498 URL: https://issues.apache.org/jira/browse/DRILL-7498 Project: Apache Drill Issue Type: Improvement Reporter: Paul Rogers Open the Drill Web Console. Click on the Storage tab. Pick a Storage Plugin and click Update. The JSON appears in nicely formatted editor. On a typical-sized monitor, the edit box takes up only half the screen vertically. Since it really helps to see more of the JSON than this small window, it would be handy if the edit box offered a resizer, such as this very Jira edit box does. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-7487) Retire unused OUT_OF_MEMORY iterator status
[ https://issues.apache.org/jira/browse/DRILL-7487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers updated DRILL-7487: --- Description: Drill has long supported the {{OUT_OF_MEMORY}} iterator status. The idea is that an operator can realize it has encountered memory pressure and ask its downstream operator to free up some memory. However, an inspection of the code shows that the status is actually sent in only one place ({{UnorderedReceiverBatch}}), and then only in response to the operator hitting its allocator limit (which no other batch can do anything about.) If an operator did choose to try to use this status, there are two key problems: # The operator must be able to suspend itself at any point that it might need memory. For example, an operator that allocates a dozen vectors must be able to stop on, say, the 9th vector, then resume at that point on the subsequent call to {{next()}}. The complexity of the state machine needed to do this is very high. # The *downstream* operators (who may not yet have seen rows) are the least likely to be able to release memory. It is the *upstream* operators (such as spillable operators) that might be able to spill some of the rows they are holding. Presto suggests a nice alternative: * An operator which encounters memory pressure asks the fragment executor for more memory. * The fragment executor asks all *other* operators in that fragment to release memory if possible. This allows a very simple memory recovery strategy: {noformat} try { // allocate something } catch (OutOfMemoryException e) { context.requestMemory(this); // allocate something again, throwing OOM if it fails again } {noformat} Note that, since the fragment runs on a single thread, the above is simple to implement. Each operator is either idle (not executing) or in a call to {{next()}} on a child operator. These are both stable times to consider invoking spilling. Further, a sender could use this opportunity to write partially-filled batches to the network and release them rather than waiting for more data. The only thing that can't be handled is, say, having an interior node flush a batch to its downstream operator in the same batch. Proposed are two changes: # Retire the OUT_OF_MEMORY status. Simply remove all references to it since it is never sent. # Create a stub {{requestMemory()}} method in the operator context that does nothing now, but could be expanded to perform the work suggested above. was: Drill has long supported the {{OUT_OF_MEMORY}} iterator status. The idea is that an operator can realize it has encountered memory pressure and ask its downstream operator to free up some memory. However, an inspection of the code shows that the status is actually sent in only one place ({{UnorderedReceiverBatch}}), and then only in response to the operator hitting its allocator limit (which no other batch can do anything about.) If an operator did choose to try to use this status, there are two key problems: 1. The operator must be able to suspend itself at any point that it might need memory. For example, an operator that allocates a dozen vectors must be able to stop on, say, the 9th vector, then resume at that point on the subsequent call to {{next()}}. The complexity of the state machine needed to do this is very high. 2. The *downstream* operators (who may not yet have seen rows) are the least likely to be able to release memory. It is the *upstream* operators (such as spillable operators) that might be able to spill some of the rows they are holding. Presto suggests a nice alternative: * An operator which encounters memory pressure asks the fragment executor for more memory. * The fragment executor asks all *other* operators in that fragment to release memory if possible. This allows a very simple memory recovery strategy: {noformat} try { // allocate something } catch (OutOfMemoryException e) { context.requestMemory(this); // allocate something again, throwing OOM if it fails again } {noformat} Note that, since the fragment runs on a single thread, the above is simple to implement. Each operator is either idle (not executing) or in a call to {{next()}} on a child operator. These are both stable times to consider invoking spilling. Further, a sender could use this opportunity to write partially-filled batches to the network and release them rather than waiting for more data. The only thing that can't be handled is, say, having an interior node flush a batch to its downstream operator in the same batch. Proposed are two changes: 1. Retire the OUT_OF_MEMORY status. Simply remove all references to it since it is never sent. 2. Create a stub {{requestMemory()}} method in the operator context that does nothing now, but could be expanded to perform the work suggested above. > Retire unused OUT_OF_MEMORY iterator status >
[jira] [Updated] (DRILL-7487) Retire unused OUT_OF_MEMORY iterator status
[ https://issues.apache.org/jira/browse/DRILL-7487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers updated DRILL-7487: --- Description: Drill has long supported the {{OUT_OF_MEMORY}} iterator status. The idea is that an operator can realize it has encountered memory pressure and ask its downstream operator to free up some memory. However, an inspection of the code shows that the status is actually sent in only one place ({{UnorderedReceiverBatch}}), and then only in response to the operator hitting its allocator limit (which no other batch can do anything about.) If an operator did choose to try to use this status, there are two key problems: 1. The operator must be able to suspend itself at any point that it might need memory. For example, an operator that allocates a dozen vectors must be able to stop on, say, the 9th vector, then resume at that point on the subsequent call to {{next()}}. The complexity of the state machine needed to do this is very high. 2. The *downstream* operators (who may not yet have seen rows) are the least likely to be able to release memory. It is the *upstream* operators (such as spillable operators) that might be able to spill some of the rows they are holding. Presto suggests a nice alternative: * An operator which encounters memory pressure asks the fragment executor for more memory. * The fragment executor asks all *other* operators in that fragment to release memory if possible. This allows a very simple memory recovery strategy: {noformat} try { // allocate something } catch (OutOfMemoryException e) { context.requestMemory(this); // allocate something again, throwing OOM if it fails again } {noformat} Note that, since the fragment runs on a single thread, the above is simple to implement. Each operator is either idle (not executing) or in a call to `next()` on a child operator. These are both stable times to consider invoking spilling. Further, a sender could use this opportunity to write partially-filled batches to the network and release them rather than waiting for more data. The only thing that can't be handled is, say, having an interior node flush a batch to its downstream operator in the same batch. Proposed are two changes: 1. Retire the OUT_OF_MEMORY status. Simply remove all references to it since it is never sent. 2. Create a stub {{requestMemory()}} method in the operator context that does nothing now, but could be expanded to perform the work suggested above. was: Drill has long supported the {{OUT_OF_MEMORY}} iterator status. The idea is that an operator can realize it has encountered memory pressure and ask its downstream operator to free up some memory. However, an inspection of the code shows that the status is actually sent in only one place ({{UnorderedReceiverBatch}}), and then only in response to the operator hitting its allocator limit (which no other batch can do anything about.) If an operator did choose to try to use this status, there are two key problems: 1. The operator must be able to suspend itself at any point that it might need memory. For example, an operator that allocates a dozen vectors must be able to stop on, say, the 9th vector, then resume at that point on the subsequent call to `next()`. The complexity of the state machine needed to do this is very high. 2. The *downstream* operators (who may not yet have seen rows) are the least likely to be able to release memory. It is the *upstream* operators (such as spillable operators) that might be able to spill some of the rows they are holding. Presto suggests a nice alternative: * An operator which encounters memory pressure asks the fragment executor for more memory. * The fragment executor asks all *other* operators in that fragment to release memory if possible. This allows a very simple memory recovery strategy: {noformat} try { // allocate something } catch (OutOfMemoryException e) { context.requestMemory(this); // allocate something again, throwing OOM if it fails again } {noformat} Note that, since the fragment runs on a single thread, the above is simple to implement. Each operator is either idle (not executing) or in a call to `next()` on a child operator. These are both stable times to consider invoking spilling. Further, a sender could use this opportunity to write partially-filled batches to the network and release them rather than waiting for more data. The only thing that can't be handled is, say, having an interior node flush a batch to its downstream operator in the same batch. Proposed are two changes: 1. Retire the OUT_OF_MEMORY status. Simply remove all references to it since it is never sent. 2. Create a stub {{requestMemory()}} method in the operator context that does nothing now, but could be expanded to perform the work suggested above. > Retire unused OUT_OF_MEMORY iterator status >
[jira] [Updated] (DRILL-7487) Retire unused OUT_OF_MEMORY iterator status
[ https://issues.apache.org/jira/browse/DRILL-7487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers updated DRILL-7487: --- Description: Drill has long supported the {{OUT_OF_MEMORY}} iterator status. The idea is that an operator can realize it has encountered memory pressure and ask its downstream operator to free up some memory. However, an inspection of the code shows that the status is actually sent in only one place ({{UnorderedReceiverBatch}}), and then only in response to the operator hitting its allocator limit (which no other batch can do anything about.) If an operator did choose to try to use this status, there are two key problems: 1. The operator must be able to suspend itself at any point that it might need memory. For example, an operator that allocates a dozen vectors must be able to stop on, say, the 9th vector, then resume at that point on the subsequent call to {{next()}}. The complexity of the state machine needed to do this is very high. 2. The *downstream* operators (who may not yet have seen rows) are the least likely to be able to release memory. It is the *upstream* operators (such as spillable operators) that might be able to spill some of the rows they are holding. Presto suggests a nice alternative: * An operator which encounters memory pressure asks the fragment executor for more memory. * The fragment executor asks all *other* operators in that fragment to release memory if possible. This allows a very simple memory recovery strategy: {noformat} try { // allocate something } catch (OutOfMemoryException e) { context.requestMemory(this); // allocate something again, throwing OOM if it fails again } {noformat} Note that, since the fragment runs on a single thread, the above is simple to implement. Each operator is either idle (not executing) or in a call to {{next()}} on a child operator. These are both stable times to consider invoking spilling. Further, a sender could use this opportunity to write partially-filled batches to the network and release them rather than waiting for more data. The only thing that can't be handled is, say, having an interior node flush a batch to its downstream operator in the same batch. Proposed are two changes: 1. Retire the OUT_OF_MEMORY status. Simply remove all references to it since it is never sent. 2. Create a stub {{requestMemory()}} method in the operator context that does nothing now, but could be expanded to perform the work suggested above. was: Drill has long supported the {{OUT_OF_MEMORY}} iterator status. The idea is that an operator can realize it has encountered memory pressure and ask its downstream operator to free up some memory. However, an inspection of the code shows that the status is actually sent in only one place ({{UnorderedReceiverBatch}}), and then only in response to the operator hitting its allocator limit (which no other batch can do anything about.) If an operator did choose to try to use this status, there are two key problems: 1. The operator must be able to suspend itself at any point that it might need memory. For example, an operator that allocates a dozen vectors must be able to stop on, say, the 9th vector, then resume at that point on the subsequent call to {{next()}}. The complexity of the state machine needed to do this is very high. 2. The *downstream* operators (who may not yet have seen rows) are the least likely to be able to release memory. It is the *upstream* operators (such as spillable operators) that might be able to spill some of the rows they are holding. Presto suggests a nice alternative: * An operator which encounters memory pressure asks the fragment executor for more memory. * The fragment executor asks all *other* operators in that fragment to release memory if possible. This allows a very simple memory recovery strategy: {noformat} try { // allocate something } catch (OutOfMemoryException e) { context.requestMemory(this); // allocate something again, throwing OOM if it fails again } {noformat} Note that, since the fragment runs on a single thread, the above is simple to implement. Each operator is either idle (not executing) or in a call to `next()` on a child operator. These are both stable times to consider invoking spilling. Further, a sender could use this opportunity to write partially-filled batches to the network and release them rather than waiting for more data. The only thing that can't be handled is, say, having an interior node flush a batch to its downstream operator in the same batch. Proposed are two changes: 1. Retire the OUT_OF_MEMORY status. Simply remove all references to it since it is never sent. 2. Create a stub {{requestMemory()}} method in the operator context that does nothing now, but could be expanded to perform the work suggested above. > Retire unused OUT_OF_MEMORY iterator status
[jira] [Updated] (DRILL-7487) Retire unused OUT_OF_MEMORY iterator status
[ https://issues.apache.org/jira/browse/DRILL-7487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers updated DRILL-7487: --- Description: Drill has long supported the {{OUT_OF_MEMORY}} iterator status. The idea is that an operator can realize it has encountered memory pressure and ask its downstream operator to free up some memory. However, an inspection of the code shows that the status is actually sent in only one place ({{UnorderedReceiverBatch}}), and then only in response to the operator hitting its allocator limit (which no other batch can do anything about.) If an operator did choose to try to use this status, there are two key problems: 1. The operator must be able to suspend itself at any point that it might need memory. For example, an operator that allocates a dozen vectors must be able to stop on, say, the 9th vector, then resume at that point on the subsequent call to `next()`. The complexity of the state machine needed to do this is very high. 2. The *downstream* operators (who may not yet have seen rows) are the least likely to be able to release memory. It is the *upstream* operators (such as spillable operators) that might be able to spill some of the rows they are holding. Presto suggests a nice alternative: * An operator which encounters memory pressure asks the fragment executor for more memory. * The fragment executor asks all *other* operators in that fragment to release memory if possible. This allows a very simple memory recovery strategy: {noformat} try { // allocate something } catch (OutOfMemoryException e) { context.requestMemory(this); // allocate something again, throwing OOM if it fails again } {noformat} Note that, since the fragment runs on a single thread, the above is simple to implement. Each operator is either idle (not executing) or in a call to `next()` on a child operator. These are both stable times to consider invoking spilling. Further, a sender could use this opportunity to write partially-filled batches to the network and release them rather than waiting for more data. The only thing that can't be handled is, say, having an interior node flush a batch to its downstream operator in the same batch. Proposed are two changes: 1. Retire the OUT_OF_MEMORY status. Simply remove all references to it since it is never sent. 2. Create a stub {{requestMemory()}} method in the operator context that does nothing now, but could be expanded to perform the work suggested above. was: Drill has long supported the {{OUT_OF_MEMORY}} iterator status. The idea is that an operator can realize it has encountered memory pressure and ask its downstream operator to free up some memory. However, an inspection of the code shows that the status is actually sent in only one place ({{UnorderedReceiverBatch}}), and then only in response to the operator hitting its allocator limit (which no other batch can do anything about.) If an operator did choose to try to use this status, there are two key problems: 1. The operator must be able to suspend itself at any point that it might need memory. For example, an operator that allocates a dozen vectors must be able to stop on, say, the 9th vector, then resume at that point on the subsequent call to `next()`. The complexity of the state machine needed to do this is very high. 2. The *downstream* operators (who may not yet have seen rows) are the least likely to be able to release memory. It is the *upstream* operators (such as spillable operators) that might be able to spill some of the rows they are holding. Presto suggests a nice alternative: * An operator which encounters memory pressure asks the fragment executor for more memory. * The fragment executor asks all *other* operators in that fragment to release memory if possible. This allows a very simple memory recovery strategy: {noformat} try { // allocate something } catch (OutOfMemoryException e) { context.requestMemory(this); // allocate something again, throwing OOM if it fails again } {noformat} Proposed are two changes: 1. Retire the OUT_OF_MEMORY status. Simply remove all references to it since it is never sent. 2. Create a stub {{requestMemory()}} method in the operator context that does nothing now, but could be expanded to perform the work suggested above. > Retire unused OUT_OF_MEMORY iterator status > --- > > Key: DRILL-7487 > URL: https://issues.apache.org/jira/browse/DRILL-7487 > Project: Apache Drill > Issue Type: Improvement >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > > Drill has long supported the {{OUT_OF_MEMORY}} iterator status. The idea is > that an operator can realize it has encountered memory pressure and ask its > downstream operator to free up some memory.
[jira] [Created] (DRILL-7487) Retire unused OUT_OF_MEMORY iterator status
Paul Rogers created DRILL-7487: -- Summary: Retire unused OUT_OF_MEMORY iterator status Key: DRILL-7487 URL: https://issues.apache.org/jira/browse/DRILL-7487 Project: Apache Drill Issue Type: Improvement Reporter: Paul Rogers Assignee: Paul Rogers Drill has long supported the {{OUT_OF_MEMORY}} iterator status. The idea is that an operator can realize it has encountered memory pressure and ask its downstream operator to free up some memory. However, an inspection of the code shows that the status is actually sent in only one place ({{UnorderedReceiverBatch}}), and then only in response to the operator hitting its allocator limit (which no other batch can do anything about.) If an operator did choose to try to use this status, there are two key problems: 1. The operator must be able to suspend itself at any point that it might need memory. For example, an operator that allocates a dozen vectors must be able to stop on, say, the 9th vector, then resume at that point on the subsequent call to `next()`. The complexity of the state machine needed to do this is very high. 2. The *downstream* operators (who may not yet have seen rows) are the least likely to be able to release memory. It is the *upstream* operators (such as spillable operators) that might be able to spill some of the rows they are holding. Presto suggests a nice alternative: * An operator which encounters memory pressure asks the fragment executor for more memory. * The fragment executor asks all *other* operators in that fragment to release memory if possible. This allows a very simple memory recovery strategy: {noformat} try { // allocate something } catch (OutOfMemoryException e) { context.requestMemory(this); // allocate something again, throwing OOM if it fails again } {noformat} Proposed are two changes: 1. Retire the OUT_OF_MEMORY status. Simply remove all references to it since it is never sent. 2. Create a stub {{requestMemory()}} method in the operator context that does nothing now, but could be expanded to perform the work suggested above. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (DRILL-5272) Text file reader is inefficient
[ https://issues.apache.org/jira/browse/DRILL-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers resolved DRILL-5272. Resolution: Fixed This issue was fixed when converting the text readers to use the result set loader framework. > Text file reader is inefficient > --- > > Key: DRILL-5272 > URL: https://issues.apache.org/jira/browse/DRILL-5272 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > > From inspection of the ScanBatch and CompliantTextReader. > Every batch holds about five implicit vectors. These are repeated for every > row, which can greatly increase incoming data size. > When populating the vectors, the allocation starts at 8 bytes and grows to 16 > bytes, causing a (slow) memory reallocation for every vector: > {code} > [org.apache.drill.exec.vector.UInt4Vector] - > Reallocating vector [$offsets$(UINT4:REQUIRED)]. # of bytes: [8] -> [16] > {code} > Whether due to the above, or a different issues is causing memory growth in > the scan batch: > {code} > Entry Memory: 6,456,448 > Exit Memory: 7,636,312 > Entry Memory: 7570560 > Exit Memory: 8750424 > ... > {code} > Evidently the implicit vectors are added in response to a "SELECT *" query. > Perhaps provide them only if actually requested. > The vectors are populated for every row, making a copy of a potentially long > file name and path for every record. Since the values are common to every > record, perhaps we can use the same data copy for each, but have the offset > vector for each record just point to the single copy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (DRILL-5272) Text file reader is inefficient
[ https://issues.apache.org/jira/browse/DRILL-5272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers reassigned DRILL-5272: -- Assignee: Paul Rogers > Text file reader is inefficient > --- > > Key: DRILL-5272 > URL: https://issues.apache.org/jira/browse/DRILL-5272 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.10.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > > From inspection of the ScanBatch and CompliantTextReader. > Every batch holds about five implicit vectors. These are repeated for every > row, which can greatly increase incoming data size. > When populating the vectors, the allocation starts at 8 bytes and grows to 16 > bytes, causing a (slow) memory reallocation for every vector: > {code} > [org.apache.drill.exec.vector.UInt4Vector] - > Reallocating vector [$offsets$(UINT4:REQUIRED)]. # of bytes: [8] -> [16] > {code} > Whether due to the above, or a different issues is causing memory growth in > the scan batch: > {code} > Entry Memory: 6,456,448 > Exit Memory: 7,636,312 > Entry Memory: 7570560 > Exit Memory: 8750424 > ... > {code} > Evidently the implicit vectors are added in response to a "SELECT *" query. > Perhaps provide them only if actually requested. > The vectors are populated for every row, making a copy of a potentially long > file name and path for every record. Since the values are common to every > record, perhaps we can use the same data copy for each, but have the offset > vector for each record just point to the single copy. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (DRILL-6832) Remove old "unmanaged" sort implementation
[ https://issues.apache.org/jira/browse/DRILL-6832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers reassigned DRILL-6832: -- Assignee: Paul Rogers > Remove old "unmanaged" sort implementation > -- > > Key: DRILL-6832 > URL: https://issues.apache.org/jira/browse/DRILL-6832 > Project: Apache Drill > Issue Type: Improvement >Affects Versions: 1.14.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Minor > > Several releases back Drill introduced a new "managed" external sort that > enhanced the sort operator's memory management. To be safe, at the time, the > new version was controlled by an option, with the ability to revert to the > old version. > The new version has proven to be stable. The time has come to remove the old > version. > * Remove the implementation in {{physical.impl.xsort}}. > * Move the implementation from {{physical.impl.xsort.managed}} to the parent > package. > * Remove the conditional code in the batch creator. > * Remove the option that allowed disabling the new version. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7486) Restructure row set reader builder
Paul Rogers created DRILL-7486: -- Summary: Restructure row set reader builder Key: DRILL-7486 URL: https://issues.apache.org/jira/browse/DRILL-7486 Project: Apache Drill Issue Type: Improvement Reporter: Paul Rogers Assignee: Paul Rogers The code to build a row set reader is located in several places, and is tied to the {{RowSet}} class for historical reasons. This restructuring pulls out the code so it can be used from a {{VectorContainer}} or other source. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7476) Info in some sys schema tables are missing if queried with limit clause
[ https://issues.apache.org/jira/browse/DRILL-7476?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16993203#comment-16993203 ] Paul Rogers commented on DRILL-7476: Added additional batch checks and uncovered the issue: {noformat} Found one or more vector errors from UnorderedReceiverBatch user - NullableVarCharVector: Value count = 1, but last set = -1 {noformat} Provided a patch. > Info in some sys schema tables are missing if queried with limit clause > --- > > Key: DRILL-7476 > URL: https://issues.apache.org/jira/browse/DRILL-7476 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.17.0 >Reporter: Arina Ielchiieva >Assignee: Paul Rogers >Priority: Blocker > Fix For: 1.17.0 > > > Affected schema: sys > Affected tables: connections, threads, memory > If query is executed with limit clause, information for some fields is > missing: > *Connections* > {noformat} > apache drill (sys)> select * from connections; > +---+---++-+---+-+-+-+--+--+ > | user|client | drillbit | established | > duration | queries | isAuthenticated | isEncrypted | usingSSL | > session| > +---+---++-+---+-+-+-+--+--+ > | anonymous | xxx.xxx.x.xxx | xxx | 2019-12-10 13:45:01.766 | 59 min 42.393 > sec | 27 | false | false | false| xxx | > +---+---++-+---+-+-+-+--+--+ > 1 row selected (0.1 seconds) > apache drill (sys)> select * from connections limit 1; > +--++--+-+--+-+-+-+--+-+ > | user | client | drillbit | established | duration | queries | > isAuthenticated | isEncrypted | usingSSL | session | > +--++--+-+--+-+-+-+--+-+ > | || | 2019-12-10 13:45:01.766 | | 28 | > false | false | false| | > +--++--+-+--+-+-+-+--+-+ > {noformat} > *Threads* > {noformat} > apache drill (sys)> select * from threads; > ++---+---+--+ > | hostname | user_port | total_threads | busy_threads | > ++---+---+--+ > | xxx | 31010 | 27| 23 | > ++---+---+--+ > 1 row selected (0.119 seconds) > apache drill (sys)> select * from threads limit 1; > +--+---+---+--+ > | hostname | user_port | total_threads | busy_threads | > +--+---+---+--+ > | | 31010 | 27| 24 | > {noformat} > *Memory* > {noformat} > apache drill (sys)> select * from memory; > ++---+--+++++ > | hostname | user_port | heap_current | heap_max | direct_current | > jvm_direct_current | direct_max | > ++---+--+++++ > | xxx | 31010 | 493974480| 4116185088 | 5048576| 122765 > | 8589934592 | > ++---+--+++++ > 1 row selected (0.115 seconds) > apache drill (sys)> select * from memory limit 1; > +--+---+--+++++ > | hostname | user_port | heap_current | heap_max | direct_current | > jvm_direct_current | direct_max | > +--+---+--+++++ > | | 31010 | 499343272| 4116185088 | 9048576| 122765 >| 8589934592 | > +--+---+--+++++ > {noformat} > When selecting data from *Drillbits* table which has similar fields (ex: > hostname), everything is fine: > {noformat} > apache drill (sys)> select * from drillbits; >
[jira] [Updated] (DRILL-7479) Short-term fixes for metadata API parameterized type issues
[ https://issues.apache.org/jira/browse/DRILL-7479?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers updated DRILL-7479: --- Description: See DRILL-7480 for a discussion of the issues with how we currently use parameterized types in the metadata API. This ticket is for short-term fixes that convert unsafe generic types of the form {{StatisticsHolder}} to the form {{StatisticsHolder}} so that the compiler does not complain with many warnings (and a few Eclipse-only errors.) The topic should be revisited later in the context of DRILL-7480. was: See DRILL- for a discussion of the issues with how we currently use parameterized types in the metadata API. This ticket is for short-term fixes that convert unsafe generic types of the form {{StatisticsHolder}} to the form {{StatisticsHolder}} so that the compiler does not complain with many warnings (and a few Eclipse-only errors.) The topic should be revisited later in the context of DRILL-. > Short-term fixes for metadata API parameterized type issues > --- > > Key: DRILL-7479 > URL: https://issues.apache.org/jira/browse/DRILL-7479 > Project: Apache Drill > Issue Type: Improvement >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > > See DRILL-7480 for a discussion of the issues with how we currently use > parameterized types in the metadata API. > This ticket is for short-term fixes that convert unsafe generic types of the > form {{StatisticsHolder}} to the form {{StatisticsHolder}} so that the > compiler does not complain with many warnings (and a few Eclipse-only errors.) > The topic should be revisited later in the context of DRILL-7480. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7480) Revisit parameterized type design for Metadata API
Paul Rogers created DRILL-7480: -- Summary: Revisit parameterized type design for Metadata API Key: DRILL-7480 URL: https://issues.apache.org/jira/browse/DRILL-7480 Project: Apache Drill Issue Type: Improvement Reporter: Paul Rogers Grabbed latest master and found that the code will not build in Eclipse due to a type mismatch in the statistics code. Specifically, the problem is that we have several parameterized classes, but we often omit the parameters. Evidently, doing so is fine for some compilers, but is an error in Eclipse. Then, while fixing the immediate issue, I found an opposite problem: code that would satisfy Eclipse, but which failed in the Maven build. I spent time making another pass through the metadata code to add type parameters, remove "rawtypes" ignores and so on. See DRILL-7479. Stepping back a bit, it seems that we are perhaps using the type parameters in a way that does not serve our needs in this particular case. We have many classes that hold onto particular values of some type, such as {{StatisticsHolder}}, which can hold a String, a Double, etc. So, we parameterize. But, after that, we treat the items generically. We don't care that {{foo}} is a {{StatisticsHolder}} and {{bar}} is {{StatisticsHolder}}, we just want to create, combine and work with lists of statistics. The same is true in several other places such as column type, comparator type, etc. For comparators, we don't really care what type they compare, we just want, given two generic \{{StatisticsHolder}}s to get the corresponding comparator. This is very similar to the situation with the "column accessors" in EVF: each column is a {{VARCHAR}} or a\{{ FLOAT8}}, but most code just treats them generically. So, the type-ness of the value was treated as data a runtime attribute, not a compile-time attribute. This is a subtle point. Most code in Drill does not work with types directly in Java code. Instead, Drill is an interpreter: it works with generic objects which, at run time, resolve to actual typed objects. It is the difference between writing an application (directly uses types) and writing a language (generically works with all types.) For example, a {{StatsticsHolder}} probably only needs to be type-aware at the moment it is populated or used, but not in all the generic column-level and table level code. (The same is true of properties in the column metadata class, as an example.) IMHO, {{StatsticsHolder}} probably wants to be a non-parameterized class. It should have a declaration object that, say, provides the name, type, comparator and with other metadata. When the actual value is needed, a typed getter can be provided: {code:java} T getValue(); {code} As it is, the type system is very complex but we get no value. Since it is so complex, the code just punted and sprinkled raw types and ignores in many places, which defeats the purpose of parameterized types anyway. Suggestion: let's revisit this work after the upcoming release and see if we can simplify it. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7479) Short-term fixes for metadata API parameterized type issues
Paul Rogers created DRILL-7479: -- Summary: Short-term fixes for metadata API parameterized type issues Key: DRILL-7479 URL: https://issues.apache.org/jira/browse/DRILL-7479 Project: Apache Drill Issue Type: Improvement Reporter: Paul Rogers Assignee: Paul Rogers See DRILL- for a discussion of the issues with how we currently use parameterized types in the metadata API. This ticket is for short-term fixes that convert unsafe generic types of the form {{StatisticsHolder}} to the form {{StatisticsHolder}} so that the compiler does not complain with many warnings (and a few Eclipse-only errors.) The topic should be revisited later in the context of DRILL-. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (DRILL-7476) Info in some sys schema tables are missing if queried with limit clause
[ https://issues.apache.org/jira/browse/DRILL-7476?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers reassigned DRILL-7476: -- Assignee: Paul Rogers (was: Paul Rogers) > Info in some sys schema tables are missing if queried with limit clause > --- > > Key: DRILL-7476 > URL: https://issues.apache.org/jira/browse/DRILL-7476 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.17.0 >Reporter: Arina Ielchiieva >Assignee: Paul Rogers >Priority: Blocker > Fix For: 1.17.0 > > > Affected schema: sys > Affected tables: connections, threads, memory > If query is executed with limit clause, information for some fields is > missing: > *Connections* > {noformat} > apache drill (sys)> select * from connections; > +---+---++-+---+-+-+-+--+--+ > | user|client | drillbit | established | > duration | queries | isAuthenticated | isEncrypted | usingSSL | > session| > +---+---++-+---+-+-+-+--+--+ > | anonymous | xxx.xxx.x.xxx | xxx | 2019-12-10 13:45:01.766 | 59 min 42.393 > sec | 27 | false | false | false| xxx | > +---+---++-+---+-+-+-+--+--+ > 1 row selected (0.1 seconds) > apache drill (sys)> select * from connections limit 1; > +--++--+-+--+-+-+-+--+-+ > | user | client | drillbit | established | duration | queries | > isAuthenticated | isEncrypted | usingSSL | session | > +--++--+-+--+-+-+-+--+-+ > | || | 2019-12-10 13:45:01.766 | | 28 | > false | false | false| | > +--++--+-+--+-+-+-+--+-+ > {noformat} > *Threads* > {noformat} > apache drill (sys)> select * from threads; > ++---+---+--+ > | hostname | user_port | total_threads | busy_threads | > ++---+---+--+ > | xxx | 31010 | 27| 23 | > ++---+---+--+ > 1 row selected (0.119 seconds) > apache drill (sys)> select * from threads limit 1; > +--+---+---+--+ > | hostname | user_port | total_threads | busy_threads | > +--+---+---+--+ > | | 31010 | 27| 24 | > {noformat} > *Memory* > {noformat} > apache drill (sys)> select * from memory; > ++---+--+++++ > | hostname | user_port | heap_current | heap_max | direct_current | > jvm_direct_current | direct_max | > ++---+--+++++ > | xxx | 31010 | 493974480| 4116185088 | 5048576| 122765 > | 8589934592 | > ++---+--+++++ > 1 row selected (0.115 seconds) > apache drill (sys)> select * from memory limit 1; > +--+---+--+++++ > | hostname | user_port | heap_current | heap_max | direct_current | > jvm_direct_current | direct_max | > +--+---+--+++++ > | | 31010 | 499343272| 4116185088 | 9048576| 122765 >| 8589934592 | > +--+---+--+++++ > {noformat} > When selecting data from *Drillbits* table which has similar fields (ex: > hostname), everything is fine: > {noformat} > apache drill (sys)> select * from drillbits; > ++---+--+---+---+-+-++ > | hostname | user_port | control_port | data_port | http_port | current | > version | state | >
[jira] [Commented] (DRILL-7470) drill-yarn unit tests print stack traces with NoSuchMethodError
[ https://issues.apache.org/jira/browse/DRILL-7470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16990071#comment-16990071 ] Paul Rogers commented on DRILL-7470: I wrote those tests originally, some of them are rather fragile because of the kinds of things they test. I'll take a quick look to see if this is something obvious. > drill-yarn unit tests print stack traces with NoSuchMethodError > --- > > Key: DRILL-7470 > URL: https://issues.apache.org/jira/browse/DRILL-7470 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.17.0 >Reporter: Vova Vysotskyi >Assignee: Anton Gozhiy >Priority: Minor > > Looks like it was caused by the Hadoop update. > *Steps to reproduce:* > 1. run {{mvn clean install}} > 2. wait until drill-yarn unit tests are finished > 3. check output > *Expected output:* > {noformat} > [INFO] --- maven-surefire-plugin:3.0.0-M3:test (default-test) @ drill-yarn --- > [INFO] > [INFO] --- > [INFO] T E S T S > [INFO] --- > [INFO] Running org.apache.drill.yarn.zk.TestAmRegistration > [INFO] Running org.apache.drill.yarn.zk.TestZkRegistry > [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.096 > s - in org.apache.drill.yarn.zk.TestAmRegistration > [INFO] Running org.apache.drill.yarn.client.TestCommandLineOptions > [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.002 > s - in org.apache.drill.yarn.client.TestCommandLineOptions > [INFO] Running org.apache.drill.yarn.client.TestClient > [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.057 > s - in org.apache.drill.yarn.client.TestClient > [INFO] Running org.apache.drill.yarn.scripts.TestScripts > [WARNING] Tests run: 1, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: > 0.001 s - in org.apache.drill.yarn.scripts.TestScripts > [INFO] Running org.apache.drill.yarn.core.TestConfig > [INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.307 > s - in org.apache.drill.yarn.core.TestConfig > [INFO] Tests run: 4, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 5.028 > s - in org.apache.drill.yarn.zk.TestZkRegistry > [INFO] > [INFO] Results: > [INFO] > [WARNING] Tests run: 11, Failures: 0, Errors: 0, Skipped: 1 > [INFO] > [INFO] > [INFO] --- maven-surefire-plugin:3.0.0-M3:test (metastore-test) @ drill-yarn > --- > {noformat} > *Actual output* > {noformat} > [INFO] --- maven-surefire-plugin:3.0.0-M3:test (default-test) @ drill-yarn --- > [INFO] > [INFO] --- > [INFO] T E S T S > [INFO] --- > Failed to instantiate [ch.qos.logback.classic.LoggerContext] > Reported exception: > java.lang.NoSuchMethodError: > ch.qos.logback.core.util.Loader.getResourceOccurrenceCount(Ljava/lang/String;Ljava/lang/ClassLoader;)Ljava/util/Set; > at > ch.qos.logback.classic.util.ContextInitializer.multiplicityWarning(ContextInitializer.java:158) > at > ch.qos.logback.classic.util.ContextInitializer.statusOnResourceSearch(ContextInitializer.java:181) > at > ch.qos.logback.classic.util.ContextInitializer.findConfigFileURLFromSystemProperties(ContextInitializer.java:109) > at > ch.qos.logback.classic.util.ContextInitializer.findURLOfDefaultConfigurationFile(ContextInitializer.java:118) > at > ch.qos.logback.classic.util.ContextInitializer.autoConfig(ContextInitializer.java:146) > at org.slf4j.impl.StaticLoggerBinder.init(StaticLoggerBinder.java:85) > at > org.slf4j.impl.StaticLoggerBinder.(StaticLoggerBinder.java:55) > at org.slf4j.LoggerFactory.bind(LoggerFactory.java:150) > at org.slf4j.LoggerFactory.performInitialization(LoggerFactory.java:124) > at org.slf4j.LoggerFactory.getILoggerFactory(LoggerFactory.java:412) > at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:357) > at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:383) > at > org.apache.drill.common.util.ProtobufPatcher.(ProtobufPatcher.java:33) > at org.apache.drill.test.BaseTest.(BaseTest.java:35) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.junit.runners.BlockJUnit4ClassRunner.createTest(BlockJUnit4ClassRunner.java:217) > at > org.junit.runners.BlockJUnit4ClassRunner$1.runReflectiveCall(BlockJUnit4ClassRunner.java:266) >
[jira] [Resolved] (DRILL-7303) Filter record batch does not handle zero-length batches
[ https://issues.apache.org/jira/browse/DRILL-7303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers resolved DRILL-7303. Resolution: Duplicate > Filter record batch does not handle zero-length batches > --- > > Key: DRILL-7303 > URL: https://issues.apache.org/jira/browse/DRILL-7303 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > > Testing of the row-set-based JSON reader revealed a limitation of the Filter > record batch: if an incoming batch has zero records, the length of the > associated SV2 is left at -1. In particular: > {code:java} > public class SelectionVector2 implements AutoCloseable { > // Indicates actual number of rows in the RecordBatch > // container which owns this SV2 instance > private int batchActualRecordCount = -1; > {code} > Then: > {code:java} > public abstract class FilterTemplate2 implements Filterer { > @Override > public void filterBatch(int recordCount) throws SchemaChangeException{ > if (recordCount == 0) { > outgoingSelectionVector.setRecordCount(0); > return; > } > {code} > Notice there is no call to set the actual record count. The solution is to > insert one line of code: > {code:java} > if (recordCount == 0) { > outgoingSelectionVector.setRecordCount(0); > outgoingSelectionVector.setBatchActualRecordCount(0); // <-- Add this > return; > } > {code} > Without this, the query fails with an error due to an invalid index of -1. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (DRILL-7311) Partial fixes for empty batch bugs
[ https://issues.apache.org/jira/browse/DRILL-7311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers resolved DRILL-7311. Resolution: Duplicate > Partial fixes for empty batch bugs > -- > > Key: DRILL-7311 > URL: https://issues.apache.org/jira/browse/DRILL-7311 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > Fix For: 1.18.0 > > > DRILL-7305 explains that multiple operators have serious bugs when presented > with empty batches. DRILL-7306 explains that the EVF (AKA "new scan > framework") was originally coded to emit an empty "fast schema" batch, but > that the feature was disabled because of the many empty-batch operator > failures. > This ticket covers a set of partial fixes for empty-batch issues. This is the > result of work done to get the converted JSON reader to work with a "fast > schema." The JSON work, in the end, revealed that Drill has too many bugs to > enable fast schema, and so the DRILL-7306 was implemented instead. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (DRILL-7305) Multiple operators do not handle empty batches
[ https://issues.apache.org/jira/browse/DRILL-7305?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers resolved DRILL-7305. Resolution: Duplicate > Multiple operators do not handle empty batches > -- > > Key: DRILL-7305 > URL: https://issues.apache.org/jira/browse/DRILL-7305 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Priority: Major > > While testing the new "EVF" framework, it was found that multiple operators > incorrectly handle empty batches. The EVF framework is set up to return a > "fast schema" empty batch with only schema as its first batch. It turns out > that many operators fail with problems such as: > * Failure to set the value counts in the output container > * Fail to initialize the offset vector position 0 to 0 for variable-width or > repeated vectors > And so on. > Partial fixes are in the JSON reader PR. > For now, the easiest work-around is to disable the "fast schema" path in the > EVF: DRILL-7306. > To discover the remaining issues, enable the > {{ScanOrchestratorBuilder.enableSchemaBatch}} option and run unit tests. You > can use the {{VectorChecker}} and {{VectorAccessorUtilities.verify()}} > methods to check state. Insert a call to {{verify()}} in each "next" method: > verify the incoming and outgoing batches. The checker only verifies a few > vector types; but these are enough to show many problems. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (DRILL-7324) Many vector-validity errors from unit tests
[ https://issues.apache.org/jira/browse/DRILL-7324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers reassigned DRILL-7324: -- Assignee: Paul Rogers > Many vector-validity errors from unit tests > --- > > Key: DRILL-7324 > URL: https://issues.apache.org/jira/browse/DRILL-7324 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.16.0 >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > > Drill's value vectors contain many counts that must be maintained in sync. > Drill provides a utility, {{BatchValidator}} to check (a subset of) these > values for consistency. > The {{IteratorValidatorBatchIterator}} class is used in tests to validate the > state of each operator (AKA "record batch") as Drill runs the Volcano > iterator. This class can also validate vectors by setting the > {{VALIDATE_VECTORS}} constant to `true`. > This was done, then unit tests were run. Many tests failed. Examples: > {noformat} > [INFO] Running org.apache.drill.TestUnionDistinct > 18:44:26.742 [22d42585-74c2-d418-6f59-9b1870d04770:frag:0:0] ERROR > o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from > LimitRecordBatch > key - NullableBitVector: Row count = 0, but value count = 2 > 18:44:26.745 [22d42585-74c2-d418-6f59-9b1870d04770:frag:0:0] ERROR > o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from > LimitRecordBatch > key - NullableBitVector: Row count = 0, but value count = 2 > [INFO] Running org.apache.drill.TestUnionDistinct > 8:44:48.302 [22d4256e-c90b-847c-5104-02d6cdf5223e:frag:0:0] ERROR > o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from > LimitRecordBatch > key - NullableBitVector: Row count = 0, but value count = 2 > 18:44:48.703 [22d4256e-ccf3-2af6-f56a-140e9c3e55bb:frag:0:0] ERROR > o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from > FilterRecordBatch > n_nationkey - IntVector: Row count = 2, but value count = 25 > n_regionkey - IntVector: Row count = 2, but value count = 25 > 18:44:48.731 [22d4256e-ccf3-2af6-f56a-140e9c3e55bb:frag:0:0] ERROR > o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from > FilterRecordBatch > n_nationkey - IntVector: Row count = 4, but value count = 25 > n_regionkey - IntVector: Row count = 4, but value count = 25 > 18:44:49.039 [22d4256f-6b39-d2ab-d145-4f2b0db315a3:frag:0:0] ERROR > o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from > FilterRecordBatch > n_nationkey - IntVector: Row count = 2, but value count = 25 > 18:44:49.363 [22d4256e-3d91-850f-9ab4-5939219ac0d0:frag:0:0] ERROR > o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from > FilterRecordBatch > c_custkey - IntVector: Row count = 4, but value count = 1500 > 18:44:49.597 [22d4256d-c113-ae5c-6f31-4dd1ec091365:frag:0:0] ERROR > o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from > FilterRecordBatch > n_nationkey - IntVector: Row count = 5, but value count = 25 > n_regionkey - IntVector: Row count = 5, but value count = 25 > 18:44:49.610 [22d4256d-c113-ae5c-6f31-4dd1ec091365:frag:0:0] ERROR > o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from > FilterRecordBatch > r_regionkey - IntVector: Row count = 1, but value count = 5 > 18:44:53.029 [22d4256a-8b70-5f3b-f79b-806e194c5ed2:frag:0:0] ERROR > o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from > LimitRecordBatch > n_nationkey - IntVector: Row count = 0, but value count = 25 > n_name - VarCharVector: Row count = 0, but value count = 25 > n_regionkey - IntVector: Row count = 0, but value count = 25 > 18:44:53.033 [22d4256a-8b70-5f3b-f79b-806e194c5ed2:frag:0:0] ERROR > o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from > LimitRecordBatch > n_regionkey - IntVector: Row count = 5, but value count = 25 > 18:44:53.331 [22d4256a-526c-7815-c216-8e45752a4a6c:frag:0:0] ERROR > o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from > LimitRecordBatch > n_nationkey - IntVector: Row count = 5, but value count = 25 > n_name - VarCharVector: Row count = 5, but value count = 25 > n_regionkey - IntVector: Row count = 5, but value count = 25 > 18:44:53.337 [22d4256a-526c-7815-c216-8e45752a4a6c:frag:0:0] ERROR > o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from > LimitRecordBatch > n_regionkey - IntVector: Row count = 0, but value count = 25 > 18:44:53.646 [22d42569-c293-ced0-c3d0-e9153cc4a70a:frag:0:0] ERROR > o.a.d.e.p.i.validate.BatchValidator - Found one or more vector errors from > LimitRecordBatch > key - NullableBitVector: Row count = 0, but value count = 2 > Running org.apache.drill.TestTpchSingleMode > 18:45:01.299
[jira] [Created] (DRILL-7458) Base storage plugin framework
Paul Rogers created DRILL-7458: -- Summary: Base storage plugin framework Key: DRILL-7458 URL: https://issues.apache.org/jira/browse/DRILL-7458 Project: Apache Drill Issue Type: Improvement Reporter: Paul Rogers Assignee: Paul Rogers The "Easy" framework allows third-parties to add format plugins to Drill with moderate effort. (The process could be easier, but "Easy" makes it as simple as possible given the current structure.) At present, no such "starter" framework exists for storage plugins. Further, multiple storage plugins have implemented filter push down, seemingly by copying large blocks of code. This ticket offers a "base" framework for storage plugins and for filter push-downs. The framework builds on the EVF, allowing plugins to also support project push down. The framework has a "test mule" storage plugin to verify functionality, and was used as the basis of an REST-like plugin. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-7457) Join assignment is random when table costs are identical
[ https://issues.apache.org/jira/browse/DRILL-7457?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers updated DRILL-7457: --- Summary: Join assignment is random when table costs are identical (was: Join assignment is random when table costa are identical) > Join assignment is random when table costs are identical > > > Key: DRILL-7457 > URL: https://issues.apache.org/jira/browse/DRILL-7457 > Project: Apache Drill > Issue Type: Bug >Reporter: Paul Rogers >Priority: Minor > > Create a simple test: a join between two identical scans, call them t1 and > t2. Ensure that the scans report the same cost. Capture the logical plan. > Repeat the exercise several times. You will see that Drill randomly assigns > t1 to the left side or right side. > Operationally this might not make a difference. But, in tests, it means that > trying to compare an "actual" and "golden" plan is impossible as the plans > are unstable. > Also, if only the estimates are the same, but the table size differs, then > runtime performance will randomly be better on some query runs than others. > Better is to fall back to SQL statement table order if the two tables are > otherwise identical in cost. > This may be a Calcite issue rather than a Drill issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7457) Join assignment is random when table costa are identical
Paul Rogers created DRILL-7457: -- Summary: Join assignment is random when table costa are identical Key: DRILL-7457 URL: https://issues.apache.org/jira/browse/DRILL-7457 Project: Apache Drill Issue Type: Bug Reporter: Paul Rogers Create a simple test: a join between two identical scans, call them t1 and t2. Ensure that the scans report the same cost. Capture the logical plan. Repeat the exercise several times. You will see that Drill randomly assigns t1 to the left side or right side. Operationally this might not make a difference. But, in tests, it means that trying to compare an "actual" and "golden" plan is impossible as the plans are unstable. Also, if only the estimates are the same, but the table size differs, then runtime performance will randomly be better on some query runs than others. Better is to fall back to SQL statement table order if the two tables are otherwise identical in cost. This may be a Calcite issue rather than a Drill issue. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7456) Batch count fixes for 12 additional operators
Paul Rogers created DRILL-7456: -- Summary: Batch count fixes for 12 additional operators Key: DRILL-7456 URL: https://issues.apache.org/jira/browse/DRILL-7456 Project: Apache Drill Issue Type: Bug Reporter: Paul Rogers Assignee: Paul Rogers Enables batch validation for 12 additional operators: * MergingRecordBatch * OrderedPartitionRecordBatch * RangePartitionRecordBatch * TraceRecordBatch * UnionAllRecordBatch * UnorderedReceiverBatch * UnpivotMapsRecordBatch * WindowFrameRecordBatch * TopNBatch * HashJoinBatch * ExternalSortBatch * WriterRecordBatch Fixes issues found with those checks so that this set of operators passes all checks. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7455) "Renaming" projection operator to avoid physical copies
Paul Rogers created DRILL-7455: -- Summary: "Renaming" projection operator to avoid physical copies Key: DRILL-7455 URL: https://issues.apache.org/jira/browse/DRILL-7455 Project: Apache Drill Issue Type: Improvement Reporter: Paul Rogers Drill/Calcite inserts project operators for three main reasons: 1. To compute a new column: {{SELECT a + b AS c ...}} 2. To rename columns: {{SELECT a AS x ...}} 3. To remove columns: {{SELECT a ...} but a data source provides columns {{a}}, and {{b}}. Example of case 1: {code:json} "pop" : "project", "@id" : 4, "exprs" : [ { "ref" : "`a0`", "expr" : "`a`" }, { "ref" : "`b0`", "expr" : "`b`" } ], {code} Of these, only case 2 requires row-by-row computation of new values. Case 1 simply creates a new vector with only the name changed; but the same data. Case 3 preserves some vectors, drops others. In the cases 1 and 2, a simple data transfer from input to output would be adequate. Yet, if one steps through the code, and enables code generation, one will see that Drill steps through each record in all three cases, even calling an empty per-record compute block. A better-performance solution is to separate out the renames/drops (cases 1 and 3) from the column computations (case 2). This can be done either: 1. At plan time, identify that all columns are renames, and replace the row-by-row project with a column-level project. 2. At run time that identifies the column-level projections (cases 1 and 3) and handles those with transfer pairs, while doing row-by-row computes only if case 2 exists. Since row-by-row copies are among the most expensive operations in Drill, this optimization could improve performance by a decent amount. Note that a further optimization is to remove "trivial" projects such as the following: {code:json} "pop" : "project", "@id" : 2, "exprs" : [ { "ref" : "`a`", "expr" : "`a`" }, { "ref" : "`b`", "expr" : "`b`" }, { "ref" : "`b0`", "expr" : "`b0`" } ], {code} The only value of such a projection is to say, "remove all vectors except {{a}}, {{b}} and {{b0}}. In fact, the only time such a projection should be needed is: 1. On top of a data source that does not support projection push down. 2. When Calcite knows it wants to discard certain intermediate columns. Otherwise, Calcite knows which columns emerge from operator x, and should not need to add a project to enforce that schema if it is already what the project will emit. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-7451) Planner inserts "trivial" top project node for simple query
[ https://issues.apache.org/jira/browse/DRILL-7451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers updated DRILL-7451: --- Summary: Planner inserts "trivial" top project node for simple query (was: Planner inserts project node even if scan handles project push-down) > Planner inserts "trivial" top project node for simple query > --- > > Key: DRILL-7451 > URL: https://issues.apache.org/jira/browse/DRILL-7451 > Project: Apache Drill > Issue Type: Bug >Reporter: Paul Rogers >Priority: Minor > > I created a "dummy" storage plugin for testing. The test does a simple query: > {code:sql} > SELECT a, b, c from dummy.myTable > {code} > The first test is to mark the plugin's group scan as supporting projection > push down. However, Drill still creates a projection node in the logical plan: > {code:json} > "graph" : [ { > "pop" : "DummyGroupScan", > "@id" : 2, > "columns" : [ "`**`" ], > "userName" : "progers", > "cost" : { > "memoryCost" : 1.6777216E7, > "outputRowCount" : 1.0 > } > }, { > "pop" : "project", > "@id" : 1, > "exprs" : [ { > "ref" : "`a`", > "expr" : "`a`" > }, { > "ref" : "`b`", > "expr" : "`b`" > }, { > "ref" : "`c`", > "expr" : "`c`" > } ], > "child" : 2, > "outputProj" : true, > "initialAllocation" : 100, > "maxAllocation" : 100, > "cost" : { > "memoryCost" : 1.6777216E7, > "outputRowCount" : 1.0 > } > }, { > "pop" : "screen", > "@id" : 0, > "child" : 1, > "initialAllocation" : 100, > "maxAllocation" : 100, > "cost" : { > "memoryCost" : 1.6777216E7, > "outputRowCount" : 1.0 > } > } ] > {code} > There is [a comment in the > code|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillPushProjectIntoScanRule.java#L109] > that suggests the project should be removed: > {code:java} > // project above scan may be removed in ProjectRemoveRule for > // the case when it is trivial > {code} > As shown in the example, the project is trivial. There is a subtlety: it may > be that the scan, unknown to the planner, produce additional columns, say > {{d}} and {{e}} which the project operator is needed to remove. > If this is the reason the project remains, perhaps we can add a flag of some > kind where the group scan can insist that not only does it handle projection, > it will not insert additional columns. At that point, the project is > completely unnecessary in this case. > This is not a functional bug; just a performance issue: we exercise the > machinery of the project operator to do exactly nothing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7451) Planner inserts project node even if scan handles project push-down
[ https://issues.apache.org/jira/browse/DRILL-7451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16978120#comment-16978120 ] Paul Rogers commented on DRILL-7451: It appears that the actual behavior is a bit more complex. Run the same test as above, with the same query, but now mark the plugin as projection pushdown is *not* supported. In this case we get two projects. This suggests that the project above is added for a different reason, but it is still trivial and should be removed. Logical plan with scan project pushdown disabled: {code:json} "graph" : [ { "pop" : "DummyGroupScan", "@id" : 3, "columns" : [ "`**`" ], "userName" : "progers", "cost" : { "memoryCost" : 1.6777216E7, "outputRowCount" : 1.0 } }, { "pop" : "project", "@id" : 2, "exprs" : [ { "ref" : "`a`", "expr" : "`a`" }, { "ref" : "`b`", "expr" : "`b`" }, { "ref" : "`c`", "expr" : "`c`" } ], "child" : 3, "outputProj" : true, "initialAllocation" : 100, "maxAllocation" : 100, "cost" : { "memoryCost" : 1.6777216E7, "outputRowCount" : 1.0 } }, { "pop" : "project", "@id" : 1, "exprs" : [ { "ref" : "`a`", "expr" : "`a`" }, { "ref" : "`b`", "expr" : "`b`" }, { "ref" : "`c`", "expr" : "`c`" } ], "child" : 2, "outputProj" : true, "initialAllocation" : 100, "maxAllocation" : 100, "cost" : { "memoryCost" : 1.6777216E7, "outputRowCount" : 1.0 } }, { "pop" : "screen", "@id" : 0, "child" : 1, "initialAllocation" : 100, "maxAllocation" : 100, "cost" : { "memoryCost" : 1.6777216E7, "outputRowCount" : 1.0 } } ] {code} > Planner inserts project node even if scan handles project push-down > --- > > Key: DRILL-7451 > URL: https://issues.apache.org/jira/browse/DRILL-7451 > Project: Apache Drill > Issue Type: Bug >Reporter: Paul Rogers >Priority: Minor > > I created a "dummy" storage plugin for testing. The test does a simple query: > {code:sql} > SELECT a, b, c from dummy.myTable > {code} > The first test is to mark the plugin's group scan as supporting projection > push down. However, Drill still creates a projection node in the logical plan: > {code:json} > "graph" : [ { > "pop" : "DummyGroupScan", > "@id" : 2, > "columns" : [ "`**`" ], > "userName" : "progers", > "cost" : { > "memoryCost" : 1.6777216E7, > "outputRowCount" : 1.0 > } > }, { > "pop" : "project", > "@id" : 1, > "exprs" : [ { > "ref" : "`a`", > "expr" : "`a`" > }, { > "ref" : "`b`", > "expr" : "`b`" > }, { > "ref" : "`c`", > "expr" : "`c`" > } ], > "child" : 2, > "outputProj" : true, > "initialAllocation" : 100, > "maxAllocation" : 100, > "cost" : { > "memoryCost" : 1.6777216E7, > "outputRowCount" : 1.0 > } > }, { > "pop" : "screen", > "@id" : 0, > "child" : 1, > "initialAllocation" : 100, > "maxAllocation" : 100, > "cost" : { > "memoryCost" : 1.6777216E7, > "outputRowCount" : 1.0 > } > } ] > {code} > There is [a comment in the > code|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillPushProjectIntoScanRule.java#L109] > that suggests the project should be removed: > {code:java} > // project above scan may be removed in ProjectRemoveRule for > // the case when it is trivial > {code} > As shown in the example, the project is trivial. There is a subtlety: it may > be that the scan, unknown to the planner, produce additional columns, say > {{d}} and {{e}} which the project operator is needed to remove. > If this is the reason the project remains, perhaps we can add a flag of some > kind where the group scan can insist that not only does it handle projection, > it will not insert additional columns. At that point, the project is > completely unnecessary in this case. > This is not a functional bug; just a performance issue: we exercise the > machinery of the project operator to do exactly nothing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7451) Planner inserts project node even if scan handles project push-down
Paul Rogers created DRILL-7451: -- Summary: Planner inserts project node even if scan handles project push-down Key: DRILL-7451 URL: https://issues.apache.org/jira/browse/DRILL-7451 Project: Apache Drill Issue Type: Bug Reporter: Paul Rogers I created a "dummy" storage plugin for testing. The test does a simple query: {code:sql} SELECT a, b, c from dummy.myTable {code} The first test is to mark the plugin's group scan as supporting projection push down. However, Drill still creates a projection node in the logical plan: {code:json} "graph" : [ { "pop" : "DummyGroupScan", "@id" : 2, "columns" : [ "`**`" ], "userName" : "progers", "cost" : { "memoryCost" : 1.6777216E7, "outputRowCount" : 1.0 } }, { "pop" : "project", "@id" : 1, "exprs" : [ { "ref" : "`a`", "expr" : "`a`" }, { "ref" : "`b`", "expr" : "`b`" }, { "ref" : "`c`", "expr" : "`c`" } ], "child" : 2, "outputProj" : true, "initialAllocation" : 100, "maxAllocation" : 100, "cost" : { "memoryCost" : 1.6777216E7, "outputRowCount" : 1.0 } }, { "pop" : "screen", "@id" : 0, "child" : 1, "initialAllocation" : 100, "maxAllocation" : 100, "cost" : { "memoryCost" : 1.6777216E7, "outputRowCount" : 1.0 } } ] {code} There is [a comment in the code|https://github.com/apache/drill/blob/master/exec/java-exec/src/main/java/org/apache/drill/exec/planner/logical/DrillPushProjectIntoScanRule.java#L109] that suggests the project should be removed: {code:java} // project above scan may be removed in ProjectRemoveRule for // the case when it is trivial {code} As shown in the example, the project is trivial. There is a subtlety: it may be that the scan, unknown to the planner, produce additional columns, say {{d}} and {{e}} which the project operator is needed to remove. If this is the reason the project remains, perhaps we can add a flag of some kind where the group scan can insist that not only does it handle projection, it will not insert additional columns. At that point, the project is completely unnecessary in this case. This is not a functional bug; just a performance issue: we exercise the machinery of the project operator to do exactly nothing. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7448) Fix warnings when running Drill memory tests
[ https://issues.apache.org/jira/browse/DRILL-7448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16976215#comment-16976215 ] Paul Rogers commented on DRILL-7448: Occurs in the vector module tests also. > Fix warnings when running Drill memory tests > > > Key: DRILL-7448 > URL: https://issues.apache.org/jira/browse/DRILL-7448 > Project: Apache Drill > Issue Type: Bug >Affects Versions: 1.16.0 >Reporter: Arina Ielchiieva >Assignee: Bohdan Kazydub >Priority: Minor > Fix For: 1.17.0 > > > {noformat} > -- drill-memory-base > [INFO] --- > [INFO] T E S T S > [INFO] --- > [INFO] Running org.apache.drill.exec.memory.TestEndianess > [INFO] Running org.apache.drill.exec.memory.TestAccountant > 16:21:45,719 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could > NOT find resource [logback.groovy] > 16:21:45,719 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Found > resource [logback-test.xml] at > [jar:file:/Users/arina/Development/git_repo/drill/common/target/drill-common-1.17.0-SNAPSHOT-tests.jar!/logback-test.xml] > 16:21:45,733 |-INFO in > ch.qos.logback.core.joran.spi.ConfigurationWatchList@dbd940d - URL > [jar:file:/Users/arina/Development/git_repo/drill/common/target/drill-common-1.17.0-SNAPSHOT-tests.jar!/logback-test.xml] > is not of type file > 16:21:45,780 |-INFO in > ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not > set > 16:21:45,802 |-ERROR in ch.qos.logback.core.joran.conditional.IfAction - > Could not find Janino library on the class path. Skipping conditional > processing. > 16:21:45,802 |-ERROR in ch.qos.logback.core.joran.conditional.IfAction - See > also http://logback.qos.ch/codes.html#ifJanino > 16:21:45,803 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - > About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender] > 16:21:45,811 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - > Naming appender as [STDOUT] > 16:21:45,826 |-INFO in > ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default > type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] > property > 16:21:45,866 |-INFO in ch.qos.logback.classic.joran.action.LevelAction - ROOT > level set to ERROR > 16:21:45,866 |-ERROR in ch.qos.logback.core.joran.conditional.IfAction - > Could not find Janino library on the class path. Skipping conditional > processing. > 16:21:45,866 |-ERROR in ch.qos.logback.core.joran.conditional.IfAction - See > also http://logback.qos.ch/codes.html#ifJanino > 16:21:45,866 |-WARN in ch.qos.logback.classic.joran.action.RootLoggerAction - > The object on the top the of the stack is not the root logger > 16:21:45,866 |-WARN in ch.qos.logback.classic.joran.action.RootLoggerAction - > It is: ch.qos.logback.core.joran.conditional.IfAction > 16:21:45,866 |-INFO in > ch.qos.logback.classic.joran.action.ConfigurationAction - End of > configuration. > 16:21:45,867 |-INFO in > ch.qos.logback.classic.joran.JoranConfigurator@71d15f18 - Registering current > configuration as safe fallback point > 16:21:45,717 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Could > NOT find resource [logback.groovy] > 16:21:45,717 |-INFO in ch.qos.logback.classic.LoggerContext[default] - Found > resource [logback-test.xml] at > [jar:file:/Users/arina/Development/git_repo/drill/common/target/drill-common-1.17.0-SNAPSHOT-tests.jar!/logback-test.xml] > 16:21:45,729 |-INFO in > ch.qos.logback.core.joran.spi.ConfigurationWatchList@2698dc7 - URL > [jar:file:/Users/arina/Development/git_repo/drill/common/target/drill-common-1.17.0-SNAPSHOT-tests.jar!/logback-test.xml] > is not of type file > 16:21:45,778 |-INFO in > ch.qos.logback.classic.joran.action.ConfigurationAction - debug attribute not > set > 16:21:45,807 |-ERROR in ch.qos.logback.core.joran.conditional.IfAction - > Could not find Janino library on the class path. Skipping conditional > processing. > 16:21:45,807 |-ERROR in ch.qos.logback.core.joran.conditional.IfAction - See > also http://logback.qos.ch/codes.html#ifJanino > 16:21:45,808 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - > About to instantiate appender of type [ch.qos.logback.core.ConsoleAppender] > 16:21:45,814 |-INFO in ch.qos.logback.core.joran.action.AppenderAction - > Naming appender as [STDOUT] > 16:21:45,829 |-INFO in > ch.qos.logback.core.joran.action.NestedComplexPropertyIA - Assuming default > type [ch.qos.logback.classic.encoder.PatternLayoutEncoder] for [encoder] > property > 16:21:45,868 |-INFO in ch.qos.logback.classic.joran.action.LevelAction - ROOT > level set to ERROR > 16:21:45,868 |-ERROR in
[jira] [Created] (DRILL-7447) Simplify the Mock reader
Paul Rogers created DRILL-7447: -- Summary: Simplify the Mock reader Key: DRILL-7447 URL: https://issues.apache.org/jira/browse/DRILL-7447 Project: Apache Drill Issue Type: Improvement Reporter: Paul Rogers Assignee: Paul Rogers The mock reader is used to generate large volumes of data. It has evolved over time and has many crufty vestiges of prior implementations. Also, the Mock reader allows specifying that types are nullable, and the rate of null values. This change adds to the existing "encoding" to allow specifying this property via SQL: add an "n" to the column name to specify nullable, a number to specify percent. To specify INT columns with 10%, 50% and 90% nulls: {noformat} SELECT a_in10, b_n50, b_n90 FROM mock.dummy1000 {noformat} The default is 25% nulls (which already existed in the code) if no numeric suffix is provided. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7446) Eclipse compilation issue in AbstractParquetGroupScan
Paul Rogers created DRILL-7446: -- Summary: Eclipse compilation issue in AbstractParquetGroupScan Key: DRILL-7446 URL: https://issues.apache.org/jira/browse/DRILL-7446 Project: Apache Drill Issue Type: Bug Reporter: Paul Rogers Assignee: Paul Rogers When the recent master branch is loaded in Eclipse, we get a compiler error in {{AbstractParquetGroupScan}}: {noformat} The method getFiltered(OptionManager, FilterPredicate) from the type AbstractGroupScanWithMetadata.GroupScanWithMetadataFilterer is not visible AbstractParquetGroupScan.java /drill-java-exec/src/main/java/org/apache/drill/exec/store/parquet line 242Java Problem Type mismatch: cannot convert from AbstractGroupScanWithMetadata.GroupScanWithMetadataFilterer to AbstractParquetGroupScan.RowGroupScanFilterer AbstractParquetGroupScan.java /drill-java-exec/src/main/java/org/apache/drill/exec/store/parquet line 237Java Problem {noformat} The issue appears to be due to using the raw type rather than using parameters with the type. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-7233) Format Plugin for HDF5
[ https://issues.apache.org/jira/browse/DRILL-7233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers updated DRILL-7233: --- Reviewer: Paul Rogers Labels: doc-impacting ready-to-commit (was: doc-impacting) > Format Plugin for HDF5 > -- > > Key: DRILL-7233 > URL: https://issues.apache.org/jira/browse/DRILL-7233 > Project: Apache Drill > Issue Type: New Feature >Affects Versions: 1.17.0 >Reporter: Charles Givre >Assignee: Charles Givre >Priority: Major > Labels: doc-impacting, ready-to-commit > Fix For: 1.18.0 > > > h2. Drill HDF5 Format Plugin > h2. > Per wikipedia, Hierarchical Data Format (HDF) is a set of file formats > designed to store and organize large amounts of data. Originally developed at > the National Center for Supercomputing Applications, it is supported by The > HDF Group, a non-profit corporation whose mission is to ensure continued > development of HDF5 technologies and the continued accessibility of data > stored in HDF. > This plugin enables Apache Drill to query HDF5 files. > h3. Configuration > There are three configuration variables in this plugin: > type: This should be set to hdf5. > extensions: This is a list of the file extensions used to identify HDF5 > files. Typically HDF5 uses .h5 or .hdf5 as file extensions. This defaults to > .h5. > defaultPath: > h3. Example Configuration > h3. > For most uses, the configuration below will suffice to enable Drill to query > HDF5 files. > {{"hdf5": { > "type": "hdf5", > "extensions": [ > "h5" > ], > "defaultPath": null > }}} > h3. Usage > Since HDF5 can be viewed as a file system within a file, a single file can > contain many datasets. For instance, if you have a simple HDF5 file, a star > query will produce the following result: > {{apache drill> select * from dfs.test.`dset.h5`; > +---+---+---+--+ > | path | data_type | file_name | int_data > | > +---+---+---+--+ > | /dset | DATASET | dset.h5 | > [[1,2,3,4,5,6],[7,8,9,10,11,12],[13,14,15,16,17,18],[19,20,21,22,23,24]] | > +---+---+---+--+}} > The actual data in this file is mapped to a column called int_data. In order > to effectively access the data, you should use Drill's FLATTEN() function on > the int_data column, which produces the following result. > {{apache drill> select flatten(int_data) as int_data from dfs.test.`dset.h5`; > +-+ > | int_data | > +-+ > | [1,2,3,4,5,6] | > | [7,8,9,10,11,12]| > | [13,14,15,16,17,18] | > | [19,20,21,22,23,24] | > +-+}} > Once you have the data in this form, you can access it similarly to how you > might access nested data in JSON or other files. > {{apache drill> SELECT int_data[0] as col_0, > . .semicolon> int_data[1] as col_1, > . .semicolon> int_data[2] as col_2 > . .semicolon> FROM ( SELECT flatten(int_data) AS int_data > . . . . . .)> FROM dfs.test.`dset.h5` > . . . . . .)> ); > +---+---+---+ > | col_0 | col_1 | col_2 | > +---+---+---+ > | 1 | 2 | 3 | > | 7 | 8 | 9 | > | 13| 14| 15| > | 19| 20| 21| > +---+---+---+}} > Alternatively, a better way to query the actual data in an HDF5 file is to > use the defaultPath field in your query. If the defaultPath field is defined > in the query, or via the plugin configuration, Drill will only return the > data, rather than the file metadata. > ** Note: Once you have determined which data set you are querying, it is > advisable to use this method to query HDF5 data. ** > You can set the defaultPath variable in either the plugin configuration, or > at query time using the table() function as shown in the example below: > {{SELECT * > FROM table(dfs.test.`dset.h5` (type => 'hdf5', defaultPath => '/dset'))}} > This query will return the result below: > {{apache drill> SELECT * FROM table(dfs.test.`dset.h5` (type => 'hdf5', > defaultPath => '/dset')); > +---+---+---+---+---+---+ > | int_col_0 | int_col_1 | int_col_2 | int_col_3 | int_col_4 | int_col_5 | > +---+---+---+---+---+---+ > | 1 | 2 | 3 | 4 | 5 | 6 | > | 7 | 8 | 9 | 10| 11| 12| > | 13| 14| 15| 16| 17| 18| > | 19| 20| 21
[jira] [Comment Edited] (DRILL-7352) Introduce new checkstyle rules to make code style more consistent
[ https://issues.apache.org/jira/browse/DRILL-7352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956669#comment-16956669 ] Paul Rogers edited comment on DRILL-7352 at 11/17/19 2:08 AM: -- Start with the [existing set of rules|http://drill.apache.org/docs/apache-drill-contribution-guidelines/]. * Import order. Typical order: {{java}}, {{javax}}, {{org}}, {{com}}. Static imports at the top. * Use {{final}} aggressively on fields, do not use it on local variables or parameters. * {{case}} statements indent one level in from the {{switch}} statements. Once decisions are finalized, update the format files for Eclipse and IntelliJ. was (Author: paul.rogers): Start with the [existing set of rules|http://drill.apache.org/docs/apache-drill-contribution-guidelines/]. * Import order. Typical order: `java`, javax`, `org`, `com`. Static imports at the top. * Use `final` aggressively on fields, do not use it on local variables or parameters. * `case` statements indent one level in from the `switch` statements. Once decisions are finalized, update the format files for Eclipse and IntelliJ. > Introduce new checkstyle rules to make code style more consistent > - > > Key: DRILL-7352 > URL: https://issues.apache.org/jira/browse/DRILL-7352 > Project: Apache Drill > Issue Type: Task >Reporter: Vova Vysotskyi >Priority: Major > > Source - https://checkstyle.sourceforge.io/checks.html > List of rules to be enabled: > * [LeftCurly|https://checkstyle.sourceforge.io/config_blocks.html#LeftCurly] > - force placement of a left curly brace at the end of the line. > * > [RightCurly|https://checkstyle.sourceforge.io/config_blocks.html#RightCurly] > - force placement of a right curly brace > * > [NewlineAtEndOfFile|https://checkstyle.sourceforge.io/config_misc.html#NewlineAtEndOfFile] > * > [UnnecessaryParentheses|https://checkstyle.sourceforge.io/config_coding.html#UnnecessaryParentheses] > * > [MethodParamPad|https://checkstyle.sourceforge.io/config_whitespace.html#MethodParamPad] > * [InnerTypeLast > |https://checkstyle.sourceforge.io/config_design.html#InnerTypeLast] > * > [MissingOverride|https://checkstyle.sourceforge.io/config_annotation.html#MissingOverride] > * > [InvalidJavadocPosition|https://checkstyle.sourceforge.io/config_javadoc.html#InvalidJavadocPosition] > * > [ArrayTypeStyle|https://checkstyle.sourceforge.io/config_misc.html#ArrayTypeStyle] > * [UpperEll|https://checkstyle.sourceforge.io/config_misc.html#UpperEll] > and others -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (DRILL-7352) Introduce new checkstyle rules to make code style more consistent
[ https://issues.apache.org/jira/browse/DRILL-7352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16956669#comment-16956669 ] Paul Rogers edited comment on DRILL-7352 at 11/17/19 2:07 AM: -- Start with the [existing set of rules|http://drill.apache.org/docs/apache-drill-contribution-guidelines/]. * Import order. Typical order: `java`, javax`, `org`, `com`. Static imports at the top. * Use `final` aggressively on fields, do not use it on local variables or parameters. * `case` statements indent one level in from the `switch` statements. Once decisions are finalized, update the format files for Eclipse and IntelliJ. was (Author: paul.rogers): Start with the [existing set of rules|http://drill.apache.org/docs/apache-drill-contribution-guidelines/]. * Import order. Typical order: `java`, javax`, `org`, `com`. Static imports at the top. * Use `final` aggressively on fields, do not use it on local variables or parameters. Once decisions are finalized, update the format files for Eclipse and IntelliJ. > Introduce new checkstyle rules to make code style more consistent > - > > Key: DRILL-7352 > URL: https://issues.apache.org/jira/browse/DRILL-7352 > Project: Apache Drill > Issue Type: Task >Reporter: Vova Vysotskyi >Priority: Major > > Source - https://checkstyle.sourceforge.io/checks.html > List of rules to be enabled: > * [LeftCurly|https://checkstyle.sourceforge.io/config_blocks.html#LeftCurly] > - force placement of a left curly brace at the end of the line. > * > [RightCurly|https://checkstyle.sourceforge.io/config_blocks.html#RightCurly] > - force placement of a right curly brace > * > [NewlineAtEndOfFile|https://checkstyle.sourceforge.io/config_misc.html#NewlineAtEndOfFile] > * > [UnnecessaryParentheses|https://checkstyle.sourceforge.io/config_coding.html#UnnecessaryParentheses] > * > [MethodParamPad|https://checkstyle.sourceforge.io/config_whitespace.html#MethodParamPad] > * [InnerTypeLast > |https://checkstyle.sourceforge.io/config_design.html#InnerTypeLast] > * > [MissingOverride|https://checkstyle.sourceforge.io/config_annotation.html#MissingOverride] > * > [InvalidJavadocPosition|https://checkstyle.sourceforge.io/config_javadoc.html#InvalidJavadocPosition] > * > [ArrayTypeStyle|https://checkstyle.sourceforge.io/config_misc.html#ArrayTypeStyle] > * [UpperEll|https://checkstyle.sourceforge.io/config_misc.html#UpperEll] > and others -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7445) Create batch copier based on result set framework
Paul Rogers created DRILL-7445: -- Summary: Create batch copier based on result set framework Key: DRILL-7445 URL: https://issues.apache.org/jira/browse/DRILL-7445 Project: Apache Drill Issue Type: Improvement Reporter: Paul Rogers Assignee: Paul Rogers The result set framework now provides both a reader and writer. Provide a copier that copies batches using this framework. Such a copier can: * Copy selected records * Copy all records, such as for an SV2 or SV4 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7444) JSON blank result on SELECT when too much byte in multiple files on Drill embedded
[ https://issues.apache.org/jira/browse/DRILL-7444?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16974607#comment-16974607 ] Paul Rogers commented on DRILL-7444: This is an odd one; there are none of the usual schema ambiguity issues that can affect JSON. I'll take a look at this since I've got some JSON work pending. > JSON blank result on SELECT when too much byte in multiple files on Drill > embedded > -- > > Key: DRILL-7444 > URL: https://issues.apache.org/jira/browse/DRILL-7444 > Project: Apache Drill > Issue Type: Bug > Components: Storage - JSON >Affects Versions: 1.17.0 >Reporter: benj >Priority: Major > > 2 files (a.json and b.json) and the concat of these 2 file (ab.json) produce > different results on a simple _SELECT_ when using +Drill embedded+. > Problem appears from a number of byte (~ 102 400 000 in my case) > {code:bash} > #!/bin/bash > # script gen.sh to reproduce the problem > for ((i=1;i<=$1;++i)); > do > echo -n '{"At":"' > for j in {1..999}; > do > echo -n 'ab' > done > echo '"}' > done > {code} > {noformat} > == I == > $ gen.sh 1 > a.json > $ gen.sh 239 > b.json > $ wc -c *.json > 1 a.json > 239 b.json > 10239 total > $ bash drill-embedded > apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1; > ++ > | At | > ++ > | aab... | > ++ > => All is fine here > == II == > $ gen.sh 1 > a.json > $ gen.sh 240 > b.json > $ wc -c *.json > 1 a.json > 240 b.json > 10240 total > $ bash drill-embedded > apache drill> SELECT * FROM dfs.tmp.`*.json` LIMIT 1; > ++ > | At | > ++ > || > ++ > => In a surprising way field `At` is empty > == III == > $ gen.sh 10240 > ab.json > $ wc -c *.json > 10240 ab.json > $ bash drill-embedded > apache drill> SELECT * FROM dfs.tmp.`c.json` LIMIT 1; > ++ > |At | > ++ > | aab... | > ++ > => All is fine here although the number of lines is equal to case II > {noformat} > The Version of the Drill 1.17 tested here is the latest at 2019-11-13 > This problem doesn't appears with Drill embedded 1.16 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7442) Create multi-batch row set reader
Paul Rogers created DRILL-7442: -- Summary: Create multi-batch row set reader Key: DRILL-7442 URL: https://issues.apache.org/jira/browse/DRILL-7442 Project: Apache Drill Issue Type: Improvement Reporter: Paul Rogers Assignee: Paul Rogers The "row set" work provided a {{RowSetWriter}} and {{RowSetReader}} to write to and read from a single batch. The {{ResultSetLoader}} class provided a writer that spans multiple batches, handling schema changes across batches and so on. This ticket introduces a reader equivalent, the {{ResultSetReader}} that reads an entire result set of multiple batches, handling schema changes along the way. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7441) Fix issues with fillEmpties, offset vectors
Paul Rogers created DRILL-7441: -- Summary: Fix issues with fillEmpties, offset vectors Key: DRILL-7441 URL: https://issues.apache.org/jira/browse/DRILL-7441 Project: Apache Drill Issue Type: Bug Reporter: Paul Rogers Assignee: Paul Rogers Enable the vector validator with full testing of offset vectors. A number of operators trigger errors. Tracking down the issues, and adding detailed tests, it turns out that: * Drill has an informal standard that zero-length batches should have zero-length offset vectors, while a batch of size 1 will have offset vectors of size 2. Thus, zero-length is a special case. * Nullable, repeated and variable-width vectors have "fill empties" logic that is used in two places: when setting the value count and when preparing to write a new value. The current logic is not quite right for either case. Detailed vector checks fail due to inconsistencies in how the above works. This PR fixes those issues. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7149) Kerberos Code Missing from Drill on YARN
[ https://issues.apache.org/jira/browse/DRILL-7149?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16968866#comment-16968866 ] Paul Rogers commented on DRILL-7149: I'm not a Kerberos expert, but I can perhaps provide a few hints. Drill information for enabling Kerberos is [here|http://drill.apache.org/docs/configuring-kerberos-security/]. My advice is to get one Drillbit working on CDH using these instructions. Then, use that information to configure DoY. The examples suggest putting the keytab file in the absolute location {{/etc/drill/conf}}. This is probably not the right choice on a CDH cluster. If the keytab is the same for all Drill nodes, then place the file in your {{$DRILL_SITE/conf}} directory. The site directory is copied from your DoY client machine to each Drill node ("localized" in YARN terminology.) You will need to change the config file to point to that location. IIRC, the {{$DRILL_SITE}} environment variable is available to Drill. The config file shown in the above-cited page is the one you create in your DoY client site directory. DoY will localize that file to every Drillbit running under YARN. If the documentation is accurate, then you only need the config options and the keytab file. You should be able to pass these along to Drill using the "stock" DoY. The trick would come in if you need to generate the keytab file per host. (Here my knowledge of Kerberos is very weak.) You will learn this as you try the step suggested above: running Drill on a CDH node by hand to learn what configuration is required. > Kerberos Code Missing from Drill on YARN > > > Key: DRILL-7149 > URL: https://issues.apache.org/jira/browse/DRILL-7149 > Project: Apache Drill > Issue Type: Bug > Components: Security >Affects Versions: 1.14.0 >Reporter: Charles Givre >Priority: Blocker > > My company is trying to deploy Drill using the Drill on Yarn (DoY) and we > have run into the issue that DoY does not seem to support passing Kerberos > credentials in order to interact with HDFS. > Upon checking the source code available in GIT > (https://github.com/apache/drill/blob/1.14.0/drill-yarn/src/main/java/org/apache/drill/yarn/core/) > and referring to Apache YARN documentation > (https://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YarnApplicationSecurity.html) > , we saw no section for passing the security credentials needed by the > application to interact with any Hadoop cluster services and applications. > This we feel needs to be added to the source code so that delegation tokens > can be passed inside the container for the process to be able access Drill > archive on HDFS and start. It probably should be added to the > ContainerLaunchContext within the ApplicationSubmissionContext for DoY as > suggested under Apache documentation. > > We tried the same DoY utility on a non-kerberised cluster and the process > started well. Although we ran into a different issue there of hosts getting > blacklisted > We tested with the Single Principal per cluster option. > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7439) Batch count fixes for six additional operators
Paul Rogers created DRILL-7439: -- Summary: Batch count fixes for six additional operators Key: DRILL-7439 URL: https://issues.apache.org/jira/browse/DRILL-7439 Project: Apache Drill Issue Type: Bug Reporter: Paul Rogers Assignee: Paul Rogers Enables vector checks, and fixes batch count and vector issues for: * StreamingAggBatch * RuntimeFilterRecordBatch * FlattenRecordBatch * MergeJoinBatch * NestedLoopJoinBatch * LimitRecordBatch -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (DRILL-7434) TopNBatch constructs Union vector incorrectly
[ https://issues.apache.org/jira/browse/DRILL-7434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers reassigned DRILL-7434: -- Assignee: (was: Paul Rogers) > TopNBatch constructs Union vector incorrectly > - > > Key: DRILL-7434 > URL: https://issues.apache.org/jira/browse/DRILL-7434 > Project: Apache Drill > Issue Type: Bug >Reporter: Paul Rogers >Priority: Major > > The Union type is an "experimental" type that has never been completed. Yet, > we use it as if it works. > Consider the test {{TestTopNSchemaChanges.testMissingColumn()}}. Run this > with the new batch validator enabled. This test creates a union vector. Here > is how the schema looks: > {noformat} > (UNION:OPTIONAL), subtypes=([FLOAT8, INT]), > children=([`internal` (MAP:REQUIRED), children=([`types` > (UINT1:REQUIRED)])]) > {noformat} > This is very hard to follow because the Union vector structure is complex > (and has many issues.) Let's work though it. > We are looking at the {{MaterializedField}} for the union vector. It tells us > that this Union has two types: {{FLOAT8}} and {{INT}}. All good. > The Union has a vector per type, stored in an "internal map".' That map shows > up as child, it is there on the {{children}} list as {{internal}}. However, > the metadata claims that only one vector exists in that map: the {{types}} > vector (the one that tells us what type to use for each row.) The vectors > for {{FLOAT8}} and {{INT}} are missing. > If, however, we use our debugger and inspect the actual contents of the > {{internal}} map, we get the following: > {noformat} > [`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)], [`float8` > (FLOAT8:OPTIONAL)], [`int` (INT:OPTIONAL)])] > {noformat} > That is, the internal map has the correct schema, but the Union vector itself > has the wrong (incomplete) schema. > This is an inherent design flaw with Union vector: it requires two copies of > the schema to be in sync. Further {{MaterializedField}} was designed to be > immutable, but the map and Union types require mutation. If the Union simply > points to the actual Map vector {{MaterializedField}}, it will drift out of > date since the map vector creates a new schema each time we add fields; the > Union vector ends up pointing to the old one. > This is not a simple bug to fix, but the result of the bug is that the > vectors end up corrupted, as detected by the Batch Validator. In fact, the > bug itself is subtle. > The TopNBatch does pass vector validation. However, because of the incorrect > metadata, the downstream {{RemovingRecordBatch}} creates the derived Union > vector incorrectly: it fails to set the value count for the {{INT}} type. > {noformat} > Found one or more vector errors from RemovingRecordBatch > kl-type-INT - NullableIntVector: Row count = 3, but value count = 0 > {noformat} > Where {{kl-type-INT}} is an ad-hoc way of saying we are checking the {{INT}} > type vector for a Union named {{kl}}. > The schema of Union out of the {{RemovingRecordBatch}} has been truncated. > The Union itself: > {noformat} > [`kl` (UNION:OPTIONAL), subtypes=([FLOAT8, INT]), > children=([`internal` (MAP:REQUIRED), children=([`types` > (UINT1:REQUIRED)])])] > {noformat} > The internal map: > {noformat} > [`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)], [`int` > (INT:OPTIONAL)])] > {noformat} > Notice that the {{FLOAT8}} vector has disappeared: the Union vector metadata > claims we have such a vector, but the internal map does not actually contain > the vector. > The root cause is that the vector checker (indeed, any client) will call > {{UnionVector.getMember(type)}} to get a vector for a type. This method > includes a switch statement to call, say, {{getIntVector()}}. That method, in > turn, creates the vector if does not exist. > But, since we are reading, we have an existing data batch. When we create a > new vector, we create it as zero size. Thus, we think we have n records > (three in this case), but we actually have zero. This kinda-sorta works > because the type vector won't ever contain an entry for the "runt" vector, so > we won't actually access data. But, this is an inconsistent structure. It > breaks if we peer inside, as we are doing in the batch validator. > If we check for this case, we now get: > {noformat} > Found one or more vector errors from RemovingRecordBatch > kl - UnionVector: Union vector includes type INT, but the internal map has no > matching member > {noformat} > This is why Union is such a mess: is this a bug or just a very fragile > design? I claim bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7434) TopNBatch constructs Union vector incorrectly
[ https://issues.apache.org/jira/browse/DRILL-7434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966320#comment-16966320 ] Paul Rogers commented on DRILL-7434: See DRILL-7436 for a workaround (to materialize all type vectors.) Someone should look deeper for a longer-term fix, such as removing unused subtypes. > TopNBatch constructs Union vector incorrectly > - > > Key: DRILL-7434 > URL: https://issues.apache.org/jira/browse/DRILL-7434 > Project: Apache Drill > Issue Type: Bug >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > > The Union type is an "experimental" type that has never been completed. Yet, > we use it as if it works. > Consider the test {{TestTopNSchemaChanges.testMissingColumn()}}. Run this > with the new batch validator enabled. This test creates a union vector. Here > is how the schema looks: > {noformat} > (UNION:OPTIONAL), subtypes=([FLOAT8, INT]), > children=([`internal` (MAP:REQUIRED), children=([`types` > (UINT1:REQUIRED)])]) > {noformat} > This is very hard to follow because the Union vector structure is complex > (and has many issues.) Let's work though it. > We are looking at the {{MaterializedField}} for the union vector. It tells us > that this Union has two types: {{FLOAT8}} and {{INT}}. All good. > The Union has a vector per type, stored in an "internal map".' That map shows > up as child, it is there on the {{children}} list as {{internal}}. However, > the metadata claims that only one vector exists in that map: the {{types}} > vector (the one that tells us what type to use for each row.) The vectors > for {{FLOAT8}} and {{INT}} are missing. > If, however, we use our debugger and inspect the actual contents of the > {{internal}} map, we get the following: > {noformat} > [`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)], [`float8` > (FLOAT8:OPTIONAL)], [`int` (INT:OPTIONAL)])] > {noformat} > That is, the internal map has the correct schema, but the Union vector itself > has the wrong (incomplete) schema. > This is an inherent design flaw with Union vector: it requires two copies of > the schema to be in sync. Further {{MaterializedField}} was designed to be > immutable, but the map and Union types require mutation. If the Union simply > points to the actual Map vector {{MaterializedField}}, it will drift out of > date since the map vector creates a new schema each time we add fields; the > Union vector ends up pointing to the old one. > This is not a simple bug to fix, but the result of the bug is that the > vectors end up corrupted, as detected by the Batch Validator. In fact, the > bug itself is subtle. > The TopNBatch does pass vector validation. However, because of the incorrect > metadata, the downstream {{RemovingRecordBatch}} creates the derived Union > vector incorrectly: it fails to set the value count for the {{INT}} type. > {noformat} > Found one or more vector errors from RemovingRecordBatch > kl-type-INT - NullableIntVector: Row count = 3, but value count = 0 > {noformat} > Where {{kl-type-INT}} is an ad-hoc way of saying we are checking the {{INT}} > type vector for a Union named {{kl}}. > The schema of Union out of the {{RemovingRecordBatch}} has been truncated. > The Union itself: > {noformat} > [`kl` (UNION:OPTIONAL), subtypes=([FLOAT8, INT]), > children=([`internal` (MAP:REQUIRED), children=([`types` > (UINT1:REQUIRED)])])] > {noformat} > The internal map: > {noformat} > [`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)], [`int` > (INT:OPTIONAL)])] > {noformat} > Notice that the {{FLOAT8}} vector has disappeared: the Union vector metadata > claims we have such a vector, but the internal map does not actually contain > the vector. > The root cause is that the vector checker (indeed, any client) will call > {{UnionVector.getMember(type)}} to get a vector for a type. This method > includes a switch statement to call, say, {{getIntVector()}}. That method, in > turn, creates the vector if does not exist. > But, since we are reading, we have an existing data batch. When we create a > new vector, we create it as zero size. Thus, we think we have n records > (three in this case), but we actually have zero. This kinda-sorta works > because the type vector won't ever contain an entry for the "runt" vector, so > we won't actually access data. But, this is an inconsistent structure. It > breaks if we peer inside, as we are doing in the batch validator. > If we check for this case, we now get: > {noformat} > Found one or more vector errors from RemovingRecordBatch > kl - UnionVector: Union vector includes type INT, but the internal map has no > matching member > {noformat} > This is why Union is such a mess: is this a bug or just a very fragile > design? I claim bug. -- This
[jira] [Commented] (DRILL-7435) Project operator incorrectly adds a LATE type to union vector
[ https://issues.apache.org/jira/browse/DRILL-7435?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966319#comment-16966319 ] Paul Rogers commented on DRILL-7435: DRILL-7436 provides a work-around fix. Someone should probably look carefully to work out the detailed semantics in this area: how should we handle `LATE` with the Union vector? > Project operator incorrectly adds a LATE type to union vector > - > > Key: DRILL-7435 > URL: https://issues.apache.org/jira/browse/DRILL-7435 > Project: Apache Drill > Issue Type: Bug >Reporter: Paul Rogers >Priority: Major > > Run Drill with a fix for DRILL-7434. Now, another test fails: > {{TestJsonReader.testTypeCase()}} fails when it tries to set the value count. > Evidently the Project operator has added the {{LATE}} type to the Union > vector. However, there is no vector type associated with the {{LATE}} type. > An attempt to get the member or this type throws an exception. > The simple work around is to special-case this type when setting the value > count. The longer-term fix is to not add the {{LATE}} type to a union vector. > The problem appears to occur here: > {noformat} > Daemon Thread [2240a19e-344e-9a8b-f3d9-2a1550662b1b:frag:0:0] (Suspended > (breakpoint at line 2091 in TypeProtos$MajorType$Builder)) > TypeProtos$MajorType$Builder.addSubType(TypeProtos$MinorType) line: > 2091 > DefaultReturnTypeInference.getType(List, > FunctionAttributes) line: 58 > FunctionTemplate$ReturnType.getType(List, > FunctionAttributes) line: 195 > > DrillSimpleFuncHolder(DrillFuncHolder).getReturnType(List) > line: 401 > DrillFuncHolderExpr.(String, DrillFuncHolder, > List, ExpressionPosition) line: 39 > DrillSimpleFuncHolder(DrillFuncHolder).getExpr(String, > List, ExpressionPosition) line: 113 > ExpressionTreeMaterializer.addCastExpression(LogicalExpression, > TypeProtos$MajorType, FunctionLookupContext, ErrorCollector, boolean) line: > 235 > > ExpressionTreeMaterializer$MaterializeVisitor(ExpressionTreeMaterializer$AbstractMaterializeVisitor).visitIfExpression(IfExpression, > FunctionLookupContext) line: 638 > > ExpressionTreeMaterializer$MaterializeVisitor(ExpressionTreeMaterializer$AbstractMaterializeVisitor).visitIfExpression(IfExpression, > Object) line: 332 > IfExpression.accept(ExprVisitor, V) line: 65 > ExpressionTreeMaterializer.materialize(LogicalExpression, > Map, ErrorCollector, FunctionLookupContext, > boolean, boolean) line: 165 > ExpressionTreeMaterializer.materialize(LogicalExpression, > VectorAccessible, ErrorCollector, FunctionLookupContext, boolean, boolean) > line: 143 > ProjectRecordBatch.setupNewSchemaFromInput(RecordBatch) line: 482 > ProjectRecordBatch.setupNewSchema() line: 571 > ProjectRecordBatch(AbstractUnaryRecordBatch).innerNext() line: 99 > ProjectRecordBatch.innerNext() line: 144 > ... > {noformat} > This appears to be processing the if statement in the following test query: > {noformat} > .sqlQuery("select case when is_bigint(field1) " + > "then field1 when is_list(field1) then field1[0] " + > "when is_map(field1) then t.field1.inner1 end f1 from > cp.`jsoninput/union/a.json` t") > {noformat} > The problem appears to be that a function says it takes data of type LATE, > and then that data is added to the Union. Not sure of the exact solution, but > simply omitting the LATE value from the Union seems to work. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7436) Fix record count, vector structure issues in several operators
Paul Rogers created DRILL-7436: -- Summary: Fix record count, vector structure issues in several operators Key: DRILL-7436 URL: https://issues.apache.org/jira/browse/DRILL-7436 Project: Apache Drill Issue Type: Bug Reporter: Paul Rogers Assignee: Paul Rogers This is the next in a continuing series of fixes to the container record count, batch record count, and vector structure in several operators. This batch represents the smallest change needed to add checking for the Filter operator. In order to get Filter to pass checks, many of its upstream operators needed to be fixed. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (DRILL-7435) Project operator incorrectly adds a LATE type to union vector
[ https://issues.apache.org/jira/browse/DRILL-7435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers updated DRILL-7435: --- Description: Run Drill with a fix for DRILL-7434. Now, another test fails: {{TestJsonReader.testTypeCase()}} fails when it tries to set the value count. Evidently the Project operator has added the {{LATE}} type to the Union vector. However, there is no vector type associated with the {{LATE}} type. An attempt to get the member or this type throws an exception. The simple work around is to special-case this type when setting the value count. The longer-term fix is to not add the {{LATE}} type to a union vector. The problem appears to occur here: {noformat} Daemon Thread [2240a19e-344e-9a8b-f3d9-2a1550662b1b:frag:0:0] (Suspended (breakpoint at line 2091 in TypeProtos$MajorType$Builder)) TypeProtos$MajorType$Builder.addSubType(TypeProtos$MinorType) line: 2091 DefaultReturnTypeInference.getType(List, FunctionAttributes) line: 58 FunctionTemplate$ReturnType.getType(List, FunctionAttributes) line: 195 DrillSimpleFuncHolder(DrillFuncHolder).getReturnType(List) line: 401 DrillFuncHolderExpr.(String, DrillFuncHolder, List, ExpressionPosition) line: 39 DrillSimpleFuncHolder(DrillFuncHolder).getExpr(String, List, ExpressionPosition) line: 113 ExpressionTreeMaterializer.addCastExpression(LogicalExpression, TypeProtos$MajorType, FunctionLookupContext, ErrorCollector, boolean) line: 235 ExpressionTreeMaterializer$MaterializeVisitor(ExpressionTreeMaterializer$AbstractMaterializeVisitor).visitIfExpression(IfExpression, FunctionLookupContext) line: 638 ExpressionTreeMaterializer$MaterializeVisitor(ExpressionTreeMaterializer$AbstractMaterializeVisitor).visitIfExpression(IfExpression, Object) line: 332 IfExpression.accept(ExprVisitor, V) line: 65 ExpressionTreeMaterializer.materialize(LogicalExpression, Map, ErrorCollector, FunctionLookupContext, boolean, boolean) line: 165 ExpressionTreeMaterializer.materialize(LogicalExpression, VectorAccessible, ErrorCollector, FunctionLookupContext, boolean, boolean) line: 143 ProjectRecordBatch.setupNewSchemaFromInput(RecordBatch) line: 482 ProjectRecordBatch.setupNewSchema() line: 571 ProjectRecordBatch(AbstractUnaryRecordBatch).innerNext() line: 99 ProjectRecordBatch.innerNext() line: 144 ... {noformat} This appears to be processing the if statement in the following test query: {noformat} .sqlQuery("select case when is_bigint(field1) " + "then field1 when is_list(field1) then field1[0] " + "when is_map(field1) then t.field1.inner1 end f1 from cp.`jsoninput/union/a.json` t") {noformat} The problem appears to be that a function says it takes data of type LATE, and then that data is added to the Union. Not sure of the exact solution, but simply omitting the LATE value from the Union seems to work. was: Run Drill with a fix for DRILL-7434. Now, another test fails: {{TestJsonReader.testTypeCase()}} fails when it tries to set the value count. Evidently the JSON reader has added the {{LATE}} type to the Union vector. However, there is no vector type associated with the {{LATE}} type. An attempt to get the member or this type throws an exception. The simple work around is to special-case this type when setting the value count. The longer-term fix is to not add the {{LATE}} type to a union vector. > Project operator incorrectly adds a LATE type to union vector > - > > Key: DRILL-7435 > URL: https://issues.apache.org/jira/browse/DRILL-7435 > Project: Apache Drill > Issue Type: Bug >Reporter: Paul Rogers >Priority: Major > > Run Drill with a fix for DRILL-7434. Now, another test fails: > {{TestJsonReader.testTypeCase()}} fails when it tries to set the value count. > Evidently the Project operator has added the {{LATE}} type to the Union > vector. However, there is no vector type associated with the {{LATE}} type. > An attempt to get the member or this type throws an exception. > The simple work around is to special-case this type when setting the value > count. The longer-term fix is to not add the {{LATE}} type to a union vector. > The problem appears to occur here: > {noformat} > Daemon Thread [2240a19e-344e-9a8b-f3d9-2a1550662b1b:frag:0:0] (Suspended > (breakpoint at line 2091 in TypeProtos$MajorType$Builder)) > TypeProtos$MajorType$Builder.addSubType(TypeProtos$MinorType) line: > 2091 > DefaultReturnTypeInference.getType(List, > FunctionAttributes) line: 58 > FunctionTemplate$ReturnType.getType(List, >
[jira] [Updated] (DRILL-7435) Project operator incorrectly adds a LATE type to union vector
[ https://issues.apache.org/jira/browse/DRILL-7435?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers updated DRILL-7435: --- Summary: Project operator incorrectly adds a LATE type to union vector (was: JSON reader incorrectly adds a LATE type to union vector) > Project operator incorrectly adds a LATE type to union vector > - > > Key: DRILL-7435 > URL: https://issues.apache.org/jira/browse/DRILL-7435 > Project: Apache Drill > Issue Type: Bug >Reporter: Paul Rogers >Priority: Major > > Run Drill with a fix for DRILL-7434. Now, another test fails: > {{TestJsonReader.testTypeCase()}} fails when it tries to set the value count. > Evidently the JSON reader has added the {{LATE}} type to the Union vector. > However, there is no vector type associated with the {{LATE}} type. An > attempt to get the member or this type throws an exception. > The simple work around is to special-case this type when setting the value > count. The longer-term fix is to not add the {{LATE}} type to a union vector. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7435) JSON reader incorrectly adds a LATE type to union vector
Paul Rogers created DRILL-7435: -- Summary: JSON reader incorrectly adds a LATE type to union vector Key: DRILL-7435 URL: https://issues.apache.org/jira/browse/DRILL-7435 Project: Apache Drill Issue Type: Bug Reporter: Paul Rogers Run Drill with a fix for DRILL-7434. Now, another test fails: {{TestJsonReader.testTypeCase()}} fails when it tries to set the value count. Evidently the JSON reader has added the {{LATE}} type to the Union vector. However, there is no vector type associated with the {{LATE}} type. An attempt to get the member or this type throws an exception. The simple work around is to special-case this type when setting the value count. The longer-term fix is to not add the {{LATE}} type to a union vector. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (DRILL-7434) TopNBatch constructs Union vector incorrectly
[ https://issues.apache.org/jira/browse/DRILL-7434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers reassigned DRILL-7434: -- Assignee: Paul Rogers > TopNBatch constructs Union vector incorrectly > - > > Key: DRILL-7434 > URL: https://issues.apache.org/jira/browse/DRILL-7434 > Project: Apache Drill > Issue Type: Bug >Reporter: Paul Rogers >Assignee: Paul Rogers >Priority: Major > > The Union type is an "experimental" type that has never been completed. Yet, > we use it as if it works. > Consider the test {{TestTopNSchemaChanges.testMissingColumn()}}. Run this > with the new batch validator enabled. This test creates a union vector. Here > is how the schema looks: > {noformat} > (UNION:OPTIONAL), subtypes=([FLOAT8, INT]), > children=([`internal` (MAP:REQUIRED), children=([`types` > (UINT1:REQUIRED)])]) > {noformat} > This is very hard to follow because the Union vector structure is complex > (and has many issues.) Let's work though it. > We are looking at the {{MaterializedField}} for the union vector. It tells us > that this Union has two types: {{FLOAT8}} and {{INT}}. All good. > The Union has a vector per type, stored in an "internal map".' That map shows > up as child, it is there on the {{children}} list as {{internal}}. However, > the metadata claims that only one vector exists in that map: the {{types}} > vector (the one that tells us what type to use for each row.) The vectors > for {{FLOAT8}} and {{INT}} are missing. > If, however, we use our debugger and inspect the actual contents of the > {{internal}} map, we get the following: > {noformat} > [`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)], [`float8` > (FLOAT8:OPTIONAL)], [`int` (INT:OPTIONAL)])] > {noformat} > That is, the internal map has the correct schema, but the Union vector itself > has the wrong (incomplete) schema. > This is an inherent design flaw with Union vector: it requires two copies of > the schema to be in sync. Further {{MaterializedField}} was designed to be > immutable, but the map and Union types require mutation. If the Union simply > points to the actual Map vector {{MaterializedField}}, it will drift out of > date since the map vector creates a new schema each time we add fields; the > Union vector ends up pointing to the old one. > This is not a simple bug to fix, but the result of the bug is that the > vectors end up corrupted, as detected by the Batch Validator. In fact, the > bug itself is subtle. > The TopNBatch does pass vector validation. However, because of the incorrect > metadata, the downstream {{RemovingRecordBatch}} creates the derived Union > vector incorrectly: it fails to set the value count for the {{INT}} type. > {noformat} > Found one or more vector errors from RemovingRecordBatch > kl-type-INT - NullableIntVector: Row count = 3, but value count = 0 > {noformat} > Where {{kl-type-INT}} is an ad-hoc way of saying we are checking the {{INT}} > type vector for a Union named {{kl}}. > The schema of Union out of the {{RemovingRecordBatch}} has been truncated. > The Union itself: > {noformat} > [`kl` (UNION:OPTIONAL), subtypes=([FLOAT8, INT]), > children=([`internal` (MAP:REQUIRED), children=([`types` > (UINT1:REQUIRED)])])] > {noformat} > The internal map: > {noformat} > [`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)], [`int` > (INT:OPTIONAL)])] > {noformat} > Notice that the {{FLOAT8}} vector has disappeared: the Union vector metadata > claims we have such a vector, but the internal map does not actually contain > the vector. > The root cause is that the vector checker (indeed, any client) will call > {{UnionVector.getMember(type)}} to get a vector for a type. This method > includes a switch statement to call, say, {{getIntVector()}}. That method, in > turn, creates the vector if does not exist. > But, since we are reading, we have an existing data batch. When we create a > new vector, we create it as zero size. Thus, we think we have n records > (three in this case), but we actually have zero. This kinda-sorta works > because the type vector won't ever contain an entry for the "runt" vector, so > we won't actually access data. But, this is an inconsistent structure. It > breaks if we peer inside, as we are doing in the batch validator. > If we check for this case, we now get: > {noformat} > Found one or more vector errors from RemovingRecordBatch > kl - UnionVector: Union vector includes type INT, but the internal map has no > matching member > {noformat} > This is why Union is such a mess: is this a bug or just a very fragile > design? I claim bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7434) TopNBatch constructs Union vector incorrectly
[ https://issues.apache.org/jira/browse/DRILL-7434?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16966243#comment-16966243 ] Paul Rogers commented on DRILL-7434: A workaround is to force creation of the child type vectors in {{UnionVector.setValueCount()}}. This is a workaround because, if there are no values for a given type, we should not need the child vector. A better long-term solution would be to simply remove child types for which there are no values. This is left as an exercise for another time. > TopNBatch constructs Union vector incorrectly > - > > Key: DRILL-7434 > URL: https://issues.apache.org/jira/browse/DRILL-7434 > Project: Apache Drill > Issue Type: Bug >Reporter: Paul Rogers >Priority: Major > > The Union type is an "experimental" type that has never been completed. Yet, > we use it as if it works. > Consider the test {{TestTopNSchemaChanges.testMissingColumn()}}. Run this > with the new batch validator enabled. This test creates a union vector. Here > is how the schema looks: > {noformat} > (UNION:OPTIONAL), subtypes=([FLOAT8, INT]), > children=([`internal` (MAP:REQUIRED), children=([`types` > (UINT1:REQUIRED)])]) > {noformat} > This is very hard to follow because the Union vector structure is complex > (and has many issues.) Let's work though it. > We are looking at the {{MaterializedField}} for the union vector. It tells us > that this Union has two types: {{FLOAT8}} and {{INT}}. All good. > The Union has a vector per type, stored in an "internal map".' That map shows > up as child, it is there on the {{children}} list as {{internal}}. However, > the metadata claims that only one vector exists in that map: the {{types}} > vector (the one that tells us what type to use for each row.) The vectors > for {{FLOAT8}} and {{INT}} are missing. > If, however, we use our debugger and inspect the actual contents of the > {{internal}} map, we get the following: > {noformat} > [`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)], [`float8` > (FLOAT8:OPTIONAL)], [`int` (INT:OPTIONAL)])] > {noformat} > That is, the internal map has the correct schema, but the Union vector itself > has the wrong (incomplete) schema. > This is an inherent design flaw with Union vector: it requires two copies of > the schema to be in sync. Further {{MaterializedField}} was designed to be > immutable, but the map and Union types require mutation. If the Union simply > points to the actual Map vector {{MaterializedField}}, it will drift out of > date since the map vector creates a new schema each time we add fields; the > Union vector ends up pointing to the old one. > This is not a simple bug to fix, but the result of the bug is that the > vectors end up corrupted, as detected by the Batch Validator. In fact, the > bug itself is subtle. > The TopNBatch does pass vector validation. However, because of the incorrect > metadata, the downstream {{RemovingRecordBatch}} creates the derived Union > vector incorrectly: it fails to set the value count for the {{INT}} type. > {noformat} > Found one or more vector errors from RemovingRecordBatch > kl-type-INT - NullableIntVector: Row count = 3, but value count = 0 > {noformat} > Where {{kl-type-INT}} is an ad-hoc way of saying we are checking the {{INT}} > type vector for a Union named {{kl}}. > The schema of Union out of the {{RemovingRecordBatch}} has been truncated. > The Union itself: > {noformat} > [`kl` (UNION:OPTIONAL), subtypes=([FLOAT8, INT]), > children=([`internal` (MAP:REQUIRED), children=([`types` > (UINT1:REQUIRED)])])] > {noformat} > The internal map: > {noformat} > [`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)], [`int` > (INT:OPTIONAL)])] > {noformat} > Notice that the {{FLOAT8}} vector has disappeared: the Union vector metadata > claims we have such a vector, but the internal map does not actually contain > the vector. > The root cause is that the vector checker (indeed, any client) will call > {{UnionVector.getMember(type)}} to get a vector for a type. This method > includes a switch statement to call, say, {{getIntVector()}}. That method, in > turn, creates the vector if does not exist. > But, since we are reading, we have an existing data batch. When we create a > new vector, we create it as zero size. Thus, we think we have n records > (three in this case), but we actually have zero. This kinda-sorta works > because the type vector won't ever contain an entry for the "runt" vector, so > we won't actually access data. But, this is an inconsistent structure. It > breaks if we peer inside, as we are doing in the batch validator. > If we check for this case, we now get: > {noformat} > Found one or more vector errors from RemovingRecordBatch > kl - UnionVector: Union vector includes type INT, but the
[jira] [Updated] (DRILL-7434) TopNBatch constructs Union vector incorrectly
[ https://issues.apache.org/jira/browse/DRILL-7434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Paul Rogers updated DRILL-7434: --- Description: The Union type is an "experimental" type that has never been completed. Yet, we use it as if it works. Consider the test {{TestTopNSchemaChanges.testMissingColumn()}}. Run this with the new batch validator enabled. This test creates a union vector. Here is how the schema looks: {noformat} (UNION:OPTIONAL), subtypes=([FLOAT8, INT]), children=([`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)])]) {noformat} This is very hard to follow because the Union vector structure is complex (and has many issues.) Let's work though it. We are looking at the {{MaterializedField}} for the union vector. It tells us that this Union has two types: {{FLOAT8}} and {{INT}}. All good. The Union has a vector per type, stored in an "internal map".' That map shows up as child, it is there on the {{children}} list as {{internal}}. However, the metadata claims that only one vector exists in that map: the {{types}} vector (the one that tells us what type to use for each row.) The vectors for {{FLOAT8}} and {{INT}} are missing. If, however, we use our debugger and inspect the actual contents of the {{internal}} map, we get the following: {noformat} [`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)], [`float8` (FLOAT8:OPTIONAL)], [`int` (INT:OPTIONAL)])] {noformat} That is, the internal map has the correct schema, but the Union vector itself has the wrong (incomplete) schema. This is an inherent design flaw with Union vector: it requires two copies of the schema to be in sync. Further {{MaterializedField}} was designed to be immutable, but the map and Union types require mutation. If the Union simply points to the actual Map vector {{MaterializedField}}, it will drift out of date since the map vector creates a new schema each time we add fields; the Union vector ends up pointing to the old one. This is not a simple bug to fix, but the result of the bug is that the vectors end up corrupted, as detected by the Batch Validator. In fact, the bug itself is subtle. The TopNBatch does pass vector validation. However, because of the incorrect metadata, the downstream {{RemovingRecordBatch}} creates the derived Union vector incorrectly: it fails to set the value count for the {{INT}} type. {noformat} Found one or more vector errors from RemovingRecordBatch kl-type-INT - NullableIntVector: Row count = 3, but value count = 0 {noformat} Where {{kl-type-INT}} is an ad-hoc way of saying we are checking the {{INT}} type vector for a Union named {{kl}}. The schema of Union out of the {{RemovingRecordBatch}} has been truncated. The Union itself: {noformat} [`kl` (UNION:OPTIONAL), subtypes=([FLOAT8, INT]), children=([`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)])])] {noformat} The internal map: {noformat} [`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)], [`int` (INT:OPTIONAL)])] {noformat} Notice that the {{FLOAT8}} vector has disappeared: the Union vector metadata claims we have such a vector, but the internal map does not actually contain the vector. The root cause is that the vector checker (indeed, any client) will call {{UnionVector.getMember(type)}} to get a vector for a type. This method includes a switch statement to call, say, {{getIntVector()}}. That method, in turn, creates the vector if does not exist. But, since we are reading, we have an existing data batch. When we create a new vector, we create it as zero size. Thus, we think we have n records (three in this case), but we actually have zero. This kinda-sorta works because the type vector won't ever contain an entry for the "runt" vector, so we won't actually access data. But, this is an inconsistent structure. It breaks if we peer inside, as we are doing in the batch validator. If we check for this case, we now get: {noformat} Found one or more vector errors from RemovingRecordBatch kl - UnionVector: Union vector includes type INT, but the internal map has no matching member {noformat} This is why Union is such a mess: is this a bug or just a very fragile design? I claim bug. was: The Union type is an "experimental" type that has never been completed. Yet, we use it as if it works. Consider the test {{TestTopNSchemaChanges.testMissingColumn()}}. Run this with the new batch validator enabled. This test creates a union vector. Here is how the schema looks: {noformat} (UNION:OPTIONAL), subtypes=([FLOAT8, INT]), children=([`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)])]) {noformat} This is very hard to follow because the Union vector structure is complex (and has many issues.) Let's work though it. We are looking at the {{MaterializedField}} for the union vector. It tells us that this Union has two types: {{FLOAT8}} and {{INT}}.
[jira] [Created] (DRILL-7434) TopNBatch constructs Union vector incorrectly
Paul Rogers created DRILL-7434: -- Summary: TopNBatch constructs Union vector incorrectly Key: DRILL-7434 URL: https://issues.apache.org/jira/browse/DRILL-7434 Project: Apache Drill Issue Type: Bug Reporter: Paul Rogers The Union type is an "experimental" type that has never been completed. Yet, we use it as if it works. Consider the test {{TestTopNSchemaChanges.testMissingColumn()}}. Run this with the new batch validator enabled. This test creates a union vector. Here is how the schema looks: {noformat} (UNION:OPTIONAL), subtypes=([FLOAT8, INT]), children=([`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)])]) {noformat} This is very hard to follow because the Union vector structure is complex (and has many issues.) Let's work though it. We are looking at the {{MaterializedField}} for the union vector. It tells us that this Union has two types: {{FLOAT8}} and {{INT}}. All good. The Union has a vector per type, stored in an "internal map".' That map shows up as child, it is there on the {{children}} list as {{internal}}. However, the metadata claims that only one vector exists in that map: the {{types}} vector (the one that tells us what type to use for each row.) The vectors for {{FLOAT8}} and {{INT}} are missing. If, however, we use our debugger and inspect the actual contents of the {{internal}} map, we get the following: {noformat} [`internal` (MAP:REQUIRED), children=([`types` (UINT1:REQUIRED)], [`float8` (FLOAT8:OPTIONAL)], [`int` (INT:OPTIONAL)])] {noformat} That is, the internal map has the correct schema, but the Union vector itself has the wrong (incomplete) schema. This is an inherent design flaw with Union vector: it requires two copies of the schema to be in sync. Further {{MaterializedField}} was designed to be immutable, but the map and Union types require mutation. If the Union simply points to the actual Map vector {{MaterializedField}}, it will drift out of date since the map vector creates a new schema each time we add fields; the Union vector ends up pointing to the old one. This is not a simple bug to fix, but the result of the bug is that the vectors end up corrupted, as detected by the Batch Validator. In fact, the bug itself is subtle. The TopNBatch does pass vector validation. However, because of the incorrect metadata, the downstream {{RemovingRecordBatch}} creates the derived Union vector incorrectly: it fails to set the value count for the {{INT}} type. {noformat} Found one or more vector errors from RemovingRecordBatch kl-type-INT - NullableIntVector: Row count = 3, but value count = 0 {noformat} Where {{kl-type-INT}} is an ad-hoc way of saying we are checking the {{INT}} type vector for a Union named {{kl}}. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (DRILL-7428) Drill incorrectly allows a repeated map field to be projected to top level
Paul Rogers created DRILL-7428: -- Summary: Drill incorrectly allows a repeated map field to be projected to top level Key: DRILL-7428 URL: https://issues.apache.org/jira/browse/DRILL-7428 Project: Apache Drill Issue Type: Bug Reporter: Paul Rogers Consider the following query from the [Mongo DB tests|https://github.com/apache/drill/blob/master/contrib/storage-mongo/src/test/java/org/apache/drill/exec/store/mongo/MongoTestConstants.java#L80]: {noformat} select t.name as name, t.topping.type as type from mongo.%s.`%s` t where t.sales >= 150 {noformat} The query is used in [{{TestMongoQueries.testUnShardedDBInShardedClusterWithProjectionAndFilter()}}|https://github.com/apache/drill/blob/master/contrib/storage-mongo/src/test/java/org/apache/drill/exec/store/mongo/TestMongoQueries.java#L89]. Here it turns out that {{topping}} is a repeated map. The query is projecting the members of that map to the top level. The query has five rows, but 24 values in the repeated map. The Project operator allows the projection, resulting in an output batch in which most vectors have 5 values, but the {{topping}} column, now at the top level and no longer in the map, has 24 values. As a result, the first five values, formerly associated with the first record, are now associated with the first five top-level records, while the values formerly associated with records 1-4 are lost. Thus, this is a data corruption bug. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7426) Json support lists of different types
[ https://issues.apache.org/jira/browse/DRILL-7426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961336#comment-16961336 ] Paul Rogers commented on DRILL-7426: [~cgivre], I should have seen that one coming... But, seriously, a provided schema turns out to be the best way to predict the future. > Json support lists of different types > - > > Key: DRILL-7426 > URL: https://issues.apache.org/jira/browse/DRILL-7426 > Project: Apache Drill > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.16.0 >Reporter: benj >Priority: Trivial > > With a file.json like > {code:json} > { > "name": "toto", > "info": [["LOAD", []]], > "response": 1 > } > {code} > A simple SELECT gives an error > {code:sql} > apache drill> SELECT * FROM dfs.test.`file.json`; > Error: UNSUPPORTED_OPERATION ERROR: In a list of type VARCHAR, encountered a > value of type LIST. Drill does not support lists of different types. > {code} > But there is an option _exec.enable_union_type_ that allows these request > {code:sql} > apache drill> ALTER SESSION SET `exec.enable_union_type` = true; > apache drill> SELECT * FROM dfs.test.`file.json`; > +--+---+--+ > | name | info | response | > +--+---+--+ > | toto | [["LOAD",[]]] | 1| > +--+---+--+ > 1 row selected (0.283 seconds) > {code} > The usage of this option is not evident. So, it will be useful to mention > after the error message the possibility to set it. > {noformat} > Error: UNSUPPORTED_OPERATION ERROR: In a list of type VARCHAR, encountered a > value of type LIST. Drill does not support lists of different types. SET > the option 'exec.enable_union_type' to true and try again; > {noformat} > This behaviour is used for other error, example: > {noformat} > ... > Error: UNSUPPORTED_OPERATION ERROR: This query cannot be planned possibly due > to either a cartesian join or an inequality join. > If a cartesian or inequality join is used intentionally, set the option > 'planner.enable_nljoin_for_scalar_only' to false and try again. > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7426) Json support lists of different types
[ https://issues.apache.org/jira/browse/DRILL-7426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961322#comment-16961322 ] Paul Rogers commented on DRILL-7426: [~cgivre], the query in question used the wildcard, which asks to read all columns. In general, the reader cannot predict the future: it cannot tell that `info` will contain mixed data. However, Drill should work if the query were `SELECT name, response FROM ...`. If not, then that is a bug that is fixable. The issue is that the user seems to need the data. One workaround is to rewrite the JSON so that the array is represented as an object: {noformat} { "name": "toto", "info": { command: "LOAD", values: [] }, "response": 1 } {noformat} But, here we run into the empty-array issue: we don't know the type of the `values` array... In general, JSON can represent a wider set of data structures than relational tuples. It has always been an open question the variety of such data that Drill should handle. I think most users end up running an ETL to convert the data into a relational format (then store the data in Parquet for better performance.) So, one could debate whether it is worth adding more complexity to Drill. > Json support lists of different types > - > > Key: DRILL-7426 > URL: https://issues.apache.org/jira/browse/DRILL-7426 > Project: Apache Drill > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.16.0 >Reporter: benj >Priority: Trivial > > With a file.json like > {code:json} > { > "name": "toto", > "info": [["LOAD", []]], > "response": 1 > } > {code} > A simple SELECT gives an error > {code:sql} > apache drill> SELECT * FROM dfs.test.`file.json`; > Error: UNSUPPORTED_OPERATION ERROR: In a list of type VARCHAR, encountered a > value of type LIST. Drill does not support lists of different types. > {code} > But there is an option _exec.enable_union_type_ that allows these request > {code:sql} > apache drill> ALTER SESSION SET `exec.enable_union_type` = true; > apache drill> SELECT * FROM dfs.test.`file.json`; > +--+---+--+ > | name | info | response | > +--+---+--+ > | toto | [["LOAD",[]]] | 1| > +--+---+--+ > 1 row selected (0.283 seconds) > {code} > The usage of this option is not evident. So, it will be useful to mention > after the error message the possibility to set it. > {noformat} > Error: UNSUPPORTED_OPERATION ERROR: In a list of type VARCHAR, encountered a > value of type LIST. Drill does not support lists of different types. SET > the option 'exec.enable_union_type' to true and try again; > {noformat} > This behaviour is used for other error, example: > {noformat} > ... > Error: UNSUPPORTED_OPERATION ERROR: This query cannot be planned possibly due > to either a cartesian join or an inequality join. > If a cartesian or inequality join is used intentionally, set the option > 'planner.enable_nljoin_for_scalar_only' to false and try again. > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (DRILL-7426) Json support lists of different types
[ https://issues.apache.org/jira/browse/DRILL-7426?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16961257#comment-16961257 ] Paul Rogers commented on DRILL-7426: As it turns out, this is a known limitation of Drill. Drill is a relational engine, designed to serve relational clients such as JDBC and ODBC. Although Drill has a Union data type, that type remains experimental and not fully supported. At present, it seems that the Union type can be passed through the scan operator to a SqlLine client, where it is converted to a string for display, as shown in your example. However, it is not supported by most other operators, resulting in the failure you reported. The fundamental problem is that it is not clear how the Union type should work with clients (JDBC, ODBC) that require a traditional relational schema. Drill does not support extended SQL syntax (such as SQL++), just traditional relational SQL. We have seen cases in which JSON authors use arrays as a compact representation of a tuple: {noformat} [ 10, "fred", "flintstone", "male", 12.34 ] {noformat} Is this the case with your example that contains, it seems, both a string and an array? At present, Drill has no way to map such a tuple into a relational structure. One could imagine converting the array into, say, a Map with field names defined somehow. Here, "all text mode" will not help as that mode can't handle array/string conflicts, only string/number conflicts. > Json support lists of different types > - > > Key: DRILL-7426 > URL: https://issues.apache.org/jira/browse/DRILL-7426 > Project: Apache Drill > Issue Type: Improvement > Components: Documentation >Affects Versions: 1.16.0 >Reporter: benj >Priority: Trivial > > With a file.json like > {code:json} > { > "name": "toto", > "info": [["LOAD", []]], > "response": 1 > } > {code} > A simple SELECT gives an error > {code:sql} > apache drill> SELECT * FROM dfs.test.`file.json`; > Error: UNSUPPORTED_OPERATION ERROR: In a list of type VARCHAR, encountered a > value of type LIST. Drill does not support lists of different types. > {code} > But there is an option _exec.enable_union_type_ that allows these request > {code:sql} > apache drill> ALTER SESSION SET `exec.enable_union_type` = true; > apache drill> SELECT * FROM dfs.test.`file.json`; > +--+---+--+ > | name | info | response | > +--+---+--+ > | toto | [["LOAD",[]]] | 1| > +--+---+--+ > 1 row selected (0.283 seconds) > {code} > The usage of this option is not evident. So, it will be useful to mention > after the error message the possibility to set it. > {noformat} > Error: UNSUPPORTED_OPERATION ERROR: In a list of type VARCHAR, encountered a > value of type LIST. Drill does not support lists of different types. SET > the option 'exec.enable_union_type' to true and try again; > {noformat} > This behaviour is used for other error, example: > {noformat} > ... > Error: UNSUPPORTED_OPERATION ERROR: This query cannot be planned possibly due > to either a cartesian join or an inequality join. > If a cartesian or inequality join is used intentionally, set the option > 'planner.enable_nljoin_for_scalar_only' to false and try again. > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005)