This issue is addressed in DRILL-3477. My proposed solution is to use VarBinary as the default type, as this has the lowest precedence in the rules for implicit cast.
Making this change breaks some other things though, so I haven't been able to merge the fix yet. On Tue, Jul 28, 2015 at 2:23 PM, Parth Chandra <[email protected]> wrote: > Hi Stefan > This is the same old issue: Drill does an initial scan to determine the > type of a field. In cases where Drill encounters nulls in the data it > defaults to using a Nullable Int as the type (not a good choice perhaps). > This leads to all sorts of issues (most of which you're hitting). > There is an effort to improve this (DRILL-3228) but it will be a while > before this work is completed. > > In the meantime, I can only suggest a workaround : use as cast around > your columns - > > select p.type, coalesce( cast(p.dimensions.dim_type as varchar(20)), > cast(p.dimensions.type as varchar(20))) dimensions_type, count(*) from > `test.json` as p where occurred_at > '2015-07-26' and p.type in > ('plan.item.added','plan.item.removed') group by p.type, > coalesce(cast(p.dimensions.dim_type as varchar(20)), cast(p.dimensions.type > as varchar(20))); > > > > > > On Mon, Jul 27, 2015 at 4:59 AM, Stefán Baxter <[email protected]> > wrote: > > > Hi, > > > > It seems that null values can trigger a column to be treated as a numeric > > one, in expressions evaluation, regardless of content or other indicators > > and that fields in substructures can affect same-named-fields in parent > > structure. > > (1.2-SNAPSHOT, parquet files) > > > > I have JSON data that can be reduced to to this: > > > > - {"occurred_at":"2015-07-26 > > > > > 08:45:41.234","type":"plan.item.added","dimensions":{"type":null,"dim_type":"Unspecified","category":"Unspecified","sub_category":null}} > > - {"occurred_at":"2015-07-26 > > > > > 08:45:43.598","type":"plan.item.removed","dimensions":{"type":"Unspecified","dim_type":null,"category":"Unspecified","sub_category":null}} > > - {"occurred_at":"2015-07-26 > > 08:45:44.241","type":"plan.item.removed","dimensions":{"type":"To > > See","category":"Nature","sub_category":"Waterfalls"}} > > > > * notice the discrepancy in the dimensions structure that the type field > is > > either called type or dim_type (slightly relevant for the rest of this > > case) > > > > > > *1. Query where dimensions are not involved* > > > > select p.type, count(*) from > > dfs.tmp.`/analytics/processed/<some-tenant>/events` as p where > occurred_at > > > '2015-07-26' and p.type in ('plan.item.added','plan.item.removed') > group > > by p.type; > > +--------------------+---------+ > > | type | EXPR$1 | > > +--------------------+---------+ > > | plan.item.removed | 947 | > > | plan.item.added | 40342 | > > +--------------------+---------+ > > 2 rows selected (0.508 seconds) > > > > > > *2. Same query but involves dimension.type as well* > > > > select p.type, coalesce(p.dimensions.dim_type, p.dimensions.type) > > dimensions_type, count(*) from > > dfs.tmp.`/analytics/processed/<some-tenant>/events` as p where > occurred_at > > > '2015-07-26' and p.type in ('plan.item.added','plan.item.removed') > group > > by p.type, coalesce(p.dimensions.dim_type, p.dimensions.type); > > > > Error: SYSTEM ERROR: NumberFormatException: To See > > Fragment 2:0 > > [Error Id: 4756f549-cc47-43e5-899e-10a11efb60ea on localhost:31010] > > (state=,code=0) > > > > > > I can provide test data if this is not enough to reproduce this bug. > > > > Regards, > > -Stefán > > > -- Steven Phillips Software Engineer mapr.com
