We experienced the same issue on first "null" value tripping up Drill's schema on-demand logic that assumes certain data type for "null" causing subsequent exception when a non-null is encountered. We added logic inside our storage plugin to make this work. But good to hear there is a general solution being worked on. Logically, one can defer the type determination until first non-null value.
Regarding "If Drill had a schema", I do fully understand the need for schemas for purposes of vectorization. Still having schema on demand has been a major benefit for our use case. So the tricky part is how to achieve the best of both worlds. -- Jiang On Wed, Jan 15, 2020 at 7:09 PM Paul Rogers <par0...@yahoo.com.invalid> wrote: > Hi Ted, > > Thanks much for the feedback! Good test cases indeed. The good news is > that we're close to finishing a "V2" JSON reader that smooths over a few > more JSON quirks like the "first column null" issue that can cause problems: > > {n: 1, v: null}{n: 2, v: "Gotcha! Wasn't Int, is actually VARCHAR!"} > > Glad your queries work. You gave an example that had fooled me multiple > times: > > select nest.u from dfs.root.`/Users/tdunning/x.json`; > > > The trick here is that Drill has no schema. All the parser can tell is, > "hey, I've got a two-part name, `nest.u`. For me a two part name means > schema.table or table.column, so, since `nest` isn't a schema, it must be a > table.Oh, look, no such table exists. FAIL!" Using a three-part name works > (IIRC): > > select t.nest.u from dfs.root.`/Users/tdunning/x.json` t; > > > Now Drill sees that `t` is a table name, and works its way down from there. > > > If Drill had a schema, then the planner could first check if `nest` is a > schema, then if it is a table, then if it is a structured field in the > query. Impala can do this because it has a schema; Drill can't. We can hope > that, with the new schema work being added to Drill, that your query will > "do the right thing" in the future. > > Adding `columns` to your query won't help: the `columns` name is valid in > only one place: when working with CSV (or, more generally, delimited) data > with no headers. > > This gets back to Jaing's point: we could really use better/more > documentation. We're good at the bare basics, "such-and-so syntax exists", > but we're not as good at explaining how to solve problems using Drill > features. The Learning Apache Drill book tries to address some holes. > Clearly, if you have a hard time with this, being part of the team that > created Drill, we've got a bit of work to do! (To be honest, neither Impala > nor Presto are much better in the "how to" department.) > > > Additional use cases/frustrations are very welcome as you find them. > > > Thanks, > - Paul > > > > On Wednesday, January 15, 2020, 3:44:09 PM PST, Ted Dunning < > ted.dunn...@gmail.com> wrote: > > On Wed, Jan 15, 2020 at 2:58 PM Paul Rogers <par0...@yahoo.com.invalid> > wrote: > > > ... > > > > For example, Ted, you mention lack of nullability on structure members. > > But, Drill represents structures as MAPs, and MAPs can have nullable > > members. So, there is likely more to your request than the short summary > > suggests. Perhaps you can help us understand this a bit more. > > > > This was quite a while ago. > > I was reading JSON data with substructures of variable form. > > I think, however, that this impression is old news. I just tried it and it > works the way I wanted. > > Here is my data: > > {"top":"a","nest":{"u":1, "v":"other"}} > {"top":"b","nest":{"v":"this", "w":"that"}} > > And here are some queries that behave just the way that I wanted: > > apache drill> *select* * *from* dfs.root.`/Users/tdunning/x.json`; > > +-----+-------------------------+ > > | *top* | * nest * | > > +-----+-------------------------+ > > | a | {"u":1,"v":"other"} | > > | b | {"v":"this","w":"that"} | > > +-----+-------------------------+ > > 2 rows selected (0.079 seconds) > > apache drill> *select* nest *from* dfs.root.`/Users/tdunning/x.json`; > > +-------------------------+ > > | * nest * | > > +-------------------------+ > > | {"u":1,"v":"other"} | > > | {"v":"this","w":"that"} | > > +-------------------------+ > > 2 rows selected (0.114 seconds) > > apache drill> *select* nest.u *from* dfs.root.`/Users/tdunning/x.json`; > > Error: VALIDATION ERROR: From line 1, column 8 to line 1, column 11: Table > 'nest' not found > > > > [Error Id: b2100faf-adf7-453e-957f-56726b96e06f ] (state=,code=0) > > apache drill> *select* columns.nest.u *from* dfs.root. > `/Users/tdunning/x.json`; > > Error: VALIDATION ERROR: From line 1, column 8 to line 1, column 14: Table > 'columns' not found > > > > [Error Id: a793e6bd-c2ed-477a-9f23-70d67b2b85df ] (state=,code=0) > > apache drill> *select* x.nest.u *from* dfs.root.`/Users/tdunning/x.json` x; > > +--------+ > > | *EXPR$0* | > > +--------+ > > | 1 | > > | null | > > +--------+ > > 2 rows selected (0.126 seconds) > apache drill> >