I was hoping to use Drill to aggregate information across a heterogeneous set of JSON files. With a defined schema, I could go to an RDBMS.
The function sqlType() doesn't seem to exist in Drill. I also found no reference to it in the manual. Also the function isNull(), which you mentioned, doesn't seem to exist. This also doesn't work: WHERE yourtag like '%foo%'. The idea was to just see whether the map as json-string contains the name. Casting yourtag to a string doesn't work either. I would need some function to test whether 'yourtag.foo' exists regardless of whether 'yourtag' or 'yourtag.foo' exits; and yourtag is a MAP type. Sebastian Paul Rogers <[email protected]> writes: > Hi Sebastian, > > The nested map is a very hard case! Without a schema, there is no way for the > reader of a.json to know it should create a map. Even if some part of Drill > were to infer that `yourtag` is a map, nothing would figure out the name and > type of the items within the map. This is a known weakness in the current > implementation. (Though, to be fair, without a schema, it is nearly > impossible to get consistency.) > > > If you only want the b.json data (the one with the nested map), then maybe > you can go by column type using a nested select: > > SELECT id, yourtag.foo.foo1 AS myFoo FROM ( > SELECT id, yourtag WHERE sqlType(yourTag) = 'MAP' FROM ...) > > The inner select returns only the rows where the yourtag column is a MAP. > This *might* weed out the batches in which the column did not exist and Drill > made up a nullable INT column. > > Not 100% sure that this will work, but worth a try. > > > Note that foo <> null will never be true. And foo <> 'null' tries to match a > string with the value of "null" (not a null string). If you do want to check > for null, use isNull(foo). > > Note also that a MAP is never null; though the fields within it can be. (The > MAP type is not nullable in Drill.) > > Thanks, > - Paul > > > > On Wednesday, September 18, 2019, 9:20:24 PM PDT, Sebastian Fischmeister > <[email protected]> wrote: > > Thanks to all for the replies so far. Some of the approaches work, if the > search targets a single tag. However, none work, if the search targets a > dictionary. For example: > > # file a.json > { "id": 1 } > # file b.json > { "id": 2, > "yourtag": { "foo": { "foo1" = 1 } } } > > > SELECT id, yourtag.foo.foo1 > FROM dfs.`/bla/*/*` > WHERE yourtag.foo is not null ## alt: REPEATED_CONTAINS(yourtag,'foo') # > alt2: yourtag.foo <> 'null' > > > I haven't found a way to write the search, so it returns the tuple (2, 1). > The problem is that instead of returning simply null, if yourtag.foo doesn't > exist, drill returns "Missing function implementation: > [repeated_contains(MAP-REQUIRED, VARCHAR-REQUIRED)]. or 'Table \'yourtag\' > not found'. > > Sebastian > > > > Paul Rogers <[email protected]> writes: > >> Hi Sebastian, >> >> On the query below, you've got a string comparison: mytag = 'hello'. I >> suspect this is your problem. >> >> By mentioning mytag, Drill knows you want to project that column from your >> file scan. When a reader notices that a file does not have that column, it >> will make one up and set it to null. As it turns out, the made-up column >> will be Nullable INT. As a result, your result set will contain some batches >> of data with VARCHAR columns (from scanners that found the column) and some >> batches in which the column is INT (for those readers that did not find the >> column.) >> >> Drill processes data in batches. Batches contain the data from one file (or, >> if the file is big, a single file may produce multiple batches.) At its >> lowest level, Drill can handle the case in which column mytag is INT in some >> batches (those where no such field was found in JSON), and VARCHAR in others >> (where the field was found.) Unfortunately, higher-level code in Drill >> cannot handle conflicting schema, causing endless user confusion. >> >> >> I suspect the NumberFormatException occurs because of the conflict between >> numeric and VARCHAR column types. Further, any attempt to group, sort or >> aggregate will also fail due to a schema conflict. >> >> This is a longstanding "feature" of Drill that we discuss in the Learning >> Apache Drill book. >> >> In this particular case, one would expect Drill to have used a short-circuit >> evaluation of the AND conjunction, evaluate mytag = 'hello' only when the >> column is not null. This should have filtered out all the INT batches since >> they are NULL. Perhaps there is a bug there somewhere. >> >> >> Note that the problem may be worse. If you have JSON like the following, >> Drill will also fail: >> >> {a: 10, mytag: null} >> {a: 20, mytag: "hello"} >> >> The above will fail because Drill must guess a column type when it sees the >> first mytag: null. It will guess nullable Int. Then, when it sees mytag: >> "hello", it will try to write a VARCHAR into an INT column and will fail. >> >> I wonder if you can try casting: ... AND CAST(mytag AS VARCHAR) = 'hello' >> Not sure if this will work, but worth a try. >> >> The longer-term fix is that the team is working on a schema feature that >> will let you tell Drill that the column is VARCHAR even if it is not found >> in the data source. The feature is available for CSV and some other file >> types, but not yet for JSON. >> >> An obvious enhancement in this one case is that Drill itself can tell your >> intent is for the column to be VARCHAR. The analyzer should be able to infer >> that column type without you telling Drill this fact. As far as I know, this >> addition is not yet part of the schema system plan, but would be a nice >> additional tweak. >> >> >> Thanks, >> - Paul >> >> >> >> On Sunday, September 15, 2019, 4:22:48 AM PDT, Sebastian Fischmeister >><[email protected]> wrote: >> >> While it's possible to test for 'is not null', you actually cannot query >>the tag, because it provides an system error due to the optimizer doing its >>job. Take this query as example: >> >> SELECT mytag >> FROM dfs.`/bla/*/*` >> WHERE mytag is not null and mytag = 'hello' >> >> Some files contain "mytag", others don't. >> >> The query works with just the clause 'mytag is not null', because all json >> files missing mytag will get mytag set to null, which is type compatible >> with the filter clause. However it actually does not work with "mytag is not >> null and mytag = 'hello'" because I get the following error for files where >> mytag is not present. >> >> SYSTEM ERROR: NumberFormatException: hello >> >> The physical plan shows that the query optimizer removes the clause 'mytag >> is not null', because it's redundant. However, it comes at the expense of >> not being able to query tags that are not present in all files. >> >> Is there a way to outsmart the optimizer and first execute a filter on >> "mytag is not null" before "mytag = 'hello'" in a single query? >> >> Sebastian >> >> >> Ted Dunning <[email protected]> writes: >> >>> Keep in mind the danger if testing Foo!=null. That doesn't work and catches >>> me by surprise all the time. Foo is null and variants are what you need. >>> >>> On Sat, Sep 14, 2019, 4:56 PM hanu mapr <[email protected]> wrote: >>> >>>> Hello Sebastian, >>>> >>>> By default Drill sets the field 'foo' to null for the files that don't >>>> contain it. I am of the opinion that the condition where foo = 'bar' should >>>> result in false for all those files which don't contain the field. >>>> Please can you send across the queries which you have run and the observed >>>> result. >>>> >>>> Just off the top of my head, some query like the below one might work >>>> select file_name from dfs.`/bla/*/*` where foo != null. --- You might want >>>> to remove duplicate entries. (of course this also results in the rows which >>>> contain the field and are null). >>>> >>>> Hope this helps. >>>> >>>> Thanks >>>> >>>> >>>> On Fri, Sep 13, 2019 at 10:53 PM Sebastian Fischmeister < >>>> [email protected]> wrote: >>>> >>>> > Hi, >>>> > >>>> > When searching multiple directories, drill only searches fields that are >>>> > common to all files (see the json data model). Is there a way to query a >>>> > directory and list all files that contain a certain field? >>>> > >>>> > In other words, I would like to use the workaround in this way: >>>> > >>>> > select * from (select fqn from dfs.`/bla/*/*` where foo exists) where foo >>>> > = 'bar' >>>> > >>>> > Or is there another way to do this? I dynamically get more files, so >>>> > finding the files should be included in the query. >>>> > >>>> > An alternative would be to execute the query such that it sets the field >>>> > 'foo' to null for all files that don't contain it. However, I don't know >>>> > how to execute this. >>>> > >>>> > Thanks, >>>> > Sebastian >>>> > >>>> >>
