Thanks to all for the replies so far. Some of the approaches work, if the search targets a single tag. However, none work, if the search targets a dictionary. For example:
# file a.json { "id": 1 } # file b.json { "id": 2, "yourtag": { "foo": { "foo1" = 1 } } } SELECT id, yourtag.foo.foo1 FROM dfs.`/bla/*/*` WHERE yourtag.foo is not null ## alt: REPEATED_CONTAINS(yourtag,'foo') # alt2: yourtag.foo <> 'null' I haven't found a way to write the search, so it returns the tuple (2, 1). The problem is that instead of returning simply null, if yourtag.foo doesn't exist, drill returns "Missing function implementation: [repeated_contains(MAP-REQUIRED, VARCHAR-REQUIRED)]. or 'Table \'yourtag\' not found'. Sebastian Paul Rogers <par0...@yahoo.com.INVALID> writes: > Hi Sebastian, > > On the query below, you've got a string comparison: mytag = 'hello'. I > suspect this is your problem. > > By mentioning mytag, Drill knows you want to project that column from your > file scan. When a reader notices that a file does not have that column, it > will make one up and set it to null. As it turns out, the made-up column will > be Nullable INT. As a result, your result set will contain some batches of > data with VARCHAR columns (from scanners that found the column) and some > batches in which the column is INT (for those readers that did not find the > column.) > > Drill processes data in batches. Batches contain the data from one file (or, > if the file is big, a single file may produce multiple batches.) At its > lowest level, Drill can handle the case in which column mytag is INT in some > batches (those where no such field was found in JSON), and VARCHAR in others > (where the field was found.) Unfortunately, higher-level code in Drill cannot > handle conflicting schema, causing endless user confusion. > > > I suspect the NumberFormatException occurs because of the conflict between > numeric and VARCHAR column types. Further, any attempt to group, sort or > aggregate will also fail due to a schema conflict. > > This is a longstanding "feature" of Drill that we discuss in the Learning > Apache Drill book. > > In this particular case, one would expect Drill to have used a short-circuit > evaluation of the AND conjunction, evaluate mytag = 'hello' only when the > column is not null. This should have filtered out all the INT batches since > they are NULL. Perhaps there is a bug there somewhere. > > > Note that the problem may be worse. If you have JSON like the following, > Drill will also fail: > > {a: 10, mytag: null} > {a: 20, mytag: "hello"} > > The above will fail because Drill must guess a column type when it sees the > first mytag: null. It will guess nullable Int. Then, when it sees mytag: > "hello", it will try to write a VARCHAR into an INT column and will fail. > > I wonder if you can try casting: ... AND CAST(mytag AS VARCHAR) = 'hello' Not > sure if this will work, but worth a try. > > The longer-term fix is that the team is working on a schema feature that will > let you tell Drill that the column is VARCHAR even if it is not found in the > data source. The feature is available for CSV and some other file types, but > not yet for JSON. > > An obvious enhancement in this one case is that Drill itself can tell your > intent is for the column to be VARCHAR. The analyzer should be able to infer > that column type without you telling Drill this fact. As far as I know, this > addition is not yet part of the schema system plan, but would be a nice > additional tweak. > > > Thanks, > - Paul > > > > On Sunday, September 15, 2019, 4:22:48 AM PDT, Sebastian Fischmeister > <sfisc...@uwaterloo.ca> wrote: > > While it's possible to test for 'is not null', you actually cannot query the > tag, because it provides an system error due to the optimizer doing its job. > Take this query as example: > > SELECT mytag > FROM dfs.`/bla/*/*` > WHERE mytag is not null and mytag = 'hello' > > Some files contain "mytag", others don't. > > The query works with just the clause 'mytag is not null', because all json > files missing mytag will get mytag set to null, which is type compatible with > the filter clause. However it actually does not work with "mytag is not null > and mytag = 'hello'" because I get the following error for files where mytag > is not present. > > SYSTEM ERROR: NumberFormatException: hello > > The physical plan shows that the query optimizer removes the clause 'mytag is > not null', because it's redundant. However, it comes at the expense of not > being able to query tags that are not present in all files. > > Is there a way to outsmart the optimizer and first execute a filter on "mytag > is not null" before "mytag = 'hello'" in a single query? > > Sebastian > > > Ted Dunning <ted.dunn...@gmail.com> writes: > >> Keep in mind the danger if testing Foo!=null. That doesn't work and catches >> me by surprise all the time. Foo is null and variants are what you need. >> >> On Sat, Sep 14, 2019, 4:56 PM hanu mapr <hanu.m...@gmail.com> wrote: >> >>> Hello Sebastian, >>> >>> By default Drill sets the field 'foo' to null for the files that don't >>> contain it. I am of the opinion that the condition where foo = 'bar' should >>> result in false for all those files which don't contain the field. >>> Please can you send across the queries which you have run and the observed >>> result. >>> >>> Just off the top of my head, some query like the below one might work >>> select file_name from dfs.`/bla/*/*` where foo != null. --- You might want >>> to remove duplicate entries. (of course this also results in the rows which >>> contain the field and are null). >>> >>> Hope this helps. >>> >>> Thanks >>> >>> >>> On Fri, Sep 13, 2019 at 10:53 PM Sebastian Fischmeister < >>> sfisc...@uwaterloo.ca> wrote: >>> >>> > Hi, >>> > >>> > When searching multiple directories, drill only searches fields that are >>> > common to all files (see the json data model). Is there a way to query a >>> > directory and list all files that contain a certain field? >>> > >>> > In other words, I would like to use the workaround in this way: >>> > >>> > select * from (select fqn from dfs.`/bla/*/*` where foo exists) where foo >>> > = 'bar' >>> > >>> > Or is there another way to do this? I dynamically get more files, so >>> > finding the files should be included in the query. >>> > >>> > An alternative would be to execute the query such that it sets the field >>> > 'foo' to null for all files that don't contain it. However, I don't know >>> > how to execute this. >>> > >>> > Thanks, >>> > Sebastian >>> > >>> >