While it's possible to test for 'is not null', you actually cannot query the 
tag, because it provides an system error due to the optimizer doing its job. 
Take this query as example:

SELECT  mytag
FROM dfs.`/bla/*/*` 
WHERE mytag is not null and mytag = 'hello'

Some files contain "mytag", others don't.

The query works with just the clause 'mytag is not null', because all json 
files missing mytag will get mytag set to null, which is type compatible with 
the filter clause. However it actually does not work with "mytag is not null 
and mytag = 'hello'" because I get the following error for files where mytag is 
not present.

SYSTEM ERROR: NumberFormatException: hello

The physical plan shows that the query optimizer removes the clause 'mytag is 
not null', because it's redundant. However, it comes at the expense of not 
being able to query tags that are not present in all files.

Is there a way to outsmart the optimizer and first execute a filter on "mytag 
is not null" before "mytag = 'hello'" in a single query?

  Sebastian


Ted Dunning <ted.dunn...@gmail.com> writes:

> Keep in mind the danger if testing Foo!=null. That doesn't work and catches
> me by surprise all the time. Foo is null and variants are what you need.
>
> On Sat, Sep 14, 2019, 4:56 PM hanu mapr <hanu.m...@gmail.com> wrote:
>
>> Hello Sebastian,
>>
>> By default Drill sets the field 'foo' to null for the files that don't
>> contain it. I am of the opinion that the condition where foo = 'bar' should
>> result in false for all those files which don't contain the field.
>> Please can you send across the queries which you have run and the observed
>> result.
>>
>> Just off the top of my head, some query like the below one might work
>> select file_name from dfs.`/bla/*/*` where foo != null. --- You might want
>> to remove duplicate entries. (of course this also results in the rows which
>> contain the field and are null).
>>
>> Hope this helps.
>>
>> Thanks
>>
>>
>> On Fri, Sep 13, 2019 at 10:53 PM Sebastian Fischmeister <
>> sfisc...@uwaterloo.ca> wrote:
>>
>> > Hi,
>> >
>> > When searching multiple directories, drill only searches fields that are
>> > common to all files (see the json data model). Is there a way to query a
>> > directory and list all files that contain a certain field?
>> >
>> > In other words, I would like to use the workaround in this way:
>> >
>> > select * from (select fqn from dfs.`/bla/*/*` where foo exists) where foo
>> > = 'bar'
>> >
>> > Or is there another way to do this? I dynamically get more files, so
>> > finding the files should be included in the query.
>> >
>> > An alternative would be to execute the query such that it sets the field
>> > 'foo' to null for all files that don't contain it. However, I don't know
>> > how to execute this.
>> >
>> > Thanks,
>> >   Sebastian
>> >
>>

Reply via email to