Re: Check presence of field in json file

Paul Rogers Sun, 15 Sep 2019 11:06:07 -0700

Hi Sebastian,

On the query below, you've got a string comparison: mytag = 'hello'. I suspect 
this is your problem.

By mentioning mytag, Drill knows you want to project that column from your file 
scan. When a reader notices that a file does not have that column, it will make 
one up and set it to null. As it turns out, the made-up column will be Nullable 
INT. As a result, your result set will contain some batches of data with 
VARCHAR columns (from scanners that found the column) and some batches in which 
the column is INT (for those readers that did not find the column.)

Drill processes data in batches. Batches contain the data from one file (or, if 
the file is big, a single file may produce multiple batches.) At its lowest 
level, Drill can handle the case in which column mytag is INT in some batches 
(those where no such field was found in JSON), and VARCHAR in others (where the 
field was found.) Unfortunately, higher-level code in Drill cannot handle 
conflicting schema, causing endless user confusion.

I suspect the NumberFormatException occurs because of the conflict between 
numeric and VARCHAR column types. Further, any attempt to group, sort or 
aggregate will also fail due to a schema conflict.

This is a longstanding "feature" of Drill that we discuss in the Learning 
Apache Drill book.

In this particular case, one would expect Drill to have used a short-circuit 
evaluation of the AND conjunction, evaluate mytag = 'hello' only when the 
column is not null. This should have filtered out all the INT batches since 
they are NULL. Perhaps there is a bug there somewhere.

Note that the problem may be worse. If you have JSON like the following, Drill 
will also fail:

{a: 10, mytag: null}
{a: 20, mytag: "hello"}

The above will fail because Drill must guess a column type when it sees the 
first mytag: null. It will guess nullable Int. Then, when it sees mytag: 
"hello", it will try to write a VARCHAR into an INT column and will fail.

I wonder if you can try casting: ... AND CAST(mytag AS VARCHAR) = 'hello' Not 
sure if this will work, but worth a try.

The longer-term fix is that the team is working on a schema feature that will 
let you tell Drill that the column is VARCHAR even if it is not found in the 
data source. The feature is available for CSV and some other file types, but 
not yet for JSON.

An obvious enhancement in this one case is that Drill itself can tell your 
intent is for the column to be VARCHAR. The analyzer should be able to infer 
that column type without you telling Drill this fact. As far as I know, this 
addition is not yet part of the schema system plan, but would be a nice 
additional tweak.

Thanks,
- Paul

    On Sunday, September 15, 2019, 4:22:48 AM PDT, Sebastian Fischmeister 
<[email protected]> wrote:  

 While it's possible to test for 'is not null', you actually cannot query the 
tag, because it provides an system error due to the optimizer doing its job. 
Take this query as example:

SELECT  mytag
FROM dfs.`/bla/*/*` 
WHERE mytag is not null and mytag = 'hello'

Some files contain "mytag", others don't.

The query works with just the clause 'mytag is not null', because all json 
files missing mytag will get mytag set to null, which is type compatible with 
the filter clause. However it actually does not work with "mytag is not null 
and mytag = 'hello'" because I get the following error for files where mytag is 
not present.

SYSTEM ERROR: NumberFormatException: hello

The physical plan shows that the query optimizer removes the clause 'mytag is 
not null', because it's redundant. However, it comes at the expense of not 
being able to query tags that are not present in all files.

Is there a way to outsmart the optimizer and first execute a filter on "mytag 
is not null" before "mytag = 'hello'" in a single query?

  Sebastian

Ted Dunning <[email protected]> writes:

> Keep in mind the danger if testing Foo!=null. That doesn't work and catches
> me by surprise all the time. Foo is null and variants are what you need.
>
> On Sat, Sep 14, 2019, 4:56 PM hanu mapr <[email protected]> wrote:
>
>> Hello Sebastian,
>>
>> By default Drill sets the field 'foo' to null for the files that don't
>> contain it. I am of the opinion that the condition where foo = 'bar' should
>> result in false for all those files which don't contain the field.
>> Please can you send across the queries which you have run and the observed
>> result.
>>
>> Just off the top of my head, some query like the below one might work
>> select file_name from dfs.`/bla/*/*` where foo != null. --- You might want
>> to remove duplicate entries. (of course this also results in the rows which
>> contain the field and are null).
>>
>> Hope this helps.
>>
>> Thanks
>>
>>
>> On Fri, Sep 13, 2019 at 10:53 PM Sebastian Fischmeister <
>> [email protected]> wrote:
>>
>> > Hi,
>> >
>> > When searching multiple directories, drill only searches fields that are
>> > common to all files (see the json data model). Is there a way to query a
>> > directory and list all files that contain a certain field?
>> >
>> > In other words, I would like to use the workaround in this way:
>> >
>> > select * from (select fqn from dfs.`/bla/*/*` where foo exists) where foo
>> > = 'bar'
>> >
>> > Or is there another way to do this? I dynamically get more files, so
>> > finding the files should be included in the query.
>> >
>> > An alternative would be to execute the query such that it sets the field
>> > 'foo' to null for all files that don't contain it. However, I don't know
>> > how to execute this.
>> >
>> > Thanks,
>> >  Sebastian
>> >
>>

Re: Check presence of field in json file

Reply via email to