Re: Check presence of field in json file

Sebastian Fischmeister Thu, 19 Sep 2019 22:20:45 -0700

I was hoping to use Drill to aggregate information across a heterogeneous set 
of JSON files. With a defined schema, I could go to an RDBMS.


The function sqlType() doesn't seem to exist in Drill. I also found no 
reference to it in the manual. Also the function isNull(), which you mentioned, 
doesn't seem to exist.

This also doesn't work: WHERE yourtag like '%foo%'. The idea was to just see 
whether the map as json-string contains the name. Casting yourtag to a string 
doesn't work either.

I would need some function to test whether 'yourtag.foo' exists regardless of 
whether 'yourtag' or 'yourtag.foo' exits; and yourtag is a MAP type.


  Sebastian

Paul Rogers <[email protected]> writes:

> Hi Sebastian,
>
> The nested map is a very hard case! Without a schema, there is no way for the 
> reader of a.json to know it should create a map. Even if some part of Drill 
> were to infer that `yourtag` is a map, nothing would figure out the name and 
> type of the items within the map. This is a known weakness in the current 
> implementation. (Though, to be fair, without a schema, it is nearly 
> impossible to get consistency.)
>
>
> If you only want the b.json data (the one with the nested map), then maybe 
> you can go by column type using a nested select:
>
> SELECT id, yourtag.foo.foo1 AS myFoo FROM (
>   SELECT id, yourtag WHERE sqlType(yourTag) = 'MAP' FROM ...)
>
> The inner select returns only the rows where the yourtag column is a MAP. 
> This *might* weed out the batches in which the column did not exist and Drill 
> made up a nullable INT column.
>
> Not 100% sure that this will work, but worth a try.
>
>
> Note that foo <> null will never be true. And foo <> 'null' tries to match a 
> string with the value of "null" (not a null string). If you do want to check 
> for null, use isNull(foo).
>
> Note also that a MAP is never null; though the fields within it can be. (The 
> MAP type is not nullable in Drill.)
>
> Thanks,
> - Paul
>
>  
>
>     On Wednesday, September 18, 2019, 9:20:24 PM PDT, Sebastian Fischmeister 
> <[email protected]> wrote:  
>  
>  Thanks to all for the replies so far. Some of the approaches work, if the 
> search targets a single tag. However, none work, if the search targets a 
> dictionary. For example:
>
> # file a.json
> { "id": 1 }
> # file b.json
> { "id": 2,
>   "yourtag": { "foo": { "foo1" = 1 } } }
>
>
> SELECT id, yourtag.foo.foo1
> FROM dfs.`/bla/*/*` 
> WHERE yourtag.foo is not null  ## alt: REPEATED_CONTAINS(yourtag,'foo') # 
> alt2: yourtag.foo <> 'null'
>
>
> I haven't found a way to write the search, so it returns the tuple (2, 1). 
> The problem is that instead of returning simply null, if yourtag.foo doesn't 
> exist, drill returns "Missing function implementation: 
> [repeated_contains(MAP-REQUIRED, VARCHAR-REQUIRED)]. or 'Table \'yourtag\' 
> not found'.
>
>   Sebastian
>
>
>
> Paul Rogers <[email protected]> writes:
>
>> Hi Sebastian,
>>
>> On the query below, you've got a string comparison: mytag = 'hello'. I 
>> suspect this is your problem.
>>
>> By mentioning mytag, Drill knows you want to project that column from your 
>> file scan. When a reader notices that a file does not have that column, it 
>> will make one up and set it to null. As it turns out, the made-up column 
>> will be Nullable INT. As a result, your result set will contain some batches 
>> of data with VARCHAR columns (from scanners that found the column) and some 
>> batches in which the column is INT (for those readers that did not find the 
>> column.)
>>
>> Drill processes data in batches. Batches contain the data from one file (or, 
>> if the file is big, a single file may produce multiple batches.) At its 
>> lowest level, Drill can handle the case in which column mytag is INT in some 
>> batches (those where no such field was found in JSON), and VARCHAR in others 
>> (where the field was found.) Unfortunately, higher-level code in Drill 
>> cannot handle conflicting schema, causing endless user confusion.
>>
>>
>> I suspect the NumberFormatException occurs because of the conflict between 
>> numeric and VARCHAR column types. Further, any attempt to group, sort or 
>> aggregate will also fail due to a schema conflict.
>>
>> This is a longstanding "feature" of Drill that we discuss in the Learning 
>> Apache Drill book.
>>
>> In this particular case, one would expect Drill to have used a short-circuit 
>> evaluation of the AND conjunction, evaluate mytag = 'hello' only when the 
>> column is not null. This should have filtered out all the INT batches since 
>> they are NULL. Perhaps there is a bug there somewhere.
>>
>>
>> Note that the problem may be worse. If you have JSON like the following, 
>> Drill will also fail:
>>
>> {a: 10, mytag: null}
>> {a: 20, mytag: "hello"}
>>
>> The above will fail because Drill must guess a column type when it sees the 
>> first mytag: null. It will guess nullable Int. Then, when it sees mytag: 
>> "hello", it will try to write a VARCHAR into an INT column and will fail.
>>
>> I wonder if you can try casting: ... AND CAST(mytag AS VARCHAR) = 'hello' 
>> Not sure if this will work, but worth a try.
>>
>> The longer-term fix is that the team is working on a schema feature that 
>> will let you tell Drill that the column is VARCHAR even if it is not found 
>> in the data source. The feature is available for CSV and some other file 
>> types, but not yet for JSON.
>>
>> An obvious enhancement in this one case is that Drill itself can tell your 
>> intent is for the column to be VARCHAR. The analyzer should be able to infer 
>> that column type without you telling Drill this fact. As far as I know, this 
>> addition is not yet part of the schema system plan, but would be a nice 
>> additional tweak.
>>
>>
>> Thanks,
>> - Paul
>>
>>  
>>
>>    On Sunday, September 15, 2019, 4:22:48 AM PDT, Sebastian Fischmeister 
>><[email protected]> wrote:  
>>  
>>  While it's possible to test for 'is not null', you actually cannot query 
>>the tag, because it provides an system error due to the optimizer doing its 
>>job. Take this query as example:
>>
>> SELECT  mytag
>> FROM dfs.`/bla/*/*` 
>> WHERE mytag is not null and mytag = 'hello'
>>
>> Some files contain "mytag", others don't.
>>
>> The query works with just the clause 'mytag is not null', because all json 
>> files missing mytag will get mytag set to null, which is type compatible 
>> with the filter clause. However it actually does not work with "mytag is not 
>> null and mytag = 'hello'" because I get the following error for files where 
>> mytag is not present.
>>
>> SYSTEM ERROR: NumberFormatException: hello
>>
>> The physical plan shows that the query optimizer removes the clause 'mytag 
>> is not null', because it's redundant. However, it comes at the expense of 
>> not being able to query tags that are not present in all files.
>>
>> Is there a way to outsmart the optimizer and first execute a filter on 
>> "mytag is not null" before "mytag = 'hello'" in a single query?
>>
>>   Sebastian
>>
>>
>> Ted Dunning <[email protected]> writes:
>>
>>> Keep in mind the danger if testing Foo!=null. That doesn't work and catches
>>> me by surprise all the time. Foo is null and variants are what you need.
>>>
>>> On Sat, Sep 14, 2019, 4:56 PM hanu mapr <[email protected]> wrote:
>>>
>>>> Hello Sebastian,
>>>>
>>>> By default Drill sets the field 'foo' to null for the files that don't
>>>> contain it. I am of the opinion that the condition where foo = 'bar' should
>>>> result in false for all those files which don't contain the field.
>>>> Please can you send across the queries which you have run and the observed
>>>> result.
>>>>
>>>> Just off the top of my head, some query like the below one might work
>>>> select file_name from dfs.`/bla/*/*` where foo != null. --- You might want
>>>> to remove duplicate entries. (of course this also results in the rows which
>>>> contain the field and are null).
>>>>
>>>> Hope this helps.
>>>>
>>>> Thanks
>>>>
>>>>
>>>> On Fri, Sep 13, 2019 at 10:53 PM Sebastian Fischmeister <
>>>> [email protected]> wrote:
>>>>
>>>> > Hi,
>>>> >
>>>> > When searching multiple directories, drill only searches fields that are
>>>> > common to all files (see the json data model). Is there a way to query a
>>>> > directory and list all files that contain a certain field?
>>>> >
>>>> > In other words, I would like to use the workaround in this way:
>>>> >
>>>> > select * from (select fqn from dfs.`/bla/*/*` where foo exists) where foo
>>>> > = 'bar'
>>>> >
>>>> > Or is there another way to do this? I dynamically get more files, so
>>>> > finding the files should be included in the query.
>>>> >
>>>> > An alternative would be to execute the query such that it sets the field
>>>> > 'foo' to null for all files that don't contain it. However, I don't know
>>>> > how to execute this.
>>>> >
>>>> > Thanks,
>>>> >  Sebastian
>>>> >
>>>>
>>

Re: Check presence of field in json file

Reply via email to