Re: Check presence of field in json file

Paul Rogers Fri, 20 Sep 2019 09:29:52 -0700

Hi Sebastian,

Drill has no magic to work with heterogeneous JSON files. Drill works best with 
homogeneous JSON with no null or missing fields: this pattern leverages Drill's 
schema-on-read technology. With heterogeneous JSON, or missing/null fields, and 
without a schema, there is no magic that will enable Drill to create a 
consistent schema across distributed readers. I realize there is a belief that 
Drill can do this, but that is true only for the "happy path" mentioned above.

As Boaz suggested, if you have single-level JSON, you can force all fields to 
VARCHAR. However, this does not work for nested fields as you will end up with 
a MAP column in those files that contain the map, but a VARCHAR column in those 
files without the map.
Sorry about the incorrect function name: it is sqlTypeOf(). See [1]. There is 
also a typeof() function, which is SQL compliant, but that function returns 
"NULL" if the value is null, while sqlTypeOf() returns the actual type name, 
even for null values.

My bad on "isNull", I confused Drill's function with another product. In Drill, 
you use the IS NULL operator as you were doing.

The idea of the nested select was to first check if yourtag is a MAP. (Handles 
the case that yourtag is missing.) Once we know it is, we use another nested 
query to check if yourtag.foo is a VARCHAR. (Handles the case where yourtag.foo 
is missing.) If it is, then you can apply the yourtag.foo LIKE %foo% pattern in 
the outer-most query.

Sorry that I've not actually tested the above, it is from memory; so caveat 
emptor...

One related point: recall that Drill is a SQL engine, to use anything other 
than the Drill native client (or, with work, JDBC), you must flatten your JSON 
into a single tuple. So, if your goal is to do JSON processing (preserve nested 
structures after querying), you'll need to write some special client code (or, 
I believe, use CTAS to create a new JSON file.) That is, Drill is good at 
structured-JSON to records; it is not as good a solution for structured-JSON to 
structured-JSON (except in CTAS.) Said another way, don't expect the REST API, 
say, to return your data as structured JSON. Do expect Drill to help you 
flatten your JSON into something that, say, Excel or Tableau can consume.

As a result, each time you mention a column, you pull it into the top-level 
record. That is:

SELECT yourtag.foo AS bar FROM ...

Creates a top-level column "bar"; you'd then use that column in your final 
check: bar LIKE "%foo%".

SELECT yourtag.foo AS bar FROM ... WHERE bar LIKE "%foo%" 

The question of schema is really more of a product feature discussion, it won't 
help you with your query today.

IMHO, the only workable solution to handle heterogeneous JSON with null or 
missing fields is to tell each reader the type of the missing columns so that 
those readers can create columns of that type.

A partial solution is the one that Ted suggested: have readers create a 
"DontKnow" column type, then modify each of a dozen operators to merge columns 
of type X/X, X/DontKnow and DontKnow/DontKnow. Might work, but we'd need a 
volunteer to implement such a sweeping change; it is a non-trivial exercise.

The solution is partial: it still does not solve the {a: 10}, {a: 10.1} problem 
(which currently fails the reader). Nor does it solve the problem of what to 
return to the JDBC/ODBC driver if the column never appears in some queries, and 
so the type is X when the column appears, but DontKnow when it does not appear: 
all clients would need to be ready for a column that is, say, an INT some 
times, and DontKnow other times. Most BI tools can't handle this.

So, although it is an old school concept, and frowned upon in Drill, declaring 
a schema is the simplest, most reliable solution to ensure that all parts of 
the pipeline (including the client) agree on data types. (Using a schema need 
not imply using the Hive Metastore, which has its own issues; the schema could 
be in a file...) Of course, Drill does not yet support a schema for JSON, so 
this is only a theoretical discussion for Drill...

Thanks,
- Paul

[1] https://drill.apache.org/docs/data-type-functions/

    On Thursday, September 19, 2019, 10:19:58 PM PDT, Sebastian Fischmeister 
<[email protected]> wrote:  

 I was hoping to use Drill to aggregate information across a heterogeneous set 
of JSON files. With a defined schema, I could go to an RDBMS.

The function sqlType() doesn't seem to exist in Drill. I also found no 
reference to it in the manual. Also the function isNull(), which you mentioned, 
doesn't seem to exist.

This also doesn't work: WHERE yourtag like '%foo%'. The idea was to just see 
whether the map as json-string contains the name. Casting yourtag to a string 
doesn't work either.

I would need some function to test whether 'yourtag.foo' exists regardless of 
whether 'yourtag' or 'yourtag.foo' exits; and yourtag is a MAP type.

  Sebastian

Paul Rogers <[email protected]> writes:

> Hi Sebastian,
>
> The nested map is a very hard case! Without a schema, there is no way for the 
> reader of a.json to know it should create a map. Even if some part of Drill 
> were to infer that `yourtag` is a map, nothing would figure out the name and 
> type of the items within the map. This is a known weakness in the current 
> implementation. (Though, to be fair, without a schema, it is nearly 
> impossible to get consistency.)
>
>
> If you only want the b.json data (the one with the nested map), then maybe 
> you can go by column type using a nested select:
>
> SELECT id, yourtag.foo.foo1 AS myFoo FROM (
>   SELECT id, yourtag WHERE sqlType(yourTag) = 'MAP' FROM ...)
>
> The inner select returns only the rows where the yourtag column is a MAP. 
> This *might* weed out the batches in which the column did not exist and Drill 
> made up a nullable INT column.
>
> Not 100% sure that this will work, but worth a try.
>
>
> Note that foo <> null will never be true. And foo <> 'null' tries to match a 
> string with the value of "null" (not a null string). If you do want to check 
> for null, use isNull(foo).
>
> Note also that a MAP is never null; though the fields within it can be. (The 
> MAP type is not nullable in Drill.)
>
> Thanks,
> - Paul
>
>  
>
>    On Wednesday, September 18, 2019, 9:20:24 PM PDT, Sebastian Fischmeister 
><[email protected]> wrote:  
>  
>  Thanks to all for the replies so far. Some of the approaches work, if the 
>search targets a single tag. However, none work, if the search targets a 
>dictionary. For example:
>
> # file a.json
> { "id": 1 }
> # file b.json
> { "id": 2,
>   "yourtag": { "foo": { "foo1" = 1 } } }
>
>
> SELECT id, yourtag.foo.foo1
> FROM dfs.`/bla/*/*` 
> WHERE yourtag.foo is not null  ## alt: REPEATED_CONTAINS(yourtag,'foo') # 
> alt2: yourtag.foo <> 'null'
>
>
> I haven't found a way to write the search, so it returns the tuple (2, 1). 
> The problem is that instead of returning simply null, if yourtag.foo doesn't 
> exist, drill returns "Missing function implementation: 
> [repeated_contains(MAP-REQUIRED, VARCHAR-REQUIRED)]. or 'Table \'yourtag\' 
> not found'.
>
>   Sebastian
>
>
>
> Paul Rogers <[email protected]> writes:
>
>> Hi Sebastian,
>>
>> On the query below, you've got a string comparison: mytag = 'hello'. I 
>> suspect this is your problem.
>>
>> By mentioning mytag, Drill knows you want to project that column from your 
>> file scan. When a reader notices that a file does not have that column, it 
>> will make one up and set it to null. As it turns out, the made-up column 
>> will be Nullable INT. As a result, your result set will contain some batches 
>> of data with VARCHAR columns (from scanners that found the column) and some 
>> batches in which the column is INT (for those readers that did not find the 
>> column.)
>>
>> Drill processes data in batches. Batches contain the data from one file (or, 
>> if the file is big, a single file may produce multiple batches.) At its 
>> lowest level, Drill can handle the case in which column mytag is INT in some 
>> batches (those where no such field was found in JSON), and VARCHAR in others 
>> (where the field was found.) Unfortunately, higher-level code in Drill 
>> cannot handle conflicting schema, causing endless user confusion.
>>
>>
>> I suspect the NumberFormatException occurs because of the conflict between 
>> numeric and VARCHAR column types. Further, any attempt to group, sort or 
>> aggregate will also fail due to a schema conflict.
>>
>> This is a longstanding "feature" of Drill that we discuss in the Learning 
>> Apache Drill book.
>>
>> In this particular case, one would expect Drill to have used a short-circuit 
>> evaluation of the AND conjunction, evaluate mytag = 'hello' only when the 
>> column is not null. This should have filtered out all the INT batches since 
>> they are NULL. Perhaps there is a bug there somewhere.
>>
>>
>> Note that the problem may be worse. If you have JSON like the following, 
>> Drill will also fail:
>>
>> {a: 10, mytag: null}
>> {a: 20, mytag: "hello"}
>>
>> The above will fail because Drill must guess a column type when it sees the 
>> first mytag: null. It will guess nullable Int. Then, when it sees mytag: 
>> "hello", it will try to write a VARCHAR into an INT column and will fail.
>>
>> I wonder if you can try casting: ... AND CAST(mytag AS VARCHAR) = 'hello' 
>> Not sure if this will work, but worth a try.
>>
>> The longer-term fix is that the team is working on a schema feature that 
>> will let you tell Drill that the column is VARCHAR even if it is not found 
>> in the data source. The feature is available for CSV and some other file 
>> types, but not yet for JSON.
>>
>> An obvious enhancement in this one case is that Drill itself can tell your 
>> intent is for the column to be VARCHAR. The analyzer should be able to infer 
>> that column type without you telling Drill this fact. As far as I know, this 
>> addition is not yet part of the schema system plan, but would be a nice 
>> additional tweak.
>>
>>
>> Thanks,
>> - Paul
>>
>>  
>>
>>    On Sunday, September 15, 2019, 4:22:48 AM PDT, Sebastian Fischmeister 
>><[email protected]> wrote:  
>>  
>>  While it's possible to test for 'is not null', you actually cannot query 
>>the tag, because it provides an system error due to the optimizer doing its 
>>job. Take this query as example:
>>
>> SELECT  mytag
>> FROM dfs.`/bla/*/*` 
>> WHERE mytag is not null and mytag = 'hello'
>>
>> Some files contain "mytag", others don't.
>>
>> The query works with just the clause 'mytag is not null', because all json 
>> files missing mytag will get mytag set to null, which is type compatible 
>> with the filter clause. However it actually does not work with "mytag is not 
>> null and mytag = 'hello'" because I get the following error for files where 
>> mytag is not present.
>>
>> SYSTEM ERROR: NumberFormatException: hello
>>
>> The physical plan shows that the query optimizer removes the clause 'mytag 
>> is not null', because it's redundant. However, it comes at the expense of 
>> not being able to query tags that are not present in all files.
>>
>> Is there a way to outsmart the optimizer and first execute a filter on 
>> "mytag is not null" before "mytag = 'hello'" in a single query?
>>
>>   Sebastian
>>
>>
>> Ted Dunning <[email protected]> writes:
>>
>>> Keep in mind the danger if testing Foo!=null. That doesn't work and catches
>>> me by surprise all the time. Foo is null and variants are what you need.
>>>
>>> On Sat, Sep 14, 2019, 4:56 PM hanu mapr <[email protected]> wrote:
>>>
>>>> Hello Sebastian,
>>>>
>>>> By default Drill sets the field 'foo' to null for the files that don't
>>>> contain it. I am of the opinion that the condition where foo = 'bar' should
>>>> result in false for all those files which don't contain the field.
>>>> Please can you send across the queries which you have run and the observed
>>>> result.
>>>>
>>>> Just off the top of my head, some query like the below one might work
>>>> select file_name from dfs.`/bla/*/*` where foo != null. --- You might want
>>>> to remove duplicate entries. (of course this also results in the rows which
>>>> contain the field and are null).
>>>>
>>>> Hope this helps.
>>>>
>>>> Thanks
>>>>
>>>>
>>>> On Fri, Sep 13, 2019 at 10:53 PM Sebastian Fischmeister <
>>>> [email protected]> wrote:
>>>>
>>>> > Hi,
>>>> >
>>>> > When searching multiple directories, drill only searches fields that are
>>>> > common to all files (see the json data model). Is there a way to query a
>>>> > directory and list all files that contain a certain field?
>>>> >
>>>> > In other words, I would like to use the workaround in this way:
>>>> >
>>>> > select * from (select fqn from dfs.`/bla/*/*` where foo exists) where foo
>>>> > = 'bar'
>>>> >
>>>> > Or is there another way to do this? I dynamically get more files, so
>>>> > finding the files should be included in the query.
>>>> >
>>>> > An alternative would be to execute the query such that it sets the field
>>>> > 'foo' to null for all files that don't contain it. However, I don't know
>>>> > how to execute this.
>>>> >
>>>> > Thanks,
>>>> >  Sebastian
>>>> >
>>>>
>>

Re: Check presence of field in json file

Reply via email to