Thanks to all for the replies so far. Some of the approaches work, if the 
search targets a single tag. However, none work, if the search targets a 
dictionary. For example:

# file a.json
{ "id": 1 }
# file b.json
{ "id": 2,
  "yourtag": { "foo": { "foo1" = 1 } } }


SELECT id, yourtag.foo.foo1
FROM dfs.`/bla/*/*` 
WHERE yourtag.foo is not null  ## alt: REPEATED_CONTAINS(yourtag,'foo') # alt2: 
yourtag.foo <> 'null'


I haven't found a way to write the search, so it returns the tuple (2, 1). The 
problem is that instead of returning simply null, if yourtag.foo doesn't exist, 
drill returns "Missing function implementation: 
[repeated_contains(MAP-REQUIRED, VARCHAR-REQUIRED)]. or 'Table \'yourtag\' not 
found'.

  Sebastian



Paul Rogers <par0...@yahoo.com.INVALID> writes:

> Hi Sebastian,
>
> On the query below, you've got a string comparison: mytag = 'hello'. I 
> suspect this is your problem.
>
> By mentioning mytag, Drill knows you want to project that column from your 
> file scan. When a reader notices that a file does not have that column, it 
> will make one up and set it to null. As it turns out, the made-up column will 
> be Nullable INT. As a result, your result set will contain some batches of 
> data with VARCHAR columns (from scanners that found the column) and some 
> batches in which the column is INT (for those readers that did not find the 
> column.)
>
> Drill processes data in batches. Batches contain the data from one file (or, 
> if the file is big, a single file may produce multiple batches.) At its 
> lowest level, Drill can handle the case in which column mytag is INT in some 
> batches (those where no such field was found in JSON), and VARCHAR in others 
> (where the field was found.) Unfortunately, higher-level code in Drill cannot 
> handle conflicting schema, causing endless user confusion.
>
>
> I suspect the NumberFormatException occurs because of the conflict between 
> numeric and VARCHAR column types. Further, any attempt to group, sort or 
> aggregate will also fail due to a schema conflict.
>
> This is a longstanding "feature" of Drill that we discuss in the Learning 
> Apache Drill book.
>
> In this particular case, one would expect Drill to have used a short-circuit 
> evaluation of the AND conjunction, evaluate mytag = 'hello' only when the 
> column is not null. This should have filtered out all the INT batches since 
> they are NULL. Perhaps there is a bug there somewhere.
>
>
> Note that the problem may be worse. If you have JSON like the following, 
> Drill will also fail:
>
> {a: 10, mytag: null}
> {a: 20, mytag: "hello"}
>
> The above will fail because Drill must guess a column type when it sees the 
> first mytag: null. It will guess nullable Int. Then, when it sees mytag: 
> "hello", it will try to write a VARCHAR into an INT column and will fail.
>
> I wonder if you can try casting: ... AND CAST(mytag AS VARCHAR) = 'hello' Not 
> sure if this will work, but worth a try.
>
> The longer-term fix is that the team is working on a schema feature that will 
> let you tell Drill that the column is VARCHAR even if it is not found in the 
> data source. The feature is available for CSV and some other file types, but 
> not yet for JSON.
>
> An obvious enhancement in this one case is that Drill itself can tell your 
> intent is for the column to be VARCHAR. The analyzer should be able to infer 
> that column type without you telling Drill this fact. As far as I know, this 
> addition is not yet part of the schema system plan, but would be a nice 
> additional tweak.
>
>
> Thanks,
> - Paul
>
>  
>
>     On Sunday, September 15, 2019, 4:22:48 AM PDT, Sebastian Fischmeister 
> <sfisc...@uwaterloo.ca> wrote:  
>  
>  While it's possible to test for 'is not null', you actually cannot query the 
> tag, because it provides an system error due to the optimizer doing its job. 
> Take this query as example:
>
> SELECT  mytag
> FROM dfs.`/bla/*/*` 
> WHERE mytag is not null and mytag = 'hello'
>
> Some files contain "mytag", others don't.
>
> The query works with just the clause 'mytag is not null', because all json 
> files missing mytag will get mytag set to null, which is type compatible with 
> the filter clause. However it actually does not work with "mytag is not null 
> and mytag = 'hello'" because I get the following error for files where mytag 
> is not present.
>
> SYSTEM ERROR: NumberFormatException: hello
>
> The physical plan shows that the query optimizer removes the clause 'mytag is 
> not null', because it's redundant. However, it comes at the expense of not 
> being able to query tags that are not present in all files.
>
> Is there a way to outsmart the optimizer and first execute a filter on "mytag 
> is not null" before "mytag = 'hello'" in a single query?
>
>   Sebastian
>
>
> Ted Dunning <ted.dunn...@gmail.com> writes:
>
>> Keep in mind the danger if testing Foo!=null. That doesn't work and catches
>> me by surprise all the time. Foo is null and variants are what you need.
>>
>> On Sat, Sep 14, 2019, 4:56 PM hanu mapr <hanu.m...@gmail.com> wrote:
>>
>>> Hello Sebastian,
>>>
>>> By default Drill sets the field 'foo' to null for the files that don't
>>> contain it. I am of the opinion that the condition where foo = 'bar' should
>>> result in false for all those files which don't contain the field.
>>> Please can you send across the queries which you have run and the observed
>>> result.
>>>
>>> Just off the top of my head, some query like the below one might work
>>> select file_name from dfs.`/bla/*/*` where foo != null. --- You might want
>>> to remove duplicate entries. (of course this also results in the rows which
>>> contain the field and are null).
>>>
>>> Hope this helps.
>>>
>>> Thanks
>>>
>>>
>>> On Fri, Sep 13, 2019 at 10:53 PM Sebastian Fischmeister <
>>> sfisc...@uwaterloo.ca> wrote:
>>>
>>> > Hi,
>>> >
>>> > When searching multiple directories, drill only searches fields that are
>>> > common to all files (see the json data model). Is there a way to query a
>>> > directory and list all files that contain a certain field?
>>> >
>>> > In other words, I would like to use the workaround in this way:
>>> >
>>> > select * from (select fqn from dfs.`/bla/*/*` where foo exists) where foo
>>> > = 'bar'
>>> >
>>> > Or is there another way to do this? I dynamically get more files, so
>>> > finding the files should be included in the query.
>>> >
>>> > An alternative would be to execute the query such that it sets the field
>>> > 'foo' to null for all files that don't contain it. However, I don't know
>>> > how to execute this.
>>> >
>>> > Thanks,
>>> >  Sebastian
>>> >
>>>
>   

Reply via email to