Re: Check presence of field in json file

2019-09-20 Thread Sebastian Fischmeister
I'm still trying out some things. However, one observation regarding the 
'DontKnow'. It's also relevant as depending on the name order of the files in 
the directory, the same query currently can return different results.


> A partial solution is the one that Ted suggested: have readers create a 
> "DontKnow" column type, then modify each of a dozen operators to merge 
> columns of type X/X, X/DontKnow and DontKnow/DontKnow. Might work, but we'd 
> need a volunteer to implement such a sweeping change; it is a non-trivial 
> exercise.

If you have multiple files, and the tag only occurs in the second file

file1.json: { }
file2.json: { "a": "foo" }
file3.json: { }

Then

select sqlTypeOf(a) from 

will return

NULL
VARCHAR
VARCHAR

If you rename the files, the result is different.

This will make the implementation of DontKnow tricky, because the query should 
return:

DONTKNOW
VARCHAR
DONTKNOW

and not

DONTKNOW
VARCHAR
VARCHAR

Otherwise, you have query results that depend on the name of the files, and for 
more complex where clauses, the optimizer might incorrectly throw out necessary 
checks (e.g., when first checking for not DONTKNOW and then checking for a 
specific value)

  Sebastian


Re: Check presence of field in json file

2019-09-20 Thread Paul Rogers
Hi Sebastian,

Drill has no magic to work with heterogeneous JSON files. Drill works best with 
homogeneous JSON with no null or missing fields: this pattern leverages Drill's 
schema-on-read technology. With heterogeneous JSON, or missing/null fields, and 
without a schema, there is no magic that will enable Drill to create a 
consistent schema across distributed readers. I realize there is a belief that 
Drill can do this, but that is true only for the "happy path" mentioned above.

As Boaz suggested, if you have single-level JSON, you can force all fields to 
VARCHAR. However, this does not work for nested fields as you will end up with 
a MAP column in those files that contain the map, but a VARCHAR column in those 
files without the map.
Sorry about the incorrect function name: it is sqlTypeOf(). See [1]. There is 
also a typeof() function, which is SQL compliant, but that function returns 
"NULL" if the value is null, while sqlTypeOf() returns the actual type name, 
even for null values.

My bad on "isNull", I confused Drill's function with another product. In Drill, 
you use the IS NULL operator as you were doing.

The idea of the nested select was to first check if yourtag is a MAP. (Handles 
the case that yourtag is missing.) Once we know it is, we use another nested 
query to check if yourtag.foo is a VARCHAR. (Handles the case where yourtag.foo 
is missing.) If it is, then you can apply the yourtag.foo LIKE %foo% pattern in 
the outer-most query.

Sorry that I've not actually tested the above, it is from memory; so caveat 
emptor...

One related point: recall that Drill is a SQL engine, to use anything other 
than the Drill native client (or, with work, JDBC), you must flatten your JSON 
into a single tuple. So, if your goal is to do JSON processing (preserve nested 
structures after querying), you'll need to write some special client code (or, 
I believe, use CTAS to create a new JSON file.) That is, Drill is good at 
structured-JSON to records; it is not as good a solution for structured-JSON to 
structured-JSON (except in CTAS.) Said another way, don't expect the REST API, 
say, to return your data as structured JSON. Do expect Drill to help you 
flatten your JSON into something that, say, Excel or Tableau can consume.

As a result, each time you mention a column, you pull it into the top-level 
record. That is:

SELECT yourtag.foo AS bar FROM ...

Creates a top-level column "bar"; you'd then use that column in your final 
check: bar LIKE "%foo%".

SELECT yourtag.foo AS bar FROM ... WHERE bar LIKE "%foo%" 

The question of schema is really more of a product feature discussion, it won't 
help you with your query today.

IMHO, the only workable solution to handle heterogeneous JSON with null or 
missing fields is to tell each reader the type of the missing columns so that 
those readers can create columns of that type.

A partial solution is the one that Ted suggested: have readers create a 
"DontKnow" column type, then modify each of a dozen operators to merge columns 
of type X/X, X/DontKnow and DontKnow/DontKnow. Might work, but we'd need a 
volunteer to implement such a sweeping change; it is a non-trivial exercise.

The solution is partial: it still does not solve the {a: 10}, {a: 10.1} problem 
(which currently fails the reader). Nor does it solve the problem of what to 
return to the JDBC/ODBC driver if the column never appears in some queries, and 
so the type is X when the column appears, but DontKnow when it does not appear: 
all clients would need to be ready for a column that is, say, an INT some 
times, and DontKnow other times. Most BI tools can't handle this.

So, although it is an old school concept, and frowned upon in Drill, declaring 
a schema is the simplest, most reliable solution to ensure that all parts of 
the pipeline (including the client) agree on data types. (Using a schema need 
not imply using the Hive Metastore, which has its own issues; the schema could 
be in a file...) Of course, Drill does not yet support a schema for JSON, so 
this is only a theoretical discussion for Drill...

Thanks,
- Paul

[1] https://drill.apache.org/docs/data-type-functions/


 

On Thursday, September 19, 2019, 10:19:58 PM PDT, Sebastian Fischmeister 
 wrote:  
 
 I was hoping to use Drill to aggregate information across a heterogeneous set 
of JSON files. With a defined schema, I could go to an RDBMS.

The function sqlType() doesn't seem to exist in Drill. I also found no 
reference to it in the manual. Also the function isNull(), which you mentioned, 
doesn't seem to exist.

This also doesn't work: WHERE yourtag like '%foo%'. The idea was to just see 
whether the map as json-string contains the name. Casting yourtag to a string 
doesn't work either.

I would need some function to test whether 'yourtag.foo' exists regardless of 
whether 'yourtag' or 'yourtag.foo' exits; and yourtag is a MAP type.


  Sebastian

Paul Rogers  writes:

> Hi Sebastian,
>
> The nested map is a very har

Re: Check presence of field in json file

2019-09-19 Thread Sebastian Fischmeister
I was hoping to use Drill to aggregate information across a heterogeneous set 
of JSON files. With a defined schema, I could go to an RDBMS.

The function sqlType() doesn't seem to exist in Drill. I also found no 
reference to it in the manual. Also the function isNull(), which you mentioned, 
doesn't seem to exist.

This also doesn't work: WHERE yourtag like '%foo%'. The idea was to just see 
whether the map as json-string contains the name. Casting yourtag to a string 
doesn't work either.

I would need some function to test whether 'yourtag.foo' exists regardless of 
whether 'yourtag' or 'yourtag.foo' exits; and yourtag is a MAP type.


  Sebastian

Paul Rogers  writes:

> Hi Sebastian,
>
> The nested map is a very hard case! Without a schema, there is no way for the 
> reader of a.json to know it should create a map. Even if some part of Drill 
> were to infer that `yourtag` is a map, nothing would figure out the name and 
> type of the items within the map. This is a known weakness in the current 
> implementation. (Though, to be fair, without a schema, it is nearly 
> impossible to get consistency.)
>
>
> If you only want the b.json data (the one with the nested map), then maybe 
> you can go by column type using a nested select:
>
> SELECT id, yourtag.foo.foo1 AS myFoo FROM (
>   SELECT id, yourtag WHERE sqlType(yourTag) = 'MAP' FROM ...)
>
> The inner select returns only the rows where the yourtag column is a MAP. 
> This *might* weed out the batches in which the column did not exist and Drill 
> made up a nullable INT column.
>
> Not 100% sure that this will work, but worth a try.
>
>
> Note that foo <> null will never be true. And foo <> 'null' tries to match a 
> string with the value of "null" (not a null string). If you do want to check 
> for null, use isNull(foo).
>
> Note also that a MAP is never null; though the fields within it can be. (The 
> MAP type is not nullable in Drill.)
>
> Thanks,
> - Paul
>
>  
>
> On Wednesday, September 18, 2019, 9:20:24 PM PDT, Sebastian Fischmeister 
>  wrote:  
>  
>  Thanks to all for the replies so far. Some of the approaches work, if the 
> search targets a single tag. However, none work, if the search targets a 
> dictionary. For example:
>
> # file a.json
> { "id": 1 }
> # file b.json
> { "id": 2,
>   "yourtag": { "foo": { "foo1" = 1 } } }
>
>
> SELECT id, yourtag.foo.foo1
> FROM dfs.`/bla/*/*` 
> WHERE yourtag.foo is not null  ## alt: REPEATED_CONTAINS(yourtag,'foo') # 
> alt2: yourtag.foo <> 'null'
>
>
> I haven't found a way to write the search, so it returns the tuple (2, 1). 
> The problem is that instead of returning simply null, if yourtag.foo doesn't 
> exist, drill returns "Missing function implementation: 
> [repeated_contains(MAP-REQUIRED, VARCHAR-REQUIRED)]. or 'Table \'yourtag\' 
> not found'.
>
>   Sebastian
>
>
>
> Paul Rogers  writes:
>
>> Hi Sebastian,
>>
>> On the query below, you've got a string comparison: mytag = 'hello'. I 
>> suspect this is your problem.
>>
>> By mentioning mytag, Drill knows you want to project that column from your 
>> file scan. When a reader notices that a file does not have that column, it 
>> will make one up and set it to null. As it turns out, the made-up column 
>> will be Nullable INT. As a result, your result set will contain some batches 
>> of data with VARCHAR columns (from scanners that found the column) and some 
>> batches in which the column is INT (for those readers that did not find the 
>> column.)
>>
>> Drill processes data in batches. Batches contain the data from one file (or, 
>> if the file is big, a single file may produce multiple batches.) At its 
>> lowest level, Drill can handle the case in which column mytag is INT in some 
>> batches (those where no such field was found in JSON), and VARCHAR in others 
>> (where the field was found.) Unfortunately, higher-level code in Drill 
>> cannot handle conflicting schema, causing endless user confusion.
>>
>>
>> I suspect the NumberFormatException occurs because of the conflict between 
>> numeric and VARCHAR column types. Further, any attempt to group, sort or 
>> aggregate will also fail due to a schema conflict.
>>
>> This is a longstanding "feature" of Drill that we discuss in the Learning 
>> Apache Drill book.
>>
>> In this particular case, one would expect Drill to have used a short-circuit 
>> evaluation of the AND conjunction, evaluate mytag = 'hello' only when the 
>> column is not null. This should have filtered out all the INT batches since 
>> they are NULL. Perhaps there is a bug there somewhere.
>>
>>
>> Note that the problem may be worse. If you have JSON like the following, 
>> Drill will also fail:
>>
>> {a: 10, mytag: null}
>> {a: 20, mytag: "hello"}
>>
>> The above will fail because Drill must guess a column type when it sees the 
>> first mytag: null. It will guess nullable Int. Then, when it sees mytag: 
>> "hello", it will try to write a VARCHAR into an INT column and will fail.
>>
>> I wonder if you can try cas

Re: Check presence of field in json file

2019-09-18 Thread Paul Rogers
Hi Sebastian,

The nested map is a very hard case! Without a schema, there is no way for the 
reader of a.json to know it should create a map. Even if some part of Drill 
were to infer that `yourtag` is a map, nothing would figure out the name and 
type of the items within the map. This is a known weakness in the current 
implementation. (Though, to be fair, without a schema, it is nearly impossible 
to get consistency.)


If you only want the b.json data (the one with the nested map), then maybe you 
can go by column type using a nested select:

SELECT id, yourtag.foo.foo1 AS myFoo FROM (
  SELECT id, yourtag WHERE sqlType(yourTag) = 'MAP' FROM ...)

The inner select returns only the rows where the yourtag column is a MAP. This 
*might* weed out the batches in which the column did not exist and Drill made 
up a nullable INT column.

Not 100% sure that this will work, but worth a try.


Note that foo <> null will never be true. And foo <> 'null' tries to match a 
string with the value of "null" (not a null string). If you do want to check 
for null, use isNull(foo).

Note also that a MAP is never null; though the fields within it can be. (The 
MAP type is not nullable in Drill.)

Thanks,
- Paul

 

On Wednesday, September 18, 2019, 9:20:24 PM PDT, Sebastian Fischmeister 
 wrote:  
 
 Thanks to all for the replies so far. Some of the approaches work, if the 
search targets a single tag. However, none work, if the search targets a 
dictionary. For example:

# file a.json
{ "id": 1 }
# file b.json
{ "id": 2,
  "yourtag": { "foo": { "foo1" = 1 } } }


SELECT id, yourtag.foo.foo1
FROM dfs.`/bla/*/*` 
WHERE yourtag.foo is not null  ## alt: REPEATED_CONTAINS(yourtag,'foo') # alt2: 
yourtag.foo <> 'null'


I haven't found a way to write the search, so it returns the tuple (2, 1). The 
problem is that instead of returning simply null, if yourtag.foo doesn't exist, 
drill returns "Missing function implementation: 
[repeated_contains(MAP-REQUIRED, VARCHAR-REQUIRED)]. or 'Table \'yourtag\' not 
found'.

  Sebastian



Paul Rogers  writes:

> Hi Sebastian,
>
> On the query below, you've got a string comparison: mytag = 'hello'. I 
> suspect this is your problem.
>
> By mentioning mytag, Drill knows you want to project that column from your 
> file scan. When a reader notices that a file does not have that column, it 
> will make one up and set it to null. As it turns out, the made-up column will 
> be Nullable INT. As a result, your result set will contain some batches of 
> data with VARCHAR columns (from scanners that found the column) and some 
> batches in which the column is INT (for those readers that did not find the 
> column.)
>
> Drill processes data in batches. Batches contain the data from one file (or, 
> if the file is big, a single file may produce multiple batches.) At its 
> lowest level, Drill can handle the case in which column mytag is INT in some 
> batches (those where no such field was found in JSON), and VARCHAR in others 
> (where the field was found.) Unfortunately, higher-level code in Drill cannot 
> handle conflicting schema, causing endless user confusion.
>
>
> I suspect the NumberFormatException occurs because of the conflict between 
> numeric and VARCHAR column types. Further, any attempt to group, sort or 
> aggregate will also fail due to a schema conflict.
>
> This is a longstanding "feature" of Drill that we discuss in the Learning 
> Apache Drill book.
>
> In this particular case, one would expect Drill to have used a short-circuit 
> evaluation of the AND conjunction, evaluate mytag = 'hello' only when the 
> column is not null. This should have filtered out all the INT batches since 
> they are NULL. Perhaps there is a bug there somewhere.
>
>
> Note that the problem may be worse. If you have JSON like the following, 
> Drill will also fail:
>
> {a: 10, mytag: null}
> {a: 20, mytag: "hello"}
>
> The above will fail because Drill must guess a column type when it sees the 
> first mytag: null. It will guess nullable Int. Then, when it sees mytag: 
> "hello", it will try to write a VARCHAR into an INT column and will fail.
>
> I wonder if you can try casting: ... AND CAST(mytag AS VARCHAR) = 'hello' Not 
> sure if this will work, but worth a try.
>
> The longer-term fix is that the team is working on a schema feature that will 
> let you tell Drill that the column is VARCHAR even if it is not found in the 
> data source. The feature is available for CSV and some other file types, but 
> not yet for JSON.
>
> An obvious enhancement in this one case is that Drill itself can tell your 
> intent is for the column to be VARCHAR. The analyzer should be able to infer 
> that column type without you telling Drill this fact. As far as I know, this 
> addition is not yet part of the schema system plan, but would be a nice 
> additional tweak.
>
>
> Thanks,
> - Paul
>
>  
>
>    On Sunday, September 15, 2019, 4:22:48 AM PDT, Sebastian Fischmeister 
> wrote:  
>  
>  While it's possible to

Re: Check presence of field in json file

2019-09-18 Thread Sebastian Fischmeister
Thanks to all for the replies so far. Some of the approaches work, if the 
search targets a single tag. However, none work, if the search targets a 
dictionary. For example:

# file a.json
{ "id": 1 }
# file b.json
{ "id": 2,
  "yourtag": { "foo": { "foo1" = 1 } } }


SELECT id, yourtag.foo.foo1
FROM dfs.`/bla/*/*` 
WHERE yourtag.foo is not null  ## alt: REPEATED_CONTAINS(yourtag,'foo') # alt2: 
yourtag.foo <> 'null'


I haven't found a way to write the search, so it returns the tuple (2, 1). The 
problem is that instead of returning simply null, if yourtag.foo doesn't exist, 
drill returns "Missing function implementation: 
[repeated_contains(MAP-REQUIRED, VARCHAR-REQUIRED)]. or 'Table \'yourtag\' not 
found'.

  Sebastian



Paul Rogers  writes:

> Hi Sebastian,
>
> On the query below, you've got a string comparison: mytag = 'hello'. I 
> suspect this is your problem.
>
> By mentioning mytag, Drill knows you want to project that column from your 
> file scan. When a reader notices that a file does not have that column, it 
> will make one up and set it to null. As it turns out, the made-up column will 
> be Nullable INT. As a result, your result set will contain some batches of 
> data with VARCHAR columns (from scanners that found the column) and some 
> batches in which the column is INT (for those readers that did not find the 
> column.)
>
> Drill processes data in batches. Batches contain the data from one file (or, 
> if the file is big, a single file may produce multiple batches.) At its 
> lowest level, Drill can handle the case in which column mytag is INT in some 
> batches (those where no such field was found in JSON), and VARCHAR in others 
> (where the field was found.) Unfortunately, higher-level code in Drill cannot 
> handle conflicting schema, causing endless user confusion.
>
>
> I suspect the NumberFormatException occurs because of the conflict between 
> numeric and VARCHAR column types. Further, any attempt to group, sort or 
> aggregate will also fail due to a schema conflict.
>
> This is a longstanding "feature" of Drill that we discuss in the Learning 
> Apache Drill book.
>
> In this particular case, one would expect Drill to have used a short-circuit 
> evaluation of the AND conjunction, evaluate mytag = 'hello' only when the 
> column is not null. This should have filtered out all the INT batches since 
> they are NULL. Perhaps there is a bug there somewhere.
>
>
> Note that the problem may be worse. If you have JSON like the following, 
> Drill will also fail:
>
> {a: 10, mytag: null}
> {a: 20, mytag: "hello"}
>
> The above will fail because Drill must guess a column type when it sees the 
> first mytag: null. It will guess nullable Int. Then, when it sees mytag: 
> "hello", it will try to write a VARCHAR into an INT column and will fail.
>
> I wonder if you can try casting: ... AND CAST(mytag AS VARCHAR) = 'hello' Not 
> sure if this will work, but worth a try.
>
> The longer-term fix is that the team is working on a schema feature that will 
> let you tell Drill that the column is VARCHAR even if it is not found in the 
> data source. The feature is available for CSV and some other file types, but 
> not yet for JSON.
>
> An obvious enhancement in this one case is that Drill itself can tell your 
> intent is for the column to be VARCHAR. The analyzer should be able to infer 
> that column type without you telling Drill this fact. As far as I know, this 
> addition is not yet part of the schema system plan, but would be a nice 
> additional tweak.
>
>
> Thanks,
> - Paul
>
>  
>
> On Sunday, September 15, 2019, 4:22:48 AM PDT, Sebastian Fischmeister 
>  wrote:  
>  
>  While it's possible to test for 'is not null', you actually cannot query the 
> tag, because it provides an system error due to the optimizer doing its job. 
> Take this query as example:
>
> SELECT  mytag
> FROM dfs.`/bla/*/*` 
> WHERE mytag is not null and mytag = 'hello'
>
> Some files contain "mytag", others don't.
>
> The query works with just the clause 'mytag is not null', because all json 
> files missing mytag will get mytag set to null, which is type compatible with 
> the filter clause. However it actually does not work with "mytag is not null 
> and mytag = 'hello'" because I get the following error for files where mytag 
> is not present.
>
> SYSTEM ERROR: NumberFormatException: hello
>
> The physical plan shows that the query optimizer removes the clause 'mytag is 
> not null', because it's redundant. However, it comes at the expense of not 
> being able to query tags that are not present in all files.
>
> Is there a way to outsmart the optimizer and first execute a filter on "mytag 
> is not null" before "mytag = 'hello'" in a single query?
>
>   Sebastian
>
>
> Ted Dunning  writes:
>
>> Keep in mind the danger if testing Foo!=null. That doesn't work and catches
>> me by surprise all the time. Foo is null and variants are what you need.
>>
>> On Sat, Sep 14, 2019, 4:56 PM hanu mapr  wrote:
>>

Re: Check presence of field in json file

2019-09-16 Thread Boaz Ben-Zvi

 Hi Sebastian,

    Setting the option "store.json.all_text_mode" to "true" should make 
the Json reader interpret all the null valued or missing columns/values 
as "text" instead of "int" (should return the text value 'null' ).


This may be another workaround to the type guessing problem that Paul 
described,


    Thanks,

 Boaz

On 9/15/19 11:05 AM, Paul Rogers wrote:

Hi Sebastian,

On the query below, you've got a string comparison: mytag = 'hello'. I suspect 
this is your problem.

By mentioning mytag, Drill knows you want to project that column from your file 
scan. When a reader notices that a file does not have that column, it will make 
one up and set it to null. As it turns out, the made-up column will be Nullable 
INT. As a result, your result set will contain some batches of data with 
VARCHAR columns (from scanners that found the column) and some batches in which 
the column is INT (for those readers that did not find the column.)

Drill processes data in batches. Batches contain the data from one file (or, if 
the file is big, a single file may produce multiple batches.) At its lowest 
level, Drill can handle the case in which column mytag is INT in some batches 
(those where no such field was found in JSON), and VARCHAR in others (where the 
field was found.) Unfortunately, higher-level code in Drill cannot handle 
conflicting schema, causing endless user confusion.


I suspect the NumberFormatException occurs because of the conflict between 
numeric and VARCHAR column types. Further, any attempt to group, sort or 
aggregate will also fail due to a schema conflict.

This is a longstanding "feature" of Drill that we discuss in the Learning 
Apache Drill book.

In this particular case, one would expect Drill to have used a short-circuit 
evaluation of the AND conjunction, evaluate mytag = 'hello' only when the 
column is not null. This should have filtered out all the INT batches since 
they are NULL. Perhaps there is a bug there somewhere.


Note that the problem may be worse. If you have JSON like the following, Drill 
will also fail:

{a: 10, mytag: null}
{a: 20, mytag: "hello"}

The above will fail because Drill must guess a column type when it sees the first mytag: 
null. It will guess nullable Int. Then, when it sees mytag: "hello", it will 
try to write a VARCHAR into an INT column and will fail.

I wonder if you can try casting: ... AND CAST(mytag AS VARCHAR) = 'hello' Not 
sure if this will work, but worth a try.

The longer-term fix is that the team is working on a schema feature that will 
let you tell Drill that the column is VARCHAR even if it is not found in the 
data source. The feature is available for CSV and some other file types, but 
not yet for JSON.

An obvious enhancement in this one case is that Drill itself can tell your 
intent is for the column to be VARCHAR. The analyzer should be able to infer 
that column type without you telling Drill this fact. As far as I know, this 
addition is not yet part of the schema system plan, but would be a nice 
additional tweak.


Thanks,
- Paul

  


 On Sunday, September 15, 2019, 4:22:48 AM PDT, Sebastian Fischmeister 
 wrote:
  
  While it's possible to test for 'is not null', you actually cannot query the tag, because it provides an system error due to the optimizer doing its job. Take this query as example:


SELECT  mytag
FROM dfs.`/bla/*/*`
WHERE mytag is not null and mytag = 'hello'

Some files contain "mytag", others don't.

The query works with just the clause 'mytag is not null', because all json files missing 
mytag will get mytag set to null, which is type compatible with the filter clause. 
However it actually does not work with "mytag is not null and mytag = 'hello'" 
because I get the following error for files where mytag is not present.

SYSTEM ERROR: NumberFormatException: hello

The physical plan shows that the query optimizer removes the clause 'mytag is 
not null', because it's redundant. However, it comes at the expense of not 
being able to query tags that are not present in all files.

Is there a way to outsmart the optimizer and first execute a filter on "mytag is not 
null" before "mytag = 'hello'" in a single query?

   Sebastian


Ted Dunning  writes:


Keep in mind the danger if testing Foo!=null. That doesn't work and catches
me by surprise all the time. Foo is null and variants are what you need.

On Sat, Sep 14, 2019, 4:56 PM hanu mapr  wrote:


Hello Sebastian,

By default Drill sets the field 'foo' to null for the files that don't
contain it. I am of the opinion that the condition where foo = 'bar' should
result in false for all those files which don't contain the field.
Please can you send across the queries which you have run and the observed
result.

Just off the top of my head, some query like the below one might work
select file_name from dfs.`/bla/*/*` where foo != null. --- You might want
to remove duplicate entries. (of course this also results in the rows which
contain the fi

Re: Check presence of field in json file

2019-09-15 Thread Ted Dunning
Another nice tweak would be to consider all missing fields to be "nullable
any".

On Sun, Sep 15, 2019, 8:05 PM Paul Rogers  wrote:

> Hi Sebastian,
>
> On the query below, you've got a string comparison: mytag = 'hello'. I
> suspect this is your problem.
>
> By mentioning mytag, Drill knows you want to project that column from your
> file scan. When a reader notices that a file does not have that column, it
> will make one up and set it to null. As it turns out, the made-up column
> will be Nullable INT. As a result, your result set will contain some
> batches of data with VARCHAR columns (from scanners that found the column)
> and some batches in which the column is INT (for those readers that did not
> find the column.)
>
> Drill processes data in batches. Batches contain the data from one file
> (or, if the file is big, a single file may produce multiple batches.) At
> its lowest level, Drill can handle the case in which column mytag is INT in
> some batches (those where no such field was found in JSON), and VARCHAR in
> others (where the field was found.) Unfortunately, higher-level code in
> Drill cannot handle conflicting schema, causing endless user confusion.
>
>
> I suspect the NumberFormatException occurs because of the conflict between
> numeric and VARCHAR column types. Further, any attempt to group, sort or
> aggregate will also fail due to a schema conflict.
>
> This is a longstanding "feature" of Drill that we discuss in the Learning
> Apache Drill book.
>
> In this particular case, one would expect Drill to have used a
> short-circuit evaluation of the AND conjunction, evaluate mytag = 'hello'
> only when the column is not null. This should have filtered out all the INT
> batches since they are NULL. Perhaps there is a bug there somewhere.
>
>
> Note that the problem may be worse. If you have JSON like the following,
> Drill will also fail:
>
> {a: 10, mytag: null}
> {a: 20, mytag: "hello"}
>
> The above will fail because Drill must guess a column type when it sees
> the first mytag: null. It will guess nullable Int. Then, when it sees
> mytag: "hello", it will try to write a VARCHAR into an INT column and will
> fail.
>
> I wonder if you can try casting: ... AND CAST(mytag AS VARCHAR) = 'hello'
> Not sure if this will work, but worth a try.
>
> The longer-term fix is that the team is working on a schema feature that
> will let you tell Drill that the column is VARCHAR even if it is not found
> in the data source. The feature is available for CSV and some other file
> types, but not yet for JSON.
>
> An obvious enhancement in this one case is that Drill itself can tell your
> intent is for the column to be VARCHAR. The analyzer should be able to
> infer that column type without you telling Drill this fact. As far as I
> know, this addition is not yet part of the schema system plan, but would be
> a nice additional tweak.
>
>
> Thanks,
> - Paul
>
>
>
> On Sunday, September 15, 2019, 4:22:48 AM PDT, Sebastian Fischmeister <
> sfisc...@uwaterloo.ca> wrote:
>
>  While it's possible to test for 'is not null', you actually cannot query
> the tag, because it provides an system error due to the optimizer doing its
> job. Take this query as example:
>
> SELECT  mytag
> FROM dfs.`/bla/*/*`
> WHERE mytag is not null and mytag = 'hello'
>
> Some files contain "mytag", others don't.
>
> The query works with just the clause 'mytag is not null', because all json
> files missing mytag will get mytag set to null, which is type compatible
> with the filter clause. However it actually does not work with "mytag is
> not null and mytag = 'hello'" because I get the following error for files
> where mytag is not present.
>
> SYSTEM ERROR: NumberFormatException: hello
>
> The physical plan shows that the query optimizer removes the clause 'mytag
> is not null', because it's redundant. However, it comes at the expense of
> not being able to query tags that are not present in all files.
>
> Is there a way to outsmart the optimizer and first execute a filter on
> "mytag is not null" before "mytag = 'hello'" in a single query?
>
>   Sebastian
>
>
> Ted Dunning  writes:
>
> > Keep in mind the danger if testing Foo!=null. That doesn't work and
> catches
> > me by surprise all the time. Foo is null and variants are what you need.
> >
> > On Sat, Sep 14, 2019, 4:56 PM hanu mapr  wrote:
> >
> >> Hello Sebastian,
> >>
> >> By default Drill sets the field 'foo' to null for the files that don't
> >> contain it. I am of the opinion that the condition where foo = 'bar'
> should
> >> result in false for all those files which don't contain the field.
> >> Please can you send across the queries which you have run and the
> observed
> >> result.
> >>
> >> Just off the top of my head, some query like the below one might work
> >> select file_name from dfs.`/bla/*/*` where foo != null. --- You might
> want
> >> to remove duplicate entries. (of course this also results in the rows
> which
> >> contain the field and are null).
> >

Re: Check presence of field in json file

2019-09-15 Thread Paul Rogers
Hi Sebastian,

On the query below, you've got a string comparison: mytag = 'hello'. I suspect 
this is your problem.

By mentioning mytag, Drill knows you want to project that column from your file 
scan. When a reader notices that a file does not have that column, it will make 
one up and set it to null. As it turns out, the made-up column will be Nullable 
INT. As a result, your result set will contain some batches of data with 
VARCHAR columns (from scanners that found the column) and some batches in which 
the column is INT (for those readers that did not find the column.)

Drill processes data in batches. Batches contain the data from one file (or, if 
the file is big, a single file may produce multiple batches.) At its lowest 
level, Drill can handle the case in which column mytag is INT in some batches 
(those where no such field was found in JSON), and VARCHAR in others (where the 
field was found.) Unfortunately, higher-level code in Drill cannot handle 
conflicting schema, causing endless user confusion.


I suspect the NumberFormatException occurs because of the conflict between 
numeric and VARCHAR column types. Further, any attempt to group, sort or 
aggregate will also fail due to a schema conflict.

This is a longstanding "feature" of Drill that we discuss in the Learning 
Apache Drill book.

In this particular case, one would expect Drill to have used a short-circuit 
evaluation of the AND conjunction, evaluate mytag = 'hello' only when the 
column is not null. This should have filtered out all the INT batches since 
they are NULL. Perhaps there is a bug there somewhere.


Note that the problem may be worse. If you have JSON like the following, Drill 
will also fail:

{a: 10, mytag: null}
{a: 20, mytag: "hello"}

The above will fail because Drill must guess a column type when it sees the 
first mytag: null. It will guess nullable Int. Then, when it sees mytag: 
"hello", it will try to write a VARCHAR into an INT column and will fail.

I wonder if you can try casting: ... AND CAST(mytag AS VARCHAR) = 'hello' Not 
sure if this will work, but worth a try.

The longer-term fix is that the team is working on a schema feature that will 
let you tell Drill that the column is VARCHAR even if it is not found in the 
data source. The feature is available for CSV and some other file types, but 
not yet for JSON.

An obvious enhancement in this one case is that Drill itself can tell your 
intent is for the column to be VARCHAR. The analyzer should be able to infer 
that column type without you telling Drill this fact. As far as I know, this 
addition is not yet part of the schema system plan, but would be a nice 
additional tweak.


Thanks,
- Paul

 

On Sunday, September 15, 2019, 4:22:48 AM PDT, Sebastian Fischmeister 
 wrote:  
 
 While it's possible to test for 'is not null', you actually cannot query the 
tag, because it provides an system error due to the optimizer doing its job. 
Take this query as example:

SELECT  mytag
FROM dfs.`/bla/*/*` 
WHERE mytag is not null and mytag = 'hello'

Some files contain "mytag", others don't.

The query works with just the clause 'mytag is not null', because all json 
files missing mytag will get mytag set to null, which is type compatible with 
the filter clause. However it actually does not work with "mytag is not null 
and mytag = 'hello'" because I get the following error for files where mytag is 
not present.

SYSTEM ERROR: NumberFormatException: hello

The physical plan shows that the query optimizer removes the clause 'mytag is 
not null', because it's redundant. However, it comes at the expense of not 
being able to query tags that are not present in all files.

Is there a way to outsmart the optimizer and first execute a filter on "mytag 
is not null" before "mytag = 'hello'" in a single query?

  Sebastian


Ted Dunning  writes:

> Keep in mind the danger if testing Foo!=null. That doesn't work and catches
> me by surprise all the time. Foo is null and variants are what you need.
>
> On Sat, Sep 14, 2019, 4:56 PM hanu mapr  wrote:
>
>> Hello Sebastian,
>>
>> By default Drill sets the field 'foo' to null for the files that don't
>> contain it. I am of the opinion that the condition where foo = 'bar' should
>> result in false for all those files which don't contain the field.
>> Please can you send across the queries which you have run and the observed
>> result.
>>
>> Just off the top of my head, some query like the below one might work
>> select file_name from dfs.`/bla/*/*` where foo != null. --- You might want
>> to remove duplicate entries. (of course this also results in the rows which
>> contain the field and are null).
>>
>> Hope this helps.
>>
>> Thanks
>>
>>
>> On Fri, Sep 13, 2019 at 10:53 PM Sebastian Fischmeister <
>> sfisc...@uwaterloo.ca> wrote:
>>
>> > Hi,
>> >
>> > When searching multiple directories, drill only searches fields that are
>> > common to all files (see the json data model). Is there a way to query a
>> > directory and

Re: Check presence of field in json file

2019-09-15 Thread Sebastian Fischmeister
While it's possible to test for 'is not null', you actually cannot query the 
tag, because it provides an system error due to the optimizer doing its job. 
Take this query as example:

SELECT  mytag
FROM dfs.`/bla/*/*` 
WHERE mytag is not null and mytag = 'hello'

Some files contain "mytag", others don't.

The query works with just the clause 'mytag is not null', because all json 
files missing mytag will get mytag set to null, which is type compatible with 
the filter clause. However it actually does not work with "mytag is not null 
and mytag = 'hello'" because I get the following error for files where mytag is 
not present.

SYSTEM ERROR: NumberFormatException: hello

The physical plan shows that the query optimizer removes the clause 'mytag is 
not null', because it's redundant. However, it comes at the expense of not 
being able to query tags that are not present in all files.

Is there a way to outsmart the optimizer and first execute a filter on "mytag 
is not null" before "mytag = 'hello'" in a single query?

  Sebastian


Ted Dunning  writes:

> Keep in mind the danger if testing Foo!=null. That doesn't work and catches
> me by surprise all the time. Foo is null and variants are what you need.
>
> On Sat, Sep 14, 2019, 4:56 PM hanu mapr  wrote:
>
>> Hello Sebastian,
>>
>> By default Drill sets the field 'foo' to null for the files that don't
>> contain it. I am of the opinion that the condition where foo = 'bar' should
>> result in false for all those files which don't contain the field.
>> Please can you send across the queries which you have run and the observed
>> result.
>>
>> Just off the top of my head, some query like the below one might work
>> select file_name from dfs.`/bla/*/*` where foo != null. --- You might want
>> to remove duplicate entries. (of course this also results in the rows which
>> contain the field and are null).
>>
>> Hope this helps.
>>
>> Thanks
>>
>>
>> On Fri, Sep 13, 2019 at 10:53 PM Sebastian Fischmeister <
>> sfisc...@uwaterloo.ca> wrote:
>>
>> > Hi,
>> >
>> > When searching multiple directories, drill only searches fields that are
>> > common to all files (see the json data model). Is there a way to query a
>> > directory and list all files that contain a certain field?
>> >
>> > In other words, I would like to use the workaround in this way:
>> >
>> > select * from (select fqn from dfs.`/bla/*/*` where foo exists) where foo
>> > = 'bar'
>> >
>> > Or is there another way to do this? I dynamically get more files, so
>> > finding the files should be included in the query.
>> >
>> > An alternative would be to execute the query such that it sets the field
>> > 'foo' to null for all files that don't contain it. However, I don't know
>> > how to execute this.
>> >
>> > Thanks,
>> >   Sebastian
>> >
>>


Re: Check presence of field in json file

2019-09-15 Thread Ted Dunning
Keep in mind the danger if testing Foo!=null. That doesn't work and catches
me by surprise all the time. Foo is null and variants are what you need.

On Sat, Sep 14, 2019, 4:56 PM hanu mapr  wrote:

> Hello Sebastian,
>
> By default Drill sets the field 'foo' to null for the files that don't
> contain it. I am of the opinion that the condition where foo = 'bar' should
> result in false for all those files which don't contain the field.
> Please can you send across the queries which you have run and the observed
> result.
>
> Just off the top of my head, some query like the below one might work
> select file_name from dfs.`/bla/*/*` where foo != null. --- You might want
> to remove duplicate entries. (of course this also results in the rows which
> contain the field and are null).
>
> Hope this helps.
>
> Thanks
>
>
> On Fri, Sep 13, 2019 at 10:53 PM Sebastian Fischmeister <
> sfisc...@uwaterloo.ca> wrote:
>
> > Hi,
> >
> > When searching multiple directories, drill only searches fields that are
> > common to all files (see the json data model). Is there a way to query a
> > directory and list all files that contain a certain field?
> >
> > In other words, I would like to use the workaround in this way:
> >
> > select * from (select fqn from dfs.`/bla/*/*` where foo exists) where foo
> > = 'bar'
> >
> > Or is there another way to do this? I dynamically get more files, so
> > finding the files should be included in the query.
> >
> > An alternative would be to execute the query such that it sets the field
> > 'foo' to null for all files that don't contain it. However, I don't know
> > how to execute this.
> >
> > Thanks,
> >   Sebastian
> >
>


Re: Check presence of field in json file

2019-09-14 Thread hanu mapr
Hello Sebastian,

By default Drill sets the field 'foo' to null for the files that don't
contain it. I am of the opinion that the condition where foo = 'bar' should
result in false for all those files which don't contain the field.
Please can you send across the queries which you have run and the observed
result.

Just off the top of my head, some query like the below one might work
select file_name from dfs.`/bla/*/*` where foo != null. --- You might want
to remove duplicate entries. (of course this also results in the rows which
contain the field and are null).

Hope this helps.

Thanks


On Fri, Sep 13, 2019 at 10:53 PM Sebastian Fischmeister <
sfisc...@uwaterloo.ca> wrote:

> Hi,
>
> When searching multiple directories, drill only searches fields that are
> common to all files (see the json data model). Is there a way to query a
> directory and list all files that contain a certain field?
>
> In other words, I would like to use the workaround in this way:
>
> select * from (select fqn from dfs.`/bla/*/*` where foo exists) where foo
> = 'bar'
>
> Or is there another way to do this? I dynamically get more files, so
> finding the files should be included in the query.
>
> An alternative would be to execute the query such that it sets the field
> 'foo' to null for all files that don't contain it. However, I don't know
> how to execute this.
>
> Thanks,
>   Sebastian
>