Re: Check presence of field in json file
I'm still trying out some things. However, one observation regarding the 'DontKnow'. It's also relevant as depending on the name order of the files in the directory, the same query currently can return different results. > A partial solution is the one that Ted suggested: have readers create a > "DontKnow" column type, then modify each of a dozen operators to merge > columns of type X/X, X/DontKnow and DontKnow/DontKnow. Might work, but we'd > need a volunteer to implement such a sweeping change; it is a non-trivial > exercise. If you have multiple files, and the tag only occurs in the second file file1.json: { } file2.json: { "a": "foo" } file3.json: { } Then select sqlTypeOf(a) from will return NULL VARCHAR VARCHAR If you rename the files, the result is different. This will make the implementation of DontKnow tricky, because the query should return: DONTKNOW VARCHAR DONTKNOW and not DONTKNOW VARCHAR VARCHAR Otherwise, you have query results that depend on the name of the files, and for more complex where clauses, the optimizer might incorrectly throw out necessary checks (e.g., when first checking for not DONTKNOW and then checking for a specific value) Sebastian
Re: Check presence of field in json file
Hi Sebastian, Drill has no magic to work with heterogeneous JSON files. Drill works best with homogeneous JSON with no null or missing fields: this pattern leverages Drill's schema-on-read technology. With heterogeneous JSON, or missing/null fields, and without a schema, there is no magic that will enable Drill to create a consistent schema across distributed readers. I realize there is a belief that Drill can do this, but that is true only for the "happy path" mentioned above. As Boaz suggested, if you have single-level JSON, you can force all fields to VARCHAR. However, this does not work for nested fields as you will end up with a MAP column in those files that contain the map, but a VARCHAR column in those files without the map. Sorry about the incorrect function name: it is sqlTypeOf(). See [1]. There is also a typeof() function, which is SQL compliant, but that function returns "NULL" if the value is null, while sqlTypeOf() returns the actual type name, even for null values. My bad on "isNull", I confused Drill's function with another product. In Drill, you use the IS NULL operator as you were doing. The idea of the nested select was to first check if yourtag is a MAP. (Handles the case that yourtag is missing.) Once we know it is, we use another nested query to check if yourtag.foo is a VARCHAR. (Handles the case where yourtag.foo is missing.) If it is, then you can apply the yourtag.foo LIKE %foo% pattern in the outer-most query. Sorry that I've not actually tested the above, it is from memory; so caveat emptor... One related point: recall that Drill is a SQL engine, to use anything other than the Drill native client (or, with work, JDBC), you must flatten your JSON into a single tuple. So, if your goal is to do JSON processing (preserve nested structures after querying), you'll need to write some special client code (or, I believe, use CTAS to create a new JSON file.) That is, Drill is good at structured-JSON to records; it is not as good a solution for structured-JSON to structured-JSON (except in CTAS.) Said another way, don't expect the REST API, say, to return your data as structured JSON. Do expect Drill to help you flatten your JSON into something that, say, Excel or Tableau can consume. As a result, each time you mention a column, you pull it into the top-level record. That is: SELECT yourtag.foo AS bar FROM ... Creates a top-level column "bar"; you'd then use that column in your final check: bar LIKE "%foo%". SELECT yourtag.foo AS bar FROM ... WHERE bar LIKE "%foo%" The question of schema is really more of a product feature discussion, it won't help you with your query today. IMHO, the only workable solution to handle heterogeneous JSON with null or missing fields is to tell each reader the type of the missing columns so that those readers can create columns of that type. A partial solution is the one that Ted suggested: have readers create a "DontKnow" column type, then modify each of a dozen operators to merge columns of type X/X, X/DontKnow and DontKnow/DontKnow. Might work, but we'd need a volunteer to implement such a sweeping change; it is a non-trivial exercise. The solution is partial: it still does not solve the {a: 10}, {a: 10.1} problem (which currently fails the reader). Nor does it solve the problem of what to return to the JDBC/ODBC driver if the column never appears in some queries, and so the type is X when the column appears, but DontKnow when it does not appear: all clients would need to be ready for a column that is, say, an INT some times, and DontKnow other times. Most BI tools can't handle this. So, although it is an old school concept, and frowned upon in Drill, declaring a schema is the simplest, most reliable solution to ensure that all parts of the pipeline (including the client) agree on data types. (Using a schema need not imply using the Hive Metastore, which has its own issues; the schema could be in a file...) Of course, Drill does not yet support a schema for JSON, so this is only a theoretical discussion for Drill... Thanks, - Paul [1] https://drill.apache.org/docs/data-type-functions/ On Thursday, September 19, 2019, 10:19:58 PM PDT, Sebastian Fischmeister wrote: I was hoping to use Drill to aggregate information across a heterogeneous set of JSON files. With a defined schema, I could go to an RDBMS. The function sqlType() doesn't seem to exist in Drill. I also found no reference to it in the manual. Also the function isNull(), which you mentioned, doesn't seem to exist. This also doesn't work: WHERE yourtag like '%foo%'. The idea was to just see whether the map as json-string contains the name. Casting yourtag to a string doesn't work either. I would need some function to test whether 'yourtag.foo' exists regardless of whether 'yourtag' or 'yourtag.foo' exits; and yourtag is a MAP type. Sebastian Paul Rogers writes: > Hi Sebastian, > > The nested map is a very har
Re: Check presence of field in json file
I was hoping to use Drill to aggregate information across a heterogeneous set of JSON files. With a defined schema, I could go to an RDBMS. The function sqlType() doesn't seem to exist in Drill. I also found no reference to it in the manual. Also the function isNull(), which you mentioned, doesn't seem to exist. This also doesn't work: WHERE yourtag like '%foo%'. The idea was to just see whether the map as json-string contains the name. Casting yourtag to a string doesn't work either. I would need some function to test whether 'yourtag.foo' exists regardless of whether 'yourtag' or 'yourtag.foo' exits; and yourtag is a MAP type. Sebastian Paul Rogers writes: > Hi Sebastian, > > The nested map is a very hard case! Without a schema, there is no way for the > reader of a.json to know it should create a map. Even if some part of Drill > were to infer that `yourtag` is a map, nothing would figure out the name and > type of the items within the map. This is a known weakness in the current > implementation. (Though, to be fair, without a schema, it is nearly > impossible to get consistency.) > > > If you only want the b.json data (the one with the nested map), then maybe > you can go by column type using a nested select: > > SELECT id, yourtag.foo.foo1 AS myFoo FROM ( > SELECT id, yourtag WHERE sqlType(yourTag) = 'MAP' FROM ...) > > The inner select returns only the rows where the yourtag column is a MAP. > This *might* weed out the batches in which the column did not exist and Drill > made up a nullable INT column. > > Not 100% sure that this will work, but worth a try. > > > Note that foo <> null will never be true. And foo <> 'null' tries to match a > string with the value of "null" (not a null string). If you do want to check > for null, use isNull(foo). > > Note also that a MAP is never null; though the fields within it can be. (The > MAP type is not nullable in Drill.) > > Thanks, > - Paul > > > > On Wednesday, September 18, 2019, 9:20:24 PM PDT, Sebastian Fischmeister > wrote: > > Thanks to all for the replies so far. Some of the approaches work, if the > search targets a single tag. However, none work, if the search targets a > dictionary. For example: > > # file a.json > { "id": 1 } > # file b.json > { "id": 2, > "yourtag": { "foo": { "foo1" = 1 } } } > > > SELECT id, yourtag.foo.foo1 > FROM dfs.`/bla/*/*` > WHERE yourtag.foo is not null ## alt: REPEATED_CONTAINS(yourtag,'foo') # > alt2: yourtag.foo <> 'null' > > > I haven't found a way to write the search, so it returns the tuple (2, 1). > The problem is that instead of returning simply null, if yourtag.foo doesn't > exist, drill returns "Missing function implementation: > [repeated_contains(MAP-REQUIRED, VARCHAR-REQUIRED)]. or 'Table \'yourtag\' > not found'. > > Sebastian > > > > Paul Rogers writes: > >> Hi Sebastian, >> >> On the query below, you've got a string comparison: mytag = 'hello'. I >> suspect this is your problem. >> >> By mentioning mytag, Drill knows you want to project that column from your >> file scan. When a reader notices that a file does not have that column, it >> will make one up and set it to null. As it turns out, the made-up column >> will be Nullable INT. As a result, your result set will contain some batches >> of data with VARCHAR columns (from scanners that found the column) and some >> batches in which the column is INT (for those readers that did not find the >> column.) >> >> Drill processes data in batches. Batches contain the data from one file (or, >> if the file is big, a single file may produce multiple batches.) At its >> lowest level, Drill can handle the case in which column mytag is INT in some >> batches (those where no such field was found in JSON), and VARCHAR in others >> (where the field was found.) Unfortunately, higher-level code in Drill >> cannot handle conflicting schema, causing endless user confusion. >> >> >> I suspect the NumberFormatException occurs because of the conflict between >> numeric and VARCHAR column types. Further, any attempt to group, sort or >> aggregate will also fail due to a schema conflict. >> >> This is a longstanding "feature" of Drill that we discuss in the Learning >> Apache Drill book. >> >> In this particular case, one would expect Drill to have used a short-circuit >> evaluation of the AND conjunction, evaluate mytag = 'hello' only when the >> column is not null. This should have filtered out all the INT batches since >> they are NULL. Perhaps there is a bug there somewhere. >> >> >> Note that the problem may be worse. If you have JSON like the following, >> Drill will also fail: >> >> {a: 10, mytag: null} >> {a: 20, mytag: "hello"} >> >> The above will fail because Drill must guess a column type when it sees the >> first mytag: null. It will guess nullable Int. Then, when it sees mytag: >> "hello", it will try to write a VARCHAR into an INT column and will fail. >> >> I wonder if you can try cas
Re: Check presence of field in json file
Hi Sebastian, The nested map is a very hard case! Without a schema, there is no way for the reader of a.json to know it should create a map. Even if some part of Drill were to infer that `yourtag` is a map, nothing would figure out the name and type of the items within the map. This is a known weakness in the current implementation. (Though, to be fair, without a schema, it is nearly impossible to get consistency.) If you only want the b.json data (the one with the nested map), then maybe you can go by column type using a nested select: SELECT id, yourtag.foo.foo1 AS myFoo FROM ( SELECT id, yourtag WHERE sqlType(yourTag) = 'MAP' FROM ...) The inner select returns only the rows where the yourtag column is a MAP. This *might* weed out the batches in which the column did not exist and Drill made up a nullable INT column. Not 100% sure that this will work, but worth a try. Note that foo <> null will never be true. And foo <> 'null' tries to match a string with the value of "null" (not a null string). If you do want to check for null, use isNull(foo). Note also that a MAP is never null; though the fields within it can be. (The MAP type is not nullable in Drill.) Thanks, - Paul On Wednesday, September 18, 2019, 9:20:24 PM PDT, Sebastian Fischmeister wrote: Thanks to all for the replies so far. Some of the approaches work, if the search targets a single tag. However, none work, if the search targets a dictionary. For example: # file a.json { "id": 1 } # file b.json { "id": 2, "yourtag": { "foo": { "foo1" = 1 } } } SELECT id, yourtag.foo.foo1 FROM dfs.`/bla/*/*` WHERE yourtag.foo is not null ## alt: REPEATED_CONTAINS(yourtag,'foo') # alt2: yourtag.foo <> 'null' I haven't found a way to write the search, so it returns the tuple (2, 1). The problem is that instead of returning simply null, if yourtag.foo doesn't exist, drill returns "Missing function implementation: [repeated_contains(MAP-REQUIRED, VARCHAR-REQUIRED)]. or 'Table \'yourtag\' not found'. Sebastian Paul Rogers writes: > Hi Sebastian, > > On the query below, you've got a string comparison: mytag = 'hello'. I > suspect this is your problem. > > By mentioning mytag, Drill knows you want to project that column from your > file scan. When a reader notices that a file does not have that column, it > will make one up and set it to null. As it turns out, the made-up column will > be Nullable INT. As a result, your result set will contain some batches of > data with VARCHAR columns (from scanners that found the column) and some > batches in which the column is INT (for those readers that did not find the > column.) > > Drill processes data in batches. Batches contain the data from one file (or, > if the file is big, a single file may produce multiple batches.) At its > lowest level, Drill can handle the case in which column mytag is INT in some > batches (those where no such field was found in JSON), and VARCHAR in others > (where the field was found.) Unfortunately, higher-level code in Drill cannot > handle conflicting schema, causing endless user confusion. > > > I suspect the NumberFormatException occurs because of the conflict between > numeric and VARCHAR column types. Further, any attempt to group, sort or > aggregate will also fail due to a schema conflict. > > This is a longstanding "feature" of Drill that we discuss in the Learning > Apache Drill book. > > In this particular case, one would expect Drill to have used a short-circuit > evaluation of the AND conjunction, evaluate mytag = 'hello' only when the > column is not null. This should have filtered out all the INT batches since > they are NULL. Perhaps there is a bug there somewhere. > > > Note that the problem may be worse. If you have JSON like the following, > Drill will also fail: > > {a: 10, mytag: null} > {a: 20, mytag: "hello"} > > The above will fail because Drill must guess a column type when it sees the > first mytag: null. It will guess nullable Int. Then, when it sees mytag: > "hello", it will try to write a VARCHAR into an INT column and will fail. > > I wonder if you can try casting: ... AND CAST(mytag AS VARCHAR) = 'hello' Not > sure if this will work, but worth a try. > > The longer-term fix is that the team is working on a schema feature that will > let you tell Drill that the column is VARCHAR even if it is not found in the > data source. The feature is available for CSV and some other file types, but > not yet for JSON. > > An obvious enhancement in this one case is that Drill itself can tell your > intent is for the column to be VARCHAR. The analyzer should be able to infer > that column type without you telling Drill this fact. As far as I know, this > addition is not yet part of the schema system plan, but would be a nice > additional tweak. > > > Thanks, > - Paul > > > > On Sunday, September 15, 2019, 4:22:48 AM PDT, Sebastian Fischmeister > wrote: > > While it's possible to
Re: Check presence of field in json file
Thanks to all for the replies so far. Some of the approaches work, if the search targets a single tag. However, none work, if the search targets a dictionary. For example: # file a.json { "id": 1 } # file b.json { "id": 2, "yourtag": { "foo": { "foo1" = 1 } } } SELECT id, yourtag.foo.foo1 FROM dfs.`/bla/*/*` WHERE yourtag.foo is not null ## alt: REPEATED_CONTAINS(yourtag,'foo') # alt2: yourtag.foo <> 'null' I haven't found a way to write the search, so it returns the tuple (2, 1). The problem is that instead of returning simply null, if yourtag.foo doesn't exist, drill returns "Missing function implementation: [repeated_contains(MAP-REQUIRED, VARCHAR-REQUIRED)]. or 'Table \'yourtag\' not found'. Sebastian Paul Rogers writes: > Hi Sebastian, > > On the query below, you've got a string comparison: mytag = 'hello'. I > suspect this is your problem. > > By mentioning mytag, Drill knows you want to project that column from your > file scan. When a reader notices that a file does not have that column, it > will make one up and set it to null. As it turns out, the made-up column will > be Nullable INT. As a result, your result set will contain some batches of > data with VARCHAR columns (from scanners that found the column) and some > batches in which the column is INT (for those readers that did not find the > column.) > > Drill processes data in batches. Batches contain the data from one file (or, > if the file is big, a single file may produce multiple batches.) At its > lowest level, Drill can handle the case in which column mytag is INT in some > batches (those where no such field was found in JSON), and VARCHAR in others > (where the field was found.) Unfortunately, higher-level code in Drill cannot > handle conflicting schema, causing endless user confusion. > > > I suspect the NumberFormatException occurs because of the conflict between > numeric and VARCHAR column types. Further, any attempt to group, sort or > aggregate will also fail due to a schema conflict. > > This is a longstanding "feature" of Drill that we discuss in the Learning > Apache Drill book. > > In this particular case, one would expect Drill to have used a short-circuit > evaluation of the AND conjunction, evaluate mytag = 'hello' only when the > column is not null. This should have filtered out all the INT batches since > they are NULL. Perhaps there is a bug there somewhere. > > > Note that the problem may be worse. If you have JSON like the following, > Drill will also fail: > > {a: 10, mytag: null} > {a: 20, mytag: "hello"} > > The above will fail because Drill must guess a column type when it sees the > first mytag: null. It will guess nullable Int. Then, when it sees mytag: > "hello", it will try to write a VARCHAR into an INT column and will fail. > > I wonder if you can try casting: ... AND CAST(mytag AS VARCHAR) = 'hello' Not > sure if this will work, but worth a try. > > The longer-term fix is that the team is working on a schema feature that will > let you tell Drill that the column is VARCHAR even if it is not found in the > data source. The feature is available for CSV and some other file types, but > not yet for JSON. > > An obvious enhancement in this one case is that Drill itself can tell your > intent is for the column to be VARCHAR. The analyzer should be able to infer > that column type without you telling Drill this fact. As far as I know, this > addition is not yet part of the schema system plan, but would be a nice > additional tweak. > > > Thanks, > - Paul > > > > On Sunday, September 15, 2019, 4:22:48 AM PDT, Sebastian Fischmeister > wrote: > > While it's possible to test for 'is not null', you actually cannot query the > tag, because it provides an system error due to the optimizer doing its job. > Take this query as example: > > SELECT mytag > FROM dfs.`/bla/*/*` > WHERE mytag is not null and mytag = 'hello' > > Some files contain "mytag", others don't. > > The query works with just the clause 'mytag is not null', because all json > files missing mytag will get mytag set to null, which is type compatible with > the filter clause. However it actually does not work with "mytag is not null > and mytag = 'hello'" because I get the following error for files where mytag > is not present. > > SYSTEM ERROR: NumberFormatException: hello > > The physical plan shows that the query optimizer removes the clause 'mytag is > not null', because it's redundant. However, it comes at the expense of not > being able to query tags that are not present in all files. > > Is there a way to outsmart the optimizer and first execute a filter on "mytag > is not null" before "mytag = 'hello'" in a single query? > > Sebastian > > > Ted Dunning writes: > >> Keep in mind the danger if testing Foo!=null. That doesn't work and catches >> me by surprise all the time. Foo is null and variants are what you need. >> >> On Sat, Sep 14, 2019, 4:56 PM hanu mapr wrote: >>
Re: Check presence of field in json file
Hi Sebastian, Setting the option "store.json.all_text_mode" to "true" should make the Json reader interpret all the null valued or missing columns/values as "text" instead of "int" (should return the text value 'null' ). This may be another workaround to the type guessing problem that Paul described, Thanks, Boaz On 9/15/19 11:05 AM, Paul Rogers wrote: Hi Sebastian, On the query below, you've got a string comparison: mytag = 'hello'. I suspect this is your problem. By mentioning mytag, Drill knows you want to project that column from your file scan. When a reader notices that a file does not have that column, it will make one up and set it to null. As it turns out, the made-up column will be Nullable INT. As a result, your result set will contain some batches of data with VARCHAR columns (from scanners that found the column) and some batches in which the column is INT (for those readers that did not find the column.) Drill processes data in batches. Batches contain the data from one file (or, if the file is big, a single file may produce multiple batches.) At its lowest level, Drill can handle the case in which column mytag is INT in some batches (those where no such field was found in JSON), and VARCHAR in others (where the field was found.) Unfortunately, higher-level code in Drill cannot handle conflicting schema, causing endless user confusion. I suspect the NumberFormatException occurs because of the conflict between numeric and VARCHAR column types. Further, any attempt to group, sort or aggregate will also fail due to a schema conflict. This is a longstanding "feature" of Drill that we discuss in the Learning Apache Drill book. In this particular case, one would expect Drill to have used a short-circuit evaluation of the AND conjunction, evaluate mytag = 'hello' only when the column is not null. This should have filtered out all the INT batches since they are NULL. Perhaps there is a bug there somewhere. Note that the problem may be worse. If you have JSON like the following, Drill will also fail: {a: 10, mytag: null} {a: 20, mytag: "hello"} The above will fail because Drill must guess a column type when it sees the first mytag: null. It will guess nullable Int. Then, when it sees mytag: "hello", it will try to write a VARCHAR into an INT column and will fail. I wonder if you can try casting: ... AND CAST(mytag AS VARCHAR) = 'hello' Not sure if this will work, but worth a try. The longer-term fix is that the team is working on a schema feature that will let you tell Drill that the column is VARCHAR even if it is not found in the data source. The feature is available for CSV and some other file types, but not yet for JSON. An obvious enhancement in this one case is that Drill itself can tell your intent is for the column to be VARCHAR. The analyzer should be able to infer that column type without you telling Drill this fact. As far as I know, this addition is not yet part of the schema system plan, but would be a nice additional tweak. Thanks, - Paul On Sunday, September 15, 2019, 4:22:48 AM PDT, Sebastian Fischmeister wrote: While it's possible to test for 'is not null', you actually cannot query the tag, because it provides an system error due to the optimizer doing its job. Take this query as example: SELECT mytag FROM dfs.`/bla/*/*` WHERE mytag is not null and mytag = 'hello' Some files contain "mytag", others don't. The query works with just the clause 'mytag is not null', because all json files missing mytag will get mytag set to null, which is type compatible with the filter clause. However it actually does not work with "mytag is not null and mytag = 'hello'" because I get the following error for files where mytag is not present. SYSTEM ERROR: NumberFormatException: hello The physical plan shows that the query optimizer removes the clause 'mytag is not null', because it's redundant. However, it comes at the expense of not being able to query tags that are not present in all files. Is there a way to outsmart the optimizer and first execute a filter on "mytag is not null" before "mytag = 'hello'" in a single query? Sebastian Ted Dunning writes: Keep in mind the danger if testing Foo!=null. That doesn't work and catches me by surprise all the time. Foo is null and variants are what you need. On Sat, Sep 14, 2019, 4:56 PM hanu mapr wrote: Hello Sebastian, By default Drill sets the field 'foo' to null for the files that don't contain it. I am of the opinion that the condition where foo = 'bar' should result in false for all those files which don't contain the field. Please can you send across the queries which you have run and the observed result. Just off the top of my head, some query like the below one might work select file_name from dfs.`/bla/*/*` where foo != null. --- You might want to remove duplicate entries. (of course this also results in the rows which contain the fi
Re: Check presence of field in json file
Another nice tweak would be to consider all missing fields to be "nullable any". On Sun, Sep 15, 2019, 8:05 PM Paul Rogers wrote: > Hi Sebastian, > > On the query below, you've got a string comparison: mytag = 'hello'. I > suspect this is your problem. > > By mentioning mytag, Drill knows you want to project that column from your > file scan. When a reader notices that a file does not have that column, it > will make one up and set it to null. As it turns out, the made-up column > will be Nullable INT. As a result, your result set will contain some > batches of data with VARCHAR columns (from scanners that found the column) > and some batches in which the column is INT (for those readers that did not > find the column.) > > Drill processes data in batches. Batches contain the data from one file > (or, if the file is big, a single file may produce multiple batches.) At > its lowest level, Drill can handle the case in which column mytag is INT in > some batches (those where no such field was found in JSON), and VARCHAR in > others (where the field was found.) Unfortunately, higher-level code in > Drill cannot handle conflicting schema, causing endless user confusion. > > > I suspect the NumberFormatException occurs because of the conflict between > numeric and VARCHAR column types. Further, any attempt to group, sort or > aggregate will also fail due to a schema conflict. > > This is a longstanding "feature" of Drill that we discuss in the Learning > Apache Drill book. > > In this particular case, one would expect Drill to have used a > short-circuit evaluation of the AND conjunction, evaluate mytag = 'hello' > only when the column is not null. This should have filtered out all the INT > batches since they are NULL. Perhaps there is a bug there somewhere. > > > Note that the problem may be worse. If you have JSON like the following, > Drill will also fail: > > {a: 10, mytag: null} > {a: 20, mytag: "hello"} > > The above will fail because Drill must guess a column type when it sees > the first mytag: null. It will guess nullable Int. Then, when it sees > mytag: "hello", it will try to write a VARCHAR into an INT column and will > fail. > > I wonder if you can try casting: ... AND CAST(mytag AS VARCHAR) = 'hello' > Not sure if this will work, but worth a try. > > The longer-term fix is that the team is working on a schema feature that > will let you tell Drill that the column is VARCHAR even if it is not found > in the data source. The feature is available for CSV and some other file > types, but not yet for JSON. > > An obvious enhancement in this one case is that Drill itself can tell your > intent is for the column to be VARCHAR. The analyzer should be able to > infer that column type without you telling Drill this fact. As far as I > know, this addition is not yet part of the schema system plan, but would be > a nice additional tweak. > > > Thanks, > - Paul > > > > On Sunday, September 15, 2019, 4:22:48 AM PDT, Sebastian Fischmeister < > sfisc...@uwaterloo.ca> wrote: > > While it's possible to test for 'is not null', you actually cannot query > the tag, because it provides an system error due to the optimizer doing its > job. Take this query as example: > > SELECT mytag > FROM dfs.`/bla/*/*` > WHERE mytag is not null and mytag = 'hello' > > Some files contain "mytag", others don't. > > The query works with just the clause 'mytag is not null', because all json > files missing mytag will get mytag set to null, which is type compatible > with the filter clause. However it actually does not work with "mytag is > not null and mytag = 'hello'" because I get the following error for files > where mytag is not present. > > SYSTEM ERROR: NumberFormatException: hello > > The physical plan shows that the query optimizer removes the clause 'mytag > is not null', because it's redundant. However, it comes at the expense of > not being able to query tags that are not present in all files. > > Is there a way to outsmart the optimizer and first execute a filter on > "mytag is not null" before "mytag = 'hello'" in a single query? > > Sebastian > > > Ted Dunning writes: > > > Keep in mind the danger if testing Foo!=null. That doesn't work and > catches > > me by surprise all the time. Foo is null and variants are what you need. > > > > On Sat, Sep 14, 2019, 4:56 PM hanu mapr wrote: > > > >> Hello Sebastian, > >> > >> By default Drill sets the field 'foo' to null for the files that don't > >> contain it. I am of the opinion that the condition where foo = 'bar' > should > >> result in false for all those files which don't contain the field. > >> Please can you send across the queries which you have run and the > observed > >> result. > >> > >> Just off the top of my head, some query like the below one might work > >> select file_name from dfs.`/bla/*/*` where foo != null. --- You might > want > >> to remove duplicate entries. (of course this also results in the rows > which > >> contain the field and are null). > >
Re: Check presence of field in json file
Hi Sebastian, On the query below, you've got a string comparison: mytag = 'hello'. I suspect this is your problem. By mentioning mytag, Drill knows you want to project that column from your file scan. When a reader notices that a file does not have that column, it will make one up and set it to null. As it turns out, the made-up column will be Nullable INT. As a result, your result set will contain some batches of data with VARCHAR columns (from scanners that found the column) and some batches in which the column is INT (for those readers that did not find the column.) Drill processes data in batches. Batches contain the data from one file (or, if the file is big, a single file may produce multiple batches.) At its lowest level, Drill can handle the case in which column mytag is INT in some batches (those where no such field was found in JSON), and VARCHAR in others (where the field was found.) Unfortunately, higher-level code in Drill cannot handle conflicting schema, causing endless user confusion. I suspect the NumberFormatException occurs because of the conflict between numeric and VARCHAR column types. Further, any attempt to group, sort or aggregate will also fail due to a schema conflict. This is a longstanding "feature" of Drill that we discuss in the Learning Apache Drill book. In this particular case, one would expect Drill to have used a short-circuit evaluation of the AND conjunction, evaluate mytag = 'hello' only when the column is not null. This should have filtered out all the INT batches since they are NULL. Perhaps there is a bug there somewhere. Note that the problem may be worse. If you have JSON like the following, Drill will also fail: {a: 10, mytag: null} {a: 20, mytag: "hello"} The above will fail because Drill must guess a column type when it sees the first mytag: null. It will guess nullable Int. Then, when it sees mytag: "hello", it will try to write a VARCHAR into an INT column and will fail. I wonder if you can try casting: ... AND CAST(mytag AS VARCHAR) = 'hello' Not sure if this will work, but worth a try. The longer-term fix is that the team is working on a schema feature that will let you tell Drill that the column is VARCHAR even if it is not found in the data source. The feature is available for CSV and some other file types, but not yet for JSON. An obvious enhancement in this one case is that Drill itself can tell your intent is for the column to be VARCHAR. The analyzer should be able to infer that column type without you telling Drill this fact. As far as I know, this addition is not yet part of the schema system plan, but would be a nice additional tweak. Thanks, - Paul On Sunday, September 15, 2019, 4:22:48 AM PDT, Sebastian Fischmeister wrote: While it's possible to test for 'is not null', you actually cannot query the tag, because it provides an system error due to the optimizer doing its job. Take this query as example: SELECT mytag FROM dfs.`/bla/*/*` WHERE mytag is not null and mytag = 'hello' Some files contain "mytag", others don't. The query works with just the clause 'mytag is not null', because all json files missing mytag will get mytag set to null, which is type compatible with the filter clause. However it actually does not work with "mytag is not null and mytag = 'hello'" because I get the following error for files where mytag is not present. SYSTEM ERROR: NumberFormatException: hello The physical plan shows that the query optimizer removes the clause 'mytag is not null', because it's redundant. However, it comes at the expense of not being able to query tags that are not present in all files. Is there a way to outsmart the optimizer and first execute a filter on "mytag is not null" before "mytag = 'hello'" in a single query? Sebastian Ted Dunning writes: > Keep in mind the danger if testing Foo!=null. That doesn't work and catches > me by surprise all the time. Foo is null and variants are what you need. > > On Sat, Sep 14, 2019, 4:56 PM hanu mapr wrote: > >> Hello Sebastian, >> >> By default Drill sets the field 'foo' to null for the files that don't >> contain it. I am of the opinion that the condition where foo = 'bar' should >> result in false for all those files which don't contain the field. >> Please can you send across the queries which you have run and the observed >> result. >> >> Just off the top of my head, some query like the below one might work >> select file_name from dfs.`/bla/*/*` where foo != null. --- You might want >> to remove duplicate entries. (of course this also results in the rows which >> contain the field and are null). >> >> Hope this helps. >> >> Thanks >> >> >> On Fri, Sep 13, 2019 at 10:53 PM Sebastian Fischmeister < >> sfisc...@uwaterloo.ca> wrote: >> >> > Hi, >> > >> > When searching multiple directories, drill only searches fields that are >> > common to all files (see the json data model). Is there a way to query a >> > directory and
Re: Check presence of field in json file
While it's possible to test for 'is not null', you actually cannot query the tag, because it provides an system error due to the optimizer doing its job. Take this query as example: SELECT mytag FROM dfs.`/bla/*/*` WHERE mytag is not null and mytag = 'hello' Some files contain "mytag", others don't. The query works with just the clause 'mytag is not null', because all json files missing mytag will get mytag set to null, which is type compatible with the filter clause. However it actually does not work with "mytag is not null and mytag = 'hello'" because I get the following error for files where mytag is not present. SYSTEM ERROR: NumberFormatException: hello The physical plan shows that the query optimizer removes the clause 'mytag is not null', because it's redundant. However, it comes at the expense of not being able to query tags that are not present in all files. Is there a way to outsmart the optimizer and first execute a filter on "mytag is not null" before "mytag = 'hello'" in a single query? Sebastian Ted Dunning writes: > Keep in mind the danger if testing Foo!=null. That doesn't work and catches > me by surprise all the time. Foo is null and variants are what you need. > > On Sat, Sep 14, 2019, 4:56 PM hanu mapr wrote: > >> Hello Sebastian, >> >> By default Drill sets the field 'foo' to null for the files that don't >> contain it. I am of the opinion that the condition where foo = 'bar' should >> result in false for all those files which don't contain the field. >> Please can you send across the queries which you have run and the observed >> result. >> >> Just off the top of my head, some query like the below one might work >> select file_name from dfs.`/bla/*/*` where foo != null. --- You might want >> to remove duplicate entries. (of course this also results in the rows which >> contain the field and are null). >> >> Hope this helps. >> >> Thanks >> >> >> On Fri, Sep 13, 2019 at 10:53 PM Sebastian Fischmeister < >> sfisc...@uwaterloo.ca> wrote: >> >> > Hi, >> > >> > When searching multiple directories, drill only searches fields that are >> > common to all files (see the json data model). Is there a way to query a >> > directory and list all files that contain a certain field? >> > >> > In other words, I would like to use the workaround in this way: >> > >> > select * from (select fqn from dfs.`/bla/*/*` where foo exists) where foo >> > = 'bar' >> > >> > Or is there another way to do this? I dynamically get more files, so >> > finding the files should be included in the query. >> > >> > An alternative would be to execute the query such that it sets the field >> > 'foo' to null for all files that don't contain it. However, I don't know >> > how to execute this. >> > >> > Thanks, >> > Sebastian >> > >>
Re: Check presence of field in json file
Keep in mind the danger if testing Foo!=null. That doesn't work and catches me by surprise all the time. Foo is null and variants are what you need. On Sat, Sep 14, 2019, 4:56 PM hanu mapr wrote: > Hello Sebastian, > > By default Drill sets the field 'foo' to null for the files that don't > contain it. I am of the opinion that the condition where foo = 'bar' should > result in false for all those files which don't contain the field. > Please can you send across the queries which you have run and the observed > result. > > Just off the top of my head, some query like the below one might work > select file_name from dfs.`/bla/*/*` where foo != null. --- You might want > to remove duplicate entries. (of course this also results in the rows which > contain the field and are null). > > Hope this helps. > > Thanks > > > On Fri, Sep 13, 2019 at 10:53 PM Sebastian Fischmeister < > sfisc...@uwaterloo.ca> wrote: > > > Hi, > > > > When searching multiple directories, drill only searches fields that are > > common to all files (see the json data model). Is there a way to query a > > directory and list all files that contain a certain field? > > > > In other words, I would like to use the workaround in this way: > > > > select * from (select fqn from dfs.`/bla/*/*` where foo exists) where foo > > = 'bar' > > > > Or is there another way to do this? I dynamically get more files, so > > finding the files should be included in the query. > > > > An alternative would be to execute the query such that it sets the field > > 'foo' to null for all files that don't contain it. However, I don't know > > how to execute this. > > > > Thanks, > > Sebastian > > >
Re: Check presence of field in json file
Hello Sebastian, By default Drill sets the field 'foo' to null for the files that don't contain it. I am of the opinion that the condition where foo = 'bar' should result in false for all those files which don't contain the field. Please can you send across the queries which you have run and the observed result. Just off the top of my head, some query like the below one might work select file_name from dfs.`/bla/*/*` where foo != null. --- You might want to remove duplicate entries. (of course this also results in the rows which contain the field and are null). Hope this helps. Thanks On Fri, Sep 13, 2019 at 10:53 PM Sebastian Fischmeister < sfisc...@uwaterloo.ca> wrote: > Hi, > > When searching multiple directories, drill only searches fields that are > common to all files (see the json data model). Is there a way to query a > directory and list all files that contain a certain field? > > In other words, I would like to use the workaround in this way: > > select * from (select fqn from dfs.`/bla/*/*` where foo exists) where foo > = 'bar' > > Or is there another way to do this? I dynamically get more files, so > finding the files should be included in the query. > > An alternative would be to execute the query such that it sets the field > 'foo' to null for all files that don't contain it. However, I don't know > how to execute this. > > Thanks, > Sebastian >