Paul Rogers created DRILL-7522:
----------------------------------

             Summary: JSON reader (v1) omits null columns in SELECT *
                 Key: DRILL-7522
                 URL: https://issues.apache.org/jira/browse/DRILL-7522
             Project: Apache Drill
          Issue Type: Bug
    Affects Versions: 1.17.0
            Reporter: Paul Rogers


Run the following unit test: {{TestStarQueries.testSelStarOrderBy}}, runs the 
following query:

{code:sql}
select * from cp.`employee.json` order by last_name
{code}

The query reads a Foodmart file {{customer.json}} that has records like this:

{code:json}
{"employee_id":53,...","end_date":null,"salary":...}
{code}

The field {{end_date}} turns out to be null for all records in 
{{customer.json}}.

Then, look at the verification query. It carefully includes all fields *except* 
{{end_date}}. That is, the test was written to expect that the JSON reader will 
omit a column that has all NULL values.

While it might seem OK to omit all-NULL columns (they don't have any data), the 
problem is that Drill is a distributed system. Suppose we query a directory of 
50 such files, some of which have all-NULLs in one field, some of which have 
all-NULLs in another. Although the files have the same schema, {{SELECT *}} 
will return different schemas (depending on which file has which non-NULL 
columns.)

A downstream operator will have to merge these schemas. And, since Drill fills 
in a Nullable INT field for missing columns, we might end up with a schema 
change exception because the actual field type is VARCHAR when it appears.

One can argue that {{SELECT *}} means "return all columns", not "return all 
columns except those that happen to be null in the first batch." Yes, we have 
the problem of not knowing the actual field type. Eventually, provided schemas 
will resolve such issues.

Note that in the "V2" JSON reader, {{end_date}} is included in the query.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to