Jinfeng Ni created DRILL-5464:
---------------------------------

             Summary: Fix JSON reader when it deals with empty file
                 Key: DRILL-5464
                 URL: https://issues.apache.org/jira/browse/DRILL-5464
             Project: Apache Drill
          Issue Type: Bug
            Reporter: Jinfeng Ni


An empty json file is the one without any json object.  If we query an empty 
json file asking it to return column 'A',  Drill's JSON record reader would 
return a batch with 0 row, and put column 'A' as a nullable int column. A 
better name for such column might be phantom columns, as the record reader does 
not have any knowledge of the column schema, and the nullable int column is 
just a guessed schema. 

However, that processing could introduce many issues. Consider if we have a 
directory consisting of multiple json files and at least one of them is empty.  
If column 'A' is returned as nullable-int column from the reader over the empty 
file, while the other json files contains a real typed column 'A', that would 
cause query hit many issues, including 1) SchemaChangeException, 2) failed in 
certain operator which does not detect SchemaChange, 3) or incorrect query 
result, since the run-time code is generated over a phantom column type, not a 
real type.

For instance, the following query against yelp json file run successfully.
{code}
select count(*), stars  from dfs.`/tmp/yelp/yelp_academic_dataset_review.json` 
group by stars;
{code}

If an empty json file is added to the directory,  the query would fail with the 
following error (which falls into the 2nd category : PartitionSender did not 
detect schema change properly).  

{code}
select count(*), stars  from dfs.`/tmp/yelp` group by stars;
Error: SYSTEM ERROR: IllegalStateException: Failure while reading vector.  
Expected vector class of org.apache.drill.exec.vector.NullableIntVector but was 
holding vector class org.apache.drill.exec.vector.NullableBigIntVector, field= 
stars(BIGINT:OPTIONAL)[$bits$(UINT1:REQUIRED), stars(BIGINT:OPTIONAL)]
{code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to