My JSON data looks - simplified - like this
{"ID":1,"a":"some text"}
{"ID":2,"a":"some text","b":"some other text"}
{"ID":3,"a":"some text"}
Column b is only physically serialized when it is not null.
It is the equivalent of a NULLable VARCHAR() column in SQL.
I run queries like these:
SELECT b
FROM dfs.`D:\MyData\test.json`
WHERE b IS NOT NULL
And normally all is fine.
However, among my thousands of data files, I have two files where the first
occurrence of b happens a few thousand records down the file.
These two data files would look like this:
{"ID":1,"a":"some text"}
{"ID":2,"a":"some text"}
... 5000 more records without column b ...
{"ID":5002,"a":"some text","b":"some other text"}
{"ID":5003,"a":"some text"}
In this case, my simple SQL query above fails:
[30027]Query execution error. Details:[
DATA_READ ERROR: Error parsing JSON - You tried to write a VarChar type when
you are using a ValueWriter of type NullableIntWriterImpl.
File /D:/MyData/test.json
Record 5002 Fragment ...
It seems that the Schema inference mechanism of Drill only samples a certain
amount of bytes (or records) to determine the schema.
If the first occurrence of a schema detail happens to far down things go boom.
I am now looking for a sane way to work around this.
Preferred by extending the query and not by altering my massive amounts of data.
BTW, I tried altering the data by chaning the first line:
{"ID":1,"a":"some text","b":null}
does not help.
Of course, changing the first line to
{"ID":1,"a":"some text","b":""}
solves the problem, but this is not a practical solution.
Any help appreciated.
Alexander