Looking for workaround to Schema detection problems

Holy Alexander Fri, 08 Jul 2016 03:50:44 -0700

My JSON data looks - simplified - like this

{"ID":1,"a":"some text"}
{"ID":2,"a":"some text","b":"some other text"}
{"ID":3,"a":"some text"}


Column b is only physically serialized when it is not null.
It is the equivalent of a NULLable VARCHAR() column in SQL.

I run queries like these:

SELECT b
FROM dfs.`D:\MyData\test.json`
WHERE b IS NOT NULL

And normally all is fine.
However, among my thousands of data files, I have two files where the first 
occurrence of b happens a few thousand records down the file.
These two data files would look like this:

{"ID":1,"a":"some text"}
{"ID":2,"a":"some text"}
... 5000 more records without column b ...
{"ID":5002,"a":"some text","b":"some other text"}
{"ID":5003,"a":"some text"}

In this case, my simple SQL query above fails:

[30027]Query execution error. Details:[
DATA_READ ERROR: Error parsing JSON - You tried to write a VarChar type when 
you are using a ValueWriter of type NullableIntWriterImpl.
File  /D:/MyData/test.json
Record 5002 Fragment ...

It seems that the Schema inference mechanism of Drill only samples a certain 
amount of bytes (or records) to determine the schema.
If the first occurrence of a schema detail happens to far down things go boom.

I am now looking for a sane way to work around this.
Preferred by extending the query and not by altering my massive amounts of data.

BTW, I tried altering the data by chaning the first line:
{"ID":1,"a":"some text","b":null}
does not help.

Of course, changing the first line to
{"ID":1,"a":"some text","b":""}
solves the problem, but this is not a practical solution.

Any help appreciated.
Alexander

Looking for workaround to Schema detection problems

Reply via email to