Hi Alexander, Please try with turning on the union type:
ALTER SESSION SET `exec.enable_union_type` = true; Kind regards Vitalii 2016-07-08 10:50 GMT+00:00 Holy Alexander <[email protected]>: > My JSON data looks - simplified - like this > > {"ID":1,"a":"some text"} > {"ID":2,"a":"some text","b":"some other text"} > {"ID":3,"a":"some text"} > > Column b is only physically serialized when it is not null. > It is the equivalent of a NULLable VARCHAR() column in SQL. > > I run queries like these: > > SELECT b > FROM dfs.`D:\MyData\test.json` > WHERE b IS NOT NULL > > And normally all is fine. > However, among my thousands of data files, I have two files where the > first occurrence of b happens a few thousand records down the file. > These two data files would look like this: > > {"ID":1,"a":"some text"} > {"ID":2,"a":"some text"} > ... 5000 more records without column b ... > {"ID":5002,"a":"some text","b":"some other text"} > {"ID":5003,"a":"some text"} > > In this case, my simple SQL query above fails: > > [30027]Query execution error. Details:[ > DATA_READ ERROR: Error parsing JSON - You tried to write a VarChar type > when you are using a ValueWriter of type NullableIntWriterImpl. > File /D:/MyData/test.json > Record 5002 Fragment ... > > It seems that the Schema inference mechanism of Drill only samples a > certain amount of bytes (or records) to determine the schema. > If the first occurrence of a schema detail happens to far down things go > boom. > > I am now looking for a sane way to work around this. > Preferred by extending the query and not by altering my massive amounts of > data. > > BTW, I tried altering the data by chaning the first line: > {"ID":1,"a":"some text","b":null} > does not help. > > Of course, changing the first line to > {"ID":1,"a":"some text","b":""} > solves the problem, but this is not a practical solution. > > Any help appreciated. > Alexander >
