RE: Looking for workaround to Schema detection problems

Holy Alexander Fri, 08 Jul 2016 06:37:45 -0700

Hi Vitalii!


This is what I tried:

Altered the setting system-wide:

ALTER SYSTEM SET `exec.enable_union_type` = true

Verified that the setting is really altered

SELECT * 
FROM sys.options 
WHERE type in ('SYSTEM','SESSION') order by name

And re-run the query

Unfortunately this does not solve the problem.
It just causes a different error:

[30027]Query execution error. Details:[ 
SYSTEM ERROR: NullPointerException
Fragment 0:0
[Error Id: 0f9cb7ae-d2d5-474c-ad57-2d558041e2c6 on

(I tried this on Drill 1.7 and 1.6)

Best regards,
Alexander


-----Original Message-----
From: Vitalii Diravka [mailto:[email protected]] 
Sent: 08 July 2016 13:30
To: [email protected]
Subject: Re: Looking for workaround to Schema detection problems

Hi Alexander,

Please try with turning on the union type:

ALTER SESSION SET `exec.enable_union_type` = true;

Kind regards
Vitalii

2016-07-08 10:50 GMT+00:00 Holy Alexander <[email protected]>:

> My JSON data looks - simplified - like this
>
> {"ID":1,"a":"some text"}
> {"ID":2,"a":"some text","b":"some other text"} {"ID":3,"a":"some 
> text"}
>
> Column b is only physically serialized when it is not null.
> It is the equivalent of a NULLable VARCHAR() column in SQL.
>
> I run queries like these:
>
> SELECT b
> FROM dfs.`D:\MyData\test.json`
> WHERE b IS NOT NULL
>
> And normally all is fine.
> However, among my thousands of data files, I have two files where the 
> first occurrence of b happens a few thousand records down the file.
> These two data files would look like this:
>
> {"ID":1,"a":"some text"}
> {"ID":2,"a":"some text"}
> ... 5000 more records without column b ...
> {"ID":5002,"a":"some text","b":"some other text"} {"ID":5003,"a":"some 
> text"}
>
> In this case, my simple SQL query above fails:
>
> [30027]Query execution error. Details:[ DATA_READ ERROR: Error parsing 
> JSON - You tried to write a VarChar type when you are using a 
> ValueWriter of type NullableIntWriterImpl.
> File  /D:/MyData/test.json
> Record 5002 Fragment ...
>
> It seems that the Schema inference mechanism of Drill only samples a 
> certain amount of bytes (or records) to determine the schema.
> If the first occurrence of a schema detail happens to far down things 
> go boom.
>
> I am now looking for a sane way to work around this.
> Preferred by extending the query and not by altering my massive 
> amounts of data.
>
> BTW, I tried altering the data by chaning the first line:
> {"ID":1,"a":"some text","b":null}
> does not help.
>
> Of course, changing the first line to
> {"ID":1,"a":"some text","b":""}
> solves the problem, but this is not a practical solution.
>
> Any help appreciated.
> Alexander
>

RE: Looking for workaround to Schema detection problems

Reply via email to