Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

Abdel Hakim Deneche Thu, 23 Jul 2015 09:25:48 -0700

I don't think Drill is supposed to "ignore" data. My understanding is that
the reader will read the new fields which will cause a schema change, and
depending on the query (if all operators involved can handle the schema
change or not) the query should either succeed or fail.
My understanding is that Drill will most likely fail rather than display
incorrect results otherwise it's a bug that needs to be fixed.
Sometimes, the reader itself my fail for example if you have a list of
numbers and the first 1000 values are int, if any value after that is
double or string, this will cause the json reader to fail.


On Thu, Jul 23, 2015 at 9:16 AM, Matt <[email protected]> wrote:

> On 23 Jul 2015, at 10:53, Abdel Hakim Deneche wrote:
>
>  When you try to read schema-less data, Drill will first investigate the
>> 1000 rows to figure out a schema for your data, then it will use this
>> schema for the remaining of the query.
>>
>
> To clarify, if the JSON schema changes on the 1001st 1MMth record, is
> Drill supposed to report an error, or ignore new data elements and only
> consider those discovered in the first 1000 objects?
>



-- 

Abdelhakim Deneche

Software Engineer

  <http://www.mapr.com/>


Now Available - Free Hadoop On-Demand Training
<http://www.mapr.com/training?utm_source=Email&utm_medium=Signature&utm_campaign=Free%20available>

Re: Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

Reply via email to