Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

Stefán Baxter Wed, 22 Jul 2015 15:52:33 -0700

Hi,

I keep coming across *quirks* in Drill that are quite time consuming to
deal with and are now causing mounting concerns.


This last one though is far more serious then the previous ones because it
deals with loss of data.

I'm working with a small(ish) dataset of around 1m records (which I'm more
than happy to hand over to replicate this)

The problem goes like this:

   1. with dfs.tmp.`/test.json`
   - containing a structure like this (simplified);
   - 800k x
   {"some":"yes","others":{"other":"true","all":"false","sometimes":"yes"}}
   - 100k
   x 
{"some":"yes","others":{"other":"true","all":"false","sometimes":"yes","additional":"last
   entries only"}}

   2. selecting: select some, t.others from dfs.tmp.`/test.json` as t;
   - returns only this for all the records: "yes",
   {"other":"true","all":"false","sometimes":"yes"}
   - never returns this:
   "yes", {"other":"true","all":"false","sometimes":"yes"}

The query never returns returns this:
"yes", {"other":"true","all":"false","sometimes":"yes","additional":"last
entries only"} so the last entries in the file are incorrectly represented.

To make matters a lot worse the the property is completely ignored in:
create X as * from dfs.tmp.`/test.json` and the now parquet file does not
include it at all.

It looks, to me, that the dynamic schema discovery has stopped looking for
schema changes and is quite set in it's way, so set in fact, that it's
ignoring data.

I'm guessing that this is potentially affecting more people than me.

I believe I have produced this under 1.1 and 1.2-SNAPSHOT.

Regards,
 -Stefan

Inaccurate data representation when selecting from json sub structures and loss of data creating Parquet files from it

Reply via email to