Hi,
I keep coming across *quirks* in Drill that are quite time consuming to
deal with and are now causing mounting concerns.
This last one though is far more serious then the previous ones because it
deals with loss of data.
I'm working with a small(ish) dataset of around 1m records (which I'm more
than happy to hand over to replicate this)
The problem goes like this:
1. with dfs.tmp.`/test.json`
- containing a structure like this (simplified);
- 800k x
{"some":"yes","others":{"other":"true","all":"false","sometimes":"yes"}}
- 100k
x
{"some":"yes","others":{"other":"true","all":"false","sometimes":"yes","additional":"last
entries only"}}
2. selecting: select some, t.others from dfs.tmp.`/test.json` as t;
- returns only this for all the records: "yes",
{"other":"true","all":"false","sometimes":"yes"}
- never returns this:
"yes", {"other":"true","all":"false","sometimes":"yes"}
The query never returns returns this:
"yes", {"other":"true","all":"false","sometimes":"yes","additional":"last
entries only"} so the last entries in the file are incorrectly represented.
To make matters a lot worse the the property is completely ignored in:
create X as * from dfs.tmp.`/test.json` and the now parquet file does not
include it at all.
It looks, to me, that the dynamic schema discovery has stopped looking for
schema changes and is quite set in it's way, so set in fact, that it's
ignoring data.
I'm guessing that this is potentially affecting more people than me.
I believe I have produced this under 1.1 and 1.2-SNAPSHOT.
Regards,
-Stefan