Using Drill's CTAS statements I've run into a schema inconsistency issue and
I'm not sure how to solve it..
CREATE TABLE name [ (column list) ] AS query;
If I have a directory called Cities which have JSON files which look like:
a.json:
{ "city":"San Francisco", "zip":"94105"}
{ "city":"San Jose", "zip":"94088"}
b.json:
{ "city":"Toronto ", "zip": null}
{ "city":"Montreal", "zip" null}
If I create a parquet file out of the Cities directory I will end up with files
called:
1_0_0.parquet through 1_5_1.parquet
Now I got a problem:
Most of the parquet files have a column type of char for zip.
Some of the parquet files have a column type of int for zip because the zip
value for a group of records was NULL..
This produces schema change errors later when trying to query the parquet
directory.
Is it possible for Drill to do a better job learning schemas across all json
files in a directory before creating parquet?
This message may contain information that is confidential or privileged. If you
are not the intended recipient, please advise the sender immediately and delete
this message. See
http://www.blackrock.com/corporate/en-us/compliance/email-disclaimers for
further information. Please refer to
http://www.blackrock.com/corporate/en-us/compliance/privacy-policy for more
information about BlackRock’s Privacy Policy.
For a list of BlackRock's office addresses worldwide, see
http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.
© 2018 BlackRock, Inc. All rights reserved.