RE: Schema problems trying to convert JSON to Parquet

Lee, David Fri, 23 Feb 2018 12:44:28 -0800

Unfortunately the JSON source files I'm trying to convert into nested Parquet 
have 4,000+ possible keys with multiple levels of nesting.. It would be ideal 
if you could inject the schema definition into a Drill query instead of relying 
on schema learning..


Like:

Contact
   First name
   Last name
Education
  School[]
      Degrees[]
 Work Experience
  Company[]
    Title
    Years
Addresses
  Address[]
    Street
    City
    State
    Zip

-----Original Message-----
From: Andries Engelbrecht [mailto:[email protected]] 
Sent: Friday, February 23, 2018 12:34 PM
To: [email protected]
Subject: Re: Schema problems trying to convert JSON to Parquet

This is a challenge when dealing with JSON. You can either force the data type 
in the CTAS statement (likely better option) or deal with the data type change 
in parquet table(s) by using CAST, etc. In the case of zip codes you need to 
consider if it will be 5 digits or the extended 5-4 digits to decide if the 
data type should be INT or VARCHAR.

Also look into the TYPEOF function, which you can use with CASE to deal with 
these types of issues.

I prefer to deal with data issues as soon as possible in the pipeline, so the 
tables you create are consistent and clean.

--Andries


On 2/23/18, 12:04 PM, "Lee, David" <[email protected]> wrote:

    Using Drill's CTAS statements I've run into a schema inconsistency issue 
and I'm not sure how to solve it..
    
    CREATE TABLE name [ (column list) ] AS query;  
    
    If I have a directory called Cities which have JSON files which look like:
    
    a.json:
    { "city":"San Francisco", "zip":"94105"}
    { "city":"San Jose", "zip":"94088"}
    
    b.json:
    { "city":"Toronto ", "zip": null}
    { "city":"Montreal", "zip" null}
    
    If I create a parquet file out of the Cities directory I will end up with 
files called:
    
    1_0_0.parquet through 1_5_1.parquet
    
    Now I got a problem:
    
    Most of the parquet files have a column type of char for zip.
    Some of the parquet files have a column type of int for zip because the zip 
value for a group of records was NULL..
    
    This produces schema change errors later when trying to query the parquet 
directory.
    
    Is it possible for Drill to do a better job learning schemas across all 
json files in a directory before creating parquet?
    
    
    
    
    
    This message may contain information that is confidential or privileged. If 
you are not the intended recipient, please advise the sender immediately and 
delete this message. See 
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.blackrock.com_corporate_en-2Dus_compliance_email-2Ddisclaimers&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=xX9pt8fqAgZu6-OzbVwvpCjEh4kJpnqajJvMlRkLkZQ&s=hpMAe2P_obD6f_4QKWUE_yeIbxM6me3oniVH3btG2Eg&e=
 for further information.  Please refer to 
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.blackrock.com_corporate_en-2Dus_compliance_privacy-2Dpolicy&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=xX9pt8fqAgZu6-OzbVwvpCjEh4kJpnqajJvMlRkLkZQ&s=TCbMg__Jd7CA-8aVdb8xaCCPLXmqWwRNk1mHMB5d7uo&e=
 for more information about BlackRock’s Privacy Policy.
    
    For a list of BlackRock's office addresses worldwide, see 
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.blackrock.com_corporate_en-2Dus_about-2Dus_contacts-2Dlocations&d=DwIFaQ&c=cskdkSMqhcnjZxdQVpwTXg&r=Q3Oz5l4W5TvDHNLpOqMYE2AgtKWFE937v89GEHyOVDU&m=xX9pt8fqAgZu6-OzbVwvpCjEh4kJpnqajJvMlRkLkZQ&s=99fuo3ra5r022Cja5zizkZcv2vzuxNneLGJjKbwv6Kw&e=.
    
    © 2018 BlackRock, Inc. All rights reserved.
    



This message may contain information that is confidential or privileged. If you 
are not the intended recipient, please advise the sender immediately and delete 
this message. See 
http://www.blackrock.com/corporate/en-us/compliance/email-disclaimers for 
further information.  Please refer to 
http://www.blackrock.com/corporate/en-us/compliance/privacy-policy for more 
information about BlackRock’s Privacy Policy.

For a list of BlackRock's office addresses worldwide, see 
http://www.blackrock.com/corporate/en-us/about-us/contacts-locations.

© 2018 BlackRock, Inc. All rights reserved.

RE: Schema problems trying to convert JSON to Parquet

Reply via email to