Re: CTAS query fails

Vimal Jain Sun, 20 Sep 2020 23:51:29 -0700

Thanks Paul for quick response.
So reading your response, looks like this has something to do would Parquet
instead of Drill ? I would post this question in the Parquet community
group as well to see if we can get an answer for this.


*Thanks and Regards,*
*Vimal Jain*


On Fri, Sep 18, 2020 at 10:45 PM Paul Rogers <[email protected]> wrote:

> Hi Vimal,
>
> You've stumbled across one of the more frustrating bits of Drill. Drill is
> "schema-free", meaning that the only information which Drill has to read
> your data is the data itself. In your case, the JSON reader can infer that
> "abc" is a MAP (Drill's term, Hive would call it a STRUCT.) Each file is
> read in a different "fragment". One fragment says that "abc" is an empty
> MAP, another says that it has some schema. These are merged sometime later
> in the query.
>
> If you had had a null value instead, Drill won't know that "abc" is a map
> and would have guessed INT as the type. So, good that you have an empty
> object, it avoids ambiguity.
>
> Sounds like the issue is in the Parquet writer: that it has some limitation
> on an empty group. Why is the group empty? Because, when writing the first
> file with the empty group, the Parquet writer has no way to predict that
> your "abc" field will eventually include a non-empty group. In fact, when
> the non-empty group does appear, the Parquet schema must change. Not sure
> what Parquet will do in that case: you may end up with some files with one
> schema, other files with another schema.
>
> What you want, of course, is for Drill to combine your files to create a
> single schema for Parquet, setting fields to null when they are missing.
> Drill can't currently do that effectively because it involves predicting
> the future, which Drill cannot do.
>
> Does anyone have more direct knowledge of how Parquet handles this case?
>
> Thanks,
>
> - Paul
>
> On Fri, Sep 18, 2020 at 4:10 AM Vimal Jain <[email protected]> wrote:
>
> > Hi,
> > I am trying to convert my JSON data into Parquet format using CTAS query
> > like below :-
> >
> > *create table ds2.root.`parquetOutput` as select * from
> > TABLE(ds1.root.`jsonInput/` (type =>'json'));*
> >
> > But it fails with error :-
> >
> >
> >
> >
> >
> >
> >
> >
> > *Error: SYSTEM ERROR: InvalidSchemaException: Cannot write a schema with
> an
> > empty group: optional group abc {}Fragment 0:0Please, refer to logs for
> > more information.[Error Id: fa3c0390-0093-4c4a-9b32-098d5cc68c7e on
> > ip-172-30-3-153.ec2.internal:31010] (state=,code=0)*
> >
> > So can someone explain what is the issue here, can't my jsons have a key
> > "abc" with value as empty object "{}" ?
> > It's empty in some json files in ds1 but in some there is a value.
> > Any help to resolve this would be appreciated.
> >
> > *Thanks and Regards,*
> > *Vimal Jain*
> >
>

Re: CTAS query fails

Reply via email to